Ethan Chiu · My personal blog

Ethan Chiu My personal blog

Creating Realtime, Client-side Twitter Analysis Script for the United Timeline

I recently started working on a project that shows and analyzes real-time Conservative, Moderate, and Liberal viewpoints across social media called the United Timeline.

Here’s the mission statement: “The media constantly bombards us with polarizing rhetoric that support our own viewpoints. Without viewing other perspectives, we create an echo chamber that limits the extent to which we can have an open dialogue that critically engages with important topics and situations. Ignoring arguments from the other political side creates divisiveness, furthering the divide between Americans and bringing no real progress when it comes to reforms and policies in Congress and elsewhere.

The United Timeline provides real-time social media posts and analysis of liberal, moderate, and conservative viewpoints. Our goal is to bridge the divide between liberals, conservatives, and moderates by sharing the current conversation from each side on social media.”

Currently, we show what Twitter news feeds would look like from those different viewpoints.To achieve this, I started out creating the Twitter timelines for Conservative, Moderate, and Liberal viewpoints. This was pretty straightforward. I created a Twitter account and created three different lists containing the most prominent and intelligent pundits from each side. Then, I simply embedded them side by side.

After that, I wanted to generate word clouds of the most recent Tweets. One way that instantly popped up in my mind was to simply use Twitter’s API. Unfortunately, the API has limits that would easily be passed for real-time analysis. Another method was using Selenium to constantly scrape the most recent Tweets. This would require me to create a server with a backend and also might not work since Twitter blocks scraping after a certain limit.

So, I created a way to analyze the 20 most recent Tweets from each list all in the client-side (no backend). As a result, I don’t have to rely on backend processing or worry about server costs. To analyze these Tweets in real time, I first analyzed how Twitter’s embedded lists load onto a site. I saw that the embedded link transformed into a asynchronous script, meaning it loads in the background of the website as the user interacts with the list. As a result, I programmed a script which checks when the async Twitter script is loaded for each list. Afterwards, it grabs the top 20 Tweets’ text content using jQuery to identify the class names. Then, I use regex to eliminate any urls, @ tags, or punctuation so that only words are generated in the word clouds. After that, I calculate the word frequency and format that into a list that the wordcloud2 Javascript library can understand.

So, my script allows the user to see realtime graphs of the 20 most recent Tweets from each list every time he or she refreshes the page!

Here’s the code:

$( document ).ready(function() {
	var $canvas = $('#word_cloud');

	//Temp solution since Twitter uses async for embeded links.

function wordFreq(string) {
		// get rid of urls, @ mentions, punctionation + spaces
		var no_url = string.replace(/(?:https?|ftp):\/\/[\n\S]+/g, '').replace(/\S*@\S*\s?/g, "").replace(/(?:(the|a|an|and|of|what|to|for|about) +)/g, "").replace(/[.,\/#!$%\^&\*;:{}=\-_`~()]/g,"");
    var words = no_url.replace(/[.]/g, '').split(/\s/);
    var freqMap = {};
    words.forEach(function(w) {
        if (!freqMap[w]) {
            freqMap[w] = 0;
        freqMap[w] += 1;

    return freqMap;

function generateWordFreqs(raw_data, name) {
	var tweets = [];
	for(var i = 0; i<raw_data.length; i++){
	//Regex to elimate tags, etc. + join words into one coherent group
	tweets = tweets.join(' ');

	listA = wordFreq(tweets);
	var list = [];
	for (var key in listA)
		list.push([key, listA[key]*5]);
	WordCloud.minFontSize = "15px";
	WordCloud($('#' + name + '_word_cloud')[0], { list: list });

function generateWordCloud(){
	    function() {
	    	//Set up variables, get document values from iframes loaded asyncronously
				//Work around for not being able to explicitly call $("#twitter-widget-.....")
	    	var liberal = $(document.getElementById('twitter-widget-0').contentWindow.document).find(".timeline-Tweet-text");
	    	var moderate = $(document.getElementById('twitter-widget-1').contentWindow.document).find(".timeline-Tweet-text");
	    	var conservative = $(document.getElementById('twitter-widget-2').contentWindow.document).find(".timeline-Tweet-text");

	    	//Generate Word Clouds
				generateWordFreqs(liberal, "liberal");
				generateWordFreqs(moderate, "moderate");
				generateWordFreqs(conservative, "conservative");
	    }, 250);

Here’s the link where you can see this all in action. Screenshot United Timeline Analysis

Obviously, this isn’t the prettiest solution out there. I could of used recursion and for loops to eliminate some repetition, but I feel like this is the clearest explanation. For a full list of my other attempts for analyzing the Tweets in realtime completely client-side, please check out the Javascript file on Github.

Since this script only works for 20 of the most recent Tweets, I’m most likely going to create a Python backend to process more Tweets (like every Tweet in a day) and present different word clouds daily. This would also allow me to use libraries such as NTLK to filter out connecting words.

We are currently working on adding more analytic tools and integrations. This website is completely open source and open to contributions. We are currently working on adding more analytic tools and integrations. If you have any suggestions/concerns, feel free to open up an issue over here.

Identifying Hate with NTLK, difflib, and Python code!

Recently, for our research project, we worked on identifying hate on anonymous platforms.

Here’s an example of a graph of the percentage of the top 15 hate words with dummy data in a random server: CentipedeCentralHateSpeechBar.png

We were inspired by the method of identifying hate outlined in the paper “Kek, Cucks, and God Emperor Trump: A Measurement Study of 4chan’s Politically Incorrect Forum and Its Effects on the Web”.

To code this up, I first had to import a list of hate words. I used the library that the aforementioned paper used: Hatebase. I used a python wrapper to get the top 1000 US hate words:


from json import loads
from hatebase import HatebaseAPI

hatebase = HatebaseAPI({"key": key})
#"language": "eng"
pagenumber = 1
responses = []

#Loads top 1000 results
while pagenumber <= 11:
    filters = {"country":"US", "page": str(pagenumber)}
    output = "json"
    query_type = "vocabulary"
    response = hatebase.performRequest(filters, output, query_type)

    # convert to Python object
    pagenumber += 1
print "Done getting API results"

Then, I processed the JSON data received from the API by using Python’s convenient iteritems() function for dictionaries:

#Process Hate Words
data = []
for r in responses:
#print len(data)
listofHatewords = []

#print len(data)
for z in data:
    for a, v in z.iteritems():
        for b in v:
print listofHatewords
listofHatewords = list(OrderedDict.fromkeys(listofHatewords))
print len(listofHatewords)

Then, I used this code from StackOverflow to find different forms of the hate words:

from nltk.corpus import wordnet as wn
#We'll store the derivational forms in a set to eliminate duplicates
index2 = 0
for word in listofHatewords:
    forms = set()
    for happy_lemma in wn.lemmas(word): #for each "happy" lemma in WordNet
        forms.add( #add the lemma itself
        for related_lemma in happy_lemma.derivationally_related_forms(): #for each related lemma
            forms.add( #add the related lemma
        versionsOfWord[index2] = forms
    index2 += 1
print len(versionsOfWord)

Then, I initialize 4 lists, size 1000, to gather as much info as possible about hate words and its occurrences in messages such as ID of the message and message content associated with the hate word:

frequency = []
versionsOfWord = []
#frequencyID = []
frequencyTime = []
listmessageID = []
listauthors = []
frequencyIndex = []
for x in range(0, 1000):

for x in range(0, 1000):
for x in range(0, 1000):
for x in range(0, 1000):

for x in range(0, 1000):
for x in range(0, 1000):
for x in range(0, 1000):

For the main loop, I first search through each message in the messages list (which was imported from prior data). I then make that message lowercase. After, I split it into a words and see if hate words or its forms are in the messages. I also check for censored words (ex: “nger”) via difflib library:

totalNumberofWords = 0
counter = 0
print len(message)
print len(messageID)
print len(timestamps)
for m, m_id, date, a_id in zip(message, messageID, timestamps, authorID):
    #print m
    totalNumberofWords += len(m)
    lower = m.lower()
    index = 0

    if counter%100000==0:
        print counter
    #print counter
    #Need to tokenize to get all frequencies
    for word in listofHatewords:
        wordLowered = word.lower()
        listof_lower = lower.split(" ")
        similarWords = versionsOfWord[index]

        #matchesHate = difflib.get_close_matches(word, listof_lower, 1, .5)
        #Else if check the NTLK forms of words
        #Check if there are versions of the word first though
        #TOOK out "word in lower" since it was inaccurate
        if wordLowered in listof_lower or len(difflib.get_close_matches(wordLowered, listof_lower, 1, .75)) >= 1:
            #frequencyID[index].append(str(m_id) + " " + m)
        elif len(similarWords) > 0:
            #found = False
            for a in similarWords:
                aLowered = a.lower()
                if aLowered in listof_lower or len(difflib.get_close_matches(aLowered, listof_lower, 1, .75)) >= 1:
                    #found = True
                    #frequencyID[index].append(str(m_id) + " " + m)
                    #print "test" + str(counter)
        #Increase index to make sense
        if index >= len(listofHatewords):
            print "Length error"
    counter += 1

Here is how I saved it in a JSON file:

#Process data
jsonList = []

for i in range(0,1000):
    jsonList.append({'hateword': listofHatewords[i], 'messageID': listmessageID[i], 'authorID':listauthors[i], 'frequency': frequency[i], 'frequencyIndex':frequencyIndex[i], 'frequencyTime':frequencyTime[i]})
#print(json.dumps(jsonList, indent = 1))

#Put to file
import simplejson
import time
timestr = time.strftime("%Y%m%d-%H%M%S")
    f = open(ChannelName + 'Allfrequencies'  + str(timestr) +'.json', 'w')
    simplejson.dump(jsonList, f)
except NameError:
    print "Almost erased" + ChannelName + "Allfrequencies.json! Be careful!!!"

Here is how I graphed it in Pandas:

Graphs percentage of mentions of ____ hate word in posts

#print totalNumberofWords
#print frequency

#TODO percentages + save list of hate words into file for further analysis
#Test for every word
#Create matrix where this is the message || 
#parse vector => how many words is mentioned
#counter vectorizer => ski kit learn. vocabulary is list of 1000 words
#^count how many times a word occurs
#Sum of rows
#Find which of the words occur the most

#Use pandas

df = pd.DataFrame({'words':listofHatewords, 'frequency':frequency})

df = df.sort_values('frequency')
#print df

#Cut to top ten most popular posts
gb = df.tail(15)

#total number of words
lengthOfMessages = len(message)
#print gb

#Calculate percentage
gb["percentage"] = gb["frequency"]/lengthOfMessages

#print df
del gb["frequency"]

#Rename Columns
gb.columns = ["Hate Word", "Percentage of Appearances of Messages"]
print gb
#Graph percentages

ax = gb.set_index('Hate Word').plot(kind='bar')
vals = ax.get_yticks()
ax.set_yticklabels(['{:3.0f}%'.format(x*100) for x in vals])
im = ax
image = im.get_figure()
image.savefig(ChannelName + 'HateSpeechBar.png')

And that’s it! This code does not detect the context of conversation such as sarcasm and solely tests for existence of words in messages.

Importance of Studying Humanities as a Programmer

I think it’s important to study humanities because many programmers don’t understand the ramifications of the technology they create.

For example, Facebook and other social media platforms have such strong algorithms and information tools that basically grant any advertiser to target any demographic they want by gender, age, etc. So, with enough money, any entity can target a certain group of people and change their opinion pretty easily. Thus, platforms like Facebook allow entities to manufacture consent on any group of people if they have enough money.

When I first started learning about computer science, I dreamed about working for the top tech companies because of how cool it was. I just wanted to create the coolest and most advanced technology out there. Little did I understand the great harm some technology like Facebook does to people.

For example, Facebook utilizes machine learning and other optimization tools to basically create an instant gratification cycle which keeps users hooked onto the site as long as possible. As a result, people see content that Facebook’s algorithms deem they will like which sometimes includes misinformation and creates echo chambers. For example, if a user was a troll and loved Donald Trump, he or she might have been shown misinformation like Pizzagate. That person might become so convinced that that information is true that he or she might bring a gun to that pizza place.

Personally speaking, studying courses in the humanities at my community college such as Economics and English classes and doing interdisciplinary research have really changed my mindset as a person and programmer. For example, in my latest English classes, we read literature such as Noam Chomsky’s Media Control and articles like “Teen Depression and Anxiety: Why the Kids Are Not Alright”. Those types of literature woke me up in a sense. Before then, I had attended a lot of hackathons creating a vast array of cool technology with other programmers. What I dreamed of was becoming a software programmer at a big tech company like Facebook. I never thought about the negative effects of those types of technology. All I saw when I dreamed about working for these companies was creating impactful technology that make our world more connective. After reading writings like the ones mentioned above in the past semesters, I’ve realized the importance of truly understanding the consequences and the multifaceted nature of creating novel technologies.

Moreover, studying these types of courses in humanities motivated me to do my own investigation and research. I talked to my cousin who is only 6. Guess what? He plays Minecraft and voice chats with random people daily!!! To me, that’s frightening. While technology like this VoIP seem cool due to its low barrier of entry, accessibility, and freedom from consequences, it also paves an easy way for trolls to bully others like young kids with no consequences. I also talked to a friend who I gamed with and was shock to have heard that he attempted suicide a few years ago at the age of 12. Based on my conversations, it was clear it stemmed from the messages he got on social platforms and the sense of needing to fit in based on the image that social media paints.

Nevertheless, I think platforms like Facebook are realizing the harm they are creating after seeing former executives speak out like Chamath Palihapitiya. Facebook recently announced they are changing the newsfeed to show more friends and family posts, thus creating more meaningful conversations. I’m also happy to see powerful and influential tech moguls such as Elon Musk create initiatives like OpenAI which focus on creating a safe pathway to artificial intelligence. Hopefully, other powerful tech executives follow suit before technology create long term damage.

Thus, I think it’s important that programmers study subjects in the humanities such as psychology, history, and ethics so that they have a strong understanding of what they are creating.

Counting Discord Channels Using Selenium

For my research project, I wanted to get a better sense of how many Discord servers exist, considering there was no official number out there.

So, I searched on Google for a list of Discord servers and found a few websites.

I started out with finding a way to scrape the first website that was listed when I searched for Discord servers’ list:

There wasn’t an official number of Discord servers listed so I investigated how the website looked like. It was fairly simple with buttons “Next” and “Previous” that helped navigate between different pages of Discord servers.

Using Inspect Element, I identified the element associated with the titles of the channels as well as identified the element associated with the “next” button. Then, using Selenium, I constructed a loop which clicked on the next button, waited for 3 seconds, and added all the servers to the list:

# The path works because I have moved the "chromedriver" file in /usr/local/bin
browser = webdriver.Chrome(executable_path='/usr/bin/chromedriver')


a = 0

#Use infinite loop since the total number of pages is unknown. 
#At least in jupyter notebooks, it'll stop when it can't find the "Next" button on the last page)
while True:
    names = browser.find_elements_by_class_name("server-name")
    for name in names:
    moveforward = browser.find_elements_by_xpath("//*[contains(text(), 'Next')]")[0]
    print a

Then, I used the Pandas library to make sure there were no duplicate servers:

#Find Duplicates using Pandas library
cleanList = pd.unique(listofChannels).tolist()

Finally, to tally up all of the servers, I just used Python’s len() function:

count = len(cleanList)
print count

In the end, I counted 13,040 servers listed on the site.

Associating Reddit Links to Descriptions using Selenium and MatPlotLib

Recently, I wrote about a script I wrote for extracting subreddit names from URLs.

From there, I programmed a way to extract the time from the JSON data so that I could eventually construct a time series using the Pandas library. Note that I had to take a substring of the result due to some bugs in the GoogleScraper:

#Process time stamps correctly for data. 
#Note that some are in the general data instead of the correct time slot
for a in data:
	for b in a['results']:
		#Fix why none is not working
		#workaround with length of 4?
		if 'None' in b['time_stamp']:

Next, I worked on programming a way to find more info about the subreddit and associated to said link. I originally tried to find a JSON dataset with information already or a list of subreddits with descriptions. Unfortunately, I couldn’t find any list that had ALL subreddits. So, I used the official reddit website to get the description utilizing their website.

First, I had to setup Selenium. I had to install Selenium and Jupyter notebooks. It was pretty simple. Just had to install the package via terminal with apt-get and download the Chrome web driver from Google and move it to /usr/bin. I added the following libraries to the top of my code:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

After testing the Jupyter notebook to make sure that their libraries were running expectantly, I went back to the website to identify patterns within classes and structure to extract descriptions. From there, I simply ran a for loop for the subreddit list which searched each of the subreddit names and then extracted the description of the first result (since those names in the list are 100% accurate). I also included a try and except loop to prevent a out of index error because if the surbeddit doesn’t have a description, the HTML element “md” is not created and thus would create an error that stops the whole script:

# Path = where the "chromedriver" file is
browser = webdriver.Chrome(executable_path='/usr/bin/chromedriver')

#Get info on each search term
inputBox = browser.find_element_by_name("q")
info = []
for channel in subReddits:
    #find element 
        elem = browser.find_elements_by_class_name("md")[1]
    except IndexError:
    #Solution to stale element issue since element changes from the original q element
    inputBox = browser.find_element_by_name("q")

And finally, creating a table for readability:

#Combine Data for data processing
arrays = zip(links, subReddits, timestamp, info)
df = pd.DataFrame(data=arrays)

Full code:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import urllib, json
from pprint import pprint
import time

#Load Json
data = json.load(open('discordgg/December2016November2017.json'))

#Get Only Links from JSON

for a in data:
	for b in a['results']:
#Process time stamps correctly for data. 
#Note that some are in the general data instead of the correct time slot
for a in data:
	for b in a['results']:
		#Fix why none is not working
		#shitty workaround with length of 4?
		if 'None' in b['time_stamp']:
#Slice to appropriate date from original data
#for stamp in timestamp:
#Pattern => 11 or 12 characters. the next character is a space.
#Better way is to find 2016 or 2017 and then slice till that point

for what in timestamp:

#Process data using regex to get subreddits
for y in links:
# Path = where the "chromedriver" file is
browser = webdriver.Chrome(executable_path='/usr/bin/chromedriver')

#Get info on each search term
inputBox = browser.find_element_by_name("q")
info = []
for channel in subReddits:
    #find element 
#Combine Data for data processing
arrays = zip(links, subReddits, timestamp, info)
df = pd.DataFrame(data=arrays)

And that’s it! ^_^

Later, I plan to categorize each of these links using a bags of words algorithm.

Thanks for reading!