Ethan Chiu My personal blog

Extracting Subreddit Names From Urls

In my research on the platform Discord, I had gathered a list of URLs mentioned the platform during the beginning of Discord’s rise in popularity. I wanted to investigate how Discord grew on Reddit.

After gathering a list of those URLs using a Google Scraper I had modified, I then needed to extract the URLs from that data to then get a list of subreddits where the Discord platform was being mentioned on.

The data I collected looked like this:

[{
  "effective_query": "",
  "id": "1721",
  "no_results": "False",
  "num_results": "10",
  "num_results_for_query": "About 2,150,000 results (0.33 seconds)\u00a0",
  "page_number": "9",
  "query": "discord.gg site:reddit.com",
  "requested_at": "2017-11-21 06:58:48.987283",
  "requested_by": "localhost",
  "results": [
    {
      "domain": "www.reddit.com",
      "id": "2665",
      "link": "https://www.reddit.com/r/HeistTeams/comments/6543q7/join_heistteams_offical_discord_server_invite/",
      "link_type": "results",
      "rank": "1",
      "serp_id": "1721",
      "snippet": "http://discord.gg/gtao. ... The good thing about discord is if you're like me and your Mic don't work there's a .... Sub still kinda active, but the discord is much more.",
      "time_stamp": "Apr 13, 2017 - 100+ posts - \u200e100+ authors",
      "title": "Join HeistTeam's Offical Discord Server! Invite: discord.gg/gtao - Reddit",
      "visible_link": "https://www.reddit.com/r/HeistTeams/.../join_heistteams_offical_discord_server_invite..."
    },
    {
      "domain": "www.reddit.com",
      "id": "2666",
      "link": "https://www.reddit.com/r/NeebsGaming/comments/6q3wlk/the_official_neebs_gaming_discord/",
      "link_type": "results",
      "rank": "2",
      "serp_id": "1721",
      "snippet": "Ive changed the link in the sidebar over to the official discord or you can join it by following this link here. http://discord.gg/neebsgaming. Here are the rules for\u00a0...",
      "time_stamp": "Jul 28, 2017 - 5 posts - \u200e4 authors",
      "title": "The Official Neebs Gaming Discord! : NeebsGaming - Reddit",
      "visible_link": "https://www.reddit.com/r/NeebsGaming/.../the_official_neebs_gaming_discord/"
    },

I first needed to extract solely the links from the “results” part of the data. So, I used a simple nested for loop to extract just the links. Then, I appended those links’ values to an empty list which will be used later on:

#Load Json
data = json.load(open('discordgg/November2015December2016.json'))

#Get Only Links from JSON
links=[]
for a in data:
	for b in a['results']:
		links.append(b['link'])
		#pprint(b['link'])

Then, I needed to somehow just extract the part of the url that corresponds to the subreddit name. For example, for the url “https://www.reddit.com/r/NeebsGaming/comments/6q3wlk/the_official_neebs_gaming_discord/”, I wanted to extract just “NeebsGaming”. Luckily, all of the links I collected from Reddit followed the same pattern where the subreddit name appeared between “/r/” and the next “/”, so I just used regex to splice and then just selected the correct index of that slice for the list of links:

#Process data using regex to get subreddits
subReddits=[]

for y in links:
	subReddits.append(y.split('/')[4])
	pprint(y.split('/')[4])

Code in its totality:

import urllib, json 
from pprint import pprint

#Load Json
data = json.load(open('discordgg/November2015December2016.json'))

#Get Only Links from JSON
links=[]
for a in data:
	for b in a['results']:
		links.append(b['link'])
		#pprint(b['link'])

#Process data using regex to get subreddits
subReddits=[]

for y in links:
	subReddits.append(y.split('/')[4])
	pprint(y.split('/')[4])

Right now, I’m using the Reddit API and getting short descriptions of those subreddits and then using a simple bags of words algorithm to categorize them. Stay tooned!

Scraping Google Within A Time Range

For the past few weeks, I’ve been trying to program a way to scrape google search results within a certain time period.

I have been using the GoogleScraper repo developed by NikolaiT which has robust scraping functionalities which speeds up my data collection. Unfortunately, it lacked gathering timestamps and searching within a time stamp. A few days ago, I solved the former issue.

While I was trying to add the searching within a time stamp functionality, I first did multiple search results using the “Custom range time” settings to limit the searches to different time ranges. After running through a couple of searches, I realized a common pattern that I could easily replicate. Here is an example of the time parameters I saw :

&source=lnt&tbs=cdr%3A1%2Ccd_min%3A2015%2Ccd_max%3A2016&tbm=

As you can see, there are two main attributes: cd_min which is the beginning time and cd_max which is the end time.

From there, I dug through the GoogleScraper code and tried to find where the url is formed. After searching and tracing different functions, I found the right location for searching within a specific time range and added a simple line of code:

def get_base_search_url_by_search_engine(config, search_engine_name, search_mode):
    assert search_mode in SEARCH_MODES, 'search mode "{}" is not available'.format(search_mode)

    specific_base_url = config.get('{}_{}_search_url'.format(search_mode, search_engine_name), None) 

    if not specific_base_url:
        specific_base_url = config.get('{}_search_url'.format(search_engine_name), None)

    ipfile = config.get('{}_ip_file'.format(search_engine_name), '')

    if os.path.exists(ipfile):
        with open(ipfile, 'rt') as file:
            ips = file.read().split('\n')
            random_ip = random.choice(ips)
            return random_ip
    specific_base_url += "&source=lnt&tbs=cdr%3A1%2Ccd_min%3A3%2F1%2F2015%2Ccd_max%3A11%2F1%2F2015&tbm=&"
    return specific_base_url 

After validating that this worked correctly, I just appended the right parameters to the url manually evertime I wanted to search at a different time range to save time. Here is my commit for better clarification.

Scraping Google Within a Time Range

For the past few weeks, I’ve been trying to program a way to scrape google search results within a certain time period.

I have been using the GoogleScraper repo developed by NikolaiT which has robust scraping functionalities which speeds up my data collection. Unfortunately, it lacked gathering timestamps and searching within a time stamp. A few days ago, I solved the former issue.

While I was trying to add the searching within a time stamp functionality, I first did multiple search results using the “Custom range time” settings to limit the searches to different time ranges. After running through a couple of searches, I realized a common pattern that I could easily replicate. Here is an example of the time parameters I saw :

&source=lnt&tbs=cdr%3A1%2Ccd_min%3A2015%2Ccd_max%3A2016&tbm=

As you can see, there are two main attributes: cd_min which is the beginning time and cd_max which is the end time.

From there, I dug through the GoogleScraper code and tried to find where the url is formed. After searching and tracing different functions, I found the right location for searching within a specific time range and added a simple line of code:

def get_base_search_url_by_search_engine(config, search_engine_name, search_mode):
    assert search_mode in SEARCH_MODES, 'search mode "{}" is not available'.format(search_mode)

    specific_base_url = config.get('{}_{}_search_url'.format(search_mode, search_engine_name), None) 

    if not specific_base_url:
        specific_base_url = config.get('{}_search_url'.format(search_engine_name), None)

    ipfile = config.get('{}_ip_file'.format(search_engine_name), '')

    if os.path.exists(ipfile):
        with open(ipfile, 'rt') as file:
            ips = file.read().split('\n')
            random_ip = random.choice(ips)
            return random_ip
    specific_base_url += "&source=lnt&tbs=cdr%3A1%2Ccd_min%3A3%2F1%2F2015%2Ccd_max%3A11%2F1%2F2015&tbm=&"
    return specific_base_url 

After validating that this worked correctly, I just appended the right parameters to the url manually evertime I wanted to search at a different time range to save time. Here is my commit for better clarification.

Scraping Google

Recently, I’ve been working on scraping millions of Google results for a research project of mine tracing how anonymous social platforms has been getting popular.

I tried many of the attempts to scrape Google efficiently and none of them worked except the GoogleScraper library developed by the github user NikolaiT.

When I saw what kind of data was produced by the scraper, I was amazed. The scraper was able to get link and meta data as well as scrape using proxies.

Unfortunately, there was no way of getting the timestamps of posts for all of the links that are scraped from the Google’s Search Engine Page.

So, I programmed it.

hile the code was well commented and written, there were many different functions and variables that made it hard to piece everything together.

I first inspected element in a random Google search of mine and identified a few common classes between links. Then, I searched that within GoogleScraper’s code to see if I could find the matching code which parses through the HTML of Google’s results page:

normal_search_selectors = {
        'results': {
            'us_ip': {
                'container': '#center_col',
                'result_container': 'div.g ',
                'link': 'h3.r > a:first-child::attr(href)',
                'snippet': 'div.s span.st::text',
                'title': 'h3.r > a:first-child::text',
                'visible_link': 'cite::text'
            },
            'de_ip': {
                'container': '#center_col',
                'result_container': 'li.g ',
                'link': 'h3.r > a:first-child::attr(href)',
                'snippet': 'div.s span.st::text',
                'title': 'h3.r > a:first-child::text',
                'visible_link': 'cite::text'
            },
            'de_ip_news_items': {
                'container': 'li.card-section',
                'link': 'a._Dk::attr(href)',
                'snippet': 'span._dwd::text',
                'title': 'a._Dk::text',
                'visible_link': 'cite::text'
            },
        },

After, I inspected element in a Google search results with time stamps and was able to find a similar class: “slp f”.

From there, I basically followed it’s track and added another instance, using my own “timestamp” variable:

        for key, value in parser.search_results.items():
            if isinstance(value, list):
                for link in value:
                    parsed = urlparse(link['link'])

                    # fill with nones to prevent key errors
                    [link.update({key: None}) for key in ('snippet', 'time_stamp','title', 'visible_link') if key not in link]

                    Link(
                        link=link['link'],
                        snippet=link['snippet'],
                        time_stamp=link['time_stamp'],
                        title=link['title'],
                        visible_link=link['visible_link'],
                        domain=parsed.netloc,
                        rank=link['rank'],
                        serp=self,
                        link_type=key
                    )
    normal_search_selectors = {
        'results': {
            'us_ip': {
                'container': '#center_col',
                'result_container': 'div.g ',
                'link': 'h3.r > a:first-child::attr(href)',
                'snippet': 'div.s span.st::text',
                'time_stamp' : 'div.slp::text',
                'title': 'h3.r > a:first-child::text',
                'visible_link': 'cite::text'
            },
            'de_ip': {
                'container': '#center_col',
                'result_container': 'li.g ',
                'link': 'h3.r > a:first-child::attr(href)',
                'snippet': 'div.s span.st::text',
                'time_stamp' : 'div.slp::text',
                'title': 'h3.r > a:first-child::text',
                'visible_link': 'cite::text'
            },
            'de_ip_news_items': {
                'container': 'li.card-section',
                'link': 'a._Dk::attr(href)',
                'snippet': 'span._dwd::text',
                'time_stamp' : 'div.slp::text',
                'title': 'a._Dk::text',
                'visible_link': 'cite::text'
            },
        },

You can see my whole process through my forked version or my pull request! :)

Now, I’m working on finding a way to search through a specific time range. This will require me to actually program different functions to translate commonalities of time range searches into a feasible solution. Stay tooned to see the next post!

Misinformation

Recently, I saw an alarming amount of posts on r/The_Donald, the largest online group of Trump supporters, claiming that a Washington Post author offered to pay women $1000 to accuse Moore of sexual misconduct for an article about Roy Moore’s unwanted sexual advances towards a 14 year old girl. Here’s the tweet that sparked all of this: https://twitter.com/umpire43/status/928783143099883520

Here is a small sampling of those posts:

Unfortunately, this increase of misinformation stems from a multitude of issues. First off, the r/TheDonald subreddit is that the mods custom programmed the CSS of the subreddit so that it makes it impossible to downvote any post unless if it appears on a user’s reddit main feed:

Secondly, the top comments of most of these posts seem to reinforce the post without fact checking the post:

Thirdly, although there seems to be a bot in place that is supposed to fact check these types of posts, that bot didn’t seem to comment the validity of most of these posts which claimed the reporter paid off women to speak poorly about Moore. This community needs a better way to make sure that its readers don’t get false information since most of the people seem to take the information at face value.

Nevertheless, some members of this community spoke out against this misinformation after seeing the bot comment that the source could be unreliable which shows the effectiveness of enforcing a bot which encourages users to check the source.

This incidence speaks volumes about how people believe information that aligns with their political views. In this instance, I wish the people who believed this claim about the reporter to check the source of the information. A quick fact check would of shown that the source was a Tweet from a user who spouses controversial and radical viewpoints. Moderators of big political subreddits like r/The_Donald and r/politics need to share resources towards their community to make sure what is a reliable source on the internet.

On a side note, I think it’s worth investigating suspicious Reddit accounts to see if any are potentially bots trying to stir up something. Especially with the recent news about Russia creating fake Twitter accounts, I wouldn’t be surprise that “Doug Lewis” (the person who started all these false accusations) is a Russian bot.