Ethan Chiu My personal blog

Scraping Google Within A Time Range

For the past few weeks, I’ve been trying to program a way to scrape google search results within a certain time period.

I have been using the GoogleScraper repo developed by NikolaiT which has robust scraping functionalities which speeds up my data collection. Unfortunately, it lacked gathering timestamps and searching within a time stamp. A few days ago, I solved the former issue.

While I was trying to add the searching within a time stamp functionality, I first did multiple search results using the “Custom range time” settings to limit the searches to different time ranges. After running through a couple of searches, I realized a common pattern that I could easily replicate. Here is an example of the time parameters I saw :

&source=lnt&tbs=cdr%3A1%2Ccd_min%3A2015%2Ccd_max%3A2016&tbm=

As you can see, there are two main attributes: cd_min which is the beginning time and cd_max which is the end time.

From there, I dug through the GoogleScraper code and tried to find where the url is formed. After searching and tracing different functions, I found the right location for searching within a specific time range and added a simple line of code:

def get_base_search_url_by_search_engine(config, search_engine_name, search_mode):
    assert search_mode in SEARCH_MODES, 'search mode "{}" is not available'.format(search_mode)

    specific_base_url = config.get('{}_{}_search_url'.format(search_mode, search_engine_name), None) 

    if not specific_base_url:
        specific_base_url = config.get('{}_search_url'.format(search_engine_name), None)

    ipfile = config.get('{}_ip_file'.format(search_engine_name), '')

    if os.path.exists(ipfile):
        with open(ipfile, 'rt') as file:
            ips = file.read().split('\n')
            random_ip = random.choice(ips)
            return random_ip
    specific_base_url += "&source=lnt&tbs=cdr%3A1%2Ccd_min%3A3%2F1%2F2015%2Ccd_max%3A11%2F1%2F2015&tbm=&"
    return specific_base_url 

After validating that this worked correctly, I just appended the right parameters to the url manually evertime I wanted to search at a different time range to save time. Here is my commit for better clarification.

Scraping Google Within a Time Range

For the past few weeks, I’ve been trying to program a way to scrape google search results within a certain time period.

I have been using the GoogleScraper repo developed by NikolaiT which has robust scraping functionalities which speeds up my data collection. Unfortunately, it lacked gathering timestamps and searching within a time stamp. A few days ago, I solved the former issue.

While I was trying to add the searching within a time stamp functionality, I first did multiple search results using the “Custom range time” settings to limit the searches to different time ranges. After running through a couple of searches, I realized a common pattern that I could easily replicate. Here is an example of the time parameters I saw :

&source=lnt&tbs=cdr%3A1%2Ccd_min%3A2015%2Ccd_max%3A2016&tbm=

As you can see, there are two main attributes: cd_min which is the beginning time and cd_max which is the end time.

From there, I dug through the GoogleScraper code and tried to find where the url is formed. After searching and tracing different functions, I found the right location for searching within a specific time range and added a simple line of code:

def get_base_search_url_by_search_engine(config, search_engine_name, search_mode):
    assert search_mode in SEARCH_MODES, 'search mode "{}" is not available'.format(search_mode)

    specific_base_url = config.get('{}_{}_search_url'.format(search_mode, search_engine_name), None) 

    if not specific_base_url:
        specific_base_url = config.get('{}_search_url'.format(search_engine_name), None)

    ipfile = config.get('{}_ip_file'.format(search_engine_name), '')

    if os.path.exists(ipfile):
        with open(ipfile, 'rt') as file:
            ips = file.read().split('\n')
            random_ip = random.choice(ips)
            return random_ip
    specific_base_url += "&source=lnt&tbs=cdr%3A1%2Ccd_min%3A3%2F1%2F2015%2Ccd_max%3A11%2F1%2F2015&tbm=&"
    return specific_base_url 

After validating that this worked correctly, I just appended the right parameters to the url manually evertime I wanted to search at a different time range to save time. Here is my commit for better clarification.

Scraping Google

Recently, I’ve been working on scraping millions of Google results for a research project of mine tracing how anonymous social platforms has been getting popular.

I tried many of the attempts to scrape Google efficiently and none of them worked except the GoogleScraper library developed by the github user NikolaiT.

When I saw what kind of data was produced by the scraper, I was amazed. The scraper was able to get link and meta data as well as scrape using proxies.

Unfortunately, there was no way of getting the timestamps of posts for all of the links that are scraped from the Google’s Search Engine Page.

So, I programmed it.

hile the code was well commented and written, there were many different functions and variables that made it hard to piece everything together.

I first inspected element in a random Google search of mine and identified a few common classes between links. Then, I searched that within GoogleScraper’s code to see if I could find the matching code which parses through the HTML of Google’s results page:

normal_search_selectors = {
        'results': {
            'us_ip': {
                'container': '#center_col',
                'result_container': 'div.g ',
                'link': 'h3.r > a:first-child::attr(href)',
                'snippet': 'div.s span.st::text',
                'title': 'h3.r > a:first-child::text',
                'visible_link': 'cite::text'
            },
            'de_ip': {
                'container': '#center_col',
                'result_container': 'li.g ',
                'link': 'h3.r > a:first-child::attr(href)',
                'snippet': 'div.s span.st::text',
                'title': 'h3.r > a:first-child::text',
                'visible_link': 'cite::text'
            },
            'de_ip_news_items': {
                'container': 'li.card-section',
                'link': 'a._Dk::attr(href)',
                'snippet': 'span._dwd::text',
                'title': 'a._Dk::text',
                'visible_link': 'cite::text'
            },
        },

After, I inspected element in a Google search results with time stamps and was able to find a similar class: “slp f”.

From there, I basically followed it’s track and added another instance, using my own “timestamp” variable:

        for key, value in parser.search_results.items():
            if isinstance(value, list):
                for link in value:
                    parsed = urlparse(link['link'])

                    # fill with nones to prevent key errors
                    [link.update({key: None}) for key in ('snippet', 'time_stamp','title', 'visible_link') if key not in link]

                    Link(
                        link=link['link'],
                        snippet=link['snippet'],
                        time_stamp=link['time_stamp'],
                        title=link['title'],
                        visible_link=link['visible_link'],
                        domain=parsed.netloc,
                        rank=link['rank'],
                        serp=self,
                        link_type=key
                    )
    normal_search_selectors = {
        'results': {
            'us_ip': {
                'container': '#center_col',
                'result_container': 'div.g ',
                'link': 'h3.r > a:first-child::attr(href)',
                'snippet': 'div.s span.st::text',
                'time_stamp' : 'div.slp::text',
                'title': 'h3.r > a:first-child::text',
                'visible_link': 'cite::text'
            },
            'de_ip': {
                'container': '#center_col',
                'result_container': 'li.g ',
                'link': 'h3.r > a:first-child::attr(href)',
                'snippet': 'div.s span.st::text',
                'time_stamp' : 'div.slp::text',
                'title': 'h3.r > a:first-child::text',
                'visible_link': 'cite::text'
            },
            'de_ip_news_items': {
                'container': 'li.card-section',
                'link': 'a._Dk::attr(href)',
                'snippet': 'span._dwd::text',
                'time_stamp' : 'div.slp::text',
                'title': 'a._Dk::text',
                'visible_link': 'cite::text'
            },
        },

You can see my whole process through my forked version or my pull request! :)

Now, I’m working on finding a way to search through a specific time range. This will require me to actually program different functions to translate commonalities of time range searches into a feasible solution. Stay tooned to see the next post!

Misinformation

Recently, I saw an alarming amount of posts on r/The_Donald, the largest online group of Trump supporters, claiming that a Washington Post author offered to pay women $1000 to accuse Moore of sexual misconduct for an article about Roy Moore’s unwanted sexual advances towards a 14 year old girl. Here’s the tweet that sparked all of this: https://twitter.com/umpire43/status/928783143099883520

Here is a small sampling of those posts:

Unfortunately, this increase of misinformation stems from a multitude of issues. First off, the r/TheDonald subreddit is that the mods custom programmed the CSS of the subreddit so that it makes it impossible to downvote any post unless if it appears on a user’s reddit main feed:

Secondly, the top comments of most of these posts seem to reinforce the post without fact checking the post:

Thirdly, although there seems to be a bot in place that is supposed to fact check these types of posts, that bot didn’t seem to comment the validity of most of these posts which claimed the reporter paid off women to speak poorly about Moore. This community needs a better way to make sure that its readers don’t get false information since most of the people seem to take the information at face value.

Nevertheless, some members of this community spoke out against this misinformation after seeing the bot comment that the source could be unreliable which shows the effectiveness of enforcing a bot which encourages users to check the source.

This incidence speaks volumes about how people believe information that aligns with their political views. In this instance, I wish the people who believed this claim about the reporter to check the source of the information. A quick fact check would of shown that the source was a Tweet from a user who spouses controversial and radical viewpoints. Moderators of big political subreddits like r/The_Donald and r/politics need to share resources towards their community to make sure what is a reliable source on the internet.

On a side note, I think it’s worth investigating suspicious Reddit accounts to see if any are potentially bots trying to stir up something. Especially with the recent news about Russia creating fake Twitter accounts, I wouldn’t be surprise that “Doug Lewis” (the person who started all these false accusations) is a Russian bot.

The Importance of Defining the Alt-right Correctly

The term alt-right is tossed around quite a bit. Pundits from the left often cast outspoken conservatives as the alt-right as a veil for calling them racist while white supremacists take ownership of the term to further the preservation of white ethno-culture. Moreover, the alt-right term is defined differently on media, online, and in face to face conversations. Some define the Alt-right as simply conservatives who are anti-establishment and favor big money while some define it as group who fight for preserving white culture over any other culture. One thing is for certain: the Alt-right stemmed from the political incorrectness in anonymous social platforms stemming from the advent of Donald Trump’s presidential campaign. Furthermore, with the recent event of Charlottesville, it is clear that those who take ownership of the term Alt-right are people who favor for white nationality and ethno-centrality rather than anything associated with the political right (conservatism). The Alt-right needs to be defined more clearly and associated with its roots in white supremacy; without a clear definition, the term becomes weaponized to the point where it does harms to both conservatives and liberals while favoring white nationalists.

Since the Alt-right was formulated by people who support white ethno-centrality, it should be defined as a group of people furthering white nationality rather than be associated with any other ideology. The Alt-right’s in dictionary definition has many great aspects to it, but it simply doesn’t do the term justice. In the Merriam-webster dictionary as well as many others define the Alt-right as a group of people rejecting mainstream conservatism and espouse extremist beliefs “typically centered on ideas of white nationalism” (“Alt-right”, Merriam Webster). The rejecting of mainstream conservatism makes sense in this definition, but I think it’s a much bigger part of the definition than it should be. While it is true that the alt-right rejects conservatives, they still lean towards conservatism and libertarianism on policies. More importantly, I believe the definition should mainly rely on the notion that the alt-right believe in white ethno centrality on all issues. Most recently, in the Alt-right blog, they claimed that the Las Vegas shooter who killed more than 50 people today did so because he was dating an Indonesian woman (“Las Vegas Shootings”, Law) . They truly desire to get rid of all minority cultures in favor of Western and European culture. They believe that history shows that European and Western culture dominates all other culture and that America was founded on that culture. Moreover, in my own experience interacting with the alt-right through their main anonymous chatting server on Discord (before they were banned), they clearly underscore that they discriminate non-whites. In their Rules section, you had to submit a photo of yourself to prove that you are white to join the member channels. Thus, since the people who take ownership of the term Alt-right are clearly focused mainly on eradicating other cultures in favor of white nationality, the definition should clearly underscore that.

The way the Alt-right term is used in media is not an accurate representation and should be defined the way online communities on the internet have used it. The Alt-right term has exploded and has been portrayed in media in a variety of ways. Recently, on CNN or any polarizing left leaning media outlet, one can see pundits labeling conservatives like Ben Shapiro when he opposed as an Alt-right member in an attempt to discredit his conservative beliefs. One interesting test case in which the Alt-right term has been used is in the election campaign when Hillary Clinton discredited the Alt-right and underscored that Donald Trump should refute them as supporters. She used the term Alt-right as a tool to cast a big veil to cast a large base of supporters for Donald Trump as racist. Thus, the use of the Alt-right term has slowly transformed as a buzzword similar to the word “racist”. Whenever someone calls someone the “Alt-right”, they quickly deny it unless he or she is a white supremacist such as Richard Spencer. Since the term Alt-right has so much negative connotation and association with racial discrimination and white nationality, the term is used to discredit conservative ideas while signaling to viewers that the accused Alt-right person is a racist. This use of the term is often unfair and creates an easy way to group all conservatives as racist basically. More importantly, the way the media uses the term Alt-right ultimates serves precedent for conservative media outlets to use other extremist terms to cast liberals in a bad light such as “Antifa”. By not defining the alt-right as clearly a group of anti-establishment as people who favor ideologies only supporting white ideals, the Alt-right term will be used a wide net to describe angry conservatives or Trump supporters which is unfair.

The Alt-right term shouldn’t be defined similarly to a political ideology like conservatism because people who take ownership of the term mainly only focus about race in policies. When looking and watching interviews with people who are proudly in the alt-right, one can see that they have no clear unifying messages when it comes to economic policies. They all seem to just be angry at non-whites and thus complain about policies that seem to eliminate white culture. For example, the “Unite the Right” rally in Charlottesville stemmed from the anger of statues of confederate soldiers being taken down. Similarly, there have been protests against removing the Confederate flag in government buildings. More importantly, Alt-righters rarely talk about economic policies and their thoughts on improving American infrastructure without blaming non-whites. Even people who have contributed to the mainstream use of the term Alt-right such as the anti-establishment conservative news site Brietbart don’t align with it anymore, citing their racist ideologies. Thus, the definition of the Alt-right should be mainly about white nationalism and should incorporate their beliefs of discriminating non-whites. By not including this in the definitions, people who are truly in the alt-right will not be viewed as racist but rather anti-establishment which they really aren’t and ultimately harms that conservative movement.

The inclusion of conservatism in the alt-right definition harms the conservative ideology and casts many conservatives as racists which is unfair. The Alt-right term is rarely taken by conservatives. Even the most outspoken, anti-establishment conservatives like Milo Yiannopoulos reject the term Alt-right to describe their ideas. Even conservatives who are against mainstream conservatism don’t align with the alt-right term. Thus, I feel that it is unfair to define the Alt-right as people who are against mainstream conservatism when in reality the Alt-right are just people who are frustrated by the state of America and believe that white culture is disappearing and believes that that is ultimately harming for America. The people that identify with the alt-right often dwell on anonymous platforms such as their blogging platforms and anonymous communication platforms like Discord. When looking at their conversations, one can see that they don’t really care about conservative policies like a smaller government and tax cuts but rather focus on policies which kick out undocumented immigrants or limits non-whites from coming to America. They all are shunned by most reputable conservatives and should be. Thus, I think it’s disingenuous to include conservatism in the Alt-right definition and also believe that the anti-establishment part of the definition should not be underscored so much. The people who are in the Alt-right movement solely care about white nationality and it’s use in policies.

The Alt-right term should defined mainly and solely as people who favor for white nationality instead of defined along the lines of anti-establishment ideals and white nationality. By including conservatism or associating it with the anti-establishment movement, it harms those valid ideas and associates those viewpoints with truly racist people. With the alt-right term being used a lot in the mainstream now, one must be careful in using it. It is an easy way to discredit and reject conservatives who are anti-establishment as racist. Defining the Alt-right term in the light of anti-establishment conservatism has and will ultimately pave the way for liberal opponents to cast liberalism as “Antifa”, thus creating an easy way to describe the liberal movements as violent and thus discrediting it.

Works Cited:

  • Law, Vincent. “The Las Vegas Shooting.” AltRight.com, AltRight, 3 Oct. 2017, altright.com/2017/10/02/the-las-vegas-shooting/. Accessed 3 Oct. 2017.
  • “Alt-right.” Merriam-Webster.com. Merriam-Webster, n.d. Web. 3 Oct. 2017.