Scraping Google Within a Time Range · Ethan Chiu

Ethan Chiu My personal blog

Scraping Google Within a Time Range

For the past few weeks, I’ve been trying to program a way to scrape google search results within a certain time period.

I have been using the GoogleScraper repo developed by NikolaiT which has robust scraping functionalities which speeds up my data collection. Unfortunately, it lacked gathering timestamps and searching within a time stamp. A few days ago, I solved the former issue.

While I was trying to add the searching within a time stamp functionality, I first did multiple search results using the “Custom range time” settings to limit the searches to different time ranges. After running through a couple of searches, I realized a common pattern that I could easily replicate. Here is an example of the time parameters I saw :

&source=lnt&tbs=cdr%3A1%2Ccd_min%3A2015%2Ccd_max%3A2016&tbm=

As you can see, there are two main attributes: cd_min which is the beginning time and cd_max which is the end time.

From there, I dug through the GoogleScraper code and tried to find where the url is formed. After searching and tracing different functions, I found the right location for searching within a specific time range and added a simple line of code:

def get_base_search_url_by_search_engine(config, search_engine_name, search_mode):
    assert search_mode in SEARCH_MODES, 'search mode "{}" is not available'.format(search_mode)

    specific_base_url = config.get('{}_{}_search_url'.format(search_mode, search_engine_name), None) 

    if not specific_base_url:
        specific_base_url = config.get('{}_search_url'.format(search_engine_name), None)

    ipfile = config.get('{}_ip_file'.format(search_engine_name), '')

    if os.path.exists(ipfile):
        with open(ipfile, 'rt') as file:
            ips = file.read().split('\n')
            random_ip = random.choice(ips)
            return random_ip
    specific_base_url += "&source=lnt&tbs=cdr%3A1%2Ccd_min%3A3%2F1%2F2015%2Ccd_max%3A11%2F1%2F2015&tbm=&"
    return specific_base_url 

After validating that this worked correctly, I just appended the right parameters to the url manually evertime I wanted to search at a different time range to save time. Here is my commit for better clarification.