Sooner or later, specialists who deal with web data face a problem related to collecting the URLs from Google. The problem is mainly related to constant IP bans, as a result of Google’s methods to detect automated access.
When you start Google scraping, typical Google’s “reaction” looks like this:
1. At first you’ll start getting warnings about some “unsafe” or “dangerous” activity (it could be a warning about a virus or a Trojan on the screen and an advice regarding it)
2. After the block with the virus message was issued, for continuing scraping you’ll need a Captcha with an authentication cookie
3. Finally Google will block the IP (either temporarily, for a few minutes/hours, or for a long time). At this point another IPs should be added.
To identify scraping, Google primarily looks for patterns in: IP address, keywords modifications and regularity
Below are some of the most important point to pay attention to while scraping SEPRs:
Choose a reliable proxy source for IP-Address changes on a constant basis. Make sure they are anonymous, fast, with no bad history (were never used for accessing Google before) and preferably rotating proxies.
Use around 100 proxies, depending on results from running each search query. Number of proxies could be more than 100 for bigger projects. Always stop scraping if the process was detected by Google.
Change your IP address consistently at the right point in time of the scraping process. The timing is crucial to your scraping success!
After you change the IP address, clear cookies or disable the IPs.
Do not get more than a thousand results for each keyword while fetching all URLs, then rotate the IP address after the keyword is changed.
If you scrape less than 300 results, it’s possible to scrape more different keywords with the same IP but only after a pause.
Use another source of IPs if more than 100 proxies are used.
Search results could be sent to the max number of 100 with the command &num=100 at the end of the search URL.
Make sure your xpaths/css selectors excludes universal results like image or video results into the organic results, as for most data projects this probably isn’t what you need
Often when requesting a page, Google may redirect you to the domain that relates to the country the request originates from. Parameter &gws_rd=cr helps to control this.
Using a consistent user-agent will help to avoid trouble, sometimes just randomly rotating the User-Agent string will work too.
With proper planning, it’s possible to scrape Google 24/ 7 without being detected.