The Hypocrisy of Google Parsing and Google Scraping
By soma56
Google is well-known as the most popular search engine in the world. It is a household name that has magically transformed into a verb. Millions of people "Google" multiple times daily yet many people aren't aware of the hypocritical factor surrounding their search engine.
To understand Google you have to understand the underlining premise behind search engines in general. Search engines are basically a large database of websites. Search engines send out scripts known as 'spiders' or 'bots' that continuously crawl the Internet. Every search engine has their own secret formula for determining what ranks in the top ten search results or search engine result pages (SERP's). Website owners have some degree of control when it comes to spiders 'scraping' their website. They can explicitly state through a file whether or not a specific page becomes 'indexed' into a search engine or not. This file is known as the "robots.txt" file. Although Google provides clear instructions on how to prevent their spider from indexing specific pages they repeatedly ignore this request. Google, to put it mildly, scrapes and indexes everything on the Internet. Many web developers have often wondered why Google would outline specific rules when they ignore them.
Google is the world's most proficient scraper. The hypocrisy lies within their terms or service. According to section 5.3 within Google's terms of service users must "agree not to access (or attempt to access) any of the Services through any automated means".However, Google has made a business out of accessing websites through automated means.
There have been various studies on the Internet by web developers who have tested Google with respect to their crawling methods. They've set out to explicitly deny Google access when it comes to indexing web pages. Google has repeatedly ignored these requests.
Google makes a point of blocking automated scripts. They have alerted the way that incoming requests are accepted. If any anomaly is detected they will redirect the user to a page requiring a captcha. A captcha is a human verification tool whereby users are required to input a random string of text and/or words. The letters have been formed in a way that is very difficult for robots or scripts to decipher. In other instances Google will temporarily ban an IP address.
In most cases the solution to scrape Google is to use proxies. Programmers will rotate them on a continuous basis to avoid detection. But even this requires skill. Proxies die fast and require time to return results. There are only a handful of people on the web that have found a way to successfully scrape Google without being banned in real-time. These " Google Parser " companies enjoy the benefits of information for marketing, SEO and any number of other uses. Their practice goes against the terms of service where Google is concerned. However, because the terms of service are hypocritical in nature it is difficult for some users to take it seriously.
Comments
No comments yet.