Searching the Web with Google: Some Oddities and the Problem of Inflation

When working on one of my papers (for more, click here [1]), I had the opportunity to learn more about the nature of Google’s search engine; and in particular about the number of hits that it returns.

The number of hits Google gives is just an estimate, a fact that Google themselves emphasises. For instance, you might have noticed that Google only gives three significant digits in their results (see Matt Cutts’ comment here [2]). There is more, as Greg pointed out: see an entry by Prof Jean Véronis here [3] and another by Mark Liberman here [4]. These distortions are rather sophisticated and might not affect your search queries.

Incidentally, you might wonder why I’ve been using Google’s results for an academic paper. Primarily because there is simply no practical alternative. Second, I’m taking some precautions to help ensure that (to the extent that the counts are distorted) all my results are affected equally. Since I’m looking at the ratios between certain search queries, any distortion that is common to both queries will not affect the results.

Second, I made sure that I saw the number of hits deflate. Third, I assume that all my search queries were equally distorted, which again makes the results more viable.

But, despite the usefulness of the Google counts, there are limitations that might cause trouble. One oddity occurs in big searches: One can only access a tiny fraction of the results Google returns (e.g., when googling “Volkswagen”, one will get some 800-900 million results, but one will only be able to click through a couple of hundreds of them).

Another problem is this: Well into my research I’ve noticed that the results seems to be inflated: Certain search queries will return thousands of hits at first, but as you browse through them, the number collapses. Try the following search parameters and see for yourself:

“{menace OR menaces OR menaced} {me OR you OR him OR her OR us OR them} to” -secrecy – silence

(Note the double inverted commas and the negatives.) At first, Google gives some 40 000 hits for this query, but as you page through, the number collapses to nothing, really.

To my research, this was a serious blow, as I relied on the viability of Google’s number. I’ve got in touch with Alex Chitu of the blog Google Operating System [5] and he explained to me that the reason for this distortion lies in the syntactic complexity of the search queries: They use quotation marks, logical operators, and exclusions; which again makes Google’s estimates get coarser.

This might explain the extreme inflation (by a multiplier of about 200), but even simple search queries give results that are inflated to a certain degree. Try to google ‘Eierschalensollbruchstelle’ (from Eierschalensollbruchstellenverursacher, German for egg puncher; a small kitchen utensil to punch a hole into an egg, so it doesn’t crack when boiling) and you’ll get some 500 or 600 results. As you click through them, it deflates to well under 200 results. Or try the proper name ‘ “Adam Roseneck” ’ (it’s important to put it into inverted commas): it gives 500 – 600 hits, which deflate to well under 50.

Now, the point with ‘Eierschalensollbruchstelle’ and ‘Adam Roseneck’ is that they are not incredibly complex queries in terms of their syntactic structure, but there is still some significant inflation going on, viz. by the multiplier 4 and 10, respectively.

Further, when comparing my results from 2008 to results you’d get today, I’ve noticed that current Google results are completely off (maybe this points to a possible change of Google’s search algorithm?). Yahoo’s 2011 results, however, remained nearly the same compared to 2008 (despite a slight, overall increase; which you’d expect from a rapidly growing internet).

All these issues call Google’s estimate into doubt. As Google say, their “results estimates are just that — estimates” (again here [2]). It is for you to judge how reliable these estimates are.