Xkcd, the comic, often runs numerically oriented cartoons. But, recently, in the xkcd blog, he announced a real problem that calls a fair amount of real linguistic work into question.
The problem seems to be those useful result counts provided by Google
may not be at all reliable.
He writes:
The “number of results” count that Google gives when you search is clearly fabricated. This is clear for a few reasons. When Google says this:
You can tell that it’s wrong first by scrolling to the end of the results. When you get to page 32, it suddenly becomes:
This doesn’t usually matter, since nobody looks much past the first few pages of results, but it’s annoying if you’re trying to use the number of results as a measure of something.
But, the problem is that these numbers have actually been used for linguistic research. It is vastly easier to use Google’s numbers than to collect a corpus of data and count the occurrences on your own. Please read the blog post for more details.
Note added 14 Feb 2011:
This problem was around in 2005 also. See these blog posts by Jean Véronis 1, and Mark Liberman: 2. It looks like anyone who wants to use Google counts for linguistics research needs to prove them
One reply on “A Foundation of Linguistics Trembles”
[…] digits in their results (see Matt Cutts’ comment here [2]). There is more, as Greg pointed out: see an entry by Prof Jean Véronis here [3] and another by Mark Liberman here [4]. These […]