Categories
academics techniques

A Foundation of Linguistics Trembles

Xkcd, the comic, often runs numerically oriented cartoons.  But, recently, in the xkcd blog, he announced a real problem that calls a fair amount of real linguistic work into question.

The problem seems to be those useful result counts provided by Google

How many uses of "zombie lawyers?"

may not be at all reliable.

He writes:

The “number of results” count that Google gives when you search is clearly fabricated.  This is clear for a few reasons.  When Google says this:

Excellent!  That's a lot!

You can tell that it’s wrong first by scrolling to the end of the results.  When you get to page 32, it suddenly becomes:

I learned in AP Calculus that 316 is WAY less than 190,000.

This doesn’t usually matter, since nobody looks much past the first few pages of results, but it’s annoying if you’re trying to use the number of results as a measure of something.

But, the problem is that these numbers have actually been used for linguistic research.  It is vastly easier to use Google’s numbers than to collect a corpus of data and count the occurrences on your own.  Please read the blog post for more details.

Note added 14 Feb 2011:

This problem was around in 2005 also.   See these blog posts by Jean Véronis 1, and Mark Liberman: 2.  It looks like anyone who wants to use Google counts for linguistics research needs to prove them