Predicting Rare Events


There’s a certain class of rare events that people want to predict, but they cannot.  These involve death, earthquakes, and (this week) disk crashes.  Now, obviously, I lie.  It’s easy to predict an earthquake: all you have to do is say “Earthquake tomorrow!” every day, and sooner or later you’ll be right.    If you live in San Francisco, you’ll be right every few decades.  So, it is possible to predict, and successfully, too.   The only problem is all the trouble you cause by those thousands of failed predictions before the earthquake actually happens.

What this is going to turn into is a discussion of false negatives and false positives in statistics.  Tomorrow, or next week, I’ll put a linguistic spin on it, but today I’ll talk about disk drives.

Recently, I installed the new Ubuntu release (Karmic Koala, 9.10) on my computers at home.   When I did the first one, it told me that its disk was failing.   Repeatedly, each time I logged in, I got a big warning message.  I actually believed it.   I threw out the disk and spent two hours re-installing software on a new disk, which promptly started producing failure warnings too.  Was I cursed?

As it turns out, no.  Rather, there was a badly-written piece of software called “palimpsest” in the new Ubuntu which took it upon itself to warn me about possible disk failures.  When I say “badly written”, I don’t want to malign the author in terms of software engineering: he/she is probably rather better at it than I am.  But, I do mean to malign him/her for not thinking about the statistical basis of the software, and in the end, writing a program that probably should not be written.

Why not?  Because of the false positives and the costs.

As it turns out, hard disks last for 30 years on average.  This is an important number, because it puts an upper limit on the value of a prediction.  Under the most optimistic assumptions, it cannot possibly save my data more often than the disk fails: once every 30 years.  So, assuming I back things up once a week, it cannot save me more than three weeks’s work in my lifetime.

Actually, rather less than that.  I back up important stuff more often than once a week, these days.  Moreover, I’ve lived through some data losses, and it takes rather less than a week to recreate a week’s work.  Usually, you remember some of it and you can avoid the dead-ends you followed the first time around.

But, even so.   Suppose it saves a week every thirty years, or one day every six years.  How much are its warnings costing me?

Well, if I believe it and replace the disk, I spend about a day’s work on that task.  Buy the new disk, back up the data, insert the new disk, copy the data across, remove the old one.  (Then, as last time, find that it doesn’t boot.  A few hours of struggle show that the new disk has a different name: /dev/sda instead of /dev/hda.  I ended up manually rejiggering a bunch of configuration files in /etc.) It was indeed a full day of work, with a certain risk of massive data loss.  (I always wonder if I’m copying the data from the old disk to the new disk or from the empty new disk to the old disk...)

Thus, if the software gives me a warning more often than every 6 years, it’s a losing proposition.  I end up spending excess time replacing disks that aren’t about to fail. And, that only applies if it eventually gives a correct and useful warning before each disk failure.  (By useful, I mean a few days in advance so I have a chance to copy data off the failing disk.)

If it were to miss half of the disk failures, it would be a loss if it warned me more often than every 12 years.   And, if it were to miss 80% of the disk failures, it would be a loss if it even warned me once in the 30-year life of a disk.

It’s just a balance of time saved vs. time wasted.  Like an earthquake warning.  It’s such an annoyance if you have to evacuate San Francisco every time there’s a false earthquake prediction.  It may be best never to make the prediction at all.   Mathematically, if the cost of a false positive times the rate at which they happen exceeds the saving from correct predictions, then you’re better off never making a prediction.

So, palimpsest’s developer really ought to think about costs and failure rates, and decide whether or not palimpsest ought to exist at all.  The software might be simply a cause of trouble.  I rather suspect that it is, actually.

Notes:  A bug report on palimpsest, actual measurements by Pinheiro, Weber and Barroso, and an attempt to predict drive failures by  Hughes, Murray, Kreutz-Delgardo and Elkan.