Data is sacred, but…


One of the things that came out in the CRU climate leak was some code (specifically osborn-tree6/briffasep98d.pro) that people have made much fuss over.  It potentially plots some fudged curves.   But, it doesn’t actually plot them.  Tim Lambert points out (here) that the part that does the plotting was commented out, so no fudged plot would be produced.  Moreover, if you look at the papers they actually published, the potential fudged plot isn’t there.   It wasn’t published.

Now, I think this CRU leak is more smoke than fire.  I’ve not seen any evidence of scientific fraud, and I’ve now looked at a fair chunk of the hacked e-mails and code. [Note that this is a cautious statement: it doesn’t say I’ve seen evidence that there was no fraud, and it doesn’t say that I approve of everything they did.]

But, I don’t want to talk about CRU here.  I want to talk about modifying data.  In fact, I want to sing the praises of modifying data, of simulating data, and of imagining data, at least when it is kept properly separate from real data.

In praise of imaginary data

Computers are very powerful tools, but like any power tools, they are dangerous.  Suppose that you write some interesting code and then depend on it for a decade.  Maybe it’s a climate model, maybe it’s a fancy analysis tool, maybe it’s just some scruffy subroutine that finds odd perfect numbers fast and efficiently.  But the point is, you use it regularly and it is part of all your published papers.  What happens if it’s wrong?  [This is the moral equivalent of losing a fingertip to your circular saw.] You’re in trouble.  If you’re unlucky and it is wrong in an important way, you may have wasted a decade; if you’re lucky, it won’t make an important difference but a lot of work may still have to be re-checked by someone.

So, how do you avoid this?  Two ways: one is the computer science route, and the other involves testing how your code behaves.

Tricks from computer science

This involves a bunch of things that work quite well for “normal” programming:

  • Break your code into small, independent modules
  • Test each module separately before you use it,
  • Write cleanly, so that errors are as obvious as possible,
    • Object-oriented design if practical,
    • Use a language (like python) that does much of the dirty work for you (if the slow speed doesn’t cause too much trouble)
  • Get other people to look at your code,
  • Turn on all possible warning messages on your compiler and use tools like lint (for C) or pychecker (for python) that find many common errors
  • Use existing libraries when possible to minimize the amount of new code you need to write.
  • Write extensive error-checking into your code.  [The “assert” statement is the most useful invention in any programming language.  You say “assert something” and if that something is true, nothing happens.   If something isn’t true, the assert statement crashes the programs and lets you dig around to find out what went wrong.  For instance in a computer voting system, you probably want “assert total_votes_cast < number_of_registered_voters”.]

All of these are good and should be done where practical, but they aren’t always practical.  For one thing, scientific computations often involve big, messy formulae involving 10 variables and they do not always break up into sensible parts.  Often, you need the speed so that the helpful interpreted language or libraries that simplify your code are not an option.

Object-oriented code can make things even harder to read if the computation doesn’t have parts that naturally behave like objects.  For instance, consider a program that models speech.  Words make pretty good objects from many points of view, but do syllables?  It’s tempting to treat a syllable as an object in your code and to say that the word “give” contains 1 syllable object, but people don’t always talk that way.  For instance “Did you give me a text?” can be said many ways, including “D’yagimmea text?” where the syllables have all smeared into each other and it’s not at all clear where one syllable begins and ends, or which words contain which syllables, or even how many syllables there are.  [Phonemes are even less object-like, and just think about an object-oriented quantum description of an electron if you really want to envision perverse code…]

The same goes with modular design.  It works well if the problem naturally divides into modules, but the real world doesn’t always do so.  [Computer scientists have the advantage here.  First, they manufacture their own world to a degree.  Second, they deal with people, and people love to think of things as objects.  Scientists need to deal with Mother Nature, and she doesn’t have a special preference for objects.]

Anyway, the computer science approach has its limits.  This should be obvious to anyone who uses software: just about all known software has known bugs.  For instance, Debian, one of the more important Linux versions has 79047 known bugs today on 13007 software packages.  What you have to do is to test your code;  one cannot depend on all your errors becoming obvious through clean programming practices.

Testing your code.

To test your code, you need to run it on some data where you know the answer.  But, obviously, you don’t know what the answer is for your real experimental data.  If you knew the answer, why would you waste your time doing the experiment?  [Of course, you may (almost) know the answer for your control group, perhaps.  That can provide one useful test of your analysis software, but it is only one test.  Also, you don’t entirely know the answer for your control group: you may think you do, but your subjects may have other ideas.  Think about the placebo effect in medecine: the control group gets sugar pills, but some of them get better because they think they have been treated…]

So, this is where imaginary data comes in.  You need to test your code in many ways, and test it on data that you design so that you know the answer.  It’s as simple as that.

And you’ll want some of the test data to be similar to your real data.  You’ll need to test your code under realistic conditions.  One way to get imaginary data that’s close to your real data is to copy the real stuff and change it a bit.  To any outsider, this will look much like fraudulent data, except hopefully you will have the sense to call the files something like “fake3123.dat” and put headers in the file that say “# Modified data: for testing only.”

The commonest and best example of using modified data is as a check and ensure that your analysis could detect an effect, if an effect were there.   Suppose you are measuring earthquakes and you want to make sure your software can reliably detect a magnitude 2 (i.e. small) earthquake amidst the traffic noise.   No problem!   Take some traffic noise and add in a fake earthquake, then run it through your software.   Hopefully, it will flash its lights and announce a detection.  [In speech, if you’re trying to detect prosodic prominence, you modify some speech to include false prominences, then run your analysis.  The intereresting bit is that you can choose the acoustic properties that make your fake prominence.  Deciding what to simulate can be a sophisticated theoretical exercise because not everyone will agree on what the properties of a prominent syllable should be.]

Suppose you want to make sure your earthquake-detection software doesn’t detect convoys of trucks?   Well, one way is to rent several trucks, load ’em up, and drive down the highway.  You may want to do that eventually, but the sensible first step is to fake up a convoy by adding together the signals from several trucks.  When you feed that into your software, it ought not to report an earthquake.   Then, after you have gained some confidence, you can spend some money and rent some real trucks if you need to, just to make sure.

So, that’s imaginary data.  Good, useful stuff.  No one told me this when I started science.  I had to discover it for myself, and it felt a bit dirty.  But, that’s because I’m an old timer and complex software analyses weren’t a big part of science when I started.  It wasn’t a necessary technique for many people.

It wasn’t really dirty then, either.   The only mortal sin in science is publishing something knowingly misleading or lying about what you did.  That hasn’t changed.  Imaginary data is only a problem when it leaks into a publication; it’s not a problem on your disk unless you forget that it is imaginary.  Or, perhaps if someone breaks in…