Doing statistics first – Science and Language

I have been teaching statistics to linguists for a while now, and it has become obvious that many linguists look at statistics as something that one does almost at the end of everything. Statistics tends to get done while one is writing, after the experiment is finished, and just before the paper is finished.

When I get called in as an advisor, I sometimes end up saying one of three difficult things: it’s not significant, you’re doing the wrong analysis, or you don’t have enough data to draw any real conclusions. This essay will look at the first case.

It’s not significant.

“It looks like you did it right; there really isn’t much of an effect.”

In principle, this isn’t a problem at all. A well-designed experiment that fails to find the expected result can be important. It can often mean that the expected result isn’t there, or at least that it is a lot smaller than people thought. In an ideal world, this would be OK; if the expectation came from a theory, then it would be a black mark against the theory.

But, we’re not in an ideal world, at least in linguistics. Partially, the field has a subtly wrong culture when it comes to experiments. People plan experiments to support a theory, and linguists are taught the primacy of theory, so they walk into the analysis with strong expectations and are humanly reluctant to change their minds. [Even at Oxford, where we have a lot of experimental research in the linguistics department, the students’ readings are still heavily weighted towards theoretical accounts of language, and undergraduates read few first-hand experimental studies.] And there’s politics involved, or at least the perception of politics. Few students or new post-docs have much desire to go against established opinion. [Perhaps they worry that rocking the boat might affect future job prospects, or perhaps they just don’t trust themselves sufficiently.]

In an ideal world, it would be easy to publish “no significance” papers and people would notice them. But, in the real world people will assume that if you get the expected result, then you did your experiment properly. But, if you get an unexpected result, reviewers will look at the paper more carefully. Bad reviewers might disbelieve it simply because it gets the “wrong” answer; good reviewers will force you to prove that the methodology is good and the experimental design is clean. All reviewers will wonder if your experiment was actually sensitive enough to actually see the expected effect.

Even assuming everything was done right, this makes for more work and a delayed paper. If a paper comes back from the reviewers with questions and and demands for improvements, it’ll mean another round of review and that will delay publication by a few months. That delay can matter a lot to an early-career researcher who needs publications on his/her CV in order to go job hunting. So, there’s a lot of incentive to find some kind of significant result.

The right way to deal with this bias toward significant results is to design your experiment so that it is almost certain to yield one. There are two strategies that work and can give good science. One approach is to step away from the psychology model of a binary effect/no-effect experiment. Design the experiment so that the interesting question is not “does the effect exist?” but rather “how big is the effect?” Then plan the experiment to be big enough so that you can get a reasonably accurate measurement.

The other approach is to design your experiment so that you can measure more than one thing at a time. Suppose you can measure three things. Then (if we assume that they are independent 50-50 chances at a significant result) there’s only a 1-in-8 chance of nothing turning up significant. You can write a paper that says X is significant, and in the discussion section, you can happily discuss why Y and Z are not, and perhaps what this says about the differences between X, Y, and Z.

Of course, if you get too enthusiastic about this approach, it can lead to trouble. I’ve seen a paper where there were 50 statistical tests on different sections of a rich data set; each done well. But, it all fell apart in the discussion section, where the discussion could pick and choose combinations of results to interpret. Had the author been badly intentioned, the paper could have have been a meaningless exercise in selecting whatever results were necessary to support the author’s pre-conceived opinions. In reality, it was done honestly enough, but it still left me with no idea how far I should believe its conclusions. Could the author honestly have come to different conclusions by combining the tests differently? Probably. The opposite conclusion? I don’t think so, but it was hard to tell… Statistical tests should be a close to the final operation as possible. Don’t make them an intermediate stage of the analysis.

Everyone knows the wrong things to do. Fudging the data, selecting the data, or doing multiple statistical tests without Bonferroni corrections. (Or, strictly speaking, doing any unplanned statistical tests.) Selecting data should be avoided, if possible. Often, you can use robust statistical techniques to make an occasional outlier unimportant. If you cannot design an analysis that’ll work with anything nature provides, you should define rules – in advance – that specify what data is acceptable. But, in human experiments, sometimes unexpected things happen, such as the subject who thinks your carefully designed sentences are funny and bursts out laughing in the middle of the experiment. Drop him, if you want, but publish what you did. But, once the analysis has progressed far enough so that you could know how each data point would affect the result, data selection ends.

Doing it right involves statistics early on, in the design stages of the experiment. You need to figure out how big the experiment needs to be, and that happens before you collect the data.