Greg Kochanski |

What does it mean when you say that a statistical test was "significant at the P < 0.05 level"? Oddly enough, it's possible to have a disagreement even on this little bit of mathematics. This is a lightly-edited transcript of an e-mail exchange.

There is a problem in Table 1 and Table 2.

One would not expect the SDT model to fail more often at the
P<0.01 level than at the P<0.05 level, yet that seems to
be the case on a number of conditions.

In fact, one can be stronger: anything that fails at the
P<0.01 level will certainly fail at the P<0.05 level. So,
the proportion of unacceptable fits *must* be larger for
P<0.05.

So, something is wrong. Hopefully, the numbers are just put in
the wrong columns, but it needs checking.

It's a well tested program, and I am not defending it. I only
report what it says about OUR simulations.

Well, perhaps it's output is a little misleading?

The program probably does what all psychologists do. It breaks
up the range into 0.001 < P < 0.005,
0.005 < P < 0.01,
0.01 < P < 0.05, and
P > 0.05 then it reports what part you fell into.
So, what it probably means when it says "P < 0.05"
is *really* "0.01 < P < 0.05".

That's OK (if you make it clear what it's doing), but it has to
be said explicitly in the paper or in the table heading. For
instance, I expect that events you label
"P < 0.05" should occur by chance 5% of the time,
not 4%.

I don't know quite what you're doing, but if it is correct, it
certainly confuses one of your readers (i.e. me).

Evidently.

Therefore the paper needs fixing, because a paper is intended
to be a device for communicating knowledge. In your small
sample of test readers (i.e. me), that section fails 100% of
the time.

You're repeating a common, faintly incorrect practice. To do
the statistics correctly, you define the significance level
before the data is collected. Then, you test to see if you
reach the significance level or not. If you then go on and
report a better significance level, that's going beyond the
test; it's bragging.

That's when you're testing hypotheses. We're concerned with
goodness of fit. If you set .05 for that and get .01, why not
report it? It shows an even worse fit. It's NOT bragging; it's
reporting a fact.

There's nothing horribly wrong with reporting it, but if it's
read carelessly, it will over-inflate the significance of a
test.

Consider an exaggerated case where we report 0.05, 0.04, 0.03,
0.02, 0.01 probability levels. We'll do 100 identical
statistical tests where the null hypothesis is true. So, we
expect that 5 of them will report a result significantly
different from H0. Right? Now, ask yourself what significance
levels you expect to find? They won't all be near 0.05!

Of the 5 "significant" results, you expect one in
0.04<P<0.05, one in 0.03<P<0.04, one in
0.02<P<0.03, one in 0.01<P<0.02 and one in
0<P<0.01. So, when you write the paper, you write that
one is significant at the 0.05 level, one at the 0.04 level,
... and one at the 0.01 level.

Then people will be tempted to say "Wow! He's getting better
than a 5% significance level. Hmm. Five successes at the, uh,
0.03 level out of 100... Maybe there's something real in that
data!" (Unfortunately, there really isn't.)

However, if you restrict yourself to the standard 0.05, 0.01,
0.005, 0.001 set, the harm is less, but not zero. In that same
example of 100 trials, you'd expect to get one that the
P<0.01 level. No big surprise there for anyone with a little
statistical knowledge. But, 10% of researchers will be lucky
enough to get one of the results at the P<0.001 level,
following the same protocol. And, boy, will they brag! People
will say "Well, yes, you do expect 5 results to turn up true
just by chance, but look! One of them is at the 0.001 level.
That one must be real..."

Basically, it's a bit of grade inflation. (and I do it myself.)
Maybe it's best thought of as using the data twice: once to
test H0, another time to estimate how significant the result
is...

Reporting significance levels above your tested level is just a
way of bragging.

The entire definition of the P<0.05 significance level is
that significant events happen by chance 5% of the time. (If
the situation meets the correct assumptions.) In fact, this is
the *only* requirement. (For instance, there are an infinite
number of different two-tailed 95% confidence levels, ranging
from 0% in the left tail to 5% in the left tail, and everywhere
in between. It's just convention that we stick with the
completely symmetric and completely asymmetric confidence
intervals.)

I agree. The .05 and .01 levels are conventions. But Where have
you found anything in the literature that uses an UNBALANCED
2-tail test?

Defining the shape of a confidence "interval" is a real issue
when you want to do confidence intervals on a 2-D plane. Then
the shape of the confidence interval is fairly obviously
undefined. If you happen to have multivariate Gaussian data and
your analysis is linear, then it's reasonable to make your
confidence intervals into ellipses, but otherwise, the shape is
arbitrary.

People often make a confidence region that follows a contour of
probability density (if they can), or they try to find the
confidence region with the smallest possible area. Either way,
you define the P<0.05 confidence interval to include 95% of
the probability.

The smallest-area confidence interval is also relevant in the
1-dimensional case. It then becomes the shortest confidence
interval that contains 95% of the probability. If your
probability distribution is symmetrical, then it is just your
standard two-tailed symmetrical confidence interval. However,
if you end up with non-symmetrical probability distributions,
then the shortest interval is asymmetric.

Suppose you were fitting y = log(a) * x to
some (x,y) data. Then, *a* would be a log-normal
distribution and it could be strongly asymmetrical if
stdev(log(a))>1.

Generally speaking, if you do a nonlinear coordinate transform,
you end up with a different (non-equivalent) confidence
interval.

The trouble with your definition is that events at the
P<0.05 column only happen 4% of the time. That's wrong.
Period.

You're wrong. Full stop. How do you get such a silly idea?
Defend you position.

Do the integrals. Integrate a Gaussian between the 0.01 and
0.05 confidence levels. Bet you $100 the integral is 0.04. So,
the region you are *calling* a 5% confidence interval only has
4% of probability mass.

The trouble with your definition is that events in the column
that you label as "P<0.05" only happen 4% of the time.
That's wrong. Period.

I'll do it to satisfy you; the reviewers will take it out. Go
ahead and try to publish it.

That last isn a about correctness. It's a statement about
culture and the sociology of science.

[ Papers | kochanski.org | Phonetics Lab | Oxford ] | Last Modified Thu May 29 17:23:41 2008 | Greg Kochanski: [ Home ] |