Reporting Confidence Levels

What does it mean when you say that a statistical test was "significant at the P < 0.05 level"? Oddly enough, it's possible to have a disagreement even on this little bit of mathematics. This is a lightly-edited transcript of an e-mail exchange.

One would not expect the SDT model to fail more often at the P<0.01 level than at the P<0.05 level, yet that seems to be the case on a number of conditions.

In fact, one can be stronger: anything that fails at the P<0.01 level will certainly fail at the P<0.05 level. So, the proportion of unacceptable fits *must* be larger for P<0.05.

The program probably does what all psychologists do. It breaks up the range into 0.001 < P < 0.005, 0.005 < P < 0.01, 0.01 < P < 0.05, and P > 0.05 then it reports what part you fell into. So, what it probably means when it says "P < 0.05" is *really* "0.01 < P < 0.05".

That's OK (if you make it clear what it's doing), but it has to be said explicitly in the paper or in the table heading. For instance, I expect that events you label "P < 0.05" should occur by chance 5% of the time, not 4%.

Therefore the paper needs fixing, because a paper is intended to be a device for communicating knowledge. In your small sample of test readers (i.e. me), that section fails 100% of the time.

You're repeating a common, faintly incorrect practice. To do the statistics correctly, you define the significance level before the data is collected. Then, you test to see if you reach the significance level or not. If you then go on and report a better significance level, that's going beyond the test; it's bragging.

That's when you're testing hypotheses. We're concerned with goodness of fit. If you set .05 for that and get .01, why not report it? It shows an even worse fit. It's NOT bragging; it's reporting a fact.

There's nothing horribly wrong with reporting it, but if it's read carelessly, it will over-inflate the significance of a test.

Consider an exaggerated case where we report 0.05, 0.04, 0.03, 0.02, 0.01 probability levels. We'll do 100 identical statistical tests where the null hypothesis is true. So, we expect that 5 of them will report a result significantly different from H0. Right? Now, ask yourself what significance levels you expect to find? They won't all be near 0.05!

Of the 5 "significant" results, you expect one in 0.04<P<0.05, one in 0.03<P<0.04, one in 0.02<P<0.03, one in 0.01<P<0.02 and one in 0<P<0.01. So, when you write the paper, you write that one is significant at the 0.05 level, one at the 0.04 level, ... and one at the 0.01 level.

Then people will be tempted to say "Wow! He's getting better than a 5% significance level. Hmm. Five successes at the, uh, 0.03 level out of 100... Maybe there's something real in that data!" (Unfortunately, there really isn't.)

However, if you restrict yourself to the standard 0.05, 0.01, 0.005, 0.001 set, the harm is less, but not zero. In that same example of 100 trials, you'd expect to get one that the P<0.01 level. No big surprise there for anyone with a little statistical knowledge. But, 10% of researchers will be lucky enough to get one of the results at the P<0.001 level, following the same protocol. And, boy, will they brag! People will say "Well, yes, you do expect 5 results to turn up true just by chance, but look! One of them is at the 0.001 level. That one must be real..."

Basically, it's a bit of grade inflation. (and I do it myself.) Maybe it's best thought of as using the data twice: once to test H0, another time to estimate how significant the result is...

The entire definition of the P<0.05 significance level is that significant events happen by chance 5% of the time. (If the situation meets the correct assumptions.) In fact, this is the *only* requirement. (For instance, there are an infinite number of different two-tailed 95% confidence levels, ranging from 0% in the left tail to 5% in the left tail, and everywhere in between. It's just convention that we stick with the completely symmetric and completely asymmetric confidence intervals.)

I agree. The .05 and .01 levels are conventions. But Where have you found anything in the literature that uses an UNBALANCED 2-tail test?

Defining the shape of a confidence "interval" is a real issue when you want to do confidence intervals on a 2-D plane. Then the shape of the confidence interval is fairly obviously undefined. If you happen to have multivariate Gaussian data and your analysis is linear, then it's reasonable to make your confidence intervals into ellipses, but otherwise, the shape is arbitrary.

People often make a confidence region that follows a contour of probability density (if they can), or they try to find the confidence region with the smallest possible area. Either way, you define the P<0.05 confidence interval to include 95% of the probability.

The smallest-area confidence interval is also relevant in the 1-dimensional case. It then becomes the shortest confidence interval that contains 95% of the probability. If your probability distribution is symmetrical, then it is just your standard two-tailed symmetrical confidence interval. However, if you end up with non-symmetrical probability distributions, then the shortest interval is asymmetric.

Suppose you were fitting y = log(a) * x to some (x,y) data. Then, a would be a log-normal distribution and it could be strongly asymmetrical if stdev(log(a))>1.

Generally speaking, if you do a nonlinear coordinate transform, you end up with a different (non-equivalent) confidence interval.

The trouble with your definition is that events at the P<0.05 column only happen 4% of the time. That's wrong. Period.

Do the integrals. Integrate a Gaussian between the 0.01 and 0.05 confidence levels. Bet you $100 the integral is 0.04. So, the region you are *calling* a 5% confidence interval only has 4% of probability mass.

The trouble with your definition is that events in the column that you label as "P<0.05" only happen 4% of the time. That's wrong. Period.