The problem with p-values

Interesting article in Aeon on why p-values may not be the best way to determine the probability we are observing a real effect in a study.

Tests of statistical significance proceed by calculating the probability of making our observations (or the more extreme ones) if there were no real effect. This isn’t an assertion that there is no real effect, but rather a calculation of what wouldbe expected if there were no real effect. The postulate that there is no real effect is called the null hypothesis, and the probability is called the p-value. Clearly the smaller the p-value, the less plausible the null hypothesis, so the more likely it is that there is, in fact, a real effect. All you have to do is to decide how small the p-value must be before you declare that you’ve made a discovery. But that turns out to be very difficult.

The problem is that the p-value gives the right answer to the wrong question. What we really want to know is not the probability of the observations given a hypothesis about the existence of a real effect, but rather the probability that there is a real effect – that the hypothesis is true – given the observations. And that is a problem of induction.

The article rightly describes that the p<0.05 threshold is very arbitrary.  Is a study with a p value of 0.047 much better than one with a p-value of 0.053?  Of course not.  This point is well known however.

Notice, though, that it’s possible to calculate the disastrous false-positive rate for screening tests only because we have estimates for the prevalence of the condition in the whole population being tested. This is the prior probability that we need to use Bayes’s theorem. If we return to the problem of tests of significance, it’s not so easy. The analogue of the prevalence of disease in the population becomes, in the case of significance tests, the probability that there is a real difference between the pills before the experiment is done – the prior probability that there’s a real effect. And it’s usually impossible to make a good guess at the value of this figure.

An example should make the idea more concrete. Imagine testing 1,000 different drugs, one at a time, to sort out which works and which doesn’t. You’d be lucky if 10 per cent of them were effective, so let’s proceed by assuming a prevalence or prior probability of 10 per cent. Say we observe a ‘just significant’ result, for example, a P = 0.047 in a single test, and declare that this is evidence that we have made a discovery. That claim will be wrong,not in 5 per cent of cases, as is commonly believed, but in 76 per cent of cases. That is disastrously high. Just as in screening tests, the reason for this large number of mistakes is that the number of false positives in the tests where there is no real effect outweighs the number of true positives that arise from the cases in which there is a real effect.

If we don’t know the prior probability for most interesting research questions, however, what is the solution?

So, although we can calculate the p-value, we can’t calculate the number of false positives. But what we can do is give a minimum value for the false positive rate. To do this, we need only assume that it’s not legitimate to say, before the observations are made, that the odds that an effect is real are any higher than 50:50. To do so would be to assume you’re more likely than not to be right before the experiment even begins.

If we repeat the drug calculations using a prevalence of 50 per cent rather than 10 per cent, we get a false positive rate of 26 per cent, still much bigger than 5 per cent. Any lower prevalence will result in an even higher false positive rate.

Getting this type of statistical analysis into mainstream research, however, will certainly be a challenge.


  1. This is an important point.

    Perhaps authors should be required to state their prior probability estimate in the paper and calculate positive/negative predictive values accordingly.

    Reviewers and readers could judge whether the prior probabilities used are reasonable.

  2. It is hard to see where the problem really is. I would suggest, first of all, a consistent use of the capital « P », so that we know what we are talking about. Then, I would also suggest to the author to look up the way how the “P” value is calculated. This might help understand that the P value has something to do only with how “exact” the obtained result probably is – and NOT how scientifically important it is, i.e. whether the result suggest some increase in knowledge or not, or whether it is an eventual discovery or not. Finally, the question of whether the obtained P values of 0.053 and 0.047 should be interpreted as “really” different – is trivial and amounts to the well known “sorites” problem. Certainly, there always will be some young journalists who would confront the issue for the first time and find it fabulous.

Leave a Reply

Your email address will not be published. Required fields are marked *