Testing multiple hypotheses simultaneously

Most people know about the good ol’ t-test. You present a null hypothesis (e.g., the Healthcare Economist is the most popular blog covering health economics on the web), collect data to conduct the test, and use the mean and variance of the data to test whether your hypothesis is true.

Standard convention holds that most t-tests are conducted with an alpha of 0.05; this means that that there will be a false positive rate of only 5%. In other words, only 5% of the time will we reject the null hypothesis (e.g., Healthcare Economist is not popular) when in fact it is true.

What if we are conducting multiple t-test simultaneously. For instance, if we want to know whether an individual has an above average likelihood of having a specific disease based on their genetic information. For simplicity, assume that there are only 100 genes. If we tested all of them and used a 5% alpha, we would expect that 5 of the 100 tests would reject the null hypothesis through pure statistical chance. However, some individuals may take the elevated disease risk for these 5 genes as an important finding. The question is, what to do about this.

The answer is basically to scale down the alpha so that we can identify the results which are truly significant.

One option is the Bonferoni correction. This method basically divides the alpha by the number of statistical tests. Thus, if there are 100 genes we are comparing, the alpha will be 0.05% rather than 5%. One problem with the Bonferoni is that fails to reject too frequently; in essence, the Bonferroni correction is too harsh.

Another approach is the False Discovery Rate (FDR) test. Wikipedia states:

False discovery rate (FDR) control is a statistical method used in multiple hypothesis testing to correct for multiple comparisons. In a list of findings (i.e. studies where the null-hypotheses are rejected), FDR procedures are designed to control the expected proportion of incorrectly rejected null hypotheses (“false discoveries”). FDR controlling procedures exert a less stringent control over false discovery compared to familywise error rate (FWER) procedures (such as the Bonferroni correction), which seek to reduce the probability of even one false discovery, as opposed to the expected proportion of false discoveries. Thus FDR procedures have greater power at the cost of increased rates of type I errors, i.e., rejecting the null hypothesis of no effect when it should be accepted.

This spreadsheet provides examples of how to calculate both corrections, as well as a third correction known as Holm’s method.

Reference:

Yoav Benjamini, Yosef Hochberg. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society, Series B, Vol 57, No. 1, (1995), 289-300.

Leave a Reply Cancel reply