In statistics, most statistical tests aim to trade off type I and type II errors. Type I error is the incorrect rejection of a true null hypothesis, in other words a false positive. Type II errors are the incorrect retaining a false null hypothesis; in other words, a false negative. Oftentimes, the null hypothesis is posed a some parameter being equal to 0.
A paper by Gelman, Hill and Yajima (2012), however, argues that this distinction may not be appropriate for social science. In genetics, if you are testing for whehter a specific gene influences some biological process, it is possible that some genes have a precise 0 effect. If we compare the effect of teacher quality on test scores, it is highly unlikely that any teachers has a precise 0 effect on test scores. The effect may be positive or negative, it may be small or large; the the chance that it is exactly zero is exceedingly small.
Thus, when doing social science research, the authors prefer to thing of statistical errors as Type S and Type M.
- Type S error. This is an error of sign. A type S error would occur if one rejects the null of no effect in favor of a positive effect, when the true effect is negative; or conversely, one rejects the null of no effect in favor of a negative effect when the true effect is positive.
- Type M error. This is an error of magnitude. Type M errors occur when we say that “…a treatment effect is near zero when it is actually large, or saying that it’s large when it’s near zero”. Type M errors are especially likely in underpowered studies where uncertainty is high.
I think this is certainly an intuitive way to look at the issues of making statistical inference in social science.
The same paper also describes the use of Bayesian multilevel modelling for conducting multiple hypothesis tests simultaneously (e.g., tests for teacher quality) as an alternative to more commonly used classical statistics approaches such as the Bonferroni or false discovery rate (FDR) corrections. Whereas the latter, classical approaches adjust the confidence intervals of the statistical tests to reflect the fact that multiple hypotheses are being tested, the Bayesian approach shrinks the sublevel analysis towards the prior. In the case of teacher quality, for instance, the Bayesian multilevel modelling approach would shrink each teacher’s estimated effect on test scores closer to the population level effect.
As for the math, consider the application by Kane, Rockoff and Staiger (2008) to measure the effect on teacher quality. In their approach, the each teacher who taught at t different classrooms, the teacher quality impact was measured using an empirical Bayes (shrinkage) estimator as follows:
- E(μj|εj,1,…,εj,t) = ε_bar (σ2μ)/(σ2μ+σ2ζ/t)
where
- ε_bar = t-1 Σ{s=1 to t} εj,s
The key parameter for each teacher is ε_bar, which is the teacher’s effectiveness as measured by their average residual. To control for multiple testing, the authors use the Bayesian approach to shrink the estimate towards the population mean, by multiplied by a scaling factor equal to “the proportion of variance in the average residual that is due to signal variance (i.e., the reliability).”
Source:
- Gelman, Andrew, Jennifer Hill, and Masanao Yajima. “Why we (usually) don’t have to worry about multiple comparisons.” Journal of Research on Educational Effectiveness 5, no. 2 (2012): 189-211.
- Kane, Thomas J., Jonah E. Rockoff, and Douglas O. Staiger. “What does certification tell us about teacher effectiveness? Evidence from New York City.” Economics of Education review27, no. 6 (2008): 615-631.