If you work in research, you may have heard that you need to worry about confounding. But what is confounding? And how can you address the problems it causes.
What is confounding? An example
Let us say that you are interested in the effect of an Ivy League education on a child’s long-term income. You could measure this relationship and you would likely find that individuals who attend Ivy League schools have higher incomes. However, is this relationship causal. One may believe that the socioeconomic status of child’s parents is a confounder here. Children who grew up within families with higher socio-economic status are more likely to attend Ivy League schools; they also are more likely to have higher income (e..g, perhaps due to family connections at better paying jobs). Thus, it could be the case that these children would have higher incomes no matter what scohol they attended.
What is confounding? A definition
There are three necessary conditions for a variable to be considered a confounder:
- The variable must be independently associated with the outcome (e.g., in my example income)
- The variable must also be associated with the exposure under study in the source population (e.g., socioeconomic status affects the likelihood of attending an Ivy League school)
- The variable should not lie on the causal pathway between exposure and outcome
All three definitions hold true in my example. Socioeconomic status is clearly related to the outcome (income). A child’s socioeconomic status also affects the likelihood of attending an Ivy League school. And finally, attending an Ivy League school is not on the causal pathway between exposure and outcome;
attending an Ivy League school does not affect the socioeconomic status of your family when you were growing up.
So now we know the problem. How do we address it?
Addressing confouding: study design
- Randomization: This will solve the confounding since there will be an equal percentage of confounded people in the treatment and control groups. In my example, one could randomize who would be attending Ivy League schools. This would solve the confounding issue, but the Ivy League schools probably would not go for this admissions policy.
- Restriction: Another approach would be to limit the study to only people of a certain socioeconomic status. In my example, this is difficult because socioeconomic status is a continuous measures so there would always be some differences. However, to the extent that one can limit the sample to be homogenous with respect to confounders, then you have solved the issue.
- Matching: Approaches such as propensity score matching insure that observable confounders are matched across the treatment and control groups. In my examples, one would be comparing individuals who did vs. did not go in to Ivy League schools but would be matching individuals with similar socioeconomic status growing up.
Addressing confouding: analysis
- Stratification: This is similar to the “restriction” approach above. You simply would evaluate the question of interest for different strata and if you wanted an overall effect you could take a weighted average of the strata. In my example, you could compare high socioeconomic children who did vs. did not go to Ivy League schools and then do another comparison of low socioeconomic status children and compare later in life income for those who did vs. did not attend Ivy League schools.
- Multivariate regression: Statistical modelling (e.g. multivariable regression analysis) is used to control for more than one confounder at the same time, and allows for the interpretation of the effect of each confounder individually. This approach is very commonly used, espeically in the field of economics. In my example, you would simply control for parents socioeconomic status (e.g., parent’s education, parent’s income, etc.) in the statistical model.
What is residual confounding?
In the example above, I assume that all confounding factors are observed in the data. For instance, I assume that one has data on parent’s socioeconomics status for all children. However, if there are some unobserved factors resulting in confounding, we call this “residual confounding” since–because the confounder is not observed, it necessarily goes into the residual.
What is effect modification and do we need to adjust for it?
Effect modification or interaction occurs when the direction or magnitude of an association between two variables varies according to the level of a third variable (the effect modifier). In my example, let us assume that socioeconomic status of parents had no effect on a child’s long-term income directly. Think of a world where an individual’s job prospects depend entirely on (i) the school they attended and (ii) the grades they receive in the school. Further assume that Ivy League schools began using randomization to accept children. In this case, a parent’s socioeconomic status has no direct effect on income nor does it effect the likelihood of attending an Ivy League school. However, it could be the case that children of family’s with higher socioeconomic status could afford tutors in high school which lead to bettter study habits. If these better study habits lead to better grades, then we would see that while Ivy League schools lead to higher income overall, the effect would be larger for children of high status parents because they would get higher grades.
In this case, one does not need to control for a parent’s socioeconomic status to measure the overall population effect. However, socioeconomic status does effect the magnitude of the relationship for specific subpopulations. Typically, one would show the overall effect and then show stratified results or use interaction terms to dive into further detail in terms of capturing some of the heterogeneity that make up the overall treatment effect.