How linear mixed models can help with missing data

What should you do when you want to conduct a cost-effectiveness analysis based on efficacy estimates from clinical trials but the trial has missing data. One common approach—known as complete case analysis (CCA)—is to discard the participants with incomplete observations. This approach is problematic as not only is there a loss in efficiency of the estimator (due to the smaller sample size), but also the estimates may be biased if the missing data does not occur at random. Common approaches to address this issue include multiple imputation (MI) (see Leurent et al. 2018) or Bayesian methods (see Gabrio et al. 2019), and the linear mixed models (LMM). In this post, we provide an overview of the LMM approach largely drawn from a Gabrio et al. (2022) paper.

Consider the following regression structure:

In this equation, the term Y_ij is the outcome of interest for person i and at different time points j. There are a series of P predictors X_i1,…,X_iP with corresponding coefficients β₁,…,β_P+1. The regular error terms is ε_ij and the term ω_i is random intercept. The equation treats the data as having a 2-level structure, where σ²_ω and σ²_ε capture the variance of the responses within (level 1) and between (level 2) individuals, respectively.

The paper also describes one type of LMM which is a Mixed Model for Repeated Measures. Consider the case where we model patient estimates of quality of life data (i.e., utilities), which are collected at three times during the trial (i.e., baseline and 2 follow-ups). We can write this model mathematically as:

In this equation, we see that utilities have a fixed indicator for whether the utilities were collected at baseline, the first follow-up or the second follow-up. After the baseline estimate, the follow-up equations also include an interaction term between treatment and the time the utilities were collected. Note that by having the random effects term, we are able to account for within compared to between person variability in utilities; if there is significant heterogeneity in utility across individuals, any missing data would increase the uncertainty of the estimates relative to cases where there is little variation in baseline utility levels across individuals. When data are missing, one can still estimate utility or QALY impacts based on weighted linear combinations of the coefficient estimates of this utility model.

The authors note that one key limitation of LMM is that it requires all covariates to be observed at baseline. While that may commonly be the case, the authors argue that “in randomized controlled trials, missing baseline data can be usually addressed by implementing single imputation techniques (e.g., mean-imputation) to obtain complete data prior to fitting the model, without loss of validity or efficiency.”

Gabrio and co-authors also post their code for Stata and R on GitHub (see here).

Leave a Reply Cancel reply