Let’s say we are interested in determining whether a particular treatment improves quality of life. Common measures of quality of life include EQ-5D, SF-12, and SF-36, among others. However, a systematic review of 237 randomized controlled trials found that only 43% collected SF-12 or SF-36 measures. If you are measuring a treatment’s effect on quality of life, how would you deal with the missing data problem?
As described in Halme and Tannenbaum (2018), the missing data could be assumed to follow one of three processes:
- Missing completely at random (MCAR): This would be the case when the missing data are totally random.
- Missing at random (MAR): The probability of missing depends in part on variables not in the data
- Missing not at random (MNAR): The probability a value is missing depends on the counterfactual value of the missing data.
If clinical trial data on quality of life are MCAR or MAR, one could use single imputation, linear regression, personal mean score, maximum likelihood approach or other approaches to impute the missing values. By what if the process MNAR? What if people don’t report their QoL when they are feeling sick or at end of life? In this case, failing to account for this reporting selection would lead to biased estimates.
To address this issue, Halme and Tannenbaum (2018) use a Bayesian approach to account for this bias. To do this the authors first create a prior distribution conditional on age and education (which are known to affect QoL). Because SF-12 questions are categorial variables, the probability of choosing any answer must sum to 1:qij~categorical (π ji1,π ji2,…π ji6)
such that Σn=1 to C π jin = 1
The categorical probability vector with a latent normally distributed variable uji was calculated with uji depending on age, education, and answers to other questions. For questions with more than 3 responses, the authors applied a beta distribution to insured the resulting probabilities fell between 0 and 1.
For the MNAR approach, the authors also applied a binary variable for “missingness”, which was model as a logistic regression as a function of age and the answers to the other SF-12 questions answered.
To test whether this algorithm worked well, the authors use a validation sample where individuals had complete answers and then the authors artificially had some responses be missing in a pattern that mirrored people with missing data. In their data, about 3 in 5 people had at least one response that was missing. They then apply the Bayesian approach described above as well a variety of other imputation methods including: assuming all missing responses were 0, mean of the sample, and two different linear regression-based predictions.
Based on metrics such as the mean absolute error, mean relative absolute error, and root mean squared error, the authors do find that the Bayesian approach provides better predictions based on the validation sample compared to the other approaches. However, the cost of the Bayesian approach is that it is much more complex and requires significantly more computing resources.
The authors noted that other approaches to address the missing not at random are available such as “hot deck”, multiple imputation methods, and full imputation maximum likelihood, but the “hot deck” may be problematic with small sample sizes and does not address MNAR; the full imputation maximum likelihood was an option but requires continuous rather than categorical variables.