Dealing with missing data

If you are doing a cost-effectiveness analysis (CEA) that relies on clinical trial data, what should you do if there is missing data in the trial? A paper by Faria et al. (2014) helps to provide the answer. The first question is, how are the data missing. There are a few options for defining this as I have mentioned in the past.

  • Missing completely at random (MCAR): This would be the case when the missing data are totally random.
  • Missing at random (MAR): The probability of missing depends in part on variables not in the data
  • Missing not at random (MNAR): The probability a value is missing depends on the counterfactual value of the missing data.

That is fine in theory, but how do you know in which way your data is missing? A few items to test include:

  • Compare missingness by trial arm. Data are unlikely to be MCAR if the proportion of missing data differs by treatment arm
  • Use graphical tools to identify missing data patterns. Graphical tools (such as ‘misspattern’ in Stata) are useful to visualise and understand the pattern of missing data. These patterns can guide whether missing data could be modelled using individual baseline characteristics or some type of aggregate score.
  • Check how well baseline variable predict missingness. Logistic regressions can be used to evaluate which baseline covariates and post-randomisation variables predict missingness. Data are not MCAR if a baseline variable predicts missingness in a statistically significant manner.
  • Check if missingness predicts outcomes. Logistic regressions can also be used to examine if missingness is associated with previously observed outcomes. If so, MAR rather than MCAR may be a more plausible assumption.

Ok, so now you know the types of missing data and some basic tests to do. What should you do about it to “correct” for the missing data. Here is the advice from Faria et al.

  • Missing baseline values. If this is the case, don’t through out these observations. Mean imputation and multiple imputation are often good approaches. Multiple imputation “…replaces each missing observation with a set of plausible imputed (predicted) values, drawn from the posterior predictive distribution of the missing data given the observed data.” Note that multiple imputation may be less efficient for addressing missingness in baseline values because it imputes in an arm-dependent way, and thus may worsen covariate balance,” but in most other cases multiple imputation is superior. There are two types of MI, joint modelling (MI-JM) and chained equations (MICE). In Stata, one can use the ‘mi impute mvn’ command for MI-JM imputation or the ‘mi impute chained’ command to estimate missing values with MICE.
  • Other approaches. The CCA approach only includes individuals in the analysis when there are complete data on all variables at all follow-up points. ” This assumes that individuals with complete data are representative of those with missing data, conditional on the variables included in the analysis model.” While this approach is inefficient, it is a useful starting point. Another approach is to use the available case analysis. In this approach, you use patients with available data for each analysis separately (e.g., cost, QALY). The problem with this approach is that the sample for one analysis (e.g., costs) can differ from the sample used for another analysis (e.g., QALY) making the analyses non-comparable. Another popular approach is inverse probability weighting (IPW). IPW weights the observed cases by the inverse of the probability of being observed. “IPW is suitable for a monotonic pattern of missing data, in which individuals lost to follow-up do not return to the study.”
  • Can we use simpler approaches? One simple approach is last value carried forward (LCVF). The problem here is that LCVF has been shown to bias estimates (even with MCAR)

The paper also discusses likelihood based models and provides and empirical example. Do read the whole thing.


Leave a Reply

Your email address will not be published. Required fields are marked *