Let’s say you are studying patients taking Drug A for Disease X. Further assume that you want to identify predictors that a patient is adherent to Drug A. Using health insurance claims data, you define your outcome–medication adherence–as whether the patient had proportion of days covered (PDC) is greater than or equal to 80%, the threshold PQA most often uses for measuring adherence. Now the question is, which statistic approach do you use? A paper by Zullig et al. (2019) tests out three options:
Logistic regression. The most common approach is to use parametric estimation such as a logistic regression. Patient characteristics were identified using backward selection with a required alpha of <0.05 to remain as a predictor. This approach, however, may overfit the data.
Least absolute shrinkage and selection operator (LASSO). LASSO is a shrinkage estimator. Whereas a logistic regression can fit as many variables that meet the alpha threshold, LASSO penalizes regressions that have too many predictors. More formally, LASSO uses L1 regularization, which adds a penalty equal based on the absolute value of the magnitude of coefficients. The LASSO tuning parameter basically sets the penalty. If the tuning parameter is 0, then all regression coefficients are used; if the tuning parameter is larger, the model will be more parsimonious.
Random forest. Random forests rely on machine learning to identify predictors. As the name implies, random forests consists of a large number of individual decision trees that operate as an ensemble. As described in Towards Data Science, “Each tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction…” Random forest use bootstrapped data samples (bootstrap is done with replacement for a sample the same size as the original data set) and each individual tree draws from a random subset of patient characteristics. Random forests allow for interactions and nonlinearity in a way that does not need to be pre-specified by the researcher. Thus, random forest likely perform better if the true model was complex (i.e., nonlinear).
Which model works best? Of course there is not a single answer as this model fit will vary by data set. However, Zullig et al. (2019) look at medication adherence to statins after an acute myocardial infarction. They find that:
…[for] the standard logistic regression model with backward selection; discrimination was moderate, with a C‐index of 0.673. There were small differences in the coefficients after re‐estimation of the LASSO regression model…The discrimination of the LASSO regression model remained moderate (C‐index 0.677 for training dataset and C‐index 0.664 for validation dataset). The discrimination of the third model, the random forest‐generated model, was also moderate (C‐index 0.666).
After all that, the authors recommended the logistic regression in this case. While LASSO has the benefit of creating more parsimonious models and random forest can better handle non-linear relationships, in the case where the models predictive ability is similar, the logistic regression’s simplicity and straightforward approach won out. In short, consider the basic models first, but also consider more sophisticated models for cases where predictive ability could be improved as well.
- Zullig, Leah L., Shelley A. Jazowski, Tracy Y. Wang, Anne Hellkamp, Daniel Wojdyla, Laine Thomas, Lisa Egbuonu‐Davis, Anne Beal, and Hayden B. Bosworth. “Novel application of approaches to predicting medication adherence using medical claims data.” Health services research 54, no. 6 (2019): 1255-1262.Harvard