When creating risk adjustment models to predict health care spending, many researchers aim to maximize the goodness-of-fit of the model. Maximize the goodness of fit, however, can produce the problem of overfitting.
Overfitting occurs when a model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model which has been overfit will generally have poor predictive performance, as it can exaggerate minor fluctuations in the data.
For instance, if you want to predict a hospital’s cost for treating different patients, you may wish to include patient comorbidities in your model. Because complex patients are often very costly, one may also wish to include comorbidity interactions in the model. If you your sample size is not sufficiently large and you include too many comorbidities in your model, you may suffer from the problem of overfitting where many of the comorbitidity interactions only describe a handful (e.g., 2 or 3) patients; in other words, the excessive number of parameters means that the variables are in essence measuring an individual level effect rather than a comorbidity affect and the predictive power of the model is likely to be weak in future time periods.
Does your model suffer from overfitting? The remainder of this post describes a statistical test which can provide the answer.
One popular test of overfitting is the Copas test, developed by John B. Copas.
Here are the steps:
- Randomly split data into two subsets A and B (e.g.,50% versus 50%, 67% versus 33%) for cross validation. This function chooses equal sizes as default.
- Fit a linear or generalized linear model
Y ~ X*betason subset A, retain the coefficients (
Yhatas the predicted values from the regression in Step 2 (from subset A) using the coefficients (
betas) obtained from step 2.
- Fit a linear model of true values (Y) versus predicted values (Yhat)
Y ~ alpha + beta*Yhaton subset B.
- Test null hypothesis (
H0:alpha=0 and beta=1) from step 4. Ellis and Mookim recommend simply testing the null hypothesis
- Repeat step 1 to step 5 for multiple times.
If you reject the null hypothesis, then overfitting may be a problem. To reduce the problem of overfitting, one should simultaneously reduce the number of variables in your model and check for outliers.
- J. B. Copas. Regression, Prediction and Shrinkage. Journal of the Royal Statistical Society. Series B (Methodological) , Vol. 45, No. 3 (1983), pp. 311-354
- Randall P. Ellisa and Pooja G. Mookim. Cross-Validation Methods for Risk Adjustment Models. Working Paper, Boston University, July 1, 2009.
Estimating Log Models: To Transform or Not to Transform? NBER Technical Working Papers, National Bureau of Economic Research, Inc