Oftentimes, researchers will use an existing risk adjustment model and apply it to new data. The question is, which risk adjustment model should be used? The most popular method is to calibrate the model (i.e., estimate model parameters or coefficients) using one sample and validate the model using a separate data file. This standard approach, however, is not the only way to choose a risk adjustment algorithm.
Today, the Healthcare Economist draws from a 2009 working paper from Ellis and Mookim to explain their K-fold cross validation.
K-fold Algorithm
The algorithm for the K-fold technique can be described in the following steps:
- Randomly split the sample into K equal parts
- For the kth part, fit the model to the other K-1 parts of the data, and use this model to calculate the prediction errors in the kth part of the data.
- Repeat the above step for k=1, 2….K and combine the K estimates of prediction errors to create a full sample of prediction errors.
If K equals the sample size (N), this is called N-fold or “leave-one-out” cross-validation. “Leave-v-out” is a more elaborate and computationally time consuming version of cross-validation that involves leaving out all possible subsets of ‘v’ cases.
The K-fold cross validation is useful for model implementers, not model developers. In other words, K-fold cross validation can be used when comparing alternative specifications that have been previously developed using other other data, not when developing new models and specifications on the same data. The standard 50/50 methodology using a testing and validation sample is still needed for model development.
Ellis and Mookim apply the K-fold algroithm to commercial claims data from the Medstat MarketScan Research Databases.
Using K-fold to test for overfitting
One can test for overfitting using the K-fold algorithm as follows. If Y is an Nx1 array of the dependent variable and X is the NxM matrix of explanatory variables, let Z = {X Y}. The report notes that the cross product matrix ZTZ contains all of the information needed to generate all conventional regression statistics, including betas, RSE and R-square. The algorithm we implement for a sample size of N is as follows.
- Randomly sort the observations so that there is no significance to the initial original order.
- Estimate the OLS model using the full data set Z, retaining Q = ZTZ.
- For each of k subsamples of size N/K, created without replacement, generate Qk = ZkTZk and take matrix differences to generate Q-k = Q – Qk = ZTZ – ZkTZk
- Use Q-k to calculate the array of OLS regression coefficients β-k(Q-k), and then generate predicted values ˆYK-Fold, which were not used in β-k(Q-k). Save these fitted values of ˆYk in an {ˆYK-Fold}
- After repeating steps 3 and 4 for all of the k samples, generate validated RSE and R-square measures for the full set of size N using the original Y and {ˆYK-Fold}.
- Run the COPAS regression of Y on {ˆYK-Fold}.
To increase the precision of these estimates, one can replicate these steps a number of times. Although K-fold cross validation can help identify overfitting that results from estimation, it cannot easily be used to understand overfitting due to model design and selection.
Source
- Randall P. Ellis and Pooja G. Mookim. Cross-Validation Methods for Risk Adjustment Models. Boston University Working Paper: July 1, 2009.
Given that most people seem to do risk adjustment with (1) large, (2) administrative data datasets, (3) using linear OLS, my thought is that the models tend to be underfitted rather than over-fitted.
The R-sqds etc I’ve seen from training runs tend to indicate poor model fit, rather than overfitted models.
But your early emphasis in this series on risk adjustment on methods of detecting and minimizing over-fitting certainly suggests your experience and/or expectations are different from mine. Please expand on this.
Enjoying the series, BTW, thanks.