Let’s say that you want to predict the impact of some policy intervention. Let us also assume that there is a randomized controlled trial (RCT) examining the impact of said policy on some outcome of interest. To predict the best model fit, at first glance, one would use all the data in the RCT to fit the model. While doing so improves model fit, it also risks over fitting the model and makes predictions inaccurate. Further, researchers may conduct data mining to alter the model structure to better fit their prior expectation.
Another approach would use a hold-out sample to test how well different models predict outcomes for the data for which it was not fit (i.e., the hold out sample). Why isn’t the hold out sampling done more often? One key reason is that sample size is smaller and thus fewer parameters can be used (i.e., model fit is worse) and there is more uncertainty. A paper by Todd and Wolpin (2023) interestingly argue for the use of hold-out samples by framing the issues as a principal agent problem. Mentioning an earlier paper by Schorfheide and Wolpin (2016), they write:
A policy maker, the principal, would like to predict the effects of a treatment at varying treatment levels. The data are available to the policy maker from an RCT that has been conducted for a single treatment level. To assess the impact of alternative treatments, the policy maker engages two modelers, the agents, each of whom estimates their preferred structural model and provides measures of predictive fit.
Modelers are rewarded in terms of model fit. SW [Schorfheide and Wolpin (SW)] consider two data venues available to the policy maker. In the first, the no-holdout venue, the modelers have access to the full sample of observations and are evaluated based on the marginal likelihood function they report, which, in a Bayesian framework, is used to update model probabilities. Because the modelers have access to the full sample, there is an incentive to modify their model specifications and thus overstate the marginal likelihood values. SW refer to this behavior as data mining. More specifically, data mining takes the form of data-based modifications of the prior distributions used to obtain posteriors.
In the second, the holdout venue, on the other hand, the modelers have access only to a subset of observations and are asked by the policy maker to predict features of the sample that is held out for model evaluation. Data mining creates a trade-off between providing the full sample, which would otherwise be optimal for prediction, and withholding data. SW provide a qualitative characterization of the behavior of the modelers under the two venues based on analytical derivations and use a numerical example to illustrate how the size and the composition (in terms of observations from the control and treatment groups) of the holdout sample affects the risk of the policy maker. Their numerical example shows that it is possible for the holdout venue to dominate the no-holdout venue because of the data mining that occurs if the modelers have access to the full sample.
An interesting approach and logic for increased use of hold-out sampling for model fit exercises.