Sample Selection vs. Two-part Model

Much of health care data is characterized by a large cluster of data at 0, and a right skewed distribution of the remaining outcomes. For instance, people who do not get sick generally use $0 of medical care. Those who do get sick, use a varying amount of medical care dollars, but there are a large number of outliers with extremely expensive medical care. How do health economists take these anomalies into account?

David Madden looks at two alternatives to correct for the shape of the distribution in his 2008 JHE paper: sample selection and two-part models. Zero consumption of medical can be caused from two different decisions: a participation decision and a consumption decision. For instance, in the case of smoking, individuals may decide not to smoke no matter how cheap cigarettes get (participation decision). On the other hand, some smokers may decide not to smoke during a given time period because cigarettes are very expensive or they have low income (consumption decision). Since people can not smoke negative cigarettes, there still may be a cluster of observations around zero.

Assume that individuals utility from participation is equal to w=α’Z + v. If w>0, then d=1, (the individual participates) and if w<0, then d=0, (the individual does not participate). For consumption, individuals will choose y**=max[0,y*]; y*= β’X + u. A general model can be written as follows:

  • L0 = Π0 [1-P(v>-α’Z) P(u>-β’X |v>-α’Z)] Π+ P(v>-α’Z) P(u > -β’X|v>-α’Z) g(y|v>-α’Z,u > -β’X)

If u and v are independent, then we have the Cragg model:

  • L10 [1-P(v>-α’Z) P(u>-β’X)] Π+ P(v>-α’Z) P(u > -β’X) g(y|u > -β’X)

If we assume that the participation constraint dominates the consumption constraint (which is likely in the smoking example, but maybe not for drinking), then we have P(y*>0|d=1)=1 and g(y*|y*>0,d=1)=g(y*|d=1). This means that if you are a smoker you will have at least one cigarette per period. When the participation constraint dominates, we ignore the consumption decision and we have the following likelihood function which corresponds to the Heckman Selection model.

  • L20 [1-P(v>-α’Z) Π+ P(v>-α’Z) g(y|v>-α’Z)

If independence is assumed, then we are left with probit for participation and OLS for consumption. This is the two part model:

  • L30 [1-P(v>-α’Z) Π+ P(v>-α’Z) g(y)

Which of these models works best empirically?


Madden looks at the fit of regressions trying to model smoking and drinking behavior using a wide variety of covariates. In general, the two-part model seems to be perform better in the data used for this study, but the author wisely notes that deciding between the Heckman selection and the two-part model should be done on a case-by-case basis.


  1. Personally I don´t think he add anything to the debate.
    More interestingly, how do you implement it in a longitudinal setting?
    Can I still assume independence and estimate the two parts separately?

Comments are closed.