Addressing 0 values with econometrics

Health care data–particularly spending data–often has a right skewed distribution with a high number of 0’s. For instance, US health care spending in 2019 was $11,852. However, many people don’t get sick and have no health care spending. Moreover, people generally don’t have negative health care spending. Further, some many patients with serious diseases rack up high health care costs. Clearly this distribution is non-normal.

How do we deal with such an issue?

An NBER white paper by Anirban Basu (2023) examines some potential solutions.

Least squares. Researchers use transformation models to avoid running non-linear specifications of covariates on spending. For instance, they could log-transform spending data and run their regression on log cost (which may be more normally distributed). However, estimation of an ordinary least squares regression on Ln(spending) measures the impact of covariates on the geometric mean; to measure the impact of a covariate on spending using the log-transformed spending, one must use complicated procedures such as the Duan smearning technique. When there are 0’s in the regression, however, this is more problematic as the ln(0) is infinite. Thus, individuals may add an arbitrary constant to the cost variable which is, well, arbitrary.
Two-part models. These models often take the form of estimating logit (or probit) model to determine the likelihood and individual has no cost and then separately estimates a transformed model conditional on having positive (i.e., non-zero) spending.
Tobit model. This model assumes that there is a latent utility function Y*. When Y* is <0, then the actual value becomes 0. However, when the latent utility Y* is positive, then the actual spending is equal to latent spending Y=Y*|Y*>0.
Double hurdle model. In his paper Basu makes the distinction that some 0’s mean different things than others. Consider the case of smoking. Many people don’t smoke. Those who do smoke have highly variable levels of cigarettes smoked per day. Thus, observing a time period (e.g., week, month) where a person has 0 cigarettes could mean that the person was not a smoker, or was a smoker but decided to take the week or month off. Thus, Basu models the 0’s separately as part of a participation decision (smoke or not smoke) and a consumption decision (i.e., how many cigarettes to smoke in a given time period, which could include 0). In this set-up, the double hurdle is based on having your latent utility be such that you decide to become a smoker and that–if you are a smoker–your utility from smoking in a given time period is higher than the cost such that you smoke a positive number of cigarettes. However, empirically, if you assume that “individuals always smoke once the first hurdle [i.e., smoker vs. non-smoker] is passed…Consequently, the second hurdle is irrelevant, and none of the zeros are generated via a consumption decision.] In this case, one could use a Tobit or Heckman selection model model since all smokers have positive cigarette consumption. On the other hand, if the decision to be a the error term in the participation and consumption hurdles are independent, then a standard two-part model would suffice.

The paper also describes how to calculate marginal effects with many zero observations as well as a variety of other empirical applications. The full paper is here.