Difference in Difference Estimation

Difference in Difference (DD) is a commonly used empirical estimation technique in economics. Let us take a hypothetical example where a state (Wisconsin) passes a bill which makes employer-provided health insurance tax deductible. Let us also assume that in the year after the bill passed (year 2) the percentage of firms offering health insurance increased by 50% compared to the year before the bill was passed (year 1). In order to estimate the impact of the of the bill on the percentage of firms offering health insurance, we could simply do a ‘before and after’ analysis and conclude that the bill increased insurance offerings by 50%. The problem is that there could be a trend over time for more employers to offer insurance. It is impossible to identify if the tax deductibility or the time trend caused this increase in firm offering.

One way to identify the impact of the bill is to run a DD regression. If there is a state (California) that did not change the way it treated employer provided health insurance, we could use this as a control group to compare the changes between Wisconsin and California between the two years.

We will run the regression:

Y=β_0 + β_1*T + β_2*WI + β_3*(T*WI) + e

Y is the percentage of firms offering health insurance in each state in each time period. T is a time dummy, WI is a state dummy for Wisconsin, and T*WI is the interaction of the time dummy and the Wisconsin state dummy.

The chart below displays the percentage of firms offering insurance in each state and time period.

California Wisconsin
Year 1 a b
Year 2 c d

The next chart explains what each coefficient in the regression represents.

Coefficient Calculation
β_0 a
β_1 c-a
β_2 b-a
β_3 (d-b)-(c-a)

We can see that β_0 is the baseline average, β_1 represents the time trend in the control group, β_2 represents the differences between the two states in year 1, and β_3 represents the difference in the changes over time. Assuming that both states have the same health insurance trends over time, we have now controlled for a possible national time trend. We can now identify what the true impact of the tax deductibility is on employers offering insurance.


  1. This is the most concise and clear explanation I have ever read.

    Y=b_0 + b_1*T + b_2*CA + b_3*(T*CA) + e

    However, I still have one point not clear. Could you please explain more about the dummy variable T? Does it mean year1( yes=1, no=0) and year2 (yes=1, no=0) or year (year1=1 and year2=0). One variable or two variables?
    Also, do I have to put the Wisconsin dummary into the same regression? If I don’t put the Wisconsin how could I know if there is the policy effect?

    I would very appreciate if you could take time to guide me. Thanks.


  2. Should be year1, T=0 and year2, T=1. If your statistical package knows what it is doing, however, if you went with the two variable misspecification you mentioned, it would just wind up dropping one of them and you are left with the 1 variable! Same thing applies with the Wisconsin/CA dummies.

    The WI in his formulation is the Wisconsin dummy. You decided you liked CA better, so you have a CA dummy. If those are your only two states, then you cannot have dummies for both. You would have a situation of perfect collinearity. Basically, the vector of ones for the intercept would equal the sum of the CA and WI vectors. So you only use 1 or the other, not both.

    As a rule of thumb, when you run a regression you are not allowed to have a complete set of dummy variables unless you get rid of the constant term…for example, you cannot have variables for male and female, but rather just one or the other. If you have age bands for under 18, 19-39, 40-64, and 65+, you can only use three of those dummies, but not all four.

    Also, the original post does have one problem, in that he says “Y is the percentage of firms offering health insurance in each state in each time period.” Clearly if you did that, you would have exactly 4 observations. CA year 1, CA year 2, WI year 1, and WI year 2. No degrees of freedom. You need matched data from individual firms within each state both before and after the policy change. i.e. say the policy change happened in 2004. You would want to have the insurance provision data for 50 CA firms, both in 2003 and 2005, as well as the insurance provision data for 50 WI firms, both in 2003 and 2005. And since you are likely to be working with binary data on the LHS (did the firm offer health care?), you’d want to run a probit or logit in this case rather than simple OLS.

  3. What if I have more than two years data, say 4 years.
    Should T= 0, 1, 2, 3? or use T1=0,1 T2=0,1 T3=0,1?

  4. Depends on whether the trend is linear or not. you can test the two methods and compare them using a likelihood ratio test. If the categorical model is significantly “more explanatory” then you need the “more complicated” categorical model. If it is not, the linear model is sufficient. If you use the categorical model, keep in mind that the coefficient will compare that time period to time 0. the coeffecient for T1 represents the difference between T1 and T0, the coefficient for T2 represents the difference between T2 and T0, etc.

    If generally you the values of the coefficients getting progressively larger (or smaller), then you might have a linear trend. If they go up and down, you’ll likely need the categorical model.

  5. What if you have multiple policy changes over a number of years?

    My treatment group is Medicare patients with a control of non-Medicare patients, but I have several policy changes (1 each year over 5 years).

Comments are closed.