Northwestern finance Prof Bernard Black describes some interesting causality bloopers, a valuable caution for students and teachers alike!
Regression anatomy revealed
Valerio Filoso from the University of Naples has written a neat Stata routine that automates the regression anatomy formula and makes a complete family of partial regression plots. Check it out!
The RD bandwidth thing
Vanderson Amadeu da Rocha, a student of economics at FEA-RP / USP,
Brazil, asks:
My questions are about the chapter of Regression Discontinuity
Designs. What criteria are used to determine the neighborhood size in
nonparametric RDD Fuzzy and Sharp?
Great question Vanderson – The bandwidth is indeed at the business end of
nonparametric RD, though until recently we simply would have had to say
“try a few.”
Happily, a new paper by Imbens and Kalyanaraman provides a better answer
by deriving formulas for an MSE-minimizing choice.
Good luck with your project!
Can I get an indulgence for bad control?
We get a lot of questions about bad control. Here’s an interesting one from Colin Vance:
I'd like to estimate the effect of fuel price (which I assume is exogenous) on distance driven. As a control, I would like to include the fuel efficiency of the driver's car. Although efficiency is likely to be endogenous, leaving it out of the specification runs the risk of imparting omitted bias on my fuel price estimate. But since it is *just* a control, I'm inclined to leave efficiency as is in the model and not worry about whether it is endogenous. Wise move? Any insights would be appreciated!
Before tackling the metrics, think about a likely motivation for the research question. Suppose the government is considering a rise in the gas tax. Policy-makers would like to know how this will affect driving habits and fuel consumption. The government is unlikely to forbid people from buying a new more fuel efficient car in response to the tax, in fact they probably would like to encourage that. So who needs to know what the causal effect of a price rise is conditional on being locked in to my current vehicle? I think this observation neatly answers Colin’s question. Prices will go up, driving behavior will change for a number of reasons. There is no scenario where only one response is all that’s allowed (driving in the same car). Then there is the econometric problem that conditioning on fuel efficiency will not actually answer the question of how driving behavior changes for those who don’t buy a more fuel efficient car. That’s the bad control problem described in MHE – but that’s just metrics.
JA
How many df in that?!
Reading pp 298-299 with somewhat more care than they were written, Tobias Wuergler from Zurich writes: In order to demonstrate that robust standard errors are likely to be more biased than non-robust under homoskedasticity, you use a bivariate example, where the single regressor is assumed to be in deviations-from-means form. Wouldn't one need, strictly speaking, the regressand "y" to be in deviations-from-means form, too, in order to partial out the constant? If so, the appropriate degree-of-freedom correction should be (1-2/N) since the residual maker in a demeaned regression is M(x)M(1), where M(1) is the annihilator associated with the vector of ones (which one needs to demean). The square of this residual maker is (M(1)-H(x)), hence E(e(hat)2)=sigma2*(m(ii,1)-h(ii,x)), and the sum of (m(ii,1)-h(ii,x)) is equal to (N-1-1) since m(ii,1)=(1-1/N). Intuitively, a demeaned simple regression (with the original model having a constant) still needs a degree of freedom correction of 2 as an average needs to be estimated apart from the single beta. Or am I misunderstanding your example? (In order to circumvent this complication one could assume a simple regression through the origin, which would not require x (nor y) to be demeaned.) Good catch Toby - partialing out the constant does not change the underlying df in the estimated residual; you can't fool mother nature. So the df should be 2 and not 1. The argument about relative bias of robust and conventional standard errors still goes through, but to get the details right, change 1-1/N to 1-2/N and make sure the leverage adds up to 2 and not 1.
P-score in the reg?
Geo. from GA asks this interesting question 'bout the propensity score: I was wondering whether replacing high dimensional covariates (X) in the regression model with their propensity scores (p(X)) was a good idea? That is, Y = a + bT + cX + e becomes Y= a + bT + c(p(X)) + e. The book does not really address it unless I missed it. What are the implications? Thanks.George: its certainly not a crazy idea. In fact, Dehejia-Wahba (1999) tried this (Table 5, estimates labeled quadratic in score). But its not clear what the theoretical justification is here; once you are using regression, why do this two-step procedure instead of just sticking the covs you've put in the score right into the reg (since you're implicitly assuming these are the only source of OVB)? Also, as we know from chpt 3, regression does not estimate the pop ATE or the effect of treatment on the treated except under constant effects or if the score is constant. Score fiends are often after those parameters instead of the variance-weighted avg that regression produces.
42 clusters references swap
The references to Hansen (2007a) and Hansen (2007b) on page 322-323 are swapped. On page 322, it should be Hansen (2007b) referenced as discussing bias-correction of serial correlation parameters and on page 323, it should be Hansen (2007a) referenced as showing pretty good results for state clustering with modest numbers of states.
Steve must have been dozin’ on his galleys by this point.
ivreg2 update
If you’re going to run multiple endogenous variables (not something we’re all that crazy about) you at least oughta look at the appropriate first stage Fs. And, as explained in an earlier post, we didn’t give the right formula in MHE. Luckily, a routine for first-stage F-stats in models with multiple endogenous variables is now programmed in ivreg2. The same update includes other useful routines, like two-way clustering. More information below:
New versions of and extensions to the Baum-Schaffer-Stillman packages ivreg2, xtivreg2, ranktest and xtoverid, and a new program, ivreg29, are now available from ssc. The main extensions and upgrades are: 1. 2-way clustering. 2-way clustering, introduced by Cameron, Gelbach and Miller (2006) and Thompson (2009), is now supported. 2-way clustering, e.g., ivreg2 y x1 x2, cluster(id year) or ivreg2 y (x = z1 z2), gmm2s (cluster id year) allows for arbitrary within-cluster correlation in two cluster dimensions. In the examples above, standard errors and statistics are robust to disturbances that are autocorrelated (correlated within panels, clustering on id) and common (correlated across panels, clustering on year). In the second example, estimates also are efficient in the presence of arbitrary within-panel and within-year clustering. As with 1-way clustering, the numbers of clusters in both dimensions should be large. 2. Angrist-Pischke first-stage F statistics ivreg2 and xtivreg2 now provide Angrist-Pischke first-stage F statistics. Angrist and Pischke (2009, pp. 217-18) introduced first-stage F statistics for tests of under- and weak identification when there is more than one endogenous regressor. In contrast to the Cragg-Donald and Kleibergen-Paap statistics, which test the identification of the equation as a whole, the AP first-stage F statistics are tests of whether one of the endogenous regressors is under- or weakly identified. 3. SEs that are robust to autocorrelated across-panel disturbances Following Thompson (2009), cluster-robust and kernel-robust SEs can be combined and applied to panel data to produce SEs that are robust to arbitary common autocorrelated disturbances. This can also be combined with 2-way clustering to provide SEs and statistics that are robust to autocorrelated within-panel disturbances (clustering on panel id) and to autocorrelated across-panel disturbances (clustering on time combined with kernel-based HAC). 4. ivreg2 has been Mata-ized ... and is noticably faster, in particular with time series and the CUE (continuously-updated) GMM estimator. 5. ivreg29 for users who don't yet have Stata 10 or 11 ivreg2 requires Stata 10 or later. For those who have only Stata 9, we have provided a new program, ivreg29. ivreg29 is basically the previous version of ivreg2 plus support for AP F-statistics and some minor bug fixes. ivreg29 does not support the other features described above. For full details and examples, see the new help files accompanying the programs.
Multiple endogenous variables – now what?!
Diligent reader Daniela Falzon, who works at the World Bank (in France . . . or Washington, DC) writes us with the following interesting problem concerning multiple endogenous variables in 2SLS:
I am estimating Y = b0+ b1*X1 +b2* X2 + b3*X1*X2 + X3
X1 is a dummy variable and endogenous,
X2 is continuous and endogenous
X3 is a set of additional control variables.
Do you have a better idea of how I should do it or should I just focus on the interaction term and instrument it?
Many thanks in advance for your response and best regards,
thanks for your question Daniela. Models with multiple endogenous variables are indeed hard to identify and the results can be hard to interpret.
So we don’t usually like to see them – for one thing it’s not clear why you’re tackling two causal questions at the same time; one is hard enough.
You may have noticed that the only model with more than one endogenous regressor in MHE is the peer effects regression (equation 4.6.6, based on Acemoglu and Angrist, 2000). Here we have both individual and state-level schooling endogenous in a wage equation.
But we are really only interested in the peer effect in this case – the effect of state average schooling. Individual schooling is there because we realize that any instrument for average schooling must also be correlated with individual schooling. We therefore try to fix this violation of the exclusion restriction by treating individual schooling as endogenous as well. This is the best reason for having a second endog variable that I can think of. And the model may work – in the case of schooling we have enough instruments. But not very often, I would think.
More generally, it doesn’t make sense to think of one endogenous variable as a “control” when looking at the effects of another, at least not a good one (in the sense in which we use the terms good and bad control in chapter 3). So any time someone shows me a problem with more than one endogenous variable, my first question is always: why?
Corrections Coming!
Princeton University Press has graciously released a corrected version of MHE. This is not a new edition (we’re still recovering from the first!). But we’ve corrected the mistakes uncovered by careful readers in the past 18 months. The corrected version is now in print and should be shipping soon from Amazon and other big retailers. PUP plans to fulfill Fall 2010 course orders using the new version.
Which isn’t to say there are no more mistakes, so keep those corrections coming.
JA