Thursday, September 17, 2020

Are You Really Controlling for that Variable?

Yes, we love RCTs and RDs (and maybe sometimes IVs and difs and difs, I guess), but remember the tried and true way to get at causal estimates is just to good old-fashioned control for omitted variables. From a "research is so cool" perspective, it's quite fascinating to see an estimate of interest drop suddenly in response to adding an important control variable. From an "I won't get this published unless I have those stars" perspective, you often hope this doesn't happen...and when is it least likely to happen? When you're sloppy in constructing those control variables or when the variables themselves only imperfectly measure the true omitted variable. 

So what happens when you add a variable measured with error as a control variable in your model? Supplysideliberal.com explains it all here with excellent intuition as well as matrix algebra! Who could ask for anything more. I've copy-pasted the important bit here: 

"Compare the coefficient estimates in a large-sample, ordinary-least-squares, multiple regression with (a) an accurately measured statistical control variable, (b) instead only that statistical control variable measured with error and (c) without the statistical control variable at all. Then all coefficient estimates with the statistical control variable measured with error (b) will be a weighted average of (a) the coefficient estimates with that statistical control variable measured accurately and (c) that statistical control variable excluded. The weight showing how far inclusion of the error-ridden statistical control variable moves the results toward what they would be with an accurate measure of that variable is equal to the fraction of signal in (signal + noise), where “signal” is the variance of the accurately measured control variable that is not explained by variables that were already in the regression, and “noise” is the variance of the measurement error."

But now for some practical advice from me: When you add an important control variable to your model, be sure to show its estimated coefficient in the table (vs. just having an X signifying that you controlled for it). Why? If you know that variable is an important omitted variable, but its estimated coefficient is close to zero and not statistically significant, the culprit may be measurement error. If that's the case, then we shouldn't be surprised that adding this poorly measured variable doesn't change the estimated coefficient of interest. On the other hand, if you can show that the control variable has its expected impact on the outcome AND it doesn't budge your estimate of the coefficient of interest, you're probably good!  

Conclusion 1: Minimize measurement error by coding carefully. 

Conclusion 2 (copied from the blog): "I strongly encourage everyone reading this to vigorously criticize any researcher who claims to be statistically controlling for something simply by putting a noisy proxy for that thing in a regression. This is wrong. Anyone doing it should be called out, so that we can get better statistical practice and get scientific activities to better serve our quest for the truth about how the world works."

No comments:

Post a Comment