Wednesday, July 7, 2021

All About Robustness Checks: "Do Not Run a Regression Just Because You Can!"

Tatyana Deryugina provides some excellent tips on how to choose robustness checks for a paper here. Running way too many regressions is a typical mistake I see many really smart and hardworking graduate students make. It's funny that she starts the blog entry with the wise words of her advisor, "You need to think more and do less," because I still remember my advisor saying something very similar: "In general, there is no good substitute for thought." (In case you're wondering, we did not have the same advisor.) 

So the key is to think about which robustness checks (or tests for heterogeneity or placebo tests) are most important for telling your story. I'm all for playing around with the data when you first start a project to get some sense of how things are. Readers of this blog will know that I'm a big advocate of tables of descriptive statistics (with max's and min's--this can help you find coding issues) as well as pretty pictures. But when you are at the end stages of a project, it's important to think carefully about which robustness checks to include in the paper. Tatyana provides an excellent guide to help you through the thought process. I also really like her advice about choosing the preferred specification--not just in terms of which controls to include but also the baseline sample:  

"Your preferred specification should be based on the most natural sample of treated and control units for your study (e.g., counties in hurricane-prone states). This will often be the sample that includes the largest number of treated units and enough high-quality control units to estimate a credible counterfactual. For example, try not to pare down your control units so much that you have 3 times as many treated as control units. At the other extreme, it is also unlikely that having a sample with 10 times as many control as treated units will be more useful than something closer to a 1-to-1 ratio. If you have a panel dataset, your preferred specification should be based on a balanced panel."