Wednesday, December 28, 2016

Are Those Unobservables Really a Big Problem?

Is there anything to be done when we get that nagging suspicion that some unobserved factor is driving our results? Should we just send the paper off to a journal and hope for the best...or can we learn something about omitted variable bias using the variables we do have in the data?

One commonly used trick is simply to check if estimates of our coefficient of interest change very much when more and more control variables are added to the regression model. To take a classic example, if we want to estimate the impact of schooling on earnings, we may be concerned that higher ability people have more formal education but would earn higher wages regardless of schooling. To address this issue, researchers may control for parental education, AFTQ score, number of books in the childhood home, and the list goes on and on. If the estimated impact of schooling doesn't change very much as more and more of these variables are added, we might feel reasonably confident that the estimated treatment effect is not severely biased. Is it possible to control for everything? No. Nevertheless, if our main results don't change very much when we control for more and more things, then our identification strategy, whatever it might be, is probably pretty good.

Sounds nice, right? I do this kind of thing all the time. But what does it mean to "not change very much"? And shouldn't adding some controls mean more than others? Emily Oster has a forthcoming paper in the  Journal of Business Economics and Statistics that formally connects coefficient stability to omitted variable bias. The key insight is that it's important to take into account both coefficient movements and R-squared movements. She even provides Stata code on her webpage for performing the necessary calculations. I have no idea if her technique will catch on, but even if you never use her code, I recommend reading the paper just for the very clear discussion of omitted variable bias.

Happy Holidays!

Monday, December 19, 2016

Machine Learning and Big Data: It's All the Rage this Year

Just like Hatchimals are the "must have" Christmas toy this year, it seems that on the econ job market, machine learning and big data are the "must have" words on job market candidates' CVs. Have a look below at the key words in NBER working-paper abstracts by year (you can also read read the full The Economist article). I definitely think it's time to pay attention to machine learning if you haven't already thought much about it.



For a very easy to read summary of machine learning techniques, I recommend this new Athey-Imbens paper. I blogged about it here. But even for those of us without access to big data, it's important to think about the proper use of these tools. This article does a great job of explaining the promise and danger from these techniques. Good news #1: It seems like machine learning techniques can help us create better policies. Good news #2: It doesn't seem like those of us trained in spotting good natural experiments will be out of a job anytime soon. Read the article for the details.

Saturday, December 3, 2016

Rule of Thumb for Coding: Embrace Your Fallibility!

My esteemed colleague, Jorge Aguero, just showed me this fabulous article, written by a political scientist, providing hints on how not to make mistakes when coding.

The grand insight: The answer is not "just be more careful." The answer is to program in such a way that you find mistakes very quickly.

I urge you to read the entire article at least once every six months, but I'll summarize my favorite pieces of advice here. Again, the most important thing to keep in mind: "humans are effectively incapable of writign error-free code, and that if we wish to improve the quality of the code we write, we must start learning and teaching coding skills that maximize the probability our mistakes will be found and corrected."

How? 
1. Add tests to your code. Run the tests every time you run the code. Examples: If you have state-level data, then you know you can only have 50 observations. If your variable is a percentage, then you know you can never have values above 100 or below 0. Use the assert command in Stata to make sure these things are true. Yeah, this is definitely my favorite tip of the article. Better researchers are probably really good at coming up with these types of tests and use them often. 
2. Copy-paste is the enemy. Never ever copy-paste. Use outreg2 to get tables into excel. Use local (or global?) macros if you are going to do the same thing to several variables. 
3. Comment! Comment! Comment! Things that are obvious when you write the code will not at all be obvious when you get back to revisions six months (or six years) later. 
4. Use informative variable names. You'll get better at this with practice. 

My own addition (sort of): 
6. Copy-edit your code. Make it look nice. Add spacing and use indents. Delete unnecessary code. This will make it easier to read through often. The more you look at it, the more your coauthors look at it, the more likely that you will find mistakes that were not caught by your tests. 

Any other tips? Comment below.