I would say that the collapse command in Stata is surely one of the commands I use most in my work. This week I remembered one of the classic mistakes I've made within this command, so I thought I'd give all of you Stata beginners a nice little warning.
What I wanted to do? Create a new variable showing the number of non-missing observations within particular cells (let's say, country of origin-state-year cells). One way to do this:
bysort country state year: generate N_countrystateyear=_N
but also easy enough to create this variable within a collapse command. Now, you would be tempted to use the 'count' option since the help file says that this will "count the number of nonmissing observations."
generate x=1
collapse (count) N_countrystateyear=x=income, by(country state year)
This will work perfectly if you are not using weights. However, if you are using weights (pweights in particular), the count option will instead give you the sum of the weights over observations in the group. Yes, this is explained in the help file but all the way at the bottom. Hence, my warning to read those things carefully. Also, always just look at the variables you create to make sure they make sense. So if you have weights and need them to create summary statistics of your other variables, what to do? Use the 'rawsum' option. For example,
generate x=1
collapse (mean) age schooling etc (rawsum) N_countrystateyear=x [pweight=perwt], by(country state year)
For more information on the different types of weights you can use in Stata, see here. Short version: Use pweights. Sometimes, Stata will refuse to calculate something using pweights. Ever wonder why? See here.
PS
Thank you Kerry Papps and Nikos Theodoropoulos for encouraging me to write this blog post!
No comments:
Post a Comment