Is the file drawer too large? Standard Errors in Stata Strike Back

August 16, 2016

We've all got one - a "file drawer" of project ideas that we got a little way into and abandoned, never to see the light of day. Killing projects without telling anybody about it is bad for science - both because it likely leads to duplicate work, and because it makes it hard to know how much we should trust published findings. Are the papers that end up in journals just the lucky 5%? Do green jelly beans really cause cancer if a journal tells me so?!

I suspect that lots of projects die as a result of t < 1.96. It's hard to publish or get a job with results that aren't statistically significant, so if a simple test of a main hypothesis doesn't come up with stars, chances are that project ends up tabled (cabineted? drawered and quartered?).

But what if too many papers are ending up in the file drawer? Let's set aside broader issues surrounding publishing statistically insignificant results - it turns out that Stata* might be contributing to our file drawer problem. Or, rather, Stata who don't know exactly what their fancy software is doing. Watch out - things are about to get a little bit technical.

Thanks to a super influential paper, Bertrand, Duflo, and Mullainathan (2004), whenever applied microeconometricians like me have multiple observations per individual, we're terrified that OLS standard errors will be biased towards zero. To deal with this problem, we generally cluster our standard errors. Great - standard errors get bigger, problem solved, right?

Turns out that's not quite the end of the story. Little-known - but very important! - fact: in short panels (like two-period diff-in-diffs!), clustered standard errors require a small-sample correction. With few observations per cluster, you should be just using the variance of the within-estimator to calculate standard errors, rather than the full variance. Failing to apply this correction can dramatically inflate standard errors - and turn a file-drawer-robust t-statistic of 1.96 into a t-statistic of, say 1.36. Back to the drawing board.** Are you running through a mental list of all the diff-in-diffs you've run recently and sweating yet?

Here's where knowing what happens under the hood of your favorite regression command is super important. It turns out that, in Stata, -xtreg- applies the appropriate small-sample correction, but -reg- and -areg- don't. Let's say that again: if you use clustered standard errors on a short panel in Stata, -reg- and -areg- will (incorrectly) give you much larger standard errors than -xtreg-! Let that sink in for a second. -reghdfe-, a user-written command for Stata that runs high-dimensional fixed effects models in a computationally-efficient way, also gets this right. (Digression: it's great. I use it almost exclusively for regressions in Stata these days.)

Edited to add: The difference between what -areg- and what -xtreg- are doing is that -areg- is counting all of the fixed effects against the regression's degrees of freedom, whereas -xtreg- is not. But in situations where fixed effects are nested within clusters, which is usually true in diff-in-diff settings, clustering already accounts for this, so you don't need to include these fixed effects in your DoF calculation. This would be akin to "double-counting" these fixed effects, so -xtreg- is doing the right thing. See pp. 17--18 of Cameron and Miller (ungated), Gormley and Matsa, Hanson and Sunderam, this Statalist post, and the -reghdfe- FAQ, many of which also cite Wooldridge (2010) on this topic. I finally convinced myself this was real with a little simulation, posted below, showing that if you apply a placebo treatment, -xtreg- will commit a Type I error the expected 5% of the time, but -areg- will do so only 0.5% of the time, suggesting that it's being overly conservative relative to what we'd expect it to do.

So: spread the Good News - if you've been using clustered standard errors with -reg- or -areg- on a short panel, you should switch to -xtreg- or -reghdfe-, and for once have correctly smaller standard errors. If for whatever reason you're unwilling to make the switch, you can multiply your -reg- or -areg- standard error by 1/sqrt((N-1)/(N-J-1)), where N is the total number of observations in your dataset, and J is the number of panel units (individuals) in your data, and you'll get the right answer again.***

Adjust your do files, shrink your standard errors, empty out your file drawer. Happy end of summer, y'all.

*For the smug R users among us (you know who you are), note that felm doesn't apply this correction either. Edited to add: Also, if you're an felm user, it turns out that felm uses the wrong degrees of freedom to calculate its p-value with clustered standard errors. If you have a large number of clusters, this won't matter, since the t distribution converges decently quickly, but in smaller samples, this can make a difference. Use the exactDOF option to set your degrees of freedom equal to the number of clusters to fix this problem.

**Note: I'm not advocating throwing away results with t=1.36. That would be Bad Science.

*** What about cross-sectional data? When is -areg- right? For more details, please scroll (all the way) down below to read David Drukker's comment on when -areg- is appropriate. Here's a small piece of his comment:

“Sometimes I have cross-sectional data and I want to condition on a
state-level fixed effects. (If I add more individuals to the sample, the
number of fixed effects does not change.) Sometimes I have a short panel and
I want to condition on individual-level fixed effects. (Every new
individual in the sample adds a fixed effect on which I must condition.)”

That is: -areg- is appropriate in the first case, -xtreg- is appropriate in the latter case. All of this highlights for me the importance of understanding what your favorite statistical package is doing, and why it's doing it. Read the help documentation, code up simulations, and figure out what's going on under the hood before blindly running regressions.

H/t to my applied-econometrician-partners in crime for helping me to do just that.

See also: More Stata standard error hijinks.

Simple example code for Stata -- notice that t goes from 1.39 to 1.97 when we switch from the incorrect to the correct clustered standard errors! Edited to add: The first chunk of code just demonstrates that the SE's are different for different approaches. The second chunk of code runs a simulation that applies a placebo treatment. I wrote it quickly. It's not super computationally efficient.

*************************************************************************
***** IS THE FILE DRAWER TOO LARGE? -- SETUP
*************************************************************************

clear all
version 14
set more off
set matsize 10000
set seed 12345

* generate 100 obs
set obs 1000
* create unit ids
gen ind = _n

* create unit fixed effects
gen u_i = rnormal(1, 10)
* and 2 time periods per unit
expand 2
bysort ind: gen post = _n - 1

* generate a time effect 
gen nu_t = rnormal(3, 5)
replace nu_t = nu_t[1]
replace nu_t = rnormal(3,5) if post == 1
replace nu_t = nu_t[2] if post == 1

* ``randomize'' half into treatment
gen trtgroup = 0
replace trtgroup = 1 if ind > 500

* and treat units in the post-period only
gen treatment = 0
replace treatment = 1 if trtgroup == 1 & post == 1 

* generate a random error
gen eps = rnormal()

**** DGP ****
gen y = 3 + 0.15*treatment + u_i + nu_t + eps

*************************************************************************
***** IS THE FILE DRAWER TOO LARGE? -- ESTIMATION RESULTS
*************************************************************************
*** ESTIMATE USING -reg-

* might want to comment this out if your computer is short on memory

reg y treatment i.post i.ind, vce(cluster ind)
/*

Linear regression Number of obs =2,000
F(1, 999) =.
Prob > F=.
R-squared = 0.9957
Root MSE= 1.0126

(Std. Err. adjusted for 1,000 clusters in ind)
------------------------------------------------------------------------------
 | Robust
 y |Coef. Std. Err.tP>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
 treatment | .1782595 .1281184 1.39 0.164-.0731525.4296715
*/

*** ESTIMATE USING -areg-

areg y treatment i.post, a(ind) vce(cluster ind)
/*

Linear regression, absorbing indicators Number of obs =2,000
F( 2,999) =6284.86
Prob > F= 0.0000
R-squared = 0.9957
Adj R-squared = 0.9913
Root MSE= 1.0126

(Std. Err. adjusted for 1,000 clusters in ind)
------------------------------------------------------------------------------
 | Robust
 y |Coef. Std. Err.tP>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
 treatment | .1782595 .1281184 1.39 0.164-.0731525.4296715

*/
*** ESTIMATE USING -xtreg-
xtset ind post
xtreg y treatment i.post,fe vce(cluster ind)
/*

Fixed-effects (within) regression Number of obs =2,000
Group variable: ind Number of groups=1,000

R-sq: Obs per group:
 within= 0.9618 min =2
 between = 0.0010 avg =2.0
 overall = 0.1091 max =2

F(2,999)= 12576.01
corr(u_i, Xb)= -0.0004Prob > F= 0.0000

(Std. Err. adjusted for 1,000 clusters in ind)
------------------------------------------------------------------------------
 | Robust
 y |Coef. Std. Err.tP>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
 treatment | .1782595 .0905707 1.97 0.049 .0005289.3559901
*/

*** ESTIMATE USING -reghdfe-

reghdfe y treatment, a(ind post) vce(cluster ind)
/*
HDFE Linear regressionNumber of obs =2,000
Absorbing 2 HDFE groups F( 1,999) = 3.88
Statistics robust to heteroskedasticity Prob > F= 0.0493
R-squared = 0.9957
Adj R-squared = 0.9913
Within R-sq.= 0.0039
Number of clusters (ind) =1,000 Root MSE= 1.0126

(Std. Err. adjusted for 1,000 clusters in ind)
------------------------------------------------------------------------------
 | Robust
 y |Coef. Std. Err.tP>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
 treatment | .1782595.090548 1.97 0.049 .0005734.3559457
------------------------------------------------------------------------------
*/

*************************************************************************
***** IS THE FILE DRAWER TOO LARGE? -- SIMULATIONS
*************************************************************************

clear all
version 14
set more off
set seed 12345

local nsims = 10000
** set up dataset to save results **
set obs `nsims'
gen pval_areg = .
gen pval_xtreg = .
save "/Users/fburlig/Desktop/file_drawer_sims_out.dta", replace

*** SIMULATION
** NOTE: THIS IS NOT A SUPER EFFICIENT LOOP. IT'S SLOW. 
** YOU MAY WANT TO ADJUST THE NUMBER OF SIMS DOWN.
clear 
forvalues i = 1/`nsims' {
clear
* generate 1000 obs
set obs 1000
* create unit ids
gen ind = _n

* create unit fixed effects
gen u_i = rnormal(1, 10)
* randomize units into treatment
gen randomizer = runiform()

* ``randomize'' half into treatment
gen trtgroup = 0
replace trtgroup = 1 if randomizer > 0.5
drop randomizer

* and 2 time periods per unit
expand 2
bysort ind: gen post = _n - 1

* generate a time effect 
gen nu_t = rnormal(3, 5)
replace nu_t = nu_t[1]
replace nu_t = rnormal(3,5) if post == 1
replace nu_t = nu_t[2] if post == 1

* and treat units in the post-period only
gen treatment = 0
replace treatment = 1 if trtgroup == 1 & post == 1 

* generate a random error
gen eps = rnormal()

**** XTSET
xtset ind post

**** DGP:TREATMENT EFFECT OF ZERO ****
gen y = 3 + 0*treatment + u_i + nu_t + eps

*** store p-value -- -areg-
areg y treatment i.post, absorb(ind) vce(cluster ind)
local pval_areg =2*ttail(e(df_r), abs(_b[treatment]/_se[treatment]))
di `pval_areg'

*** store p-value -- -xtreg-
xtreg y treatment i.post, fe vce(cluster ind)
local pval_xtreg =2*ttail(e(df_r), abs(_b[treatment]/_se[treatment]))
di `pval_xtreg'

use"/Users/fburlig/Desktop/file_drawer_sims_out.dta", clear
replace pval_areg = `pval_areg' in `i'
replace pval_xtreg = `pval_xtreg' in `i'
save "/Users/fburlig/Desktop/file_drawer_sims_out.dta", replace
}

*** COMPUTE TYPE I ERROR RATES
use"/Users/fburlig/Desktop/file_drawer_sims_out.dta", clear

gen rej_xtreg = 0
replace rej_xtreg = 1 if pval_xtreg < 0.05

gen rej_areg = 0
replace rej_areg = 1 if pval_areg < 0.05

sum rej_xtreg
/*
Variable |ObsMeanStd. Dev. MinMax
-------------+---------------------------------------------------------
 rej_xtreg | 10,000 .0501.218162201

*/

sum rej_areg
/*

Variable |ObsMeanStd. Dev. MinMax
-------------+---------------------------------------------------------
rej_areg | 10,000 .0052.071926901
*/

*** NOTE: xtreg commits a type I error 5% of the time
** areg does so 0.5% of the time!

Standard errors in Stata: a (somewhat) cautionary tale

January 29, 2016

Last week, a colleague and I were having a conversation about standard errors. He had a new discovery for me - "Did you know that clustered standard errors and robust standard errors are the same thing with panel data?"

I argued that this couldn't be right - but he said that he'd run -xtreg- in Stata with robust standard errors and with clustered standard errors and gotten the same result - and then sent me the relevant citations in the Stata help documentation. I'm highly skeptical - especially when it comes to standard errors - so I decided to dig into this a little further.

Turns out Andrew was wrong after all - but through very little fault of his own. Stata pulled the wool over his eyes a little bit here. It turns out that in panel data settings, "robust" - AKA heteroskedasticity-robust - standard errors aren't consistent. Oops. This important insight comes from James Stock and Mark Watson's 2008 Econometrica paper. So using -xtreg, fe robust- is bad news. In light of this result, StataCorp made an executive decision: when you specify -xtreg, fe robust-, Stata actually calculates standard errors as though you had written -xtreg, vce(cluster panelvar)- !

Standard errors: giving sandwiches a bad name since 1967.

On the one hand, it's probably a good idea not to allow users to compute robust standard errors in panel settings anymore. On the other hand, computing something other than what users think is being computed, without an explicit warning that this is happening, is less good.

To be fair, Stata does tell you that "(Std. Err. adjusted for N clusters in panelvar)", but this is easy to miss - there's no "Warning - Clustered standard errors computed in place of robust standard errors" label, or anything like that. The help documentation mentions (on p. 25) that specifying -vce(robust)- is equivalent to specifying -vce(cluster panelvar)-, but what's actually going on is pretty hard to discern, I think. Especially because there's a semantics issue here: a cluster-robust standard error and a heteroskedasticity-robust standard error are two different things. In the econometrics classes I've taken, "robust" is used to refer to heteroskedasticity- (but not cluster-) robust errors. In fact, StataCorp refers to errors this way in a very thorough and useful FAQ answer posted online - and clearly states that the Huber and White papers don't deal with clustering in another.

All that to say: when you use canned routines, it's very important to know what exactly they're doing! I have a growing appreciation for Max's requirement that his econometrics students build their own functions up from first principles. This is obviously impractical in the long run, but helps to instill a healthy mistrust of others' code. So: caveat econometricus - let the econometrician beware!

Officially SSMART!

December 01, 2015

I've been a bad blogger over the past month or so, something I'm hoping to remedy over the coming weeks. (Somewhere out there, a behavioral economist is grumbling about me being present-biased and naive about it. Whatever, grumbly behavioral economist.) I'm writing this from SFO, about to head off to Bangalore (via Seattle and Paris, where I'll meet my coauthor/adventure buddy Louis), thanks to USAID and Berkeley's Development Impact Lab. We're hoping to study the effects of the smartgrid in urban India, as well as to learn more about what energy consumption looks like in Bangalore in general. There is a small but quickly-growing body of evidence on energy use in developing countries (see Gertler, Shelef, Wolfram, and Fuchs -- forthcoming AER, and one of my favorite of Catherine's papers! -- and Jack and Smith -- AER P&P on pre-paid metering in South Africa -- for a couple of recent examples). Still, there's a lot that we don't know, and, of course, a lot more that we don't know that we don't know. Thanks a lot, Rumsfeld.

In other exciting news, the Berkeley Initiative for Transparency in the Social Sciences (BITSS) has released its SSMART grant awardees - and my new project (with Matt and Louis, and overseen by Catherine) on improving power calculations and making sure researchers get their standard errors right has been funded! Very exciting. Check out the official announcement here, and our page on the Open Science Framework here. Since this is a grant explicitly about transparency, we'll be making our results public as we go through the process. Our money is officially for this coming summer, so look for an update / more details in a few months.

Where we are currently: there are theoretical reasons to be handling standard errors differently than we currently are in a lot of empirical applications, and there are also theoretical reasons that existing formula-based power calculations might be ending up under powered. In progress: how badly wrong are we when we use current methods?

My flight is boarding, so I'll leave you with that lovely teaser!

The white man's (research) burden

September 18, 2015

Please, please, please don't take this title as anything other than a goofy comment. No offense meant. Now that we have that out of the way, moving on: a really interesting paper just came out in the Journal of Economic Behavior and Organization.

In this work, titled "The white-man effect: How foreigner presence affects behavior in experiments", Jacobus Cilliers, Oeindrila Dube, and Bilal Siddiqi provide some of the first concrete and well-identified evidence that the composition of fieldwork implementation teams teams really matters for measuring outcomes. It probably doesn't come as a surprise to anybody that the identity of people administering experiments can sway the results - we have lots of evidence from the psychology literature to suggest that even question framing can have dramatic effects - but demonstrating this point in a developing-country context makes me want to think really carefully about how I conduct fieldwork. Development economists have known for some time now that employing locals is key to successful research in the field; this paper suggests yet another reason to work with enumerators from the survey region.

Too much of this in development research. Before you start to worry that I'm calling out an actual person, realize that this picture comes from The Onion (whose original headline is spot on: "6-Day Visit To Rural African Village Completely Changes W… — Too much of this in development research. Before you start to worry that I'm calling out an actual person, realize that this picture comes from The Onion (whose original headline is spot on: "6-Day Visit To Rural African Village Completely Changes Woman’s Facebook Profile Picture")

It's always the case that as a white American development economist, I want to be sensitive about doing research in a way that can come off as paternalistic - one of the best things we can do in the field is to give our subjects agency. In addition to that, this paper highlights yet again why it is problematic that, for example, African scholars are underrepresented in the highest ranks of research about Africa. I feel very lucky to be able to work in India, and to get to know (small pieces of) the country. I'm sure that I will keep this paper in mind as I do more work that requires original, in-person data collection.

WWP: An oldie but a goodie

August 31, 2015

I always appreciate papers when they teach me something about methodology as well as about their particular research question. A lot of the papers that I really like that have done this have already been published (David McKenzie at the World Bank has a bunch of papers that fall into this category - and excellent blog posts as well.)

This week's WWP isn't particularly new, but is definitely both interesting and useful methodologically (and it is still a working paper!). Many readers of this blog (ha, as if this blog has many readers) have probably read this paper before, or at least skimmed it, or at least seen the talk. But if you haven't read it carefully, I urge you to go back and give it another look. Yes, I'm talking about Sol Hsiang and Amir Jina's hurricanes paper (the actual title is: The Causal Effect of Environmental Catastrophe on Long-Run Economic Growth: Evidence from 6,700 Cyclones). Aside from being interesting and cool (and having scarily large estimates), it also provides really clear discussions of how to do a bunch of things that applied microeconomists might want to know how to do.

It describes in a good bit of detail how to map environmental events to economic observations (don't miss the page-long footnote 13...). It also discusses how to estimate a distributed lag model, and then explains how to recover the cumulative effect from this model (something that I never saw in an econometrics class). It provides really clear visualizations of the main results (we should expect no less from Sol at this point). A lot of the methodological meat is also contained in the battery of robustness checks, including a randomization inference procedure, a variety of cuts of the data, more discussion of distributed lag and spatial lag models, modeling potential adaptation, etc etc etc. Finally, they do an interesting exercise where they use their model to simulate what growth would have looked like in the absence of cyclones, and (of course) do a climate projection - but also add a NPV calculation on top of it.

All in all, I think I'll use this paper as a great reference for how to implement different techniques for a while - and I look forward to reading the eventual published version. I'll let the authors describe their results themselves. Their abstract:

“Does the environment have a causal effect on economic development? Using meteorological data, we reconstruct every country’s exposure to the universe of tropical cyclones during 1950-2008. We exploit random within-country year-to-year variation in cyclone strikes to identify the causal effect of environmental disasters on long-run growth. We compare each country’s growth rate to itself in the years immediately before and after exposure, accounting for the distribution of cyclones in preceding years. The data reject hypotheses that disasters stimulate growth or that short-run losses disappear following migrations or transfers of wealth. Instead, we find robust evidence that national incomes decline, relative to their pre-disaster trend, and do not recover within twenty years. Both rich and poor countries exhibit this response, with losses magnified in countries with less historical cyclone experience. Income losses arise from a small but persistent suppression of annual growth rates spread across the fifteen years following disaster, generating large and significant cumulative effects: a 90th percentile event reduces per capita incomes by 7.4% two decades later, effectively undoing 3.7 years of average development. The gradual nature of these losses render them inconspicuous to a casual observer, however simulations indicate that they have dramatic influence over the long-run development of countries that are endowed with regular or continuous exposure to disaster. Linking these results to projections of future cyclone activity, we estimate that under conservative discounting assumptions the present discounted cost of “business as usual” climate change is roughly $9.7 trillion larger than previously thought.”

Edited to add: This turned out to be especially timely due to the record number of hurricanes in the Pacific at the moment. (Luckily, none of them are threatening landfall as of August 31st.)