We've all got one - a "file drawer" of project ideas that we got a little way into and abandoned, never to see the light of day. Killing projects without telling anybody about it is bad for science - both because it likely leads to duplicate work, and because it makes it hard to know how much we should trust published findings. Are the papers that end up in journals just the lucky 5%? Do green jelly beans really cause cancer if a journal tells me so?!

I suspect that lots of projects die as a result of t < 1.96. It's hard to publish or get a job with results that aren't statistically significant, so if a simple test of a main hypothesis doesn't come up with stars, chances are that project ends up tabled (cabineted? drawered and quartered?).

But what if too many papers are ending up in the file drawer? Let's set aside broader issues surrounding publishing statistically insignificant results - it turns out that Stata* might be contributing to our file drawer problem. Or, rather, Stata who don't know exactly what their fancy software is doing. Watch out - things are about to get a little bit technical.

Thanks to a super influential paper, Bertrand, Duflo, and Mullainathan (2004), whenever applied microeconometricians like me have multiple observations per individual, we're terrified that OLS standard errors will be biased towards zero. To deal with this problem, we generally cluster our standard errors. Great - standard errors get bigger, problem solved, right?

Turns out that's not quite the end of the story. Little-known - but very important! - fact: **in short panels (like two-period diff-in-diffs!), clustered standard errors require a small-sample correction. **With few observations per cluster, you should be just using the variance of the within-estimator to calculate standard errors, rather than the full variance. Failing to apply this correction can dramatically inflate standard errors - and turn a file-drawer-robust t-statistic of 1.96 into a t-statistic of, say 1.36. Back to the drawing board.** Are you running through a mental list of all the diff-in-diffs you've run recently and sweating yet?

Here's where knowing what happens under the hood of your favorite regression command is super important. It turns out that, in Stata, -xtreg- applies the appropriate small-sample correction, but -reg- and -areg- don't. Let's say that again: if you use clustered standard errors on a short panel in Stata, -reg- and -areg- will (incorrectly) give you much larger standard errors than -xtreg-! Let that sink in for a second. -reghdfe-, a user-written command for Stata that runs high-dimensional fixed effects models in a computationally-efficient way, also gets this right. (Digression: it's great. I use it almost exclusively for regressions in Stata these days.)

**Edited to add: **The difference between what -areg- and what -xtreg- are doing is that -areg- is counting all of the fixed effects against the regression's degrees of freedom, whereas -xtreg- is not. But *in situations where fixed effects are nested within clusters*, which is usually true in diff-in-diff settings, clustering already accounts for this, so you don't need to include these fixed effects in your DoF calculation. This would be akin to "double-counting" these fixed effects, so -xtreg- is doing the right thing. See pp. 17--18 of Cameron and Miller (ungated), Gormley and Matsa, Hanson and Sunderam, this Statalist post, and the -reghdfe- FAQ, many of which also cite Wooldridge (2010) on this topic. I finally convinced myself this was real with a little simulation, posted below, showing that if you apply a placebo treatment, -xtreg- will commit a Type I error the expected 5% of the time, but -areg- will do so only 0.5% of the time, suggesting that it's being overly conservative relative to what we'd expect it to do.

So: spread the Good News - if you've been using clustered standard errors with -reg- or -areg- on a short panel, you should switch to -xtreg- or -reghdfe-, and for once have *correctly smaller* standard errors. If for whatever reason you're unwilling to make the switch, you can multiply your -reg- or -areg- standard error by 1/sqrt((N-1)/(N-J-1)), where N is the total number of observations in your dataset, and J is the number of panel units (individuals) in your data, and you'll get the right answer again.***

Adjust your do files, shrink your standard errors, empty out your file drawer. Happy end of summer, y'all.

*For the smug R users among us (you know who you are), note that felm doesn't apply this correction either. **Edited to add: **Also, if you're an felm user, it turns out that felm uses the wrong degrees of freedom to calculate its p-value with clustered standard errors. If you have a large number of clusters, this won't matter, since the t distribution converges decently quickly, but in smaller samples, this can make a difference. Use the exactDOF option to set your degrees of freedom equal to the number of clusters to fix this problem.

**Note: I'm not advocating throwing away results with t=1.36. That would be Bad Science.

*** What about cross-sectional data? When is -areg- right?** **For more details, please scroll (all the way) down below to read David Drukker's comment on when -areg- is appropriate. Here's a small piece of his comment:

That is: -areg- is appropriate in the first case, -xtreg- is appropriate in the latter case. All of this highlights for me the importance of understanding what your favorite statistical package is doing, and why it's doing it. Read the help documentation, code up simulations, and figure out what's going on under the hood before blindly running regressions.

H/t to my applied-econometrician-partners in crime for helping me to do just that.

**See also: **More Stata standard error hijinks.

Simple example code for Stata -- notice that t goes from 1.39 to 1.97 when we switch from the incorrect to the correct clustered standard errors! **Edited to add: **The first chunk of code just demonstrates that the SE's are different for different approaches. The second chunk of code runs a simulation that applies a placebo treatment. I wrote it quickly. It's not super computationally efficient.

************************************************************************* ***** IS THE FILE DRAWER TOO LARGE? -- SETUP ************************************************************************* clear all version 14 set more off set matsize 10000 set seed 12345 * generate 100 obs set obs 1000 * create unit ids gen ind = _n * create unit fixed effects gen u_i = rnormal(1, 10) * and 2 time periods per unit expand 2 bysort ind: gen post = _n - 1 * generate a time effect gen nu_t = rnormal(3, 5) replace nu_t = nu_t[1] replace nu_t = rnormal(3,5) if post == 1 replace nu_t = nu_t[2] if post == 1 * ``randomize'' half into treatment gen trtgroup = 0 replace trtgroup = 1 if ind > 500 * and treat units in the post-period only gen treatment = 0 replace treatment = 1 if trtgroup == 1 & post == 1 * generate a random error gen eps = rnormal() **** DGP **** gen y = 3 + 0.15*treatment + u_i + nu_t + eps ************************************************************************* ***** IS THE FILE DRAWER TOO LARGE? -- ESTIMATION RESULTS ************************************************************************* *** ESTIMATE USING -reg- * might want to comment this out if your computer is short on memory reg y treatment i.post i.ind, vce(cluster ind) /* Linear regression Number of obs =2,000 F(1, 999) =. Prob > F=. R-squared = 0.9957 Root MSE= 1.0126 (Std. Err. adjusted for 1,000 clusters in ind) ------------------------------------------------------------------------------ | Robust y |Coef. Std. Err.tP>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- treatment | .1782595 .1281184 1.39 0.164-.0731525.4296715 */ *** ESTIMATE USING -areg- areg y treatment i.post, a(ind) vce(cluster ind) /* Linear regression, absorbing indicators Number of obs =2,000 F( 2,999) =6284.86 Prob > F= 0.0000 R-squared = 0.9957 Adj R-squared = 0.9913 Root MSE= 1.0126 (Std. Err. adjusted for 1,000 clusters in ind) ------------------------------------------------------------------------------ | Robust y |Coef. Std. Err.tP>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- treatment | .1782595 .1281184 1.39 0.164-.0731525.4296715 */ *** ESTIMATE USING -xtreg- xtset ind post xtreg y treatment i.post,fe vce(cluster ind) /* Fixed-effects (within) regression Number of obs =2,000 Group variable: ind Number of groups=1,000 R-sq: Obs per group: within= 0.9618 min =2 between = 0.0010 avg =2.0 overall = 0.1091 max =2 F(2,999)= 12576.01 corr(u_i, Xb)= -0.0004Prob > F= 0.0000 (Std. Err. adjusted for 1,000 clusters in ind) ------------------------------------------------------------------------------ | Robust y |Coef. Std. Err.tP>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- treatment | .1782595 .0905707 1.97 0.049 .0005289.3559901 */ *** ESTIMATE USING -reghdfe- reghdfe y treatment, a(ind post) vce(cluster ind) /* HDFE Linear regressionNumber of obs =2,000 Absorbing 2 HDFE groups F( 1,999) = 3.88 Statistics robust to heteroskedasticity Prob > F= 0.0493 R-squared = 0.9957 Adj R-squared = 0.9913 Within R-sq.= 0.0039 Number of clusters (ind) =1,000 Root MSE= 1.0126 (Std. Err. adjusted for 1,000 clusters in ind) ------------------------------------------------------------------------------ | Robust y |Coef. Std. Err.tP>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- treatment | .1782595.090548 1.97 0.049 .0005734.3559457 ------------------------------------------------------------------------------ */

************************************************************************* ***** IS THE FILE DRAWER TOO LARGE? -- SIMULATIONS ************************************************************************* clear all version 14 set more off set seed 12345 local nsims = 10000 ** set up dataset to save results ** set obs `nsims' gen pval_areg = . gen pval_xtreg = . save "/Users/fburlig/Desktop/file_drawer_sims_out.dta", replace *** SIMULATION ** NOTE: THIS IS NOT A SUPER EFFICIENT LOOP. IT'S SLOW. ** YOU MAY WANT TO ADJUST THE NUMBER OF SIMS DOWN. clear forvalues i = 1/`nsims' { clear * generate 1000 obs set obs 1000 * create unit ids gen ind = _n * create unit fixed effects gen u_i = rnormal(1, 10) * randomize units into treatment gen randomizer = runiform() * ``randomize'' half into treatment gen trtgroup = 0 replace trtgroup = 1 if randomizer > 0.5 drop randomizer * and 2 time periods per unit expand 2 bysort ind: gen post = _n - 1 * generate a time effect gen nu_t = rnormal(3, 5) replace nu_t = nu_t[1] replace nu_t = rnormal(3,5) if post == 1 replace nu_t = nu_t[2] if post == 1 * and treat units in the post-period only gen treatment = 0 replace treatment = 1 if trtgroup == 1 & post == 1 * generate a random error gen eps = rnormal() **** XTSET xtset ind post **** DGP:TREATMENT EFFECT OF ZERO **** gen y = 3 + 0*treatment + u_i + nu_t + eps *** store p-value -- -areg- areg y treatment i.post, absorb(ind) vce(cluster ind) local pval_areg =2*ttail(e(df_r), abs(_b[treatment]/_se[treatment])) di `pval_areg' *** store p-value -- -xtreg- xtreg y treatment i.post, fe vce(cluster ind) local pval_xtreg =2*ttail(e(df_r), abs(_b[treatment]/_se[treatment])) di `pval_xtreg' use"/Users/fburlig/Desktop/file_drawer_sims_out.dta", clear replace pval_areg = `pval_areg' in `i' replace pval_xtreg = `pval_xtreg' in `i' save "/Users/fburlig/Desktop/file_drawer_sims_out.dta", replace } *** COMPUTE TYPE I ERROR RATES use"/Users/fburlig/Desktop/file_drawer_sims_out.dta", clear gen rej_xtreg = 0 replace rej_xtreg = 1 if pval_xtreg < 0.05 gen rej_areg = 0 replace rej_areg = 1 if pval_areg < 0.05 sum rej_xtreg /* Variable |ObsMeanStd. Dev. MinMax -------------+--------------------------------------------------------- rej_xtreg | 10,000 .0501.218162201 */ sum rej_areg /* Variable |ObsMeanStd. Dev. MinMax -------------+--------------------------------------------------------- rej_areg | 10,000 .0052.071926901 */ *** NOTE: xtreg commits a type I error 5% of the time ** areg does so 0.5% of the time!