Goldilocks RCTs

January 02, 2017

What better way to ring in the new year than to announce that Louis Preonas, Matt Woerman, and I have posted a new working paper, "Panel Data and Experimental Design"? The online appendix is here (warning - it's math heavy!), and we've got a software package, called pcpanel, available for Stata via ssc, with the R version to follow.*

TL;DR: Existing methods for performing power calculations with panel data only allow for very limited types of serial correlation, and will result in improperly powered experiments in real-world settings. We've got new methods (and software) for choosing sample sizes in panel data settings that properly account for arbitrary within-unit serial correlation, and yield properly powered experiments in simulated and real data.

The basic idea here is that researchers should aim to have appropriately-sized ("Goldilocks") experiments: too many participants, and your study is more expensive than it should be; too few, and you won't be able to statistically distinguish a true treatment effect from zero effect. It turns out that doing this right gets complicated in panel data settings, where you observe the same individual multiple times over the study. Intuitively, applied econometricians know that we have to cluster our standard errors to handle arbitrary within-unit correlation over time in panel data settings.** This will (in general) make our standard errors larger, and so we need to account for this ex ante, generally by increasing our sample sizes, when we design experiments. The existing methods for choosing sample sizes in panel data experiments only allow for very limited types of serial correlation, and require strong assumptions that are unlikely to be satisfied in most panel data settings. In this paper, we develop new methods for power calculations that accommodate the panel data settings that researchers typically encounter. In particular, we allow for arbitrary within-unit serial correlation, allowing researchers to design appropriately powered (read: correctly sized) experiments, even when using data with complex correlation structures.

I prefer pretty pictures to words, so let's illustrate that. The existing methods for power calculations in panel data only allow for serial correlation that can be fully described with fixed effects - that is, once you put a unit fixed effect into your model, your errors are no longer serially correlated, like this:

But we often think that real panel data exhibits more complex types of serial correlation - things like this:

Okay, that's a pretty stylized example - but we usually think of panel data being correlated over time - electricity consumption data, for instance, generally follows some kind of sinusoidal pattern; maize prices in East Africa at a given market typically exhibit correlation over time that can't just be described with a level shift; etc; etc; etc. And of course, in the real world, data are never nice enough that including a unit fixed effect can completely account for the correlation structure.

So what happens if I use the existing methods when I've got this type of data structure? I can get the answer wildly wrong. In the figure below, we've generated some difference-in-difference type data (with a treatment group that sees treatment turn on halfway through the dataset, and a control group that never experiences treatment) with a simple AR(1) process, calculated what the existing methods imply the minimum detectable effect (MDE) of the experiment should be, and simulated 10,000 "experiments" using this treatment effect size. To do this, we implement a simple difference-in-difference regression model. Because of the way we've designed this setup, if the assumptions of the model are correct, every line on the left panel (which shows realized power, or the fraction of these experiments where we reject the null of no treatment) should be at 0.8. Every line on the right panel should be at 0.05 - this shows the realized false rejection rate, or what happens when we apply a treatment effect size of 0.

Adapted from Figure 2 of Burlig, Preonas, Woerman (2017). The y-axis of the left panel shows the realized power, or fraction of the simulated "experiments" described above that reject the (false) null hypothesis; the y-axis of the right panel shows … — Adapted from Figure 2 of Burlig, Preonas, Woerman (2017). The y-axis of the left panel shows the realized power, or fraction of the simulated "experiments" described above that reject the (false) null hypothesis; the y-axis of the right panel shows the realized false rejection rate, or fraction of simulated "experiments" that reject the true null under a zero treatment effect. The y-axis in both plots is the number of pre- and post-treatment periods in the "experiment." The colors show increasing levels of AR(1) correlation. If the Frison and Pocock model were performing properly, we should expect all of the lines on the left panel to be at 0.80, and all of the lines in the right panel to be at 0.05. Because we're clustering our standard errors in this setup, the right panel is getting things right - but the left panel is wildly off, because the FP model doesn't account for serial correlation.

We're clustering our standard errors, so as expected, our false rejection rate is always right at 0.05. But we're overpowered in short panels, and wildly underpowered in longer ones. The easiest way to think about statistical power is that if you aimed to be powered at 80% (generally the accepted standard, and meaning that you'll fail to reject a false null 20% of the time), you're going to fail to reject the null - even when there is a true treatment effect - 20% of the time. So that means if you end up powered to, say, 20%, as happens with some of these simulations, you're going to fail to reject the (false) null 80% of the time. Yikes! What's happening here is essentially that by not taking serial correlation into account, in long panels, we think we can detect a smaller effect than we actually can. Because we're clustering our standard errors, though, our false rejection rate is disciplined - so we get stars on our estimates way less often than we were expecting.***

By contrast, when we apply our "serial-correlation-robust" method, which takes the serial correlation into account ex ante, this happens:

Adapted from Figure 2 of Burlig, Preonas, Woerman (2017). Same y- and x-axes as above, but now we're able to design appropriately-powered experiments for all levels of AR(1) correlation and all panel lengths - and we're still clustering our standard… — Adapted from Figure 2 of Burlig, Preonas, Woerman (2017). Same y- and x-axes as above, but now we're able to design appropriately-powered experiments for all levels of AR(1) correlation and all panel lengths - and we're still clustering our standard errors, so the false rejection rates are still right. Whoo!

That is, we get right on 80% power and 5% false rejection rates, regardless of the panel length and strength of the AR(1) parameter. This is the central result of the paper. Slightly more formally, our method starts with the existing power calculation formula, and extends it by adding three terms that we show are sufficient to characterize the full covariance structure of the data (see Equation (8) in the paper for more details).

If you're an economist (and, given that you've read this far, you probably are), you've got a healthy (?) level of skepticism. To demonstrate that we haven't just cooked this up by simulating our own data, we do the same thought experiment, but this time, using data from an actual RCT that took place in China (thanks, QJE open data policy!), where we don't know the underlying correlation structure. To be clear, what we're doing here is taking the pre-experimental period from the actual experimental data, and calculating the minimum detectable effect size for this data using the FP model, and again using our model.**** This is just like what we did above, except this time, we don't know the correlation structure. We non-parametrically estimate the parameters that both models need in order to calculate the minimum detectable effect. The idea here is to put ourselves in the shoes of real researchers, who have some pre-existing data, but don't actually know the underlying process that generated their data. So what happens?

The dot-dashed line shows the results when we use the existing methods; the dashed line shows the results when we guess that the correlation structure is AR(1); and the solid navy line shows the results using our method.

Adapted from Figure 3 of Burlig, Preonas, Woerman (2017). The axes are the same as above. The dot-dashed line shows the realized power, over varying panel lengths, of simulated experiments using the Bloom et al (2015) data in conjunction with the Fr… — Adapted from Figure 3 of Burlig, Preonas, Woerman (2017). The axes are the same as above. The dot-dashed line shows the realized power, over varying panel lengths, of simulated experiments using the Bloom et al (2015) data in conjunction with the Frison and Pocock model. The dashed line instead assumes that the correlation structure in the data is AR(1), and calibrates our model under this assumption. Neither of these two methods performs particularly well - with 10 pre and 10 post periods, the FP model yields experiments that are powered to ~45 percent. While the AR(1) model performs better, it is still far off from the desired 80% power. In contrast, the solid navy line shows our serial-correlation-robust approach, demonstrating that even when we don't know the true correlation structure in the data, our model performs well, and delivers the expected 80% power across the full range of panel lengths.

Again, only our method achieves the desired 80% power across all panel lengths. While an AR(1) assumption gets closer than the existing method, it's still pretty off, highlighting the importance of thinking through the whole covariance structure.

In the remainder of the paper, we A) show that a similar result holds using high-frequency electricity consumption data from the US; B) show that collapsing your data to two periods won't solve all of your problems; C) think through what happens when using ANCOVA (common in economics RCTs) -- there are some efficiency gains, but much fewer than you'd think if you ignored the serial correlation; and D) couch these power calculations in a budget constrained setup to think about the trade-offs between more time periods and more units. *wipes brow*

A few last practical considerations that are worth addressing here:

All of the main results in the paper (and, indeed, in existing work on power calculations) are designed for the case where you know the true parameters governing the data generating process. In the appendix, we prove that (with small correction factors) you can also use estimates of these parameters to get to the right answer.
People are often worried enough about estimating the one parameter that the old formulas needed, let alone the four that our formula requires. While we don't have a perfect answer for this (more data is always better), simply ignoring these additional 3 parameters implicitly assumes they're zero, which is likely wrong. The paper and the appendix provide some thoughts on dealing with insufficient data.
Estimating these parameters can be complicated...so we've provided software that does it for you!
We'd also like to put in a plug for doing power calculations by simulation when you've got representative data - this makes it much easier to vary your model, assumptions on standard errors, etc, etc, etc.

Phew - managed to get through an entire blog post about econometrics without any equations! You're welcome. Overall, we're excited to have this paper out in the world - and looking forward to seeing an increasing number of (well-powered) panel RCTs start to hit the econ literature!

* We've debugged the software quite a bit in house, but there are likely still bugs. Let us know if you find something that isn't working!

** Yes, even in experiments. Though Chris Blattman is (of course) right that you don't need to cluster your standard errors in a unit-level randomization in the cross-section, this is no longer true in the panel. See the appendix for proofs.

*** What's going on with the short panels? Short panels will be overpowered in the AR(1) setup, because a difference-in-differences design is identified off of the comparison between treatment and control in the post-period vs this same difference in the pre-period. In short panels, more serial correlation means that it's actually easier to identify the "jump" at the point of treatment. In longer panels, this is swamped by the fact that each observation is now "worth less". See Equation (9) of the paper for more details.

**** We're not actually saying anything about what the authors should have done - we have no idea what they actually did! They find statistically significant results at the end of the day, suggesting that they did something right with respect to power calculations, but we remain agnostic about this.

Full disclosure: funding for this research was provided by the Berkeley Initiative for Transparency in the Social Sciences, a program of the Center for Effective Global Action (CEGA), with support from the Laura and John Arnold Foundation.

Is the file drawer too large? Standard Errors in Stata Strike Back

August 16, 2016

We've all got one - a "file drawer" of project ideas that we got a little way into and abandoned, never to see the light of day. Killing projects without telling anybody about it is bad for science - both because it likely leads to duplicate work, and because it makes it hard to know how much we should trust published findings. Are the papers that end up in journals just the lucky 5%? Do green jelly beans really cause cancer if a journal tells me so?!

I suspect that lots of projects die as a result of t < 1.96. It's hard to publish or get a job with results that aren't statistically significant, so if a simple test of a main hypothesis doesn't come up with stars, chances are that project ends up tabled (cabineted? drawered and quartered?).

But what if too many papers are ending up in the file drawer? Let's set aside broader issues surrounding publishing statistically insignificant results - it turns out that Stata* might be contributing to our file drawer problem. Or, rather, Stata who don't know exactly what their fancy software is doing. Watch out - things are about to get a little bit technical.

Thanks to a super influential paper, Bertrand, Duflo, and Mullainathan (2004), whenever applied microeconometricians like me have multiple observations per individual, we're terrified that OLS standard errors will be biased towards zero. To deal with this problem, we generally cluster our standard errors. Great - standard errors get bigger, problem solved, right?

Turns out that's not quite the end of the story. Little-known - but very important! - fact: in short panels (like two-period diff-in-diffs!), clustered standard errors require a small-sample correction. With few observations per cluster, you should be just using the variance of the within-estimator to calculate standard errors, rather than the full variance. Failing to apply this correction can dramatically inflate standard errors - and turn a file-drawer-robust t-statistic of 1.96 into a t-statistic of, say 1.36. Back to the drawing board.** Are you running through a mental list of all the diff-in-diffs you've run recently and sweating yet?

Here's where knowing what happens under the hood of your favorite regression command is super important. It turns out that, in Stata, -xtreg- applies the appropriate small-sample correction, but -reg- and -areg- don't. Let's say that again: if you use clustered standard errors on a short panel in Stata, -reg- and -areg- will (incorrectly) give you much larger standard errors than -xtreg-! Let that sink in for a second. -reghdfe-, a user-written command for Stata that runs high-dimensional fixed effects models in a computationally-efficient way, also gets this right. (Digression: it's great. I use it almost exclusively for regressions in Stata these days.)

Edited to add: The difference between what -areg- and what -xtreg- are doing is that -areg- is counting all of the fixed effects against the regression's degrees of freedom, whereas -xtreg- is not. But in situations where fixed effects are nested within clusters, which is usually true in diff-in-diff settings, clustering already accounts for this, so you don't need to include these fixed effects in your DoF calculation. This would be akin to "double-counting" these fixed effects, so -xtreg- is doing the right thing. See pp. 17--18 of Cameron and Miller (ungated), Gormley and Matsa, Hanson and Sunderam, this Statalist post, and the -reghdfe- FAQ, many of which also cite Wooldridge (2010) on this topic. I finally convinced myself this was real with a little simulation, posted below, showing that if you apply a placebo treatment, -xtreg- will commit a Type I error the expected 5% of the time, but -areg- will do so only 0.5% of the time, suggesting that it's being overly conservative relative to what we'd expect it to do.

So: spread the Good News - if you've been using clustered standard errors with -reg- or -areg- on a short panel, you should switch to -xtreg- or -reghdfe-, and for once have correctly smaller standard errors. If for whatever reason you're unwilling to make the switch, you can multiply your -reg- or -areg- standard error by 1/sqrt((N-1)/(N-J-1)), where N is the total number of observations in your dataset, and J is the number of panel units (individuals) in your data, and you'll get the right answer again.***

Adjust your do files, shrink your standard errors, empty out your file drawer. Happy end of summer, y'all.

*For the smug R users among us (you know who you are), note that felm doesn't apply this correction either. Edited to add: Also, if you're an felm user, it turns out that felm uses the wrong degrees of freedom to calculate its p-value with clustered standard errors. If you have a large number of clusters, this won't matter, since the t distribution converges decently quickly, but in smaller samples, this can make a difference. Use the exactDOF option to set your degrees of freedom equal to the number of clusters to fix this problem.

**Note: I'm not advocating throwing away results with t=1.36. That would be Bad Science.

*** What about cross-sectional data? When is -areg- right? For more details, please scroll (all the way) down below to read David Drukker's comment on when -areg- is appropriate. Here's a small piece of his comment:

“Sometimes I have cross-sectional data and I want to condition on a
state-level fixed effects. (If I add more individuals to the sample, the
number of fixed effects does not change.) Sometimes I have a short panel and
I want to condition on individual-level fixed effects. (Every new
individual in the sample adds a fixed effect on which I must condition.)”

That is: -areg- is appropriate in the first case, -xtreg- is appropriate in the latter case. All of this highlights for me the importance of understanding what your favorite statistical package is doing, and why it's doing it. Read the help documentation, code up simulations, and figure out what's going on under the hood before blindly running regressions.

H/t to my applied-econometrician-partners in crime for helping me to do just that.

See also: More Stata standard error hijinks.

Simple example code for Stata -- notice that t goes from 1.39 to 1.97 when we switch from the incorrect to the correct clustered standard errors! Edited to add: The first chunk of code just demonstrates that the SE's are different for different approaches. The second chunk of code runs a simulation that applies a placebo treatment. I wrote it quickly. It's not super computationally efficient.

*************************************************************************
***** IS THE FILE DRAWER TOO LARGE? -- SETUP
*************************************************************************

clear all
version 14
set more off
set matsize 10000
set seed 12345

* generate 100 obs
set obs 1000
* create unit ids
gen ind = _n

* create unit fixed effects
gen u_i = rnormal(1, 10)
* and 2 time periods per unit
expand 2
bysort ind: gen post = _n - 1

* generate a time effect 
gen nu_t = rnormal(3, 5)
replace nu_t = nu_t[1]
replace nu_t = rnormal(3,5) if post == 1
replace nu_t = nu_t[2] if post == 1

* ``randomize'' half into treatment
gen trtgroup = 0
replace trtgroup = 1 if ind > 500

* and treat units in the post-period only
gen treatment = 0
replace treatment = 1 if trtgroup == 1 & post == 1 

* generate a random error
gen eps = rnormal()

**** DGP ****
gen y = 3 + 0.15*treatment + u_i + nu_t + eps

*************************************************************************
***** IS THE FILE DRAWER TOO LARGE? -- ESTIMATION RESULTS
*************************************************************************
*** ESTIMATE USING -reg-

* might want to comment this out if your computer is short on memory

reg y treatment i.post i.ind, vce(cluster ind)
/*

Linear regression Number of obs =2,000
F(1, 999) =.
Prob > F=.
R-squared = 0.9957
Root MSE= 1.0126

(Std. Err. adjusted for 1,000 clusters in ind)
------------------------------------------------------------------------------
 | Robust
 y |Coef. Std. Err.tP>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
 treatment | .1782595 .1281184 1.39 0.164-.0731525.4296715
*/

*** ESTIMATE USING -areg-

areg y treatment i.post, a(ind) vce(cluster ind)
/*

Linear regression, absorbing indicators Number of obs =2,000
F( 2,999) =6284.86
Prob > F= 0.0000
R-squared = 0.9957
Adj R-squared = 0.9913
Root MSE= 1.0126

(Std. Err. adjusted for 1,000 clusters in ind)
------------------------------------------------------------------------------
 | Robust
 y |Coef. Std. Err.tP>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
 treatment | .1782595 .1281184 1.39 0.164-.0731525.4296715

*/
*** ESTIMATE USING -xtreg-
xtset ind post
xtreg y treatment i.post,fe vce(cluster ind)
/*

Fixed-effects (within) regression Number of obs =2,000
Group variable: ind Number of groups=1,000

R-sq: Obs per group:
 within= 0.9618 min =2
 between = 0.0010 avg =2.0
 overall = 0.1091 max =2

F(2,999)= 12576.01
corr(u_i, Xb)= -0.0004Prob > F= 0.0000

(Std. Err. adjusted for 1,000 clusters in ind)
------------------------------------------------------------------------------
 | Robust
 y |Coef. Std. Err.tP>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
 treatment | .1782595 .0905707 1.97 0.049 .0005289.3559901
*/

*** ESTIMATE USING -reghdfe-

reghdfe y treatment, a(ind post) vce(cluster ind)
/*
HDFE Linear regressionNumber of obs =2,000
Absorbing 2 HDFE groups F( 1,999) = 3.88
Statistics robust to heteroskedasticity Prob > F= 0.0493
R-squared = 0.9957
Adj R-squared = 0.9913
Within R-sq.= 0.0039
Number of clusters (ind) =1,000 Root MSE= 1.0126

(Std. Err. adjusted for 1,000 clusters in ind)
------------------------------------------------------------------------------
 | Robust
 y |Coef. Std. Err.tP>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
 treatment | .1782595.090548 1.97 0.049 .0005734.3559457
------------------------------------------------------------------------------
*/

*************************************************************************
***** IS THE FILE DRAWER TOO LARGE? -- SIMULATIONS
*************************************************************************

clear all
version 14
set more off
set seed 12345

local nsims = 10000
** set up dataset to save results **
set obs `nsims'
gen pval_areg = .
gen pval_xtreg = .
save "/Users/fburlig/Desktop/file_drawer_sims_out.dta", replace

*** SIMULATION
** NOTE: THIS IS NOT A SUPER EFFICIENT LOOP. IT'S SLOW. 
** YOU MAY WANT TO ADJUST THE NUMBER OF SIMS DOWN.
clear 
forvalues i = 1/`nsims' {
clear
* generate 1000 obs
set obs 1000
* create unit ids
gen ind = _n

* create unit fixed effects
gen u_i = rnormal(1, 10)
* randomize units into treatment
gen randomizer = runiform()

* ``randomize'' half into treatment
gen trtgroup = 0
replace trtgroup = 1 if randomizer > 0.5
drop randomizer

* and 2 time periods per unit
expand 2
bysort ind: gen post = _n - 1

* generate a time effect 
gen nu_t = rnormal(3, 5)
replace nu_t = nu_t[1]
replace nu_t = rnormal(3,5) if post == 1
replace nu_t = nu_t[2] if post == 1

* and treat units in the post-period only
gen treatment = 0
replace treatment = 1 if trtgroup == 1 & post == 1 

* generate a random error
gen eps = rnormal()

**** XTSET
xtset ind post

**** DGP:TREATMENT EFFECT OF ZERO ****
gen y = 3 + 0*treatment + u_i + nu_t + eps

*** store p-value -- -areg-
areg y treatment i.post, absorb(ind) vce(cluster ind)
local pval_areg =2*ttail(e(df_r), abs(_b[treatment]/_se[treatment]))
di `pval_areg'

*** store p-value -- -xtreg-
xtreg y treatment i.post, fe vce(cluster ind)
local pval_xtreg =2*ttail(e(df_r), abs(_b[treatment]/_se[treatment]))
di `pval_xtreg'

use"/Users/fburlig/Desktop/file_drawer_sims_out.dta", clear
replace pval_areg = `pval_areg' in `i'
replace pval_xtreg = `pval_xtreg' in `i'
save "/Users/fburlig/Desktop/file_drawer_sims_out.dta", replace
}

*** COMPUTE TYPE I ERROR RATES
use"/Users/fburlig/Desktop/file_drawer_sims_out.dta", clear

gen rej_xtreg = 0
replace rej_xtreg = 1 if pval_xtreg < 0.05

gen rej_areg = 0
replace rej_areg = 1 if pval_areg < 0.05

sum rej_xtreg
/*
Variable |ObsMeanStd. Dev. MinMax
-------------+---------------------------------------------------------
 rej_xtreg | 10,000 .0501.218162201

*/

sum rej_areg
/*

Variable |ObsMeanStd. Dev. MinMax
-------------+---------------------------------------------------------
rej_areg | 10,000 .0052.071926901
*/

*** NOTE: xtreg commits a type I error 5% of the time
** areg does so 0.5% of the time!

Standard errors in Stata: a (somewhat) cautionary tale

January 29, 2016

Last week, a colleague and I were having a conversation about standard errors. He had a new discovery for me - "Did you know that clustered standard errors and robust standard errors are the same thing with panel data?"

I argued that this couldn't be right - but he said that he'd run -xtreg- in Stata with robust standard errors and with clustered standard errors and gotten the same result - and then sent me the relevant citations in the Stata help documentation. I'm highly skeptical - especially when it comes to standard errors - so I decided to dig into this a little further.

Turns out Andrew was wrong after all - but through very little fault of his own. Stata pulled the wool over his eyes a little bit here. It turns out that in panel data settings, "robust" - AKA heteroskedasticity-robust - standard errors aren't consistent. Oops. This important insight comes from James Stock and Mark Watson's 2008 Econometrica paper. So using -xtreg, fe robust- is bad news. In light of this result, StataCorp made an executive decision: when you specify -xtreg, fe robust-, Stata actually calculates standard errors as though you had written -xtreg, vce(cluster panelvar)- !

Standard errors: giving sandwiches a bad name since 1967.

On the one hand, it's probably a good idea not to allow users to compute robust standard errors in panel settings anymore. On the other hand, computing something other than what users think is being computed, without an explicit warning that this is happening, is less good.

To be fair, Stata does tell you that "(Std. Err. adjusted for N clusters in panelvar)", but this is easy to miss - there's no "Warning - Clustered standard errors computed in place of robust standard errors" label, or anything like that. The help documentation mentions (on p. 25) that specifying -vce(robust)- is equivalent to specifying -vce(cluster panelvar)-, but what's actually going on is pretty hard to discern, I think. Especially because there's a semantics issue here: a cluster-robust standard error and a heteroskedasticity-robust standard error are two different things. In the econometrics classes I've taken, "robust" is used to refer to heteroskedasticity- (but not cluster-) robust errors. In fact, StataCorp refers to errors this way in a very thorough and useful FAQ answer posted online - and clearly states that the Huber and White papers don't deal with clustering in another.

All that to say: when you use canned routines, it's very important to know what exactly they're doing! I have a growing appreciation for Max's requirement that his econometrics students build their own functions up from first principles. This is obviously impractical in the long run, but helps to instill a healthy mistrust of others' code. So: caveat econometricus - let the econometrician beware!