Goldilocks RCTs

January 02, 2017

What better way to ring in the new year than to announce that Louis Preonas, Matt Woerman, and I have posted a new working paper, "Panel Data and Experimental Design"? The online appendix is here (warning - it's math heavy!), and we've got a software package, called pcpanel, available for Stata via ssc, with the R version to follow.*

TL;DR: Existing methods for performing power calculations with panel data only allow for very limited types of serial correlation, and will result in improperly powered experiments in real-world settings. We've got new methods (and software) for choosing sample sizes in panel data settings that properly account for arbitrary within-unit serial correlation, and yield properly powered experiments in simulated and real data.

The basic idea here is that researchers should aim to have appropriately-sized ("Goldilocks") experiments: too many participants, and your study is more expensive than it should be; too few, and you won't be able to statistically distinguish a true treatment effect from zero effect. It turns out that doing this right gets complicated in panel data settings, where you observe the same individual multiple times over the study. Intuitively, applied econometricians know that we have to cluster our standard errors to handle arbitrary within-unit correlation over time in panel data settings.** This will (in general) make our standard errors larger, and so we need to account for this ex ante, generally by increasing our sample sizes, when we design experiments. The existing methods for choosing sample sizes in panel data experiments only allow for very limited types of serial correlation, and require strong assumptions that are unlikely to be satisfied in most panel data settings. In this paper, we develop new methods for power calculations that accommodate the panel data settings that researchers typically encounter. In particular, we allow for arbitrary within-unit serial correlation, allowing researchers to design appropriately powered (read: correctly sized) experiments, even when using data with complex correlation structures.

I prefer pretty pictures to words, so let's illustrate that. The existing methods for power calculations in panel data only allow for serial correlation that can be fully described with fixed effects - that is, once you put a unit fixed effect into your model, your errors are no longer serially correlated, like this:

But we often think that real panel data exhibits more complex types of serial correlation - things like this:

Okay, that's a pretty stylized example - but we usually think of panel data being correlated over time - electricity consumption data, for instance, generally follows some kind of sinusoidal pattern; maize prices in East Africa at a given market typically exhibit correlation over time that can't just be described with a level shift; etc; etc; etc. And of course, in the real world, data are never nice enough that including a unit fixed effect can completely account for the correlation structure.

So what happens if I use the existing methods when I've got this type of data structure? I can get the answer wildly wrong. In the figure below, we've generated some difference-in-difference type data (with a treatment group that sees treatment turn on halfway through the dataset, and a control group that never experiences treatment) with a simple AR(1) process, calculated what the existing methods imply the minimum detectable effect (MDE) of the experiment should be, and simulated 10,000 "experiments" using this treatment effect size. To do this, we implement a simple difference-in-difference regression model. Because of the way we've designed this setup, if the assumptions of the model are correct, every line on the left panel (which shows realized power, or the fraction of these experiments where we reject the null of no treatment) should be at 0.8. Every line on the right panel should be at 0.05 - this shows the realized false rejection rate, or what happens when we apply a treatment effect size of 0.

Adapted from Figure 2 of Burlig, Preonas, Woerman (2017). The y-axis of the left panel shows the realized power, or fraction of the simulated "experiments" described above that reject the (false) null hypothesis; the y-axis of the right panel shows … — Adapted from Figure 2 of Burlig, Preonas, Woerman (2017). The y-axis of the left panel shows the realized power, or fraction of the simulated "experiments" described above that reject the (false) null hypothesis; the y-axis of the right panel shows the realized false rejection rate, or fraction of simulated "experiments" that reject the true null under a zero treatment effect. The y-axis in both plots is the number of pre- and post-treatment periods in the "experiment." The colors show increasing levels of AR(1) correlation. If the Frison and Pocock model were performing properly, we should expect all of the lines on the left panel to be at 0.80, and all of the lines in the right panel to be at 0.05. Because we're clustering our standard errors in this setup, the right panel is getting things right - but the left panel is wildly off, because the FP model doesn't account for serial correlation.

We're clustering our standard errors, so as expected, our false rejection rate is always right at 0.05. But we're overpowered in short panels, and wildly underpowered in longer ones. The easiest way to think about statistical power is that if you aimed to be powered at 80% (generally the accepted standard, and meaning that you'll fail to reject a false null 20% of the time), you're going to fail to reject the null - even when there is a true treatment effect - 20% of the time. So that means if you end up powered to, say, 20%, as happens with some of these simulations, you're going to fail to reject the (false) null 80% of the time. Yikes! What's happening here is essentially that by not taking serial correlation into account, in long panels, we think we can detect a smaller effect than we actually can. Because we're clustering our standard errors, though, our false rejection rate is disciplined - so we get stars on our estimates way less often than we were expecting.***

By contrast, when we apply our "serial-correlation-robust" method, which takes the serial correlation into account ex ante, this happens:

Adapted from Figure 2 of Burlig, Preonas, Woerman (2017). Same y- and x-axes as above, but now we're able to design appropriately-powered experiments for all levels of AR(1) correlation and all panel lengths - and we're still clustering our standard… — Adapted from Figure 2 of Burlig, Preonas, Woerman (2017). Same y- and x-axes as above, but now we're able to design appropriately-powered experiments for all levels of AR(1) correlation and all panel lengths - and we're still clustering our standard errors, so the false rejection rates are still right. Whoo!

That is, we get right on 80% power and 5% false rejection rates, regardless of the panel length and strength of the AR(1) parameter. This is the central result of the paper. Slightly more formally, our method starts with the existing power calculation formula, and extends it by adding three terms that we show are sufficient to characterize the full covariance structure of the data (see Equation (8) in the paper for more details).

If you're an economist (and, given that you've read this far, you probably are), you've got a healthy (?) level of skepticism. To demonstrate that we haven't just cooked this up by simulating our own data, we do the same thought experiment, but this time, using data from an actual RCT that took place in China (thanks, QJE open data policy!), where we don't know the underlying correlation structure. To be clear, what we're doing here is taking the pre-experimental period from the actual experimental data, and calculating the minimum detectable effect size for this data using the FP model, and again using our model.**** This is just like what we did above, except this time, we don't know the correlation structure. We non-parametrically estimate the parameters that both models need in order to calculate the minimum detectable effect. The idea here is to put ourselves in the shoes of real researchers, who have some pre-existing data, but don't actually know the underlying process that generated their data. So what happens?

The dot-dashed line shows the results when we use the existing methods; the dashed line shows the results when we guess that the correlation structure is AR(1); and the solid navy line shows the results using our method.

Adapted from Figure 3 of Burlig, Preonas, Woerman (2017). The axes are the same as above. The dot-dashed line shows the realized power, over varying panel lengths, of simulated experiments using the Bloom et al (2015) data in conjunction with the Fr… — Adapted from Figure 3 of Burlig, Preonas, Woerman (2017). The axes are the same as above. The dot-dashed line shows the realized power, over varying panel lengths, of simulated experiments using the Bloom et al (2015) data in conjunction with the Frison and Pocock model. The dashed line instead assumes that the correlation structure in the data is AR(1), and calibrates our model under this assumption. Neither of these two methods performs particularly well - with 10 pre and 10 post periods, the FP model yields experiments that are powered to ~45 percent. While the AR(1) model performs better, it is still far off from the desired 80% power. In contrast, the solid navy line shows our serial-correlation-robust approach, demonstrating that even when we don't know the true correlation structure in the data, our model performs well, and delivers the expected 80% power across the full range of panel lengths.

Again, only our method achieves the desired 80% power across all panel lengths. While an AR(1) assumption gets closer than the existing method, it's still pretty off, highlighting the importance of thinking through the whole covariance structure.

In the remainder of the paper, we A) show that a similar result holds using high-frequency electricity consumption data from the US; B) show that collapsing your data to two periods won't solve all of your problems; C) think through what happens when using ANCOVA (common in economics RCTs) -- there are some efficiency gains, but much fewer than you'd think if you ignored the serial correlation; and D) couch these power calculations in a budget constrained setup to think about the trade-offs between more time periods and more units. *wipes brow*

A few last practical considerations that are worth addressing here:

All of the main results in the paper (and, indeed, in existing work on power calculations) are designed for the case where you know the true parameters governing the data generating process. In the appendix, we prove that (with small correction factors) you can also use estimates of these parameters to get to the right answer.
People are often worried enough about estimating the one parameter that the old formulas needed, let alone the four that our formula requires. While we don't have a perfect answer for this (more data is always better), simply ignoring these additional 3 parameters implicitly assumes they're zero, which is likely wrong. The paper and the appendix provide some thoughts on dealing with insufficient data.
Estimating these parameters can be complicated...so we've provided software that does it for you!
We'd also like to put in a plug for doing power calculations by simulation when you've got representative data - this makes it much easier to vary your model, assumptions on standard errors, etc, etc, etc.

Phew - managed to get through an entire blog post about econometrics without any equations! You're welcome. Overall, we're excited to have this paper out in the world - and looking forward to seeing an increasing number of (well-powered) panel RCTs start to hit the econ literature!

* We've debugged the software quite a bit in house, but there are likely still bugs. Let us know if you find something that isn't working!

** Yes, even in experiments. Though Chris Blattman is (of course) right that you don't need to cluster your standard errors in a unit-level randomization in the cross-section, this is no longer true in the panel. See the appendix for proofs.

*** What's going on with the short panels? Short panels will be overpowered in the AR(1) setup, because a difference-in-differences design is identified off of the comparison between treatment and control in the post-period vs this same difference in the pre-period. In short panels, more serial correlation means that it's actually easier to identify the "jump" at the point of treatment. In longer panels, this is swamped by the fact that each observation is now "worth less". See Equation (9) of the paper for more details.

**** We're not actually saying anything about what the authors should have done - we have no idea what they actually did! They find statistically significant results at the end of the day, suggesting that they did something right with respect to power calculations, but we remain agnostic about this.

Full disclosure: funding for this research was provided by the Berkeley Initiative for Transparency in the Social Sciences, a program of the Center for Effective Global Action (CEGA), with support from the Laura and John Arnold Foundation.

WWP: Forests without borders?

May 09, 2016

As the semester winds to a close, I've been trying to get back into a better habit of reading new papers. Really, I'm just looking for more excuses to hang out at my local Blue Bottle coffee shop - great lighting, tasty snacks, wonderful caffeine, homegrown in Oakland...what more could I ask for? Sometimes doing lots of reading can feel like a slog - but this week, I stumbled across a really cool new working paper that's well worth a look. The paper, by Robin Burgess, Francisco Costa, and Ben Olken, is called "The Power of the State: National Borders and the Deforestation of the Amazon." Deforestation is a pretty big problem in the Amazon, especially when we think that (on top of the idea that we might want to be careful to protect exhaustible natural resources for their own sake, and not pave paradise to build a parking lot) tropical forests have a large role to play in combating climate change, because trees serve as a pretty darn effective carbon sink. Plus they produce oxygen. Win-win! Despite these benefits, trees are also lucrative, and there are also economic benefits to converting forests into farmland. This has led to a spate of deforestation.

To combat the destruction of the Amazon, Brazil enacted a series of anti-deforestation policies in 2005-06. It's important to understand whether these policies worked, and it's not obvious ex ante that they would: according to evidence from a friend (and badass spatial data guru/data scientist), Dan Hammer, a deforestation moratorium in Indonesia was unsuccessful: if anything, it led to an increased rate of deforestation. Oops.

The key figure from Dan's paper. The grey bars indicate times when the Indonesian deforestation moratorium was in effect. The teal line is the deforestation rate in Malaysia, and the salmon line is the rate in Indonesia. Not much evidence here that … — The key figure from Dan's paper. The grey bars indicate times when the Indonesian deforestation moratorium was in effect. The teal line is the deforestation rate in Malaysia, and the salmon line is the rate in Indonesia. Not much evidence here that Indonesia's deforestation rate decreased relative to Malayasia during (after) the gray periods.

The big challenging with studying deforestation, especially when it's been made illegal, is that it's tough to get data on. Going out and counting trees requires a huge effort, and people don't usually like to self-report illegal activity. Like Dan before them, Burgess, Costa, and Olken (hereafter BCO) turn to satellite imagery. As I've said before, I'm excited about the advances in remote sensing - there's an explosion of data, which can be harnessed to measure all kinds of things where we don't necessarily have good surveys (Marshall Burke, ahead of the curve as usual). BCO use 30 x 30 meter resolution data on forest cover from a paper published in Science - ahh, interdisciplinary (and a win for open data!).

Of course, it's not enough to just have a measurement of forest cover - in order to figure out the causal effect of Brazil's deforestation policies, the authors also need an identification strategy. Maybe this is because I've got regression discontinuities on the brain lately, but I think what BCO do is super cool. They use the border between Brazil and its neighbors in the Amazon to identify the effects of Brazil's policy. The argument is that, other than the deforestation policies in the different countries, parts of the Amazon just to the Brazilian side of the border look just like parts of the Amazon just to the opposite side of the border. This obviously isn't bullet-proof - you might worry that governance, institutions, infrastructure, populations, languages, etc change discontinuously at the border. They do some initial checks to show that this isn't true (including a nice anecdote where the Brazilian president-elect accidentally walked into Boliva for an hour before being stopped by the border patrol), which are decently compelling (though we're always worried about unobservables). Under this assumption, BCO run an RD comparing deforestation rates in Brazil to its neighbors:

The first key figure from BCO: in 2000, well before Brazil's aggressive anti-deforestation policies, the percent of forest cover was much lower on the Brazilian (right-hand) side of the border.

There's a clear visual difference between deforestation in Brazil and in its neighbors. But here's what I really like about the paper: even if you don't completely buy the static RD identifying assumption here, you have to agree that the following sequence of RD figures is pretty compelling.

This is a little annoying to compare to the first figure, since the y-axis here is different: this time, it's percent of forest cover lost each year - that's why Brazil appears higher in these figures than in the earlier graph. But: it clearly pops … — This is a little annoying to compare to the first figure, since the y-axis here is different: this time, it's percent of forest cover lost each year - that's why Brazil appears higher in these figures than in the earlier graph. But: it clearly pops out that In 2006, when Brazil's policies went into effect, the discontinuity disappears.

The cool thing about the data that the authors have, which differentiates it from many spatial RD papers, is that it's not static - they've got multiple years of data. This allows them to look at deforestation over time. Critically, even though Brazil's forest cover is much lower than its neighbors prior to the policy, its annual rate of forest cover loss slows dramatically in 2006, when the deforestation policies came into effect, and appears to remain equal to the neighboring country rate in 2007 and 2008.

This is pretty strong evidence that the anti-deforestation policies put into place by the Brazilian government worked! You should still be slightly skeptical, and want to see a bunch of robustness checks (many of which are in the paper), but I really like this paper. It combines awesome remote sensing data with a quasi-experimental research design to study the effectiveness of important policies. It's not too often that we can be optimistic about the future of the Amazon - but it looks like we've got some reason to be hopeful here.

If you've made it all the way through this post, I'll reward you with John Oliver's new video on science.

Conference Recap: PacDev 2016

March 11, 2016

I had the pleasure of driving down to rainy Stanford for PacDev last weekend. Aside from the obviously-most-important benefit of having fondue for dinner at my parents' house, I got to present my ongoing work with Louis. Our paper, on the effects of electrification on economic development in India, is finally almost ready to see the light of day (for real this time, we swear!) - there will of course be an announcement on this blog when that happens. But enough about me. I also got to see a bunch of interesting new papers in a diverse range of subsets of development economics, and I'm pretty excited about them. A few highlights:

I went to an (early) morning session on taxes, which I know almost nothing about, and really enjoyed all four papers I saw: Anne Brockmeyer's on third-party information and withholding decisions in Costa Rica; Spencer Smith's on some cool experimental evidence on information (also in Costa Rica); Jose Tessada's on regulations about firm gender ratios and capital substitution; and the always-excellent (follow her on Twitter!) Dina Pomeranz's on multinational corporations and tax havens.
Yong Suk Lee (previously a professor at Williams!) has an interesting new paper looking at the effect of sanctions in North Korea on economic activity. He shows that the sanctions and subsequent shifts in international trading partners led to changes in the locational composition of economic activity, as proxied by nighttime lights. While I'm always wary of the use of nightlights to proxy for sub-national GDP, this paper deals with these issues nicely, making within-country comparisons between regions and talking about relative concentration and identification. Plus, we have so little to go on with North Korea. Cool stuff.
I also went to two political economy sessions - like economic history, political economy often has some of my favorite papers of conferences like this. I particularly enjoyed listening to my old boss, Saumitra Jha, talk about an amazing experiment in which giving Israelis stocks led them to be more willing to vote for peace-promoting government, and hearing K weku Opoku-Agyemang (a Berkeley post-doc), talk about the effects of raising police salaries on bribery.

All in all, a great conference. Hopefully I'll get to go back next year!

If you read all the way down here, here's a reward. Amazing. h/t Josh.

WWP: Rain, rain, go away, you're hurting my human capital development

February 10, 2016

Greetings from String Matching Land! Since I seem to be trapped in an endless loop of making small edits to code and then waiting 25' for it to run (break), I'm going to break out for a little bit and tell you about a cool new paper that I just finished reading. Also, it's Wednesday, so my WWP acronym remains intact. Nice.

Manisha Shah (a Berkeley ARE graduate!) at UCLA and Bryce Millett Steinberg, a Harvard PhD currently doing a postdoc at Brown who is on the job market this year (hire her! this profession needs more women!), have a new paper that I really like. I thought of this same idea (with slightly different data) recently, and then realized that this is forthcoming in the JPE (damn)- and it's excellently done. The writing is clear, the set-up is interesting, the data are cool, the empirics are credible, and the results are intuitive. Did I mention it's forthcoming in the JPE?

In this paper, Shah and Steinberg tackle a prominent strand of development economics: what do economic shocks do to children at various stages of growth? There's a long literature on this, including the canonical Maccini and Yang paper (2009 AER), who find that good rainfall shocks in early life dramatically improve outcomes for women as adults (in Indonesia). This paper does a great job of documenting a treatment effect (if you haven't read it yet, metaphorically put down this blog and go read that instead), but less to say about the mechanisms behind it.

Steinberg and Shah take seriously the idea that rainfall shocks might affect human capital through multiple channels: good rain shocks could mean more income, and therefore consumption and human capital, or good rain shocks might mean a higher opportunity cost of schooling, leading to less education and human capital development. They put together a very simple but elegant model of human capital decisions, and test its implications using a large dataset including some simple math and verbal tests from India. They show that good rain shocks are beneficial for human capital (as proxied by test scores) early in life, but lead to a decrease in human capital later in life. They demonstrate that children are in fact substituting labor for schooling in good harvesting years, and show that rainfall experienced in childhood matters for total years of schooling as well, which could help explain the Maccini and Yang result, though they don't find differential effects by gender.

In the authors' own words (from the abstract):

“Higher wages are generally thought to increase human capital production, particularly in the developing world. We introduce a simple model of human capital production in which investments and time allocation differ by age. Using data on test scores and schooling from rural India, we show that higher wages increase human capital investment in early life (in utero to age 2) but decrease human capital from ages 5-16. Positive rainfall shocks increase wages by 2% and decrease math test scores by 2-5% of a standard deviation, school attendance by 2 percentage points, and the probability that a child is enrolled in school by 1 percentage point. These results are long-lasting; adults complete 0.2 fewer total years of schooling for each year of exposure to a positive rainfall shock from ages 11-13. We show that children are switching out of school enrollment into productive work when rainfall is higher. These results suggest that the opportunity cost of schooling, even for fairly young children, is an important factor in determining overall human capital investment.”

Obligatory stock photo of Indian school kids during the rainy season. Obviously not my own photo.

A few nitpicky points: I could've missed this, but the data are a repeated cross-section rather than a panel of students, so I wanted a little more discussion of whether selection into the dataset was driving the empirics. Also, when they start splitting things by age group, I'm surprised that they still have enough variation in test performance among 11-16-year-olds to estimate effects. I would've expected these students to max out the test metrics, given that the exam being administered is incredibly basic numeracy and literacy skills. But maybe not. Finally, since I'm teaching my 1st-year econometrics students about figures soon, these graphs convey the message but aren't the sexiest. Personal gripe. All in all, though, this is a really nice paper - I urge you to go read it.

A final caveat: this is of course context-specific. I don't at all mean to suggest (and nor do the authors), for instance, that these results should have Californians glad that we're done with the rain and back to sunny weather. As much as I enjoy sunrise runs (n = 1) and sitting outside reading papers, I'd be happy with a little more of what El Nino's got to offer the West Coast.

Weekend Op-Ed: Delhi driving restrictions actually work [so far]!

January 25, 2016

New semester, new blog-resolutions. We're back with a WWP...except that the analysis I'm talking about here hasn't actually made its way into a working paper yet. That said, the work is interesting and cool, and extremely policy-relevant, so it's worth taking a minute to discuss, I think.

For those of you not up on your India news, Delhi's air pollution is horrendous. Air pollution data suggest that Delhi's PM2.5 and PM10 concentrations are the worst in the world - the city has even less breathable air than notoriously dirty Beijing. Having spent some time in Delhi last January, I can add some of my own anecdata (my new favorite Fowlie-ism) as well: after three days of staying and moving around in the city, I was hacking up a lung trying to walk up three flights of stairs to our airbnb. I'm certainly not the fittest environmental economist around, but a few steps don't usually give me trouble.

So I was glad to hear that Delhi has recently been undertaking some efforts to improve its air quality. I was less glad to hear the method for doing so: between January 1 and January 15, Delhi implemented a pilot driving restriction. Cars with license plates ending in odd numbers would be allowed to drive on odd-numbered dates only, while cars with license plates ending in even numbers would be allowed to drive on even-numbered dates only. This sounds good - cutting the number of cars on the road by about half should have a drastic effect on air quality, right? The problem is that Mexico City has had a similar rule in place for years - Hoy No Circula - and rockstar professor Lucas Davis took a look at its effects in a 2008 paper, published in the Journal of Political Economy. Unfortunately (I thought) for the Indian regulation, Lucas finds that license-plate-based restrictions lead to no detectable effect on air quality across a range of pollutants.

Here's Lucas' graphical evidence for Nitrogen Dioxide. If the policy had worked, we would've expected a discontinuous jump downwards at the gray vertical line. He shows similar figures for CO, NOx, Ozone, and SO2.

Lucas provides an interesting possible explanation for the lack of change: he has suggestive evidence that drivers responded to the regulation by buying additional vehicles - in the Delhi case, if I have a license plate ending in 1, but really value driving on even-numbered days, I might go out and get a car with a plate ending in 2 instead. In light of this evidence, I was less-than-optimistic about the Delhi case.

So what actually happened in Delhi? New evidence from Michael Greenstone, Santosh Harish, Anant Sudarshan, and Rohini Pande suggests that the Delhi driving restriction pilot did have a meaningful effect on pollution levels - on the order of 10-13 percent! (A more detailed overview of what they did is available here). These authors use a difference-in-differences design, in which they compare Delhi to similar cities before and after the policy went into effect, doing something like this:

Effect = (Delhi - Others)_Post - (Delhi - Others)_Pre

Under the identifying assumption that Delhi and the chosen comparison cities were on parallel emissions trajectories before the program went into effect, this estimation strategy is nice because it removes common shocks in air pollution. The money figure from this analysis shows the dip in pollution in Delhi starkly:

It looks like, in its brief pilot, that Delhi was successful at reducing pollution using this policy. So why is the result so different than in Mexico City? Obviously, India and Mexico are very different contexts. It also seems like the channel Lucas highlighted, about vehicle owners purchasing more cars, is something that people would only do after being convinced that the policy would be permanent - so there might be additional adjustment that occurs that isn't picked up in a pilot like this. (Would you go out and buy a new car in response to someone telling you that over the next two weeks they're trying out a driving restriction? I don't think I have that kind of disposable income...) Also, the control group obviously matters a lot. I'd like to see (and expect to see, if this gets turned into an actual working paper) a further analysis of what's going on in the comparison cities over the same time period. The pollutants being measured are different - though I doubt that this actually affects much, given how highly correlated PM is with the pollutants measured in Lucas' paper.

In general, I'm encouraged to see both that Delhi is taking active steps to attempt to reduce air pollution, and that these steps are being evaluated in credible ways. As the authors point out in their Op-Ed, and as I've tried to highlight above, we should be cautious about extrapolating the successes of this pilot to the long run - a congestion or emissions pricing scheme might be a more effective long-run approach to tackling air pollution.

I'd also like to briefly highlight the importance of making air pollution data available for these kinds of analysis. There's a cool new initiative online called OpenAQ that's trying to download administrative data from pollution monitors and make this information publicly available - and they're not the only ones. Berkeley Earth is also providing some amazing data on Chinese air quality - and rumor has it they'll be adding more locations soon. Understanding the implications of air quality on health, productivity, and welfare is increasingly important as developing country cities grow and house millions in dirty environments - the more data that's out there to aid in this effort, the better.