energy

Goldilocks RCTs

What better way to ring in the new year than to announce that Louis Preonas, Matt Woerman, and I have posted a new working paper, "Panel Data and Experimental Design"?  The online appendix is here (warning - it's math heavy!), and we've got a software package, called pcpanel, available for Stata via ssc, with the R version to follow.*

TL;DR: Existing methods for performing power calculations with panel data only allow for very limited types of serial correlation, and will result in improperly powered experiments in real-world settings. We've got new methods (and software) for choosing sample sizes in panel data settings that properly account for arbitrary within-unit serial correlation, and yield properly powered experiments in simulated and real data. 

The basic idea here is that researchers should aim to have appropriately-sized ("Goldilocks") experiments: too many participants, and your study is more expensive than it should be; too few, and you won't be able to statistically distinguish a true treatment effect from zero effect. It turns out that doing this right gets complicated in panel data settings, where you observe the same individual multiple times over the study. Intuitively, applied econometricians know that we have to cluster our standard errors to handle arbitrary within-unit correlation over time in panel data settings.** This will (in general) make our standard errors larger, and so we need to account for this ex ante, generally by increasing our sample sizes, when we design experiments. The existing methods for choosing sample sizes in panel data experiments only allow for very limited types of serial correlation, and require strong assumptions that are unlikely to be satisfied in most panel data settings. In this paper, we develop new methods for power calculations that accommodate the panel data settings that researchers typically encounter. In particular, we allow for arbitrary within-unit serial correlation, allowing researchers to design appropriately powered (read: correctly sized) experiments, even when using data with complex correlation structures. 

I prefer pretty pictures to words, so let's illustrate that. The existing methods for power calculations in panel data only allow for serial correlation that can be fully described with fixed effects - that is, once you put a unit fixed effect into your model, your errors are no longer serially correlated, like this:

But we often think that real panel data exhibits more complex types of serial correlation - things like this:

Okay, that's a pretty stylized example - but we usually think of panel data being correlated over time - electricity consumption data, for instance, generally follows some kind of sinusoidal pattern; maize prices in East Africa at a given market typically exhibit correlation over time that can't just be described with a level shift; etc; etc; etc. And of course, in the real world, data are never nice enough that including a unit fixed effect can completely account for the correlation structure.

So what happens if I use the existing methods when I've got this type of data structure? I can get the answer wildly wrong. In the figure below, we've generated some difference-in-difference type data (with a treatment group that sees treatment turn on halfway through the dataset, and a control group that never experiences treatment) with a simple AR(1) process, calculated what the existing methods imply the minimum detectable effect (MDE) of the experiment should be, and simulated 10,000 "experiments" using this treatment effect size. To do this, we implement a simple difference-in-difference regression model. Because of the way we've designed this setup, if the assumptions of the model are correct, every line on the left panel (which shows realized power, or the fraction of these experiments where we reject the null of no treatment) should be at 0.8. Every line on the right panel should be at 0.05 - this shows the realized false rejection rate, or what happens when we apply a treatment effect size of 0. 

Adapted from Figure 2 of Burlig, Preonas, Woerman (2017). The y-axis of the left panel shows the realized power, or fraction of the simulated "experiments" described above that reject the (false) null hypothesis; the y-axis of the right panel shows …

Adapted from Figure 2 of Burlig, Preonas, Woerman (2017). The y-axis of the left panel shows the realized power, or fraction of the simulated "experiments" described above that reject the (false) null hypothesis; the y-axis of the right panel shows the realized false rejection rate, or fraction of simulated "experiments" that reject the true null under a zero treatment effect. The y-axis in both plots is the number of pre- and post-treatment periods in the "experiment." The colors show increasing levels of AR(1) correlation. If the Frison and Pocock model were performing properly, we should expect all of the lines on the left panel to be at 0.80, and all of the lines in the right panel to be at 0.05. Because we're clustering our standard errors in this setup, the right panel is getting things right - but the left panel is wildly off, because the FP model doesn't account for serial correlation. 

We're clustering our standard errors, so as expected, our false rejection rate is always right at 0.05. But we're overpowered in short panels, and wildly underpowered in longer ones. The easiest way to think about statistical power is that if you aimed to be powered at 80% (generally the accepted standard, and meaning that you'll fail to reject a false null 20% of the time), you're going to fail to reject the null - even when there is a true treatment effect - 20% of the time. So that means if you end up powered to, say, 20%, as happens with some of these simulations, you're going to fail to reject the (false) null 80% of the time. Yikes! What's happening here is essentially that by not taking serial correlation into account, in long panels, we think we can detect a smaller effect than we actually can. Because we're clustering our standard errors, though, our false rejection rate is disciplined - so we get stars on our estimates way less often than we were expecting.***

By contrast, when we apply our "serial-correlation-robust" method, which takes the serial correlation into account ex ante, this happens:

Adapted from Figure 2 of Burlig, Preonas, Woerman (2017). Same y- and x-axes as above, but now we're able to design appropriately-powered experiments for all levels of AR(1) correlation and all panel lengths - and we're still clustering our standard…

Adapted from Figure 2 of Burlig, Preonas, Woerman (2017). Same y- and x-axes as above, but now we're able to design appropriately-powered experiments for all levels of AR(1) correlation and all panel lengths - and we're still clustering our standard errors, so the false rejection rates are still right. Whoo!

That is, we get right on 80% power and 5% false rejection rates, regardless of the panel length and strength of the AR(1) parameter. This is the central result of the paper. Slightly more formally, our method starts with the existing power calculation formula, and extends it by adding three terms that we show are sufficient to characterize the full covariance structure of the data (see Equation (8) in the paper for more details).

If you're an economist (and, given that you've read this far, you probably are), you've got a healthy (?) level of skepticism. To demonstrate that we haven't just cooked this up by simulating our own data, we do the same thought experiment, but this time, using data from an actual RCT that took place in China (thanks, QJE open data policy!), where we don't know the underlying correlation structure. To be clear, what we're doing here is taking the pre-experimental period from the actual experimental data, and calculating the minimum detectable effect size for this data using the FP model, and again using our model.**** This is just like what we did above, except this time, we don't know the correlation structure. We non-parametrically estimate the parameters that both models need in order to calculate the minimum detectable effect. The idea here is to put ourselves in the shoes of real researchers, who have some pre-existing data, but don't actually know the underlying process that generated their data. So what happens?

The dot-dashed line shows the results when we use the existing methods; the dashed line shows the results when we guess that the correlation structure is AR(1); and the solid navy line shows the results using our method. 

Adapted from Figure 3 of Burlig, Preonas, Woerman (2017). The axes are the same as above. The dot-dashed line shows the realized power, over varying panel lengths, of simulated experiments using the Bloom et al (2015) data in conjunction with the Fr…

Adapted from Figure 3 of Burlig, Preonas, Woerman (2017). The axes are the same as above. The dot-dashed line shows the realized power, over varying panel lengths, of simulated experiments using the Bloom et al (2015) data in conjunction with the Frison and Pocock model. The dashed line instead assumes that the correlation structure in the data is AR(1), and calibrates our model under this assumption. Neither of these two methods performs particularly well - with 10 pre and 10 post periods, the FP model yields experiments that are powered to ~45 percent. While the AR(1) model performs better, it is still far off from the desired 80% power. In contrast, the solid navy line shows our serial-correlation-robust approach, demonstrating that even when we don't know the true correlation structure in the data, our model performs well, and delivers the expected 80% power across the full range of panel lengths. 

Again, only our method achieves the desired 80% power across all panel lengths. While an AR(1) assumption gets closer than the existing method, it's still pretty off, highlighting the importance of thinking through the whole covariance structure.

In the remainder of the paper, we A) show that a similar result holds using high-frequency electricity consumption data from the US; B) show that collapsing your data to two periods won't solve all of your problems; C) think through what happens when using ANCOVA (common in economics RCTs) -- there are some efficiency gains, but much fewer than you'd think if you ignored the serial correlation; and D) couch these power calculations in a budget constrained setup to think about the trade-offs between more time periods and more units. *wipes brow*

A few last practical considerations that are worth addressing here:

  • All of the main results in the paper (and, indeed, in existing work on power calculations) are designed for the case where you know the true parameters governing the data generating process. In the appendix, we prove that (with small correction factors) you can also use estimates of these parameters to get to the right answer.
  • People are often worried enough about estimating the one parameter that the old formulas needed, let alone the four that our formula requires. While we don't have a perfect answer for this (more data is always better), simply ignoring these additional 3 parameters implicitly assumes they're zero, which is likely wrong. The paper and the appendix provide some thoughts on dealing with insufficient data.
  • Estimating these parameters can be complicated...so we've provided software that does it for you!
  • We'd also like to put in a plug for doing power calculations by simulation when you've got representative data - this makes it much easier to vary your model, assumptions on standard errors, etc, etc, etc.  

Phew - managed to get through an entire blog post about econometrics without any equations! You're welcome. Overall, we're excited to have this paper out in the world - and looking forward to seeing an increasing number of (well-powered) panel RCTs start to hit the econ literature!

* We've debugged the software quite a bit in house, but there are likely still bugs. Let us know if you find something that isn't working!

** Yes, even in experiments. Though Chris Blattman is (of course) right that you don't need to cluster your standard errors in a unit-level randomization in the cross-section, this is no longer true in the panel. See the appendix for proofs.

*** What's going on with the short panels? Short panels will be overpowered in the AR(1) setup, because a difference-in-differences design is identified off of the comparison between treatment and control in the post-period vs this same difference in the pre-period. In short panels, more serial correlation means that it's actually easier to identify the "jump" at the point of treatment. In longer panels, this is swamped by the fact that each observation is now "worth less". See Equation (9) of the paper for more details.

**** We're not actually saying anything about what the authors should have done - we have no idea what they actually did! They find statistically significant results at the end of the day, suggesting that they did something right with respect to power calculations, but we remain agnostic about this. 

Full disclosure: funding for this research was provided by the Berkeley Initiative for Transparency in the Social Sciences, a program of the Center for Effective Global Action (CEGA), with support from the Laura and John Arnold Foundation.

Out of the Darkness and Into the Light? Development Effects of Rural Electrification In India

At long last, Louis and I have a working paper version of our paper, "Out of the Darkness and Into the Light? Development Effects of Rural Electrification in India" (Appendix here - warning: it's a pretty big file). The paper has, as you might expect, a lot of gory details in it, and the appendix even more so, so I'll try to provide a less technical overview here. 

The basic motivation behind the paper is the following: there are still over a billion people around the world without access to electricity, most of whom are poor, rural, and live in South Asia or Sub-Saharan Africa. Developing country governments and NGOs are pouring billions of dollars into rural electrification programs with the explicit goal of bringing people out of poverty - one of the new UN Sustainable Development Goals is to "Ensure Access to Affordable, Reliable, Sustainable and Modern Energy for All."

In light of the big pushes towards universal energy access (pun only half intended), it's important that we understand what these investment dollars are going towards - even more so when we're talking about developing country dollars, which have opportunity costs like schools and health clinics and roads. It turns out, though, that we know surprisingly little about the actual effects of electrification on economic development. There is a strong positive correlation between GDP per capita and electricity consumption, with rich countries like the USA and Japan consuming much more energy per capita than poorer places like Nigeria and Bangladesh:

Data source: World Bank

Data source: World Bank

As we all know, though, correlation does not equal causation (thanks, XKCD!). Figuring out the causality here is definitely not trivial: at the country level, there are lots of things that drive income differences between Japan and Bangladesh that have nothing to do with electricity. It turns out that these omitted variable bias problems plague sub-national-level studies as well. If we just regressed incomes (or better yet, welfare!) on electricity infrastructure at the village level in a developing country (say, India), we'd likely end up with a number that was way too high. Why? Energy infrastructure projects are large and expensive, and (correctly) not randomly placed: governments think long and hard about where to put the electricity grid. In practice, this usually means that places that are already doing well economically, or places that are expected to do well in the future, are the first to get access to electricity. In more technical terms, that regression I described above is subject to a ton of omitted variable bias. 

In our new paper, Louis and I try to shed light on the following question: What does electrification do to rural economies? To do this, we take advantage of a natural experiment built into India's national rural electrification program, RGGVY, which would eventually expand electricity access in over 400,000 villages across 27 states. In order to keep costs down, in the first phase of the program, only villages with neighborhoods of 300+ people were eligible for electrification. This means that we can compare villages with 299-person neighborhoods to those with 301-person neighborhoods to look at the effect of electrification. The idea is that, in the absence of RGGVY, these villages would be virtually indistinguishable - and indeed, in the paper, we show that prior to the program, villages just above and just below the 300-person cutoff look the same. Having a cutoff of this kind built into the program is really nice from an evaluation standpoint, because it allows us to mimic a randomized experiment couched in the natural (large-scale) rollout of RGGVY. 

Having identified this cutoff, we get to bring a bunch of cool data to bear on the problem. The first thing we need to do is to demonstrate that RGGVY actually led to increases in electricity access for eligible villages. We're concerned that a simple 1/0 electrification indicator doesn't actually capture power availability and use - it would mask variation in power quality or in the number of households that have access to the grid, for example. Instead, we turn to remote sensing. We got access to boundary shapefiles (outlines) of [almost] every village in India, and superimposed them on top of NOAA's DMSP-OLS nighttime lights dataset - this is a satellite-based measure of nighttime brightness. We combine these to datasets and "cookie-cutter" out the nighttime lights values for each village, which allows us to create statistics like the average or maximum brightness in each location.

Yum, village cookies!

Yum, village cookies!

We pair nighttime brightness data from 2011, 6 years after the announcement of the program, with population information from the 2001 Census (the official population of record for RGGVY), and look at the effects of eligibility for RGGVY on night lights. Another advantage of the cutoff-based ("regression discontinuity", for those in the know) analysis? It lends itself naturally to blogging, because you should be able to see the difference between ineligible villages and eligible villages very clearly in a figure. Et voila:

Effects of eligibility for RGGVY on nighttime brightness. Each dot contains ~1,800 villages.

Effects of eligibility for RGGVY on nighttime brightness. Each dot contains ~1,800 villages.

We find that RGGVY eligibility (aka crossing the 300-person threshold) led to a sharp increase in nighttime brightness, as visible from space - remember that we're looking at villages of around 300 people, so we're pretty impressed that we can detect this. It helps that we're studying India, where there are upwards of 30,000 villages in our main estimation sample. It's a little hard to interpret this effect directly, since it's in units of brightness points, but previous papers that have gone out and groundtruthed the relationship between nighttime lights and electrification suggest that this is consistent with about a 50-percent increase in the household electrification rate in our sample. We do a bunch of work in the paper and in the appendix to show that these changes are actually attributable to RGGVY - if you're curious, check it out!

 

Okay, great: it looks like RGGVY brought electricity to Indian villages - but what about the effects of electricity access on things we care about, like the workforce, asset ownership, housing stock characteristics, village-wide outcomes, and education? We gathered a bunch of village-level data from the Indian Census of 2011 and the District Information System for Education, and look, among other things, at the effects of electrification on outcomes like men and women working in agriculture vs. the more formal sector; ownership of assets like TVs and motorcycles; whether households are classified as "dilapidated" or have mud floors, and other housing stock outcomes; the presence of mobile phone coverage, agricultural credit societies, and other village-wide services; and the number of children enrolled in primary and upper-primary school. Here's a snapshot of (some of) our results:

For all of the outcomes I've shown here, we see very little evidence of sharp discontinuities at the 300-person threshold. We do see clear evidence of men shifting from agricultural to non-agricultural employment (see p. 20 in the paper), but little else. The graphical evidence is borne out by the regression estimates as well: in almost no cases can we reject the null hypothesis of zero effect, but more importantly, we can reject even modest effects in nearly all outcomes. (The exception to this is education: we again see no visual evidence of effects on education, but our sample size is much smaller for these outcomes, since not all villages actually contain schools, so we have less precision with which to rule out effects; the schools effects are also sensitive to specification choices, bandwidths, etc in a way that the other outcomes are not.) To reiterate: in the medium term, we find that eligibility for RGGVY caused a substantial increase in electricity use, but can reject small effects on labor markets, asset ownership, housing characteristics, and village-wide outcomes; we do not find robust evidence that RGGVY led to changes in education. We were pretty surprised by these results, so we threw a bunch of checks at them (see the Appendix) - but they seem to hold up. (Turns out that our results are also consistent with new evidence from our Berkeley colleagues Ken Lee, Ted Miguel, and Catherine Wolfram in Kenya - see the abstract here).

A couple of these tests in particular are worth highlighting:

You might be concerned that we're not finding much because the program wasn't implemented well, or because we're lumping a bunch of villages where electrification did a lot in with villages where electrification didn't do anything, so things are averaging out to zero. When we cherry-pick the states that saw the largest increase in nighttime brightness as a result of the program, however, we don't see evidence of this. Among this selected sample, the nighttime lights effect approximately doubles - but the effects on the other outcomes stay the same. 

You might also be concerned that villages with around 300 people are unlikely to see big effects - they might be too poor, too credit-constrained, etc, etc, etc. A couple of responses to this: first, if we care about electrification from a poverty-reduction standpoint, then we should be worried about people being too poor to take advantage of electrification. But that's more speculative than data-driven, so we do a more formal test to think about effects of electrification for the rest of the villages in India. Rather than relying on our nice cutoff, we instead do a difference-in-differences (DD) analysis: we compare villages electrified in the first wave of the program (like a "treatment" group), before and after electrification, to villages electrified in the second wave of the program (like a "control" group). When we do this, and calculate different effects by population groups, here's what we find:

ddvsrd

There are a couple important takeaways from this figure: first, our cutoff-based estimate (navy dots) line up remarkably well with the DD point estimates for the appropriately sized villages; and second, the brightness effect is increasing in population, but the other outcomes (proportion of men working in agriculture shown here) are not, suggesting that our original estimates might actually generalize to the rest of the population. 

So what does it all mean? We present well-identified quasi-experimental evidence from the world's largest unelectrified population. Taken together, our results suggest that rural electrification may not be as beneficial as previously thought. Does our paper say that we shouldn't be implementing these kinds of electrification programs, or that electricity isn't making people better off? No - we explicitly don't make any statements about overall welfare in the paper, because we don't have the data to support these types of claims. In fact, we visited some villages in Karnataka in December, which made it pretty clear that people like having access to power (so do I!). But we can say that it's difficult to find evidence in the data that electrification is dramatically transforming rural India after 5 years or so. We think this is a case where highlighting a null result is really important - take that, file drawer! At the end of the day, in the medium term, rural electrification just doesn't appear to be a silver bullet for development. In typical Ivory Tower fashion, we think more research is needed to understand where and when power can transform economies - maybe we should be targeting electricity infrastructure upgrades to urban areas, for instance. We've got a couple projects in the works to look at some of these questions - so check back with us in a few years, and hopefully we'll have some more answers!

 

Officially SSMART!

I've been a bad blogger over the past month or so, something I'm hoping to remedy over the coming weeks. (Somewhere out there, a behavioral economist is grumbling about me being present-biased and naive about it. Whatever, grumbly behavioral economist.) I'm writing this from SFO, about to head off to Bangalore (via Seattle and Paris, where I'll meet my coauthor/adventure buddy Louis), thanks to USAID and Berkeley's Development Impact Lab. We're hoping to study the effects of the smartgrid in urban India, as well as to learn more about what energy consumption looks like in Bangalore in general. There is a small but quickly-growing body of evidence on energy use in developing countries (see Gertler, Shelef, Wolfram, and Fuchs -- forthcoming AER, and one of my favorite of Catherine's papers! -- and Jack and Smith -- AER P&P on pre-paid metering in South Africa -- for a couple of recent examples). Still, there's a lot that we don't know, and, of course, a lot more that we don't know that we don't know. Thanks a lot, Rumsfeld.

Feeling SSMARTer already!

Feeling SSMARTer already!

In other exciting news, the Berkeley Initiative for Transparency in the Social Sciences (BITSS) has released its SSMART grant awardees - and my new project (with Matt and Louis, and overseen by Catherine) on improving power calculations and making sure researchers get their standard errors right has been funded! Very exciting. Check out the official announcement here, and our page on the Open Science Framework here. Since this is a grant explicitly about transparency, we'll be making our results public as we go through the process. Our money is officially for this coming summer, so look for an update / more details in a few months.

Where we are currently: there are theoretical reasons to be handling standard errors differently than we currently are in a lot of empirical applications, and there are also theoretical reasons that existing formula-based power calculations might be ending up under powered. In progress: how badly wrong are we when we use current methods? 

My flight is boarding, so I'll leave you with that lovely teaser!

Conference recap: IGC's Energy and Growth

Last week, I was lucky to have been invited to the IGC’s first annual (we hope!) Energy and Growth conference in London. Organized by heavy-hitter energy economists Michael Greenstone and Nick Ryan, this cool conference brought together economists who work on energy and development with policymakers from a range of international organizations and governments. The IGC seems to have facilitated many useful collaborations between researchers and the ``real world,’’ and this was an interesting venue to highlight some of that work (and some other work as well). A few highlights:

  • Not to plug my advisor again, but I can’t help it: Catherine’s work with Ted Miguel and Ken Lee (fellow ARE PhD student) on an RCT they’re conducting in Kenya might have been the hit of the conference. First, they noticed that a large number of households were “under-grid”, or had electricity infrastructure nearby, but not in, their homes. This is an example of the last mile (last centimeter?) problem. They’ve randomly assigned subsidies for grid connections to households in rural villages, and have varied the subsidy level across villages, letting them estimate a demand curve for electrification among these households. They also have cost information from the Kenyan Rural Electrification Authority, so that they can estimate a supply curve as well. The surprising punchline? The supply curve sits above the demand curve basically everywhere: households aren’t interested in taking up connections at the cost at which the Kenyan government can provide them. Also some cool stuff about credit constraints. I’d be interested in learning more about what’s driving this effect: is it concerns about reliability? Lack of information about the power of electricity (eheheh)? Something else? They’re also conducting a follow-up survey which will let them understand the effects of electrification in these regions, which will also hopefully shed some light on what’s going on here. Fascinating stuff.
  • Kelsey Jack has some really cool new work with Grant Smith (PhD student at UCT – one of my favorite places) on the consequences of pre-paid electricity metering in Cape Town, also using an RCT. When these meters are installed, some households do reduce their energy consumption – but many also dramatically increase their electricity-related transactions costs, by going to the shop to fill their meters multiple times a week. It turns out that pre-paid meters don’t substantially reduce consumption of the households that are being subsidized, and do reduce consumption among the subsidizers, so they don’t seem to be an effective tactic for additional revenue recovery, either. Again, lots more to be learned about pre-paid metering. I’m hopeful that they’ll be able to expand their study area as well.

·      Koichiro Ito has new work looking at willingness to pay for air quality in China, exploiting the Huai River discontinuity. While I’ve always been highly skeptical of a certain previous paper using this result (sorry, MG), this one is definitely more convincing, in particular because it measures PM10 which is a fairly local pollutant, because it includes longitude controls, and, most importantly, because the effect is only visible in the winter. Combining this with scanner data on retail outlet purchases of air purifiers, Koichiro is able to give us what’s probably a lower bound on the Chinese WTP for air quality (and in turn, quite a low VSL – though not so low as Kremer et al found using clean water in Kenya). I want to see more about getting from this number to an implied VSL, and the discount rates or lack of information that would have to be required to rationalize a VSL closer to the one we use in the US, but that should be doable in a fairly simple back-of-the-envelope calculation. Neat!

  • Rohini Pande made some insightful comments as part of a panel discussion about the importance of government oversight and regulatory strength in keeping local and global pollutants managed in the context of growing energy demand in the developing world. It sounds like she’s also got some interesting new work about the placement of energy infrastructure: should it be near people (ala China) or near coal (ala India)? These have vastly different distributional consequences, which we know from some of her earlier work (“Dams” with Esther Duflo) can be very important. Looking forward to seeing what she does. One of my favorite researchers.

Lots of other interesting work was also presented (check the program for some details); I’m inspired to do more work in this area. There are obviously a number of smart people working in this space – but luckily it’s a vast and important research space, so hopefully there’s room for another grad student or two.

 

Stay tuned to this blog for the next few weeks for a couple of exciting announcements about upcoming work and new results! 

No, really, climate change will be bad!

I would be remiss to purport to blog about the economics of energy, the environment, and the developing world if I failed to highlight a new (important) study that came out in Nature this week.

The all-star team of Marshall Burke, Sol Hsiang (who has a fancy new website), and Ted Miguel is at it again, with a paper on the effects of temperature on GDP around the world. Before they even get to the empirics, they provide some really nice insight as to why when there are sharp non-linearities in micro temperature response functions, we shouldn't expect to see these same kinks in macro response functions. The idea is basically the following: a micro response function tells us the marginal effect of having an additional (hour, day) in a given temperature range. Imagine, as with US maize and lots of other things, temperature is increasing up to a point and then has a sharp decrease beyond that point. The macro response function will aggregate these days or hours up to a longer time period (a year, say), meaning that the overall effect of annual temperature on annual output will be a weighted average of the two slopes of the micro response, weighted by the number of days in each temperature range. Was that confusing? Check out Figure 1, panels d, e, and f (the math to derive this is all in the supplement to the paper as well):

This key insight is really important in allowing us to understand how we should expect micro responses to differ from macro ones. Cool. 

The authors then go on to empirically estimate the global macro temperature response function, settling on (after many robustness checks) a quadratic in temperature. What they come up with is a strong inverted-u shaped relationship, with an optimum around 55F (that might seem low, but remember that we're talking about annual average temperature here). This suggests that some (colder) countries might benefit from global warming, and hotter countries have a lot to lose. They tackle several points that are often brought up in this literature, and end up unable to reject that the rich and poor country responses are the same (though the confidence intervals are quite large as well. Minor gripe: 90% confidence intervals are shown in the paper. Yes, I know that 95% is arbitrary too, but it is the empirical economics standard...); they show that agriculture takes a big hit in both poor and rich countries, and that non-ag GDP seems to take slightly less of one in richer countries, but the relationship between temperature and non-ag GDP is still downward sloping; and finally, that the response functions in 1960-1989 look almost identical to the 1990-2010 response functions, suggesting that there hasn't been a ton of adaptation during the time period of their data.

Using these estimates, they go on to make some beautiful figures showing climate damage projections out to 2100 (IMHO, as much as I know that they like Figure 3, I think it's aesthetically pleasing but not the most legible). They find that, using fairly reasonable assumptions about growth and emissions paths, global GDP is projected to be approximately 25% lower in 2100 with climate change than without -- a much larger effect than all three current IAMs used in US policy (DICE, FUND, and PAGE) would suggest . There are wide confidence intervals around this estimate to be sure - but it's also worth noting that the majority of the uncertainty here comes from Europe and North America. These are large economies, and so have a large effect on GDP per capita overall, but are also close to the estimated global optimum, meaning that if the optimum is off by a little bit, the effects for these countries could even flip in sign.

I think this paper is a really important contribution to the climate-economics space. The effects are huge, and the paper (and supplemental information, and stuff that got left out of the supplemental information but was in an earlier non-circulating working paper version) is very thorough.

A few small comments: it is worth noting that there's a ton of statistical uncertainty floating around here.  Panel C of the first extended data figure shows the estimated marginal effects with lags included - and in every estimate that includes lags, the confidence interval bounds zero (and I think these are still the 90% CI's?). The confidence intervals on Figure 5a, the main estimate, also sit squarely on top of zero. And, as with every projection exercise, we should take this one with a giant brick of salt. These guys do a good job, but remember that they're also using short-run fluctuations in temperatures to trace out this response function. This is nice because, conditional  on the right fixed effects, we generally think that it's as good as randomly assigned, but does make plugging the estimates into a projection a little tricky to interpret. It's standard in this literature to do this kind of thing - and the fact that they find no evidence of adaptation in the 50+ year period they're looking at helps shore up the argument for doing so - but it's worth keeping in mind that that's what's being done.

It's also really important to think carefully (in all of these papers - not just BHM) about what's actually being used for identification. We know from Wolfram and Craig McIntosh that using higher-order polynomials in fixed effects models re-introduces cross sectional variation (and any omitted variable bias that comes with it!). I think in an earlier version of the paper, I saw a binned model floating around, which removes this concern, and had similar point estimates, but this general point is something that's under-appreciated, I think. (And, even with binned models, we need to be really careful when presenting something as the aggregate temperature response function, if there are only a few countries that ever end up in the really hot bins. That's a soapbox for another day.)

Also, as I mentioned above a little bit, while it's true that these guys aren't able to statistically reject that the poor and rich country responses are different, that doesn't mean that the true responses aren't different - it could be that there's not enough statistical power to address these questions in the data. That's going to be especially true at the colder end of the distribution - there are so few poor countries there that it's really hard to say anything concrete. 

All that said, I think this is a super interesting and important paper, and I'm glad that it's out in time for Paris. I've already learned a lot from these guys, and I continue to do so - they're some of the most careful, thorough, and productive researchers out there working on really policy-relevant topics. Plus, they make beautiful figures. This is a paper that's really worth diving into - I highly recommend actually reading the paper, the extended data, and the supplemental information (which is something I won't say very often)!

One last thing before I close: Marshall, Sol, and Ted have put up a really good companion site to their paper, that makes the results accessible and digestible. Plus, they've put up replication code - very important when you're working on such hot (ha) issues as climate and GDP. Take a look!

Edited to add: Marshall just posted a response to some frequent criticism on his blog. Worth a read.