New resource: Intro to econometrics in R

I've added a new resource to this site - all of my section materials from ARE 212, Max Auffhammer's first-year PhD econometrics course, which build off of notes written by Dan HammerPatrick Baylis, and Kenny Bell.

These section notes simultaneously provide a gentle introduction to econometrics and to R. I covered very basic coding, including matrix operations and functions; partitioned regression and goodness of fit; hypothesis testing; ggplot2; generalized least squares and maximum likelihood; large sample properties of OLS; non-standard standard errors (twice!); instrumental variables; power calculations; spatial data; and replication, with a bonus intro to Monte Carlo simulation. Quite a full semester! 

It's so nice when everything is well behaved! OLS is even consistent!

It's so nice when everything is well behaved! OLS is even consistent!

I hope these will be a helpful resource to others. A warning: I made many of the materials from scratch and/or expanded existing notes, so there are almost certainly errors. Please let me know if you find any. A more important warning: These notes are rife with bad jokes. Prepare yourselves.

Finally, I owe a debt of gratitude to all of the ARE-212-ers of 2016, who braved 8 AM section (not my choice) and have already dramatically improved these materials. Thanks for being a super fun class!

An end-of-semester gift from one of my students (did I mention I had a great class?) and excellent in-joke for those in the know.

An end-of-semester gift from one of my students (did I mention I had a great class?) and excellent in-joke for those in the know.

Standard errors in Stata: a (somewhat) cautionary tale

Last week, a colleague and I were having a conversation about standard errors. He had a new discovery for me - "Did you know that clustered standard errors and robust standard errors are the same thing with panel data?"

I argued that this couldn't be right - but he said that he'd run -xtreg- in Stata with robust standard errors and with clustered standard errors and gotten the same result - and then sent me the relevant citations in the Stata help documentation. I'm highly skeptical - especially when it comes to standard errors - so I decided to dig into this a little further. 

Turns out Andrew was wrong after all - but through very little fault of his own. Stata pulled the wool over his eyes a little bit here. It turns out that in panel data settings, "robust" - AKA heteroskedasticity-robust - standard errors aren't consistent. Oops. This important insight comes from James Stock and Mark Watson's 2008 Econometrica paper. So using -xtreg, fe robust- is bad news. In light of this result, StataCorp made an executive decision: when you specify -xtreg, fe robust-, Stata actually calculates standard errors as though you had written -xtreg, vce(cluster panelvar)- !


Standard errors: giving sandwiches a bad name since 1967.

Standard errors: giving sandwiches a bad name since 1967.


On the one hand, it's probably a good idea not to allow users to compute robust standard errors in panel settings anymore. On the other hand, computing something other than what users think is being computed, without an explicit warning that this is happening, is less good. 

To be fair, Stata does tell you that "(Std. Err. adjusted for N clusters in panelvar)", but this is easy to miss - there's no "Warning - Clustered standard errors computed in place of robust standard errors" label, or anything like that. The help documentation mentions (on p. 25) that specifying -vce(robust)- is equivalent to specifying -vce(cluster panelvar)-, but what's actually going on is pretty hard to discern, I think. Especially because there's a semantics issue here: a cluster-robust standard error and a heteroskedasticity-robust standard error are two different things. In the econometrics classes I've taken, "robust" is used to refer to heteroskedasticity- (but not cluster-) robust errors. In fact, StataCorp refers to errors this way in a very thorough and useful FAQ answer posted online - and clearly states that the Huber and White papers don't deal with clustering in another.

All that to say: when you use canned routines, it's very important to know what exactly they're doing! I have a growing appreciation for Max's requirement that his econometrics students build their own functions up from first principles. This is obviously impractical in the long run, but helps to instill a healthy mistrust of others' code. So: caveat econometricus - let the econometrician beware!

WWP: An oldie but a goodie

I always appreciate papers when they teach me something about methodology as well as about their particular research question. A lot of the papers that I really like that have done this have already been published (David McKenzie at the World Bank has a bunch of papers that fall into this category - and excellent blog posts as well.) 

This week's WWP isn't particularly new, but is definitely both interesting and useful methodologically (and it is still a working paper!). Many readers of this blog (ha, as if this blog has many readers) have probably read this paper before, or at least skimmed it, or at least seen the talk. But if you haven't read it carefully, I urge you to go back and give it another look. Yes, I'm talking about Sol Hsiang and Amir Jina's hurricanes paper (the actual title is: The Causal Effect of Environmental Catastrophe on Long-Run Economic Growth: Evidence from 6,700 Cyclones). Aside from being interesting and cool (and having scarily large estimates), it also provides really clear discussions of how to do a bunch of things that applied microeconomists might want to know how to do.

It describes in a good bit of detail how to map environmental events to economic observations (don't miss the page-long footnote 13...). It also discusses how to estimate a distributed lag model, and then explains how to recover the cumulative effect from this model (something that I never saw in an econometrics class). It provides really clear visualizations of the main results (we should expect no less from Sol at this point). A lot of the methodological meat is also contained in the battery of robustness checks, including a randomization inference procedure, a variety of cuts of the data, more discussion of distributed lag and spatial lag models, modeling potential adaptation, etc etc etc. Finally, they do an interesting exercise where they use their model to simulate what growth would have looked like in the absence of cyclones, and (of course) do a climate projection - but also add a NPV calculation on top of it.

All in all, I think I'll use this paper as a great reference for how to implement different techniques for a while - and I look forward to reading the eventual published version. I'll let the authors describe their results themselves. Their abstract:

Does the environment have a causal effect on economic development? Using meteorological data, we reconstruct every country’s exposure to the universe of tropical cyclones during 1950-2008. We exploit random within-country year-to-year variation in cyclone strikes to identify the causal effect of environmental disasters on long-run growth. We compare each country’s growth rate to itself in the years immediately before and after exposure, accounting for the distribution of cyclones in preceding years. The data reject hypotheses that disasters stimulate growth or that short-run losses disappear following migrations or transfers of wealth. Instead, we find robust evidence that national incomes decline, relative to their pre-disaster trend, and do not recover within twenty years. Both rich and poor countries exhibit this response, with losses magnified in countries with less historical cyclone experience. Income losses arise from a small but persistent suppression of annual growth rates spread across the fifteen years following disaster, generating large and significant cumulative effects: a 90th percentile event reduces per capita incomes by 7.4% two decades later, effectively undoing 3.7 years of average development. The gradual nature of these losses render them inconspicuous to a casual observer, however simulations indicate that they have dramatic influence over the long-run development of countries that are endowed with regular or continuous exposure to disaster. Linking these results to projections of future cyclone activity, we estimate that under conservative discounting assumptions the present discounted cost of “business as usual” climate change is roughly $9.7 trillion larger than previously thought.

Edited to add: This turned out to be especially timely due to the record number of hurricanes in the Pacific at the moment. (Luckily, none of them are threatening landfall as of August 31st.)


Back on track

I was going to make this post a Wednesday Working Paper, but because of my fantastic Seattle vacation (and less fantastic return to 2 vet trips in as many days with my cat), I haven't actually read anything new. Sorry I'm not sorry. To get the ball rolling agin, I want to highlight two great websites that were brought to my attention this week (both via Twitter. Have I mentioned my ongoing love affair with Twitter yet?)

Great data visualization or  the greatest  data visualization? Proof that both analysis of and presentation of (social science) data is hard.

Great data visualization or the greatest data visualization? Proof that both analysis of and presentation of (social science) data is hard.

First, FiveThirtyEight has a really nice piece on the state of science. Like the Economist article I blogged about a little while ago (first link to my own blog - oooh, meta), this post has an interactive infographic where you can play with p-hacking, this time using actual data to show statistically significant effects of Republicans/Democrats on the economy. The article does a nice job explaining potentially complex issues, like p-hacking, differences in methodological approaches by different disciplines, and the degree to which science is self-correcting, in a digestible way.  As a (social) scientist myself, I appreciate the article's headline and subtitle: ``Science Isn't Broken - It's just a hell of a lot harder than we give it credit for.'' Truth. One important thing missing from this article, though, is that the author spends essentially zero time talking about causality. The p-hacking exercise (and, as far as I can tell, the fascinating soccer player example...which includes an author from BITSS, Garret Christensen) deals only with correlations. Figuring out whether something is causal or merely correlational might be the biggest part of my job as a young economist - and actually nailing down causality is really hard to do. So consider that yet another (extremely large) item on the why-(social)-science-is-hard list. We would also benefit from more media highlighting the differences between causal and correlational work - both are very important, but should have different policy implications, but they're often treated as one and the same in newspaper or online articles about research. Kudos overall, though, to FiveThirtyEight for a detailed but readable piece on the challenges of doing science currently (and how far we've come at doing better science - we've got a ways to go, but I'm optimistic that a great deal of progress has already been made).

On a lighter note (and not to be outdone), here's what might be my new favorite time-waster website: bad data presentation from WTF Visualization. Seems like the creators of these awful graphics need to read some Tufte