hacks

What to do about overly open data?

Last week's Ashley Madison hack pretty much induces slimy feelings all around. I generally think that cheating is bad, hacking into people's personal information online is bad, posting said personal information is probably bad, etc, etc, etc. But, because we're good little grad students, the leaked files also sparked some debate between colleagues and myself over whether using data obtained in this manner in research is reasonable.

On the one hand, hacked data (or data obtained in a manner that violates a website's terms of use) can often provide insights into areas that we have no data on otherwise. The Ashley Madison data is an obvious example (it turns out that getting credible survey data on infidelity is tricky!), but so are the WikiLeaks cables and data scraped from the infamous Silk Road.

Obligatory cool-looking-but-meaningless Matrix-esque hacker graphic.

Obligatory cool-looking-but-meaningless Matrix-esque hacker graphic.

On the other hand, these datasets potentially violate standard IRB protocol about informed consent, could butt up against copyright law, and just generally feel icky to use. The use of hacked data in research could also further privacy advocates' concerns about academics, which has potential for large negative externalities. 

So I'm curious: has anybody tried to use this type of data in research before? Was there pushback from colleagues, IRB, or journal editors? Whether you've tried or not, is this an inherently bad thing? Should we leave leaked/hacked data alone, or try learn something from it? Leave a comment below with your thoughts!