:Ben Metcalfe Blog

AOL releases search data on 500k users… and then tries to take it back

AOL Research Labs released for download the data of all searches performed by 500,000 (some say it’s more like 650k+) of their users over a three months period.

That’s just amazing – 20 million search queries, each with:

  • UserID – numeric unique id rather than actual name
  • Query
  • QueryTime
  • ClickedRank – ie whether the first, second, third link was clicked on, etc
  • DestinationDomainUrl

The things you can do with this data make your mind boggle – both within the shady SEO business and in the slightly more positive mash-up scene.

To the SEO industry, this data is priceless – assuming it’s a statistically unbiased sample all sorts of analysis can be made on popular search terms, temporal analysis of when searches are made, etc etc

But then…
…Someone (presumably higher up in AOL?) realised this was shocking. Simply shocking. Think about the amount of personal data there must be in this sample.

Quite quickly the original link on the AOL Research Blog went dead.

People’s names, identifiable locations, sensitive subjects, etc – it all gets concerning when you can cluster search terms to a specific user via the unique user id. It’s even alleged on the blogosphere that there are signs of someone plotting murder from their search terms.

No shit, it is literally jaw-dropping how stupid AOL has been. Don’t forget this is the very data that Google refused to hand over the US DoJ – citing reasons of privacy. AOL decides to give it away to anyone and everyone!

I wonder whether there will be a class-action law suit out of this? TechCrunch has more on this.

But then… #2

So I’m not stranger to controversy when it comes to data.

(BTW I still maintain that there was nothing wrong with me releasing the urls to BBC weather data that were, and still are, referenced in their public javascript. I decided to take them down because it was causing hassle for my friends who worked on the project.)

So here’s another controversial thing from dotBen (eat this Ian Betteridge): here are some mirrors where you can download this AOL search data (it’s 439MB remember):

(The MD5 Hash of the original file is 31cd27ce12c3a3f2df62a38050ce4c0a – so make sure you have an untampered copy.)

It’s up to you whether you feel you want to download the data. And I guess from a karma perspective I hope you do something positive with it.

My reason for pointing to these urls is simply because the link was on the AOL site for some hours and many people will have downloaded it. It’s now in the public-domain – rightly or wrongly – and so some positives might as well be derived from it.

So go build something good with it!