AOL releases search data on 500k users… and then tries to take it back

AOL Research Labs released for download the data of all searches performed by 500,000 (some say it’s more like 650k+) of their users over a three months period.

That’s just amazing – 20 million search queries, each with:

  • UserID – numeric unique id rather than actual name
  • Query
  • QueryTime
  • ClickedRank – ie whether the first, second, third link was clicked on, etc
  • DestinationDomainUrl

The things you can do with this data make your mind boggle – both within the shady SEO business and in the slightly more positive mash-up scene.

To the SEO industry, this data is priceless – assuming it’s a statistically unbiased sample all sorts of analysis can be made on popular search terms, temporal analysis of when searches are made, etc etc

But then…
…Someone (presumably higher up in AOL?) realised this was shocking. Simply shocking. Think about the amount of personal data there must be in this sample.

Quite quickly the original link on the AOL Research Blog went dead.

People’s names, identifiable locations, sensitive subjects, etc – it all gets concerning when you can cluster search terms to a specific user via the unique user id. It’s even alleged on the blogosphere that there are signs of someone plotting murder from their search terms.

No shit, it is literally jaw-dropping how stupid AOL has been. Don’t forget this is the very data that Google refused to hand over the US DoJ – citing reasons of privacy. AOL decides to give it away to anyone and everyone!

I wonder whether there will be a class-action law suit out of this? TechCrunch has more on this.

But then… #2

So I’m not stranger to controversy when it comes to data.

(BTW I still maintain that there was nothing wrong with me releasing the urls to BBC weather data that were, and still are, referenced in their public javascript. I decided to take them down because it was causing hassle for my friends who worked on the project.)

So here’s another controversial thing from dotBen (eat this Ian Betteridge): here are some mirrors where you can download this AOL search data (it’s 439MB remember):

(The MD5 Hash of the original file is 31cd27ce12c3a3f2df62a38050ce4c0a – so make sure you have an untampered copy.)

It’s up to you whether you feel you want to download the data. And I guess from a karma perspective I hope you do something positive with it.

My reason for pointing to these urls is simply because the link was on the AOL site for some hours and many people will have downloaded it. It’s now in the public-domain – rightly or wrongly – and so some positives might as well be derived from it.

So go build something good with it!

19 thoughts on “AOL releases search data on 500k users… and then tries to take it back

  1. THe data sets are extremely valuable to researchers. They should be released to researchers with the understanding that they are used to improve the quality of search and not to violate the privacy of searchers.

  2. Oh, and the fact that you think there’s some kind of comparison between the cases of AOL and the BBC show that you still just don’t get why you acted like a shit.

  3. you still just don’t get why you acted like a shit

    Hey, as my old dear used to say: it takes one to know one!

  4. BTW: if you think I’m such a dick, why do you continue to read my shit?

    You’re like one of those peeps who’ll watch a tv programme, get offended by it and then watch the rest of the programme to continue to get offended by it.

    Switch off, go make a cup of tea and do something else. Like remove the cock from your arse.

  5. I imagine he takes issue with the actions you are performing and thinks he should try and dissuade you from doing them. I’m not sure I’d use the same language but I can’t say I’m particularly impressed by you … sorry, dotBen’s … move here either.

    It seems to me that while the information is now in the public eye, it was against the wishes of the people the information is about. Their privacy was violated by the release – a violation that will be made worse by drawing the attention of the world to that data. I think it’s probably wrong to link to the data in the way you have here. Surely what you’re doing is hitting AOL over the head with the victims of their action – whether or not the victims suffer more in the process.

    I know you have this idea that things are either public or they’re not and that it’s all black and white and simple and easy. All I can ask is that you think of all the things you could possibly have said ever in any public place or in front of any people that you’d be uncomfortable communicated to the world. You’d think that kind of action was unfair and damaging and would think people who expressly decided to communicate it were being unpleasant and behaving badly. Surely this is the same thing?

    And with the BBC move – so a piece of information has been mistakenly presented to the world in contravention of the organisation’s agreement with a third party. The sensible thing to do there if you cared about the organisation would be to alert them to their error, not advertise the mistake to the whole world and advise them to make use of it.

    I know that you and I don’t get on, but can I say quite seriously that I don’t go looking for reasons to fight – I genuinely disagree with your tactics and I think some of the things you’re doing are clumsy and damaging and more about getting public attention and being looked at and showing-off than they are about doing good things in the world. I get the feeling that you haven’t really thought about the people who get harmed by the things you write. Actively taking issue with what people are doing is one thing – you should have a platform to protest or argue – but not noticing that your actions have consequences seems to me to be quite another. I very much believe that there are times to fight and stand up and be counted and to rail against the system or whatever. But I think you have to choose those times pretty carefully if you’re not just going to alienate everyone in the process.

  6. Thanks Tom,

    I’m not doing any of this ‘for public attention’ but it’s hard to prove your not doing something, etc.

    I choose not to play the straight and narrow. I choose not to follow the corporate line, especially as it’s not bestowed upon me anymore.

    I do what I feel feels right, what’s within my conscience, and not play some watered down position to placate others or take a stance that is best for my personal or professional position.

    I wear my heart on my sleeve and I do have a subversive streak running through me. Sure that’s going to cause fire because it’s not what people expect – they expect me to be all nice-as-pie.

    On the AOL issue – I think it’s awful what AOL have done. And yes, by posting the links I’m perpetuating it. But I also think that some good can be realised out of this and so seeing as the data is now out there anyway (and people will download it if they really want to) I’d rather support positive uses of it rather than help the rest of ‘em try to sweep this under the carpet.

    That’s probably a great example of where two people agreeing “that it’s awful that AOL has abused so many people’s trust with their privacy” can come to two different conclusions. The obvious, mainstream, corporate line: ‘let’s suppress it’ and the alternative, think outside the box: ‘the genie is out of the bottle, there’s no going back, so let’s see what we can do with it now it’s there’.

    I think both have positives and negatives but I’ve chosen to go with the second option for reasons already explained.

    We’ll see what happens. Oh, and I hope you will come to the BarCampLondon, Tom

  7. Ben I totally agree with you on the

    “The obvious, mainstream, corporate line: ‘let’s suppress it’ and the alternative, think outside the box: ‘the genie is out of the bottle, there’s no going back, so let’s see what we can do with it now it’s there’.”

    turn it around and turn a cockup into something good.
    Just what I dont know, but i’ve been making graphs in asp all day so can only see pie charts

  8. So if you’re not doing it for the attention, Ben, what ARE you doing this for?

    On the AOL front, in what way are you “supporting positive uses of the data”? Reposting isn’t supporting positive use: doing something with the data that was positive yourself as an example for others would be “supporting positive uses”. Simply reposting is exactly that: reposting. It’s at best ethically neutral, and at worst irresponsible.

    It’s also gesture politics, because as you and I both know, that data is only ever going to be a Google search away now. So again: what’s the point of what you’re doing? What’s the overall effect of it, other than to bolster your ego?

  9. I think your metaphors are interesting, and your statement that you can come to two different conclusions which both have ‘positives’ and ‘negatives’ is rather overwhelmed by your description of one position as ‘obvious and corporate’ versus the other which is ‘thinking out of the box’.

    I don’t think it’s a corporate mindset that says ‘don’t hurt real people in the pursuit of an agenda’. I think it’s very much a non-corporate position. Berate AOL by all means, rip them to shreds if you want. But I’m afraid I think you’re dressing up irresponsibility as radicalism, which is sort of the definition of not being a grown-up.

  10. AOL could be in some trouble, now at least one user has been identified to the unique ID of the search data. I’d imagine a few more are likely to be known in the next few weeks too.

  11. I just wanted to say thanks for the links. I was slow on the draw and did not have time to pick up a data file until this evening.

    This is exactly the wake up call that the new high speed internet generation needs. Their information is very personal and will portray a portrait of each user that is more distorted than if they appeared on a reality TV show. The info is out there, it is never well protected and the government is getting reams of it without suing Google for it. Remember the stink with Google was that they refused, and that is the only reason we heard about it!

    I think the release is part of a long term strategy and not an F’up. Someone was counting on both the exposure for AOL and that the data would instantly be linked all over the internet and therefore in Ben’s world part of the ‘public domain’.

    Whatever the motivation for this I bet the fallout will be that American’s loose more personal freedom over it and the internet shifts farther to the corporate than the community side of the online world. Not insinuating the corporate side is the dark side – just the “not the in people’s best interest side”.

    If this were boxing: we are all looking at the jab and not seeing the knock out coming. :)

    BTW: Sad to see so much sexual tension between to male personas (Ben and Ian) on a blog, you two should get together, do the nasty and get over it. Or are you already together and just playing it this way on the blog for attention? Hmm, that’s genius, really.

  12. Ben!, thanks so much! for the links to the downloads :D. I hope that everyone connected to the internet gets the chance to see this data, the stuff on it is amazing and will open your eyes WIDE. All you a$$holes trying the sweep it under the rug are the unpatriotic scums. …(Ian Betteridge/probably a paid AOL empolyee trying to help out AOL). This Data needs to be flagged as high as CNN or FOXNEWS or the front page of a sssssh!t load of new papers, plus the entertainment value of this data will keep many computer nerds like me not so bored. Thanks for the quick links Ben your the best and F@ck all the haters, let them hate as long as they fear.

  13. This is funny as hell. Blog wars.

    You made a mistake, no big deal, time to both get over it. (Great reading though–not too often blog authors go public with their battles, with words volleying back and forth across websites. Almost like watching tennis.)

    I wouldn’t directly link to the AOL files because I don’t want to be responsible for the loss of privacy of their mostly unaware customers. That would lay on my conscience most uncomfortably.

    On the other hand there’s hundreds of mirrors and a site that lets you view the files so the damage is done, and one more well-known blogger here and there hotlinking to it isn’t going to make much difference. Its not your fault this info is available–its AOL’s, for being stupid enough to release it in the first place.

    I think its a horrible invasion of privacy for their members. I’ve read comments from blogs all over the Web saying, “Privacy is non-existent on the Web, anyway–get used to it.” I don’t think that needs to be true.

    If search engines and ISPs didn’t keep thorough records of people’s online activities our expectation of privacy and anonymity would be much greater. No one wants this sort of information out there for everyone to look at, except people who plan on using the data to benefit their SEO practices, and people who are just plain nosy or determined to find out about one or more people using the info AOL so carelessly provided.

    There is no “good” use for it, Ben–sorry. But since it will always be easily available online from now through Eternity, you are, as I said, not adding much more to the damage AOL has done.

    I’ve noticed talk about this fiasco wandering thanks in large part to sites like http://plentyoffish.com into how to figure out if some guy means to kill his wife, how much porn people really look at, etc, with the discussion on his site becoming quite involved and lively.

    I wish he’d never mentioned the “guy killing his wife” part because he distracts from the real issue: what those data sets contain is deeply personal to each member of AOL. Its none of our business what members searched for, who they are, what they’ve got planned, what they buy, where they live, what their credit card # or SS # is.

    The fact that AOL released that inforamtion to the public doesn’t and never will *make* it our business.

    What AOL did is morally wrong, a complete invasion of member privacy and absolutely reprehensible from every angle, including the fact that as plentyoffish says, correctly, “Google’s gonna get mega-spammed.” With a goldmine of search engine information like that, what else would SEO masters do with it?

    It will be interseting, to say the least, to see how this affects more highly prized Adwords–will some of them be considered overvalued after studying this data? Will some cheaper
    Adwords actually prove to be a better buy? If so, they’ll be snatched up while Google’s price is still low, and the higher-priced ones that don’t do as well based on these data sets will slowly fall by the wayside, so that Google will be forced to lower prices for them.

    I’m sure one thing Goolge never expected any company or person to do (much less AOL, who they have an ad and SEO partnership with) is give away their trade secrets to the public for free. If I were them I’d be pissed.

    Some sympathy for the other side (members of AOL who are mostly clueless that this has happened and what it means) is needed now.

    It could be *your* searches analyzed by the general public, the United States government and webmasters eager to market to you. *Your* name, credit card and SS # and search details now known to the world. Your most private online moments digested and interpreted by everybody. Doesn’t that sound embarassing? If those search records don’t include you, be grateful that you don’t use AOL, but think of *their* members–too ignorant to even know what’s going on–being researched, made fun of and snooped on as we speak.

    There’s a saying, “Treat your neighbor as you want to be treated,” and another, “The Web connects us all to each other–like a global community.” If everyone on the Web is part of a global community, then we’re neighbors who aren’t treating each other too well if we tolerate privacy invasions like this. Benefiting from them is another step down a slippery slope that will make the Web so fraught with privacy invasions that no one will think its worth bothering with anymore.

Comments are closed.