Month: January 2013

Ruby RSS curl trick / partial download beats Conditional GET any day

Posted on Updated on

I’ve been playing with RSS injestion and I gotta say there are some issues.  Life is made MUCH easier by gems like Feedzirra, but all is not perfect in RSS world.

Here are some challenges that I came across:

  1. RSS Feeds can be HUGE. There is no way to get around the fact that if you have time sensitive RSS needs, then simply pulling an RSS feed every minute doesn’t cut it. Even if after you get it make sensible choices based on Feed timestamp and item publish dates, still.. you don’t want to grab 2mb / minute if you don’t have to.
  2. 304s suck.  The idea that you can do a conditional GET on a page, by passing either the last-modfied or etag attribute, and have the server say “304 – nothing new” just doesn’t work very well.  You are absolutely at the mercy of the server you are calling and that’s a problem.  Not sure if it’s just inherently janky, or the issue might be the result of actually hitting different RSS feeds under a load balance, I don’t know.. but it’s a bit hateful.
  3. My first thought after giving up on 304s was to use net/http to just grab the header, and compare things in it with what I found the last time.. Make sense, right?  You have modified-date, etag, content-length to play with, but the sad truth is that these change for reasons you don’t think of.  RSS feeds get updated not just because something has popped in to the top, but also sometimes because it’s a feed covering X hours, and something has dropped off the bottom.  RSS providors also adopt caching strategies.. and their systems dumb as they are, will recreate the same content after 90 seconds, “just because” and from a header perspective, it’s “new”.  Still this was working out for me pretty well.  I went from 4mb / MINUTE injestion across various feeds down to a point where that 4mb was spread over about 5 minutes when I actually got a header match that I trusted. Still.. UGH..  some of these feeds are stale for HOURS over the weekend, but the headers infer that they are working their brains out making me grab a big ass RSS feed every 90 seconds or so just so I can throw it away.
So.. thinking outside the box.. What are we REALLY interested in?  What is the real goal?
  1. Find out if there is a new RSS entry
  2. Do so in a way that is time / resources / bandwidth / cpu efficient.
Well when you put it like that..

$ curl -r 0-2000

Say what? EURETHRA! BOOHYA.. etc.
No really.. I know curl has been round the block more than yo mama, but hey once in a while the wonderfully beautiful elegance of Rails can benefit from a bit of ghetto.
So what curl does with -r is simply grab the first 2000 bytes of a page, then call it quits.  How nice is that??  Just curl enough to get the first item title, do something smart with it, and yer done.
We aren’t there yet though.  So I looked at about 12 ruby curl wrappers, and maybe I’m missing something, but I didn’t see support for the myriad of curl parameters, and I needed proxy, user-agent AND this -r wizardry.. so what the hell.. let’s just go ghetto.
So let’s be clear, we don’t really need to get the REAL title.. we just need to do something consistent where if we do it twice in a row and get a different result, then we know there is a new item.
There are many ways you can go with this, but here’s what I do centered around this method:

def get_rss_MD5(url,length)
  data =`curl -r 0-#{length} --connect-timeout 10 --fail  #{url}`
  if data.include? '<item>' and data.include? '</title>'
    return Digest::MD5.hexdigest(data.split('<item>')[1].split('</title>')[0])
    return nil

Pass it a url, and a length (that you find is enough to get through the rss headers and get to the first post title + some) and what you get back is an md5 hash of the first item title. It’s probably going to have some junk around it, but that doesn’t matter for our purposes.  Store the result. Next time you hit the RSS feed, if it’s different then you have new content.

It’s soooooooooo fast, so efficient, and so beautifully ghetto.  Sure – it means an extra http request when there is data, but my bandwidth has been cut by 95%!

Looking down the road, there are even more possibilities for optimization.  Right now I’ve added this method to the rails app process that gets fired off by a cron every minute.   I could instead write the “check if it’s changed” logic into a ruby script, or even a bash script so that an actual rake task and all the heavy plodding that goes along with it only happens when I’m sure that there is a new RSS entry.

Good times had by all.