I’ve been playing with RSS injestion and I gotta say there are some issues. Life is made MUCH easier by gems like Feedzirra, but all is not perfect in RSS world.
Here are some challenges that I came across:
- RSS Feeds can be HUGE. There is no way to get around the fact that if you have time sensitive RSS needs, then simply pulling an RSS feed every minute doesn’t cut it. Even if after you get it make sensible choices based on Feed timestamp and item publish dates, still.. you don’t want to grab 2mb / minute if you don’t have to.
- 304s suck. The idea that you can do a conditional GET on a page, by passing either the last-modfied or etag attribute, and have the server say “304 – nothing new” just doesn’t work very well. You are absolutely at the mercy of the server you are calling and that’s a problem. Not sure if it’s just inherently janky, or the issue might be the result of actually hitting different RSS feeds under a load balance, I don’t know.. but it’s a bit hateful.
- My first thought after giving up on 304s was to use net/http to just grab the header, and compare things in it with what I found the last time.. Make sense, right? You have modified-date, etag, content-length to play with, but the sad truth is that these change for reasons you don’t think of. RSS feeds get updated not just because something has popped in to the top, but also sometimes because it’s a feed covering X hours, and something has dropped off the bottom. RSS providors also adopt caching strategies.. and their systems dumb as they are, will recreate the same content after 90 seconds, “just because” and from a header perspective, it’s “new”. Still this was working out for me pretty well. I went from 4mb / MINUTE injestion across various feeds down to a point where that 4mb was spread over about 5 minutes when I actually got a header match that I trusted. Still.. UGH.. some of these feeds are stale for HOURS over the weekend, but the headers infer that they are working their brains out making me grab a big ass RSS feed every 90 seconds or so just so I can throw it away.
- Find out if there is a new RSS entry
- Do so in a way that is time / resources / bandwidth / cpu efficient.
$ curl -r 0-2000 http://www.somedomain.com
Pass it a url, and a length (that you find is enough to get through the rss headers and get to the first post title + some) and what you get back is an md5 hash of the first item title. It’s probably going to have some junk around it, but that doesn’t matter for our purposes. Store the result. Next time you hit the RSS feed, if it’s different then you have new content.
It’s soooooooooo fast, so efficient, and so beautifully ghetto. Sure – it means an extra http request when there is data, but my bandwidth has been cut by 95%!
Looking down the road, there are even more possibilities for optimization. Right now I’ve added this method to the rails app process that gets fired off by a cron every minute. I could instead write the “check if it’s changed” logic into a ruby script, or even a bash script so that an actual rake task and all the heavy plodding that goes along with it only happens when I’m sure that there is a new RSS entry.
Good times had by all.