Latest Event Updates
I have no idea how often people ACTUALLY look at their logs looking for someone scraping their pages, but sometimes you want to just fly under the radar. I generally don’t agree with stealing web content by scraping, but I do believe that if someone is in the data distribution business, but they suck at it, it’s ok to bend the rules a little. For example, if they offer an RSS feed that is buggy, slow, huge etc. but their homepage offers the same information more reliably – go for it.
There are basically two things at play here:
- Spoofing a user-agent (pretending to be a browser not a script)
- Spoofing the source of the request.
Here’s a little function you can call to get a random user agent, based on a list of really common user agents.. Thanks to the guy who posted this to a blog, sorry, I don’t have the reference anymore.
I usually put such things in a model utility.rb so I can just call it with Utility.random_desktop_agent
That takes care of user agent, now on to proxy. You’re going to go out and get your own list of proxies.. whether you get some reliable free ones or pay for services. The best services you call a single proxy of theirs, and it will then cycle through a bunch of IP addresses with the call. Each call, through a different one round robin. They then dump those IPs every 30 minutes or so.. not bad.
Again – I put that in the Utility.rb model.
So put it all together..
Calling a page looks something like this with open-uri
That’s about it.
I’ve been playing with RSS injestion and I gotta say there are some issues. Life is made MUCH easier by gems like Feedzirra, but all is not perfect in RSS world.
Here are some challenges that I came across:
- RSS Feeds can be HUGE. There is no way to get around the fact that if you have time sensitive RSS needs, then simply pulling an RSS feed every minute doesn’t cut it. Even if after you get it make sensible choices based on Feed timestamp and item publish dates, still.. you don’t want to grab 2mb / minute if you don’t have to.
- 304s suck. The idea that you can do a conditional GET on a page, by passing either the last-modfied or etag attribute, and have the server say “304 – nothing new” just doesn’t work very well. You are absolutely at the mercy of the server you are calling and that’s a problem. Not sure if it’s just inherently janky, or the issue might be the result of actually hitting different RSS feeds under a load balance, I don’t know.. but it’s a bit hateful.
- My first thought after giving up on 304s was to use net/http to just grab the header, and compare things in it with what I found the last time.. Make sense, right? You have modified-date, etag, content-length to play with, but the sad truth is that these change for reasons you don’t think of. RSS feeds get updated not just because something has popped in to the top, but also sometimes because it’s a feed covering X hours, and something has dropped off the bottom. RSS providors also adopt caching strategies.. and their systems dumb as they are, will recreate the same content after 90 seconds, “just because” and from a header perspective, it’s “new”. Still this was working out for me pretty well. I went from 4mb / MINUTE injestion across various feeds down to a point where that 4mb was spread over about 5 minutes when I actually got a header match that I trusted. Still.. UGH.. some of these feeds are stale for HOURS over the weekend, but the headers infer that they are working their brains out making me grab a big ass RSS feed every 90 seconds or so just so I can throw it away.
- Find out if there is a new RSS entry
- Do so in a way that is time / resources / bandwidth / cpu efficient.
$ curl -r 0-2000 http://www.somedomain.com
Pass it a url, and a length (that you find is enough to get through the rss headers and get to the first post title + some) and what you get back is an md5 hash of the first item title. It’s probably going to have some junk around it, but that doesn’t matter for our purposes. Store the result. Next time you hit the RSS feed, if it’s different then you have new content.
It’s soooooooooo fast, so efficient, and so beautifully ghetto. Sure – it means an extra http request when there is data, but my bandwidth has been cut by 95%!
Looking down the road, there are even more possibilities for optimization. Right now I’ve added this method to the rails app process that gets fired off by a cron every minute. I could instead write the “check if it’s changed” logic into a ruby script, or even a bash script so that an actual rake task and all the heavy plodding that goes along with it only happens when I’m sure that there is a new RSS entry.
Good times had by all.