Latest Event Updates

How to add some anonymity to your data scraping Ruby and Rails apps

Posted on Updated on

I have no idea how often people ACTUALLY look at their logs looking for someone scraping their pages, but sometimes you want to just fly under the radar.  I generally don’t agree with stealing web content by scraping, but I do believe that if someone is in the data distribution business, but they suck at it,  it’s ok to bend the rules a little. For example, if they offer an RSS feed that is buggy, slow, huge etc. but their homepage offers the same information more reliably – go for it.

There are basically two things at play here:

  1. Spoofing a user-agent  (pretending to be a browser not a script)
  2. Spoofing the source of the request.

Here’s a little function you can call to get a random user agent, based on a list of really common user agents.. Thanks to the guy who posted this to a blog, sorry, I don’t have the reference anymore.

def self.random_desktop_user_agent
    user_agents = [
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
      "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:16.0) Gecko/20100101 Firefox/16.0",
      "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/536.26.17 (KHTML, like Gecko) Version/6.0.2 Safari/536.26.17",
      "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)",
      "Mozilla/5.0 (Windows NT 5.1; rv:13.0) Gecko/20100101 Firefox/13.0.1",
      "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; FunWebProducts; .NET CLR 1.1.4322; PeoplePal 6.2)",
      "Mozilla/5.0 (Windows NT 5.1; rv:5.0.1) Gecko/20100101 Firefox/5.0.1",
      "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.112 Safari/535.1",
      "Opera/9.80 (Windows NT 5.1; U; en) Presto/2.10.289 Version/12.01",
      "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11",
      "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)",
      "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) )",
      "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727)",
      "Mozilla/5.0 (Windows NT 6.1; rv:2.0b7pre) Gecko/20100921 Firefox/4.0b7pre",
      "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322)",
      "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11",
      "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; .NET CLR 3.5.30729)",
      "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.112 Safari/535.1",
      "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
      "Mozilla/4.0 (compatible; MSIE 6.0; MSIE 5.5; Windows NT 5.0) Opera 7.02 Bork-edition [en]",
      "Mozilla/5.0 (Windows NT 6.1; rv:5.0) Gecko/20100101 Firefox/5.02",
      "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
      "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0",
      "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.8 (build 4157); .NET CLR 2.0.50727; AskTbPTV/5.11.3.15590)",
      "Mozilla/5.0 (Windows NT 5.1; rv:16.0) Gecko/20100101 Firefox/16.0",
      "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
      "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20100101 Firefox/17.0",
      "Mozilla/5.0 (Windows NT 6.1; rv:16.0) Gecko/20100101 Firefox/16.0",
      "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2",
      "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.91 Safari/537.11",
      "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/536.26.17 (KHTML, like Gecko) Version/6.0.2 Safari/536.26.17",
      "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:16.0) Gecko/20100101 Firefox/16.0",
      "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/17.0 Firefox/17.0",
      "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; TencentTraveler ; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; .NET CLR 2.0.50727)",
      "Mozilla/5.0 (iPad; CPU OS 6_0_1 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A523 Safari/8536.25",
      "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6",
      "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
      "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:16.0) Gecko/20100101 Firefox/16.0",
      "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11",
      "Mozilla/5.0 (Windows NT 5.1; rv:17.0) Gecko/20100101 Firefox/17.0",
      "Mozilla/5.0 (Windows NT 5.1; rv:17.0) Gecko/20100101 Firefox/17.0"]
   return user_agents.sample
  end

I usually put such things in a model utility.rb so I can just call it with Utility.random_desktop_agent

That takes care of user agent, now on to proxy.  You’re going to go out and get your own list of proxies.. whether you get some reliable free ones or pay for services.  The best services you call a single proxy of theirs, and it will then cycle through a bunch of IP addresses with the call.  Each call, through a different one round robin.  They then dump those IPs every 30 minutes or so.. not bad.

def self.random_proxy_server
    proxies = [["proxy_server1","proxy_port", "proxyuser_if_authenticated", "proxypassword_if_authenticated"],["proxy_server2",
      "proxy_port","proxyuser_if_authenticated", "proxypassword_if_authenticated"]]
]] return proxies.sample end

Again – I put that in the Utility.rb model.

So put it all together..

Calling a page looks something like this with open-uri

proxy = Utility.random_proxy_server
open( url :proxy_http_basic_authentication => ["#{proxy[0]}:#{proxy[1]}", "#{proxy[2]}", "#{proxy[3]}"], "User-Agent" => Utility.random_desktop_user_agent)

That’s about it.

K

Ruby RSS curl trick / partial download beats Conditional GET any day

Posted on Updated on

I’ve been playing with RSS injestion and I gotta say there are some issues.  Life is made MUCH easier by gems like Feedzirra, but all is not perfect in RSS world.

Here are some challenges that I came across:

  1. RSS Feeds can be HUGE. There is no way to get around the fact that if you have time sensitive RSS needs, then simply pulling an RSS feed every minute doesn’t cut it. Even if after you get it make sensible choices based on Feed timestamp and item publish dates, still.. you don’t want to grab 2mb / minute if you don’t have to.
  2. 304s suck.  The idea that you can do a conditional GET on a page, by passing either the last-modfied or etag attribute, and have the server say “304 – nothing new” just doesn’t work very well.  You are absolutely at the mercy of the server you are calling and that’s a problem.  Not sure if it’s just inherently janky, or the issue might be the result of actually hitting different RSS feeds under a load balance, I don’t know.. but it’s a bit hateful.
  3. My first thought after giving up on 304s was to use net/http to just grab the header, and compare things in it with what I found the last time.. Make sense, right?  You have modified-date, etag, content-length to play with, but the sad truth is that these change for reasons you don’t think of.  RSS feeds get updated not just because something has popped in to the top, but also sometimes because it’s a feed covering X hours, and something has dropped off the bottom.  RSS providors also adopt caching strategies.. and their systems dumb as they are, will recreate the same content after 90 seconds, “just because” and from a header perspective, it’s “new”.  Still this was working out for me pretty well.  I went from 4mb / MINUTE injestion across various feeds down to a point where that 4mb was spread over about 5 minutes when I actually got a header match that I trusted. Still.. UGH..  some of these feeds are stale for HOURS over the weekend, but the headers infer that they are working their brains out making me grab a big ass RSS feed every 90 seconds or so just so I can throw it away.
So.. thinking outside the box.. What are we REALLY interested in?  What is the real goal?
  1. Find out if there is a new RSS entry
  2. Do so in a way that is time / resources / bandwidth / cpu efficient.
Well when you put it like that..

$ curl -r 0-2000 http://www.somedomain.com

Say what? EURETHRA! BOOHYA.. etc.
No really.. I know curl has been round the block more than yo mama, but hey once in a while the wonderfully beautiful elegance of Rails can benefit from a bit of ghetto.
So what curl does with -r is simply grab the first 2000 bytes of a page, then call it quits.  How nice is that??  Just curl enough to get the first item title, do something smart with it, and yer done.
We aren’t there yet though.  So I looked at about 12 ruby curl wrappers, and maybe I’m missing something, but I didn’t see support for the myriad of curl parameters, and I needed proxy, user-agent AND this -r wizardry.. so what the hell.. let’s just go ghetto.
So let’s be clear, we don’t really need to get the REAL title.. we just need to do something consistent where if we do it twice in a row and get a different result, then we know there is a new item.
There are many ways you can go with this, but here’s what I do centered around this method:

def get_rss_MD5(url,length)
  data =`curl -r 0-#{length} --connect-timeout 10 --fail  #{url}`
  if data.include? '<item>' and data.include? '</title>'
    return Digest::MD5.hexdigest(data.split('<item>')[1].split('</title>')[0])
  else
    return nil
  end
end

Pass it a url, and a length (that you find is enough to get through the rss headers and get to the first post title + some) and what you get back is an md5 hash of the first item title. It’s probably going to have some junk around it, but that doesn’t matter for our purposes.  Store the result. Next time you hit the RSS feed, if it’s different then you have new content.

It’s soooooooooo fast, so efficient, and so beautifully ghetto.  Sure – it means an extra http request when there is data, but my bandwidth has been cut by 95%!

Looking down the road, there are even more possibilities for optimization.  Right now I’ve added this method to the rails app process that gets fired off by a cron every minute.   I could instead write the “check if it’s changed” logic into a ruby script, or even a bash script so that an actual rake task and all the heavy plodding that goes along with it only happens when I’m sure that there is a new RSS entry.

Good times had by all.