Understanding Proxy Servers. What are they for? Why would you use one?
Proxy servers are really used for one of two things:
1) A stepping stone.
Think of a Proxy Server as a step between you and where you want to go. For example, you can configure a proxy server in most browser configurations. The result is something like this:
– you want to view google.com
– your browser sends the request to the proxy server
– the proxy server pulls up google.com
– the proxy server tunnels whatever it gets back to you.
– ANONYMITY. Google.com doesn’t see your address in the request. It just sees the Proxy Server. Of course there is no real anonymity on the web – but we are talking about what someone sys admin at google can tell about the traffic.
– SPOOFING LOCATION. If you are in the US, and there’s a website in the UK that is locked down to only allow people in the UK to access content, then by using a UK Proxy Server, that website thinks you are in the UK and you are good to go.
2) A Gateway
This is frequently more of a corporate thing, where a company comes up with various reasons why they need a Proxy Server, but in general it’s so they can control the crap out of you. If you work for a big company then while on the internal network you might only be able to surf the internet through corporate Proxy Server. There are legit and useful reasons to do things this way, but more often than not, a biggie is that with all traffic tunneling through a single point it’s much easier for the IT dept to block access to certain sites or make sure you aren’t watching donkey porn at work.
For this post, we are more interested in (1), the stepping stone, and what options you have, and that really depends on what you are doing.
TYPE A: FREE (or pay once) PROXY SERVICES
If you are on vacation in Mexico and you want to use http://www.hulu.com, then the best thing to do is go to somewhere like http://hidemyass.com/proxy-list which publishes lists of Open Proxy Servers. The more recent the listing, the more likely it’s going to work.
Results will vary. A proxy might work, or it might not, and it might be horribly slow. For sure, a proxy that works today probably won’t work tomorrow. You also have no idea WHY this proxy is available, and it’s definitely a possibility that it’s sniffing whatever traffic you are putting through it, so not the best time to Instant Message your credit card details to someone.
If these services have a Premium offering, the basic message is the same – you just get the the list delivered to you in a better format.
TYPE B: PAID PROXY SERVICES
There are a bunch out there, and if you are doing some sort of web scraping or web spidering or similar where reliability and performance is important to you, there is no other way to go. PAID proxies typically ensure that they can keep up with needs by charging by throughput . That makes them a BAD idea for your Vacation Property in Mexico for Hulu / Netflix because media bandwidth adds up quickly.
Here are some of the things that differentiate offerings.
1) Revolving IPs
This is a good thing. Basically this means that you always call the same proxy server, but every time that proxy server makes an outbound call it cycles round-robin through a bunch of IP addresses.. so if there are 10 revolving IP addresses and you hit a page 10 times, then in theory each time is with a different IP.
Typically the IP addresses are sequential, so it’s not exactly rocket science for someone on the receiving end to see what is going on, but it’s still a bonus.
2) Multiple Proxy Servers
A decent service will give you credentials for a number of proxy servers, possibly in different countries. That means that you can round-robin call each of them from your application to further tangle the path between you and your final destination (See this post for an example of how you can easily do this with Ruby).
3) Short Term IP use
Not only do decent services round-robin between different IP addresses, they often throw those IP addresses away periodically and start with new ones. The advantage here is that if someone sees incoming traffic and blocks an IP, then that block is only good until the Proxy Service throws that IP away.
A really good PAID proxy service. ProxyMesh.com
It’s been about 6 months since I really dug in, but I really like ProxyMesh.com
– Their prices are reasonable starting at $1 / gig.
– They have highly maintained US and UK proxies with revolving Short Term IPs
– They also manage a list of Open Proxie servers (like the ones listed at hidemyass.com) but THEY manage the list on their end. You just call the same ProxyMesh Proxy, and they farm out the request to any one of hundreds of proxy servers – and that list is very fluid as proxy servers come and go worldwide. These proxy servers are of course much less reliable than their core service, but offer significantly more anonymity.
That’s all folks.
I have no idea how often people ACTUALLY look at their logs looking for someone scraping their pages, but sometimes you want to just fly under the radar. I generally don’t agree with stealing web content by scraping, but I do believe that if someone is in the data distribution business, but they suck at it, it’s ok to bend the rules a little. For example, if they offer an RSS feed that is buggy, slow, huge etc. but their homepage offers the same information more reliably – go for it.
There are basically two things at play here:
- Spoofing a user-agent (pretending to be a browser not a script)
- Spoofing the source of the request.
Here’s a little function you can call to get a random user agent, based on a list of really common user agents.. Thanks to the guy who posted this to a blog, sorry, I don’t have the reference anymore.
I usually put such things in a model utility.rb so I can just call it with Utility.random_desktop_agent
That takes care of user agent, now on to proxy. You’re going to go out and get your own list of proxies.. whether you get some reliable free ones or pay for services. The best services you call a single proxy of theirs, and it will then cycle through a bunch of IP addresses with the call. Each call, through a different one round robin. They then dump those IPs every 30 minutes or so.. not bad.
Again – I put that in the Utility.rb model.
So put it all together..
Calling a page looks something like this with open-uri
That’s about it.