Monday, March 30, 2009

SSH for web proxying

OpenSSH can be trivially used as a SOCKS proxy. From your machine, simply run


ssh -D 2000 destination -N


to establish the proxy. The -N flag starts the tunnel without running a shell on the other side and is optional.

Then, tell your web browser to use localhost:2000 as a SOCKS proxy. With Firefox, this is in the Preferences/Network/Connection menu. Alternatively, you can use foxyproxy, a plugin that better manages your proxies.

By default, your DNS requests still go directly to your local DNS server. If you wish to change that, you need to tell your web browser to use the SOCKS proxy to resolve names as well; in the case of foxyproxy, that is under Foxyproxy Options/Global Settings/Miscellaneous.

Why would you want to do this? Well, for one, your HTTP traffic is transmitted in plaintext, so if you are on an untrusted network, someone could be snooping all your traffic. By proxying it through a secure link to a known machine, you are preventing other people on that network from seeing your web browsing, possible passwords, etc.

More interestingly, suppose you have an intranet that has some resources that are not available to the outside world. By proxying through a machine on the intranet, you can remotely use those resources; for example, by proxying through a machine on your collage campus, you can use site-licensed resources like paper archives. Alternatively, if you use Amazon's EC2 (for example, in conjunction with Hadoop), you could use the proxy to let you access web services running on your EC2 images without opening up public ports and even using the internal domain names for the machines.

Hadoop, EC2 and S3

Hadoop is a library for performing MapReduce computations on a cluster. Recently, Amazon introduced Elastic MapReduce which makes it easier to run Hadoop jobs on their cluster. However, that comes with extra costs (more per-hour costs, plus extra S3 usage) and the existing way of using EC2 with Hadoop is already quite good. This is what I'll describe here.

First, you need to get an Amazon AWS account and set up the EC2 API on your machine; Amazon's instructions are pretty clear. You also need to download and unzip the latest version of Hadoop on a local machine. For convenience, add the
src/contrib/ec2/bin
subdirectory of hadoop to your path.

Now, you must edit
 src/contrib/ec2/bin/hadoop-ec2-env.sh  

and set up variables specifying your AWS credentials. For KEYNAME, use the name that you selected when you created a new key; the file should be called id_rsa-$KEYNAME tough you can change that on the PRIVATE_KEY_PATH line. You shouldn't need to change anything other than the first 5 variables in the file.

If you've done things correctly, you should now be able to run
 hadoop-ec2 launch-cluster my-cluster 2

which should churn for a while, create the right groups, permissions, etc and boot up all the machines. Your cluster is good to go. "my-cluster" is the name that you chose for the cluster and can be anything you want; 2 is the number of slave nodes to run. If you have an existing cluster running, you can add machines to it by running
 hadoop-ec2 launch-slaves my-cluster n 

to launch n more slaves for that cluster. It will take a little time for them boot up and join the cluster, but in a few minutes, they should join.

The cluster can be turned off with
 hadoop-ec2 terminate-cluster my-cluster


The next hadoop-related post will talk about how the cluster can e used and how data can be imported/exported to S3.