Monday, March 30, 2009

Hadoop, EC2 and S3

Hadoop is a library for performing MapReduce computations on a cluster. Recently, Amazon introduced Elastic MapReduce which makes it easier to run Hadoop jobs on their cluster. However, that comes with extra costs (more per-hour costs, plus extra S3 usage) and the existing way of using EC2 with Hadoop is already quite good. This is what I'll describe here.

First, you need to get an Amazon AWS account and set up the EC2 API on your machine; Amazon's instructions are pretty clear. You also need to download and unzip the latest version of Hadoop on a local machine. For convenience, add the
src/contrib/ec2/bin
subdirectory of hadoop to your path.

Now, you must edit
 src/contrib/ec2/bin/hadoop-ec2-env.sh  

and set up variables specifying your AWS credentials. For KEYNAME, use the name that you selected when you created a new key; the file should be called id_rsa-$KEYNAME tough you can change that on the PRIVATE_KEY_PATH line. You shouldn't need to change anything other than the first 5 variables in the file.

If you've done things correctly, you should now be able to run
 hadoop-ec2 launch-cluster my-cluster 2

which should churn for a while, create the right groups, permissions, etc and boot up all the machines. Your cluster is good to go. "my-cluster" is the name that you chose for the cluster and can be anything you want; 2 is the number of slave nodes to run. If you have an existing cluster running, you can add machines to it by running
 hadoop-ec2 launch-slaves my-cluster n 

to launch n more slaves for that cluster. It will take a little time for them boot up and join the cluster, but in a few minutes, they should join.

The cluster can be turned off with
 hadoop-ec2 terminate-cluster my-cluster


The next hadoop-related post will talk about how the cluster can e used and how data can be imported/exported to S3.

No comments:

Post a Comment