Programmer Land: Hadoop, EC2 and S3

Hadoop is a library for performing MapReduce computations on a cluster. Recently, Amazon introduced Elastic MapReduce which makes it easier to run Hadoop jobs on their cluster. However, that comes with extra costs (more per-hour costs, plus extra S3 usage) and the existing way of using EC2 with Hadoop is already quite good. This is what I'll describe here.

First, you need to get an Amazon AWS account and set up the EC2 API on your machine; Amazon's instructions are pretty clear. You also need to download and unzip the latest version of Hadoop on a local machine. For convenience, add the

src/contrib/ec2/bin

subdirectory of hadoop to your path.

Now, you must edit

 src/contrib/ec2/bin/hadoop-ec2-env.sh

and set up variables specifying your AWS credentials. For KEYNAME, use the name that you selected when you created a new key; the file should be called id_rsa-$KEYNAME tough you can change that on the PRIVATE_KEY_PATH line. You shouldn't need to change anything other than the first 5 variables in the file.

If you've done things correctly, you should now be able to run

 hadoop-ec2 launch-cluster my-cluster 2

which should churn for a while, create the right groups, permissions, etc and boot up all the machines. Your cluster is good to go. "my-cluster" is the name that you chose for the cluster and can be anything you want; 2 is the number of slave nodes to run. If you have an existing cluster running, you can add machines to it by running

 hadoop-ec2 launch-slaves my-cluster n

to launch n more slaves for that cluster. It will take a little time for them boot up and join the cluster, but in a few minutes, they should join.

The cluster can be turned off with

 hadoop-ec2 terminate-cluster my-cluster

The next hadoop-related post will talk about how the cluster can e used and how data can be imported/exported to S3.

Programmer Land

Monday, March 30, 2009

Hadoop, EC2 and S3

No comments:

Post a Comment

Blog Archive