Programmer Land: Using the EC2 Hadoop Cluster (and S3).

A previous post describes how to get going and start and EC2 Hadoop cluster. Now, we'll see how it can be used.

The "hadoop-ec2" command can be used to interact with your cluster; type it in to see the help prompt that will describe some uses. Some handy things are

 hadoop-ec2 push my-cluster filename

to copy a file to the cluster (though as we'll see later, you could use S3 later to move files around) and

 hadoop-ec2 login my-cluster

to open a ssh terminal to the master machine on the cluster.

OK, suppose you actually want to do a job. For now, use "hadoop-ec2 push" to transfer all the required files to the cluster, login to the cluster and use the 'hadoop' command to execute tasks, just as you would on a local machine. For example, if you wanted to run a provided example, you could login into the master and run

 hadoop jar /usr/local/hadoop-0.19.0/hadoop-0.19.0-examples.jar pi 4 10000

Suppose you want to do some real work. Since you're using hadoop, you probably have some big files and repeatedly copying them from your machine to the cluster is a bad idea; furthermore, they may not even fit on the cluster master. Not to worry -- that's what S3 is for! Hadoop has a built-in interface to S3: you refer to files on S3 as

s3n://AWS_ID:AWS_SECRET_KEY@bucket/filename

So, you can list all the files in a bucket by

hadoop fs -ls s3n://AWS_ID:AWS_SECRET_KEY@bucket

and copy all the files in the bucket to HDFS (using your cluster to distribute the work) by

hadoop distcp s3n://AWS_ID:AWS_SECRET_KEY@bucket/ newDir

which will place all the files on HDFS, in a directory called newDir. If you reverse the two last options, it'll copy files over from HDFS to S3, so you can use them later.

If your secret key contains a slash, you will have problems; two possible options are to regenerate a key or edit the

conf/hadoop-site.xml

file and add


<property>
   <name>fs.s3n.awsAccessKeyId </name>
  <value>YOUR_ACCESS_KEY_ID</value>
</property>

<property>
  <name>fs.s3n.awsSecretAccessKey</name>
  <value>THE_KEY_ITSELF</value>
</property>

and you won't need to specify the id/key on the command line.

If some of your files are too big to fit on S3 directly (greater than 5gb in size), you can use the s3:// instead of s3n:// and Hadoop will automatically split them, but all the files will then be stored in a special hadoop-only block format and you won't be able to access them using other tools.

4 comments:

UnknownJanuary 29, 2010 at 2:04 PM
Regarding slashes, the lash can be replaces with %2F
UnknownFebruary 26, 2010 at 7:28 AM
Stupid comment: distcp won't copy directories when destination folder already exists. Why should hadoop do it, no?
diego.lavayen.alarconMarch 24, 2010 at 9:25 PM
This comment has been removed by the author.
AnonymousJune 10, 2010 at 11:05 AM
Any secret tips or hints? I can't get this to fully work. For example:

# bin/hadoop fs -ls s3n://dwh-hdfs-test/tmp
ls: org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/tmp' - ResponseCode=403, ResponseMessage=Forbidden

But the namenode -format comand was able to set up the directories+files on S3... I'm baffled...

Programmer Land

Friday, April 3, 2009

Using the EC2 Hadoop Cluster (and S3).

4 comments:

Blog Archive