Friday, April 3, 2009

Using the EC2 Hadoop Cluster (and S3).

A previous post describes how to get going and start and EC2 Hadoop cluster. Now, we'll see how it can be used.

The "hadoop-ec2" command can be used to interact with your cluster; type it in to see the help prompt that will describe some uses. Some handy things are
 hadoop-ec2 push my-cluster filename

to copy a file to the cluster (though as we'll see later, you could use S3 later to move files around) and
 hadoop-ec2 login my-cluster

to open a ssh terminal to the master machine on the cluster.

OK, suppose you actually want to do a job. For now, use "hadoop-ec2 push" to transfer all the required files to the cluster, login to the cluster and use the 'hadoop' command to execute tasks, just as you would on a local machine. For example, if you wanted to run a provided example, you could login into the master and run
 hadoop jar /usr/local/hadoop-0.19.0/hadoop-0.19.0-examples.jar pi 4 10000 


Suppose you want to do some real work. Since you're using hadoop, you probably have some big files and repeatedly copying them from your machine to the cluster is a bad idea; furthermore, they may not even fit on the cluster master. Not to worry -- that's what S3 is for! Hadoop has a built-in interface to S3: you refer to files on S3 as
s3n://AWS_ID:AWS_SECRET_KEY@bucket/filename

So, you can list all the files in a bucket by
hadoop fs -ls s3n://AWS_ID:AWS_SECRET_KEY@bucket 

and copy all the files in the bucket to HDFS (using your cluster to distribute the work) by
hadoop distcp s3n://AWS_ID:AWS_SECRET_KEY@bucket/ newDir

which will place all the files on HDFS, in a directory called newDir. If you reverse the two last options, it'll copy files over from HDFS to S3, so you can use them later.

If your secret key contains a slash, you will have problems; two possible options are to regenerate a key or edit the
conf/hadoop-site.xml

file and add

<property>
<name>fs.s3n.awsAccessKeyId </name>
<value>YOUR_ACCESS_KEY_ID</value>
</property>

<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>THE_KEY_ITSELF</value>
</property>


and you won't need to specify the id/key on the command line.

If some of your files are too big to fit on S3 directly (greater than 5gb in size), you can use the s3:// instead of s3n:// and Hadoop will automatically split them, but all the files will then be stored in a special hadoop-only block format and you won't be able to access them using other tools.

4 comments:

  1. Regarding slashes, the lash can be replaces with %2F

    ReplyDelete
  2. Stupid comment: distcp won't copy directories when destination folder already exists. Why should hadoop do it, no?

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. Any secret tips or hints? I can't get this to fully work. For example:

    # bin/hadoop fs -ls s3n://dwh-hdfs-test/tmp
    ls: org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/tmp' - ResponseCode=403, ResponseMessage=Forbidden

    But the namenode -format comand was able to set up the directories+files on S3... I'm baffled...

    ReplyDelete