A previous
post describes how to get going and start and EC2 Hadoop cluster. Now, we'll see how it can be used.
The "hadoop-ec2" command can be used to interact with your cluster; type it in to see the help prompt that will describe some uses. Some handy things are
hadoop-ec2 push my-cluster filename
to copy a file to the cluster (though as we'll see later, you could use S3 later to move files around) and
hadoop-ec2 login my-cluster
to open a ssh terminal to the master machine on the cluster.
OK, suppose you actually want to do a job. For now, use "hadoop-ec2 push" to transfer all the required files to the cluster, login to the cluster and use the 'hadoop' command to execute tasks, just as you would on a local machine. For example, if you wanted to run a provided example, you could login into the master and run
hadoop jar /usr/local/hadoop-0.19.0/hadoop-0.19.0-examples.jar pi 4 10000
Suppose you want to do some real work. Since you're using hadoop, you probably have some big files and repeatedly copying them from your machine to the cluster is a bad idea; furthermore, they may not even fit on the cluster master. Not to worry -- that's what S3 is for! Hadoop has a built-in interface to S3: you refer to files on S3 as
s3n://AWS_ID:AWS_SECRET_KEY@bucket/filename
So, you can list all the files in a bucket by
hadoop fs -ls s3n://AWS_ID:AWS_SECRET_KEY@bucket
and copy all the files in the bucket to HDFS (using your cluster to distribute the work) by
hadoop distcp s3n://AWS_ID:AWS_SECRET_KEY@bucket/ newDir
which will place all the files on HDFS, in a directory called newDir. If you reverse the two last options, it'll copy files over from HDFS to S3, so you can use them later.
If your secret key contains a slash, you will have problems; two possible options are to regenerate a key or edit the
conf/hadoop-site.xml
file and add
<property>
<name>fs.s3n.awsAccessKeyId </name>
<value>YOUR_ACCESS_KEY_ID</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>THE_KEY_ITSELF</value>
</property>
and you won't need to specify the id/key on the command line.
If some of your files are too big to fit on S3 directly (greater than 5gb in size), you can use the s3:// instead of s3n:// and Hadoop will automatically split them, but all the files will then be stored in a special hadoop-only block format and you won't be able to access them using other tools.