Sunday, April 19, 2009

Hadoop & EC2: Adding slaves to an existing cluster

Suppose you have a cluster called my-cluster running, and you wish to add more slaves. This is really easy to do -- simply run


hadoop-ec2 launch-slaves my-cluster 5


to launch 5 additional slaves. It will take a few minutes for the slaves to join the cluster, but you don't need to do anything else.

Stopping nodes is more complicated.

Hadoop Sequence Files

One way to store data that Hadoop works with is through sequence files. These are files, possibly compressed, containing pairs of Writable key/values. Hadoop has ways of splitting sequence files for doing jobs in parallel, even if they are compressed, making them a convenient way of storing your data without making your own format. To import your own code into a sequence file, execute something like:



Path path = new Path("filename.of.sequence.file");
org.apache.hadoop.fs.RawLocalFileSystem fs = new org.apache.hadoop.fs.RawLocalFileSystem();

SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, path, Text.class, BytesWritable.class);

for(loop-through-data-here}
writer.append(new Text("key"), new BytesWritable("Value"));



Obviously, in the example above, the key is of type Text and the value of type BytesWritable; you can use other types.

Friday, April 3, 2009

Using the EC2 Hadoop Cluster (and S3).

A previous post describes how to get going and start and EC2 Hadoop cluster. Now, we'll see how it can be used.

The "hadoop-ec2" command can be used to interact with your cluster; type it in to see the help prompt that will describe some uses. Some handy things are
 hadoop-ec2 push my-cluster filename

to copy a file to the cluster (though as we'll see later, you could use S3 later to move files around) and
 hadoop-ec2 login my-cluster

to open a ssh terminal to the master machine on the cluster.

OK, suppose you actually want to do a job. For now, use "hadoop-ec2 push" to transfer all the required files to the cluster, login to the cluster and use the 'hadoop' command to execute tasks, just as you would on a local machine. For example, if you wanted to run a provided example, you could login into the master and run
 hadoop jar /usr/local/hadoop-0.19.0/hadoop-0.19.0-examples.jar pi 4 10000 


Suppose you want to do some real work. Since you're using hadoop, you probably have some big files and repeatedly copying them from your machine to the cluster is a bad idea; furthermore, they may not even fit on the cluster master. Not to worry -- that's what S3 is for! Hadoop has a built-in interface to S3: you refer to files on S3 as
s3n://AWS_ID:AWS_SECRET_KEY@bucket/filename

So, you can list all the files in a bucket by
hadoop fs -ls s3n://AWS_ID:AWS_SECRET_KEY@bucket 

and copy all the files in the bucket to HDFS (using your cluster to distribute the work) by
hadoop distcp s3n://AWS_ID:AWS_SECRET_KEY@bucket/ newDir

which will place all the files on HDFS, in a directory called newDir. If you reverse the two last options, it'll copy files over from HDFS to S3, so you can use them later.

If your secret key contains a slash, you will have problems; two possible options are to regenerate a key or edit the
conf/hadoop-site.xml

file and add

<property>
<name>fs.s3n.awsAccessKeyId </name>
<value>YOUR_ACCESS_KEY_ID</value>
</property>

<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>THE_KEY_ITSELF</value>
</property>


and you won't need to specify the id/key on the command line.

If some of your files are too big to fit on S3 directly (greater than 5gb in size), you can use the s3:// instead of s3n:// and Hadoop will automatically split them, but all the files will then be stored in a special hadoop-only block format and you won't be able to access them using other tools.