Programmer Land: April 2009

One way to store data that Hadoop works with is through sequence files. These are files, possibly compressed, containing pairs of Writable key/values. Hadoop has ways of splitting sequence files for doing jobs in parallel, even if they are compressed, making them a convenient way of storing your data without making your own format. To import your own code into a sequence file, execute something like:



Path path = new Path("filename.of.sequence.file");
org.apache.hadoop.fs.RawLocalFileSystem fs = new org.apache.hadoop.fs.RawLocalFileSystem();

SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, path, Text.class, BytesWritable.class);

for(loop-through-data-here}
   writer.append(new Text("key"), new BytesWritable("Value"));

Obviously, in the example above, the key is of type Text and the value of type BytesWritable; you can use other types.

A previous post describes how to get going and start and EC2 Hadoop cluster. Now, we'll see how it can be used.

The "hadoop-ec2" command can be used to interact with your cluster; type it in to see the help prompt that will describe some uses. Some handy things are

 hadoop-ec2 push my-cluster filename

to copy a file to the cluster (though as we'll see later, you could use S3 later to move files around) and

 hadoop-ec2 login my-cluster

to open a ssh terminal to the master machine on the cluster.

OK, suppose you actually want to do a job. For now, use "hadoop-ec2 push" to transfer all the required files to the cluster, login to the cluster and use the 'hadoop' command to execute tasks, just as you would on a local machine. For example, if you wanted to run a provided example, you could login into the master and run

 hadoop jar /usr/local/hadoop-0.19.0/hadoop-0.19.0-examples.jar pi 4 10000

Suppose you want to do some real work. Since you're using hadoop, you probably have some big files and repeatedly copying them from your machine to the cluster is a bad idea; furthermore, they may not even fit on the cluster master. Not to worry -- that's what S3 is for! Hadoop has a built-in interface to S3: you refer to files on S3 as

s3n://AWS_ID:AWS_SECRET_KEY@bucket/filename

So, you can list all the files in a bucket by

hadoop fs -ls s3n://AWS_ID:AWS_SECRET_KEY@bucket

and copy all the files in the bucket to HDFS (using your cluster to distribute the work) by

hadoop distcp s3n://AWS_ID:AWS_SECRET_KEY@bucket/ newDir

which will place all the files on HDFS, in a directory called newDir. If you reverse the two last options, it'll copy files over from HDFS to S3, so you can use them later.

If your secret key contains a slash, you will have problems; two possible options are to regenerate a key or edit the

conf/hadoop-site.xml

file and add


<property>
   <name>fs.s3n.awsAccessKeyId </name>
  <value>YOUR_ACCESS_KEY_ID</value>
</property>

<property>
  <name>fs.s3n.awsSecretAccessKey</name>
  <value>THE_KEY_ITSELF</value>
</property>

and you won't need to specify the id/key on the command line.

If some of your files are too big to fit on S3 directly (greater than 5gb in size), you can use the s3:// instead of s3n:// and Hadoop will automatically split them, but all the files will then be stored in a special hadoop-only block format and you won't be able to access them using other tools.

Programmer Land

Sunday, April 19, 2009

Hadoop & EC2: Adding slaves to an existing cluster

Hadoop Sequence Files

Friday, April 3, 2009

Using the EC2 Hadoop Cluster (and S3).

Blog Archive