Sunday, July 18, 2010

ScalaHadoop

I just pushed , ScalaHadoop to github. It's a small infrastructure that makes using Scala together with Hadoop a lot easier.

For example, the main code for a word count app looks like


object WordCount extends ScalaHadoopTool{
def run(args: Array[String]) : Int = {
(MapReduceTaskChain.init() -->
IO.Text[LongWritable, Text](args(0)).input -->
MapReduceTask.MapReduceTask(TokenizerMap1, SumReducer1) -->
IO.Text[Text, LongWritable](args(1)).output) execute;
return 0;
}
}


with the mapper and reducers looking like


object TokenizerMap1 extends TypedMapper[LongWritable, Text, Text, LongWritable] {
override def doMap : Unit =
v split " |\t" foreach ((word) => context.write(word, 1L))
}

object SumReducer1 extends TypedReducer[Text, LongWritable, Text, LongWritable] {
override def doReduce : Unit =
context.write(k, (0L /: v) ((total, next) => total+next))
}


It's possible to chain map and reduce tasks together, use custom file formats, etc but a lot of the features aren't documented yet.

Sunday, April 19, 2009

Hadoop & EC2: Adding slaves to an existing cluster

Suppose you have a cluster called my-cluster running, and you wish to add more slaves. This is really easy to do -- simply run


hadoop-ec2 launch-slaves my-cluster 5


to launch 5 additional slaves. It will take a few minutes for the slaves to join the cluster, but you don't need to do anything else.

Stopping nodes is more complicated.

Hadoop Sequence Files

One way to store data that Hadoop works with is through sequence files. These are files, possibly compressed, containing pairs of Writable key/values. Hadoop has ways of splitting sequence files for doing jobs in parallel, even if they are compressed, making them a convenient way of storing your data without making your own format. To import your own code into a sequence file, execute something like:



Path path = new Path("filename.of.sequence.file");
org.apache.hadoop.fs.RawLocalFileSystem fs = new org.apache.hadoop.fs.RawLocalFileSystem();

SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, path, Text.class, BytesWritable.class);

for(loop-through-data-here}
writer.append(new Text("key"), new BytesWritable("Value"));



Obviously, in the example above, the key is of type Text and the value of type BytesWritable; you can use other types.

Friday, April 3, 2009

Using the EC2 Hadoop Cluster (and S3).

A previous post describes how to get going and start and EC2 Hadoop cluster. Now, we'll see how it can be used.

The "hadoop-ec2" command can be used to interact with your cluster; type it in to see the help prompt that will describe some uses. Some handy things are
 hadoop-ec2 push my-cluster filename

to copy a file to the cluster (though as we'll see later, you could use S3 later to move files around) and
 hadoop-ec2 login my-cluster

to open a ssh terminal to the master machine on the cluster.

OK, suppose you actually want to do a job. For now, use "hadoop-ec2 push" to transfer all the required files to the cluster, login to the cluster and use the 'hadoop' command to execute tasks, just as you would on a local machine. For example, if you wanted to run a provided example, you could login into the master and run
 hadoop jar /usr/local/hadoop-0.19.0/hadoop-0.19.0-examples.jar pi 4 10000 


Suppose you want to do some real work. Since you're using hadoop, you probably have some big files and repeatedly copying them from your machine to the cluster is a bad idea; furthermore, they may not even fit on the cluster master. Not to worry -- that's what S3 is for! Hadoop has a built-in interface to S3: you refer to files on S3 as
s3n://AWS_ID:AWS_SECRET_KEY@bucket/filename

So, you can list all the files in a bucket by
hadoop fs -ls s3n://AWS_ID:AWS_SECRET_KEY@bucket 

and copy all the files in the bucket to HDFS (using your cluster to distribute the work) by
hadoop distcp s3n://AWS_ID:AWS_SECRET_KEY@bucket/ newDir

which will place all the files on HDFS, in a directory called newDir. If you reverse the two last options, it'll copy files over from HDFS to S3, so you can use them later.

If your secret key contains a slash, you will have problems; two possible options are to regenerate a key or edit the
conf/hadoop-site.xml

file and add

<property>
<name>fs.s3n.awsAccessKeyId </name>
<value>YOUR_ACCESS_KEY_ID</value>
</property>

<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>THE_KEY_ITSELF</value>
</property>


and you won't need to specify the id/key on the command line.

If some of your files are too big to fit on S3 directly (greater than 5gb in size), you can use the s3:// instead of s3n:// and Hadoop will automatically split them, but all the files will then be stored in a special hadoop-only block format and you won't be able to access them using other tools.

Monday, March 30, 2009

SSH for web proxying

OpenSSH can be trivially used as a SOCKS proxy. From your machine, simply run


ssh -D 2000 destination -N


to establish the proxy. The -N flag starts the tunnel without running a shell on the other side and is optional.

Then, tell your web browser to use localhost:2000 as a SOCKS proxy. With Firefox, this is in the Preferences/Network/Connection menu. Alternatively, you can use foxyproxy, a plugin that better manages your proxies.

By default, your DNS requests still go directly to your local DNS server. If you wish to change that, you need to tell your web browser to use the SOCKS proxy to resolve names as well; in the case of foxyproxy, that is under Foxyproxy Options/Global Settings/Miscellaneous.

Why would you want to do this? Well, for one, your HTTP traffic is transmitted in plaintext, so if you are on an untrusted network, someone could be snooping all your traffic. By proxying it through a secure link to a known machine, you are preventing other people on that network from seeing your web browsing, possible passwords, etc.

More interestingly, suppose you have an intranet that has some resources that are not available to the outside world. By proxying through a machine on the intranet, you can remotely use those resources; for example, by proxying through a machine on your collage campus, you can use site-licensed resources like paper archives. Alternatively, if you use Amazon's EC2 (for example, in conjunction with Hadoop), you could use the proxy to let you access web services running on your EC2 images without opening up public ports and even using the internal domain names for the machines.

Hadoop, EC2 and S3

Hadoop is a library for performing MapReduce computations on a cluster. Recently, Amazon introduced Elastic MapReduce which makes it easier to run Hadoop jobs on their cluster. However, that comes with extra costs (more per-hour costs, plus extra S3 usage) and the existing way of using EC2 with Hadoop is already quite good. This is what I'll describe here.

First, you need to get an Amazon AWS account and set up the EC2 API on your machine; Amazon's instructions are pretty clear. You also need to download and unzip the latest version of Hadoop on a local machine. For convenience, add the
src/contrib/ec2/bin
subdirectory of hadoop to your path.

Now, you must edit
 src/contrib/ec2/bin/hadoop-ec2-env.sh  

and set up variables specifying your AWS credentials. For KEYNAME, use the name that you selected when you created a new key; the file should be called id_rsa-$KEYNAME tough you can change that on the PRIVATE_KEY_PATH line. You shouldn't need to change anything other than the first 5 variables in the file.

If you've done things correctly, you should now be able to run
 hadoop-ec2 launch-cluster my-cluster 2

which should churn for a while, create the right groups, permissions, etc and boot up all the machines. Your cluster is good to go. "my-cluster" is the name that you chose for the cluster and can be anything you want; 2 is the number of slave nodes to run. If you have an existing cluster running, you can add machines to it by running
 hadoop-ec2 launch-slaves my-cluster n 

to launch n more slaves for that cluster. It will take a little time for them boot up and join the cluster, but in a few minutes, they should join.

The cluster can be turned off with
 hadoop-ec2 terminate-cluster my-cluster


The next hadoop-related post will talk about how the cluster can e used and how data can be imported/exported to S3.