Sunday, July 18, 2010

ScalaHadoop

I just pushed , ScalaHadoop to github. It's a small infrastructure that makes using Scala together with Hadoop a lot easier.

For example, the main code for a word count app looks like


object WordCount extends ScalaHadoopTool{
def run(args: Array[String]) : Int = {
(MapReduceTaskChain.init() -->
IO.Text[LongWritable, Text](args(0)).input -->
MapReduceTask.MapReduceTask(TokenizerMap1, SumReducer1) -->
IO.Text[Text, LongWritable](args(1)).output) execute;
return 0;
}
}


with the mapper and reducers looking like


object TokenizerMap1 extends TypedMapper[LongWritable, Text, Text, LongWritable] {
override def doMap : Unit =
v split " |\t" foreach ((word) => context.write(word, 1L))
}

object SumReducer1 extends TypedReducer[Text, LongWritable, Text, LongWritable] {
override def doReduce : Unit =
context.write(k, (0L /: v) ((total, next) => total+next))
}


It's possible to chain map and reduce tasks together, use custom file formats, etc but a lot of the features aren't documented yet.