Programmer Land: ScalaHadoop

I just pushed , ScalaHadoop to github. It's a small infrastructure that makes using Scala together with Hadoop a lot easier.

For example, the main code for a word count app looks like


object WordCount extends ScalaHadoopTool{ 
  def run(args: Array[String]) : Int = {  
    (MapReduceTaskChain.init() -->
     IO.Text[LongWritable, Text](args(0)).input                -->  
     MapReduceTask.MapReduceTask(TokenizerMap1, SumReducer1)   -->
     IO.Text[Text, LongWritable](args(1)).output) execute;
    return 0;
  }
}

with the mapper and reducers looking like


object TokenizerMap1 extends TypedMapper[LongWritable, Text, Text, LongWritable] {
  override def doMap : Unit = 
    v split " |\t" foreach ((word) => context.write(word, 1L))
}

object SumReducer1 extends TypedReducer[Text, LongWritable, Text, LongWritable] {
  override def  doReduce : Unit = 
    context.write(k, (0L /: v) ((total, next) => total+next))
}

It's possible to chain map and reduce tasks together, use custom file formats, etc but a lot of the features aren't documented yet.

Programmer Land

Sunday, July 18, 2010

ScalaHadoop

No comments:

Post a Comment

Blog Archive