This post originated from an RSS feed registered with Scala Buzz
by David Bernard.
Original Post: A Scalable Language, and a Scalable Framework
Feed Title: Scala Blog
Feed URL: http://www.scala-blogs.org/feeds/posts/default?alt=rss
Feed Description: In an effort to realize the "grow together" spirit, the developers of lift have teamed up with other great minds in the Scala community to bring you Scala-Blogs.org. Our mission is to strengthen the community by sharing our experiences and knowledge so that others can learn from our mistakes and triumph along with our successes.
At Scala-Blogs.org you will find tutorials and articles written by a growing host of enthusiasts, each with a different background and area of expertise.
I'm a pretty big fan of the MapReduce framework. With two fairly simple classes, a Map and a Reduce (influenced by, but not the same as, functional programming constructors of the same name), you can easily write programs that operate on terabytes of data. Apache's Hadoop is a popular open source version of MapReduce, and it's used by Yahoo, Amazon, and Facebook, to name a few.
I'm also a pretty big fan of Scala, and luckily Hadoop is written in Java, which is to say that you can use Hadoop from Scala pretty easily. Unfortunately, like many Java libraries, Hadoop requires a lot of boiler plate, and it's of course not well integrated with some of the higher-order paradigms that Scala supports so well. But maybe we can fix that.
Since my group is just now moving to both Scala (slowly) and Hadoop (somewhat more quickly), I thought it would be good to help combine the two. The result is SMR, which provides a wrapper around Hadoop to make everything much more Scala-like.
WordCount: Java
A pretty basic introduction to MapReduce is the word count example. The task is, given a series of files, to segment all the words in a collection of files, and return a list of each word and the number of time it appears. Here's the vanilla Java Hadoop example, taken from Hadoop's tutorial, leaving out imports:
public class WordCount { public static class Map extends MapReduceBase implements Mapper<LongWritable, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount");
FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf); } }
The basic idea is pretty simple. For the map: take each line in each file, tokenize it, and for each word emit a pair of that word and 1. The reduce receives a word and an iterator of counts and then sums them together, and emits the word with the sum.
WordCount: Scala
The problem is that this remarkably simple program gets lost in all that boilerplate. Let's take a look at how SMR tackles the problem:
object WordCount { def main(args : Array[String]) { val (h,remainingArgs) = Hadoop.fromArgs(args,new Path("output")); val words = for( (offset,line) <- h.loadLines(remainingArgs); word <- line.split(" \n\r\t\f")) yield(word,1);
Both programs accomplish the same task, but one takes 1/4 the number of lines. The reasons for the disparity are multiple, but there are a few key features of Scala that account for most of it:
Syntactic Support for Map: By treating for loops as syntactic sugar for map and its cousins, Scala allows you to create new classes that interact with the primitive looping construct in intuitive ways.
Strong Type System: Scala's type system not only takes a lot of the boiler plate away from the code, but it makes library writer's lives easier by making much richer type information available at compile time. In SMR, the type system is used to automatically select the correct "input format type" and figure out what the types of the Maps and the Reduces are.
Higher-Order and Anonymous Functions: The combination of functions that take other functions and the ability to create new functions on the fly easily and succinctly obviates a lot of the boiler plate associated with defining a new "Mapper" and "Reducer" for every task. Instead, the compiler automatically creates the functions that perform the task for you.
SMR is still very much in development, as my group is still transitioning to Hadoop, but I hope that people find the code useful in scaling up their problems to deal with terabytes and terabytes of data.