Summary
A new IBM alphaWorks tool makes it easier to work with the popular distributed computing technique MapReduce and the open-source Apache MapReduce implementation, Hadoop.
Advertisement
MapReduce is a distributed computing technique popularized by Google: It extends the functional programing constructs map and reduce with the ability for parallel execution across a compute cluster. While map iterates over elements of a collection, performing some function on each element, reduce computes a single value from collection elements. Map, and to a lesser extent reduce, operations can be performed in parallel, increasing the speed of both operations.
Hadoop, an Apache project, is an open-source implementation of the MapReduce technique, and is also a distributed computing framework built around MapReduce. With its origins in the Lucene distributed file system, Hadoop has the ability to execute map operations on large files by automatically splitting a large file into smaller segements:
As the Map operation is parallelized the input file set is first split to several pieces called FileSplits. If an individual file is so large that it will affect seek time it will be split to several Splits. The splitting does not know anything about the input file's internal logical structure, for example line-oriented text files are split on arbitrary byte boundaries. Then a new map task is created per FileSplit.
Hadoop provides several other facilities as well for operating on large amounts of data, such as search indexes, in a distributed fashion.
IBM's alphaWorks project recently released an Eclipse-based tool, MapReduce Tools, for working with MapReduce, and in particular with Hadoop. According to the project's documentation,
The plug-in automatically creates projects with the Hadoop libraries for development and testing. Templates for MapReduce drivers are also provided. After a project is completed, the plug-in uses SCP (secure copy) to deploy the code to a Hadoop server and then remotely executes it via SSH (secure shell). During execution, the plug-in communicates with the Hadoop task tracker via HTTP and displays the job status.
What do you think of Hadoop and IBM's new MapReduce Tools?