MapReduce

From Wikipedia, the free encyclopedia

MapReduce is a software framework originally developed by Google to support parallel computations over large (greater than 100 terabyte) data sets on unreliable clusters of computers. The name is derived from the map and reduce functions commonly used in functional programming.[1]

MapReduce implementations have been written in C++, Python and Java.

Contents

[edit] Dataflow

The frozen part of the MapReduce framework is a large distributed sort. The hot spots, which the application defines, are:

  • a input reader
  • a Map function
  • a partition function
  • a compare function
  • a Reduce function
  • a output writer

[edit] Input Reader

The input reader divides the input into 16MB to 128MB splits and the framework assigns one split to each Map function. The input reader reads the data from stable storage (typically a distributed file system like Google File System) and generates key, value pairs.

A common example will read a directory full of text files and return each line as a record.

[edit] Map Function

Each Map function gets series of key, value pairs; processes each; and generates 0 or more output key, value pairs. The input and output types of the map can be and often are different from each other.

If the application is doing a word count, the map function would break the line into words and output the word as the key and 1 as the value.

[edit] Partition Function

The output of all of the maps are allocated to particular reduces by the applications's partition function. The partition function is given the key and the number of reduces and returns the index of the desired reduce.

A typical default is to hash the key and modulo the number of reduces.

[edit] Comparison Function

The input for each reduce is pulled from the machine where the map ran and sorted using the application's comparison function.

[edit] Reduce Function

The framework calls the application's reduce function once for each unique key in the sorted order. The reduce can iterate through the values that are associated with that key and output 0 or more key, value pairs.

In the word count example, the reduce function takes the input values, sums them and generates a single output of the word and the final sum.

[edit] Output Writer

The Output Writer writes the output of the reduce to stable storage, usually a distributed file system, such as Google File System.

[edit] Distribution and reliability

MapReduce achieves reliability by parceling out a number of operations on the set of data to each node in the network; each node is expected to report back periodically with completed work and status updates. If a node falls silent for longer than that interval, the master node (similar to the master server in the Google File System) records the node as dead, and sends out the node's assigned work to other nodes. Individual operations use atomic operations for naming file outputs as a double check to ensure that there are not parallel conflicting threads running; when files are renamed, it is possible to also copy them to another name in addition to the name of the task (allowing for side-effects).

The reduce operations operate much the same way, but because of their inferior properties with regard to parallel operations, the master node attempts to schedule reduce operations on the same node, or as close as possible to the node holding the data being operated on; this property is desirable for Google as it conserves bandwidth.

[edit] Uses

MapReduce is useful in a wide range of applications, including: "distributed grep, distributed sort, web link-graph reversal, term-vector per host, web access log stats, inverted index construction, document clustering, machine learning, statistical machine translation..." Most significantly, when MapReduce was finished, it was used to completely regenerate Google's index of the Internet, and replaced the old ad hoc programs that updated the index and ran the various analyses. [2]

MapReduce's stable inputs and outputs are usually stored in a distributed file system, such as Google File System. The transient data is usually stored on local disk and fetched remotely by the reduces.

[edit] Implementations

  • The Google MapReduce framework is implemented in C++ with interfaces in Python and Java.
  • The Hadoop project [1] is a free open source Java MapReduce implementation.
  • A Ruby implementation is available in the Starfish gem [2].

[edit] References

  1. ^ "Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages." -"MapReduce: Simplified Data Processing on Large Clusters", by Jeffrey Dean and Sanjay Ghemawat; from Google Labs
  2. ^ "As of October, Google was running about 3,000 computing jobs per day through MapReduce, representing thousands of machine-days, according to a presentation by Dean. Among other things, these batch routines analyze the latest Web pages and update Google's indexes." *"How Google Works"

[edit] External links

[edit] Papers

[edit] Articles

[edit] Software

In other languages