Hadoop MapReduce is a software framework designed to develop applications to process large dataset in parallel in a reliable and fault tolerant manner. A MapReduce application processes the input dataset into chunks in parallel on multiple nodes.

The below diagram shows the different phases for a MapReduce application:

MapReduce Framework Phases
MapReduce Framework Phases

 

There are two main phases named as Map Phase and Reduce Phase. Each input data chunk is first processed in Map phase and the output is then feed to Reduce phase which finally generate the resulting dataset.

MapReduce Key Terminologies

The below diagram shows the MapReduce application flow, that how the different phases of MapReduce interact with data:

MapReduce Job Execution Flow
MapReduce Job Execution Flow

Inputs and Outputs: All inputs in the framework are in key-value pair, which means that the input dataset to map phases, the intermediate result of map phase and the final output of the application are processed as <key, value> pair.

(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)

Mapper: It maps the input dataset processed as a <key,value> pair to a set of intermediatory <key,value> pair of records. The structure and number of intermediate output records might be same or different from the input chunk data. The logic for this intermediate output is defined in the Map() method of the application. These Mappers run on different chunks of data available on different data-nodes and produce the output result for that chunk.

The number of maps to be executed by the map-reduce application depends on the total size of input and total number of blocks of input file.

Reducer: It reduces the intermidiate value produced by the Mappers. It has 3 phases:

  1. Shuffle: In this phase, the sorted output from the Mapper is collected over HTTP.
  2. Sort: The input from different Mappers are again sorted based on the similar keys in different Mappers.

The shuffle and sort phase occurs simultaniously. As the Mappers output is collected, it is also sorted.

  1. Reduce: In this phase, reduce method is called for each <key, (list of values)> pair in the grouped inputs.

The output of the reduce task is written to the filesystem by OutputCollector.collect(). The output of the reducer is not sorted.

Reducer None: It is possible to have zero reducer if no reduction is needed. In this case, the output of Maps is directly written to filesystem without sorting.

Partitioner: It partitions the key space. It controls the partitioning of keys of intermediate map-outputs. The key or a subset of it is used to create the partition usually by a hash function. The total number of partitions is the same as the number of reduce task. HashPartitioner is the default Partitioner. It controls which of the m keys of Mapper is sent to for reduction.

Reporter:It facilitates the MapReduce application to report progress, update counters and set application level status message. Mappers and Reducers use Reporters to report progress and to indicate that they are alive.

Output Collector:It facilitates the MapReduce framework to collect data output by the Mapper and Reducer.

Job Input: InputFormat describe the input-specification for a Map-Reduce job. It uses InputFormat to:

  1. Validate the input-specification.
  2. Split the input file(s) into logical InputSplits instance, each of which is then assigned to individual Mappers.
  3. Provide the RecordReader implementation used to parse input records from the logical InputSplit

The default behaviour of file-based InputFormat implementations, typically sub-classes of FileInputFormat, is to split the input into logical InputSplit instances based on the total size, in bytes, of the input files.

However, the FileSystem block-size of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via mapred.min.split.size.

In many cases, the logical splits based on input size is not suffcient to preciously define the record boundries, The application then should implement a RecordReader, who is responsible for respecting record-boundaries and presents a record-oriented view of the logical InputSplit to the individual task.

InputSplits: It represents the data to be processed by an individual Mapper. Typically InputSplit presents a byte-oriented view of the input, and it is the responsibility of RecordReader to process and present a record-oriented view. FileSplit is the default InputSplit. It sets map.input.file to the path of the input file for the logical split.

Record Reader: RecordReader reads <key,value> pairs from an Input Split. Typically the RecordReader converts the byte-oriented view of the input, provided by the InputSplit, and presents a record-oriented to the Mapper implementations for processing. RecordReader thus assumes the responsibility of processing record boundaries and presents the tasks with keys and values.

Reference: Apache

Share this:

One thought on “MapReduce Introduction

Leave a Reply

Your email address will not be published. Required fields are marked *