Hadoop is a framework used for handling Big Data. It uses HDFS as the distributed storage mechanism and MapReduce as the parallel processing paradigm for data residing in HDFS.
The key components of Mapreduce are Mapper and Reducer.
When a MapReduce Job runs on a large dataset, Mappers generate large size of intermediate dataset that is passed on to Hadoop Reducer for further processing, which leads to massive network congestion. The MapReduce framework offers a function known as ‘Combiner’ that can play a crucial role in reducing network congestion.
The Combiner is also known as “mini-reduce” process which operates only on data generated by one machine. It reduces the data on machine level so that less data could be transferred to Reducers.
Consider below hive Query
select a,count(a) from stock_data group by a;
Here the filtering of data i.e. selecting column a from the dataset is done at the Mapper end and the aggregation of the values is done at the Reducer end.
We know that the output from the mappers is stored on the local file system and then the shuffle and sort phase starts where the sort phase guarantees that the keys presented to the reducer are sorted in ascending order and shuffle phase passes the values along the keys to the reducer in an iterable form. Now this shuffle and sorting step in mapreduce is the most time consuming as the data goes to the reducer via the network and the sorting and shuffling phase consumes resources as well.
To reduce the time of shuffling we use Combiner as a workaround, what combiner does is it reduces the processing load on the reducer. Combiner reduces the number of values that needs to be iterated and send to the reducer.
Now that we need summation of the values corresponding to the keys and this summation is happening at the reducers end and the values sent to reducer in iterable form is done by the sort and shuffle phase. More the values the more time it takes to shuffle the values and send it to reducer.
One of the workaround for this is reduce the number of values that needs to be shuffled across the keys.
Notice that the output generated by the mappers can be reduced by reducing the interim key value pairs generated by the Mappers. A simple way to do it is to apply the reduce function over the interim output of mappers before sending this data to be written to the local hard drive.
And this is where the combiner comes into action.
Notice the key value pairs generated by the Framework without the Combiner and with the combiner.
A combiner is a code that performs the same action as that of the reducer but to the output of the mapper before it gets to the sort and shuffle phase.
Note:- Combiners work one per node means it does not depend on the number of mappers. It is not necessary that you have defined a Combiner and it will execute necessarily. A combiner can get executed one or more than once but it will not affect the overall output as the reduce action is performed lately before writing data to HDFS.
The reason for some cases where combiner may not execute even by declaring are because the Combiner works on the number of spills of data (the interim files the mapper writes on Local File System) By default the minimum number of spills for a combiner to execute are 3.
This is configurable just set the value of the parameter “min.num.spills.for.combine” in hdfs-site.xml to 1; in this case even if there is a single mapper then too the combiner will be executed.
About the Author –