This is an old revision of the document!
MapReduce Tutorial : Combiners
Sometimes the reduce is a binary operation, which is associative and commutative, e.g. +
. In that case it is inefficient to produce all the (key, value) pairs in the mappers and send them through the network.
Instead, reducer can be executed right after the map, on some portion of values belonging to the same key. Only the results are then sent through the network.
A Hadoop job can have such locally executed reducer, called combiner. If a combiner is specified, the output of a mapper is processed by a combiner before sending the pairs to reducer. The combiner may be invoked 0, 1 or multiple times, usually when the data are written to disk.
Typically, the combiner is the same as the reducer of a MR job.
package Mapper; ... package Reducer; ... package Main; use Hadoop::Runner; my $runner = Hadoop::Runner->new( mapper => Mapper->new(), combiner => Reducer->new(), # Specify the combiner. reducer => Reducer->new(), input_format => 'KeyValueTextInputFormat'); ...
Excersise
Compare the effect of adding the combiner to a MR job which counts occurences of words: wc-without-combiner.pl and wc-with-combiner.pl.
How would you explain the results?