Sometimes the reduce is a binary operation, which is associative and commutative, e.g. +
. In that case it is inefficient to produce all the (key, value) pairs in the mappers and send them through the network.
Instead, reducer can be executed right after the map, on some portion of values belonging to the same key. Only the aggregated results are then sent through the network.
A Hadoop job can have such locally executed reducer, called a combiner. If a combiner is specified, the output of a mapper is processed by a combiner before sending the pairs to reducer. The combiner may be invoked 0, 1 or multiple times, usually when the data are written to disk.
Typically, the combiner is the same as the reducer of a MR job.
package My::Mapper; ... package My::Reducer; ... package main; use Hadoop::Runner; my $runner = Hadoop::Runner->new( mapper => My::Mapper->new(), combiner => My::Reducer->new(), # Specify the combiner. reducer => My::Reducer->new(), input_format => 'KeyValueTextInputFormat'); ...
Compare the effect of adding the combiner to a MR job which counts occurrences of words in /home/straka/wiki/cs-text-medium
: step-10-wc-without-combiner.pl and step-10-wc-with-combiner.pl.
wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-5-solution1.txt' -O 'step-10-wc-without-combiner.pl' # NOW VIEW THE FILE # $EDITOR step-10-wc-without-combiner.pl rm -rf step-10-out-wout; time perl step-10-wc-without-combiner.pl /home/straka/wiki/cs-text-medium/ step-10-out-wout wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-10.txt' -O 'step-10-wc-with-combiner.pl' # NOW VIEW THE FILE # $EDITOR step-10-wc-with-combiner.pl rm -rf step-10-out-with; time perl step-10-wc-with-combiner.pl /home/straka/wiki/cs-text-medium/ step-10-out-with
How would you explain the results?
Step 9: Hadoop properties. | Overview | Step 11: Initialization and cleanup of MR tasks, performance of combiners. |