MapReduce Tutorial : Initialization and cleanup of MR tasks, performance of combiners

During the mapper or reducer task execution the following steps take place:

Perl script is executed in the current directory, ie. in the directory where the job was executed / submitted from.
Mapper/Reducer object is constructed.
Method setup($self, $context) is called on this object. The $context can be already used to produce (key, value) pairs or increment counters.
Method map or reduce is called for all input values.
Method cleanup($self, $context) is called after all (key, value) pairs of this task are processed. Again, the $context can be used to produce (key, value) pairs or increment counters.
Perl script finishes.

The setup and cleanup methods are very useful for initialization and cleanup of the tasks.

Please note that complex initialization should not be performed during construction of Mapper and Reducer objects, as these are constructed every time the script is executed.

Exercise

Improve the step-11-wc-without-combiner.pl script by manually combining the results in the Mapper – create a hash of word occurrences, populate it during the map calls without outputting results and finally output all (key, value) pairs in the cleanup method.

wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-5-solution1.txt' -O 'step-11-wc-without-combiner.pl'
rm -rf step-11-out-wout; time perl step-11-wc-without-combiner.pl run /home/straka/wiki/cs-text-medium/ step-11-out-wout
less step-11-out-wout/part-*

Measure the improvement.

Solution

You can also download the solution step-11-wc-with-perl-hash.pl and check the correct output.

wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-11-solution.txt' -O 'step-11-wc-with-perl-hash.pl'
rm -rf step-11-out-with-hash; time perl step-11-wc-with-perl-hash.pl run /home/straka/wiki/cs-text-medium/ step-11-out-with-hash
less step-11-out-with-hash/part-*

Combiners and Perl API performance

As you have seen, the combiners are not very efficient when using the Perl API. This is a problem of the Perl API – reading and writing the (key, value) pairs is relatively slow and a combiner does not help – it in fact increases the number of (key, value) pairs that need to be read/written.

This is even more obvious with larger input data:

Script Time to complete on /home/straka/wiki/cs-text Commands

Script	Time to complete on `/home/straka/wiki/cs-text`	Commands
step-11-wc-without-combiner.pl	5mins, 4sec	wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-5-solution1.txt' -O 'step-11-wc-without-combiner.pl' rm -rf step-11-out-wout; time perl step-11-wc-without-combiner.pl run /home/straka/wiki/cs-text/ step-11-out-wout
step-11-wc-with-combiner.pl	5mins, 33sec	wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-10.txt' -O 'step-11-wc-with-combiner.pl' rm -rf step-11-out-with-combiner; time perl step-11-wc-with-combiner.pl run /home/straka/wiki/cs-text/ step-11-out-with-combiner
step-11-wc-with-perl-hash.pl	2mins, 24sec	wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-11-solution.txt' -O 'step-11-wc-with-perl-hash.pl' rm -rf step-11-out-with-perl-hash; time perl step-11-wc-with-perl-hash.pl run /home/straka/wiki/cs-text/ step-11-out-with-perl-hash

step-11-wc-without-combiner.pl

5mins, 4sec

wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-5-solution1.txt' -O 'step-11-wc-without-combiner.pl'
rm -rf step-11-out-wout; time perl step-11-wc-without-combiner.pl run /home/straka/wiki/cs-text/ step-11-out-wout

step-11-wc-with-combiner.pl

5mins, 33sec

wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-10.txt' -O 'step-11-wc-with-combiner.pl'
rm -rf step-11-out-with-combiner; time perl step-11-wc-with-combiner.pl run /home/straka/wiki/cs-text/ step-11-out-with-combiner

step-11-wc-with-perl-hash.pl

2mins, 24sec

wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-11-solution.txt' -O 'step-11-wc-with-perl-hash.pl'
rm -rf step-11-out-with-perl-hash; time perl step-11-wc-with-perl-hash.pl run /home/straka/wiki/cs-text/ step-11-out-with-perl-hash

For comparison, here are times of Java solutions:

Program	Time to complete on `/home/straka/wiki/cs-text`	Size of map output
Wordcount without combiner	2mins, 26sec	367MB
Wordcount with combiner	1min, 51sec	51MB
Wordcount with hash in mapper	1min, 14sec	51MB

Using the combiner is beneficial, although manually combining the word occurrences in mapper manually is still faster.

Step 10: Combiners.

Overview

Step 12: Additional output from mappers and reducers.

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Table of Contents

MapReduce Tutorial : Initialization and cleanup of MR tasks, performance of combiners

Exercise

Solution

Combiners and Perl API performance