MapReduce Tutorial : Initialization and cleanup of MR tasks

During the mapper or reducer task execution the following steps take place:

Perl script is executed in the current directory, ie. in the directory where the job was executed / submitted from.
Mapper/Reducer object is constructed.
Method setup($self, $context) is called on this object. The $context can be already used to produce (key, value) pairs or increment counters.
Method map or reduce is called for all input values.
Method cleanup($self, $context) is called after all (key, value) pairs of this task are processed. Again, the $context can be used to produce (key, value) pairs or increment counters.
Perl script finishes.

The setup and cleanup methods are very useful for initialization and cleanup of the tasks.

Please note that complex initialization should not be performed during construction of Mapper and Reducer objects, as these are constructed every time the script is executed.

Exercise

Improve the wc-without-combiner.pl script by manually combining the results in the Mapper – create a hash of word occurrences, fill it during the map calls and output the (key, value) pairs in cleanup method.

Then measure the improvement.

Solution.pl

Combiners and Perl API

As you have seen, the combiners are not efficient when using the Perl API. This is a problem of Perl API – reading and writing the (key, value) pairs is relatively slow and a combiner does not help – it in fact increases the number of (key, value) pairs that need to be read/written.

This is even more obvious with larger input data:

Script	Time to complete on `/home/straka/wiki/cs-text`
wc-without-combiner.pl	5mins, 4sec
wc-with-combiner.pl	5mins, 33sec
wc-with-perl-hash.pl	2mins, 24sec

For comparison, here are times of Java solutions:

Program	Time to complete on `/home/straka/wiki/cs-text`
Wordcount without combiner
Wordcount with combiner
Wordcount with hash in mapper

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Table of Contents

MapReduce Tutorial : Initialization and cleanup of MR tasks

Exercise

Combiners and Perl API