====== MapReduce Tutorial : Initialization and cleanup of MR tasks, performance of combiners ====== During the mapper or reducer task execution the following steps take place: * Perl script is executed in the current directory, ie. in the directory where the job was executed / submitted from. * Mapper/Reducer object is constructed. * Method ''setup($self, $context)'' is called on this object. The ''$context'' can be already used to produce (key, value) pairs or increment counters. * Method ''map'' or ''reduce'' is called for all input values. * Method ''cleanup($self, $context'') is called after all (key, value) pairs of this task are processed. Again, the ''$context'' can be used to produce (key, value) pairs or increment counters. * Perl script finishes. The ''setup'' and ''cleanup'' methods are very useful for initialization and cleanup of the tasks. Please note that complex initialization should not be performed during construction of Mapper and Reducer objects, as these are constructed every time the script is executed. ===== Exercise ===== Improve the {{:courses:mapreduce-tutorial:step-5-solution1.txt|step-11-wc-without-combiner.pl}} script by manually combining the results in the Mapper -- create a hash of word occurrences, populate it during the ''map'' calls without outputting results and finally output all (key, value) pairs in the ''cleanup'' method. wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-5-solution1.txt' -O 'step-11-wc-without-combiner.pl' # NOW EDIT THE FILE # $EDITOR step-11-exercise.pl rm -rf step-11-out-wout; time perl step-11-wc-without-combiner.pl /home/straka/wiki/cs-text-medium/ step-11-out-wout less step-11-out-wout/part-* Measure the improvement. ==== Solution ==== You can also download the solution {{:courses:mapreduce-tutorial:step-11-solution.txt|step-11-wc-with-perl-hash.pl}} and check the correct output. wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-11-solution.txt' -O 'step-11-wc-with-perl-hash.pl' # NOW VIEW THE FILE # $EDITOR step-11-solution.pl rm -rf step-11-out-with-hash; time perl step-11-wc-with-perl-hash.pl /home/straka/wiki/cs-text-medium/ step-11-out-with-hash less step-11-out-with-hash/part-* ===== Combiners and Perl API performance ===== As you have seen, the combiners are not very efficient when using the Perl API. This is a problem of the Perl API -- reading and writing the (key, value) pairs is relatively slow and a combiner does not help -- it in fact increases the number of (key, value) pairs that need to be read/written. This is even more obvious with larger input data: ^ Script ^ Time to complete on ''/home/straka/wiki/cs-text'' ^ Commands ^ | {{:courses:mapreduce-tutorial:step-5-solution1.txt|step-11-wc-without-combiner.pl}} | 5mins, 4sec |
wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-5-solution1.txt' -O 'step-11-wc-without-combiner.pl'
rm -rf step-11-out-wout; time perl step-11-wc-without-combiner.pl /home/straka/wiki/cs-text/ step-11-out-wout
| | {{:courses:mapreduce-tutorial:step-10.txt|step-11-wc-with-combiner.pl}} | 5mins, 33sec |
wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-10.txt' -O 'step-11-wc-with-combiner.pl'
rm -rf step-11-out-with-combiner; time perl step-11-wc-with-combiner.pl /home/straka/wiki/cs-text/ step-11-out-with-combiner
| | {{:courses:mapreduce-tutorial:step-11-solution.txt|step-11-wc-with-perl-hash.pl}} | 2mins, 24sec |
wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-11-solution.txt' -O 'step-11-wc-with-perl-hash.pl'
rm -rf step-11-out-with-perl-hash; time perl step-11-wc-with-perl-hash.pl /home/straka/wiki/cs-text/ step-11-out-with-perl-hash
| For comparison, here are times of Java solutions: ^ Program ^ Time to complete on ''/home/straka/wiki/cs-text'' ^ Size of map output ^ | Wordcount without combiner | 2mins, 26sec | 367MB | | Wordcount with combiner | 1min, 51sec | 51MB | | Wordcount with hash in mapper | 1min, 14sec | 51MB | Using the combiner is beneficial, although combining the word occurrences in mapper manually is still faster. ----
[[step-10|Step 10]]: Combiners. [[.|Overview]] [[step-12|Step 12]]: Additional output from mappers and reducers.