Differences

This shows you the differences between two versions of the page.

--- courses:mapreduce-tutorial:step-11 [2012/01/25 15:46]
straka vytvořeno
+++ courses:mapreduce-tutorial:step-11 [2012/01/31 09:39] (current)
straka Change Perl commandline syntax.
@@ Line 1: / Line 1: @@
-====== MapReduce Tutorial :  ======
+====== MapReduce Tutorial : Initialization and cleanup of MR tasks, performance of combiners ======
+During the mapper or reducer task execution the following steps take place:
+  * Perl script is executed in the current directory, ie. in the directory where the job was executed / submitted from.
+  * Mapper/Reducer object is constructed.
+  * Method ''setup($self, $context)'' is called on this object. The ''$context'' can be already used to produce (key, value) pairs or increment counters.
+  * Method ''map'' or ''reduce'' is called for all input values.
+  * Method ''cleanup($self, $context'') is called after all (key, value) pairs of this task are processed. Again, the ''$context'' can be used to produce (key, value) pairs or increment counters.
+  * Perl script finishes.
+The ''setup'' and ''cleanup'' methods are very useful for initialization and cleanup of the tasks.
+Please note that complex initialization should not be performed during construction of Mapper and Reducer objects, as these are constructed every time the script is executed.
+===== Exercise =====
+Improve the {{:courses:mapreduce-tutorial:step-5-solution1.txt|step-11-wc-without-combiner.pl}} script by manually combining the results in the Mapper -- create a hash of word occurrences, populate it during the ''map'' calls without outputting results and finally output all (key, value) pairs in the ''cleanup'' method.
+  wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-5-solution1.txt' -O 'step-11-wc-without-combiner.pl'
+  # NOW EDIT THE FILE
+  # $EDITOR step-11-exercise.pl
+  rm -rf step-11-out-wout; time perl step-11-wc-without-combiner.pl /home/straka/wiki/cs-text-medium/ step-11-out-wout
+  less step-11-out-wout/part-*
+Measure the improvement.
+==== Solution ====
+You can also download the solution {{:courses:mapreduce-tutorial:step-11-solution.txt|step-11-wc-with-perl-hash.pl}} and check the correct output.
+  wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-11-solution.txt' -O 'step-11-wc-with-perl-hash.pl'
+  # NOW VIEW THE FILE
+  # $EDITOR step-11-solution.pl
+  rm -rf step-11-out-with-hash; time perl step-11-wc-with-perl-hash.pl /home/straka/wiki/cs-text-medium/ step-11-out-with-hash
+  less step-11-out-with-hash/part-*
+===== Combiners and Perl API performance =====
+As you have seen, the combiners are not very efficient when using the Perl API. This is a problem of the Perl API -- reading and writing the (key, value) pairs is relatively slow and a combiner does not help -- it in fact increases the number of (key, value) pairs that need to be read/written.
+This is even more obvious with larger input data:
+^ Script ^ Time to complete on ''/home/straka/wiki/cs-text'' ^ Commands ^
+| {{:courses:mapreduce-tutorial:step-5-solution1.txt|step-11-wc-without-combiner.pl}} | 5mins, 4sec | <html><pre>wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-5-solution1.txt' -O 'step-11-wc-without-combiner.pl'<br>rm -rf step-11-out-wout; time perl step-11-wc-without-combiner.pl /home/straka/wiki/cs-text/ step-11-out-wout</pre></html> |
+| {{:courses:mapreduce-tutorial:step-10.txt|step-11-wc-with-combiner.pl}} | 5mins, 33sec  | <html><pre>wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-10.txt' -O 'step-11-wc-with-combiner.pl'<br>rm -rf step-11-out-with-combiner; time perl step-11-wc-with-combiner.pl /home/straka/wiki/cs-text/ step-11-out-with-combiner</pre></html>|
+| {{:courses:mapreduce-tutorial:step-11-solution.txt|step-11-wc-with-perl-hash.pl}} | 2mins, 24sec | <html><pre>wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-11-solution.txt' -O 'step-11-wc-with-perl-hash.pl'<br>rm -rf step-11-out-with-perl-hash; time perl step-11-wc-with-perl-hash.pl /home/straka/wiki/cs-text/ step-11-out-with-perl-hash</pre></html>|
+For comparison, here are times of Java solutions:
+^ Program ^ Time to complete on ''/home/straka/wiki/cs-text'' ^ Size of map output ^
+| Wordcount without combiner | 2mins, 26sec | 367MB |
+| Wordcount with combiner | 1min, 51sec | 51MB |
+| Wordcount with hash in mapper | 1min, 14sec | 51MB |
+Using the combiner is beneficial, although combining the word occurrences in mapper manually is still faster.
+----
+<html>
+<table style="width:100%">
+<tr>
+<td style="text-align:left; width: 33%; "></html>[[step-10|Step 10]]: Combiners.<html></td>
+<td style="text-align:center; width: 33%; "></html>[[.|Overview]]<html></td>
+<td style="text-align:right; width: 33%; "></html>[[step-12|Step 12]]: Additional output from mappers and reducers.<html></td>
+</tr>
+</table>
+</html>

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences