Differences

This shows you the differences between two versions of the page.

--- courses:mapreduce-tutorial:step-8 [2012/01/25 15:54]
straka
+++ courses:mapreduce-tutorial:step-8 [2012/01/29 20:58]
straka
@@ Line 6: / Line 6: @@
 The number of mappers is determined automatically according to input files sizes. Every input file is divided into //splits//. The default split size is 32MB. Every file split is then executed by a different mapper.
-The size of file split can be overridden by ''mapred.min.split.size'' and ''maperd.max.split.size''. See the next tutorial step for how to set these flags.
+The size of file split can be overridden by ''mapred.min.split.size'' and ''maperd.max.split.size''. See the next step for how to set these flags.
 ===== Multiple reducers =====
-Then number of reducers is specified by the job, default number is one. As the outputs of reducers are not merged, there are as many output files as reducers.
+The number of reducers is specified by the job, defaulting to one if unspecified. As the outputs of reducers are not merged, there are as many output files as reducers.
 To use multiple reducers, the MR job must be executed by a cluster (even with one computer), not locally. The number of reducers is specified by ''-r'' flag:
-  perl script.pl [-jt cluster_master | -c cluster_size [-w sec_to_wait]] [-r number_of_reducers]
+  perl script.pl run [-jt cluster_master | -c cluster_size [-w sec_to_wait]] [-r number_of_reducers]
+Optimal number of reducers is the same as the number of machines in the cluster, so that all the reducers can run in parallel at the same time.
 ==== Partitioning ====
 When there are multiple reducers, it is important how the (key, value) pairs are distributed between the reducers.
-By default, (key, value) pair is sent to reducer number //hash(key) modulo number_of_reducers//. This guarantees that for one key, all its values are processed by unique reducer.
+By default, (key, value) pair is sent to a reducer number //hash(key) modulo number_of_reducers//. This guarantees that for one key, all its values are processed by a unique reducer.
-To override the default behaviour, MR job can specify a //partitioner//. A partitioner is given each (key, value) pair produced by a mapper, number of reducers, and outputs the zero-based number of reducer, where this (key, value) pair belongs:
+To override the default behaviour, MR job can specify a //partitioner//. A partitioner is given every (key, value) pair produced by a mapper, it is also given the number of reducers, and outputs the zero-based number of reducer, where this (key, value) pair belongs:
 <code perl>
@@ Line 45: / Line 47: @@
 A MR job must have a reducer if it specifies a partitioner. Also, the partitioner is not called if there is only one reducer.
-===== Example =====
+===== The order of keys during reduce =====
+It is guaranteed that every reducer processes the keys in //ascending order//.
+On the other hand, the order of values belonging to one key is undefined.
+===== Exercise =====
+Run one MR job on '/home/straka/wiki/cs-text-medium', which creates two output files -- one with ascending list of unique article names and the other with an ascending list of unique words. You can download the template {{:courses:mapreduce-tutorial:step-8-exercise.txt|step-8-exercise.pl}}  and execute it.
+  wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-8-exercise.txt' -O 'step-8-exercise.pl'
+  rm -rf step-8-out-ex; perl step-8-exercise.pl run /home/straka/wiki/cs-text-medium/ step-8-out-ex
+  less step-8-out-ex/part-*
+==== Solution ====
+You can also download the solution {{:courses:mapreduce-tutorial:step-8-solution.txt|step-8-solution.pl}} and check the correct output.
+  wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-8-solution.txt' -O 'step-8-solution.pl'
+  rm -rf step-8-out-sol; perl step-8-solution.pl run /home/straka/wiki/cs-text-medium/ step-8-out-sol
+  less step-8-out-sol/part-*
-Run a MR job on '/home/straka/wiki/cs-text-medium', which creates a list of unique article names and at the same time list of unique words.
+----
-{{:courses:mapreduce-tutorial:step-8-solution.txt|Solution.pl}}
+<html>
+<table style="width:100%">
+<tr>
+<td style="text-align:left; width: 33%; "></html>[[step-7|Step 7]]: Dynamic Hadoop cluster for several computations.<html></td>
+<td style="text-align:center; width: 33%; "></html>[[.|Overview]]<html></td>
+<td style="text-align:right; width: 33%; "></html>[[step-9|Step 9]]: Hadoop properties.<html></td>
+</tr>
+</table>
+</html>

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences