Differences

This shows you the differences between two versions of the page.

--- courses:mapreduce-tutorial:step-8 [2012/01/25 14:46]
straka vytvořeno
+++ courses:mapreduce-tutorial:step-8 [2012/01/29 21:01]
straka
@@ Line 1: / Line 1: @@
-====== MapReduce Tutorial : Multiple reducers and partitioning ======
+====== MapReduce Tutorial : Multiple mappers, reducers and partitioning ======
+It is important for a job, which should run on many computers at the same time, to use multiple mappers and reducers. It is possible to control these numbers to some degree.
+===== Multiple mappers =====
+The number of mappers is determined automatically according to input files sizes. Every input file is divided into //splits//. The default split size is 32MB. Every file split is then executed by a different mapper.
+The size of file split can be overridden by ''mapred.min.split.size'' and ''maperd.max.split.size''. See the next step for how to set these flags.
+===== Multiple reducers =====
+The number of reducers is specified by the job, defaulting to one if unspecified. As the outputs of reducers are not merged, there are as many output files as reducers.
+To use multiple reducers, the MR job must be executed by a cluster (even with one computer), not locally. The number of reducers is specified by ''-r'' flag:
+  perl script.pl run [-jt cluster_master | -c cluster_size [-w sec_to_wait]] [-r number_of_reducers]
+Optimal number of reducers is the same as the number of machines in the cluster, so that all the reducers can run in parallel at the same time.
+==== Partitioning ====
+When there are multiple reducers, it is important how the (key, value) pairs are distributed between the reducers.
+By default, (key, value) pair is sent to a reducer number //hash(key) modulo number_of_reducers//. This guarantees that for one key, all its values are processed by a unique reducer.
+To override the default behaviour, MR job can specify a //partitioner//. A partitioner is given every (key, value) pair produced by a mapper, it is also given the number of reducers, and outputs the zero-based number of reducer, where this (key, value) pair belongs:
+<code perl>
+package Partitioner;
+use Moose;
+with 'Hadoop::Partitioner';
+sub getPartition {
+  my ($self, $key, $value, $partitions) = @_;
+  return $key % $partitions;
+}
+...
+package Main;
+use Hadoop::Runner;
+my $runner = Hadoop::Runner->new(
+  ...
+  partitioner => Partitioner->new(),
+  ...);
+...
+</code>
+A MR job must have a reducer if it specifies a partitioner. Also, the partitioner is not called if there is only one reducer.
+===== The order of keys during reduce =====
+It is guaranteed that every reducer processes the keys in //ascending order//.
+On the other hand, the order of values belonging to one key is undefined.
+===== Exercise =====
+Run one MR job on '/home/straka/wiki/cs-text-medium', which creates two output files -- one with ascending list of unique article names and the other with an ascending list of unique words. You can download the template {{:courses:mapreduce-tutorial:step-8-exercise.txt|step-8-exercise.pl}}  and execute it.
+  wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-8-exercise.txt' -O 'step-8-exercise.pl'
+  rm -rf step-8-out-ex; perl step-8-exercise.pl run /home/straka/wiki/cs-text-medium/ step-8-out-ex
+  less step-8-out-ex/part-*
+==== Solution ====
+You can also download the solution {{:courses:mapreduce-tutorial:step-8-solution.txt|step-8-solution.pl}} and check the correct output.
+  wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-8-solution.txt' -O 'step-8-solution.pl'
+  rm -rf step-8-out-sol; perl step-8-solution.pl run /home/straka/wiki/cs-text-medium/ step-8-out-sol
+  less step-8-out-sol/part-*
+----
+<html>
+<table style="width:100%">
+<tr>
+<td style="text-align:left; width: 33%; "></html>[[step-7|Step 7]]: Dynamic Hadoop cluster for several computations.<html></td>
+<td style="text-align:center; width: 33%; "></html>[[.|Overview]]<html></td>
+<td style="text-align:right; width: 33%; "></html>[[step-9|Step 9]]: Hadoop properties.<html></td>
+</tr>
+</table>
+</html>

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences