Differences

This shows you the differences between two versions of the page.

--- courses:mapreduce-tutorial:step-8 [2012/01/25 22:09]
straka
+++ courses:mapreduce-tutorial:step-8 [2012/01/31 15:55] (current)
straka
@@ Line 1: / Line 1: @@
 ====== MapReduce Tutorial : Multiple mappers, reducers and partitioning ======
-In order to achieve parallelism, mappers and reducers must be executed in parallel.
+A Hadoop job, which is expected to run on many computers at the same time, need to use multiple mappers and reducers. It is possible to control these numbers to some degree.
 ===== Multiple mappers =====
@@ Line 12: / Line 12: @@
 To use multiple reducers, the MR job must be executed by a cluster (even with one computer), not locally. The number of reducers is specified by ''-r'' flag:
-  perl script.pl run [-jt cluster_master | -c cluster_size [-w sec_to_wait]] [-r number_of_reducers]
+  perl script.pl [-jt cluster_master | -c cluster_size [-w sec_to_wait]] [-r number_of_reducers]
+Optimal number of reducers is the same as the number of machines in the cluster, so that all the reducers can run in parallel at the same time.
 ==== Partitioning ====
@@ Line 19: / Line 21: @@
 By default, (key, value) pair is sent to a reducer number //hash(key) modulo number_of_reducers//. This guarantees that for one key, all its values are processed by a unique reducer.
-To override the default behaviour, MR job can specify a //partitioner//. A partitioner is given every (key, value) pair produced by a mapper, it is also given the number of reducers, and outputs the zero-based number of reducer, where this (key, value) pair belongs:
+To override the default behaviour, MR job can specify a //partitioner//. A partitioner is given every (key, value) pair produced by a mapper, it is also given the number of reducers, and outputs the zero-based number of reducer, where this (key, value) pair belongs.
+A partitioner should be provided if
+  * the default partitioner fails to distribute the data between reducers equally, i.e., some of the reducers operate on much more data than others.
+  * you need an explicit control of (key, value) placement. This can happen for example when [[.:step-13|sorting data]].
 <code perl>
-package Partitioner;
+package My::Partitioner;
 use Moose;
 with 'Hadoop::Partitioner';
@@ Line 33: / Line 39: @@
 ...
-package Main;
+package main;
 use Hadoop::Runner;
 my $runner = Hadoop::Runner->new(
   ...
-  partitioner => Partitioner->new(),
+  partitioner => My::Partitioner->new(),
   ...);
 ...
@@ Line 46: / Line 52: @@
 ===== The order of keys during reduce =====
-It is guaranteed that every reducer processes the keys in //ascending order//.
+It is guaranteed that every reducer processes the keys in //ascending lexicographic order//.
 On the other hand, the order of values belonging to one key is undefined.
-===== Example =====
+===== Exercise =====
+Run one MR job on '/home/straka/wiki/cs-text-medium', which creates two output files -- one with ascending list of unique article names and the other with an ascending list of unique words. You can download the template {{:courses:mapreduce-tutorial:step-8-exercise.txt|step-8-exercise.pl}}  and execute it.
+  wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-8-exercise.txt' -O 'step-8-exercise.pl'
+  # NOW EDIT THE FILE
+  # $EDITOR step-8-exercise.pl
+  rm -rf step-8-out-ex; perl step-8-exercise.pl -c 2 -r 2 /home/straka/wiki/cs-text-medium/ step-8-out-ex
+  less step-8-out-ex/part-*
+==== Solution ====
+You can also download the solution {{:courses:mapreduce-tutorial:step-8-solution.txt|step-8-solution.pl}} and check the correct output.
+  wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-8-solution.txt' -O 'step-8-solution.pl'
+  # NOW VIEW THE FILE
+  # $EDITOR step-8-solution.pl
+  rm -rf step-8-out-sol; perl step-8-solution.pl -c 2 -r 2 /home/straka/wiki/cs-text-medium/ step-8-out-sol
+  less step-8-out-sol/part-*
-Run a MR job on '/home/straka/wiki/cs-text-medium', which creates an ascending list of unique article names and at the same time an ascending list of unique words.
-{{:courses:mapreduce-tutorial:step-8-solution.txt|Solution.pl}}
+----
+<html>
+<table style="width:100%">
+<tr>
+<td style="text-align:left; width: 33%; "></html>[[step-7|Step 7]]: Dynamic Hadoop cluster for several computations.<html></td>
+<td style="text-align:center; width: 33%; "></html>[[.|Overview]]<html></td>
+<td style="text-align:right; width: 33%; "></html>[[step-9|Step 9]]: Hadoop properties.<html></td>
+</tr>
+</table>
+</html>

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences