[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
courses:mapreduce-tutorial:step-8 [2012/01/28 15:36]
majlis
courses:mapreduce-tutorial:step-8 [2012/01/31 15:55] (current)
straka
Line 1: Line 1:
 ====== MapReduce Tutorial : Multiple mappers, reducers and partitioning ====== ====== MapReduce Tutorial : Multiple mappers, reducers and partitioning ======
  
-In order to achieve parallelism, mappers and reducers must be executed in parallel.+A Hadoop job, which is expected to run on many computers at the same timeneed to use multiple mappers and reducers. It is possible to control these numbers to some degree.
  
 ===== Multiple mappers ===== ===== Multiple mappers =====
Line 12: Line 12:
  
 To use multiple reducers, the MR job must be executed by a cluster (even with one computer), not locally. The number of reducers is specified by ''-r'' flag: To use multiple reducers, the MR job must be executed by a cluster (even with one computer), not locally. The number of reducers is specified by ''-r'' flag:
-  perl script.pl run [-jt cluster_master | -c cluster_size [-w sec_to_wait]] [-r number_of_reducers]+  perl script.pl [-jt cluster_master | -c cluster_size [-w sec_to_wait]] [-r number_of_reducers] 
 + 
 +Optimal number of reducers is the same as the number of machines in the cluster, so that all the reducers can run in parallel at the same time.
  
 ==== Partitioning ==== ==== Partitioning ====
Line 19: Line 21:
 By default, (key, value) pair is sent to a reducer number //hash(key) modulo number_of_reducers//. This guarantees that for one key, all its values are processed by a unique reducer. By default, (key, value) pair is sent to a reducer number //hash(key) modulo number_of_reducers//. This guarantees that for one key, all its values are processed by a unique reducer.
  
-To override the default behaviour, MR job can specify a //partitioner//. A partitioner is given every (key, value) pair produced by a mapper, it is also given the number of reducers, and outputs the zero-based number of reducer, where this (key, value) pair belongs:+To override the default behaviour, MR job can specify a //partitioner//. A partitioner is given every (key, value) pair produced by a mapper, it is also given the number of reducers, and outputs the zero-based number of reducer, where this (key, value) pair belongs
 + 
 +A partitioner should be provided if 
 +  * the default partitioner fails to distribute the data between reducers equally, i.e., some of the reducers operate on much more data than others. 
 +  * you need an explicit control of (key, value) placement. This can happen for example when [[.:step-13|sorting data]].
  
 <code perl> <code perl>
-package Partitioner;+package My::Partitioner;
 use Moose; use Moose;
 with 'Hadoop::Partitioner'; with 'Hadoop::Partitioner';
Line 33: Line 39:
  
 ... ...
-package Main;+package main;
 use Hadoop::Runner; use Hadoop::Runner;
  
 my $runner = Hadoop::Runner->new( my $runner = Hadoop::Runner->new(
   ...   ...
-  partitioner => Partitioner->new(),+  partitioner => My::Partitioner->new(),
   ...);   ...);
 ... ...
Line 46: Line 52:
  
 ===== The order of keys during reduce ===== ===== The order of keys during reduce =====
-It is guaranteed that every reducer processes the keys in //ascending order//.+It is guaranteed that every reducer processes the keys in //ascending lexicographic order//.
  
 On the other hand, the order of values belonging to one key is undefined. On the other hand, the order of values belonging to one key is undefined.
  
-===== Example =====+===== Exercise ===== 
 + 
 +Run one MR job on '/home/straka/wiki/cs-text-medium', which creates two output files -- one with ascending list of unique article names and the other with an ascending list of unique words. You can download the template {{:courses:mapreduce-tutorial:step-8-exercise.txt|step-8-exercise.pl}}  and execute it. 
 +  wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-8-exercise.txt' -O 'step-8-exercise.pl' 
 +  # NOW EDIT THE FILE 
 +  # $EDITOR step-8-exercise.pl 
 +  rm -rf step-8-out-ex; perl step-8-exercise.pl -c 2 -r 2 /home/straka/wiki/cs-text-medium/ step-8-out-ex 
 +  less step-8-out-ex/part-*
  
-Run a MR job on '/home/straka/wiki/cs-text-medium', which creates an ascending list of unique article names and at the same time an ascending list of unique words.+==== Solution ==== 
 +You can also download the solution {{:courses:mapreduce-tutorial:step-8-solution.txt|step-8-solution.pl}} and check the correct output. 
 +  wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-8-solution.txt' -O 'step-8-solution.pl' 
 +  # NOW VIEW THE FILE 
 +  # $EDITOR step-8-solution.pl 
 +  rm -rf step-8-out-sol; perl step-8-solution.pl -c 2 -r 2 /home/straka/wiki/cs-text-medium/ step-8-out-sol 
 +  less step-8-out-sol/part-*
  
  
 +----
  
 +<html>
 +<table style="width:100%">
 +<tr>
 +<td style="text-align:left; width: 33%; "></html>[[step-7|Step 7]]: Dynamic Hadoop cluster for several computations.<html></td>
 +<td style="text-align:center; width: 33%; "></html>[[.|Overview]]<html></td>
 +<td style="text-align:right; width: 33%; "></html>[[step-9|Step 9]]: Hadoop properties.<html></td>
 +</tr>
 +</table>
 +</html>

[ Back to the navigation ] [ Back to the content ]