[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
Next revision Both sides next revision
courses:mapreduce-tutorial:step-8 [2012/01/25 14:46]
straka vytvořeno
courses:mapreduce-tutorial:step-8 [2012/01/25 15:54]
straka
Line 1: Line 1:
-====== MapReduce Tutorial : Multiple reducers and partitioning ======+====== MapReduce Tutorial : Multiple mappers, reducers and partitioning ====== 
 + 
 +In order to achieve parallelism, mappers and reducers must be executed in parallel. 
 + 
 +===== Multiple mappers ===== 
 +The number of mappers is determined automatically according to input files sizes. Every input file is divided into //splits//. The default split size is 32MB. Every file split is then executed by a different mapper. 
 + 
 +The size of file split can be overridden by ''mapred.min.split.size'' and ''maperd.max.split.size''. See the next tutorial step for how to set these flags. 
 + 
 +===== Multiple reducers ===== 
 +Then number of reducers is specified by the job, default number is one. As the outputs of reducers are not merged, there are as many output files as reducers. 
 + 
 +To use multiple reducers, the MR job must be executed by a cluster (even with one computer), not locally. The number of reducers is specified by ''-r'' flag: 
 +  perl script.pl [-jt cluster_master | -c cluster_size [-w sec_to_wait]] [-r number_of_reducers] 
 + 
 +==== Partitioning ==== 
 +When there are multiple reducers, it is important how the (key, value) pairs are distributed between the reducers. 
 + 
 +By default, (key, value) pair is sent to reducer number //hash(key) modulo number_of_reducers//. This guarantees that for one key, all its values are processed by unique reducer. 
 + 
 +To override the default behaviour, MR job can specify a //partitioner//. A partitioner is given each (key, value) pair produced by a mapper, number of reducers, and outputs the zero-based number of reducer, where this (key, value) pair belongs: 
 + 
 +<code perl> 
 +package Partitioner; 
 +use Moose; 
 +with 'Hadoop::Partitioner'; 
 + 
 +sub getPartition { 
 +  my ($self, $key, $value, $partitions) = @_; 
 + 
 +  return $key % $partitions; 
 +
 + 
 +... 
 +package Main; 
 +use Hadoop::Runner; 
 + 
 +my $runner = Hadoop::Runner->new( 
 +  ... 
 +  partitioner => Partitioner->new(), 
 +  ...); 
 +... 
 +</code> 
 + 
 +A MR job must have a reducer if it specifies a partitioner. Also, the partitioner is not called if there is only one reducer. 
 + 
 +===== Example ===== 
 + 
 +Run a MR job on '/home/straka/wiki/cs-text-medium', which creates a list of unique article names and at the same time list of unique words. 
 + 
 +{{:courses:mapreduce-tutorial:step-8-solution.txt|Solution.pl}}

[ Back to the navigation ] [ Back to the content ]