This is an old revision of the document!
Table of Contents
MapReduce Tutorial : Multiple mappers, reducers and partitioning
In order to achieve parallelism, mappers and reducers must be executed in parallel.
Multiple mappers
The number of mappers is determined automatically according to input files sizes. Every input file is divided into splits. The default split size is 32MB. Every file split is then executed by a different mapper.
The size of file split can be overridden by mapred.min.split.size
and maperd.max.split.size
. See the next tutorial step for how to set these flags.
Multiple reducers
Then number of reducers is specified by the job, default number is one. As the outputs of reducers are not merged, there are as many output files as reducers.
To use multiple reducers, the MR job must be executed by a cluster (even with one computer), not locally. The number of reducers is specified by -r
flag:
perl script.pl [-j cluster_master | -c cluster_size [-w sec_to_wait]] [-r number_of_reducers]
Partitioning
When there are multiple reducers, it is important how the (key, value) pairs are distributed between the reducers.
By default, (key, value) pair is sent to reducer number hash(key) modulo number_of_reducers. This guarantees that for one key, all its values are processed by unique reducer.
To override the default behaviour, MR job can specify a partitioner. A partitioner is given each (key, value) pair produced by a mapper, number of reducers, and outputs the zero-based number of reducer, where this (key, value) pair belongs:
package Partitioner; use Moose; with 'Hadoop::Partitioner'; sub getPartition { my ($self, $key, $value, $partitions) = @_; return $key % $partitions; } ... package Main; use Hadoop::Runner; my $runner = Hadoop::Runner->new( ... partitioner => Partitioner->new(), ...); ...
A MR job must have a reducer if it specifies a partitioner. Also, the partitioner is not called if there is only one reducer.