Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
courses:mapreduce-tutorial:step-8 [2012/01/25 14:47] straka |
courses:mapreduce-tutorial:step-8 [2012/01/25 15:00] straka |
||
---|---|---|---|
Line 1: | Line 1: | ||
====== MapReduce Tutorial : Multiple mappers, reducers and partitioning ====== | ====== MapReduce Tutorial : Multiple mappers, reducers and partitioning ====== | ||
+ | In order to achieve parallelism, | ||
+ | ===== Multiple mappers ===== | ||
+ | The number of mappers is determined automatically according to input files sizes. Every input file is divided into //splits//. The default split size is 32MB. Every file split is then executed by a different mapper. | ||
+ | |||
+ | The size of file split can be overridden by '' | ||
+ | |||
+ | ===== Multiple reducers ===== | ||
+ | Then number of reducers is specified by the job, default number is one. As the outputs of reducers are not merged, there are as many output files as reducers. | ||
+ | |||
+ | To use multiple reducers, the MR job must be executed by a cluster (even with one computer), not locally. The number of reducers is specified by '' | ||
+ | perl script.pl [-j cluster_master | -c cluster_size [-w sec_to_wait]] [-r number_of_reducers] | ||
+ | |||
+ | ==== Partitioning ==== | ||
+ | When there are multiple reducers, it is important how the (key, value) pairs are distributed between the reducers. | ||
+ | |||
+ | By default, (key, value) pair is sent to reducer number //hash(key) modulo number_of_reducers// | ||
+ | |||
+ | To override the default behaviour, MR job can specify a // | ||