Differences

This shows you the differences between two versions of the page.

--- courses:mapreduce-tutorial:step-8 [2012/01/25 14:47]
straka
+++ courses:mapreduce-tutorial:step-8 [2012/01/25 15:00]
straka
@@ Line 1: / Line 1: @@
 ====== MapReduce Tutorial : Multiple mappers, reducers and partitioning ======
+In order to achieve parallelism, mappers and reducers must be executed in parallel.
+===== Multiple mappers =====
+The number of mappers is determined automatically according to input files sizes. Every input file is divided into //splits//. The default split size is 32MB. Every file split is then executed by a different mapper.
+The size of file split can be overridden by ''mapred.min.split.size'' and ''maperd.max.split.size''. See the next tutorial step for how to set these flags.
+===== Multiple reducers =====
+Then number of reducers is specified by the job, default number is one. As the outputs of reducers are not merged, there are as many output files as reducers.
+To use multiple reducers, the MR job must be executed by a cluster (even with one computer), not locally. The number of reducers is specified by ''-r'' flag:
+  perl script.pl [-j cluster_master | -c cluster_size [-w sec_to_wait]] [-r number_of_reducers]
+==== Partitioning ====
+When there are multiple reducers, it is important how the (key, value) pairs are distributed between the reducers.
+By default, (key, value) pair is sent to reducer number //hash(key) modulo number_of_reducers//. This guarantees that for one key, all its values are processed by unique reducer.
+To override the default behaviour, MR job can specify a //partitioner//.

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences