Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision Next revision Both sides next revision | ||
courses:mapreduce-tutorial:step-8 [2012/01/25 14:46] straka vytvořeno |
courses:mapreduce-tutorial:step-8 [2012/01/29 21:01] straka |
||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== MapReduce Tutorial : Multiple reducers and partitioning ====== | + | ====== MapReduce Tutorial : Multiple |
+ | |||
+ | It is important for a job, which should run on many computers at the same time, to use multiple mappers and reducers. It is possible to control these numbers to some degree. | ||
+ | |||
+ | ===== Multiple mappers ===== | ||
+ | The number of mappers is determined automatically according to input files sizes. Every input file is divided into //splits//. The default split size is 32MB. Every file split is then executed by a different mapper. | ||
+ | |||
+ | The size of file split can be overridden by '' | ||
+ | |||
+ | ===== Multiple reducers ===== | ||
+ | The number of reducers is specified by the job, defaulting to one if unspecified. As the outputs of reducers are not merged, there are as many output files as reducers. | ||
+ | |||
+ | To use multiple reducers, the MR job must be executed by a cluster (even with one computer), not locally. The number of reducers is specified by '' | ||
+ | perl script.pl run [-jt cluster_master | -c cluster_size [-w sec_to_wait]] [-r number_of_reducers] | ||
+ | |||
+ | Optimal number of reducers is the same as the number of machines in the cluster, so that all the reducers can run in parallel at the same time. | ||
+ | |||
+ | ==== Partitioning ==== | ||
+ | When there are multiple reducers, it is important how the (key, value) pairs are distributed between the reducers. | ||
+ | |||
+ | By default, (key, value) pair is sent to a reducer number //hash(key) modulo number_of_reducers// | ||
+ | |||
+ | To override the default behaviour, MR job can specify a // | ||
+ | |||
+ | <code perl> | ||
+ | package Partitioner; | ||
+ | use Moose; | ||
+ | with ' | ||
+ | |||
+ | sub getPartition { | ||
+ | my ($self, $key, $value, $partitions) = @_; | ||
+ | |||
+ | return $key % $partitions; | ||
+ | } | ||
+ | |||
+ | ... | ||
+ | package Main; | ||
+ | use Hadoop:: | ||
+ | |||
+ | my $runner = Hadoop:: | ||
+ | ... | ||
+ | partitioner => Partitioner-> | ||
+ | ...); | ||
+ | ... | ||
+ | </ | ||
+ | |||
+ | A MR job must have a reducer if it specifies a partitioner. Also, the partitioner is not called if there is only one reducer. | ||
+ | |||
+ | ===== The order of keys during reduce ===== | ||
+ | It is guaranteed that every reducer processes the keys in //ascending order//. | ||
+ | |||
+ | On the other hand, the order of values belonging to one key is undefined. | ||
+ | |||
+ | ===== Exercise ===== | ||
+ | |||
+ | Run one MR job on '/ | ||
+ | wget --no-check-certificate ' | ||
+ | rm -rf step-8-out-ex; | ||
+ | less step-8-out-ex/ | ||
+ | |||
+ | ==== Solution ==== | ||
+ | You can also download the solution {{: | ||
+ | wget --no-check-certificate ' | ||
+ | rm -rf step-8-out-sol; | ||
+ | less step-8-out-sol/ | ||
+ | |||
+ | |||
+ | ---- | ||
+ | |||
+ | < | ||
+ | <table style=" | ||
+ | < | ||
+ | <td style=" | ||
+ | <td style=" | ||
+ | <td style=" | ||
+ | </ | ||
+ | </ | ||
+ | </ |