Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
courses:mapreduce-tutorial:step-8 [2012/01/25 14:46] straka vytvořeno |
courses:mapreduce-tutorial:step-8 [2012/01/31 15:55] (current) straka |
||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== MapReduce Tutorial : Multiple reducers and partitioning ====== | + | ====== MapReduce Tutorial : Multiple |
+ | |||
+ | A Hadoop job, which is expected to run on many computers at the same time, need to use multiple mappers and reducers. It is possible to control these numbers to some degree. | ||
+ | |||
+ | ===== Multiple mappers ===== | ||
+ | The number of mappers is determined automatically according to input files sizes. Every input file is divided into //splits//. The default split size is 32MB. Every file split is then executed by a different mapper. | ||
+ | |||
+ | The size of file split can be overridden by '' | ||
+ | |||
+ | ===== Multiple reducers ===== | ||
+ | The number of reducers is specified by the job, defaulting to one if unspecified. As the outputs of reducers are not merged, there are as many output files as reducers. | ||
+ | |||
+ | To use multiple reducers, the MR job must be executed by a cluster (even with one computer), not locally. The number of reducers is specified by '' | ||
+ | perl script.pl [-jt cluster_master | -c cluster_size [-w sec_to_wait]] [-r number_of_reducers] | ||
+ | |||
+ | Optimal number of reducers is the same as the number of machines in the cluster, so that all the reducers can run in parallel at the same time. | ||
+ | |||
+ | ==== Partitioning ==== | ||
+ | When there are multiple reducers, it is important how the (key, value) pairs are distributed between the reducers. | ||
+ | |||
+ | By default, (key, value) pair is sent to a reducer number //hash(key) modulo number_of_reducers// | ||
+ | |||
+ | To override the default behaviour, MR job can specify a // | ||
+ | |||
+ | A partitioner should be provided if | ||
+ | * the default partitioner fails to distribute the data between reducers equally, i.e., some of the reducers operate on much more data than others. | ||
+ | * you need an explicit control of (key, value) placement. This can happen for example when [[.: | ||
+ | |||
+ | <code perl> | ||
+ | package My:: | ||
+ | use Moose; | ||
+ | with ' | ||
+ | |||
+ | sub getPartition { | ||
+ | my ($self, $key, $value, $partitions) = @_; | ||
+ | |||
+ | return $key % $partitions; | ||
+ | } | ||
+ | |||
+ | ... | ||
+ | package main; | ||
+ | use Hadoop:: | ||
+ | |||
+ | my $runner = Hadoop:: | ||
+ | ... | ||
+ | partitioner => My:: | ||
+ | ...); | ||
+ | ... | ||
+ | </ | ||
+ | |||
+ | A MR job must have a reducer if it specifies a partitioner. Also, the partitioner is not called if there is only one reducer. | ||
+ | |||
+ | ===== The order of keys during reduce ===== | ||
+ | It is guaranteed that every reducer processes the keys in //ascending lexicographic order//. | ||
+ | |||
+ | On the other hand, the order of values belonging to one key is undefined. | ||
+ | |||
+ | ===== Exercise ===== | ||
+ | |||
+ | Run one MR job on '/ | ||
+ | wget --no-check-certificate ' | ||
+ | # NOW EDIT THE FILE | ||
+ | # $EDITOR step-8-exercise.pl | ||
+ | rm -rf step-8-out-ex; | ||
+ | less step-8-out-ex/ | ||
+ | |||
+ | ==== Solution ==== | ||
+ | You can also download the solution {{: | ||
+ | wget --no-check-certificate ' | ||
+ | # NOW VIEW THE FILE | ||
+ | # $EDITOR step-8-solution.pl | ||
+ | rm -rf step-8-out-sol; | ||
+ | less step-8-out-sol/ | ||
+ | |||
+ | |||
+ | ---- | ||
+ | |||
+ | < | ||
+ | <table style=" | ||
+ | < | ||
+ | <td style=" | ||
+ | <td style=" | ||
+ | <td style=" | ||
+ | </ | ||
+ | </ | ||
+ | </ |