Differences
This shows you the differences between two versions of the page.
| Both sides previous revision Previous revision Next revision | Previous revision | ||
|
courses:mapreduce-tutorial:step-8 [2012/01/25 15:17] straka |
courses:mapreduce-tutorial:step-8 [2012/01/31 15:55] (current) straka |
||
|---|---|---|---|
| Line 1: | Line 1: | ||
| ====== MapReduce Tutorial : Multiple mappers, reducers and partitioning ====== | ====== MapReduce Tutorial : Multiple mappers, reducers and partitioning ====== | ||
| - | In order to achieve parallelism, mappers and reducers | + | A Hadoop job, which is expected |
| ===== Multiple mappers ===== | ===== Multiple mappers ===== | ||
| The number of mappers is determined automatically according to input files sizes. Every input file is divided into //splits//. The default split size is 32MB. Every file split is then executed by a different mapper. | The number of mappers is determined automatically according to input files sizes. Every input file is divided into //splits//. The default split size is 32MB. Every file split is then executed by a different mapper. | ||
| - | The size of file split can be overridden by '' | + | The size of file split can be overridden by '' |
| ===== Multiple reducers ===== | ===== Multiple reducers ===== | ||
| - | Then number of reducers is specified by the job, default number is one. As the outputs of reducers are not merged, there are as many output files as reducers. | + | The number of reducers is specified by the job, defaulting to one if unspecified. As the outputs of reducers are not merged, there are as many output files as reducers. |
| To use multiple reducers, the MR job must be executed by a cluster (even with one computer), not locally. The number of reducers is specified by '' | To use multiple reducers, the MR job must be executed by a cluster (even with one computer), not locally. The number of reducers is specified by '' | ||
| - | perl script.pl [-j cluster_master | -c cluster_size [-w sec_to_wait]] [-r number_of_reducers] | + | perl script.pl [-jt cluster_master | -c cluster_size [-w sec_to_wait]] [-r number_of_reducers] |
| + | |||
| + | Optimal number of reducers is the same as the number of machines in the cluster, so that all the reducers can run in parallel at the same time. | ||
| ==== Partitioning ==== | ==== Partitioning ==== | ||
| When there are multiple reducers, it is important how the (key, value) pairs are distributed between the reducers. | When there are multiple reducers, it is important how the (key, value) pairs are distributed between the reducers. | ||
| - | By default, (key, value) pair is sent to reducer number //hash(key) modulo number_of_reducers// | + | By default, (key, value) pair is sent to a reducer number //hash(key) modulo number_of_reducers// |
| - | To override the default behaviour, MR job can specify a // | + | To override the default behaviour, MR job can specify a // |
| + | |||
| + | A partitioner should be provided if | ||
| + | * the default partitioner fails to distribute the data between reducers equally, i.e., some of the reducers operate on much more data than others. | ||
| + | * you need an explicit control of (key, value) placement. This can happen for example when [[.:step-13|sorting data]]. | ||
| <code perl> | <code perl> | ||
| - | package Partitioner; | + | package |
| use Moose; | use Moose; | ||
| with ' | with ' | ||
| Line 33: | Line 39: | ||
| ... | ... | ||
| - | package | + | package |
| use Hadoop:: | use Hadoop:: | ||
| my $runner = Hadoop:: | my $runner = Hadoop:: | ||
| ... | ... | ||
| - | partitioner => Partitioner-> | + | partitioner => My::Partitioner-> |
| ...); | ...); | ||
| ... | ... | ||
| Line 45: | Line 51: | ||
| A MR job must have a reducer if it specifies a partitioner. Also, the partitioner is not called if there is only one reducer. | A MR job must have a reducer if it specifies a partitioner. Also, the partitioner is not called if there is only one reducer. | ||
| - | ===== Example | + | ===== The order of keys during reduce |
| + | It is guaranteed that every reducer processes the keys in //ascending lexicographic order//. | ||
| + | |||
| + | On the other hand, the order of values belonging to one key is undefined. | ||
| + | |||
| + | ===== Exercise ===== | ||
| + | |||
| + | Run one MR job on '/ | ||
| + | wget --no-check-certificate ' | ||
| + | # NOW EDIT THE FILE | ||
| + | # $EDITOR step-8-exercise.pl | ||
| + | rm -rf step-8-out-ex; | ||
| + | less step-8-out-ex/ | ||
| + | |||
| + | ==== Solution ==== | ||
| + | You can also download the solution {{: | ||
| + | wget --no-check-certificate ' | ||
| + | # NOW VIEW THE FILE | ||
| + | # $EDITOR step-8-solution.pl | ||
| + | rm -rf step-8-out-sol; | ||
| + | less step-8-out-sol/ | ||
| + | |||
| + | |||
| + | ---- | ||
| + | |||
| + | < | ||
| + | <table style=" | ||
| + | < | ||
| + | <td style=" | ||
| + | <td style=" | ||
| + | <td style=" | ||
| + | </ | ||
| + | </ | ||
| + | </ | ||
