Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
courses:mapreduce-tutorial:step-8 [2012/01/28 15:36] majlis |
courses:mapreduce-tutorial:step-8 [2012/01/31 15:55] (current) straka |
||
---|---|---|---|
Line 1: | Line 1: | ||
====== MapReduce Tutorial : Multiple mappers, reducers and partitioning ====== | ====== MapReduce Tutorial : Multiple mappers, reducers and partitioning ====== | ||
- | In order to achieve parallelism, mappers and reducers | + | A Hadoop job, which is expected |
===== Multiple mappers ===== | ===== Multiple mappers ===== | ||
Line 12: | Line 12: | ||
To use multiple reducers, the MR job must be executed by a cluster (even with one computer), not locally. The number of reducers is specified by '' | To use multiple reducers, the MR job must be executed by a cluster (even with one computer), not locally. The number of reducers is specified by '' | ||
- | perl script.pl | + | perl script.pl [-jt cluster_master | -c cluster_size [-w sec_to_wait]] [-r number_of_reducers] |
+ | |||
+ | Optimal number of reducers is the same as the number of machines in the cluster, so that all the reducers can run in parallel at the same time. | ||
==== Partitioning ==== | ==== Partitioning ==== | ||
Line 19: | Line 21: | ||
By default, (key, value) pair is sent to a reducer number //hash(key) modulo number_of_reducers// | By default, (key, value) pair is sent to a reducer number //hash(key) modulo number_of_reducers// | ||
- | To override the default behaviour, MR job can specify a // | + | To override the default behaviour, MR job can specify a // |
+ | |||
+ | A partitioner should be provided if | ||
+ | * the default partitioner fails to distribute the data between reducers equally, i.e., some of the reducers operate on much more data than others. | ||
+ | * you need an explicit control of (key, value) placement. This can happen for example when [[.:step-13|sorting data]]. | ||
<code perl> | <code perl> | ||
- | package Partitioner; | + | package |
use Moose; | use Moose; | ||
with ' | with ' | ||
Line 33: | Line 39: | ||
... | ... | ||
- | package | + | package |
use Hadoop:: | use Hadoop:: | ||
my $runner = Hadoop:: | my $runner = Hadoop:: | ||
... | ... | ||
- | partitioner => Partitioner-> | + | partitioner => My::Partitioner-> |
...); | ...); | ||
... | ... | ||
Line 46: | Line 52: | ||
===== The order of keys during reduce ===== | ===== The order of keys during reduce ===== | ||
- | It is guaranteed that every reducer processes the keys in //ascending order//. | + | It is guaranteed that every reducer processes the keys in // |
On the other hand, the order of values belonging to one key is undefined. | On the other hand, the order of values belonging to one key is undefined. | ||
- | ===== Example | + | ===== Exercise |
+ | |||
+ | Run one MR job on '/ | ||
+ | wget --no-check-certificate ' | ||
+ | # NOW EDIT THE FILE | ||
+ | # $EDITOR step-8-exercise.pl | ||
+ | rm -rf step-8-out-ex; | ||
+ | less step-8-out-ex/ | ||
- | Run a MR job on '/ | + | ==== Solution ==== |
+ | You can also download the solution {{: | ||
+ | wget --no-check-certificate | ||
+ | # NOW VIEW THE FILE | ||
+ | # $EDITOR step-8-solution.pl | ||
+ | rm -rf step-8-out-sol; | ||
+ | less step-8-out-sol/ | ||
+ | ---- | ||
+ | < | ||
+ | <table style=" | ||
+ | <tr> | ||
+ | <td style=" | ||
+ | <td style=" | ||
+ | <td style=" | ||
+ | </tr> | ||
+ | </ | ||
+ | </ |