Both sides previous revision
Previous revision
Next revision
|
Previous revision
Next revision
Both sides next revision
|
courses:mapreduce-tutorial [2012/01/23 20:56] straka |
courses:mapreduce-tutorial [2012/01/28 00:34] straka |
| |
===== Day 1 ===== | ===== Day 1 ===== |
Today we will be using the Perl API. | Today we will be using the [[.:mapreduce-tutorial:Perl API]] (there is no need to study it now, the tutorial will explain it). |
* [[.:mapreduce-tutorial:Step-1]]: Setting the environment. | === Environment === |
| * [[.:mapreduce-tutorial:Step 1]]: Setting the environment. |
| |
| === MapReduce basics === |
| * [[.:mapreduce-tutorial:Step 2]]: Input and output format, testing data. |
| * [[.:mapreduce-tutorial:Step 3]]: Basic mapper. |
| * [[.:mapreduce-tutorial:Step 4]]: Counters. |
| * [[.:mapreduce-tutorial:Step 5]]: Basic reducer. |
| |
| === Controlling the cluster === |
| * [[.:mapreduce-tutorial:Step 6]]: Running on cluster. |
| * [[.:mapreduce-tutorial:Step 7]]: Dynamic Hadoop cluster for several computations. |
| |
| === MapReduce extended === |
| From now on, it is best to run MR jobs using a one-machine cluster -- create a one-machine cluster using ''hadoop-cluster'' for 3h (10800s) and run jobs using ''-jt cluster_master''. Running the scripts locally without any cluster has several disadvantages, most notably having only one reducer per job. |
| * [[.:mapreduce-tutorial:Step 8]]: Multiple mappers, reducers and partitioning. |
| * [[.:mapreduce-tutorial:Step 9]]: Hadoop properties. |
| * [[.:mapreduce-tutorial:Step 10]]: Combiners. |
| * [[.:mapreduce-tutorial:Step 11]]: Initialization and cleanup of MR tasks, performance of combiners. |
| * [[.:mapreduce-tutorial:Step 12]]: Additional output from mappers and reducers. |
| |
| === Advanced MapReduce exercises === |
| Exercises in this section can be made in any order, but it is recommended to try solving all of them. The [[.:mapreduce-tutorial:Perl API|Perl API reference]] may come handy. |
| * [[.:mapreduce-tutorial:Step 13]]: Sorting |
| * [[.:mapreduce-tutorial:Step 14]]: N-gram language model |
| * [[.:mapreduce-tutorial:Step 15]]: K-means clustering |
| |
| ===== Day 2 ===== |
| |
| Today we will be using the [[http://hadoop.apache.org/common/docs/r1.0.0/api/index.html|Java API]]. |
| |
| === Environment === |
| * [[.:mapreduce-tutorial:Step 21]]: Preparing the environment. |
| * [[.:mapreduce-tutorial:Step 22]]: Optional -- Setting Eclipse. |
| |
| === Java Hadoop basics ==== |
| * [[.:mapreduce-tutorial:Step 23]]: Predefined formats and types. |
| * [[.:mapreduce-tutorial:Step 24]]: Mappers, running Java Hadoop jobs. |
| * [[.:mapreduce-tutorial:Step 25]]: Reducers, combiners and partitioners. |
| * [[.:mapreduce-tutorial:Step 26]]: Counters and job configuration. |
| |
| === Custom data types and formats === |
| * Custom data type -- Pair<A, B>, BerIntWritable. |
| * Custom input format -- WholeFile and WholeFileAsPath |
| |
| === Exercises === |
| * Inverted index. |
| * Is [[.:mapreduce-tutorial:Step 13]], [[.:mapreduce-tutorial:Step 14]] and [[.:mapreduce-tutorial:Step 15]] enough? |
| |
===== Other ===== | ===== Other ===== |
* [[user:majlis:hadoop|Further information]] | * [[user:majlis:hadoop|Further information]] |
| |