[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
courses:mapreduce-tutorial:hadoop-job-overview [2012/02/05 19:15]
straka vytvořeno
courses:mapreduce-tutorial:hadoop-job-overview [2012/02/06 06:11] (current)
straka
Line 1: Line 1:
-====== MapReduce Tutorial : ======+====== MapReduce Tutorial : Hadoop job overview ====== 
 + 
 +A regular Hadoop job consists of: 
 +  * [required] //a mapper// -- processes input (key, value) pairs, produces (key, value) pairs. There can be multiple mappers: each file is divided into (by default 32MB) splits and each split is processed by one mapper. Script ''/net/projects/hadoop/bin/compute-splitsize input nr_of_mappers'' can be used to compute the size of a split, such that the resulting job would consist of specified number of mappers. 
 +  * [optional] //a reducer// -- in an ascending order of keys, it processes a key and all its associated values. Produces (key, value) pairs. User can specify number of reducers: 0, 1 or more, default is 1. 
 +  * [optional] //a combiner// -- a reducer which is executed locally on output of a mapper. 
 +  * [optional] //a partitioner// -- partitioner is executed on every (key, value) pair produced by mapper, and outputs the number of the reducer which should process this pair. When no partitioner is specified, the partition is derived from the hash of the key. 
 + 
 +An AllReduce Hadoop job ([[.:step-16|Perl version]], [[.:step-31|Java version]]) consists of a mapper only. All the mappers must be executed simultaneously and can communicate using a ''allReduce'' function. 
 + 
 +Any Hadoop job can run: 
 +  * on a cluster. A separate process is used for every mapper and reducer. 
 +  * locally. No processes are created, the computation runs using only a single thread. Useful for debugging. //Warning: in this mode, there cannot be more than 1 reducer. This is a deficiency of Hadoop, which is already fixed in the development version.// 

[ Back to the navigation ] [ Back to the content ]