MapReduce Tutorial : Hadoop job overview
A regular Hadoop job consists of:
- [required] a mapper – processes input (key, value) pairs, produces (key, value) pairs. There can be multiple mappers: each file is divided into (by default 32MB) splits and each split is processed by one mapper. Script
/net/projects/hadoop/bin/compute-splitsize input nr_of_mappers
can be used to compute the size of a split, such that the resulting job would consist of specified number of mappers. - [optional] a reducer – in an ascending order of keys, it processes a key and all its associated values. Produces (key, value) pairs. User can specify number of reducers: 0, 1 or more, default is 1.
- [optional] a combiner – a reducer which is executed locally on output of a mapper.
- [optional] a partitioner – partitioner is executed on every (key, value) pair produced by mapper, and outputs the number of the reducer which should process this pair. When no partitioner is specified, the partition is derived from the hash of the key.
An AllReduce Hadoop job (Perl version, Java version) consists of a mapper only. All the mappers must be executed simultaneously and can communicate using a allReduce
function.
Any Hadoop job can run:
- on a cluster. A separate process is used for every mapper and reducer.
- locally. No processes are created, the computation runs using only a single thread. Useful for debugging. Warning: in this mode, there cannot be more than 1 reducer. This is a deficiency of Hadoop, which is already fixed in the development version.