This is an old revision of the document!
MapReduce Tutorial : Hadoop job overview
A Hadoop job consists of:
- [required] a mapper – processes input (key, value) pairs, produces (key, value) pairs. There can be multiple mappers: each file is divided into (by default 32MB) splits and each split is processed by one mapper. Script
/net/projects/hadoop/bin/compute-splitsize input nr_of_mappers
can be used to compute the size of a split, such that the resulting job would consist of specified number of mappers. - [optional] a reducer – in an ascending order of keys, it processes a key and all its associated values. Produces (key, value) pairs. User can specify number of reducers: 0, 1 or more, default is 1.
- [optional] a combiner – a reducer which is executed locally on output of a mapper.
- [optional] a partitioner – partitioner is executed on every (key, value) pair produced by mapper, and outputs the number of the reducer which should process this pair.
A Hadoop job can run:
- on a cluster. A separate process is used for every mapper and reducer.
- locally. No processes are created, the computation runs using only a single thread. Useful for debugging. Warning: in this mode, there cannot be more than 1 reducer. This is a deficiency of Hadoop, which is already fixed in the development version.