This is an old revision of the document!

MapReduce Tutorial : Hadoop job overview

A Hadoop job consists of:

[required] a mapper – processes input (key, value) pairs, produces (key, value) pairs. There can be multiple mappers: each file is divided into (by default 32MB) splits and each split is processed by one mapper. Script /net/projects/hadoop/bin/compute-splitsize input nr_of_mappers can be used to compute the size of a split, such that the resulting job would consist of specified number of mappers.
[optional] a reducer – in an ascending order of keys, it processes a key and all its associated values. Produces (key, value) pairs. User can specify number of reducers: 0, 1 or more, default is 1.
[optional] a combiner – a reducer which is executed locally on output of a mapper.
[optional] a partitioner – partitioner is executed on every (key, value) pair produced by mapper, and outputs the number of the reducer which should process this pair.

A Hadoop job can run:

on a cluster. A separate process is used for every mapper and reducer.
locally. No processes are created, the computation runs using only a single thread. Useful for debugging. Warning: in this mode, there cannot be more than 1 reducer. This is a deficiency of Hadoop, which is already fixed in the development version.

Institute of Formal and Applied Linguistics Wiki