This is an old revision of the document!

MapReduce Tutorial : Additional output from mappers and reducers

Sometimes it would be useful to create output files manually in reducers – either multiple files are needed per reducer, or a specific file format is desired.

Problem is that Hadoop framework can spawn same reducer multiple times – either because of speculative execution, or if one reducer is presumed to have crashed, even if it in fact did not.

For these reasons Hadoop creates an output directory for every reduce attempt it makes. If the reducer finishes successfully, the files in this directory are moved to the output directory. Still, user must ensure different reducers produce different filenames, usually by naming the files using the serial number of reducer.

Both these informations are available in Perl API using environmental variables:

HADOOP_TASK_ID – available in every mapper and reducer. The serial number of the mapper and reducer task (in range 0..number_of_tasks-1).
HADOOP_WORK_OUTPUT_PATH

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

MapReduce Tutorial : Additional output from mappers and reducers