Table of Contents
MapReduce Tutorial : Additional output from mappers and reducers
Sometimes it would be useful to create output files manually in reducers – either multiple files are needed per reducer, or a specific file format is desired.
Problem is that Hadoop framework can spawn several task attempts for the same reducer task – either because of speculative execution, or if one reduce attempt is presumed to have crashed, even if it in fact did not.
For these reasons Hadoop creates an output directory for every reduce attempt it makes. If the reducer finishes successfully, the files in this directory are moved to the output directory. Still, user must ensure different reducers produce different filenames, usually by naming the files using the serial number of reducer.
Both these informations are available in Perl API using environmental variables:
HADOOP_TASK_ID
– available in every mapper and reducer. The serial number of the mapper and reducer task (in range 0..number_of_tasks-1).HADOOP_WORK_OUTPUT_PATH
– available in a reducer. It contains an existing directory where the reducer can output files. If the reducer finishes successfully, all files and subdirectories will be moved to output directory of the job.
Reduce-less jobs
If a MR job runs without reducers, the output of mappers is written to output directory without further processing. In this case, environmental variable HADOOP_WORK_OUTPUT_PATH
is present even in a mapper and the files created in this directory are copied to the job output directory.
Exercise
Change the word counting script step-12-exercise.pl to produce results in reducers manually using the mentioned environmental variables, and execute it using four reducers.
wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-5-solution1.txt' -O 'step-12-exercise.pl' # NOW EDIT THE FILE # $EDITOR step-12-exercise.pl rm -rf step-12-out-ex; perl step-12-exercise.pl -c 4 -r 4 /home/straka/wiki/cs-text-medium/ step-12-out-ex less step-12-out-ex/part-*
Solution
You can also download the solution step-12-solution.pl and check the correct output.
wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-12-solution.txt' -O 'step-12-solution.pl' # NOW VIEW THE FILE # $EDITOR step-12-solution.pl rm -rf step-12-out-sol; perl step-12-solution.pl -c 4 -r 4 /home/straka/wiki/cs-text-medium/ step-12-out-sol less step-12-out-sol/vystup-*
Step 11: Initialization and cleanup of MR tasks, performance of combiners. | Overview | Step 13: Sorting. |