MapReduce Tutorial : Dynamic Hadoop cluster for several computations

When multiple Hadoop jobs should be executed, it is better to reuse the cluster instead of allocating a new one for every computation.

A cluster can be created using

/net/projects/hadoop/bin/hadoop-cluster -c number_of_machines -w sec_to_run_the_cluster_for

The syntax is the same as in perl script.pl run.

The associated SGE job name is HadoopCluster. The running job can be stopped by either removing HadoopCluster.c$SGE_JOBID file or deleting the SGE job using qdel.

Using a running cluster

Running cluster is identified by its master. When running a Hadoop job using Perl API, existing cluster can be used by

perl script.pl run -jt cluster_master:9001 ...

Example

Try running the same script step-7-wordcount.pl as in the last step, this time by creating the cluster and submitting the job to it:

wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-6.txt' -O 'step-7-wordcount.pl'
/net/projects/hadoop/bin/hadoop-cluster -c 1 -w 600
rm -rf step-7-out-sol; perl step-7-wordcount.pl run -jt cluster_master:9001 -Dmapred.max.split.size=1000000 /home/straka/wiki/cs-text-medium step-7-out-sol
less less step-7-out-sol/part-*

Remarks:

The reducers seem to start running before the mappers finish. In the web interface, the running time of reducers is divided into thirds:
- during the first 33%, the mapper outputs are copied to the machine where reducer runs.
- during the second 33%, the (key, value) pairs are sorted.
- during the last 33%, the user-defined reducer runs.

Step 6: Running on cluster.

Overview

Step 8: Multiple mappers, reducers and partitioning.

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Table of Contents

MapReduce Tutorial : Dynamic Hadoop cluster for several computations

Using a running cluster

Example