This is an old revision of the document!
Table of Contents
MapReduce Tutorial : Dynamic Hadoop cluster for several computations
When multiple Hadoop jobs should be executed, it is better to reuse the cluster instead of allocating a new one for every computation.
A cluster can be created using
/net/projects/hadoop/bin/hadoop-cluster -c number_of_machines -w sec_to_run_the_cluster_for
The syntax is the same as in perl script.pl run
.
The associated SGE job name is HadoopCluster. The running job can be stopped by either removing HadoopCluster.c$SGE_JOBID
file or deleting the SGE job using qdel
.
Using a running cluster
Running cluster is identified by its master. When running a Hadoop job using Perl API, existing cluster can be used by
perl script.pl run -jt cluster_master:9001 ...
Example
Try running the same script wordcount.pl as in the last step, this time by creating the cluster and submitting the job to it:
/net/projects/hadoop/bin/hadoop-cluster -c 1 -w 600 perl wordcount.pl run -jt cluster_master:9001 -Dmapred.max.split.size=1000000 /home/straka/wiki/cs-text-medium some_output_directory