When multiple Hadoop jobs should be executed, it is better to reuse the cluster instead of allocating a new one for every computation.
A cluster can be created using
/net/projects/hadoop/bin/hadoop-cluster -c number_of_machines -w sec_to_wait_after_all_jobs_completed
The syntax is the same as in perl script.pl run
.
The associated SGE job name is HadoopCluster. The running job can be stopped by either removing HadoopCluster.c$SGE_JOBID
file or deleting the SGE job using qdel
.
Running cluster is identified by its master. When running a Hadoop job using Perl API, existing cluster can be used by
perl script.pl -jt cluster_master:9001 ...
From now on, it is best to run MR jobs using a one-machine cluster – create a one-machine cluster using hadoop-cluster
for 3h (10800s) and run jobs using -jt cluster_master
. Running the scripts locally without any cluster has several disadvantages, most notably having only one reducer per job.
Try running the same script step-7-wordcount.pl as in the last step, this time by creating the cluster and submitting the job to it:
wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-6.txt' -O 'step-7-wordcount.pl' /net/projects/hadoop/bin/hadoop-cluster -c 1 -w 600 # NOW VIEW THE FILE # $EDITOR step-7-wordcount.pl rm -rf step-7-out-sol; perl step-7-wordcount.pl -jt cluster_master:9001 -Dmapred.max.split.size=1000000 /home/straka/wiki/cs-text-medium step-7-out-sol less less step-7-out-sol/part-*
Remarks:
Step 6: Running on cluster. | Overview | Step 8: Multiple mappers, reducers and partitioning. |