This is an old revision of the document!
Table of Contents
MapReduce Tutorial : Running on cluster
Probably the most important feature of MapReduce is to run computations distributively.
So far all our Hadoop jobs were executed locally. But all of them can be executed on multiple machines. It suffices to add parameter -c number_of_machines
when running them:
perl script.pl run -c number_of_machines [-w sec_to_wait_after_job_completion] input_directory output_directory
This commands creates a cluster of specified number of machines. Every machine is able to run two mappers and two reducers simultaneously. In order to be able to observe the status of the computation after it ends, parameter -w sec_to_wait_after_job_completion
can be used.
One of the machines in the cluster is a master, or a job tracker, and it is used to identify the cluster.
In the UFAL environment, when a distributed Hadoop computations is executed, it submits a job to SGE cluster, with the name of the Perl script. The job creates 3 files in the current directory:
script.pl.c$SGE_JOBID
– high-level status of the Hadoop computationscript.pl.o$SGE_JOBID
– contains stdout and stderr of the Hadoop jobscript.pl.po$SGE_JOBID
– contains stdout and stderr of the Hadoop cluster
When the computation ends and is waiting because of the -w
parameter, removing the file script.pl.c$SGE_JOBID
stops the cluster. The cluster can be also stopped by removing its SGE job using qdel
.
Web interface
The cluster master provides a web interface on port 50030 (the port may change in the future). The cluster master address can be found at the first line of script.pl.c$SGE_JOBID
, or using qstat -j $SGE_JOBID
(context variable hdfs_jobtracker_admin
).
The web interface provides a lot of useful information:
- running, failed and successfully completed jobs
- for running job, current progress and counters of the whole job and also of each mapper and reducer is available
- for any job, the counters and outputs of all mappers and reducers
- for any job, all Hadoop settings
Example
Try running the wordcount.pl using
perl wordcount.pl run -c 1 -w 600 -Dmapred.max.split.size=1000000 /home/straka/wiki/cs-text-medium some_output_directory
and explore the web interface.
If you cannot access directly the *.ufal.hide.ms.mff.cuni.cz
network, you can use
ssh -N -L 50030:pandora3:50030 geri
to create a tunnel from local port 50030 to machine pandora3:50030
. Replace pandora3
by your cluster_master, but do not change geri
.