Table of Contents

MapReduce Tutorial : Running on cluster

Probably the most important feature of MapReduce is to run computations distributively.

So far all our Hadoop jobs were executed locally. But all of them can be executed on multiple machines. It suffices to add parameter -c number_of_machines when running them:

perl script.pl -c number_of_machines [-w sec_to_wait_after_job_completion] input_directory output_directory

This commands creates a cluster of specified number of machines. Every machine is able to run two mappers and two reducers simultaneously. In order to be able to observe the counters, status and error logs of the computation after it ends, parameter -w sec_to_wait_after_job_completion can be used – when it is used, after the job finishes (successfully or not) the cluster waits for specified time before shutting down.

One of the machines in the cluster is a master, or a job tracker, and it is used to identify the cluster.

In the UFAL environment, when a distributed Hadoop computations is executed, it submits a job to SGE cluster, with the name of the Perl script. The job creates 3 files in the current directory:

When the computation ends and is waiting because of the -w parameter, removing the file script.pl.c$SGE_JOBID stops the cluster. The cluster can be also stopped by removing its SGE job using qdel.

Web interface

The cluster master provides a web interface on address printed by the hadoop-cluster script. The address is also present on the second line of script.pl.c$SGE_JOBID, or using qstat -j $SGE_JOBID, context variable hdfs_jobtracker_admin.

The web interface provides a lot of useful information:

Example

Try running the step-6-wordcount.pl using

wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-6.txt' -O 'step-6-wordcount.pl'
rm -rf step-6-out; perl step-6-wordcount.pl -c 1 -w 600 -Dmapred.max.split.size=1000000 /home/straka/wiki/cs-text-medium step-6-out

and explore the web interface.

If you cannot access directly the *.ufal.hide.ms.mff.cuni.cz network, you can use

ssh -N -L 50030:pandora3:50030 geri.ms.mff.cuni.cz

on your computer to create a tunnel from local port 50030 to machine pandora3:50030. Replace pandora3 by your cluster_master, but leave the hostname geri.ms.mff.cuni.cz unmodified. Now you can access the web interface on the URL http://localhost:50030


Step 5: Basic reducer. Overview Step 7: Dynamic Hadoop cluster for several computations.