[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

This is an old revision of the document!


Table of Contents

MapReduce Tutorial : Running on cluster

Probably the most important feature of MapReduce is to run computations distributively.

So far all our MR jobs were executed locally. But all of them can be executed on multiple machines. It suffices to add parameter -c number_of_machines when running them:

perl script.pl run -c number_of_machines [-w sec_to_wait_after_job_completion] input_directory output_directory

This commands creates a cluster of specified number of machines. Every machine is able to run two mappers and two reducers simultaneously. In order to be able to observe the status of the computation after it ends, parameter -w sec_to_wait_after_job_completion can be used.

When a distributed MR computations is executed, it submits a job to SGE cluster, with the name of the Perl script. The SGE cluster creates 3 files in current directory:

When the computation ends and is waiting because of the -w parameter, removing the file script.pl.c$SGE_JOBID stops the cluster. The cluster can be also stopped by removing its SGE job.

Web interface

The cluster master provides a web interface on port 50030 (the port may change in the future). The cluster master address can be found at the first line of script.pl.c$SGE_JOBID, or using qstat -j $SGE_JOBID (context variable hdfs_jobtracker_admin).

The web interface provides a lot of useful informations:

Example

Try running the wordcount.pl using

perl wordcount.pl -c 1 -w 300 -Dmapred.max.split.size=1000000 /home/straka/wiki/cs-text-medium some_output_directory

and explore the web interface.

If you cannot access directly the *.ufal.hide.ms.mff.cuni.cz, you can use for example

ssh -N -L 50030:pandora3:50030 geri

to create a tunnel from local port 50030 to machine pandora3:50030.


[ Back to the navigation ] [ Back to the content ]