====== MapReduce Tutorial : Dynamic Hadoop cluster for several computations ======

When multiple Hadoop jobs should be executed, it is better to reuse the cluster instead of allocating a new one for every computation.

A cluster can be created using
  /net/projects/hadoop/bin/hadoop-cluster -c number_of_machines -w sec_to_wait_after_all_jobs_completed
The syntax is the same as in ''perl script.pl run''.

The associated SGE job name is HadoopCluster. The running job can be stopped by either removing ''HadoopCluster.c$SGE_JOBID'' file or deleting the SGE job using ''qdel''.

===== Using a running cluster =====
Running cluster is identified by its master. When running a Hadoop job using Perl API, existing cluster can be used by
  perl script.pl -jt cluster_master:9001 ...

===== Running Hadoop jobs from now on =====

From now on, it is best to run MR jobs using a one-machine cluster -- create a one-machine cluster using ''hadoop-cluster'' for 3h (10800s) and run jobs using ''-jt cluster_master''. Running the scripts locally without any cluster has several disadvantages, most notably having only one reducer per job. 

===== Example =====

Try running the same script {{:courses:mapreduce-tutorial:step-6.txt|step-7-wordcount.pl}} as in the last step, this time by creating the cluster and submitting the job to it:
  wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-6.txt' -O 'step-7-wordcount.pl'
  /net/projects/hadoop/bin/hadoop-cluster -c 1 -w 600
  # NOW VIEW THE FILE
  # $EDITOR step-7-wordcount.pl
  rm -rf step-7-out-sol; perl step-7-wordcount.pl -jt cluster_master:9001 -Dmapred.max.split.size=1000000 /home/straka/wiki/cs-text-medium step-7-out-sol
  less less step-7-out-sol/part-*
Remarks:
  * The reducers seem to start running before the mappers finish. In the web interface, the running time of reducers is divided into thirds:
    * during the first 33%, the mapper outputs are copied to the machine where reducer runs.
    * during the second 33%, the (key, value) pairs are sorted.
    * during the last 33%, the user-defined reducer runs.

----

<html>
<table style="width:100%">
<tr>
<td style="text-align:left; width: 33%; "></html>[[step-6|Step 6]]: Running on cluster.<html></td>
<td style="text-align:center; width: 33%; "></html>[[.|Overview]]<html></td>
<td style="text-align:right; width: 33%; "></html>[[step-8|Step 8]]: Multiple mappers, reducers and partitioning.<html></td>
</tr>
</table>
</html>