[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
courses:mapreduce-tutorial:step-7 [2012/01/25 00:48]
straka
courses:mapreduce-tutorial:step-7 [2012/01/31 12:43]
straka
Line 1: Line 1:
 ====== MapReduce Tutorial : Dynamic Hadoop cluster for several computations ====== ====== MapReduce Tutorial : Dynamic Hadoop cluster for several computations ======
  
-When multiple MR jobs should be executed, it would be better to reuse the cluster instead of allocating a new one for every computation.+When multiple Hadoop jobs should be executed, it is better to reuse the cluster instead of allocating a new one for every computation.
  
 A cluster can be created using A cluster can be created using
-  /home/straka/hadoop/bin/hadoop-cluster -c number_of_machines -w sec_to_run_the_cluster_for+  /net/projects/hadoop/bin/hadoop-cluster -c number_of_machines -w sec_to_run_the_cluster_for
 The syntax is the same as in ''perl script.pl run''. The syntax is the same as in ''perl script.pl run''.
  
Line 10: Line 10:
  
 ===== Using a running cluster ===== ===== Using a running cluster =====
-Running cluster is identified by its master. When running a Perl MR job, existing cluster can be used by +Running cluster is identified by its master. When running a Hadoop job using Perl API, existing cluster can be used by 
-  perl script.pl run -jt hostname_of_cluster_master:9001 ...+  perl script.pl -jt cluster_master:9001 ... 
 + 
 +===== Running Hadoop jobs from now on ===== 
 + 
 +From now on, it is best to run MR jobs using a one-machine cluster -- create a one-machine cluster using ''hadoop-cluster'' for 3h (10800s) and run jobs using ''-jt cluster_master''. Running the scripts locally without any cluster has several disadvantages, most notably having only one reducer per job.  
 + 
 +===== Example ===== 
 + 
 +Try running the same script {{:courses:mapreduce-tutorial:step-6.txt|step-7-wordcount.pl}} as in the last step, this time by creating the cluster and submitting the job to it: 
 +  wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-6.txt' -O 'step-7-wordcount.pl' 
 +  /net/projects/hadoop/bin/hadoop-cluster -c 1 -w 600 
 +  # NOW VIEW THE FILE 
 +  # $EDITOR step-7-wordcount.pl 
 +  rm -rf step-7-out-sol; perl step-7-wordcount.pl -jt cluster_master:9001 -Dmapred.max.split.size=1000000 /home/straka/wiki/cs-text-medium step-7-out-sol 
 +  less less step-7-out-sol/part-* 
 +Remarks: 
 +  * The reducers seem to start running before the mappers finish. In the web interface, the running time of reducers is divided into thirds: 
 +    * during the first 33%, the mapper outputs are copied to the machine where reducer runs. 
 +    * during the second 33%, the (key, value) pairs are sorted. 
 +    * during the last 33%, the user-defined reducer runs. 
 + 
 +---- 
 + 
 +<html> 
 +<table style="width:100%"> 
 +<tr> 
 +<td style="text-align:left; width: 33%; "></html>[[step-6|Step 6]]: Running on cluster.<html></td> 
 +<td style="text-align:center; width: 33%; "></html>[[.|Overview]]<html></td> 
 +<td style="text-align:right; width: 33%; "></html>[[step-8|Step 8]]: Multiple mappers, reducers and partitioning.<html></td> 
 +</tr> 
 +</table> 
 +</html>
  

[ Back to the navigation ] [ Back to the content ]