[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
courses:mapreduce-tutorial:step-7 [2012/01/25 22:01]
straka
courses:mapreduce-tutorial:step-7 [2013/02/08 14:36] (current)
popel Milan improved our Hadoop
Line 4: Line 4:
  
 A cluster can be created using A cluster can be created using
-  /home/straka/hadoop/bin/hadoop-cluster -c number_of_machines -w sec_to_run_the_cluster_for+  /net/projects/hadoop/bin/hadoop-cluster -c number_of_machines -w sec_to_wait_after_all_jobs_completed
 The syntax is the same as in ''perl script.pl run''. The syntax is the same as in ''perl script.pl run''.
  
Line 11: Line 11:
 ===== Using a running cluster ===== ===== Using a running cluster =====
 Running cluster is identified by its master. When running a Hadoop job using Perl API, existing cluster can be used by Running cluster is identified by its master. When running a Hadoop job using Perl API, existing cluster can be used by
-  perl script.pl run -jt cluster_master:9001 ...+  perl script.pl -jt cluster_master:9001 ..
 + 
 +===== Running Hadoop jobs from now on ===== 
 + 
 +From now on, it is best to run MR jobs using a one-machine cluster -- create a one-machine cluster using ''hadoop-cluster'' for 3h (10800s) and run jobs using ''-jt cluster_master''. Running the scripts locally without any cluster has several disadvantages, most notably having only one reducer per job
  
 ===== Example ===== ===== Example =====
  
-Try running the same script {{:courses:mapreduce-tutorial:step-6.txt|wordcount.pl}} as in the last step, this time by creating the cluster and submitting the job to it: +Try running the same script {{:courses:mapreduce-tutorial:step-6.txt|step-7-wordcount.pl}} as in the last step, this time by creating the cluster and submitting the job to it: 
-  /home/straka/hadoop/bin/hadoop-cluster -c 1 -w 600 +  wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-6.txt' -O 'step-7-wordcount.pl' 
-  perl wordcount.pl run -jt cluster_master:9001 -Dmapred.max.split.size=1000000 /home/straka/wiki/cs-text-medium some_output_directory+  /net/projects/hadoop/bin/hadoop-cluster -c 1 -w 600 
 +  # NOW VIEW THE FILE 
 +  # $EDITOR step-7-wordcount.pl 
 +  rm -rf step-7-out-sol; perl step-7-wordcount.pl -jt cluster_master:9001 -Dmapred.max.split.size=1000000 /home/straka/wiki/cs-text-medium step-7-out-sol 
 +  less less step-7-out-sol/part-* 
 +Remarks: 
 +  * The reducers seem to start running before the mappers finish. In the web interface, the running time of reducers is divided into thirds: 
 +    * during the first 33%, the mapper outputs are copied to the machine where reducer runs. 
 +    * during the second 33%, the (key, value) pairs are sorted. 
 +    * during the last 33%, the user-defined reducer runs. 
 + 
 +---- 
 + 
 +<html> 
 +<table style="width:100%"> 
 +<tr> 
 +<td style="text-align:left; width: 33%; "></html>[[step-6|Step 6]]: Running on cluster.<html></td> 
 +<td style="text-align:center; width: 33%; "></html>[[.|Overview]]<html></td> 
 +<td style="text-align:right; width: 33%; "></html>[[step-8|Step 8]]: Multiple mappers, reducers and partitioning.<html></td> 
 +</tr> 
 +</table> 
 +</html>
  

[ Back to the navigation ] [ Back to the content ]