[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
courses:mapreduce-tutorial:running-jobs [2012/02/05 21:15]
straka
courses:mapreduce-tutorial:running-jobs [2013/02/08 14:33] (current)
popel Milan improved our Hadoop
Line 5: Line 5:
 The output of a Hadoop job must be a directory, which does not exist. The output of a Hadoop job must be a directory, which does not exist.
  
-===== Run Perl jobs ===== +===== Running jobs =====
-Choosing mode of operation: +
-| ^ Command ^ +
-^ Run locally | ''perl script.pl input output''+
-^ Run using specified jobtracker | ''perl script.pl -jt jobtracker:port input output''+
-^ Run job in dedicated cluster | ''perl script.pl -c number_of_machines input output''+
-^ Run job in dedicated cluster and after it finishes, \\ wait for //W// seconds before stopping the cluster | ''perl script.pl -c number_of_machines -w W_seconds input output'' |+
  
-Specifying number of mappers and reducers: 
 | ^ Command ^ | ^ Command ^
-^ Run using //R// reducers \\ (//R//>1 not working when running locally)| ''perl -r R script.pl input output'' +^ Run Perl script ''script.pl'' | ''perl script.pl'' //options// | 
-^ Run using //M// mappers | ''perl script.pl `/net/projects/hadoop/bin/compute-splitsize input M` input output'' |+^ Run Java job ''job.jar'' | ''/net/projects/hadoop/bin/hadoop job.jar'' //options// |
  
-===== Run Java jobs ===== +The options are the same for Perl and java:
-Choosing mode of operation: +
-| ^ Command ^ +
-^ Run locally | ''/net/projects/hadoop/bin/hadoop job.jar input output''+
-^ Run using specified jobtracker | ''/net/projects/hadoop/bin/hadoop job.jar -jt jobtracker:port input output''+
-^ Run job in dedicated cluster | ''/net/projects/hadoop/bin/hadoop job.jar -c number_of_machines input output''+
-^ Run job in dedicated cluster and after it finishes, \\ wait for //W// seconds before stopping the cluster | ''/net/projects/hadoop/bin/hadoop job.jar -c number_of_machines -w W_seconds input output'' |+
  
-Specifying number of mappers and reducers: +| ^ Options ^ 
-| ^ Command +^ Run locally ''input output''
-^ Run using //R// reducers \\ (//R//>1 not working when running locally)| ''perl -r R script.pl input output''+Run using specified jobtracker | ''-jt jobtracker:port input output''
-^ Run using //M// mappers | ''/net/projects/hadoop/bin/hadoop job.jar `/net/projects/hadoop/bin/compute-splitsize input M` input output'' |+Run job in dedicated cluster | ''-c number_of_machines input output''
 +^ Run job in dedicated cluster and after it finishes, \\ wait for //W// seconds before stopping the cluster | ''-c number_of_machines -w W_seconds input output'' | 
 +^ Run using //R// reducers \\ (//R//>1 not working when running locally)| ''-r R input output''
 +^ Run using //M// mappers | ''`/net/projects/hadoop/bin/compute-splitsize input M` input output'' | 
 + 
 +From February 2012, using the parameter ''-w'' makes Hadoop to wait W seconds after the **last** task is finished. This means that you can start a cluster for one task (with ''-c N_machines -w W_seconds'') and reuse it (with ''-jt jobtracker:port'') for other tasks without worries that the other tasks will be killed before finishing.
  
 +===== Running multiple jobs =====
 +There are several ways of running multiple jobs:
 +  * Java only: Create multiple ''Job'' instances and call ''submit'' or ''waitForCompletion'' multiple times
 +  * Create a cluster using ''/net/projects/hadoop/bin/hadoop-cluster'', parse the jobtracker:port using ''head -1'' and run the jobs using ''-jt jobtracker:port''
 +  * Create a shell script running multiple jobs using ''-jt HADOOP_JOBTRACKER''. Then run it using ''//net/projects/hadoop/bin/hadoop-cluster -c machines script.sh''.

[ Back to the navigation ] [ Back to the content ]