====== MapReduce Tutorial : Running jobs ======

The input of a Hadoop job is either a file, or a directory. In latter case all files in the directory are processed.

The output of a Hadoop job must be a directory, which does not exist.

===== Running jobs =====

| ^ Command ^
^ Run Perl script ''script.pl'' | ''perl script.pl'' //options// |
^ Run Java job ''job.jar'' | ''/net/projects/hadoop/bin/hadoop job.jar'' //options// |

The options are the same for Perl and java:

| ^ Options ^
^ Run locally | ''input output'' |
^ Run using specified jobtracker | ''-jt jobtracker:port input output'' |
^ Run job in dedicated cluster | ''-c number_of_machines input output'' |
^ Run job in dedicated cluster and after it finishes, \\ wait for //W// seconds before stopping the cluster | ''-c number_of_machines -w W_seconds input output'' |
^ Run using //R// reducers \\ (//R//>1 not working when running locally)| ''-r R input output'' |
^ Run using //M// mappers | ''`/net/projects/hadoop/bin/compute-splitsize input M` input output'' |

From February 2012, using the parameter ''-w'' makes Hadoop to wait W seconds after the **last** task is finished. This means that you can start a cluster for one task (with ''-c N_machines -w W_seconds'') and reuse it (with ''-jt jobtracker:port'') for other tasks without worries that the other tasks will be killed before finishing.

===== Running multiple jobs =====
There are several ways of running multiple jobs:
  * Java only: Create multiple ''Job'' instances and call ''submit'' or ''waitForCompletion'' multiple times
  * Create a cluster using ''/net/projects/hadoop/bin/hadoop-cluster'', parse the jobtracker:port using ''head -1'' and run the jobs using ''-jt jobtracker:port''
  * Create a shell script running multiple jobs using ''-jt HADOOP_JOBTRACKER''. Then run it using ''//net/projects/hadoop/bin/hadoop-cluster -c machines script.sh''.