MapReduce Tutorial : Running jobs
- Running jobs
- Running multiple jobs

MapReduce Tutorial : Running jobs

The input of a Hadoop job is either a file, or a directory. In latter case all files in the directory are processed.

The output of a Hadoop job must be a directory, which does not exist.

Running jobs

	Command
Run Perl script `script.pl`	`perl script.pl` options
Run Java job `job.jar`	`/net/projects/hadoop/bin/hadoop job.jar` options

The options are the same for Perl and java:

	Options
Run locally	`input output`
Run using specified jobtracker	`-jt jobtracker:port input output`
Run job in dedicated cluster	`-c number_of_machines input output`
Run job in dedicated cluster and after it finishes, wait for W seconds before stopping the cluster	`-c number_of_machines -w W_seconds input output`
Run using R reducers (R>1 not working when running locally)	`-r R input output`
Run using M mappers	`/net/projects/hadoop/bin/compute-splitsize input M` input output

From February 2012, using the parameter -w makes Hadoop to wait W seconds after the last task is finished. This means that you can start a cluster for one task (with -c N_machines -w W_seconds) and reuse it (with -jt jobtracker:port) for other tasks without worries that the other tasks will be killed before finishing.

Running multiple jobs

There are several ways of running multiple jobs:

Java only: Create multiple Job instances and call submit or waitForCompletion multiple times
Create a cluster using /net/projects/hadoop/bin/hadoop-cluster, parse the jobtracker:port using head -1 and run the jobs using -jt jobtracker:port
Create a shell script running multiple jobs using -jt HADOOP_JOBTRACKER. Then run it using net/projects/hadoop/bin/hadoop-cluster -c machines script.sh''.

Table of Contents

MapReduce Tutorial : Running jobs

Running jobs

Running multiple jobs