MapReduce Tutorial : Running jobs

The input of a Hadoop job is either a file, or a directory. In latter case all files in the directory are processed.

The output of a Hadoop job must be a directory, which does not exist.

Run Perl jobs

Choosing mode of operation:

	Command
Run locally	`perl script.pl input output`
Run using specified jobtracker	`perl script.pl -jt jobtracker:port input output`
Run job in dedicated cluster	`perl script.pl -c number_of_machines input output`
Run job in dedicated cluster and after it finishes, wait for W seconds before stopping the cluster	`perl script.pl -c number_of_machines -w W_seconds input output`

Specifying number of mappers and reducers:

	Command
Run using R reducers (R>1 not working when running locally)	`perl -r R script.pl input output`
Run using M mappers	perl script.pl `/net/projects/hadoop/bin/compute-splitsize input M` input output

Choosing mode of operation:

	Command
Run locally	`/net/projects/hadoop/bin/hadoop job.jar input output`
Run using specified jobtracker	`/net/projects/hadoop/bin/hadoop job.jar -jt jobtracker:port input output`
Run job in dedicated cluster	`/net/projects/hadoop/bin/hadoop job.jar -c number_of_machines input output`
Run job in dedicated cluster and after it finishes, wait for W seconds before stopping the cluster	`/net/projects/hadoop/bin/hadoop job.jar -c number_of_machines -w W_seconds input output`

Specifying number of mappers and reducers:

	Command
Run using R reducers (R>1 not working when running locally)	`perl -r R script.pl input output`
Run using M mappers	/net/projects/hadoop/bin/hadoop job.jar `/net/projects/hadoop/bin/compute-splitsize input M` input output