[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki

[ Back to the navigation ]

Table of Contents

MapReduce Tutorial : Running jobs

The input of a Hadoop job is either a file, or a directory. In latter case all files in the directory are processed.

The output of a Hadoop job must be a directory, which does not exist.

Running jobs

Run Perl script script.pl perl script.pl options
Run Java job job.jar /net/projects/hadoop/bin/hadoop job.jar options

The options are the same for Perl and java:

Run locally input output
Run using specified jobtracker -jt jobtracker:port input output
Run job in dedicated cluster -c number_of_machines input output
Run job in dedicated cluster and after it finishes,
wait for W seconds before stopping the cluster
-c number_of_machines -w W_seconds input output
Run using R reducers
(R>1 not working when running locally)
-r R input output
Run using M mappers `/net/projects/hadoop/bin/compute-splitsize input M` input output

From February 2012, using the parameter -w makes Hadoop to wait W seconds after the last task is finished. This means that you can start a cluster for one task (with -c N_machines -w W_seconds) and reuse it (with -jt jobtracker:port) for other tasks without worries that the other tasks will be killed before finishing.

Running multiple jobs

There are several ways of running multiple jobs:

[ Back to the navigation ] [ Back to the content ]