The input of a Hadoop job is either a file, or a directory. In latter case all files in the directory are processed.
The output of a Hadoop job must be a directory, which does not exist.
Command | |
---|---|
Run Perl script script.pl | perl script.pl options |
Run Java job job.jar | /net/projects/hadoop/bin/hadoop job.jar options |
The options are the same for Perl and java:
Options | |
---|---|
Run locally | input output |
Run using specified jobtracker | -jt jobtracker:port input output |
Run job in dedicated cluster | -c number_of_machines input output |
Run job in dedicated cluster and after it finishes, wait for W seconds before stopping the cluster | -c number_of_machines -w W_seconds input output |
Run using R reducers (R>1 not working when running locally) | -r R input output |
Run using M mappers | `/net/projects/hadoop/bin/compute-splitsize input M` input output |
From February 2012, using the parameter -w
makes Hadoop to wait W seconds after the last task is finished. This means that you can start a cluster for one task (with -c N_machines -w W_seconds
) and reuse it (with -jt jobtracker:port
) for other tasks without worries that the other tasks will be killed before finishing.
There are several ways of running multiple jobs:
Job
instances and call submit
or waitForCompletion
multiple times/net/projects/hadoop/bin/hadoop-cluster
, parse the jobtracker:port using head -1
and run the jobs using -jt jobtracker:port
-jt HADOOP_JOBTRACKER
. Then run it using net/projects/hadoop/bin/hadoop-cluster -c machines script.sh''.