This is an old revision of the document!
Table of Contents
MapReduce Tutorial : Hadoop properties
We have controlled the Hadoop jobs using the Perl API so far, which is quite limited.
The Hadoop itself uses many configuration options. Every option has a (dot-separated) name and a value and can be set on the command line using -Dname=value
syntax:
perl script.pl run [-jt cluster_master | -c cluster_size [-w sec_to_wait]] [-r number_of_reducers] [Hadoop options] input_path output_path
Mind that the order of options matters – the -jt
, -c
, -w
and -r
must precede Hadoop options to be recognized.
Every Hadoop option has a read-only default. These are overridden by cluster specific options. Lastly, all of these are overriden by job specific options given on the command line (or set using the Java API).
A brief list of Hadoop options
Hadoop option | Default value | Description |
---|---|---|
mapred.job.tracker | ? | Cluster master |
mapred.reduce.tasks | 1 | Number of reducers |
mapred.min.split.size | 1 | Minimum size of file split in bytes |
mapred.max.split.size | 2^63-1 | Minimum size of file split in bytes |
mapred.map.tasks.speculative.execution | true | If true, then multiple instances of some map tasks may be executed in parallel |
mapred.reduce.tasks.speculative.execution | true | If true, then multiple instances of some reduce tasks may be executed in parallel |
mapred.compress.map.output | false | Should the outputs of the maps be compressed before being sent across the network. Uses SequenceFile compression |
A more complete list (but not exhaustive) can be found here.
Mapping of Perl options to Hadoop
Perl options | Hadoop options |
---|---|
no options (running locally) | -Dmapred.job.tracker=local -Dmapred.local.dir=hadoop-localrunner-tmp -Dhadoop.tmp.dir=hadoop-localrunner-tmp |
-jt cluster_master | -Dmapred.job.tracker=cluster_master |
-c cluster_machines | configuration of new cluster contains -Dmapred.job.tracker=cluster_master |
-r number_of_reducers | -Dmapred.reduce.tasks=number_of_reducers |