This is an old revision of the document!
Table of Contents
MapReduce Tutorial : Hadoop properties
We have controlled the Hadoop jobs using the Perl API so far, which is quite limited.
The Hadoop itself uses many configuration options. The options can be set on command line using the -Dname=value
syntax:
perl script.pl run [-jt cluster_master | -c cluster_size [-w sec_to_wait]] [-r number_of_reducers] [Hadoop options] input_path output_path
Mind that the order of options matters – the -jt
, -c
, -w
and -r
must precede Hadoop options to be recognized.
Every Hadoop option has a read-only default. These are overridden by cluster specific options. Lastly, all of these are overridden by job specific options given on the command line (or set using the Java API).
A brief list of Hadoop options
Hadoop option | Default value | Description |
---|---|---|
mapred.job.tracker | ? | Cluster master. |
mapred.reduce.tasks | 1 | Number of reducers. |
mapred.min.split.size | 1 | Minimum size of file split in bytes. |
mapred.max.split.size | 2^63-1 | Minimum size of file split in bytes. |
mapred.map.tasks.speculative.execution | true | If true, then multiple instances of some map tasks may be executed in parallel. |
mapred.reduce.tasks.speculative.execution | true | If true, then multiple instances of some reduce tasks may be executed in parallel. |
mapred.compress.map.output | false | Should the outputs of the maps be compressed before being sent across the network. Uses SequenceFile compression. |
A more complete list (but not exhaustive) can be found here.
Mapping of Perl options to Hadoop
Perl options | Hadoop options |
---|---|
no options (running locally) | -Dmapred.job.tracker=local -Dmapred.local.dir=hadoop-localrunner-tmp -Dhadoop.tmp.dir=hadoop-localrunner-tmp |
-jt cluster_master | -Dmapred.job.tracker=cluster_master |
-c cluster_machines | configuration of new cluster contains -Dmapred.job.tracker=cluster_master |
-r number_of_reducers | -Dmapred.reduce.tasks=number_of_reducers |