====== MapReduce Tutorial : Compression and job configuration ======


===== Compression =====

The output files can be compressed using
<code java>
  FileOutputFormat.setCompressOutput(job, true);
</code>
  
The default compression format is ''deflate'' -- raw Zlib compression. Several other compression formats can be selected:
<code java>
import org.apache.hadoop.io.compress.*;

  ...
  FileOutputFormat.setOutputCompressorClass(GzipCodec.class);   //.gz
  FileOutputFormat.setOutputCompressorClass(BZip2Codec.class);  //.bz2
</code>

Of course, any of these formats is decompressed transparently when the file is being read.

===== Job configuration =====

The job properties can be set:
  * on the command line -- the [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/util/ToolRunner.html|ToolRunner]] parses options in format ''-Dname=value''. See the [[.:step-24#running-the-job|syntax of the hadoop script]].
  * using the [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/Job.html|Job]]''.getConfiguration()'' a [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/conf/Configuration.html|Configuration]] object is retrieved. It provides following methods:
    * ''String get(String name)'' -- get the value of the ''name'' property, ''null'' if it does not exist.
    * ''String get(String name, String defaultValue)'' -- get the value of the ''name'' property
    * ''getBoolean'', ''getClass'', ''getFile'', ''getFloat'', ''getInt'', ''getLong'', ''getStrings'' -- return a typed value of the ''name'' property (i.e., number, file name, class name, ...).
    * ''set(String name, String value)'' -- set the value of the ''name'' property to ''value''.
    * ''setBoolean'', ''setClass'', ''setFile'', ''setFloat'', ''setInt'', ''setLong'', ''setStrings'' -- set the typed value of the ''name'' property (i.e., number, file name, class name, ...).
  * in a mapper or a reducer, the ''context'' object also provides the ''getConfiguration()'' method, so the job properties can be accessed in the mappers and reducers too.

Apart from already mentioned [[.:step-9#a-brief-list-of-hadoop-options|brief list of Hadoop properties]], there is one important Java-specific property:
  * **''mapred.child.java.opts''** with default value **''-Xmx200m''**. This property sets the Java options used for every task attempt (mapper, reducer, combiner, partitioner). The default value ''-Xmx200m'' specifies the maximum size of memory allocation pool. If your mappers and reducers need 1GB memory, use ''-Xmx1024m''. Other Java options can be found in ''man java''.


----

<html>
<table style="width:100%">
<tr>
<td style="text-align:left; width: 33%; "></html>[[step-25|Step 25]]: Reducers, combiners and partitioners.<html></td>
<td style="text-align:center; width: 33%; "></html>[[.|Overview]]<html></td>
<td style="text-align:right; width: 33%; "></html>[[step-27|Step 27]]: Running multiple Hadoop jobs in one source file.<html></td>
</tr>
</table>
</html>