This is an old revision of the document!
Table of Contents
MapReduce Tutorial : Predefined formats and types
Currently there are two different Java APIs:
- org.apache.hadoop.mapred: This is the original API, which is currently deprecated.
- org.apache.hadoop.mapreduce: This is the new API, which we will be using in this tutorial. The only problem is that some library classes have not yet been converted to use the new API and we cannot therefore use them.
When browsing through the documentation, make sure to stay in org.apache.hadoop.mapreduce
namespace.
Types
The Java API differs from the Perl API in one important aspect: the keys and values are types.
The type of a value must be a subclass of Writable, which provides methods for serializing and deserializing values.
The type of a key must be a subclass of WritableComparable, which provides both Writable
and Comparable
interface.
Here is a list of frequently used types:
Text
– UTF-8 encoded stringBytesWritable
– sequence of arbitrary bytesIntWritable
– 32-bit integerLongWritable
– 64-bit integerFloatWritable
– 64-bit floating numberDoubleWritable
– 64-bit floating number
For more complicated types like variable-length encoded integers, dictionaries, bloom filters, etc., see Writable.
Input formats
The input formats are the same as in Perl API. Every input format also specifies which types it can provide.
An input format is a subclass of FileInputFormat<K,V>, where K is the type of keys and V is the type of values it can load.
Available input formats:
TextInputFormat
: The type of keys isLongWritable
and the type of values isText
.KeyValueTextInputFormat
: The type of both keys and values isText
.SequenceFileInputFormat
: Any type of keys and values can be used.
Output formats
An input format is a subclass of FileOutputFormat<K,V>, where K is the type of keys and V is the type of values it can store.
Available output formats:
TextOutputFormat
: The type of both keys and values isText
.SequenceFileOutputFormat
: Any type of keys and values can be used.