Table of Contents
MapReduce tutorial : Input and output format, testing data.
The MapReduce framework is frequently using (key, value) pairs. These pairs can be read from a file and written to a file and there are several formats available.
Input formats
TextInputFormat
– values are lines of UTF8 plain text files, keys are the positions of their first character in the file.KeyValueTextInputFormat
– every line of UTF8 plain text file is split using first TAB character, forming key and value. If there is no TAB character, the value is empty.SequenceFileInputFormat
– binary format.
The input format can be compressed and will be decompressed transparently by the MR framework.
Output formats
TextOutputFormat
– (key, value) pair is printed using UTF8 on one line separated by a TAB character. If key or value is empty, no TAB character is used.SequenceFileOutputFormat
– binary format.
The output format can be compressed on demand.
Input data
Testing data are available in several formats and sizes:
/home/straka/wiki/cs-seq
– compressed SequenceFile of Czech Wikipedia, 85MB./home/straka/wiki/cs-seq-medium
– compressed SequenceFile of Czech Wikipedia, 8MB./home/straka/wiki/cs-seq-small
– compressed SequenceFile of Czech Wikipedia, 35kB./home/straka/wiki/cs-text
– uncompressed plain text files of Czech Wikipedia in theKeyValueTextInputFormat
, 200MB./home/straka/wiki/cs-text-medium
– uncompressed plain text files of Czech Wikipedia in theKeyValueTextInputFormat
, 16MB./home/straka/wiki/cs-text-small
– uncompressed plain text files of Czech Wikipedia in theKeyValueTextInputFormat
, 70kB./home/straka/wiki/en-seq
– compressed SequenceFile of English Wikipedia, 1.9GB.
It is recommended to use the text format in the tutorial, so that both input and output files are readable.
Step 1: Setting the environment. | Overview | Step 3: Basic mapper. |