Table of Contents
MapReduce tutorial : Input and output format, testing data.
The MapReduce framework is frequently using (key, value) pairs. These pairs can be read from a file and written to a file and there are several formats available.
Input formats
TextInputFormat– values are lines of UTF8 plain text files, keys are the positions of their first character in the file.KeyValueTextInputFormat– every line of UTF8 plain text file is split using first TAB character, forming key and value. If there is no TAB character, the value is empty.SequenceFileInputFormat– binary format.
The input format can be compressed and will be decompressed transparently by the MR framework.
Output formats
TextOutputFormat– (key, value) pair is printed using UTF8 on one line separated by a TAB character. If key or value is empty, no TAB character is used.SequenceFileOutputFormat– binary format.
The output format can be compressed on demand.
Input data
Testing data are available in several formats and sizes:
/home/straka/wiki/cs-seq– compressed SequenceFile of Czech Wikipedia, 85MB./home/straka/wiki/cs-seq-medium– compressed SequenceFile of Czech Wikipedia, 8MB./home/straka/wiki/cs-seq-small– compressed SequenceFile of Czech Wikipedia, 35kB./home/straka/wiki/cs-text– uncompressed plain text files of Czech Wikipedia in theKeyValueTextInputFormat, 200MB./home/straka/wiki/cs-text-medium– uncompressed plain text files of Czech Wikipedia in theKeyValueTextInputFormat, 16MB./home/straka/wiki/cs-text-small– uncompressed plain text files of Czech Wikipedia in theKeyValueTextInputFormat, 70kB./home/straka/wiki/en-seq– compressed SequenceFile of English Wikipedia, 1.9GB.
It is recommended to use the text format in the tutorial, so that both input and output files are readable.
| Step 1: Setting the environment. | Overview | Step 3: Basic mapper. |
