MapReduce tutorial : Input and output format, testing data.

MapReduce tutorial : Input and output format, testing data.

The MapReduce framework is frequently using (key, value) pairs. These pairs can be read from a file and written to a file and there are several formats available.

Input formats

TextInputFormat – values are lines of UTF8 plain text files, keys are the positions of their first character in the file.
KeyValueTextInputFormat – every line of UTF8 plain text file is split using first TAB character, forming key and value. If there is no TAB character, the value is empty.
SequenceFileInputFormat – binary format.

The input format can be compressed and will be decompressed transparently by the MR framework.

Output formats

TextOutputFormat – (key, value) pair is printed using UTF8 on one line separated by a TAB character. If key or value is empty, no TAB character is used.
SequenceFileOutputFormat – binary format.

The output format can be compressed on demand.

Input data

Testing data are available in several formats and sizes:

/home/straka/wiki/cs-seq – compressed SequenceFile of Czech Wikipedia, 85MB.
/home/straka/wiki/cs-seq-medium – compressed SequenceFile of Czech Wikipedia, 8MB.
/home/straka/wiki/cs-seq-small – compressed SequenceFile of Czech Wikipedia, 35kB.
/home/straka/wiki/cs-text – uncompressed plain text files of Czech Wikipedia in the KeyValueTextInputFormat, 200MB.
/home/straka/wiki/cs-text-medium – uncompressed plain text files of Czech Wikipedia in the KeyValueTextInputFormat, 16MB.
/home/straka/wiki/cs-text-small – uncompressed plain text files of Czech Wikipedia in the KeyValueTextInputFormat, 70kB.
/home/straka/wiki/en-seq – compressed SequenceFile of English Wikipedia, 1.9GB.

It is recommended to use the text format in the tutorial, so that both input and output files are readable.

Step 1: Setting the environment.

Overview

Step 3: Basic mapper.

Table of Contents

MapReduce tutorial : Input and output format, testing data.

Input formats

Output formats

Input data