====== MapReduce tutorial : Input and output format, testing data. ====== The MapReduce framework is frequently using (key, value) pairs. These pairs can be read from a file and written to a file and there are several formats available. ===== Input formats ===== * ''TextInputFormat'' -- values are lines of UTF8 plain text files, keys are the positions of their first character in the file. * ''KeyValueTextInputFormat'' -- every line of UTF8 plain text file is split using first TAB character, forming key and value. If there is no TAB character, the value is empty. * ''SequenceFileInputFormat'' -- binary format. The input format can be compressed and will be decompressed transparently by the MR framework. ===== Output formats ===== * ''TextOutputFormat'' -- (key, value) pair is printed using UTF8 on one line separated by a TAB character. If key or value is empty, no TAB character is used. * ''SequenceFileOutputFormat'' -- binary format. The output format can be compressed on demand. ===== Input data ===== Testing data are available in several formats and sizes: * ''/home/straka/wiki/cs-seq'' -- compressed SequenceFile of Czech Wikipedia, 85MB. * ''/home/straka/wiki/cs-seq-medium'' -- compressed SequenceFile of Czech Wikipedia, 8MB. * ''/home/straka/wiki/cs-seq-small'' -- compressed SequenceFile of Czech Wikipedia, 35kB. * ''/home/straka/wiki/cs-text'' -- uncompressed plain text files of Czech Wikipedia in the ''KeyValueTextInputFormat'', 200MB. * ''/home/straka/wiki/cs-text-medium'' -- uncompressed plain text files of Czech Wikipedia in the ''KeyValueTextInputFormat'', 16MB. * ''/home/straka/wiki/cs-text-small'' -- uncompressed plain text files of Czech Wikipedia in the ''KeyValueTextInputFormat'', 70kB. * ''/home/straka/wiki/en-seq'' -- compressed SequenceFile of English Wikipedia, 1.9GB. It is recommended to use the text format in the tutorial, so that both input and output files are readable. ----
[[step-1|Step 1]]: Setting the environment. [[.|Overview]] [[step-3|Step 3]]: Basic mapper.