[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision Both sides next revision
courses:mapreduce-tutorial:step-2 [2012/01/24 08:56]
straka
courses:mapreduce-tutorial:step-2 [2012/01/24 19:02]
straka
Line 1: Line 1:
 ====== MapReduce tutorial : Input and output format, testing data. ====== ====== MapReduce tutorial : Input and output format, testing data. ======
 +
 +The MapReduce framework is frequently using (key, value) pairs. These
 +pairs can be read from a file and written to a file and there are several formats available.
 +
 +===== Input formats =====
 +  * ''TextInputFormat'' -- values are lines of plain text files, keys are the positions of their first character in the file
 +  * ''KeyValueTextInputFormat'' -- every line of plain text file is split using first TAB character, forming key and value. If there is no TAB character, value is empty
 +  * ''SequenceFileInputFormat'' -- binary format
 +The input format can be compressed and will be decompressed transparently by the MR framework.
 +
 +===== Output formats =====
 +  * ''TextOutputFormat'' -- (key, value) pair is printed on one line separated by a TAB character. If key or value is empty, no TAB character is used.
 +  * ''SequenceFileOutputFormat'' -- binary format
 +The output format can be compressed on demand.
 +
 +===== Input data =====
 +Testing data are available in several formats and sizes:
 +  * ''/home/straka/wiki/cs-seq'' -- compressed SequenceFile of Czech Wikipedia, 85MB.
 +  * ''/home/straka/wiki/cs-seq-medium'' -- compressed SequenceFile of Czech Wikipedia, 8MB.
 +  * ''/home/straka/wiki/cs-seq-small'' -- compressed SequenceFile of Czech Wikipedia, 35kB.
 +  * ''/home/straka/wiki/cs-text'' -- uncompressed plain text files of Czech Wikipedia, 200MB.
 +  * ''/home/straka/wiki/cs-text-medium'' -- uncompressed plain text files of Czech Wikipedia, 16MB.
 +  * ''/home/straka/wiki/cs-text-small'' -- uncompressed plain text files of Czech Wikipedia, 70kB.
 +  * ''/home/straka/wiki/en-seq'' -- compressed SequenceFile of English Wikipedia, 1.9GB.
  

[ Back to the navigation ] [ Back to the content ]