[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
courses:mapreduce-tutorial:step-2 [2012/01/25 00:32]
straka
courses:mapreduce-tutorial:step-2 [2012/01/29 16:03] (current)
straka
Line 1: Line 1:
 ====== MapReduce tutorial : Input and output format, testing data. ====== ====== MapReduce tutorial : Input and output format, testing data. ======
  
-The MapReduce framework is frequently using (key, value) pairs. These +The MapReduce framework is frequently using (key, value) pairs. These pairs can be read from a file and written to a file and there are several formats available.
-pairs can be read from a file and written to a file and there are several formats available.+
  
 ===== Input formats ===== ===== Input formats =====
-  * ''TextInputFormat'' -- values are lines of UTF8 plain text files, keys are the positions of their first character in the file +  * ''TextInputFormat'' -- values are lines of UTF8 plain text files, keys are the positions of their first character in the file. 
-  * ''KeyValueTextInputFormat'' -- every line of UTF8 plain text file is split using first TAB character, forming key and value. If there is no TAB character, value is empty +  * ''KeyValueTextInputFormat'' -- every line of UTF8 plain text file is split using first TAB character, forming key and value. If there is no TAB character, the value is empty. 
-  * ''SequenceFileInputFormat'' -- binary format+  * ''SequenceFileInputFormat'' -- binary format.
 The input format can be compressed and will be decompressed transparently by the MR framework. The input format can be compressed and will be decompressed transparently by the MR framework.
  
 ===== Output formats ===== ===== Output formats =====
-  * ''TextOutputFormat'' -- (key, value) pair is printed in UTF8 on one line separated by a TAB character. If key or value is empty, no TAB character is used. +  * ''TextOutputFormat'' -- (key, value) pair is printed using UTF8 on one line separated by a TAB character. If key or value is empty, no TAB character is used. 
-  * ''SequenceFileOutputFormat'' -- binary format+  * ''SequenceFileOutputFormat'' -- binary format.
 The output format can be compressed on demand. The output format can be compressed on demand.
  
Line 20: Line 19:
   * ''/home/straka/wiki/cs-seq-medium'' -- compressed SequenceFile of Czech Wikipedia, 8MB.   * ''/home/straka/wiki/cs-seq-medium'' -- compressed SequenceFile of Czech Wikipedia, 8MB.
   * ''/home/straka/wiki/cs-seq-small'' -- compressed SequenceFile of Czech Wikipedia, 35kB.   * ''/home/straka/wiki/cs-seq-small'' -- compressed SequenceFile of Czech Wikipedia, 35kB.
-  * ''/home/straka/wiki/cs-text'' -- uncompressed plain text files of Czech Wikipedia, 200MB. +  * ''/home/straka/wiki/cs-text'' -- uncompressed plain text files of Czech Wikipedia in the ''KeyValueTextInputFormat'', 200MB. 
-  * ''/home/straka/wiki/cs-text-medium'' -- uncompressed plain text files of Czech Wikipedia, 16MB. +  * ''/home/straka/wiki/cs-text-medium'' -- uncompressed plain text files of Czech Wikipedia in the ''KeyValueTextInputFormat'', 16MB. 
-  * ''/home/straka/wiki/cs-text-small'' -- uncompressed plain text files of Czech Wikipedia, 70kB.+  * ''/home/straka/wiki/cs-text-small'' -- uncompressed plain text files of Czech Wikipedia in the ''KeyValueTextInputFormat'', 70kB.
   * ''/home/straka/wiki/en-seq'' -- compressed SequenceFile of English Wikipedia, 1.9GB.   * ''/home/straka/wiki/en-seq'' -- compressed SequenceFile of English Wikipedia, 1.9GB.
 +It is recommended to use the text format in the tutorial, so that both input and output files are readable.
 +
 +----
 +
 +<html>
 +<table style="width:100%">
 +<tr>
 +<td style="text-align:left; width: 33%; "></html>[[step-1|Step 1]]: Setting the environment.<html></td>
 +<td style="text-align:center; width: 33%; "></html>[[.|Overview]]<html></td>
 +<td style="text-align:right; width: 33%; "></html>[[step-3|Step 3]]: Basic mapper.<html></td>
 +</tr>
 +</table>
 +</html>
  

[ Back to the navigation ] [ Back to the content ]