Both sides previous revision
Previous revision
Next revision
|
Previous revision
|
courses:mapreduce-tutorial:step-2 [2012/01/24 19:02] straka |
courses:mapreduce-tutorial:step-2 [2012/01/29 16:03] (current) straka |
====== MapReduce tutorial : Input and output format, testing data. ====== | ====== MapReduce tutorial : Input and output format, testing data. ====== |
| |
The MapReduce framework is frequently using (key, value) pairs. These | The MapReduce framework is frequently using (key, value) pairs. These pairs can be read from a file and written to a file and there are several formats available. |
pairs can be read from a file and written to a file and there are several formats available. | |
| |
===== Input formats ===== | ===== Input formats ===== |
* ''TextInputFormat'' -- values are lines of plain text files, keys are the positions of their first character in the file | * ''TextInputFormat'' -- values are lines of UTF8 plain text files, keys are the positions of their first character in the file. |
* ''KeyValueTextInputFormat'' -- every line of plain text file is split using first TAB character, forming key and value. If there is no TAB character, value is empty | * ''KeyValueTextInputFormat'' -- every line of UTF8 plain text file is split using first TAB character, forming key and value. If there is no TAB character, the value is empty. |
* ''SequenceFileInputFormat'' -- binary format | * ''SequenceFileInputFormat'' -- binary format. |
The input format can be compressed and will be decompressed transparently by the MR framework. | The input format can be compressed and will be decompressed transparently by the MR framework. |
| |
===== Output formats ===== | ===== Output formats ===== |
* ''TextOutputFormat'' -- (key, value) pair is printed on one line separated by a TAB character. If key or value is empty, no TAB character is used. | * ''TextOutputFormat'' -- (key, value) pair is printed using UTF8 on one line separated by a TAB character. If key or value is empty, no TAB character is used. |
* ''SequenceFileOutputFormat'' -- binary format | * ''SequenceFileOutputFormat'' -- binary format. |
The output format can be compressed on demand. | The output format can be compressed on demand. |
| |
* ''/home/straka/wiki/cs-seq-medium'' -- compressed SequenceFile of Czech Wikipedia, 8MB. | * ''/home/straka/wiki/cs-seq-medium'' -- compressed SequenceFile of Czech Wikipedia, 8MB. |
* ''/home/straka/wiki/cs-seq-small'' -- compressed SequenceFile of Czech Wikipedia, 35kB. | * ''/home/straka/wiki/cs-seq-small'' -- compressed SequenceFile of Czech Wikipedia, 35kB. |
* ''/home/straka/wiki/cs-text'' -- uncompressed plain text files of Czech Wikipedia, 200MB. | * ''/home/straka/wiki/cs-text'' -- uncompressed plain text files of Czech Wikipedia in the ''KeyValueTextInputFormat'', 200MB. |
* ''/home/straka/wiki/cs-text-medium'' -- uncompressed plain text files of Czech Wikipedia, 16MB. | * ''/home/straka/wiki/cs-text-medium'' -- uncompressed plain text files of Czech Wikipedia in the ''KeyValueTextInputFormat'', 16MB. |
* ''/home/straka/wiki/cs-text-small'' -- uncompressed plain text files of Czech Wikipedia, 70kB. | * ''/home/straka/wiki/cs-text-small'' -- uncompressed plain text files of Czech Wikipedia in the ''KeyValueTextInputFormat'', 70kB. |
* ''/home/straka/wiki/en-seq'' -- compressed SequenceFile of English Wikipedia, 1.9GB. | * ''/home/straka/wiki/en-seq'' -- compressed SequenceFile of English Wikipedia, 1.9GB. |
| It is recommended to use the text format in the tutorial, so that both input and output files are readable. |
| |
| ---- |
| |
| <html> |
| <table style="width:100%"> |
| <tr> |
| <td style="text-align:left; width: 33%; "></html>[[step-1|Step 1]]: Setting the environment.<html></td> |
| <td style="text-align:center; width: 33%; "></html>[[.|Overview]]<html></td> |
| <td style="text-align:right; width: 33%; "></html>[[step-3|Step 3]]: Basic mapper.<html></td> |
| </tr> |
| </table> |
| </html> |
| |