[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision Both sides next revision
courses:mapreduce-tutorial:step-29 [2012/01/30 00:44]
straka
courses:mapreduce-tutorial:step-29 [2012/01/30 00:50]
straka
Line 117: Line 117:
  
 ===== Exercise: ParagraphTextInputFormat ===== ===== Exercise: ParagraphTextInputFormat =====
 +
 +Implement ''ParagraphTextInputFormat''. This format reads plain text files and splits it into //paragraphs//. A paragraph consists of lines, all of which are nonempty, and different paragraphs are separated by at least one empty line. The ''ParagraphTextInputFormat'' reads one paragraph at a time and return its first line as key and the rest of lines as values.
 +
 +The ''ParagraphTextInputFormat'' should allow splitting of uncompressed files. Be careful to properly implement reading paragraphs which are on split boundary. The easiest way of doing so is the following:
 +  * if the offset of the split is 0, start reading at the beginning of the split. If the offset of the split is larger than 0, start reading from the offset and ignore first paragraph found.
 +  * read all paragraphs that start 

[ Back to the navigation ] [ Back to the content ]