Differences

This shows you the differences between two versions of the page.

--- courses:mapreduce-tutorial:step-29 [2012/01/29 17:40]
straka
+++ courses:mapreduce-tutorial:step-29 [2012/01/30 00:40]
straka
@@ Line 7: / Line 7: @@
 We start by creating ''FileAsPathInputFormat'', which reads any file, splits it and for each split return exactly one input pair (file_path, start-length) with types (''Text'', ''Text''), where ''file_path'' is path to the file and ''start-length'' is a string containing two dash-separated numbers: start offset of the split and length of the split.
+When implementing new input format, we must
+  * decide whether the input files are splittable. Usually uncompressed files are splittable and compressed files are not splittable, with the exception of ''SequenceFile'', which is always splittable.
+  * implement [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/RecordReader.html|RecordReader<K, V>]]. The ''RecordReader'' is the one doing the real work -- it is given a file split and it reads (key, value) pairs of types (K, V), until there are any.
+Our ''FileAsPathInputFormat'' is simple -- we allow splitting of uncompressed file and the ''RecordReader'' reads exactly one input pair.
 <code java>
-public static class FileAsPathInputFormat extends FileInputFormat<Text, Text> {
+public class FileAsPathInputFormat extends FileInputFormat<Text, Text> {
   public static class FileAsPathRecordReader extends RecordReader<Text, Text> {
     private Path file;

Institute of Formal and Applied Linguistics Wiki