Differences

This shows you the differences between two versions of the page.

--- courses:mapreduce-tutorial:step-29 [2012/01/29 17:26]
straka
+++ courses:mapreduce-tutorial:step-29 [2012/01/30 00:52]
straka
@@ Line 3: / Line 3: @@
 Every custom format reading keys of type ''K'' and values of type ''V'' must subclass [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/InputFormat.html|InputFormat<K, V>]]. Usually it is easier to subclass [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html|FileInputFormat<K, V>]] -- the file listing and splitting is then solved by the ''FileInputFormat'' itself.
-===== WholeFileInputFormat =====
+===== FileAsPathInputFormat =====
-We start by creating ''WholeFileInputFormat'', which reads any file and return exactly one input pair (input_path, file_content) with types (''Text'', ''BytesWritable''). The format does not allow file splitting -- each file will be processed by exactly one mapper.
+We start by creating ''FileAsPathInputFormat'', which reads any file, splits it and for each split return exactly one input pair (file_path, start-length) with types (''Text'', ''Text''), where ''file_path'' is path to the file and ''start-length'' is a string containing two dash-separated numbers: start offset of the split and length of the split.
+When implementing new input format, we must
+  * decide whether the input files are splittable. Usually uncompressed files are splittable and compressed files are not splittable, with the exception of ''SequenceFile'', which is always splittable.
+  * implement [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/RecordReader.html|RecordReader<K, V>]]. The ''RecordReader'' is the one doing the real work -- it is given a file split and it reads (key, value) pairs of types (K, V), until there are any.
+Our ''FileAsPathInputFormat'' is simple -- we allow splitting of uncompressed file and the ''RecordReader'' reads exactly one input pair.
+<code java>
+public class FileAsPathInputFormat extends FileInputFormat<Text, Text> {
+  // Helper class, which does the actual work -- produce the (path, offset-length) input pair.
+  public static class FileAsPathRecordReader extends RecordReader<Text, Text> {
+    private Path file;
+    long start, length;
+    private Text key, value;
+    public void initialize(InputSplit genericSplit, TaskAttemptContext context) throws IOException {
+      FileSplit split = (FileSplit) genericSplit;
+      file = split.getPath();
+      start = split.getStart();
+      length = split.getLength();
+      key = null;
+      value = null;
+    }
+    public boolean nextKeyValue() throws IOException {
+      if (key != null) return false;
+      key = new Text(file.toString());
+      value = new Text(String.format("%d-%d", start, length));
+      return true;
+    }
+    public Text getCurrentKey() { return key; }
+    public Text getCurrentValue() { return value; }
+    public float getProgress() { return (key == null) ? 0 : 1; }
+    public synchronized void close() throws IOException {}
+  }
+  // Use the helper class as a RecordReader in out file format.
+  public RecordReader<Text, Text> createRecordReader(InputSplit split, TaskAttemptContext context) {
+    return new FileAsPathRecordReader();
+  }
+  // Allow splitting uncompressed files.
+  protected boolean isSplittable(JobContext context, Path filename) {
+    CompressionCodec codec = new CompressionCodecFactory(context.getConfiguration()).getCodec(filename);
+    return codec == null;
+  }
+}
+</code>
+===== WholeFileInputFormat =====
-The main functionality lays in ''WholeFileRecordReader'', a subclass of [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/RecordReader.html|RecordReader<Text, BytesWritable]].
+Next we create ''WhileFileInputFormat'', which for each file return exactly one input pair (input_path, file_content) with types (''Text'', ''BytesWritable''). The format does not allow file splitting -- each file will be processed by exactly one mapper.
 <code java>
@@ Line 15: / Line 66: @@
     private Path file;
     int length;
-    private boolean value_read;
     private Text key;
     private BytesWritable value;
@@ Line 26: / Line 76: @@
       key = null;
       value = null;
-      value_read = false;
       FileSystem fs = file.getFileSystem(context.getConfiguration());
@@ Line 38: / Line 87: @@
     public boolean nextKeyValue() throws IOException {
-      if (value_read) return false;
+      if (key != null) return false;
       byte[] data = new byte[length];
@@ Line 45: / Line 94: @@
       key = new Text(file.toString());
       value = new BytesWritable(data);
-      value_read = true;
       return true;
@@ Line 52: / Line 100: @@
     public Text getCurrentKey() { return key; }
     public BytesWritable getCurrentValue() { return value; }
-    public float getProgress() { return value_read ? 0 : 1; }
+    public float getProgress() { return key == null ? 0 : 1; }
     public synchronized void close() throws IOException { if (in != null) { in.close(); in = null; } }
   }
@@ Line 66: / Line 114: @@
   }
 }
 </code>
+===== Exercise: ParagraphTextInputFormat =====
+Implement ''ParagraphTextInputFormat''. This format reads plain text files and splits it into //paragraphs//. A paragraph consists of lines, all of which are nonempty, and different paragraphs are separated by at least one empty line. The ''ParagraphTextInputFormat'' reads one paragraph at a time and return its first line as key and the rest of lines as values.
+The ''ParagraphTextInputFormat'' should allow splitting of uncompressed files. Be careful to properly implement reading paragraphs which are on split boundary. The easiest way of doing so is the following:
+  * if the offset of the split is 0, start reading at the beginning of the split. If the offset of the split is larger than 0, start reading at the offset and ignore first paragraph found.
+  * read all paragraphs that start before the end of the split boundary, even if they end after the split boundary. //If a paragraph starts just after the current split (i.e., on the split boundary), read it too.//

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences