Differences

This shows you the differences between two versions of the page.

--- courses:mapreduce-tutorial:step-29 [2012/01/29 17:26]
straka
+++ courses:mapreduce-tutorial:step-29 [2012/01/30 00:40]
straka
@@ Line 2: / Line 2: @@
 Every custom format reading keys of type ''K'' and values of type ''V'' must subclass [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/InputFormat.html|InputFormat<K, V>]]. Usually it is easier to subclass [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html|FileInputFormat<K, V>]] -- the file listing and splitting is then solved by the ''FileInputFormat'' itself.
+===== FileAsPathInputFormat =====
+We start by creating ''FileAsPathInputFormat'', which reads any file, splits it and for each split return exactly one input pair (file_path, start-length) with types (''Text'', ''Text''), where ''file_path'' is path to the file and ''start-length'' is a string containing two dash-separated numbers: start offset of the split and length of the split.
+When implementing new input format, we must
+  * decide whether the input files are splittable. Usually uncompressed files are splittable and compressed files are not splittable, with the exception of ''SequenceFile'', which is always splittable.
+  * implement [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/RecordReader.html|RecordReader<K, V>]]. The ''RecordReader'' is the one doing the real work -- it is given a file split and it reads (key, value) pairs of types (K, V), until there are any.
+Our ''FileAsPathInputFormat'' is simple -- we allow splitting of uncompressed file and the ''RecordReader'' reads exactly one input pair.
+<code java>
+public class FileAsPathInputFormat extends FileInputFormat<Text, Text> {
+  public static class FileAsPathRecordReader extends RecordReader<Text, Text> {
+    private Path file;
+    long start, length;
+    private Text key, value;
+    public void initialize(InputSplit genericSplit, TaskAttemptContext context) throws IOException {
+      FileSplit split = (FileSplit) genericSplit;
+      file = split.getPath();
+      start = split.getStart();
+      length = split.getLength();
+      key = null;
+      value = null;
+    }
+    public boolean nextKeyValue() throws IOException {
+      if (key != null) return false;
+      key = new Text(file.toString());
+      value = new Text(String.format("%d-%d", start, length));
+      return true;
+    }
+    public Text getCurrentKey() { return key; }
+    public Text getCurrentValue() { return value; }
+    public float getProgress() { return (key == null) ? 0 : 1; }
+    public synchronized void close() throws IOException {}
+  }
+  public RecordReader<Text, Text> createRecordReader(InputSplit split, TaskAttemptContext context) {
+    return new FileAsPathRecordReader();
+  }
+  protected boolean isSplittable(JobContext context, Path filename) {
+    CompressionCodec codec = new CompressionCodecFactory(context.getConfiguration()).getCodec(filename);
+    return codec == null;
+  }
+}
+</code>
 ===== WholeFileInputFormat =====

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences