Differences

This shows you the differences between two versions of the page.

--- courses:mapreduce-tutorial:step-29 [2012/01/29 17:40]
straka
+++ courses:mapreduce-tutorial:step-29 [2012/01/30 00:44]
straka
@@ Line 7: / Line 7: @@
 We start by creating ''FileAsPathInputFormat'', which reads any file, splits it and for each split return exactly one input pair (file_path, start-length) with types (''Text'', ''Text''), where ''file_path'' is path to the file and ''start-length'' is a string containing two dash-separated numbers: start offset of the split and length of the split.
+When implementing new input format, we must
+  * decide whether the input files are splittable. Usually uncompressed files are splittable and compressed files are not splittable, with the exception of ''SequenceFile'', which is always splittable.
+  * implement [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/RecordReader.html|RecordReader<K, V>]]. The ''RecordReader'' is the one doing the real work -- it is given a file split and it reads (key, value) pairs of types (K, V), until there are any.
+Our ''FileAsPathInputFormat'' is simple -- we allow splitting of uncompressed file and the ''RecordReader'' reads exactly one input pair.
 <code java>
-public static class FileAsPathInputFormat extends FileInputFormat<Text, Text> {
+public class FileAsPathInputFormat extends FileInputFormat<Text, Text> {
+  // Helper class, which does the actual work -- produce the (path, offset-length) input pair.
   public static class FileAsPathRecordReader extends RecordReader<Text, Text> {
     private Path file;
@@ Line 37: / Line 43: @@
   }
+  // Use the helper class as a RecordReader in out file format.
   public RecordReader<Text, Text> createRecordReader(InputSplit split, TaskAttemptContext context) {
     return new FileAsPathRecordReader();
   }
+  // Allow splitting uncompressed files.
   protected boolean isSplittable(JobContext context, Path filename) {
     CompressionCodec codec = new CompressionCodecFactory(context.getConfiguration()).getCodec(filename);
@@ Line 50: / Line 58: @@
 ===== WholeFileInputFormat =====
-We start by creating ''WholeFileInputFormat'', which reads any file and return exactly one input pair (input_path, file_content) with types (''Text'', ''BytesWritable''). The format does not allow file splitting -- each file will be processed by exactly one mapper.
+Next we create ''WhileFileInputFormat'', which for each file return exactly one input pair (input_path, file_content) with types (''Text'', ''BytesWritable''). The format does not allow file splitting -- each file will be processed by exactly one mapper.
-The main functionality lays in ''WholeFileRecordReader'', a subclass of [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/RecordReader.html|RecordReader<Text, BytesWritable]].
 <code java>
@@ Line 60: / Line 66: @@
     private Path file;
     int length;
-    private boolean value_read;
     private Text key;
     private BytesWritable value;
@@ Line 71: / Line 76: @@
       key = null;
       value = null;
-      value_read = false;
       FileSystem fs = file.getFileSystem(context.getConfiguration());
@@ Line 83: / Line 87: @@
     public boolean nextKeyValue() throws IOException {
-      if (value_read) return false;
+      if (key != null) return false;
       byte[] data = new byte[length];
@@ Line 90: / Line 94: @@
       key = new Text(file.toString());
       value = new BytesWritable(data);
-      value_read = true;
       return true;
@@ Line 97: / Line 100: @@
     public Text getCurrentKey() { return key; }
     public BytesWritable getCurrentValue() { return value; }
-    public float getProgress() { return value_read ? 0 : 1; }
+    public float getProgress() { return key == null ? 0 : 1; }
     public synchronized void close() throws IOException { if (in != null) { in.close(); in = null; } }
   }
@@ Line 111: / Line 114: @@
   }
 }
 </code>
+===== Exercise: ParagraphTextInputFormat =====

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences