[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
courses:mapreduce-tutorial:step-29 [2012/01/29 17:26]
straka
courses:mapreduce-tutorial:step-29 [2012/01/29 17:40]
straka
Line 2: Line 2:
  
 Every custom format reading keys of type ''K'' and values of type ''V'' must subclass [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/InputFormat.html|InputFormat<K, V>]]. Usually it is easier to subclass [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html|FileInputFormat<K, V>]] -- the file listing and splitting is then solved by the ''FileInputFormat'' itself. Every custom format reading keys of type ''K'' and values of type ''V'' must subclass [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/InputFormat.html|InputFormat<K, V>]]. Usually it is easier to subclass [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html|FileInputFormat<K, V>]] -- the file listing and splitting is then solved by the ''FileInputFormat'' itself.
 +
 +===== FileAsPathInputFormat =====
 +
 +We start by creating ''FileAsPathInputFormat'', which reads any file, splits it and for each split return exactly one input pair (file_path, start-length) with types (''Text'', ''Text''), where ''file_path'' is path to the file and ''start-length'' is a string containing two dash-separated numbers: start offset of the split and length of the split.
 +
 +When 
 +
 +<code java>
 +public static class FileAsPathInputFormat extends FileInputFormat<Text, Text> {
 +  public static class FileAsPathRecordReader extends RecordReader<Text, Text> {
 +    private Path file;
 +    long start, length;
 +    private Text key, value;
 +    
 +    public void initialize(InputSplit genericSplit, TaskAttemptContext context) throws IOException {
 +      FileSplit split = (FileSplit) genericSplit;
 +      file = split.getPath();
 +      start = split.getStart();
 +      length = split.getLength();
 +      key = null;   
 +      value = null; 
 +    }               
 +    public boolean nextKeyValue() throws IOException {
 +      if (key != null) return false;
 +                    
 +      key = new Text(file.toString());
 +      value = new Text(String.format("%d-%d", start, length));
 +                    
 +      return true;  
 +    }               
 +                    
 +    public Text getCurrentKey() { return key; }
 +    public Text getCurrentValue() { return value; }
 +    public float getProgress() { return (key == null) ? 0 : 1; }
 +    public synchronized void close() throws IOException {}
 +  }
 +      
 +  public RecordReader<Text, Text> createRecordReader(InputSplit split, TaskAttemptContext context) {
 +    return new FileAsPathRecordReader();
 +  }   
 +      
 +  protected boolean isSplittable(JobContext context, Path filename) {
 +    CompressionCodec codec = new CompressionCodecFactory(context.getConfiguration()).getCodec(filename);
 +    return codec == null;
 +  }
 +}
 +</code>
  
 ===== WholeFileInputFormat ===== ===== WholeFileInputFormat =====

[ Back to the navigation ] [ Back to the content ]