[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision Both sides next revision
courses:mapreduce-tutorial:step-29 [2012/01/29 17:26]
straka
courses:mapreduce-tutorial:step-29 [2012/01/29 17:40]
straka
Line 2: Line 2:
  
 Every custom format reading keys of type ''K'' and values of type ''V'' must subclass [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/InputFormat.html|InputFormat<K, V>]]. Usually it is easier to subclass [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html|FileInputFormat<K, V>]] -- the file listing and splitting is then solved by the ''FileInputFormat'' itself. Every custom format reading keys of type ''K'' and values of type ''V'' must subclass [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/InputFormat.html|InputFormat<K, V>]]. Usually it is easier to subclass [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html|FileInputFormat<K, V>]] -- the file listing and splitting is then solved by the ''FileInputFormat'' itself.
 +
 +===== FileAsPathInputFormat =====
 +
 +We start by creating ''FileAsPathInputFormat'', which reads any file, splits it and for each split return exactly one input pair (file_path, start-length) with types (''Text'', ''Text''), where ''file_path'' is path to the file and ''start-length'' is a string containing two dash-separated numbers: start offset of the split and length of the split.
 +
 +<code java>
 +public static class FileAsPathInputFormat extends FileInputFormat<Text, Text> {
 +  public static class FileAsPathRecordReader extends RecordReader<Text, Text> {
 +    private Path file;
 +    long start, length;
 +    private Text key, value;
 +    
 +    public void initialize(InputSplit genericSplit, TaskAttemptContext context) throws IOException {
 +      FileSplit split = (FileSplit) genericSplit;
 +      file = split.getPath();
 +      start = split.getStart();
 +      length = split.getLength();
 +      key = null;   
 +      value = null; 
 +    }               
 +    public boolean nextKeyValue() throws IOException {
 +      if (key != null) return false;
 +                    
 +      key = new Text(file.toString());
 +      value = new Text(String.format("%d-%d", start, length));
 +                    
 +      return true;  
 +    }               
 +                    
 +    public Text getCurrentKey() { return key; }
 +    public Text getCurrentValue() { return value; }
 +    public float getProgress() { return (key == null) ? 0 : 1; }
 +    public synchronized void close() throws IOException {}
 +  }
 +      
 +  public RecordReader<Text, Text> createRecordReader(InputSplit split, TaskAttemptContext context) {
 +    return new FileAsPathRecordReader();
 +  }   
 +      
 +  protected boolean isSplittable(JobContext context, Path filename) {
 +    CompressionCodec codec = new CompressionCodecFactory(context.getConfiguration()).getCodec(filename);
 +    return codec == null;
 +  }
 +}
 +</code>
  
 ===== WholeFileInputFormat ===== ===== WholeFileInputFormat =====

[ Back to the navigation ] [ Back to the content ]