Differences

This shows you the differences between two versions of the page.

--- courses:mapreduce-tutorial:step-29 [2012/01/30 00:44]
straka
+++ courses:mapreduce-tutorial:step-29 [2012/02/05 18:49]
straka
@@ Line 1: / Line 1: @@
-====== MapReduce Tutorial : Custom input formats ======
+====== MapReduce Tutorial : Custom sorting and grouping comparators. ======
-Every custom format reading keys of type ''K'' and values of type ''V'' must subclass [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/InputFormat.html|InputFormat<K, V>]]. Usually it is easier to subclass [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html|FileInputFormat<K, V>]] -- the file listing and splitting is then solved by the ''FileInputFormat'' itself.
+====== Fast sorting comparator ======
-===== FileAsPathInputFormat =====
+The keys are sorted before processed by a reducer, using a
+[[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/io/RawComparator.html|Raw comparator]]. The default comparator uses the [[compareTo]] method provided by the key type, which is a subclass of [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/io/WritableComparable.html|WritableComparable]]. Consider for example the following ''IntPair'' type:
-We start by creating ''FileAsPathInputFormat'', which reads any file, splits it and for each split return exactly one input pair (file_path, start-length) with types (''Text'', ''Text''), where ''file_path'' is path to the file and ''start-length'' is a string containing two dash-separated numbers: start offset of the split and length of the split.
+<code java>
+public static class IntPair implements WritableComparable<IntPair> {
+  private int first = 0;
+  private int second = 0;
-When implementing new input format, we must
+  public void set(int left, int right) { first = left; second = right; }
-  * decide whether the input files are splittable. Usually uncompressed files are splittable and compressed files are not splittable, with the exception of ''SequenceFile'', which is always splittable.
+  public int getFirst() { return first; }
-  * implement [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/RecordReader.html|RecordReader<K, V>]]. The ''RecordReader'' is the one doing the real work -- it is given a file split and it reads (key, value) pairs of types (K, V), until there are any.
+  public int getSecond() { return second; }
-Our ''FileAsPathInputFormat'' is simple -- we allow splitting of uncompressed file and the ''RecordReader'' reads exactly one input pair.
+  public void readFields(DataInput in) throws IOException {
-<code java>
+    first = in.readInt();
-public class FileAsPathInputFormat extends FileInputFormat<Text, Text> {
+    second = in.readInt();
-  // Helper class, which does the actual work -- produce the (path, offset-length) input pair.
-  public static class FileAsPathRecordReader extends RecordReader<Text, Text> {
-    private Path file;
-    long start, length;
-    private Text key, value;
-    public void initialize(InputSplit genericSplit, TaskAttemptContext context) throws IOException {
-      FileSplit split = (FileSplit) genericSplit;
-      file = split.getPath();
-      start = split.getStart();
-      length = split.getLength();
-      key = null;
-      value = null;
-    }
-    public boolean nextKeyValue() throws IOException {
-      if (key != null) return false;
-      key = new Text(file.toString());
-      value = new Text(String.format("%d-%d", start, length));
-      return true;
-    }
-    public Text getCurrentKey() { return key; }
-    public Text getCurrentValue() { return value; }
-    public float getProgress() { return (key == null) ? 0 : 1; }
-    public synchronized void close() throws IOException {}
   }
+  public void write(DataOutput out) throws IOException {
-  // Use the helper class as a RecordReader in out file format.
+    out.writeInt(first);
-  public RecordReader<Text, Text> createRecordReader(InputSplit split, TaskAttemptContext context) {
+    out.writeInt(second);
-    return new FileAsPathRecordReader();
+  }
-  }
+  public int compareTo(IntPair o) {
-  // Allow splitting uncompressed files.
+    if (first != o.first) return first < o.first ? -1 : 1;
-  protected boolean isSplittable(JobContext context, Path filename) {
+    else return second < o.second ? -1 : second == o.second ? 0 : 1;
-    CompressionCodec codec = new CompressionCodecFactory(context.getConfiguration()).getCodec(filename);
-    return codec == null;
   }
 }
 </code>
-===== WholeFileInputFormat =====
+If we would like in a Hadoop job to sort the ''IntPair'' using the first element only, we can provide a ''RawComparator'' and set it using [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/Job.html#setSortComparatorClass(java.lang.Class)|job.setSortComparatorClass]]:
-Next we create ''WhileFileInputFormat'', which for each file return exactly one input pair (input_path, file_content) with types (''Text'', ''BytesWritable''). The format does not allow file splitting -- each file will be processed by exactly one mapper.
-<code java>
-public class WholeFileInputFormat extends FileInputFormat<Text, BytesWritable> {
-  // Helper class, which does the actual work -- reads the (path, content) input pair.
-  public static class WholeFileRecordReader extends RecordReader<Text, BytesWritable> {
-    private Path file;
-    int length;
-    private Text key;
-    private BytesWritable value;
-    DataInputStream in;
-    public void initialize(InputSplit genericSplit, TaskAttemptContext context) throws IOException {
+====== Grouping comparator ======
-      FileSplit split = (FileSplit) genericSplit;
-      file = split.getPath();
-      length = (int) split.getLength();
-      key = null;
-      value = null;
-      FileSystem fs = file.getFileSystem(context.getConfiguration());
+In a reduce, it is guaranteed that keys are processed in ascending order. Sometimes it would be useful if the //values associated with one key// could also be processed in ascending order.
-      in = fs.open(split.getPath());
-      CompressionCodecFactory compressionCodecs = new CompressionCodecFactory(context.getConfiguration());
+----
-      CompressionCodec codec = compressionCodecs.getCodec(file);
-      if (codec != null)
-        in = new DataInputStream(codec.createInputStream(in));
-    }
-    public boolean nextKeyValue() throws IOException {
+<html>
-      if (key != null) return false;
+<table style="width:100%">
+<tr>
-      byte[] data = new byte[length];
+<td style="text-align:left; width: 33%; "></html>[[step-28|Step 28]]: Custom data types.<html></td>
-      in.readFully(data);
+<td style="text-align:center; width: 33%; "></html>[[.|Overview]]<html></td>
+<td style="text-align:right; width: 33%; "></html>[[step-30|Step 30]]: Custom input formats.<html></td>
-      key = new Text(file.toString());
+</tr>
-      value = new BytesWritable(data);
+</table>
+</html>
-      return true;
-    }
-    public Text getCurrentKey() { return key; }
-    public BytesWritable getCurrentValue() { return value; }
-    public float getProgress() { return key == null ? 0 : 1; }
-    public synchronized void close() throws IOException { if (in != null) { in.close(); in = null; } }
-  }
-  // Use the helper class as a RecordReader in out file format.
-  public RecordReader<Text, BytesWritable> createRecordReader(InputSplit split, TaskAttemptContext context) {
-    return new WholeFileRecordReader();
-  }
-  // Do not allow splitting.
-  protected boolean isSplittable(JobContext context, Path filename) {
-    return false;
-  }
-}
-</code>
-===== Exercise: ParagraphTextInputFormat =====

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences