This is an old revision of the document!
Table of Contents
MapReduce Tutorial : Custom sorting and grouping comparators.
Custom sorting comparator
The keys are sorted before processed by a reducer, using a
Raw comparator. The default comparator uses the compareTo
method provided by the key type, which is a subclass of WritableComparable. Consider for example the following IntPair
type:
public static class IntPair implements WritableComparable<IntPair> { private int first = 0; private int second = 0; public void set(int left, int right) { first = left; second = right; } public int getFirst() { return first; } public int getSecond() { return second; } public void readFields(DataInput in) throws IOException { first = in.readInt(); second = in.readInt(); } public void write(DataOutput out) throws IOException { out.writeInt(first); out.writeInt(second); } public int compareTo(IntPair o) { if (first != o.first) return first < o.first ? -1 : 1; else return second < o.second ? -1 : second == o.second ? 0 : 1; } }
If we would like in a Hadoop job to sort the IntPair
using the first element only, we can provide a RawComparator
and set it using job.setSortComparatorClass:
public static class IntPair implements WritableComparable<IntPair> { ... public static class FirstOnlyComparator implements RawComparator<IntPair> { public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) { int first1 = WritableComparator.readInt(b1, s1); int first2 = WritableComparator.readInt(b2, s2); return first1 < first2 ? -1 : first1 == first2 ? 0 : 1; } public int compare(IntPair x, IntPair y) { return x.getFirst() < y.getFirst() ? -1 : x.getFirst() == y.getFirst() ? 0 : 1; } } } ... job.setSortComparatorClass(IntPair.FirstOnlyComparator.class);
Notice we used helper function readInt
from WritableComparator class, which provides means of parsing primitive data types from byte streams.