[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
courses:mapreduce-tutorial:step-29 [2012/02/05 18:54]
straka
courses:mapreduce-tutorial:step-29 [2012/02/05 19:13]
straka
Line 57: Line 57:
  
 In a reduce, it is guaranteed that keys are processed in ascending order. Sometimes it would be useful if the //values associated with one key// could also be processed in ascending order. In a reduce, it is guaranteed that keys are processed in ascending order. Sometimes it would be useful if the //values associated with one key// could also be processed in ascending order.
 +
 +That is possible only to some degree. The (key, value) pairs are compared //using the key only//. After the (key, value) pairs are sorted, the (key, value) pairs with the same key are grouped together. This grouping can be performed using a custom ''RawComparator'' -- it is therefore possible to group the input pairs using //only a part of the keys//. The custom grouping comparator can be specified using [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/Job.html#setGroupingComparatorClass(java.lang.Class)|job.setGroupingComparatorClass]].
 +
 +As an example, consider that the input consists of (''IntWritable'', ''IntWritable'') pairs. We would like to perform a Hadoop job with these pairs, such that the values belonging to one key are sorted before processed by a reducer.
 +  - The mapper produces (''IntPair'', ''IntWritable'') pairs. Notice that the key now consists of both numbers.
 +  - These pairs are sorted by the ''IntPair'' keys -- i.e., by both numbers.
 +  - The custom grouping comparator is used, which groups the ''IntPair'' keys using the first element only (using the ''RawComparator'' from the previous section):
 +<code java>
 +public static class IntPair implements WritableComparable<IntPair> {
 +  ...
 +  public static class FirstOnlyComparator implements RawComparator<IntPair> {
 +    public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
 +      int first1 = WritableComparator.readInt(b1, s1);
 +      int first2 = WritableComparator.readInt(b2, s2);
 +      return first1 < first2 ? -1 : first1 == first2 ? 0 : 1;
 +    }
 +    public int compare(IntPair x, IntPair y) {
 +      return x.getFirst() < y.getFirst() ? -1 : x.getFirst() == y.getFirst() ? 0 : 1;
 +    }
 +  }
 +}
 +
 +...
 +job.setGroupingComparatorClass(IntPair.FirstOnlyComparator.class);
 +</code>
 +
 +====== Exercise ======
 +
 +Improve the [[.:step-28#exercise-1|inverted index exercise]] from the previous step to create for each word a //sorted// list of ''DocWithOccurrences<Text>''.
 +
 +Use the same approach as with the ''IntPair'' -- create a type ''TextPair'', which stores two values of type ''Text'' and let the mapper create ''(TextPair, DocWIthOccurrences<Text>'' pairs, where the ''TextPair'' contains the word and then the document. Provide a ''FirstOnlyComparator'' which compares two ''TextPair''s using only the word (hint: use [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/io/Text.Comparator.html#compare(byte[],%20int,%20int,%20byte[],%20int,%20int)|Text.Comparator.compare]] when defining the byte version ''FirstOnlyComparator.compare'').
  
 ---- ----

[ Back to the navigation ] [ Back to the content ]