[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
courses:mapreduce-tutorial:step-29 [2012/01/29 17:23]
straka
courses:mapreduce-tutorial:step-29 [2012/02/05 18:49]
straka
Line 1: Line 1:
-====== MapReduce Tutorial : Custom input formats ======+====== MapReduce Tutorial : Custom sorting and grouping comparators. ======
  
-Every custom format reading keys of type ''K'' and values of type ''V'' must subclass [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/InputFormat.html|InputFormat<K, V>]]. Usually it is easier to subclass [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html|FileInputFormat<K, V>]] -- the file listing and splitting is then solved by the ''FileInputFormat'' itself.+====== Fast sorting comparator ======
  
-===== WholeFileInputFormat =====+The keys are sorted before processed by a reducer, using a 
 +[[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/io/RawComparator.html|Raw comparator]]. The default comparator uses the [[compareTo]] method provided by the key type, which is a subclass of [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/io/WritableComparable.html|WritableComparable]]. Consider for example the following ''IntPair'' type:
  
-We start by creating ''WholeFileInputFormat'', which reads any file and return exactly one input pair (input_path, file_content) with types (''Text'', ''BytesWritable''). The format does not allow file splitting -- each file will be processed by exactly one mapper.+<code java> 
 +public static class IntPair implements WritableComparable<IntPair>
 +  private int first = 0; 
 +  private int second = 0;
  
-The main functionality lays in ''WholeFileRecordReader'',subclass of [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/RecordReader.html|RecordReader<TextBytesWritable]].+  public void set(int left, int right) { first = left; second = right; } 
 +  public int getFirst() { return first; } 
 +  public int getSecond() { return second; } 
 + 
 +  public void readFields(DataInput in) throws IOException { 
 +    first = in.readInt(); 
 +    second = in.readInt(); 
 +  } 
 +  public void write(DataOutput out) throws IOException { 
 +    out.writeInt(first); 
 +    out.writeInt(second); 
 +  } 
 + 
 +  public int compareTo(IntPair o) { 
 +    if (first != o.first) return first < o.first ? -1 : 1; 
 +    else return second < o.second ? -1 : second == o.second ? 0 : 1; 
 +  } 
 +
 +</code> 
 + 
 +If we would like in a Hadoop job to sort the ''IntPair'' using the first element onlywe can provide ''RawComparator'' and set it using [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/Job.html#setSortComparatorClass(java.lang.Class)|job.setSortComparatorClass]]: 
 + 
 + 
 + 
 +====== Grouping comparator ====== 
 + 
 +In a reduceit is guaranteed that keys are processed in ascending order. Sometimes it would be useful if the //values associated with one key// could also be processed in ascending order. 
 + 
 +---- 
 + 
 +<html> 
 +<table style="width:100%"> 
 +<tr> 
 +<td style="text-align:left; width: 33%; "></html>[[step-28|Step 28]]: Custom data types.<html></td> 
 +<td style="text-align:center; width: 33%; "></html>[[.|Overview]]<html></td> 
 +<td style="text-align:right; width: 33%; "></html>[[step-30|Step 30]]: Custom input formats.<html></td> 
 +</tr> 
 +</table> 
 +</html>
  

[ Back to the navigation ] [ Back to the content ]