Differences

This shows you the differences between two versions of the page.

--- courses:mapreduce-tutorial:step-28 [2012/01/31 12:40]
straka
+++ courses:mapreduce-tutorial:step-28 [2012/02/05 19:10] (current)
straka
@@ Line 1: / Line 1: @@
-====== MapReduce Tutorial : Running multiple Hadoop jobs in one class ======
+====== MapReduce Tutorial : Custom data types ======
-The Java API offers possibility to submit multiple Hadoop job in one class. A job can be submitted either using
+An important feature of the Java API is that custom data and format types can be provided. In this step we implement two custom data types.
-  * [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/Job.html#waitForCompletion(boolean)|job.waitForCompletion]] -- the job is submitted and the method waits for it to finish (successfully or not).
-  * [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/Job.html#submit()|job.submit]] -- the job is submitted and the method immediately returns. In this case, the state of the submitted job can be accessed using [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/Job.html#isComplete()|job.isComplete]] and [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/Job.html#isSuccessful()|job.isSuccessful]]
+===== BERIntWritable =====
+We want to implement BERIntWritable, which is an ''int'' stored in the format of ''pack "w", $num''. Quoting: //The bytes represent an unsigned integer in base 128, most significant digit first, with as few digits as possible. Bit eight (the high bit) is set on each byte except the last.//
+The new class must implement the [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/io/Writable.html|Writable]] interface, i.e., methods ''readFields'' and ''write'':
+<code java>
+public class BERIntWritable implements Writable {
+  private int value;
+  public void readFields(DataInput in) throws IOException {
+    value = 0;
+    byte next;
+    while (((next = in.readByte()) & 0x80) != 0) {
+      value = (value << 7) | (next & 0x7F);
+    }
+    value = (value << 7) | next;
+  }
+  public void write(DataOutput out) throws IOException {
+    int mask_shift = 28;
+    while (mask_shift > 0 && (value & (0x7F << mask_shift)) == 0) mask_shift -= 7;
+    while (mask_shift > 0) {
+      out.writeByte(0x80 | ((value >> mask_shift) & 0x7F));
+      mask_shift -= 7;
+    }
+    out.writeByte(value & 0x7F);
+  }
+</code>
+Accessory methods ''get'' and ''set'' are needed in order to work with the value. Also we override ''toString'', which is used by Hadoop when writing to plain text files.
+<code java>
+  public int get() { return value; }
+  public void set(int value) { this.value = value; }
+  public String toString() { return String.valueOf(value); }
+}
+</code>
+Remark: If the ''BERIntWritable'' class is not declared top-level, it must be declared **''static''**.
+Such implementation can be used as a type of //values//. If we wanted to use it as a type of //keys//, we need to implement [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/io/WritableComparable.html|WritableComparable]] instead of just ''Writable''. It is enough to add ''compareTo'' method to current implementation:
+<code java>
+  public class BERIntWritable implements WritableComparable {
+  ... //Same as before
+  public int compareTo(Object other) {
+    int otherValue = ((BERIntWritable)other).get();
+    return value < otherValue ? -1 : (value == otherValue ? 0 : 1);
+  }
+}
+</code>
+===== PairWritable<A, B> =====
+As another example, we implement a type consisting of two user-defined ''Writable'' implementations:
+<code java>
+public static class PairWritable<A extends Writable, B extends Writable > implements Writable {
+  private A first;
+  private B second;
+  public void readFields(DataInput in) throws IOException {
+    first.readFields(in);
+    second.readFields(in);
+  }
+  public void write(DataOutput out) throws IOException {
+    first.write(out);
+    second.write(out);
+  }
+  public A getFirst() { return first; }
+  public B getSecond() { return second; }
+  public void setFirst(A first) { this.first = first; }
+  public void setSecond(B first) { this.second = second; }
+  public String toString() { return String.format("%s %s", first.toString(), second.toString()); }
+  public PairWritable(A first, B second) { this.first = first; this.second = second; }
+}
+</code>
+Remark: Remark: If the ''PairWritable'' class is not declared top-level, it must be declared **''static''**.
+We did not define ''compareTo'' method. The reason is that in order to do so, the types ''A'' and ''B'' would have to implement ''WritableComparable'' and the ''PairWritable'' could not be used with types not providing ''compareTo''. The best way of solving this issue is probably to create a new type ''PairWritableComparable<A, B>'' which implements ''WritableComparable'':
+<code java>
+public static class PairWritableComparable<A extends WritableComparable, B extends WritableComparable > implements WritableComparable {
+  private A first;
+  private B second;
+  public void readFields(DataInput in) throws IOException {
+    first.readFields(in);
+    second.readFields(in);
+  }
+  public void write(DataOutput out) throws IOException {
+    first.write(out);
+    second.write(out);
+  }
+  public int compareTo(Object other) {
+    PairWritableComparable<A, B> otherPair = (PairWritableComparable<A, B>) other;
+    int cmpFirst = first.compareTo(otherPair.getFirst());
+    if (cmpFirst < 0) return -1;
+    if (cmpFirst > 0) return 1;
+    return second.compareTo(otherPair.getSecond());
+  }
+  public A getFirst() { return first; }
+  public B getSecond() { return second; }
+  public void setFirst(A first) { this.first = first; }
+  public void setSecond(B first) { this.second = second; }
+  public String toString() { return String.format("%s %s", first.toString(), second.toString()); }
+  public PairWritableComparable(A first, B second) { this.first = first; this.second = second; }
+}
+</code>
+Remark: If the ''PairWritableComparable'' class is not declared top-level, it must be declared **''static''**.
 ===== Exercise 1 =====
-Improve the last [[.:step-27#exercise|inverted index creation exercise]], such that
+Imagine you want to create an inverted index. In the index, for each word and document containing the word, all positions of the word in the document have to be stored.
-  - in the first job, create a list of unique document names. Number the documents using the order in this list.
-  - in the second job, create for each word sorted list of ''DocWithOccurences<IntWritable>'', where the document is identified by its number (contrary to the previous exercise, where ''Text'' was used to identify the document).
-===== Exercise 2 =====
+Create a type ''DocWithOccurrences<Doctype extends WritableComparable>'' implementing ''WritableComparable''. The type:
+  * stores a document of type ''Doctype''.
+  * stores a list of positions of occurrence. The sequence of length //N// should be stored on disk as number //N// followed by //N// numbers -- positions of occurrence. Type ''BERIntWritable'' should be used.
+  * is comparable, comparing using the ''Comparable'' interface od ''Doctype''.
+  * has methods ''getDoc'', ''setDoc'', ''getOccurrences'', ''addOccurrence'', ''toString''.
-Implement the [[.:step-15|K-means clustering exercise]] in Java. Instead of an controlling script, use the Java class itself to execute the Hadoop job as many times as necessary.
+Using this type, create an inverted index -- implement a Hadoop job, that for each word creates a list of ''DocWithOccurrences<Text>'' containing the documents containing this word, including the occurrences.
+===== Exercise 2 =====
+Optional. Improve the solution to identify the documents by their ids instead of names, i.e., create for each word a sequence of ''DocWithOccurrences<IntWritable>''. Your solution should use two Hadoop jobs:
+  - in the first job, create a list of unique document names. Number the documents using the order in this list.
+  - in the second job, create for each word a list of ''DocWithOccurrences<IntWritable>'', where the document is identified by its number (contrary to the previous exercise, where ''Text'' was used to identify the document).
 ----
@@ Line 21: / Line 139: @@
 <table style="width:100%">
 <tr>
-<td style="text-align:left; width: 33%; "></html>[[step-27|Step 27]]: Custom data types.<html></td>
+<td style="text-align:left; width: 33%; "></html>[[step-27|Step 27]]: Running multiple Hadoop jobs in one source file.<html></td>
 <td style="text-align:center; width: 33%; "></html>[[.|Overview]]<html></td>
 <td style="text-align:right; width: 33%; "></html>[[step-29|Step 29]]: Custom sorting and grouping comparators.<html></td>

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences