[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
courses:mapreduce-tutorial:step-27 [2012/01/28 18:46]
straka
courses:mapreduce-tutorial:step-27 [2012/01/31 14:39] (current)
straka
Line 1: Line 1:
-====== MapReduce Tutorial : Custom data types ======+====== MapReduce Tutorial : Running multiple Hadoop jobs in one source file ======
  
-An important feature of the Java API is that custom data and format types can be provided. In this step we implement two custom data types.+The Java API offers possibility to submit multiple Hadoop job in one source file. A job can be submitted either using 
 +  * [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/Job.html#waitForCompletion(boolean)|job.waitForCompletion]] -- the job is submitted and the method waits for it to finish (successfully or not). 
 +  * [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/Job.html#submit()|job.submit]] -- the job is submitted and the method immediately returns. In this case, the state of the submitted job can be accessed using [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/Job.html#isComplete()|job.isComplete]] and [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/Job.html#isSuccessful()|job.isSuccessful]]
  
-===== BERIntWritable =====+===== Exercise 1 ===== 
 +Improve the [[.:step-25#exercise|sorting exercise]] to handle [[.:step-13#nonuniform-data|nonuniform keys distribution]]. As in the [[.:step-13#nonuniform-data|Perl solution]], run two Hadoop jobs (using one Java source file) -- first samples the input and creates separator, second does the real sorting.
  
-We want to implement BERIntWritable, which is an ''int'' stored in the format of ''pack "w", $num''. Quoting: //The bytes represent an unsigned integer in base 128, most significant digit first, with as few digits as possible. Bit eight (the high bit) is set on each byte except the last.//+===== Exercise 2 =====
  
-The new class must implement the [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/io/Writable.html|Writable]] interfacei.e., methods ''readFields'' and ''write'':+Implement the [[.:step-15|K-means clustering exercise]] in Java. Instead of an controlling scriptuse the Java class itself to execute the Hadoop job as many times as necessary.
  
-<code java> 
-public class BERIntWritable implements Writable { 
-  private int value; 
  
-  public void readFields(DataInput in) throws IOException { +----
-    value = 0;+
  
-    byte next; +<html> 
-    while (((next in.readByte()) & 0x80) != 0) { +<table style="width:100%"> 
-      value = (value << 7) | (next & 0x7F)+<tr> 
-    } +<td style="text-align:leftwidth: 33%; "></html>[[step-26|Step 26]]: Compression and job configuration.<html></td> 
-    value (value << 7) next+<td style="text-align:center; width: 33%; "></html>[[.|Overview]]<html></td> 
-  }+<td style="text-align:rightwidth: 33%; "></html>[[step-28|Step 28]]: Custom data types.<html></td> 
 +</tr> 
 +</table> 
 +</html>
  
-  public void write(DataOutput out) throws IOException { 
-    int highest_pos = 28; 
-    while (highest_pos > 0 && (value & (0x7F << highest_pos)) == 0) highest_pos -= 7; 
-    while (highest_pos > 0) { 
-      out.writeByte(0x80 | ((value >> highest_pos) & 0x7F)); 
-      highest_pos -= 7; 
-    } 
-    out.writeByte(value & 0x7F); 
-  } 
-</code> 
-Accessory methods ''get'' and ''set'' are needed in order to work with the value. Also we override ''toString'', which is used by Hadoop when writing to plain text files. 
-<code java> 
-  public int get() { return value; } 
-  public void set(int value) { this.value = value; } 
-  public String toString() { return String.valueOf(value); } 
-} 
-</code> 
- 
-Such implementation can be used as a type of //values//. If we wanted to use it as a type of //keys//, we need to implement [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/io/WritableComparable.html|WritableComparable]] instead of just ''Writable''. It is enough to add ''compareTo'' method to current implementation: 
-<code java> 
-  public class BERIntWritable implements WritableComparable { 
-  ... //Same as before 
- 
-  public int compareTo(Object other) { 
-    int otherValue = ((BERIntWritable)other).get(); 
-    return value < otherValue ? -1 : (value == otherValue ? 0 : 1); 
-  } 
-} 
-</code> 
- 
-===== PairWritable<A, B> ===== 
- 
-As another example, we implement a type storing two user-defined ''Writable'' implementations: 

[ Back to the navigation ] [ Back to the content ]