This is an old revision of the document!
Table of Contents
MapReduce Tutorial : Running multiple Hadoop jobs in one class
The Java API offers possibility to submit multiple Hadoop job in one class. A job can be submitted either using
- job.waitForCompletion – the job is submitted and the method waits for it to finish (successfully or not).
- job.submit – the job is submitted and the method immediately returns. In this case, the state of the submitted job can be accessed using job.isComplete and job.isSuccessful
Exercise 1
Improve the last inverted index creation exercise, such that
- in the first job, create a list of unique document names. Number the documents using the order in this list.
- in the second job, create for each word sorted list of
DocWithOccurences<IntWritable>
, where the document is identified by its number (contrary to the previous exercise, whereText
was used to identify the document).
Exercise 2
Implement the K-means clustering exercise in Java. Instead of an controlling script, use the Java class itself to execute the Hadoop job as many times as necessary.