MapReduce Tutorial : Running multiple Hadoop jobs in source file

The Java API offers possibility to submit multiple Hadoop job in one source file. A job can be submitted either using

job.waitForCompletion – the job is submitted and the method waits for it to finish (successfully or not).
job.submit – the job is submitted and the method immediately returns. In this case, the state of the submitted job can be accessed using job.isComplete and job.isSuccessful

Exercise 1

Improve the sorting exercise to handle nonuniform keys distribution. As in the Perl solution, run two Hadoop jobs (using one Java source file) – first samples the input and creates separator, second does the real sorting.

Exercise 2

Improve the inverted index creation exercise, such that

in the first job, create a list of unique document names. Number the documents using the order in this list.
in the second job, create for each word sorted list of DocWithOccurences<IntWritable>, where the document is identified by its number (contrary to the previous exercise, where Text was used to identify the document).

Exercise 3

Implement the K-means clustering exercise in Java. Instead of an controlling script, use the Java class itself to execute the Hadoop job as many times as necessary.

Step 27: Custom data types.

Overview

Step 29: Custom sorting and grouping comparators.

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Table of Contents

MapReduce Tutorial : Running multiple Hadoop jobs in source file

Exercise 1

Exercise 2

Exercise 3