[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Table of Contents

MapReduce Tutorial : Mappers, running Java Hadoop jobs, counters

We start by going through a simple Hadoop job with Mapper only.

A mapper which processes (key, value) pairs of types (Kin, Vin) and produces (key, value) pairs of types (Kout, Vout) must be a subclass of Mapper<Kin, Vin, Kout, Vout>. In our case, the mapper is subclass of Mapper<Text, Text, Text, Text>.

The mapper must define a map method and may provide setup and context method:

  public static class TheMapper extends Mapper<Text, Text, Text, Text>{
    public void setup(Context context) throws IOException, InterruptedException {}
 
    public void map(Text key, Text value, Context context) throws IOException, InterruptedException {}
 
    public void cleanup(Context context) throws IOException, InterruptedException {}
  }

Outputting (key, value) pairs is performed using the MapContext<Kin, Vin, Kout, Vout> object (the Context is an abbreviation for this type), with the method context.write(Kout key, Vout value).

Here is the source of the whole Hadoop job:

MapperOnlyHadoopJob.java
import java.io.IOException;
 
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.util.*;
 
public class MapperOnlyHadoopJob extends Configured implements Tool {
  // Mapper
  public static class TheMapper extends Mapper<Text, Text, Text, Text>{
    public void setup(Context context) {
    }
 
    public void map(Text key, Text value, Context context) throws IOException, InterruptedException {
      if (key.getLength() > 0 && Character.toUpperCase(key.charAt(0)) == 'A') {
        context.write(key, value);
      }
    }
 
    public void cleanup(Context context) {
    }
  }
 
  // Job configuration
  public int run(String[] args) throws Exception {
    if (args.length < 2) {
      System.err.printf("Usage: %s.jar in-path out-path", this.getClass().getName());
      return 1;
    }
 
    Job job = new Job(getConf(), this.getClass().getName());    // Create class representing Hadoop job.
                                                                // Name of the job is the name of current class.
 
    job.setJarByClass(this.getClass());                         // Use jar containing current class.
    job.setMapperClass(TheMapper.class);                        // The mapper of the job.
    job.setOutputKeyClass(Text.class);                          // Type of the output keys.
    job.setOutputValueClass(Text.class);                        // Type of the output values.
 
    job.setInputFormatClass(KeyValueTextInputFormat.class);     // Input format.
                                                                // Output format is the default -- TextOutputFormat
 
    FileInputFormat.addInputPath(job, new Path(args[0]));       // Input path is on command line.
    FileOutputFormat.setOutputPath(job, new Path(args[1]));     // Output path is on command line too.
 
    return job.waitForCompletion(true) ? 0 : 1;
  }
 
  // Main method
  public static void main(String[] args) throws Exception {
    int res = ToolRunner.run(new MapperOnlyHadoopJob(), args);
 
    System.exit(res);
  }
}

Remarks

Running the job

The official way of running Hadoop jobs is to use the /SGE/HADOOP/active/bin/hadoop script. Jobs submitted through this script can be configured using Hadoop properties only. Therefore a wrapper script is provided, with similar options as the Perl API runner:

Exercise 1

Download the MapperOnlyHadoopJob.java, compile it and run it using

wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_export/code/courses:mapreduce-tutorial:step-24?codeblock=1' -O 'MapperOnlyHadoopJob.java'
make -f /net/projects/hadoop/java/Makefile MapperOnlyHadoopJob.jar
rm -rf step-24-out-sol; /net/projects/hadoop/bin/hadoop MapperOnlyHadoopJob.jar -r 0 /home/straka/wiki/cs-text-small step-24-out-sol
less step-24-out-sol/part-*

Mind the -r 0 switch – specifying -r 0 disable the reducer. If the switch -r 0 was not given, one reducer of default type IdentityReducer would be used. The IdentityReducer outputs every (key, value) pair it is given.

Counters

As in the Perl API, a mapper (or a reducer) can increment various counters by using context.getCounter(“Group”, “Name”).increment(value):

public void map(Text key, Text value, Context context) throws IOException, InterruptedException {
  ...
  context.getCounter("Group", "Name").increment(value);
  ...
}

The getCounter method returns a Counter object, so if a counter is incremented frequently, the getCounter method can be called only once:

public void map(Text key, Text value, Context context) throws IOException, InterruptedException {
  ...
  Counter words = context.getCounter("Mapper", "Number of words");
  for (String word : value.toString().split("\\W+")) {
    ...
    words.increment(1);
  }
}

Example 2

Run a Hadoop job on /home/straka/wiki/cs-text-small, which filters the documents so that only three-letter words remain. Also use counters to count the histogram of words lengths and to compute the percentage of three letter words in the documents. You can download the template ThreeLetterWords.java and execute it.

wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-24.txt' -O 'ThreeLetterWords.java'
# NOW VIEW THE FILE
# $EDITOR ThreeLetterWords.java
make -f /net/projects/hadoop/java/Makefile ThreeLetterWords.jar
rm -rf step-24-out-sol; /net/projects/hadoop/bin/hadoop ThreeLetterWords.jar -r 0 /home/straka/wiki/cs-text-small step-24-out-sol
less step-24-out-sol/part-*

Step 23: Predefined formats and types. Overview Step 25: Reducers, combiners and partitioners.


[ Back to the navigation ] [ Back to the content ]