[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

This is an old revision of the document!


Table of Contents

MapReduce Tutorial : Mappers, running Java Hadoop jobs

We start by going through a simple Hadoop job with Mapper only.

A mapper which processes (key, value) pairs of types (Kin, Vin) and produces (key, value) pairs of types (Kout, Vout) must be a subclass of Mapper<Kin, Vin, Kout, Vout>. In our case, the mapper is subclass of Mapper<Text, Text, Text, Text>.

The mapper must define a map method and may provide setup and context method:

  public static class TheMapper extends Mapper<Text, Text, Text, Text>{
    public void setup(Context context) throws IOException, InterruptedException {}
 
    public void map(Text key, Text value, Context context) throws IOException, InterruptedException {}
 
    public void cleanup(Context context) throws IOException, InterruptedException {}
  }

Outputting (key, value) pairs is performed using the MapContext<Kin, Vin, Kout, Vout> object (the Context is an abbreviation for this type), with the method context.write(Kout key, Vout value).

Here is the source of the whole Hadoop job:

MapperOnlyHadoopJob.java
import java.io.IOException;
 
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.util.*;
 
public class MapperOnlyHadoopJob extends Configured implements Tool {
  // Mapper
  public static class TheMapper extends Mapper<Text, Text, Text, Text>{
    public void setup(Context context) {
    }
 
    public void map(Text key, Text value, Context context) throws IOException, InterruptedException {
      if (key.getLength() > 0 && Character.toUpperCase(key.charAt(0)) == 'A') {
        context.write(key, value);
      }
    }
 
    public void cleanup(Context context) {
    }
  }
 
  // Job configuration
  public int run(String[] args) throws Exception {
    if (args.length < 2) {
      System.err.printf("Usage: %s.jar in-path out-path", this.getClass().getName());
      return 1;
    }
 
    Job job = new Job(getConf(), this.getClass().getName());    // Create class representing Hadoop job.
                                                                // Name of the job is the name of current class.
 
    job.setJarByClass(this.getClass());                         // Use jar containing current class.
    job.setMapperClass(TheMapper.class);                        // The mapper of the job.
    job.setOutputKeyClass(Text.class);                          // Type of the output keys.
    job.setOutputValueClass(Text.class);                        // Type of the output values.
 
    job.setInputFormatClass(KeyValueTextInputFormat.class);     // Input format.
                                                                // Output format is the default -- TextOutputFormat
 
    FileInputFormat.addInputPath(job, new Path(args[0]));       // Input path is on command line.
    FileOutputFormat.setOutputPath(job, new Path(args[1]));     // Output path is on command line too.
 
    return job.waitForCompletion(true) ? 0 : 1;
  }
 
  // Main method
  public static void main(String[] args) throws Exception {
    int res = ToolRunner.run(new MapperOnlyHadoopJob(), args);
 
    System.exit(res);
  }
}

Remarks

Running the job

The official way of running Hadoop jobs is to use the /SGE/HADOOP/active/bin/hadoop script. Jobs submitted through this script can be configured using Hadoop properties only. Therefore a wrapper script is provided, with similar options as the Perl API runner:

Exercise

Download the MapperOnlyHadoopJob.java, compile it and run it using

wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_export/code/courses:mapreduce-tutorial:step-24?codeblock=1' -O 'MapperOnlyHadoopJob.java'
make -f /net/projects/hadoop/java/Makefile MapperOnlyHadoopJob.jar
rm -rf step-24-out-sol; /net/projects/hadoop/bin/hadoop MapperOnlyHadoopJob.jar -r 0 /home/straka/wiki/cs-text-small step-24-out-sol
less step-24-out-sol/part-*

Mind the -r 0 switch – specifying -r 0 disable the reducer. If the switch -r 0 was not given, one reducer of default type IdentityReducer would be used. The IdentityReducer outputs every (key, value) pair it is given.


Step 23: Predefined formats and types. Overview Step 25: Reducers, combiners and partitioners.


[ Back to the navigation ] [ Back to the content ]