This is an old revision of the document!

MapReduce Tutorial : Mappers, running Java Hadoop jobs

We start by exploring a simple Hadoop job with Mapper only. The Mapper outputs only keys starting with A.

import java.io.IOException;
 
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.util.*;
 
public class MapperOnlyHadoopJob extends Configured implements Tool {
  // Mapper
  public static class TheMapper extends Mapper<Text, Text, Text, Text>{
    public void setup(Context context) {
    }
 
    public void map(Text key, Text value, Context context) throws IOException, InterruptedException {
      if (key.getLength() > 0 && Character.toUpperCase(key.charAt(0)) == 'A') {
        context.write(key, value);
      }
    }
 
    public void cleanup(Context context) {
    }
  }
 
  // Job configuration
  public int run(String[] args) throws Exception {
    if (args.length < 2) {
      System.err.printf("Usage: %s.jar in-path out-path", this.getClass().getName());
      return 1;
    }
 
    Job job = new Job(getConf(), this.getClass().getName());
 
    job.setJarByClass(this.getClass());
    job.setMapperClass(TheMapper.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Text.class);
 
    job.setInputFormatClass(KeyValueTextInputFormat.class);
 
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
 
    return job.waitForCompletion(true) ? 0 : 1;
  }
 
  // Main method
  public static void main(String[] args) throws Exception {
    int res = ToolRunner.run(new MapperOnlyHadoopJob(), args);
 
    System.exit(res);
  }
}

Running the job

Download the source and compile it.

The official way of running Hadoop jobs is to use the /SGE/HADOOP/active/bin/hadoop script. Jobs submitted through this script can be configured using Hadoop properties only. Therefore a wrapper script is provided, with similar options as the Perl API runner:

net/projects/hadoop/bin/hadoop [-r number_of_reducers] job.jar [generic Hadoop properties] input_path output_path executes the given job locally in a single thread. It is useful for debugging.
net/projects/hadoop/bin/hadoop -jt cluster_master [-r number_of_reducers] job.jar [generic Hadoop properties] input_path output_path submits the job to given cluster_master.
net/projects/hadoop/bin/hadoop -c number_of_machines [-w secs_to_wait_after_job_finishes] [-r number_of_reducers] job.jar [generic Hadoop properties] input_path output_path creates a new cluster with specified number of machines, which executes given job, and then waits for specified number of seconds before it stops.

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

MapReduce Tutorial : Mappers, running Java Hadoop jobs

Running the job