[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
Next revision Both sides next revision
courses:mapreduce-tutorial:step-24 [2012/01/27 20:11]
straka vytvořeno
courses:mapreduce-tutorial:step-24 [2012/01/27 22:08]
straka
Line 1: Line 1:
-====== MapReduce Tutorial : ======+====== MapReduce Tutorial : Mappers, running Java Hadoop jobs ====== 
 + 
 +We start by going through a simple Hadoop job with Mapper only. 
 + 
 +A mapper which processes (key, value) pairs of types (Kin, Vin) and produces (key, value) pairs of types (Kout, Vout) must be a subclass of [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/Mapper.html|Mapper<Kin, Vin, Kout, Vout>]]. In our case, the mapper is subclass of ''Mapper<Text, Text, Text, Text>''
 + 
 +The mapper must define a ''map'' method and may provide ''setup'' and ''context'' method: 
 +<code java> 
 +  public static class TheMapper extends Mapper<Text, Text, Text, Text>{ 
 +    public void setup(Context context) throws IOException, InterruptedException {} 
 + 
 +    public void map(Text key, Text value, Context context) throws IOException, InterruptedException {} 
 + 
 +    public void cleanup(Context context) throws IOException, InterruptedException {} 
 +  } 
 +</code> 
 + 
 +Outputting (key, value) pairs is performed using the [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/MapContext.html|MapContext<Kin, Vin, Kout, Vout>]] object (the ''Context'' is an abbreviation for this type), with the method ''context.write(Kout key, Vout value)''
 + 
 +Here is the source of the whole Hadoop job: 
 + 
 +<file java MapperOnlyHadoopJob.java> 
 +import java.io.IOException; 
 + 
 +import org.apache.hadoop.conf.*; 
 +import org.apache.hadoop.fs.Path; 
 +import org.apache.hadoop.io.*; 
 +import org.apache.hadoop.mapreduce.*; 
 +import org.apache.hadoop.mapreduce.lib.input.*; 
 +import org.apache.hadoop.mapreduce.lib.output.*; 
 +import org.apache.hadoop.util.*; 
 + 
 +public class MapperOnlyHadoopJob extends Configured implements Tool { 
 +  // Mapper 
 +  public static class TheMapper extends Mapper<Text, Text, Text, Text>{ 
 +    public void setup(Context context) { 
 +    } 
 + 
 +    public void map(Text key, Text value, Context context) throws IOException, InterruptedException { 
 +      if (key.getLength() > 0 && Character.toUpperCase(key.charAt(0)) == 'A') { 
 +        context.write(key, value); 
 +      } 
 +    } 
 + 
 +    public void cleanup(Context context) { 
 +    } 
 +  } 
 + 
 +  // Job configuration 
 +  public int run(String[] args) throws Exception { 
 +    if (args.length < 2) { 
 +      System.err.printf("Usage: %s.jar in-path out-path", this.getClass().getName()); 
 +      return 1; 
 +    } 
 + 
 +    Job job = new Job(getConf(), this.getClass().getName());    // Create class representing Hadoop job. 
 +                                                                // Name of the job is the name of current class. 
 + 
 +    job.setJarByClass(this.getClass());                         // Use jar containing current class. 
 +    job.setMapperClass(TheMapper.class);                        // The mapper of the job. 
 +    job.setOutputKeyClass(Text.class);                          // Type of the output keys. 
 +    job.setOutputValueClass(Text.class);                        // Type of the output values. 
 + 
 +    job.setInputFormatClass(KeyValueTextInputFormat.class);     // Input format. 
 +                                                                // Output format is the default -- TextOutputFormat 
 + 
 +    FileInputFormat.addInputPath(job, new Path(args[0]));       // Input path is on command line. 
 +    FileOutputFormat.setOutputPath(job, new Path(args[1]));     // Output path is on command line too. 
 + 
 +    return job.waitForCompletion(true) ? 0 : 1; 
 +  } 
 + 
 +  // Main method 
 +  public static void main(String[] args) throws Exception { 
 +    int res = ToolRunner.run(new MapperOnlyHadoopJob(), args); 
 + 
 +    System.exit(res); 
 +  } 
 +
 +</file> 
 + 
 +Remarks: 
 +  * The filename //must// be the same as the name of the class -- this is enforced by Java compiler. 
 +  * In one class multiple jobs can be submitted, either in sequence or in parallel. 
 +  * A mismatch of types is usually detected by the compiler, but sometimes it is detected only at runtime. If that happens, an exception is raised and the program crashes. 
 + 
 +===== Running the job ===== 
 +The official way of running Hadoop jobs is to use the ''/SGE/HADOOP/active/bin/hadoop'' script. Jobs submitted through this script can be configured using Hadoop properties only. Therefore a wrapper script is provided, with similar options as the Perl API runner: 
 +  * ''net/projects/hadoop/bin/hadoop [-r number_of_reducers] job.jar [generic Hadoop properties] input_path output_path'' -- executes the given job locally in a single thread. It is useful for debugging. 
 +  * ''net/projects/hadoop/bin/hadoop -jt cluster_master [-r number_of_reducers] job.jar [generic Hadoop properties] input_path output_path'' -- submits the job to given ''cluster_master''
 +  * ''net/projects/hadoop/bin/hadoop -c number_of_machines [-w secs_to_wait_after_job_finishes] [-r number_of_reducers] job.jar [generic Hadoop properties] input_path output_path'' -- creates a new cluster with specified number of machines, which executes given job, and then waits for specified number of seconds before it stops. 
 + 
 +===== Exercise ===== 
 +Download the ''MapperOnlyHadoopJob.java'', compile it and run it using 
 +  /net/projects/hadoop/bin/hadoop -r 0 MapperOnlyHadoopJob

[ Back to the navigation ] [ Back to the content ]