Both sides previous revision
Previous revision
Next revision
|
Previous revision
Next revision
Both sides next revision
|
courses:mapreduce-tutorial:step-24 [2012/01/27 21:27] straka |
courses:mapreduce-tutorial:step-24 [2012/01/27 21:52] straka |
====== MapReduce Tutorial : Mappers, running Java Hadoop jobs ====== | ====== MapReduce Tutorial : Mappers, running Java Hadoop jobs ====== |
| |
We start by exploring a simple Hadoop job with Mapper only. The Mapper outputs only keys starting with ''A''. | We start by going through a simple Hadoop job with Mapper only. |
| |
| A mapper which processes (key, value) pairs of types (Kin, Vin) and produces (key, value) pairs of types (Kout, Vout) must be a subclass of [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/Mapper.html|Mapper<Kin, Vin, Kout, Vout>]]. In our case, the mapper is subclass of ''Mapper<Text, Text, Text, Text>''. |
| |
| The mapper must define a ''map'' method and may provide ''setup'' and ''context'' method: |
| <code java> |
| public static class TheMapper extends Mapper<Text, Text, Text, Text>{ |
| public void setup(Context context) throws IOException, InterruptedException {} |
| |
| public void map(Text key, Text value, Context context) throws IOException, InterruptedException {} |
| |
| public void cleanup(Context context) throws IOException, InterruptedException {} |
| } |
| </code> |
| |
| Outputting (key, value) pairs is performed using the [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/MapContext.html|MapContext<Kin, Vin, Kout, Vout>]] object (the ''Context'' is an abbreviation for this type), with the method ''context.write(Kout key, Vout value)''. |
| |
| Here is the source of the whole Hadoop job: |
<file java MapperOnlyHadoopJob.java> | <file java MapperOnlyHadoopJob.java> |
import java.io.IOException; | import java.io.IOException; |
} | } |
| |
Job job = new Job(getConf(), this.getClass().getName()); | Job job = new Job(getConf(), this.getClass().getName()); // Create class representing Hadoop job. |
| |
job.setJarByClass(this.getClass()); | job.setJarByClass(this.getClass()); // Use jar containing current class. |
job.setMapperClass(TheMapper.class); | job.setMapperClass(TheMapper.class); // The mapper of the job. |
job.setOutputKeyClass(Text.class); | job.setOutputKeyClass(Text.class); // Type of the output keys. |
job.setOutputValueClass(Text.class); | job.setOutputValueClass(Text.class); // Type of the output values. |
| |
job.setInputFormatClass(KeyValueTextInputFormat.class); | job.setInputFormatClass(KeyValueTextInputFormat.class); // Input format. |
| // Output format is the default -- TextOutputFormat |
| |
FileInputFormat.addInputPath(job, new Path(args[0])); | FileInputFormat.addInputPath(job, new Path(args[0])); // Input path is on command line. |
FileOutputFormat.setOutputPath(job, new Path(args[1])); | FileOutputFormat.setOutputPath(job, new Path(args[1])); // Output path is on command line too. |
| |
return job.waitForCompletion(true) ? 0 : 1; | return job.waitForCompletion(true) ? 0 : 1; |
| |
The official way of running Hadoop jobs is to use the ''/SGE/HADOOP/active/bin/hadoop'' script. Jobs submitted through this script can be configured using Hadoop properties only. Therefore a wrapper script is provided, with similar options as the Perl API runner: | The official way of running Hadoop jobs is to use the ''/SGE/HADOOP/active/bin/hadoop'' script. Jobs submitted through this script can be configured using Hadoop properties only. Therefore a wrapper script is provided, with similar options as the Perl API runner: |
* ''net/projects/hadoop/bin/hadoop [-r number_of_reducers] job.jar input_path output_path'' executes the given job locally in a single thread. It is useful for debugging. | * ''net/projects/hadoop/bin/hadoop [-r number_of_reducers] job.jar [generic Hadoop properties] input_path output_path'' -- executes the given job locally in a single thread. It is useful for debugging. |
* ''net/projects/hadoop/bin/hadoop -jt cluster_master [-r number_of_reducers] job.jar input_path output_path'' submits the job to given ''cluster_master''. | * ''net/projects/hadoop/bin/hadoop -jt cluster_master [-r number_of_reducers] job.jar [generic Hadoop properties] input_path output_path'' -- submits the job to given ''cluster_master''. |
* ''net/projects/hadoop/bin/hadoop -c number_of_machines [-w secs_to_wait_after_job_finishes] [-r number_of_reducers] job.jar input_path output_path'' creates a new cluster with specified number of machines, which executes given job, and then waits for specified number of seconds before it stops. | * ''net/projects/hadoop/bin/hadoop -c number_of_machines [-w secs_to_wait_after_job_finishes] [-r number_of_reducers] job.jar [generic Hadoop properties] input_path output_path'' -- creates a new cluster with specified number of machines, which executes given job, and then waits for specified number of seconds before it stops. |
| |