[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
courses:mapreduce-tutorial:step-24 [2012/01/27 21:02]
straka
courses:mapreduce-tutorial:step-24 [2012/01/27 21:47]
straka
Line 1: Line 1:
 ====== MapReduce Tutorial : Mappers, running Java Hadoop jobs ====== ====== MapReduce Tutorial : Mappers, running Java Hadoop jobs ======
  
-We start by exploring a simple Hadoop job with Mapper only. The Mapper outputs only keys starting with ''A''.+We start by going through a simple Hadoop job with Mapper only. 
 + 
 +A mapper which processes (key, value) pairs of types (Kin, Vin) and produces (key, value) pairs of types (Kout, Vout) must be a subclass of [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/Mapper.html|Mapper<Kin, Vin, Kout, Vout>]]. In our case, the mapper is subclass of ''Mapper<Text, Text, Text, Text>''
 + 
 +The mapper must define a ''map'' method and may provide ''setup'' and ''context'' method: 
 +<code java> 
 +  public static class TheMapper extends Mapper<Text, Text, Text, Text>{ 
 +    public void setup(Context context) throws IOException, InterruptedException {} 
 + 
 +    public void map(Text key, Text value, Context context) throws IOException, InterruptedException {} 
 + 
 +    public void cleanup(Context context) throws IOException, InterruptedException {} 
 +  } 
 +</code> 
 + 
 +Outputting (key, value) pairs is performed using the [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/MapContext.html|MapContext<Kin, Vin, Kout, Vout>]] (the ''Context'' is an abbreviation for this type). 
 <file java MapperOnlyHadoopJob.java> <file java MapperOnlyHadoopJob.java>
 import java.io.IOException; import java.io.IOException;
Line 63: Line 79:
 Download the source and compile it. Download the source and compile it.
  
-The //official// way of running Hadoop jobs is to use the ''//SGE/HADOOP/active/bin/hadoop'' script. This script has no user-friendly options and only Hadoop properties can be set. Therefore a wrapper script is provided. This script has the same options as the Perl API runner: +The official way of running Hadoop jobs is to use the ''/SGE/HADOOP/active/bin/hadoop'' script. Jobs submitted through this script can be configured using Hadoop properties only. Therefore a wrapper script is provided, with similar options as the Perl API runner: 
-  * ''net/projects/hadoop/bin/hadoop job.jar input_path output_path'' executes teh given job locally in a single thread. It is useful for debugging.+  * ''net/projects/hadoop/bin/hadoop [-r number_of_reducers] job.jar [generic Hadoop properties] input_path output_path'' -- executes the given job locally in a single thread. It is useful for debugging
 +  * ''net/projects/hadoop/bin/hadoop -jt cluster_master [-r number_of_reducers] job.jar [generic Hadoop properties] input_path output_path'' -- submits the job to given ''cluster_master''
 +  * ''net/projects/hadoop/bin/hadoop -c number_of_machines [-w secs_to_wait_after_job_finishes] [-r number_of_reducers] job.jar [generic Hadoop properties] input_path output_path'' -- creates a new cluster with specified number of machines, which executes given job, and then waits for specified number of seconds before it stops.
  

[ Back to the navigation ] [ Back to the content ]