Differences

This shows you the differences between two versions of the page.

--- courses:mapreduce-tutorial:step-24 [2012/01/27 21:47]
straka
+++ courses:mapreduce-tutorial:step-24 [2012/01/31 16:25] (current)
dusek
@@ Line 1: / Line 1: @@
-====== MapReduce Tutorial : Mappers, running Java Hadoop jobs ======
+====== MapReduce Tutorial : Mappers, running Java Hadoop jobs, counters ======
 We start by going through a simple Hadoop job with Mapper only.
-A mapper which processes (key, value) pairs of types (Kin, Vin) and produces (key, value) pairs of types (Kout, Vout) must be a subclass of [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/Mapper.html|Mapper<Kin, Vin, Kout, Vout>]]. In our case, the mapper is subclass of ''Mapper<Text, Text, Text, Text>''.
+A //mapper// which processes (key, value) pairs of types (Kin, Vin) and produces (key, value) pairs of types (Kout, Vout) must be a subclass of [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/Mapper.html|Mapper<Kin, Vin, Kout, Vout>]]. In our case, the mapper is subclass of ''Mapper<Text, Text, Text, Text>''.
 The mapper must define a ''map'' method and may provide ''setup'' and ''context'' method:
@@ Line 16: / Line 16: @@
 </code>
-Outputting (key, value) pairs is performed using the [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/MapContext.html|MapContext<Kin, Vin, Kout, Vout>]] (the ''Context'' is an abbreviation for this type).
+Outputting (key, value) pairs is performed using the [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/MapContext.html|MapContext<Kin, Vin, Kout, Vout>]] object (the ''Context'' is an abbreviation for this type), with the method ''context.write(Kout key, Vout value)''.
+Here is the source of the whole Hadoop job:
 <file java MapperOnlyHadoopJob.java>
@@ Line 52: / Line 54: @@
     }
-    Job job = new Job(getConf(), this.getClass().getName());
+    Job job = new Job(getConf(), this.getClass().getName());    // Create class representing Hadoop job.
+                                                                // Name of the job is the name of current class.
-    job.setJarByClass(this.getClass());
+    job.setJarByClass(this.getClass());                         // Use jar containing current class.
-    job.setMapperClass(TheMapper.class);
+    job.setMapperClass(TheMapper.class);                        // The mapper of the job.
-    job.setOutputKeyClass(Text.class);
+    job.setOutputKeyClass(Text.class);                          // Type of the output keys.
-    job.setOutputValueClass(Text.class);
+    job.setOutputValueClass(Text.class);                        // Type of the output values.
-    job.setInputFormatClass(KeyValueTextInputFormat.class);
+    job.setInputFormatClass(KeyValueTextInputFormat.class);     // Input format.
+                                                                // Output format is the default -- TextOutputFormat
-    FileInputFormat.addInputPath(job, new Path(args[0]));
+    FileInputFormat.addInputPath(job, new Path(args[0]));       // Input path is on command line.
-    FileOutputFormat.setOutputPath(job, new Path(args[1]));
+    FileOutputFormat.setOutputPath(job, new Path(args[1]));     // Output path is on command line too.
     return job.waitForCompletion(true) ? 0 : 1;
@@ Line 76: / Line 80: @@
 </file>
-===== Running the job =====
+==== Remarks ====
-Download the source and compile it.
+  * The filename //must// be the same as the name of the top-level class -- this is enforced by Java compiler. But the top-level class can contain any number of nested classes.
+  * In one class multiple jobs can be submitted, either in sequence or in parallel.
+  * A mismatch of types is usually detected by the compiler, but sometimes it is detected only at runtime. If that happens, an exception is raised and the program crashes. For example, default key output class it ''LongWritable'' -- if ''Text'' was not specified, the program would crash.
+  * **VIM users**: The code completion plugin does not complete the ''context'' variable. That is because it does not understand that ''Context'' is used as an abbreviation for ''MapContext<Text, Text, Text, Text>''. If the type ''MapContext<Text, Text, Text, Text>'' is used instead of ''Context'', the code compiles and code completion on ''context'' works.
+===== Running the job =====
 The official way of running Hadoop jobs is to use the ''/SGE/HADOOP/active/bin/hadoop'' script. Jobs submitted through this script can be configured using Hadoop properties only. Therefore a wrapper script is provided, with similar options as the Perl API runner:
-  * ''net/projects/hadoop/bin/hadoop [-r number_of_reducers] job.jar [generic Hadoop properties] input_path output_path'' -- executes the given job locally in a single thread. It is useful for debugging.
+  * ''net/projects/hadoop/bin/hadoop job.jar [-Dname=value -Dname=value ...] input output_path'' -- executes the given job locally in a single thread. It is useful for debugging.
-  * ''net/projects/hadoop/bin/hadoop -jt cluster_master [-r number_of_reducers] job.jar [generic Hadoop properties] input_path output_path'' -- submits the job to given ''cluster_master''.
+  * ''net/projects/hadoop/bin/hadoop job.jar -jt cluster_master [-r number_of_reducers] [-Dname=value -Dname=value ...] input output_path'' -- submits the job to given ''cluster_master''.
-  * ''net/projects/hadoop/bin/hadoop -c number_of_machines [-w secs_to_wait_after_job_finishes] [-r number_of_reducers] job.jar [generic Hadoop properties] input_path output_path'' -- creates a new cluster with specified number of machines, which executes given job, and then waits for specified number of seconds before it stops.
+  * ''net/projects/hadoop/bin/hadoop job.jar -c number_of_machines [-w secs_to_wait_after_job_finishes] [-r number_of_reducers] [-Dname=value -Dname=value ...] input output_path'' -- creates a new cluster with specified number of machines, which executes given job, and then waits for specified number of seconds before it stops.
+===== Exercise 1 =====
+Download the ''MapperOnlyHadoopJob.java'', compile it and run it using
+  wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_export/code/courses:mapreduce-tutorial:step-24?codeblock=1' -O 'MapperOnlyHadoopJob.java'
+  make -f /net/projects/hadoop/java/Makefile MapperOnlyHadoopJob.jar
+  rm -rf step-24-out-sol; /net/projects/hadoop/bin/hadoop MapperOnlyHadoopJob.jar -r 0 /home/straka/wiki/cs-text-small step-24-out-sol
+  less step-24-out-sol/part-*
+Mind the ''-r 0'' switch -- specifying ''-r 0'' disable the reducer. If the switch ''-r 0'' was not given, one reducer of default type ''IdentityReducer'' would be used. The ''IdentityReducer'' outputs every (key, value) pair it is given.
+  * When using ''-r 0'', the job runs faster, as the mappers write the output directly to disk. Buth there are as many output files as mappers and the (key, value) pairs are stored in no special order.
+  * When not specifying ''-r 0'' (i.e., using ''-r 1'' with ''IdentityReducer''), the job produces the same (key, value) pairs. But this time they are in one output file, sorted according to the key. Of course, the job runs slower in this case.
+===== Counters =====
+As in the Perl API, a mapper (or a reducer) can increment various counters by using ''context.getCounter("Group", "Name").increment(value)'':
+<code java>
+public void map(Text key, Text value, Context context) throws IOException, InterruptedException {
+  ...
+  context.getCounter("Group", "Name").increment(value);
+  ...
+}
+</code>
+The ''getCounter'' method returns a [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/Counter.html|Counter]] object, so if a counter is incremented frequently, the ''getCounter'' method can be called only once:
+<code java>
+public void map(Text key, Text value, Context context) throws IOException, InterruptedException {
+  ...
+  Counter words = context.getCounter("Mapper", "Number of words");
+  for (String word : value.toString().split("\\W+")) {
+    ...
+    words.increment(1);
+  }
+}
+</code>
+===== Example 2 =====
+Run a Hadoop job on /home/straka/wiki/cs-text-small, which filters the documents so that only three-letter words remain. Also use counters to count the histogram of words lengths and to compute the percentage of three letter words in the documents. You can download the template {{:courses:mapreduce-tutorial:step-24.txt|ThreeLetterWords.java}} and execute it.
+  wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-24.txt' -O 'ThreeLetterWords.java'
+  # NOW VIEW THE FILE
+  # $EDITOR ThreeLetterWords.java
+  make -f /net/projects/hadoop/java/Makefile ThreeLetterWords.jar
+  rm -rf step-24-out-sol; /net/projects/hadoop/bin/hadoop ThreeLetterWords.jar -r 0 /home/straka/wiki/cs-text-small step-24-out-sol
+  less step-24-out-sol/part-*
+----
+<html>
+<table style="width:100%">
+<tr>
+<td style="text-align:left; width: 33%; "></html>[[step-23|Step 23]]: Predefined formats and types.<html></td>
+<td style="text-align:center; width: 33%; "></html>[[.|Overview]]<html></td>
+<td style="text-align:right; width: 33%; "></html>[[step-25|Step 25]]: Reducers, combiners and partitioners.<html></td>
+</tr>
+</table>
+</html>

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences