Differences

This shows you the differences between two versions of the page.

--- courses:mapreduce-tutorial:step-31 [2012/02/06 08:03]
straka
+++ courses:mapreduce-tutorial:step-31 [2012/02/06 14:52] (current)
dusek
@@ Line 26: / Line 26: @@
   * ''void allReduce(Context, double[][] values, int reduce_op)'' -- each machine provides multiple values. The reduced values are stored in the original array in every machine.
 The ''reduce_op'' can currently be one on:
-  * REDUCE_ADD
+  * ''REDUCE_ADD''
-  * REDUCE_MIN
+  * ''REDUCE_MIN''
-  * REDUCE_MAX
+  * ''REDUCE_MAX''
+It is crucial that all the mappers run simultaneously. This can be achieved using the ''/net/projects/hadoop/bin/compute-splitsize'' script: for given Hadoop input and requested number of mappers, it computes the appropriate splitsize.
+When the computation finishes, only one of the mappers should print the results, as all of them have the same results. For simplicity, the ''cooperate'' method has ''boolean writeResults'' argument, which is set in exactly one mapper.
+===== Example =====
+This example reads the keys of ''/net/projects/hadoop/examples/inputs/numbers-small'', computes the sum of all the keys and prints it:
+<code java Sum.java>
+import java.io.IOException;
+import org.apache.hadoop.conf.*;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.*;
+import org.apache.hadoop.mapreduce.*;
+import org.apache.hadoop.mapreduce.lib.allreduce.*;
+import org.apache.hadoop.mapreduce.lib.input.*;
+import org.apache.hadoop.mapreduce.lib.output.*;
+import org.apache.hadoop.util.*;
+public class Sum extends Configured implements Tool {
+  public static class TheMapper extends AllReduceMapper<Text, Text, DoubleWritable, NullWritable>{
+    int[] points = new int[64];
+    int points_num = 0;
+    public void map(Text key, Text value, Context context) throws IOException, InterruptedException {
+      if (points_num == points.length) {
+        int[] new_points = new int[2*points_num];
+        System.arraycopy(points, 0, new_points, 0, points_num);
+        points = new_points;
+      }
+      points[points_num++] = Integer.parseInt(key.toString());
+    }
+    public void cooperate(Context context, boolean writeResults) throws IOException, InterruptedException {
+      double sum = 0;
+      for (int i = 0; i < points_num; i++) sum += points[i];
+      double total_sum = allReduce(context, sum, REDUCE_ADD);
+      if (writeResults) context.write(new DoubleWritable(total_sum), NullWritable.get());
+    }
+  }
+  // Job configuration
+  public int run(String[] args) throws Exception {
+    if (args.length < 2) {
+      System.err.printf("Usage: %s.jar in-path out-path", this.getClass().getName());
+      return 1;
+    }
+    Job job = new Job(getConf(), this.getClass().getName());
+    job.setJarByClass(this.getClass());
+    job.setMapperClass(TheMapper.class);
+    AllReduce.init(job);
+    job.setOutputKeyClass(DoubleWritable.class);
+    job.setOutputValueClass(NullWritable.class);
+    job.setInputFormatClass(KeyValueTextInputFormat.class);
+    FileInputFormat.addInputPath(job, new Path(args[0]));
+    FileOutputFormat.setOutputPath(job, new Path(args[1]));
+    return job.waitForCompletion(true) ? 0 : 1;
+  }
+  // Main method
+  public static void main(String[] args) throws Exception {
+    int res = ToolRunner.run(new Sum(), args);
+    System.exit(res);
+  }
+}
+</code>
+You can run the example locally using:
+  wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_export/code/courses:mapreduce-tutorial:step-31?codeblock=0' -O Sum.java
+  make -f /net/projects/hadoop/java/Makefile Sum.jar
+  rm -rf step-31-out; /net/projects/hadoop/bin/hadoop Sum.jar /net/projects/hadoop/examples/inputs/numbers-small step-31-out
+  less step-31-out/part-*
+To run on a cluster using specified number of machines:
+  rm -rf step-31-out; M=#of_machines; INPUT=/net/projects/hadoop/examples/inputs/numbers-small; /net/projects/hadoop/bin/hadoop Sum.jar -c $M `/net/projects/hadoop/bin/compute-splitsize $INPUT $M` $INPUT step-31-out
+  less step-31-out/part-*
+===== Exercise 1 =====
+Implement an AllReduce job on ''/net/projects/hadoop/examples/inputs/numbers-small'', which computes
+  * number of keys
+  * mean of the keys
+  * variance of the keys
+  * minimum of the keys
+  * maximum of the keys
+You can download the template {{:courses:mapreduce-tutorial:step-31-exercise1.txt|Statistics.java}} and execute it using:
+  wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-31-exercise1.txt' -O Statistics.java
+  # NOW VIEW THE FILE
+  # $EDITOR Statistics.java
+  make -f /net/projects/hadoop/java/Makefile Statistics.jar
+  rm -rf step-31-out; M=#of_machines; INPUT=/net/projects/hadoop/examples/inputs/numbers-small; /net/projects/hadoop/bin/hadoop Statistics.jar -c $M `/net/projects/hadoop/bin/compute-splitsize $INPUT $M` $INPUT step-31-out
+  less step-31-out/part-*
+===== Exercise 2 =====
+Implement an AllReduce job on ''/net/projects/hadoop/examples/inputs/numbers-small'', which computes median of the input data. You can use the following iterative algorithm:
+  * At the beginning, set //min<sub>1</sub>// = ''Integer.MIN_VALUE'', //max<sub>1</sub>// = ''Integer.MAX_VALUE'', //index_to_find// = number_of_input_data / 2.
+  * In step //i//, do the following:
+    - Consider only input keys in range <//min<sub>i</sub>//, //max<sub>i</sub>//>.
+    - Compute //split// = ceiling of mean of the keys.
+    - If the //index_to_find// is in range <1+number of keys less than //split//, number of keys less or equal to //split//>, then ''split'' is median.
+    - Else, if //index_to_find// is at most the number of keys less than //split//, set //max<sub>i+1</sub>// = //split//-1.
+    - Else, set //min<sub>i+1</sub>// = //split//+1 and subtract from //index_to_find// the number of keys less or equal to //split//.
+You can download the template {{:courses:mapreduce-tutorial:step-31-exercise2.txt|Median.java}} and execute it using:
+  wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-31-exercise2.txt' -O Median.java
+  # NOW VIEW THE FILE
+  # $EDITOR Median.java
+  make -f /net/projects/hadoop/java/Makefile Median.jar
+  rm -rf step-31-out; M=#of_machines; INPUT=/net/projects/hadoop/examples/inputs/numbers-small; /net/projects/hadoop/bin/hadoop Median.jar -c $M `/net/projects/hadoop/bin/compute-splitsize $INPUT $M` $INPUT step-31-out
+  less step-31-out/part-*
+Solution: {{:courses:mapreduce-tutorial:step-31-solution2.txt|Median.java}}.
+===== Exercise 3 =====
+Implement an AllReduce job on ''/net/projects/hadoop/examples/inputs/points-small'', which implements the [[http://en.wikipedia.org/wiki/K-means_clustering#Standard_algorithm|K-means clustering algorithm]]. See [[.:step-15|K-means clustering exercise]] for description of input data.
+You can download the template {{:courses:mapreduce-tutorial:step-31-exercise3.txt|KMeans.java}}. This template uses two Hadoop properties:
+  * ''clusters.num'' -- number of clusters
+  * ''clusters.file'' -- file where to read the initial clusters from
+You can download and compile it using:
+  wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-31-exercise3.txt' -O KMeans.java
+  # NOW VIEW THE FILE
+  # $EDITOR KMeans.java
+  make -f /net/projects/hadoop/java/Makefile KMeans.jar
+You can run it using specified number of machines on the following input data:
+  * ''/net/projects/hadoop/examples/inputs/points-small'':
+<code>M=#of_machines; K=50; INPUT=/net/projects/hadoop/examples/inputs/points-small/points.txt
+rm -rf step-31-out; /net/projects/hadoop/bin/hadoop KMeans.jar -Dclusters.num=$K -Dclusters.file=$INPUT -c $M `/net/projects/hadoop/bin/compute-splitsize $INPUT $M` $INPUT step-31-out</code>
+  * ''/net/projects/hadoop/examples/inputs/points-medium'':
+<code>M=#of_machines; K=100; INPUT=/net/projects/hadoop/examples/inputs/points-medium/points.txt
+rm -rf step-31-out; /net/projects/hadoop/bin/hadoop KMeans.jar -Dclusters.num=$K -Dclusters.file=$INPUT -c $M `/net/projects/hadoop/bin/compute-splitsize $INPUT $M` $INPUT step-31-out</code>
+  * ''/net/projects/hadoop/examples/inputs/points-large'':
+<code>M=#of_machines; K=200; INPUT=/net/projects/hadoop/examples/inputs/points-large/points.txt
+rm -rf step-31-out; /net/projects/hadoop/bin/hadoop KMeans.jar -Dclusters.num=$K -Dclusters.file=$INPUT -c $M `/net/projects/hadoop/bin/compute-splitsize $INPUT $M` $INPUT step-31-out</code>
+Solution: {{:courses:mapreduce-tutorial:step-31-solution3.txt|KMeans.java}}.

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences