[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
courses:mapreduce-tutorial:step-31 [2012/02/06 08:26]
straka
courses:mapreduce-tutorial:step-31 [2012/02/06 08:50]
straka
Line 107: Line 107:
    
 To run on a cluster with //C// machines using //C// mappers: To run on a cluster with //C// machines using //C// mappers:
-  rm -rf step-31-out; /net/projects/hadoop/bin/hadoop -c C `/net/projects/hadoop/bin/compute-splitsize /net/projects/hadoop/examples/inputs/numbers-small C` Sum.jar /net/projects/hadoop/examples/inputs/numbers-small step-31-out+  rm -rf step-31-out; /net/projects/hadoop/bin/hadoop Sum.jar -c C `/net/projects/hadoop/bin/compute-splitsize /net/projects/hadoop/examples/inputs/numbers-small C` /net/projects/hadoop/examples/inputs/numbers-small step-31-out
   less step-31-out/part-*   less step-31-out/part-*
  
Line 123: Line 123:
   # $EDITOR Statistics.java   # $EDITOR Statistics.java
   make -f /net/projects/hadoop/java/Makefile Statistics.java   make -f /net/projects/hadoop/java/Makefile Statistics.java
-  rm -rf step-31-out; /net/projects/hadoop/bin/hadoop Statistics.jar+  rm -rf step-31-out; /net/projects/hadoop/bin/hadoop Statistics.jar -c C `/net/projects/hadoop/bin/compute-splitsize /net/projects/hadoop/examples/inputs/numbers-small C` /net/projects/hadoop/examples/inputs/numbers-small step-31-out
   less step-31-out/part-*   less step-31-out/part-*
 +
 +===== Exercise 2 =====
 +
 +Implement an AllReduce job on ''/net/projects/hadoop/examples/inputs/numbers-small'', which computes median of the input data. You can use the following iterative algorithm:
 +  * At the beginning, set //min<sub>1</sub>// = ''Integer.MIN_VALUE'', //max<sub>1</sub>// = ''Integer.MAX_VALUE'', //index_to_find// = number_of_input_data / 2.
 +  * In step //i//, do the following:
 +    - Consider only input keys in range <//min<sub>i</sub>//, //max<sub>i</sub>//>.
 +    - Compute //split// = ceiling of mean of the keys.
 +    - If the //index_to_find// is in range <1+number of keys less than //split//, number of keys less or equal to //split//>, then ''split'' is median.
 +    - Else, if //index_to_find// is at most the number of keys less than //split//, set //max<sub>i+1</sub>// = //split//-1.
 +    - Else, set //min<sub>i+1</sub>// = //split//+1 and subtract from //index_to_find// the number of keys less or equal to //split//.
 +
 +You can download the template {{:courses:mapreduce-tutorial:step-31-exercise2.txt|Median.java}} and execute it using:
 +  wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-31-exercise2.txt' -O Median.java
 +  # NOW VIEW THE FILE
 +  # $EDITOR Median.java
 +  make -f /net/projects/hadoop/java/Makefile Median.java
 +  rm -rf step-31-out; /net/projects/hadoop/bin/hadoop Median.jar -c C `/net/projects/hadoop/bin/compute-splitsize /net/projects/hadoop/examples/inputs/numbers-small C` /net/projects/hadoop/examples/inputs/numbers-small step-31-out
 +  less step-31-out/part-*
 +
 +Solution: {{:courses:mapreduce-tutorial:step-31-solution2.txt|Median.java}}.
 +
 +===== Exercise 3 =====
 +
 +Implement an AllReduce job on ''/net/projects/hadoop/examples/inputs/points-small'', which implements the [[http://en.wikipedia.org/wiki/K-means_clustering#Standard_algorithm|K-means clustering algorithm]]. See [[.:step-15|K-means clustering exercise]] for description of input data.
 +
 +You can download the template {{:courses:mapreduce-tutorial:step-31-exercise3.txt|KMeans.java}}. This template uses two Hadoop properties:
 +  * ''clusters.num'' -- number of clusters
 +  * ''clusters.file'' -- file where to read the initial clusters from
 +You can download and compile it using:
 +  wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-31-exercise3.txt' -O KMeans.java.java
 +  # NOW VIEW THE FILE
 +  # $EDITOR KMeans.java.java
 +  make -f /net/projects/hadoop/java/Makefile KMeans.java.java
 +You can run it using //C// machines on the following input data:
 +  * ''/net/projects/hadoop/examples/inputs/points-small'': <code>rm -rf step-31-out; /net/projects/hadoop/bin/hadoop KMeans.java.jar -Dclusters.num=50 -Dclusters.file=/net/projects/hadoop/examples/inputs/points-small/points.txt -c C `/net/projects/hadoop/bin/compute-splitsize /net/projects/hadoop/examples/inputs/points-small C` /net/projects/hadoop/examples/inputs/points-small step-31-out</code>
 +  * ''/net/projects/hadoop/examples/inputs/points-medium'': <code>rm -rf step-31-out; /net/projects/hadoop/bin/hadoop KMeans.java.jar -Dclusters.num=100 -Dclusters.file=/net/projects/hadoop/examples/inputs/points-medium/points.txt -c C `/net/projects/hadoop/bin/compute-splitsize /net/projects/hadoop/examples/inputs/points-medium C` /net/projects/hadoop/examples/inputs/points-medium step-31-out</code>
 +  * ''/net/projects/hadoop/examples/inputs/points-large'': <code>rm -rf step-31-out; /net/projects/hadoop/bin/hadoop KMeans.java.jar -Dclusters.num=200 -Dclusters.file=/net/projects/hadoop/examples/inputs/points-large/points.txt -c C `/net/projects/hadoop/bin/compute-splitsize /net/projects/hadoop/examples/inputs/points-large C` /net/projects/hadoop/examples/inputs/points-large step-31-out</code>
 +
 +Solution: {{:courses:mapreduce-tutorial:step-31-solution3.txt|KMeans.java}}.
  

[ Back to the navigation ] [ Back to the content ]