Both sides previous revision
Previous revision
Next revision
|
Previous revision
|
courses:mapreduce-tutorial:step-31 [2012/02/06 08:40] straka |
courses:mapreduce-tutorial:step-31 [2012/02/06 14:52] (current) dusek |
It is crucial that all the mappers run simultaneously. This can be achieved using the ''/net/projects/hadoop/bin/compute-splitsize'' script: for given Hadoop input and requested number of mappers, it computes the appropriate splitsize. | It is crucial that all the mappers run simultaneously. This can be achieved using the ''/net/projects/hadoop/bin/compute-splitsize'' script: for given Hadoop input and requested number of mappers, it computes the appropriate splitsize. |
| |
When the computation finishes, only one of the mappers should print the results, as all of them have the same results. For simplicity, the ''cooperate'' method has ''boolean shouldWrite'' argument, which is set in exactly one mapper. | When the computation finishes, only one of the mappers should print the results, as all of them have the same results. For simplicity, the ''cooperate'' method has ''boolean writeResults'' argument, which is set in exactly one mapper. |
| |
===== Example ===== | ===== Example ===== |
This example reads the keys of ''/net/projects/hadoop/examples/inputs/numbers-small'', computes the sum of all the keys and print it: | This example reads the keys of ''/net/projects/hadoop/examples/inputs/numbers-small'', computes the sum of all the keys and prints it: |
<code java Sum.java> | <code java Sum.java> |
| import java.io.IOException; |
| |
| import org.apache.hadoop.conf.*; |
| import org.apache.hadoop.fs.Path; |
| import org.apache.hadoop.io.*; |
import org.apache.hadoop.mapreduce.*; | import org.apache.hadoop.mapreduce.*; |
import org.apache.hadoop.mapreduce.lib.allreduce.*; | import org.apache.hadoop.mapreduce.lib.allreduce.*; |
You can run the example locally using: | You can run the example locally using: |
wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_export/code/courses:mapreduce-tutorial:step-31?codeblock=0' -O Sum.java | wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_export/code/courses:mapreduce-tutorial:step-31?codeblock=0' -O Sum.java |
make -f /net/projects/hadoop/java/Makefile Sum.java | make -f /net/projects/hadoop/java/Makefile Sum.jar |
rm -rf step-31-out; /net/projects/hadoop/bin/hadoop Sum.jar /net/projects/hadoop/examples/inputs/numbers-small step-31-out | rm -rf step-31-out; /net/projects/hadoop/bin/hadoop Sum.jar /net/projects/hadoop/examples/inputs/numbers-small step-31-out |
less step-31-out/part-* | less step-31-out/part-* |
| |
To run on a cluster with //C// machines using //C// mappers: | To run on a cluster using specified number of machines: |
rm -rf step-31-out; /net/projects/hadoop/bin/hadoop Sum.jar -c C `/net/projects/hadoop/bin/compute-splitsize /net/projects/hadoop/examples/inputs/numbers-small C` /net/projects/hadoop/examples/inputs/numbers-small step-31-out | rm -rf step-31-out; M=#of_machines; INPUT=/net/projects/hadoop/examples/inputs/numbers-small; /net/projects/hadoop/bin/hadoop Sum.jar -c $M `/net/projects/hadoop/bin/compute-splitsize $INPUT $M` $INPUT step-31-out |
less step-31-out/part-* | less step-31-out/part-* |
| |
# NOW VIEW THE FILE | # NOW VIEW THE FILE |
# $EDITOR Statistics.java | # $EDITOR Statistics.java |
make -f /net/projects/hadoop/java/Makefile Statistics.java | make -f /net/projects/hadoop/java/Makefile Statistics.jar |
rm -rf step-31-out; /net/projects/hadoop/bin/hadoop Statistics.jar -c C `/net/projects/hadoop/bin/compute-splitsize /net/projects/hadoop/examples/inputs/numbers-small C` /net/projects/hadoop/examples/inputs/numbers-small step-31-out | rm -rf step-31-out; M=#of_machines; INPUT=/net/projects/hadoop/examples/inputs/numbers-small; /net/projects/hadoop/bin/hadoop Statistics.jar -c $M `/net/projects/hadoop/bin/compute-splitsize $INPUT $M` $INPUT step-31-out |
less step-31-out/part-* | less step-31-out/part-* |
| |
# NOW VIEW THE FILE | # NOW VIEW THE FILE |
# $EDITOR Median.java | # $EDITOR Median.java |
make -f /net/projects/hadoop/java/Makefile Median.java | make -f /net/projects/hadoop/java/Makefile Median.jar |
rm -rf step-31-out; /net/projects/hadoop/bin/hadoop Median.jar -c C `/net/projects/hadoop/bin/compute-splitsize /net/projects/hadoop/examples/inputs/numbers-small C` /net/projects/hadoop/examples/inputs/numbers-small step-31-out | rm -rf step-31-out; M=#of_machines; INPUT=/net/projects/hadoop/examples/inputs/numbers-small; /net/projects/hadoop/bin/hadoop Median.jar -c $M `/net/projects/hadoop/bin/compute-splitsize $INPUT $M` $INPUT step-31-out |
less step-31-out/part-* | less step-31-out/part-* |
| |
===== Exercise 3 ===== | ===== Exercise 3 ===== |
| |
Implement an AllReduce job on ''/net/projects/hadoop/examples/inputs/numbers-small'', which computes | Implement an AllReduce job on ''/net/projects/hadoop/examples/inputs/points-small'', which implements the [[http://en.wikipedia.org/wiki/K-means_clustering#Standard_algorithm|K-means clustering algorithm]]. See [[.:step-15|K-means clustering exercise]] for description of input data. |
| |
You can download the template {{:courses:mapreduce-tutorial:step-31-exercise3.txt|Median.java}} and execute it using: | You can download the template {{:courses:mapreduce-tutorial:step-31-exercise3.txt|KMeans.java}}. This template uses two Hadoop properties: |
wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-31-exercise3.txt' -O KMeans.java.java | * ''clusters.num'' -- number of clusters |
| * ''clusters.file'' -- file where to read the initial clusters from |
| You can download and compile it using: |
| wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-31-exercise3.txt' -O KMeans.java |
# NOW VIEW THE FILE | # NOW VIEW THE FILE |
# $EDITOR KMeans.java.java | # $EDITOR KMeans.java |
make -f /net/projects/hadoop/java/Makefile KMeans.java.java | make -f /net/projects/hadoop/java/Makefile KMeans.jar |
rm -rf step-31-out; /net/projects/hadoop/bin/hadoop KMeans.java.jar -c C `/net/projects/hadoop/bin/compute-splitsize /net/projects/hadoop/examples/inputs/numbers-small C` /net/projects/hadoop/examples/inputs/numbers-small step-31-out | You can run it using specified number of machines on the following input data: |
less step-31-out/part-* | * ''/net/projects/hadoop/examples/inputs/points-small'': |
| <code>M=#of_machines; K=50; INPUT=/net/projects/hadoop/examples/inputs/points-small/points.txt |
| rm -rf step-31-out; /net/projects/hadoop/bin/hadoop KMeans.jar -Dclusters.num=$K -Dclusters.file=$INPUT -c $M `/net/projects/hadoop/bin/compute-splitsize $INPUT $M` $INPUT step-31-out</code> |
| * ''/net/projects/hadoop/examples/inputs/points-medium'': |
| <code>M=#of_machines; K=100; INPUT=/net/projects/hadoop/examples/inputs/points-medium/points.txt |
| rm -rf step-31-out; /net/projects/hadoop/bin/hadoop KMeans.jar -Dclusters.num=$K -Dclusters.file=$INPUT -c $M `/net/projects/hadoop/bin/compute-splitsize $INPUT $M` $INPUT step-31-out</code> |
| * ''/net/projects/hadoop/examples/inputs/points-large'': |
| <code>M=#of_machines; K=200; INPUT=/net/projects/hadoop/examples/inputs/points-large/points.txt |
| rm -rf step-31-out; /net/projects/hadoop/bin/hadoop KMeans.jar -Dclusters.num=$K -Dclusters.file=$INPUT -c $M `/net/projects/hadoop/bin/compute-splitsize $INPUT $M` $INPUT step-31-out</code> |
| |
Solution: {{:courses:mapreduce-tutorial:step-31-solution3.txt|KMeans.java}}. | Solution: {{:courses:mapreduce-tutorial:step-31-solution3.txt|KMeans.java}}. |
| |