Both sides previous revision
Previous revision
Next revision
|
Previous revision
|
courses:mapreduce-tutorial:step-16 [2012/02/06 09:12] straka |
courses:mapreduce-tutorial:step-16 [2012/02/06 13:29] (current) straka |
| |
To run on a cluster with //C// machines using //C// mappers: | To run on a cluster with //C// machines using //C// mappers: |
rm -rf step-16-out; perl sum.pl -c C `/net/projects/hadoop/bin/compute-splitsize /net/projects/hadoop/examples/inputs/numbers-small C` /net/projects/hadoop/examples/inputs/numbers-small step-16-out | rm -rf step-16-out; M=#of_machines; INPUT=/net/projects/hadoop/examples/inputs/numbers-small; perl sum.pl -c $M `/net/projects/hadoop/bin/compute-splitsize $INPUT $M` $INPUT step-16-out |
less step-16-out/part-* | less step-16-out/part-* |
| |
# NOW VIEW THE FILE | # NOW VIEW THE FILE |
# $EDITOR statistics.pl | # $EDITOR statistics.pl |
rm -rf step-16-out; perl statistics.pl -c C `/net/projects/hadoop/bin/compute-splitsize /net/projects/hadoop/examples/inputs/numbers-small C` /net/projects/hadoop/examples/inputs/numbers-small step-16-out | rm -rf step-16-out; M=#of_machines; INPUT=/net/projects/hadoop/examples/inputs/numbers-small; perl statistics.pl -c $M `/net/projects/hadoop/bin/compute-splitsize $INPUT $M` $INPUT step-16-out |
less step-16-out/part-* | less step-16-out/part-* |
| |
- Else, set //min<sub>i+1</sub>// = //split//+1 and subtract from //index_to_find// the number of keys less or equal to //split//. | - Else, set //min<sub>i+1</sub>// = //split//+1 and subtract from //index_to_find// the number of keys less or equal to //split//. |
| |
You can download the template {{:courses:mapreduce-tutorial:step-31-exercise2.txt|Median.java}} and execute it using: | You can download the template {{:courses:mapreduce-tutorial:step-16-exercise2.txt|median.pl}} and execute it using: |
wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-31-exercise2.txt' -O Median.java | wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-16-exercise2.txt' -O median.pl |
# NOW VIEW THE FILE | # NOW VIEW THE FILE |
# $EDITOR Median.java | # $EDITOR median.pl |
make -f /net/projects/hadoop/java/Makefile Median.java | rm -rf step-16-out; M=#of_machines; INPUT=/net/projects/hadoop/examples/inputs/numbers-small; perl median.pl -c $M `/net/projects/hadoop/bin/compute-splitsize $INPUT $M` $INPUT step-16-out |
rm -rf step-31-out; /net/projects/hadoop/bin/hadoop Median.jar -c C `/net/projects/hadoop/bin/compute-splitsize /net/projects/hadoop/examples/inputs/numbers-small C` /net/projects/hadoop/examples/inputs/numbers-small step-31-out | less step-16-out/part-* |
less step-31-out/part-* | |
| |
Solution: {{:courses:mapreduce-tutorial:step-31-solution2.txt|Median.java}}. | Solution: {{:courses:mapreduce-tutorial:step-16-solution2.txt|median.pl}}. |
| |
===== Exercise 3 ===== | ===== Exercise 3 ===== |
Implement an AllReduce job on ''/net/projects/hadoop/examples/inputs/points-small'', which implements the [[http://en.wikipedia.org/wiki/K-means_clustering#Standard_algorithm|K-means clustering algorithm]]. See [[.:step-15|K-means clustering exercise]] for description of input data. | Implement an AllReduce job on ''/net/projects/hadoop/examples/inputs/points-small'', which implements the [[http://en.wikipedia.org/wiki/K-means_clustering#Standard_algorithm|K-means clustering algorithm]]. See [[.:step-15|K-means clustering exercise]] for description of input data. |
| |
You can download the template {{:courses:mapreduce-tutorial:step-31-exercise3.txt|KMeans.java}}. This template uses two Hadoop properties: | You can download the template {{:courses:mapreduce-tutorial:step-16-exercise3.txt|kmeans.pl}}. This template uses two environment variables: |
* ''clusters.num'' -- number of clusters | * ''CLUSTERS_NUM'' -- number of clusters |
* ''clusters.file'' -- file where to read the initial clusters from | * ''CLUSTERS_FILE'' -- file where to read the initial clusters from |
You can download and compile it using: | You can download and compile it using: |
wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-31-exercise3.txt' -O KMeans.java.java | wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-16-exercise3.txt' -O kmeans.pl |
# NOW VIEW THE FILE | # NOW VIEW THE FILE |
# $EDITOR KMeans.java.java | # $EDITOR kmeans.pl |
make -f /net/projects/hadoop/java/Makefile KMeans.java.java | |
You can run it using specified number of machines on the following input data: | You can run it using specified number of machines on the following input data: |
* ''/net/projects/hadoop/examples/inputs/points-small'': | * ''/net/projects/hadoop/examples/inputs/points-small'': |
<code>M=machines; K=50; INPUT=/net/projects/hadoop/examples/inputs/points-small/points.txt | <code>M=#of_machines; export CLUSTERS_NUM=50 CLUSTERS_FILE=/net/projects/hadoop/examples/inputs/points-small/points.txt |
rm -rf step-31-out; /net/projects/hadoop/bin/hadoop KMeans.java.jar -Dclusters.num=$K -Dclusters.file=$INPUT [-jt jobtracker | -c $M] `/net/projects/hadoop/bin/compute-splitsize $INPUT $M` $INPUT step-31-out</code> | rm -rf step-16-out; perl kmeans.pl -c $M `/net/projects/hadoop/bin/compute-splitsize $CLUSTERS_FILE $M` $CLUSTERS_FILE step-16-out</code> |
* ''/net/projects/hadoop/examples/inputs/points-medium'': | * ''/net/projects/hadoop/examples/inputs/points-medium'': |
<code>M=machines; K=100; INPUT=/net/projects/hadoop/examples/inputs/points-medium/points.txt | <code>M=#of_machines; export CLUSTERS_NUM=100 CLUSTERS_FILE=/net/projects/hadoop/examples/inputs/points-medium/points.txt |
rm -rf step-31-out; /net/projects/hadoop/bin/hadoop KMeans.java.jar -Dclusters.num=$K -Dclusters.file=$INPUT [-jt jobtracker | -c $M] `/net/projects/hadoop/bin/compute-splitsize $INPUT $M` $INPUT step-31-out</code> | rm -rf step-16-out; perl kmeans.pl -c $M `/net/projects/hadoop/bin/compute-splitsize $CLUSTERS_FILE $M` $CLUSTERS_FILE step-16-out</code> |
* ''/net/projects/hadoop/examples/inputs/points-large'': | * ''/net/projects/hadoop/examples/inputs/points-large'': |
<code>M=machines; K=200; INPUT=/net/projects/hadoop/examples/inputs/points-large/points.txt | <code>M=#of_machines; export CLUSTERS_NUM=200 CLUSTERS_FILE=/net/projects/hadoop/examples/inputs/points-large/points.txt |
rm -rf step-31-out; /net/projects/hadoop/bin/hadoop KMeans.java.jar -Dclusters.num=$K -Dclusters.file=$INPUT [-jt jobtracker | -c $M] `/net/projects/hadoop/bin/compute-splitsize $INPUT $M` $INPUT step-31-out</code> | rm -rf step-16-out; perl kmeans.pl -c $M `/net/projects/hadoop/bin/compute-splitsize $CLUSTERS_FILE $M` $CLUSTERS_FILE step-16-out</code> |
| |
Solution: {{:courses:mapreduce-tutorial:step-31-solution3.txt|KMeans.java}}. | Solution: {{:courses:mapreduce-tutorial:step-16-solution3.txt|kmeans.pl}}, much faster solution with distance computations written in C: {{:courses:mapreduce-tutorial:step-16-solution3_c.txt|kmeans_C.pl}}. |
| |