This is an old revision of the document!
MapReduce Tutorial : K-means clustering
Implement the K-means clustering algorithm. You can use the following data:
Path | Number of points | Number of dimensions | Number of clusters |
---|---|---|---|
/home/straka/hadoop/example-inputs/points-small | 10000 | 50 | 50 |
/home/straka/hadoop/example-inputs/points-medium | 100000 | 100 | 100 |
/home/straka/hadoop/example-inputs/points-large | 500000 | 200 | 200 |
When dealing with iterative algorithms, each iteration is usually implemented as one Hadoop job. The Hadoop input_path contains the input data and each mapper also reads the current clusters. The reducers are used to aggregate the data and output new cluster centers. A controlling script is taking care of executing Hadoop jobs and stopping the iteration when the algorithm converges.