====== MapReduce Tutorial : K-means clustering ====== Implement the [[http://en.wikipedia.org/wiki/K-means_clustering#Standard_algorithm|K-means clustering algorithm]]. You can use the following data: ^ Path ^ Number of points ^ Number of dimensions ^ Number of clusters ^ | ''/net/projects/hadoop/examples/inputs/points-small'' | 10000 | 50 | 50 | | ''/net/projects/hadoop/examples/inputs/points-medium'' | 100000 | 100 | 100 | | ''/net/projects/hadoop/examples/inputs/points-large'' | 500000 | 200 | 200 | When dealing with iterative algorithms, each iteration is usually implemented as one Hadoop job. The Hadoop ''input_path'' should contain the input data and each mapper should also read the current clusters. The reducers are used to aggregate the data and output new cluster centers. A controlling script should take care of executing Hadoop jobs and stopping the iteration when the algorithm converges. ----
[[step-14|Step 14]]: N-gram language model. [[.|Overview]]