MapReduce Tutorial : K-means clustering

Implement the K-means clustering algorithm. You can use the following data:

Path Number of points Number of dimensions Number of clusters
/net/projects/hadoop/examples/inputs/points-small 10000 50 50
/net/projects/hadoop/examples/inputs/points-medium 100000 100 100
/net/projects/hadoop/examples/inputs/points-large 500000 200 200

When dealing with iterative algorithms, each iteration is usually implemented as one Hadoop job. The Hadoop input_path should contain the input data and each mapper should also read the current clusters. The reducers are used to aggregate the data and output new cluster centers. A controlling script should take care of executing Hadoop jobs and stopping the iteration when the algorithm converges.

Step 14: N-gram language model. Overview

