This is an old revision of the document!

MapReduce Tutorial : K-means clustering

Implement the K-means clustering algorithm. You can use the following data:

Path	Number of points	Number of dimensions	Number of clusters
`/net/projects/hadoop/examples/inputs/points-small`	10000	50	50
`/net/projects/hadoop/examples/inputs/points-medium`	100000	100	100
`/net/projects/hadoop/examples/inputs/points-large`	500000	200	200

When dealing with iterative algorithms, each iteration is usually implemented as one Hadoop job. The Hadoop input_path contains the input data and each mapper also reads the current clusters. The reducers are used to aggregate the data and output new cluster centers. A controlling script is taking care of executing Hadoop jobs and stopping the iteration when the algorithm converges.

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

MapReduce Tutorial : K-means clustering