[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

This is an old revision of the document!


MapReduce Tutorial : K-means clustering

Implement the K-means clustering algorithm. You can use the following data:

Path Number of points Number of dimensions Number of clusters
/home/straka/hadoop/example-inputs/points-small 10000 50 50
/home/straka/hadoop/example-inputs/points-medium 100000 100 100
/home/straka/hadoop/example-inputs/points-large 500000 200 200

When dealing with iterative algorithms, each iteration is usually implemented as one Hadoop job. The Hadoop input_path contains the input data and each mapper also reads the current clusters. The reducers are used to aggregate the data and output new cluster centers. A controlling script is taking care of executing Hadoop jobs and stopping the iteration when the algorithm converges.


[ Back to the navigation ] [ Back to the content ]