Implement the K-means clustering algorithm. You can use the following data:
Path | Number of points | Number of dimensions | Number of clusters |
---|---|---|---|
/net/projects/hadoop/examples/inputs/points-small | 10000 | 50 | 50 |
/net/projects/hadoop/examples/inputs/points-medium | 100000 | 100 | 100 |
/net/projects/hadoop/examples/inputs/points-large | 500000 | 200 | 200 |
When dealing with iterative algorithms, each iteration is usually implemented as one Hadoop job. The Hadoop input_path
should contain the input data and each mapper should also read the current clusters. The reducers are used to aggregate the data and output new cluster centers. A controlling script should take care of executing Hadoop jobs and stopping the iteration when the algorithm converges.
Step 14: N-gram language model. | Overview |