MapReduce Tutorial : Exercise - N-gram language model

For a given N create a simple N-gram language model. You can start experimenting on the following data:

Path Size
/home/straka/wiki/cs-seq-medium 8MB
/home/straka/wiki/cs-seq 82MB
/home/straka/wiki/en-seq 1.9GB

Your model should contain all the unigrams, bigrams, …, N-grams with the number of occurrences in the given corpus.

As the size of the resulting corpus matters, you should represent the N-grams efficiently. You can devise your own format, or you can use the following representation:

Try creating such index. Ideally, the sizes of resulting data files should be as equal as possible.


Step 13: Sorting. Overview Step 15: K-means clustering.

1)
You are free to choose better constant :–)