This is an old revision of the document!
MapReduce Tutorial : Exercise - N-gram language model
For a given N create a simple N-gram language model. You can experimenting on the following data:
Path | Size |
---|---|
/home/straka/wiki/cs-seq-medium | 8MB |
/home/straka/wiki/cs-seq | 82MB |
/home/straka/wiki/en-seq | 1.9GB |
Your model should contain all the unigrams, bigrams, …, N-grams with the number of occurrences in the given corpus.
As the size of the resulting corpus matters, you should represent the N-grams efficiently. Try using the following representation:
- Find the unique words of the corpus, sort them according to the number of their occurences