This is an old revision of the document!

MapReduce Tutorial : Exercise - N-gram language model

For a given N create a simple N-gram language model. You can experimenting on the following data:

Path	Size
/home/straka/wiki/cs-seq-medium	8MB
/home/straka/wiki/cs-seq	82MB
/home/straka/wiki/en-seq	1.9GB

Your model should contain all the unigrams, bigrams, …, N-grams with the number of occurrences in the given corpus.

As the size of the resulting corpus matters, you should represent the N-grams efficiently. Try using the following representation:

Find the unique words of the corpus, sort them according to the number of their occurences

Institute of Formal and Applied Linguistics Wiki