[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

This is an old revision of the document!


MapReduce Tutorial : Exercise - N-gram language model

For a given N create a simple N-gram language model. You can experimenting on the following data:

Path Size
/home/straka/wiki/cs-seq-medium 8MB
/home/straka/wiki/cs-seq 82MB
/home/straka/wiki/en-seq 1.9GB

Your model should contain all the unigrams, bigrams, …, N-grams with the number of occurrences in the given corpus.

As the size of the resulting corpus matters, you should represent the N-grams efficiently. Try using the following representation:

Try creating such index. Ideally, the sizes of resulting data files should be as equal as possible.


[ Back to the navigation ] [ Back to the content ]