This is an old revision of the document!

MapReduce Tutorial : Exercise - N-gram language model

For a given N create a simple N-gram language model. You can experimenting on the following data:

Path	Size
/home/straka/wiki/cs-seq-medium	8MB
/home/straka/wiki/cs-seq	82MB
/home/straka/wiki/en-seq	1.9GB

Your model should contain all the unigrams, bigrams, …, N-grams with the number of occurrences in the given corpus.

As the size of the resulting corpus matters, you should represent the N-grams efficiently. Try using the following representation:

Compute the unique words of the corpus, filter out the words that have only one occurrence, sort them according to the number of their occurrences and number them from 1.
In order to represent N-gram, use the N numbers of the words, followed by 0. Store the numbers using variable-length encoding (smaller numbers take less bytes) – use pack 'w*', @word_numbers, 0.
One file of the resulting index should contain a sorted list of (n-gram representation, occurrence), where n-gram representation is described above and occurrence is a variable-length encoded number of occurrences. No separators are necessary.
Every file should also be accompanied by the index – the index contains every 1000-th n-gram representation

* If the resulting index should consist of

Institute of Formal and Applied Linguistics Wiki