This is an old revision of the document!
MapReduce Tutorial : Exercise - N-gram language model
For a given N create a simple N-gram language model. You can experimenting on the following data:
Path | Size |
---|---|
/home/straka/wiki/cs-seq-medium | 8MB |
/home/straka/wiki/cs-seq | 82MB |
/home/straka/wiki/en-seq | 1.9GB |
Your model should contain all the unigrams, bigrams, …, N-grams with the number of occurrences in the given corpus.
As the size of the resulting corpus matters, you should represent the N-grams efficiently. Try using the following representation:
- Compute the unique words of the corpus, filter out the words that have only one occurrence, sort them according to the number of their occurrences and number them from 1.
- In order to represent N-gram, use the N numbers of the words, followed by 0. Store the numbers using variable-length encoding (smaller numbers take less bytes) – use
pack 'w*', @word_numbers, 0
. - One file of the resulting index should contain a sorted list of (n-gram representation, occurrence), where n-gram representation is described above and occurrence is a variable-length encoded number of occurrences. No separators are necessary.
- Every file should also be accompanied by the index – the index contains every 1000-th n-gram representation
* If the resulting index should consist of