Differences

This shows you the differences between two versions of the page.

--- courses:mapreduce-tutorial:step-14 [2012/01/25 22:19]
straka
+++ courses:mapreduce-tutorial:step-14 [2012/01/25 23:15]
straka
@@ Line 1: / Line 1: @@
 ====== MapReduce Tutorial : Exercise - N-gram language model ======
+For a given //N// create a simple N-gram language model. You can experimenting on the following data:
+^ Path ^ Size ^
+| /home/straka/wiki/cs-seq-medium | 8MB |
+| /home/straka/wiki/cs-seq | 82MB |
+| /home/straka/wiki/en-seq | 1.9GB |
+Your model should contain all the unigrams, bigrams, ..., //N//-grams with the number of occurrences in the given corpus.
+As the size of the resulting corpus matters, you should represent the //N//-grams efficiently. Try using the following representation:
+  * Find the unique words of the corpus, sort them according to the number of their occurences

Institute of Formal and Applied Linguistics Wiki