[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
Next revision Both sides next revision
courses:mapreduce-tutorial:step-14 [2012/01/25 15:46]
straka vytvořeno
courses:mapreduce-tutorial:step-14 [2012/01/25 23:15]
straka
Line 1: Line 1:
-====== MapReduce Tutorial :  ======+====== MapReduce Tutorial : Exercise - N-gram language model ====== 
 + 
 +For a given //N// create a simple N-gram language model. You can experimenting on the following data: 
 +^ Path ^ Size ^ 
 +| /home/straka/wiki/cs-seq-medium | 8MB | 
 +| /home/straka/wiki/cs-seq | 82MB | 
 +| /home/straka/wiki/en-seq | 1.9GB | 
 + 
 +Your model should contain all the unigrams, bigrams, ..., //N//-grams with the number of occurrences in the given corpus. 
 + 
 +As the size of the resulting corpus matters, you should represent the //N//-grams efficiently. Try using the following representation: 
 +  * Find the unique words of the corpus, sort them according to the number of their occurences  

[ Back to the navigation ] [ Back to the content ]