[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
courses:mapreduce-tutorial:step-14 [2012/01/25 23:30]
straka
courses:mapreduce-tutorial:step-14 [2012/01/31 16:08] (current)
dusek
Line 1: Line 1:
 ====== MapReduce Tutorial : Exercise - N-gram language model ====== ====== MapReduce Tutorial : Exercise - N-gram language model ======
  
-For a given //N// create a simple N-gram language model. You can experimenting on the following data:+For a given //N// create a simple N-gram language model. You can start experimenting on the following data:
 ^ Path ^ Size ^ ^ Path ^ Size ^
 | /home/straka/wiki/cs-seq-medium | 8MB | | /home/straka/wiki/cs-seq-medium | 8MB |
Line 9: Line 9:
 Your model should contain all the unigrams, bigrams, ..., //N//-grams with the number of occurrences in the given corpus. Your model should contain all the unigrams, bigrams, ..., //N//-grams with the number of occurrences in the given corpus.
  
-As the size of the resulting corpus matters, you should represent the //N//-grams efficiently. Try using the following representation:+As the size of the resulting corpus matters, you should represent the //N//-grams efficiently. You can devise your own format, or you can use the following representation:
   * Compute the unique words of the corpus, filter out the words that have only one occurrence, sort them according to the number of their occurrences and number them from 1.   * Compute the unique words of the corpus, filter out the words that have only one occurrence, sort them according to the number of their occurrences and number them from 1.
-  * In order to represent //N//-gram, use the //N// numbers of the words, followed by 0. Store the numbers using variable-length encoding (smaller numbers take less bytes) -- use ''pack 'w*', @word_numbers, 0''+  * In order to represent //N//-gram, use the //N// numbers of the words, followed by 0. Store the numbers using variable-length encoding (smaller numbers take less bytes) -- use ''pack 'w*', @word_numbers, 0''
-  * One file of the resulting index should contain a sorted list of (n-gram representation, occurrence), where //n-gram representation// is described above and //occurrence// is a variable-length encoded number of occurrences. No separators are necessary. +  * One file of the resulting index should contain a sorted list of (N-gram representation, occurrences), where //N-gram representation// is described above and //occurrence// is a variable-length encoded number of occurrences (again using ''pack 'w', $occurrences''). No separators are necessary. 
-  * Every file should also be accompanied by the index -- the index contains every 1000-th //n-gram representation// +  * Every data file should also be accompanied by an index file, which contains every 1000((You are free to choose better constant :--) ))-th //N-gram representation// of the data file, together with the byte offset of that //N-gram representation// in the data file. (The motivation behind the index file is that it will be read into memory and if an N-gram is searched for, it will point to the possible position in the data file.) 
-If the resulting index should consist of+  As in the sorting example, the //N-gram representation// in one data file should be all smaller or larger than in another data file.
  
 +Try creating such index. Ideally, the sizes of resulting data files should be as equal as possible.
 +
 +----
 +
 +<html>
 +<table style="width:100%">
 +<tr>
 +<td style="text-align:left; width: 33%; "></html>[[step-13|Step 13]]: Sorting.<html></td>
 +<td style="text-align:center; width: 33%; "></html>[[.|Overview]]<html></td>
 +<td style="text-align:right; width: 33%; "></html>[[step-15|Step 15]]: K-means clustering.<html></td>
 +</tr>
 +</table>
 +</html>

[ Back to the navigation ] [ Back to the content ]