[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
courses:mapreduce-tutorial:step-14 [2012/01/26 23:14]
straka
courses:mapreduce-tutorial:step-14 [2012/01/31 16:08] (current)
dusek
Line 1: Line 1:
 ====== MapReduce Tutorial : Exercise - N-gram language model ====== ====== MapReduce Tutorial : Exercise - N-gram language model ======
  
-For a given //N// create a simple N-gram language model. You can experimenting on the following data:+For a given //N// create a simple N-gram language model. You can start experimenting on the following data:
 ^ Path ^ Size ^ ^ Path ^ Size ^
 | /home/straka/wiki/cs-seq-medium | 8MB | | /home/straka/wiki/cs-seq-medium | 8MB |
Line 9: Line 9:
 Your model should contain all the unigrams, bigrams, ..., //N//-grams with the number of occurrences in the given corpus. Your model should contain all the unigrams, bigrams, ..., //N//-grams with the number of occurrences in the given corpus.
  
-As the size of the resulting corpus matters, you should represent the //N//-grams efficiently. Try using the following representation:+As the size of the resulting corpus matters, you should represent the //N//-grams efficiently. You can devise your own format, or you can use the following representation:
   * Compute the unique words of the corpus, filter out the words that have only one occurrence, sort them according to the number of their occurrences and number them from 1.   * Compute the unique words of the corpus, filter out the words that have only one occurrence, sort them according to the number of their occurrences and number them from 1.
   * In order to represent //N//-gram, use the //N// numbers of the words, followed by a 0. Store the numbers using variable-length encoding (smaller numbers take less bytes) -- use ''pack 'w*', @word_numbers, 0''.   * In order to represent //N//-gram, use the //N// numbers of the words, followed by a 0. Store the numbers using variable-length encoding (smaller numbers take less bytes) -- use ''pack 'w*', @word_numbers, 0''.
-  * One file of the resulting index should contain a sorted list of (N-gram representation, occurrence), where //N-gram representation// is described above and //occurrence// is a variable-length encoded number of occurrences (again using ''pack 'w', $occurences''). No separators are necessary. +  * One file of the resulting index should contain a sorted list of (N-gram representation, occurrences), where //N-gram representation// is described above and //occurrence// is a variable-length encoded number of occurrences (again using ''pack 'w', $occurrences''). No separators are necessary. 
-  * Every data file should also be accompanied by an index file, which contains every 1000-th //N-gram representation// of the data file, together with the byte offset of that //N-gram representation// in the data file. (The motivation behind the index file is that it will be read into memory and if an N-gram is searched for, it will point to the possible position in the data file.)+  * Every data file should also be accompanied by an index file, which contains every 1000((You are free to choose better constant :--) ))-th //N-gram representation// of the data file, together with the byte offset of that //N-gram representation// in the data file. (The motivation behind the index file is that it will be read into memory and if an N-gram is searched for, it will point to the possible position in the data file.)
   * As in the sorting example, the //N-gram representation// in one data file should be all smaller or larger than in another data file.   * As in the sorting example, the //N-gram representation// in one data file should be all smaller or larger than in another data file.
  
 Try creating such index. Ideally, the sizes of resulting data files should be as equal as possible. Try creating such index. Ideally, the sizes of resulting data files should be as equal as possible.
  
 +----
 +
 +<html>
 +<table style="width:100%">
 +<tr>
 +<td style="text-align:left; width: 33%; "></html>[[step-13|Step 13]]: Sorting.<html></td>
 +<td style="text-align:center; width: 33%; "></html>[[.|Overview]]<html></td>
 +<td style="text-align:right; width: 33%; "></html>[[step-15|Step 15]]: K-means clustering.<html></td>
 +</tr>
 +</table>
 +</html>

[ Back to the navigation ] [ Back to the content ]