[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision Both sides next revision
courses:mapreduce-tutorial:step-14 [2012/01/26 23:15]
straka
courses:mapreduce-tutorial:step-14 [2012/01/26 23:16]
straka
Line 9: Line 9:
 Your model should contain all the unigrams, bigrams, ..., //N//-grams with the number of occurrences in the given corpus. Your model should contain all the unigrams, bigrams, ..., //N//-grams with the number of occurrences in the given corpus.
  
-As the size of the resulting corpus matters, you should represent the //N//-grams efficiently. Try using the following representation:+As the size of the resulting corpus matters, you should represent the //N//-grams efficiently. You can devise your own format, or you can use the following representation:
   * Compute the unique words of the corpus, filter out the words that have only one occurrence, sort them according to the number of their occurrences and number them from 1.   * Compute the unique words of the corpus, filter out the words that have only one occurrence, sort them according to the number of their occurrences and number them from 1.
   * In order to represent //N//-gram, use the //N// numbers of the words, followed by a 0. Store the numbers using variable-length encoding (smaller numbers take less bytes) -- use ''pack 'w*', @word_numbers, 0''.   * In order to represent //N//-gram, use the //N// numbers of the words, followed by a 0. Store the numbers using variable-length encoding (smaller numbers take less bytes) -- use ''pack 'w*', @word_numbers, 0''.

[ Back to the navigation ] [ Back to the content ]