Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision | Next revision Both sides next revision | ||
courses:mapreduce-tutorial:step-14 [2012/01/26 23:15] straka |
courses:mapreduce-tutorial:step-14 [2012/01/26 23:16] straka |
||
---|---|---|---|
Line 9: | Line 9: | ||
Your model should contain all the unigrams, bigrams, ..., //N//-grams with the number of occurrences in the given corpus. | Your model should contain all the unigrams, bigrams, ..., //N//-grams with the number of occurrences in the given corpus. | ||
- | As the size of the resulting corpus matters, you should represent the //N//-grams efficiently. | + | As the size of the resulting corpus matters, you should represent the //N//-grams efficiently. |
* Compute the unique words of the corpus, filter out the words that have only one occurrence, sort them according to the number of their occurrences and number them from 1. | * Compute the unique words of the corpus, filter out the words that have only one occurrence, sort them according to the number of their occurrences and number them from 1. | ||
* In order to represent //N//-gram, use the //N// numbers of the words, followed by a 0. Store the numbers using variable-length encoding (smaller numbers take less bytes) -- use '' | * In order to represent //N//-gram, use the //N// numbers of the words, followed by a 0. Store the numbers using variable-length encoding (smaller numbers take less bytes) -- use '' |