Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
courses:mapreduce-tutorial:step-14 [2012/01/25 23:15] straka |
courses:mapreduce-tutorial:step-14 [2012/01/26 23:14] straka |
||
---|---|---|---|
Line 10: | Line 10: | ||
As the size of the resulting corpus matters, you should represent the //N//-grams efficiently. Try using the following representation: | As the size of the resulting corpus matters, you should represent the //N//-grams efficiently. Try using the following representation: | ||
- | * Find the unique words of the corpus, sort them according to the number of their occurences | + | * Compute |
+ | * In order to represent //N//-gram, use the //N// numbers of the words, followed by a 0. Store the numbers using variable-length encoding (smaller numbers take less bytes) -- use '' | ||
+ | * One file of the resulting index should contain a sorted list of (N-gram representation, | ||
+ | * Every data file should also be accompanied by an index file, which contains every 1000-th //N-gram representation// | ||
+ | * As in the sorting example, the //N-gram representation// | ||
+ | |||
+ | Try creating such index. Ideally, the sizes of resulting data files should be as equal as possible. | ||