Differences

This shows you the differences between two versions of the page.

--- courses:mapreduce-tutorial:step-14 [2012/01/25 23:15]
straka
+++ courses:mapreduce-tutorial:step-14 [2012/01/25 23:30]
straka
@@ Line 10: / Line 10: @@
 As the size of the resulting corpus matters, you should represent the //N//-grams efficiently. Try using the following representation:
-  * Find the unique words of the corpus, sort them according to the number of their occurences
+  * Compute the unique words of the corpus, filter out the words that have only one occurrence, sort them according to the number of their occurrences and number them from 1.
+  * In order to represent //N//-gram, use the //N// numbers of the words, followed by 0. Store the numbers using variable-length encoding (smaller numbers take less bytes) -- use ''pack 'w*', @word_numbers, 0''.
+  * One file of the resulting index should contain a sorted list of (n-gram representation, occurrence), where //n-gram representation// is described above and //occurrence// is a variable-length encoded number of occurrences. No separators are necessary.
+  * Every file should also be accompanied by the index -- the index contains every 1000-th //n-gram representation//
+* If the resulting index should consist of

Institute of Formal and Applied Linguistics Wiki