[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision Both sides next revision
courses:mapreduce-tutorial:step-14 [2012/01/26 23:16]
straka
courses:mapreduce-tutorial:step-14 [2012/01/26 23:17]
straka
Line 12: Line 12:
   * Compute the unique words of the corpus, filter out the words that have only one occurrence, sort them according to the number of their occurrences and number them from 1.   * Compute the unique words of the corpus, filter out the words that have only one occurrence, sort them according to the number of their occurrences and number them from 1.
   * In order to represent //N//-gram, use the //N// numbers of the words, followed by a 0. Store the numbers using variable-length encoding (smaller numbers take less bytes) -- use ''pack 'w*', @word_numbers, 0''.   * In order to represent //N//-gram, use the //N// numbers of the words, followed by a 0. Store the numbers using variable-length encoding (smaller numbers take less bytes) -- use ''pack 'w*', @word_numbers, 0''.
-  * One file of the resulting index should contain a sorted list of (N-gram representation, occurrences), where //N-gram representation// is described above and //occurrence// is a variable-length encoded number of occurrences (again using ''pack 'w', $occurences''). No separators are necessary.+  * One file of the resulting index should contain a sorted list of (N-gram representation, occurrences), where //N-gram representation// is described above and //occurrence// is a variable-length encoded number of occurrences (again using ''pack 'w', $occurrences''). No separators are necessary.
   * Every data file should also be accompanied by an index file, which contains every 1000-th //N-gram representation// of the data file, together with the byte offset of that //N-gram representation// in the data file. (The motivation behind the index file is that it will be read into memory and if an N-gram is searched for, it will point to the possible position in the data file.)   * Every data file should also be accompanied by an index file, which contains every 1000-th //N-gram representation// of the data file, together with the byte offset of that //N-gram representation// in the data file. (The motivation behind the index file is that it will be read into memory and if an N-gram is searched for, it will point to the possible position in the data file.)
   * As in the sorting example, the //N-gram representation// in one data file should be all smaller or larger than in another data file.   * As in the sorting example, the //N-gram representation// in one data file should be all smaller or larger than in another data file.

[ Back to the navigation ] [ Back to the content ]