Differences
This shows you the differences between two versions of the page.
| Next revision | Previous revision | ||
|
courses:mapreduce-tutorial:step-14 [2012/01/25 15:46] straka vytvořeno |
courses:mapreduce-tutorial:step-14 [2012/01/31 16:08] (current) dusek |
||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | ====== MapReduce Tutorial : ====== | + | ====== MapReduce Tutorial : Exercise - N-gram language model ====== |
| + | |||
| + | For a given //N// create a simple N-gram language model. You can start experimenting on the following data: | ||
| + | ^ Path ^ Size ^ | ||
| + | | / | ||
| + | | / | ||
| + | | / | ||
| + | |||
| + | Your model should contain all the unigrams, bigrams, ..., //N//-grams with the number of occurrences in the given corpus. | ||
| + | |||
| + | As the size of the resulting corpus matters, you should represent the //N//-grams efficiently. You can devise your own format, or you can use the following representation: | ||
| + | * Compute the unique words of the corpus, filter out the words that have only one occurrence, sort them according to the number of their occurrences and number them from 1. | ||
| + | * In order to represent //N//-gram, use the //N// numbers of the words, followed by a 0. Store the numbers using variable-length encoding (smaller numbers take less bytes) -- use '' | ||
| + | * One file of the resulting index should contain a sorted list of (N-gram representation, | ||
| + | * Every data file should also be accompanied by an index file, which contains every 1000((You are free to choose better constant :--) ))-th //N-gram representation// | ||
| + | * As in the sorting example, the //N-gram representation// | ||
| + | |||
| + | Try creating such index. Ideally, the sizes of resulting data files should be as equal as possible. | ||
| + | |||
| + | ---- | ||
| + | |||
| + | < | ||
| + | <table style=" | ||
| + | < | ||
| + | <td style=" | ||
| + | <td style=" | ||
| + | <td style=" | ||
| + | </ | ||
| + | </ | ||
| + | </ | ||
