Differences
This shows you the differences between two versions of the page.
| Next revision | Previous revision | ||
|
courses:rg:2012:longdtreport [2012/03/12 20:11] longdt vytvořeno |
courses:rg:2012:longdtreport [2012/03/12 22:59] (current) longdt |
||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | Test | + | ====== Faster and Smaller N-Gram Language Model ====== |
| + | //Presenter : Joachim Daiber | ||
| + | Reporter: Long DT// | ||
| + | Date : 12-March-2012\\ | ||
| + | |||
| + | ==== Overview ==== | ||
| + | |||
| + | The talk is mainly about techniques to improve performance of N-gram language model. | ||
| + | How it will run faster and use smaller amount of memory. | ||
| + | |||
| + | ==== Encoding ==== | ||
| + | **I. Encoding the count** | ||
| + | |||
| + | In web1T corpus, the most frequent n-gram is 95 billion times, but contain only 770 000 unique count. | ||
| + | => Maintain value rank array is a good way to encode count | ||
| + | |||
| + | **II. Encoding the n-gram** | ||
| + | |||
| + | **// | ||
| + | encode W1,W2....Wn = c(W1, | ||
| + | c is offset function, so call context encoding | ||
| + | |||
| + | **// | ||
| + | |||
| + | __Sorted Array__ | ||
| + | |||
| + | - Use n array for n-gram model (array i-th is used for i-gram) | ||
| + | - Each element in array in pair (w,c) | ||
| + | w : index of that word in unigram array | ||
| + | c : offset pointer | ||
| + | - Sort base on w | ||
| + | |||
| + | Improvement : Implicitly encode W (all n-gram ending with particular word wi are stored -> wasteful. So, maintain another array save the beginning and the end of the range | ||
| + | |||
| + | __Hash Table__ | ||
| + | Use open addressing (with linear probling) | ||
| + | Use extra 40% space for auxiliary part (avoid collision) | ||
| + | |||
| + | **// | ||
| + | |||
| + | __Variable length coding__ | ||
| + | |||
| + | Idea : | ||
| + | Context offset tend to be close with each others, => Only save the first offsets address and the difference of others with the it | ||
| + | |||
| + | Question : | ||
| + | how the unigram be sorted ? | ||
| + | Martin suggest that it must be sorted base on frequency | ||
| + | |||
| + | __Block Compression__ | ||
| + | compress the key/value array in the blocks of 128 bytes, | ||
| + | the underline reason is only to support binary search | ||
| + | |||
| + | ==== Decoding ==== | ||
| + | **I. Exploiting Repetitive Queries** | ||
| + | |||
| + | The method use cache to speed up the process | ||
| + | This simple implementation increase performance of 300% over conventional implementation | ||
| + | **II. Exploiting Scrolling Queries** | ||
| + | |||
| + | We can quickly form the context encoding of the next query by concatenating new words with saved offset from previous query | ||
| + | |||
| + | ==== Conclusion ==== | ||
| + | In summary, it was a worth reading paper, well presented, and thoroughly | ||
| + | discussed. Paper contain many detail techniques that definitely helpful for actual implementation. | ||
