Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
courses:rg:2012:longdtreport [2012/03/12 22:44] longdt |
courses:rg:2012:longdtreport [2012/03/12 22:47] longdt |
||
---|---|---|---|
Line 11: | Line 11: | ||
==== Encoding ==== | ==== Encoding ==== | ||
**I. Encoding the count** | **I. Encoding the count** | ||
+ | |||
In web1T corpus, the most frequent n-gram is 95 billion times, but contain only 770 000 unique count. | In web1T corpus, the most frequent n-gram is 95 billion times, but contain only 770 000 unique count. | ||
=> Maintain value rank array is a good way to encode count | => Maintain value rank array is a good way to encode count | ||
+ | |||
**II. Encoding the n-gram** | **II. Encoding the n-gram** | ||
Line 21: | Line 23: | ||
**// | **// | ||
- | Sorted Array | + | __Sorted Array__ |
+ Use n array for n-gram model (array i-th is used for i-gram) | + Use n array for n-gram model (array i-th is used for i-gram) | ||
+ Each element in array in pair (w,c) | + Each element in array in pair (w,c) | ||
Line 27: | Line 29: | ||
+ c : offset pointer | + c : offset pointer | ||
+ Sort base on w | + Sort base on w | ||
- | //Improvement// : Implicitly encode W (all n-gram ending with particular word wi are stored -> wasteful. So, maintain another array save the beginning and the end of the range | + | Improvement : Implicitly encode W (all n-gram ending with particular word wi are stored -> wasteful. So, maintain another array save the beginning and the end of the range |
- | Hash Table | + | __Hash Table__ |
+ | Use open addressing (with linear probling) | ||
+ | Use extra 40% space for auxiliary part (avoid collision) | ||
+ | |||
+ | **// | ||
+ | |||
+ | __Variable length coding__ | ||
| | ||
Most of the attendants apparently understood the talk and the paper well, and a | Most of the attendants apparently understood the talk and the paper well, and a |