Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
courses:rg:2012:longdtreport [2012/03/12 22:44] longdt |
courses:rg:2012:longdtreport [2012/03/12 22:59] (current) longdt |
||
---|---|---|---|
Line 11: | Line 11: | ||
==== Encoding ==== | ==== Encoding ==== | ||
**I. Encoding the count** | **I. Encoding the count** | ||
+ | |||
In web1T corpus, the most frequent n-gram is 95 billion times, but contain only 770 000 unique count. | In web1T corpus, the most frequent n-gram is 95 billion times, but contain only 770 000 unique count. | ||
=> Maintain value rank array is a good way to encode count | => Maintain value rank array is a good way to encode count | ||
+ | |||
**II. Encoding the n-gram** | **II. Encoding the n-gram** | ||
Line 21: | Line 23: | ||
**// | **// | ||
- | Sorted Array | + | __Sorted Array__ |
- | | + | |
- | | + | - Use n array for n-gram model (array i-th is used for i-gram) |
- | | + | - Each element in array in pair (w,c) |
- | | + | |
- | | + | |
- | //Improvement// : Implicitly encode W (all n-gram ending with particular word wi are stored -> wasteful. So, maintain another array save the beginning and the end of the range | + | - Sort base on w |
+ | |||
+ | Improvement : Implicitly encode W (all n-gram ending with particular word wi are stored -> wasteful. So, maintain another array save the beginning and the end of the range | ||
- | Hash Table | + | __Hash Table__ |
- | + | Use open addressing (with linear probling) | |
- | Most of the attendants apparently understood the talk and the paper well, and a | + | Use extra 40% space for auxiliary part (avoid collision) |
- | lively discussion followed. One of our first topics of debate was the notion of | + | |
- | skyline presented in the paper. The skyline was somewhat of a supervised element | + | **// |
- | -- the authors estimated initial parameters | + | |
- | trained it afterwards. They assumed that a model with parameters estimated from | + | |
- | gold data cannot be beaten by an unsupervisedly trained model. Verily, after | + | |
- | training the skyline model, its accuracy dropped very significantly. The reasons | + | |
- | of this were a point of surprise for us as well as for the paper' | + | |
- | Complementary to the skyline, the authors presented a baseline which should | + | __Variable length coding__ |
- | definitely be beaten by their final model. This baseline, they called | + | |
- | " | + | |
- | used in this model. We could only speculate it was a uniform or random | + | |
- | probability distribution. | + | |
- | A point about unsupervised language modeling came out: Many linguistic phenomena | + | Idea : |
- | are annotated in a way that is to some extent arbitrary, and reflects more the | + | Context offset tend to be close with each others, => Only save the first offsets address |
- | linguistic theory used than the language itself, | + | |
- | cannot hope to get them right. The example we discussed was whether | + | |
- | " | + | |
- | noticed that dependency orientation in general was not a particularly strong | + | |
- | point of their parser, and so they also included an evaluation metric that | + | |
- | ignored the dependency orientations. | + | |
- | Perhaps the most crucial observations the authors made was that there is a limit | + | Question : |
- | where feeding more data to the model training hurts its accuracy. They | + | how the unigram |
- | progressed from short sentences to longer, and identified the threshold, where | + | Martin suggest |
- | it's best to start ignoring any more training data, at sentences of length 15. | + | |
- | However, we were not 100% clear how they computed this constant. | + | |
- | If the model was to be fully unsupervised, | + | |
- | this threshold, because it cannot be safely assumed | + | |
- | for all languages and setups. | + | |
- | The writing style of the paper was also a matter | + | __Block Compression__ |
- | Undeniably, it is written in a vocabulary-intensive fashion, bringing readers | + | compress |
- | face to face with words like " | + | the underline reason |
- | never seen before. | + | |
- | ==== Conclusion | + | ==== Decoding |
+ | **I. Exploiting Repetitive Queries** | ||
- | All in all, it was a paper worth reading, well presented, and thoroughly | + | The method use cache to speed up the process |
- | discussed, bringing useful general ideas as well as interesting details. | + | This simple implementation increase performance of 300% over conventional implementation |
+ | **II. Exploiting Scrolling Queries** | ||
+ | |||
+ | We can quickly form the context encoding of the next query by concatenating new words with saved offset from previous query | ||
+ | |||
+ | ==== Conclusion ==== | ||
+ | In summary, it was a worth reading | ||
+ | discussed. |