Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision Next revision Both sides next revision | ||
courses:rg:2012:longdtreport [2012/03/12 20:11] longdt vytvořeno |
courses:rg:2012:longdtreport [2012/03/12 22:42] longdt |
||
---|---|---|---|
Line 1: | Line 1: | ||
- | Test | + | ====== Faster and Smaller N-Gram Language Model ====== |
+ | //Presenter : Joachim Daiber | ||
+ | Reporter: Long DT// | ||
+ | Date : 12-March-2012\\ | ||
+ | |||
+ | ==== Overview ==== | ||
+ | |||
+ | The talk is mainly about techniques to improve performance of N-gram language model. | ||
+ | How it will run faster and use smaller amount of memory. | ||
+ | |||
+ | ==== Encoding ==== | ||
+ | **I. Encoding the count** | ||
+ | In web1T corpus, the most frequent n-gram is 95 billion times, but contain only 770 000 unique count. | ||
+ | => Maintain value rank array is a good way to encode count | ||
+ | **II. Encoding the n-gram** | ||
+ | |||
+ | **// | ||
+ | encode W1,W2....Wn = c(W1, | ||
+ | c is offset function, so call context encoding | ||
+ | |||
+ | **// | ||
+ | |||
+ | Sorted Array | ||
+ | + Use n array for n-gram model (array i-th is used for i-gram) | ||
+ | + Each element in array in pair (w,c) | ||
+ | + w : index of that word in unigram array | ||
+ | + c : offset pointer | ||
+ | |||
+ | |||
+ | Most of the attendants apparently understood the talk and the paper well, and a | ||
+ | lively discussion followed. One of our first topics of debate was the notion of | ||
+ | skyline presented in the paper. The skyline was somewhat of a supervised element | ||
+ | -- the authors estimated initial parameters for a model from gold data and | ||
+ | trained it afterwards. They assumed that a model with parameters estimated from | ||
+ | gold data cannot be beaten by an unsupervisedly trained model. Verily, after | ||
+ | training the skyline model, its accuracy dropped very significantly. The reasons | ||
+ | of this were a point of surprise for us as well as for the paper' | ||
+ | |||
+ | Complementary to the skyline, the authors presented a baseline which should | ||
+ | definitely be beaten by their final model. This baseline, they called | ||
+ | " | ||
+ | used in this model. We could only speculate it was a uniform or random | ||
+ | probability distribution. | ||
+ | |||
+ | A point about unsupervised language modeling came out: Many linguistic phenomena | ||
+ | are annotated in a way that is to some extent arbitrary, and reflects more the | ||
+ | linguistic theory used than the language itself, and an unsupervised model | ||
+ | cannot hope to get them right. The example we discussed was whether the word | ||
+ | " | ||
+ | noticed that dependency orientation in general was not a particularly strong | ||
+ | point of their parser, and so they also included an evaluation metric that | ||
+ | ignored the dependency orientations. | ||
+ | |||
+ | Perhaps the most crucial observations the authors made was that there is a limit | ||
+ | where feeding more data to the model training hurts its accuracy. They | ||
+ | progressed from short sentences to longer, and identified the threshold, where | ||
+ | it's best to start ignoring any more training data, at sentences of length 15. | ||
+ | However, we were not 100% clear how they computed this constant. | ||
+ | If the model was to be fully unsupervised, | ||
+ | this threshold, because it cannot be safely assumed that it would be the same | ||
+ | for all languages and setups. | ||
+ | |||
+ | The writing style of the paper was also a matter of differing opinions. | ||
+ | Undeniably, it is written in a vocabulary-intensive fashion, bringing readers | ||
+ | face to face with words like " | ||
+ | never seen before. | ||
+ | |||
+ | ==== Conclusion ==== | ||
+ | |||
+ | All in all, it was a paper worth reading, well presented, and thoroughly | ||
+ | discussed, bringing useful general ideas as well as interesting details. |