Differences

This shows you the differences between two versions of the page.

--- courses:rg:2012:longdtreport [2012/03/12 20:11]
longdt vytvořeno
+++ courses:rg:2012:longdtreport [2012/03/12 22:42]
longdt
@@ Line 1: / Line 1: @@
-Test
+====== Faster and Smaller N-Gram Language Model ======
+//Presenter :  Joachim Daiber
+Reporter: Long DT//
+Date : 12-March-2012\\
+==== Overview ====
+The talk is mainly about techniques to improve performance of N-gram language model.
+How it will run faster and use smaller amount of memory.
+==== Encoding ====
+**I. Encoding the count**
+In web1T corpus, the most frequent n-gram is 95 billion times, but contain only 770 000 unique count.
+=> Maintain value rank array is a good way to encode count
+**II. Encoding the n-gram**
+**//Idea//**
+encode W1,W2....Wn = c(W1,W2...W(n-1)) Wn
+c is offset function, so call context encoding
+**//Implementation//**
+Sorted Array
+  + Use n array for n-gram model (array i-th is used for i-gram)
+  + Each element in array in pair (w,c)
+            + w : index of that word in unigram array
+            + c : offset pointer
+Most of the attendants apparently understood the talk and the paper well, and a
+lively discussion followed. One of our first topics of debate was the notion of
+skyline presented in the paper. The skyline was somewhat of a supervised element
+-- the authors estimated initial parameters for a model from gold data and
+trained it afterwards. They assumed that a model with parameters estimated from
+gold data cannot be beaten by an unsupervisedly trained model. Verily, after
+training the skyline model, its accuracy dropped very significantly. The reasons
+of this were a point of surprise for us as well as for the paper's authors.
+Complementary to the skyline, the authors presented a baseline which should
+definitely be beaten by their final model. This baseline, they called
+"uninformed", but were vague about which exact probability distribution they
+used in this model. We could only speculate it was a uniform or random
+probability distribution.
+A point about unsupervised language modeling came out: Many linguistic phenomena
+are annotated in a way that is to some extent arbitrary, and reflects more the
+linguistic theory used than the language itself, and an unsupervised model
+cannot hope to get them right. The example we discussed was whether the word
+"should" is governing the verb it's bound with, or vice versa. The authors
+noticed that dependency orientation in general was not a particularly strong
+point of their parser, and so they also included an evaluation metric that
+ignored the dependency orientations.
+Perhaps the most crucial observations the authors made was that there is a limit
+where feeding more data to the model training hurts its accuracy. They
+progressed from short sentences to longer, and identified the threshold, where
+it's best to start ignoring any more training data, at sentences of length 15.
+However, we were not 100% clear how they computed this constant.
+If the model was to be fully unsupervised, it remains a question, how to setup
+this threshold, because it cannot be safely assumed that it would be the same
+for all languages and setups.
+The writing style of the paper was also a matter of differing opinions.
+Undeniably, it is written in a vocabulary-intensive fashion, bringing readers
+face to face with words like "unbridled" or "jettison", which I personally had
+never seen before.
+==== Conclusion ====
+All in all, it was a paper worth reading, well presented, and thoroughly
+discussed, bringing useful general ideas as well as interesting details.

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences