Differences

This shows you the differences between two versions of the page.

--- courses:rg:2012:longdtreport [2012/03/12 20:17]
longdt
+++ courses:rg:2012:longdtreport [2012/03/12 22:53]
longdt
@@ Line 1: / Line 1: @@
 ====== Faster and Smaller N-Gram Language Model ======
 //Presenter :  Joachim Daiber
-Subscriber : Long DT//
+Reporter: Long DT//
 Date : 12-March-2012\\
 ==== Overview ====
-On Monday, October 24th 2011, we heard a talk about a paper by Valentin
+The talk is mainly about techniques to improve performance of N-gram language model.
-Spitkovsky, Hiyan Alshawi and Daniel Jurafsky on enhancing unsupervised language
+How it will run faster and use smaller amount of memory.
-parsers. The paper itself focuses on improving the state of the art in
-unsupervised parsing, and reports a success in a rate of percents, which
-certainly makes it a paper worth notice.
-==== Notes ====
+==== Encoding ====
+**I. Encoding the count**
-Most of the attendants apparently understood the talk and the paper well, and a
+In web1T corpus, the most frequent n-gram is 95 billion times, but contain only 770 000 unique count.
-lively discussion followed. One of our first topics of debate was the notion of
+=> Maintain value rank array is a good way to encode count
-skyline presented in the paper. The skyline was somewhat of a supervised element
--- the authors estimated initial parameters for a model from gold data and
-trained it afterwards. They assumed that a model with parameters estimated from
-gold data cannot be beaten by an unsupervisedly trained model. Verily, after
-training the skyline model, its accuracy dropped very significantly. The reasons
-of this were a point of surprise for us as well as for the paper's authors.
-Complementary to the skyline, the authors presented a baseline which should
+**II. Encoding the n-gram**
-definitely be beaten by their final model. This baseline, they called
-"uninformed", but were vague about which exact probability distribution they
-used in this model. We could only speculate it was a uniform or random
-probability distribution.
-A point about unsupervised language modeling came out: Many linguistic phenomena
+**//Idea//**
-are annotated in a way that is to some extent arbitrary, and reflects more the
+encode W1,W2....Wn = c(W1,W2...W(n-1)) Wn
-linguistic theory used than the language itself, and an unsupervised model
+c is offset function, so call context encoding
-cannot hope to get them right. The example we discussed was whether the word
-"should" is governing the verb it's bound with, or vice versa. The authors
-noticed that dependency orientation in general was not a particularly strong
-point of their parser, and so they also included an evaluation metric that
-ignored the dependency orientations.
-Perhaps the most crucial observations the authors made was that there is a limit
+**//Implementation//**
-where feeding more data to the model training hurts its accuracy. They
-progressed from short sentences to longer, and identified the threshold, where
-it's best to start ignoring any more training data, at sentences of length 15.
-However, we were not 100% clear how they computed this constant.
-If the model was to be fully unsupervised, it remains a question, how to setup
-this threshold, because it cannot be safely assumed that it would be the same
-for all languages and setups.
-The writing style of the paper was also a matter of differing opinions.
+__Sorted Array__
-Undeniably, it is written in a vocabulary-intensive fashion, bringing readers
-face to face with words like "unbridled" or "jettison", which I personally had
-never seen before.
-==== Conclusion ====
+- Use n array for n-gram model (array i-th is used for i-gram)
+- Each element in array in pair (w,c)
+     w : index of that word in unigram array
+     c : offset pointer
+- Sort base on w
+Improvement : Implicitly encode W (all n-gram ending with particular word wi are stored -> wasteful. So, maintain another array save the beginning and the end of the range
+__Hash Table__
+Use open addressing (with linear probling)
+Use extra 40% space for auxiliary part (avoid collision)
+**//Encoding Improvement//**
+__Variable length coding__
+Idea :
+Context offset tend to be close with each others, => Only save the first offsets address and the difference of others with the it
+Question :
+how the unigram be sorted ?
+Martin suggest that it must be sorted base on frequency
+__Block Compression__
+compress the key/value array in the blocks of 128 bytes,
+the underline reason is only to support binary search
+==== Decoding ====
+**I. Exploiting Repetitive Queries**
+The method use cache to speed up the process
+This simple implementation increase performance of 300% over conventional implementation
+**II. Exploiting Scrolling Queries**
+We can quickly form the context encoding of the next query by concatenating new words with saved offset from previous query
+==== Conclusion ====
 All in all, it was a paper worth reading, well presented, and thoroughly
 discussed, bringing useful general ideas as well as interesting details.

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences