[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
courses:rg:2012:longdtreport [2012/03/12 20:19]
longdt
courses:rg:2012:longdtreport [2012/03/12 22:59] (current)
longdt
Line 1: Line 1:
 ====== Faster and Smaller N-Gram Language Model ====== ====== Faster and Smaller N-Gram Language Model ======
 //Presenter :  Joachim Daiber //Presenter :  Joachim Daiber
-Subscriber : Long DT//+Reporter: Long DT//
 Date : 12-March-2012\\ Date : 12-March-2012\\
  
 ==== Overview ==== ==== Overview ====
  
-The talk is mainly about technique to improve performance of N-gram language model. +The talk is mainly about techniques to improve performance of N-gram language model. 
 How it will run faster and use smaller amount of memory.  How it will run faster and use smaller amount of memory. 
  
-==== Notes ====+==== Encoding ==== 
 +**I. Encoding the count**
  
-Most of the attendants apparently understood the talk and the paper welland a +In web1T corpus, the most frequent n-gram is 95 billion timesbut contain only 770 000 unique count.  
-lively discussion followed. One of our first topics of debate was the notion of +=> Maintain value rank array is good way to encode count
-skyline presented in the paper. The skyline was somewhat of a supervised element +
--- the authors estimated initial parameters for a model from gold data and +
-trained it afterwards. They assumed that a model with parameters estimated from +
-gold data cannot be beaten by an unsupervisedly trained model. Verilyafter +
-training the skyline model, its accuracy dropped very significantlyThe reasons +
-of this were point of surprise for us as well as for the paper's authors.+
  
-Complementary to the skyline, the authors presented a baseline which should +**II. Encoding the n-gram**
-definitely be beaten by their final model. This baseline, they called +
-"uninformed", but were vague about which exact probability distribution they +
-used in this model. We could only speculate it was a uniform or random +
-probability distribution.+
  
-A point about unsupervised language modeling came out: Many linguistic phenomena +**//Idea//** 
-are annotated in a way that is to some extent arbitraryand reflects more the +encode W1,W2....Wn = c(W1,W2...W(n-1)) Wn  
-linguistic theory used than the language itselfand an unsupervised model +is offset function, so call context encoding
-cannot hope to get them rightThe example we discussed was whether the word +
-"should" is governing the verb it's bound withor vice versa. The authors +
-noticed that dependency orientation in general was not a particularly strong +
-point of their parser, and so they also included an evaluation metric that +
-ignored the dependency orientations.+
  
-Perhaps the most crucial observations the authors made was that there is a limit +**//Implementation//**
-where feeding more data to the model training hurts its accuracy. They +
-progressed from short sentences to longer, and identified the threshold, where +
-it's best to start ignoring any more training data, at sentences of length 15. +
-However, we were not 100% clear how they computed this constant. +
-If the model was to be fully unsupervised, it remains a question, how to setup +
-this threshold, because it cannot be safely assumed that it would be the same +
-for all languages and setups.+
  
-The writing style of the paper was also a matter of differing opinions. +__Sorted Array__
-Undeniably, it is written in a vocabulary-intensive fashion, bringing readers +
-face to face with words like "unbridled" or "jettison", which I personally had +
-never seen before.+
  
-==== Conclusion ====+- Use n array for n-gram model (array i-th is used for i-gram) 
 +- Each element in array in pair (w,c) 
 +     w : index of that word in unigram array  
 +     c : offset pointer 
 +- Sort base on w
  
-All in all, it was a paper worth reading, well presented, and thoroughly +Improvement : Implicitly encode W (all n-gram ending with particular word wi are stored -> wasteful. So, maintain another array save the beginning and the end of the range 
-discussed, bringing useful general ideas as well as interesting details.+            
 +__Hash Table__ 
 +Use open addressing (with linear probling) 
 +Use extra 40% space for auxiliary part (avoid collision) 
 +       
 +**//Encoding Improvement//**       
 + 
 +__Variable length coding__ 
 + 
 +Idea :  
 +Context offset tend to be close with each others, => Only save the first offsets address and the difference of others with the it 
 + 
 +Question :  
 +how the unigram be sorted ?  
 +Martin suggest that it must be sorted base on frequency 
 + 
 +__Block Compression__ 
 +compress the key/value array in the blocks of 128 bytes,  
 +the underline reason is only to support binary search 
 + 
 +==== Decoding ==== 
 +**I. Exploiting Repetitive Queries** 
 + 
 +The method use cache to speed up the process 
 +This simple implementation increase performance of 300% over conventional implementation 
 +**II. Exploiting Scrolling Queries** 
 + 
 +We can quickly form the context encoding of the next query by concatenating new words with saved offset from previous query 
 + 
 +==== Conclusion ==== 
 +In summary, it was a worth reading paper, well presented, and thoroughly 
 +discussed. Paper contain many detail techniques that definitely helpful for actual implementation.  

[ Back to the navigation ] [ Back to the content ]