[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
courses:rg:2012:longdtreport [2012/03/12 20:17]
longdt
courses:rg:2012:longdtreport [2012/03/12 22:53]
longdt
Line 1: Line 1:
 ====== Faster and Smaller N-Gram Language Model ====== ====== Faster and Smaller N-Gram Language Model ======
 //Presenter :  Joachim Daiber //Presenter :  Joachim Daiber
-Subscriber : Long DT//+Reporter: Long DT//
 Date : 12-March-2012\\ Date : 12-March-2012\\
  
 ==== Overview ==== ==== Overview ====
  
-On Monday, October 24th 2011, we heard a talk about a paper by Valentin +The talk is mainly about techniques to improve performance of N-gram language model.  
-Spitkovsky, Hiyan Alshawi and Daniel Jurafsky on enhancing unsupervised language +How it will run faster and use smaller amount of memory
-parsersThe paper itself focuses on improving the state of the art in +
-unsupervised parsing, and reports a success in a rate of percents, which +
-certainly makes it a paper worth notice.+
  
-==== Notes ====+==== Encoding ==== 
 +**I. Encoding the count**
  
-Most of the attendants apparently understood the talk and the paper welland a +In web1T corpus, the most frequent n-gram is 95 billion timesbut contain only 770 000 unique count.  
-lively discussion followed. One of our first topics of debate was the notion of +=> Maintain value rank array is good way to encode count
-skyline presented in the paper. The skyline was somewhat of a supervised element +
--- the authors estimated initial parameters for a model from gold data and +
-trained it afterwards. They assumed that a model with parameters estimated from +
-gold data cannot be beaten by an unsupervisedly trained model. Verilyafter +
-training the skyline model, its accuracy dropped very significantlyThe reasons +
-of this were point of surprise for us as well as for the paper's authors.+
  
-Complementary to the skyline, the authors presented a baseline which should +**II. Encoding the n-gram**
-definitely be beaten by their final model. This baseline, they called +
-"uninformed", but were vague about which exact probability distribution they +
-used in this model. We could only speculate it was a uniform or random +
-probability distribution.+
  
-A point about unsupervised language modeling came out: Many linguistic phenomena +**//Idea//** 
-are annotated in a way that is to some extent arbitraryand reflects more the +encode W1,W2....Wn = c(W1,W2...W(n-1)) Wn  
-linguistic theory used than the language itselfand an unsupervised model +is offset function, so call context encoding
-cannot hope to get them rightThe example we discussed was whether the word +
-"should" is governing the verb it's bound withor vice versa. The authors +
-noticed that dependency orientation in general was not a particularly strong +
-point of their parser, and so they also included an evaluation metric that +
-ignored the dependency orientations.+
  
-Perhaps the most crucial observations the authors made was that there is a limit +**//Implementation//**
-where feeding more data to the model training hurts its accuracy. They +
-progressed from short sentences to longer, and identified the threshold, where +
-it's best to start ignoring any more training data, at sentences of length 15. +
-However, we were not 100% clear how they computed this constant. +
-If the model was to be fully unsupervised, it remains a question, how to setup +
-this threshold, because it cannot be safely assumed that it would be the same +
-for all languages and setups.+
  
-The writing style of the paper was also a matter of differing opinions. +__Sorted Array__
-Undeniably, it is written in a vocabulary-intensive fashion, bringing readers +
-face to face with words like "unbridled" or "jettison", which I personally had +
-never seen before.+
  
-==== Conclusion ====+- Use n array for n-gram model (array i-th is used for i-gram) 
 +- Each element in array in pair (w,c) 
 +     w : index of that word in unigram array  
 +     c : offset pointer 
 +- Sort base on w
  
 +Improvement : Implicitly encode W (all n-gram ending with particular word wi are stored -> wasteful. So, maintain another array save the beginning and the end of the range
 +           
 +__Hash Table__
 +Use open addressing (with linear probling)
 +Use extra 40% space for auxiliary part (avoid collision)
 +      
 +**//Encoding Improvement//**      
 +
 +__Variable length coding__
 +
 +Idea : 
 +Context offset tend to be close with each others, => Only save the first offsets address and the difference of others with the it
 +
 +Question : 
 +how the unigram be sorted ? 
 +Martin suggest that it must be sorted base on frequency
 +
 +__Block Compression__
 +compress the key/value array in the blocks of 128 bytes, 
 +the underline reason is only to support binary search
 +
 +==== Decoding ====
 +**I. Exploiting Repetitive Queries**
 +
 +The method use cache to speed up the process
 +This simple implementation increase performance of 300% over conventional implementation
 +**II. Exploiting Scrolling Queries**
 +
 +We can quickly form the context encoding of the next query by concatenating new words with saved offset from previous query
 +
 +==== Conclusion ====
 All in all, it was a paper worth reading, well presented, and thoroughly All in all, it was a paper worth reading, well presented, and thoroughly
 discussed, bringing useful general ideas as well as interesting details. discussed, bringing useful general ideas as well as interesting details.
 +

[ Back to the navigation ] [ Back to the content ]