[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
courses:rg:2012:longdtreport [2012/03/12 20:11]
longdt vytvořeno
courses:rg:2012:longdtreport [2012/03/12 22:59] (current)
longdt
Line 1: Line 1:
-Test+====== Faster and Smaller N-Gram Language Model ====== 
 +//Presenter :  Joachim Daiber 
 +Reporter: Long DT// 
 +Date : 12-March-2012\\ 
 + 
 +==== Overview ==== 
 + 
 +The talk is mainly about techniques to improve performance of N-gram language model.  
 +How it will run faster and use smaller amount of memory.  
 + 
 +==== Encoding ==== 
 +**I. Encoding the count** 
 + 
 +In web1T corpus, the most frequent n-gram is 95 billion times, but contain only 770 000 unique count.  
 +=> Maintain value rank array is a good way to encode count 
 + 
 +**II. Encoding the n-gram** 
 + 
 +**//Idea//** 
 +encode W1,W2....Wn = c(W1,W2...W(n-1)) Wn  
 +c is offset function, so call context encoding 
 + 
 +**//Implementation//** 
 + 
 +__Sorted Array__ 
 + 
 +- Use n array for n-gram model (array i-th is used for i-gram) 
 +- Each element in array in pair (w,c) 
 +     w : index of that word in unigram array  
 +     c : offset pointer 
 +- Sort base on w 
 + 
 +Improvement : Implicitly encode W (all n-gram ending with particular word wi are stored -> wasteful. So, maintain another array save the beginning and the end of the range 
 +            
 +__Hash Table__ 
 +Use open addressing (with linear probling) 
 +Use extra 40% space for auxiliary part (avoid collision) 
 +       
 +**//Encoding Improvement//**       
 + 
 +__Variable length coding__ 
 + 
 +Idea :  
 +Context offset tend to be close with each others, => Only save the first offsets address and the difference of others with the it 
 + 
 +Question :  
 +how the unigram be sorted ?  
 +Martin suggest that it must be sorted base on frequency 
 + 
 +__Block Compression__ 
 +compress the key/value array in the blocks of 128 bytes,  
 +the underline reason is only to support binary search 
 + 
 +==== Decoding ==== 
 +**I. Exploiting Repetitive Queries** 
 + 
 +The method use cache to speed up the process 
 +This simple implementation increase performance of 300% over conventional implementation 
 +**II. Exploiting Scrolling Queries** 
 + 
 +We can quickly form the context encoding of the next query by concatenating new words with saved offset from previous query 
 + 
 +==== Conclusion ==== 
 +In summary, it was a worth reading paper, well presented, and thoroughly 
 +discussed. Paper contain many detail techniques that definitely helpful for actual implementation.  

[ Back to the navigation ] [ Back to the content ]