Differences

This shows you the differences between two versions of the page.

--- courses:rg:2012:longdtreport [2012/03/12 20:11]
longdt vytvořeno
+++ courses:rg:2012:longdtreport [2012/03/12 22:59] (current)
longdt
@@ Line 1: / Line 1: @@
-Test
+====== Faster and Smaller N-Gram Language Model ======
+//Presenter :  Joachim Daiber
+Reporter: Long DT//
+Date : 12-March-2012\\
+==== Overview ====
+The talk is mainly about techniques to improve performance of N-gram language model.
+How it will run faster and use smaller amount of memory.
+==== Encoding ====
+**I. Encoding the count**
+In web1T corpus, the most frequent n-gram is 95 billion times, but contain only 770 000 unique count.
+=> Maintain value rank array is a good way to encode count
+**II. Encoding the n-gram**
+**//Idea//**
+encode W1,W2....Wn = c(W1,W2...W(n-1)) Wn
+c is offset function, so call context encoding
+**//Implementation//**
+__Sorted Array__
+- Use n array for n-gram model (array i-th is used for i-gram)
+- Each element in array in pair (w,c)
+     w : index of that word in unigram array
+     c : offset pointer
+- Sort base on w
+Improvement : Implicitly encode W (all n-gram ending with particular word wi are stored -> wasteful. So, maintain another array save the beginning and the end of the range
+__Hash Table__
+Use open addressing (with linear probling)
+Use extra 40% space for auxiliary part (avoid collision)
+**//Encoding Improvement//**
+__Variable length coding__
+Idea :
+Context offset tend to be close with each others, => Only save the first offsets address and the difference of others with the it
+Question :
+how the unigram be sorted ?
+Martin suggest that it must be sorted base on frequency
+__Block Compression__
+compress the key/value array in the blocks of 128 bytes,
+the underline reason is only to support binary search
+==== Decoding ====
+**I. Exploiting Repetitive Queries**
+The method use cache to speed up the process
+This simple implementation increase performance of 300% over conventional implementation
+**II. Exploiting Scrolling Queries**
+We can quickly form the context encoding of the next query by concatenating new words with saved offset from previous query
+==== Conclusion ====
+In summary, it was a worth reading paper, well presented, and thoroughly
+discussed. Paper contain many detail techniques that definitely helpful for actual implementation.

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences