Presenter : Joachim Daiber
Reporter: Long DT
Date : 12-March-2012
The talk is mainly about techniques to improve performance of N-gram language model.
How it will run faster and use smaller amount of memory.
I. Encoding the count
In web1T corpus, the most frequent n-gram is 95 billion times, but contain only 770 000 unique count.
⇒ Maintain value rank array is a good way to encode count
II. Encoding the n-gram
Idea
encode W1,W2….Wn = c(W1,W2…W(n-1)) Wn
c is offset function, so call context encoding
Implementation
Sorted Array
- Use n array for n-gram model (array i-th is used for i-gram)
- Each element in array in pair (w,c)
w : index of that word in unigram array c : offset pointer
- Sort base on w
Improvement : Implicitly encode W (all n-gram ending with particular word wi are stored → wasteful. So, maintain another array save the beginning and the end of the range
Hash Table
Use open addressing (with linear probling)
Use extra 40% space for auxiliary part (avoid collision)
Encoding Improvement
Variable length coding
Idea :
Context offset tend to be close with each others, ⇒ Only save the first offsets address and the difference of others with the it
Question :
how the unigram be sorted ?
Martin suggest that it must be sorted base on frequency
Block Compression
compress the key/value array in the blocks of 128 bytes,
the underline reason is only to support binary search
I. Exploiting Repetitive Queries
The method use cache to speed up the process
This simple implementation increase performance of 300% over conventional implementation
II. Exploiting Scrolling Queries
We can quickly form the context encoding of the next query by concatenating new words with saved offset from previous query
In summary, it was a worth reading paper, well presented, and thoroughly
discussed. Paper contain many detail techniques that definitely helpful for actual implementation.