Table of Contents

Faster and Smaller N-Gram Language Model

Presenter : Joachim Daiber
Reporter: Long DT

Date : 12-March-2012

Overview

The talk is mainly about techniques to improve performance of N-gram language model.
How it will run faster and use smaller amount of memory.

Encoding

I. Encoding the count

In web1T corpus, the most frequent n-gram is 95 billion times, but contain only 770 000 unique count.
⇒ Maintain value rank array is a good way to encode count

II. Encoding the n-gram

Idea
encode W1,W2….Wn = c(W1,W2…W(n-1)) Wn
c is offset function, so call context encoding

Implementation

Sorted Array

- Use n array for n-gram model (array i-th is used for i-gram)
- Each element in array in pair (w,c)

   w : index of that word in unigram array 
   c : offset pointer

- Sort base on w

Improvement : Implicitly encode W (all n-gram ending with particular word wi are stored → wasteful. So, maintain another array save the beginning and the end of the range

Hash Table
Use open addressing (with linear probling)
Use extra 40% space for auxiliary part (avoid collision)

Encoding Improvement

Variable length coding

Idea :
Context offset tend to be close with each others, ⇒ Only save the first offsets address and the difference of others with the it

Question :
how the unigram be sorted ?
Martin suggest that it must be sorted base on frequency

Block Compression
compress the key/value array in the blocks of 128 bytes,
the underline reason is only to support binary search

Decoding

I. Exploiting Repetitive Queries

The method use cache to speed up the process
This simple implementation increase performance of 300% over conventional implementation
II. Exploiting Scrolling Queries

We can quickly form the context encoding of the next query by concatenating new words with saved offset from previous query

Conclusion

In summary, it was a worth reading paper, well presented, and thoroughly
discussed. Paper contain many detail techniques that definitely helpful for actual implementation.