====== Faster and Smaller N-Gram Language Model ======
//Presenter :  Joachim Daiber
Reporter: Long DT//
Date : 12-March-2012\\

==== Overview ====

The talk is mainly about techniques to improve performance of N-gram language model. 
How it will run faster and use smaller amount of memory. 

==== Encoding ====
**I. Encoding the count**

In web1T corpus, the most frequent n-gram is 95 billion times, but contain only 770 000 unique count. 
=> Maintain value rank array is a good way to encode count

**II. Encoding the n-gram**

**//Idea//**
encode W1,W2....Wn = c(W1,W2...W(n-1)) Wn 
c is offset function, so call context encoding

**//Implementation//**

__Sorted Array__

- Use n array for n-gram model (array i-th is used for i-gram)
- Each element in array in pair (w,c)
     w : index of that word in unigram array 
     c : offset pointer
- Sort base on w

Improvement : Implicitly encode W (all n-gram ending with particular word wi are stored -> wasteful. So, maintain another array save the beginning and the end of the range
           
__Hash Table__
Use open addressing (with linear probling)
Use extra 40% space for auxiliary part (avoid collision)
      
**//Encoding Improvement//**      

__Variable length coding__

Idea : 
Context offset tend to be close with each others, => Only save the first offsets address and the difference of others with the it

Question : 
how the unigram be sorted ? 
Martin suggest that it must be sorted base on frequency

__Block Compression__
compress the key/value array in the blocks of 128 bytes, 
the underline reason is only to support binary search

==== Decoding ====
**I. Exploiting Repetitive Queries**

The method use cache to speed up the process
This simple implementation increase performance of 300% over conventional implementation
**II. Exploiting Scrolling Queries**

We can quickly form the context encoding of the next query by concatenating new words with saved offset from previous query

==== Conclusion ====
In summary, it was a worth reading paper, well presented, and thoroughly
discussed. Paper contain many detail techniques that definitely helpful for actual implementation.