====== Faster and Smaller N-Gram Language Model ====== //Presenter : Joachim Daiber Reporter: Long DT// Date : 12-March-2012\\ ==== Overview ==== The talk is mainly about techniques to improve performance of N-gram language model. How it will run faster and use smaller amount of memory. ==== Encoding ==== **I. Encoding the count** In web1T corpus, the most frequent n-gram is 95 billion times, but contain only 770 000 unique count. => Maintain value rank array is a good way to encode count **II. Encoding the n-gram** **//Idea//** encode W1,W2....Wn = c(W1,W2...W(n-1)) Wn c is offset function, so call context encoding **//Implementation//** __Sorted Array__ - Use n array for n-gram model (array i-th is used for i-gram) - Each element in array in pair (w,c) w : index of that word in unigram array c : offset pointer - Sort base on w Improvement : Implicitly encode W (all n-gram ending with particular word wi are stored -> wasteful. So, maintain another array save the beginning and the end of the range __Hash Table__ Use open addressing (with linear probling) Use extra 40% space for auxiliary part (avoid collision) **//Encoding Improvement//** __Variable length coding__ Idea : Context offset tend to be close with each others, => Only save the first offsets address and the difference of others with the it Question : how the unigram be sorted ? Martin suggest that it must be sorted base on frequency __Block Compression__ compress the key/value array in the blocks of 128 bytes, the underline reason is only to support binary search ==== Decoding ==== **I. Exploiting Repetitive Queries** The method use cache to speed up the process This simple implementation increase performance of 300% over conventional implementation **II. Exploiting Scrolling Queries** We can quickly form the context encoding of the next query by concatenating new words with saved offset from previous query ==== Conclusion ==== In summary, it was a worth reading paper, well presented, and thoroughly discussed. Paper contain many detail techniques that definitely helpful for actual implementation.