Faster and Smaller N-Gram Language Model

Presenter : Joachim Daiber
Reporter: Long DT
Date : 12-March-2012

Overview

The talk is mainly about techniques to improve performance of N-gram language model.
How it will run faster and use smaller amount of memory.

Encoding

I. Encoding the count

In web1T corpus, the most frequent n-gram is 95 billion times, but contain only 770 000 unique count.
⇒ Maintain value rank array is a good way to encode count

II. Encoding the n-gram

Idea
encode W1,W2….Wn = c(W1,W2…W(n-1)) Wn
c is offset function, so call context encoding

Implementation

Sorted Array

- Use n array for n-gram model (array i-th is used for i-gram)
- Each element in array in pair (w,c)

w : index of that word in unigram array
c : offset pointer

- Sort base on w

Improvement : Implicitly encode W (all n-gram ending with particular word wi are stored → wasteful. So, maintain another array save the beginning and the end of the range

Hash Table
Use open addressing (with linear probling)
Use extra 40% space for auxiliary part (avoid collision)

Encoding Improvement

Variable length coding

Idea :
Context offset tend to be close with each others, ⇒ Only save the first offsets address and the difference of others with the it

Question :
how the unigram be sorted ?
Martin suggest that it must be sorted base on frequency

Block Compression
compress the key/value array in the blocks of 128 bytes,
the underline reason is only to support binary search

Conclusion

All in all, it was a paper worth reading, well presented, and thoroughly
discussed, bringing useful general ideas as well as interesting details.

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Table of Contents

Faster and Smaller N-Gram Language Model

Overview

Encoding

Conclusion