This is an old revision of the document!
Table of Contents
Faster and Smaller N-Gram Language Model
Presenter : Joachim Daiber
Reporter: Long DT
Date : 12-March-2012
Overview
The talk is mainly about techniques to improve performance of N-gram language model.
How it will run faster and use smaller amount of memory.
Encoding
I. Encoding the count
In web1T corpus, the most frequent n-gram is 95 billion times, but contain only 770 000 unique count.
⇒ Maintain value rank array is a good way to encode count
II. Encoding the n-gram
Idea
encode W1,W2….Wn = c(W1,W2…W(n-1)) Wn
c is offset function, so call context encoding
Implementation
Sorted Array
+ Use n array for n-gram model (array i-th is used for i-gram) + Each element in array in pair (w,c) + w : index of that word in unigram array + c : offset pointer
Most of the attendants apparently understood the talk and the paper well, and a
lively discussion followed. One of our first topics of debate was the notion of
skyline presented in the paper. The skyline was somewhat of a supervised element
– the authors estimated initial parameters for a model from gold data and
trained it afterwards. They assumed that a model with parameters estimated from
gold data cannot be beaten by an unsupervisedly trained model. Verily, after
training the skyline model, its accuracy dropped very significantly. The reasons
of this were a point of surprise for us as well as for the paper's authors.
Complementary to the skyline, the authors presented a baseline which should
definitely be beaten by their final model. This baseline, they called
“uninformed”, but were vague about which exact probability distribution they
used in this model. We could only speculate it was a uniform or random
probability distribution.
A point about unsupervised language modeling came out: Many linguistic phenomena
are annotated in a way that is to some extent arbitrary, and reflects more the
linguistic theory used than the language itself, and an unsupervised model
cannot hope to get them right. The example we discussed was whether the word
“should” is governing the verb it's bound with, or vice versa. The authors
noticed that dependency orientation in general was not a particularly strong
point of their parser, and so they also included an evaluation metric that
ignored the dependency orientations.
Perhaps the most crucial observations the authors made was that there is a limit
where feeding more data to the model training hurts its accuracy. They
progressed from short sentences to longer, and identified the threshold, where
it's best to start ignoring any more training data, at sentences of length 15.
However, we were not 100% clear how they computed this constant.
If the model was to be fully unsupervised, it remains a question, how to setup
this threshold, because it cannot be safely assumed that it would be the same
for all languages and setups.
The writing style of the paper was also a matter of differing opinions.
Undeniably, it is written in a vocabulary-intensive fashion, bringing readers
face to face with words like “unbridled” or “jettison”, which I personally had
never seen before.
Conclusion
All in all, it was a paper worth reading, well presented, and thoroughly
discussed, bringing useful general ideas as well as interesting details.