Differences
This shows you the differences between two versions of the page.
| Both sides previous revision Previous revision Next revision | Previous revision | ||
|
courses:rg:2012:longdtreport [2012/03/12 20:17] longdt |
courses:rg:2012:longdtreport [2012/03/12 22:59] (current) longdt |
||
|---|---|---|---|
| Line 1: | Line 1: | ||
| ====== Faster and Smaller N-Gram Language Model ====== | ====== Faster and Smaller N-Gram Language Model ====== | ||
| //Presenter : Joachim Daiber | //Presenter : Joachim Daiber | ||
| - | Subscriber | + | Reporter: Long DT// |
| Date : 12-March-2012\\ | Date : 12-March-2012\\ | ||
| ==== Overview ==== | ==== Overview ==== | ||
| - | On Monday, October 24th 2011, we heard a talk about a paper by Valentin | + | The talk is mainly |
| - | Spitkovsky, Hiyan Alshawi and Daniel Jurafsky on enhancing unsupervised | + | How it will run faster |
| - | parsers. The paper itself focuses on improving the state of the art in | + | |
| - | unsupervised parsing, | + | |
| - | certainly makes it a paper worth notice. | + | |
| - | ==== Notes ==== | + | ==== Encoding |
| + | **I. Encoding the count** | ||
| - | Most of the attendants apparently understood the talk and the paper well, and a | + | In web1T corpus, the most frequent n-gram is 95 billion times, but contain only 770 000 unique count. |
| - | lively discussion followed. One of our first topics of debate was the notion of | + | => Maintain value rank array is a good way to encode count |
| - | skyline presented in the paper. The skyline was somewhat of a supervised element | + | |
| - | -- the authors estimated initial parameters for a model from gold data and | + | |
| - | trained it afterwards. They assumed that a model with parameters estimated from | + | |
| - | gold data cannot be beaten by an unsupervisedly trained model. Verily, after | + | |
| - | training the skyline model, its accuracy dropped very significantly. The reasons | + | |
| - | of this were a point of surprise for us as well as for the paper' | + | |
| - | Complementary to the skyline, the authors presented a baseline which should | + | **II. Encoding |
| - | definitely be beaten by their final model. This baseline, they called | + | |
| - | " | + | |
| - | used in this model. We could only speculate it was a uniform or random | + | |
| - | probability distribution. | + | |
| - | A point about unsupervised language modeling came out: Many linguistic phenomena | + | **// |
| - | are annotated in a way that is to some extent arbitrary, and reflects more the | + | encode W1,W2....Wn = c(W1,W2...W(n-1)) Wn |
| - | linguistic theory used than the language itself, and an unsupervised model | + | c is offset function, so call context encoding |
| - | cannot hope to get them right. The example we discussed was whether the word | + | |
| - | " | + | |
| - | noticed that dependency orientation in general was not a particularly strong | + | |
| - | point of their parser, and so they also included an evaluation metric that | + | |
| - | ignored the dependency orientations. | + | |
| - | Perhaps the most crucial observations the authors made was that there is a limit | + | **// |
| - | where feeding more data to the model training hurts its accuracy. They | + | |
| - | progressed from short sentences to longer, and identified the threshold, where | + | |
| - | it's best to start ignoring any more training data, at sentences of length 15. | + | |
| - | However, we were not 100% clear how they computed this constant. | + | |
| - | If the model was to be fully unsupervised, | + | |
| - | this threshold, because it cannot be safely assumed that it would be the same | + | |
| - | for all languages and setups. | + | |
| - | The writing style of the paper was also a matter of differing opinions. | + | __Sorted Array__ |
| - | Undeniably, it is written in a vocabulary-intensive fashion, bringing readers | + | |
| - | face to face with words like " | + | |
| - | never seen before. | + | |
| - | ==== Conclusion ==== | + | - Use n array for n-gram model (array i-th is used for i-gram) |
| + | - Each element in array in pair (w,c) | ||
| + | w : index of that word in unigram array | ||
| + | c : offset pointer | ||
| + | - Sort base on w | ||
| - | All in all, it was a paper worth reading, well presented, and thoroughly | + | Improvement : Implicitly encode W (all n-gram ending with particular word wi are stored -> wasteful. So, maintain another array save the beginning and the end of the range |
| - | discussed, bringing useful general ideas as well as interesting details. | + | |
| + | __Hash Table__ | ||
| + | Use open addressing (with linear probling) | ||
| + | Use extra 40% space for auxiliary part (avoid collision) | ||
| + | |||
| + | **// | ||
| + | |||
| + | __Variable length coding__ | ||
| + | |||
| + | Idea : | ||
| + | Context offset tend to be close with each others, => Only save the first offsets address and the difference of others with the it | ||
| + | |||
| + | Question : | ||
| + | how the unigram be sorted ? | ||
| + | Martin suggest that it must be sorted base on frequency | ||
| + | |||
| + | __Block Compression__ | ||
| + | compress the key/value array in the blocks of 128 bytes, | ||
| + | the underline reason is only to support binary search | ||
| + | |||
| + | ==== Decoding ==== | ||
| + | **I. Exploiting Repetitive Queries** | ||
| + | |||
| + | The method use cache to speed up the process | ||
| + | This simple implementation increase performance of 300% over conventional implementation | ||
| + | **II. Exploiting Scrolling Queries** | ||
| + | |||
| + | We can quickly form the context encoding of the next query by concatenating new words with saved offset from previous query | ||
| + | |||
| + | ==== Conclusion ==== | ||
| + | In summary, it was a worth reading | ||
| + | discussed. | ||
