Differences

This shows you the differences between two versions of the page.

--- courses:rg:2012:longdtreport [2012/03/12 20:21]
longdt
+++ courses:rg:2012:longdtreport [2012/03/12 22:41]
longdt
@@ Line 6: / Line 6: @@
 ==== Overview ====
-The talk is mainly about technique to improve performance of N-gram language model.
+The talk is mainly about techniques to improve performance of N-gram language model.
 How it will run faster and use smaller amount of memory.
-==== Notes ====
+==== Encoding ====
+**I. Encoding the count**
+In web1T corpus, the most frequent n-gram is 95 billion times, but contain only 770 000 unique count.
+=> Maintain value rank array is a good way to encode count
+**II. Encoding the n-gram**
+//Idea//
+encode W1,W2....Wn = c(W1,W2...W(n-1)) Wn
+c is offset function, so call context encoding
+//Implementation//
+- Sorted Array
+  + Use n array for n-gram model (array i-th is used for i-gram)
+  + Each element in array in pair (w,c)
+            + w : index of that word in unigram array
+            + c : offset pointer
 Most of the attendants apparently understood the talk and the paper well, and a
 lively discussion followed. One of our first topics of debate was the notion of

Institute of Formal and Applied Linguistics Wiki