[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
courses:rg:2012:longdtreport [2012/03/12 22:41]
longdt
courses:rg:2012:longdtreport [2012/03/12 22:47]
longdt
Line 11: Line 11:
 ==== Encoding ==== ==== Encoding ====
 **I. Encoding the count** **I. Encoding the count**
 +
 In web1T corpus, the most frequent n-gram is 95 billion times, but contain only 770 000 unique count.  In web1T corpus, the most frequent n-gram is 95 billion times, but contain only 770 000 unique count. 
 => Maintain value rank array is a good way to encode count => Maintain value rank array is a good way to encode count
 +
 **II. Encoding the n-gram** **II. Encoding the n-gram**
 +
 **//Idea//** **//Idea//**
 encode W1,W2....Wn = c(W1,W2...W(n-1)) Wn  encode W1,W2....Wn = c(W1,W2...W(n-1)) Wn 
 c is offset function, so call context encoding c is offset function, so call context encoding
-**//Implementation// + 
-** +**//Implementation//** 
-Sorted Array+ 
 +__Sorted Array__
   + Use n array for n-gram model (array i-th is used for i-gram)   + Use n array for n-gram model (array i-th is used for i-gram)
   + Each element in array in pair (w,c)   + Each element in array in pair (w,c)
             + w : index of that word in unigram array             + w : index of that word in unigram array
             + c : offset pointer             + c : offset pointer
-            +  + Sort base on w 
 +Improvement : Implicitly encode W (all n-gram ending with particular word wi are stored -> wasteful. So, maintain another array save the beginning and the end of the range 
 +            
 +__Hash Table__ 
 +Use open addressing (with linear probling) 
 +Use extra 40% space for auxiliary part (avoid collision) 
 +       
 +**//Encoding Improvement//**       
 + 
 +__Variable length coding__
      
 Most of the attendants apparently understood the talk and the paper well, and a Most of the attendants apparently understood the talk and the paper well, and a

[ Back to the navigation ] [ Back to the content ]