[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision Both sides next revision
courses:rg:2012:longdtreport [2012/03/12 22:45]
longdt
courses:rg:2012:longdtreport [2012/03/12 22:47]
longdt
Line 11: Line 11:
 ==== Encoding ==== ==== Encoding ====
 **I. Encoding the count** **I. Encoding the count**
 +
 In web1T corpus, the most frequent n-gram is 95 billion times, but contain only 770 000 unique count.  In web1T corpus, the most frequent n-gram is 95 billion times, but contain only 770 000 unique count. 
 => Maintain value rank array is a good way to encode count => Maintain value rank array is a good way to encode count
 +
 **II. Encoding the n-gram** **II. Encoding the n-gram**
  
Line 21: Line 23:
 **//Implementation//** **//Implementation//**
  
-Sorted Array+__Sorted Array__
   + Use n array for n-gram model (array i-th is used for i-gram)   + Use n array for n-gram model (array i-th is used for i-gram)
   + Each element in array in pair (w,c)   + Each element in array in pair (w,c)
Line 29: Line 31:
 Improvement : Implicitly encode W (all n-gram ending with particular word wi are stored -> wasteful. So, maintain another array save the beginning and the end of the range Improvement : Implicitly encode W (all n-gram ending with particular word wi are stored -> wasteful. So, maintain another array save the beginning and the end of the range
                        
-Hash Table            +__Hash Table__ 
 +Use open addressing (with linear probling) 
 +Use extra 40% space for auxiliary part (avoid collision) 
 +       
 +**//Encoding Improvement//**       
 + 
 +__Variable length coding__
      
 Most of the attendants apparently understood the talk and the paper well, and a Most of the attendants apparently understood the talk and the paper well, and a

[ Back to the navigation ] [ Back to the content ]