Differences

This shows you the differences between two versions of the page.

--- courses:rg:2012:longdtreport [2012/03/12 22:44]
longdt
+++ courses:rg:2012:longdtreport [2012/03/12 22:47]
longdt
@@ Line 11: / Line 11: @@
 ==== Encoding ====
 **I. Encoding the count**
 In web1T corpus, the most frequent n-gram is 95 billion times, but contain only 770 000 unique count.
 => Maintain value rank array is a good way to encode count
 **II. Encoding the n-gram**
@@ Line 21: / Line 23: @@
 **//Implementation//**
-Sorted Array
+__Sorted Array__
   + Use n array for n-gram model (array i-th is used for i-gram)
   + Each element in array in pair (w,c)
@@ Line 27: / Line 29: @@
             + c : offset pointer
   + Sort base on w
-  //Improvement// : Implicitly encode W (all n-gram ending with particular word wi are stored -> wasteful. So, maintain another array save the beginning and the end of the range
+Improvement : Implicitly encode W (all n-gram ending with particular word wi are stored -> wasteful. So, maintain another array save the beginning and the end of the range
-Hash Table
+__Hash Table__
+Use open addressing (with linear probling)
+Use extra 40% space for auxiliary part (avoid collision)
+**//Encoding Improvement//**
+__Variable length coding__
 Most of the attendants apparently understood the talk and the paper well, and a

Institute of Formal and Applied Linguistics Wiki