[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
Next revision Both sides next revision
courses:rg:extracting-parallel-sentences-from-comparable-corpora [2011/05/22 18:33]
ivanova vytvořeno
courses:rg:extracting-parallel-sentences-from-comparable-corpora [2011/05/22 18:39]
ivanova
Line 1: Line 1:
 ====== Introduction ====== ====== Introduction ======
- 
 Article is about parallel sentence extraction from Wikipedia. This resource can be viewed as comparable corpus in which the document alignment is already provided by the interwiki links. Article is about parallel sentence extraction from Wikipedia. This resource can be viewed as comparable corpus in which the document alignment is already provided by the interwiki links.
  
 ====== Training models ====== ====== Training models ======
- 
 Authors train three models: Authors train three models:
   * binary classifier model;   * binary classifier model;
   * ranking model;   * ranking model;
-  * Conditional Random Field (CRF) model+  * conditional random field (CRF) model.
 When the binary classifier is used, there is a substantial class imbalance: O(n) positive examples and O(n²) negative examples. When the binary classifier is used, there is a substantial class imbalance: O(n) positive examples and O(n²) negative examples.
  
Line 18: Line 16:
  
 ===== Category 1: Features derived from word alignment ===== ===== Category 1: Features derived from word alignment =====
- +  - Číslovaný seznam log probability of the alignment; 
 +  - number of aligned/unaligned words; 
 +  - longest aligned/unaligned sequence of words; 
 +  - sentence length; 
 +  - the difference in relative document position of the two sentences. 
 +Last two features are independent from word alignment. All these features are defined on sentence pairs and included in the binary classification and ranking models. 
 + 
 +==== Category 2: Distortion features ==== 
 +One set of features bins distances between previous and current aligned sentences. Another set of features looks at the absolute difference between the expected position (one after the previous aligned sentence) and the actual position. 
 + 
 +=== Category 3: Features derived from Wikipedia markup === 
  
  
  

[ Back to the navigation ] [ Back to the content ]