[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
courses:rg:extracting-parallel-sentences-from-comparable-corpora [2011/05/22 19:11]
ivanova
courses:rg:extracting-parallel-sentences-from-comparable-corpora [2011/05/22 19:23] (current)
ivanova
Line 1: Line 1:
 +**Extracting Parallel Sentences from Comparable Corpora using Document Level Alignment**
 +//Jason R. Smith Chris Quirk and Kristina Toutanova//
 + 
 ====== Introduction ====== ====== Introduction ======
 +
 Article is about parallel sentence extraction from Wikipedia. This resource can be viewed as comparable corpus in which the document alignment is already provided by the interwiki links. Article is about parallel sentence extraction from Wikipedia. This resource can be viewed as comparable corpus in which the document alignment is already provided by the interwiki links.
  
Line 26: Line 30:
 One set of features bins distances between previous and current aligned sentences. Another set of features looks at the absolute difference between the expected position (one after the previous aligned sentence) and the actual position. One set of features bins distances between previous and current aligned sentences. Another set of features looks at the absolute difference between the expected position (one after the previous aligned sentence) and the actual position.
  
-=== Category 3: features derived from Wikipedia markup ===+==== Category 3: features derived from Wikipedia markup ====
   - number of matching links in the sentence pairs;   - number of matching links in the sentence pairs;
   - image feature (if two sentences are captions of the same image);   - image feature (if two sentences are captions of the same image);
Line 32: Line 36:
   - bias feature (if the alignment is non-null).   - bias feature (if the alignment is non-null).
  
-=== Category 4: word-level induced lexicon features ===+==== Category 4: word-level induced lexicon features ====
   - Translation probability;   - Translation probability;
   - position difference;   - position difference;
Line 41: Line 45:
 Using this model, the authors generate a new translation table which is used to define another HMM word-alignment  model for use in sentence extraction model. Using this model, the authors generate a new translation table which is used to define another HMM word-alignment  model for use in sentence extraction model.
  
-== Evaluation ==+====== Evaluation ====== 
 __Data for evaluation:__ __Data for evaluation:__
 20 Wikipedia article pairs for Spanish-English, Bulgarian-English and German-English. Positive examples of sentence pairs in the datasets are the sentences that are mostly parallel with some missing words and sentences that are direct translations. 20 Wikipedia article pairs for Spanish-English, Bulgarian-English and German-English. Positive examples of sentence pairs in the datasets are the sentences that are mostly parallel with some missing words and sentences that are direct translations.
Line 55: Line 60:
  
 The SMT evaluation was using BLEU score. For each language the exploited 2 training conditions: The SMT evaluation was using BLEU score. For each language the exploited 2 training conditions:
-  * Medium +**1) Medium**    
-   Training set data:+Training set data:
      * Europarl corpus for Spanish and German;      * Europarl corpus for Spanish and German;
      * JRC-Aquis corpus for Bulgarian;      * JRC-Aquis corpus for Bulgarian;
      * article titles for parallel Wikipedia documents;      * article titles for parallel Wikipedia documents;
      * translations available from Wikipedia entries.      * translations available from Wikipedia entries.
-   * Large +** 2) Large**  
-   Training set data included:+Training set data included:
      * all Medium data;      * all Medium data;
      * broad range of available sources (data scraped from Web, data from United Nations, phrase books, software documentation and more)      * broad range of available sources (data scraped from Web, data from United Nations, phrase books, software documentation and more)
 In each condition they exploited impact of including parallel sentences automatically extracted from Wikipedia. In each condition they exploited impact of including parallel sentences automatically extracted from Wikipedia.
  
-== Coclusions ==+====== Conclusions ====== 
   * Wikipedia is a useful resource for mining parallel data and it is a good resource for machine translation.   * Wikipedia is a useful resource for mining parallel data and it is a good resource for machine translation.
   * Ranking approach sidesteps problematic class imbalance issue.   * Ranking approach sidesteps problematic class imbalance issue.
Line 74: Line 80:
   * Induced word-level lexicon in combination with sentence extraction helps to achieve substantial gains.   * Induced word-level lexicon in combination with sentence extraction helps to achieve substantial gains.
  
-== Strong sides of the article ==+====== Strong sides of the article ====== 
 + 
   * Novel approaches to extracting parallel sentences.   * Novel approaches to extracting parallel sentences.
   * Evaluation.   * Evaluation.
  
-== Weak sides of the article ==+====== Weak sides of the article ===== 
   * The authors use word-alignment model for sentence alignment task which is not typical. They should have stressed this and explain their reasons for using such technique.   * The authors use word-alignment model for sentence alignment task which is not typical. They should have stressed this and explain their reasons for using such technique.
   * The authors didn't explicitly say if the do pruning in the ranking model (if they choose just one sentence for alignment and prune all the others or not).   * The authors didn't explicitly say if the do pruning in the ranking model (if they choose just one sentence for alignment and prune all the others or not).
Line 84: Line 93:
 Our understanding of this  feature is: Our understanding of this  feature is:
 TOPIC A: EN <-> ES TOPIC A: EN <-> ES
-          ↓      ↓+         ↓      ↓     
 TOPIC B: EN <-> ES TOPIC B: EN <-> ES
  
 where  where 
 ↓ is a link ↓ is a link
 +
 <-> is an interwiki link <-> is an interwiki link
  
  
- --- //[[violinangy@gmail.com|Angelina Ivanova]] 2011/05/22 18:39//+ --- //Comments by Angelina Ivanova //
  
  
  
  

[ Back to the navigation ] [ Back to the content ]