[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
courses:rg:extracting-parallel-sentences-from-comparable-corpora [2011/05/22 19:13]
ivanova
courses:rg:extracting-parallel-sentences-from-comparable-corpora [2011/05/22 19:23] (current)
ivanova
Line 1: Line 1:
 +**Extracting Parallel Sentences from Comparable Corpora using Document Level Alignment**
 +//Jason R. Smith Chris Quirk and Kristina Toutanova//
 + 
 ====== Introduction ====== ====== Introduction ======
 +
 Article is about parallel sentence extraction from Wikipedia. This resource can be viewed as comparable corpus in which the document alignment is already provided by the interwiki links. Article is about parallel sentence extraction from Wikipedia. This resource can be viewed as comparable corpus in which the document alignment is already provided by the interwiki links.
  
Line 41: Line 45:
 Using this model, the authors generate a new translation table which is used to define another HMM word-alignment  model for use in sentence extraction model. Using this model, the authors generate a new translation table which is used to define another HMM word-alignment  model for use in sentence extraction model.
  
-==== Evaluation ====+====== Evaluation ====== 
 __Data for evaluation:__ __Data for evaluation:__
 20 Wikipedia article pairs for Spanish-English, Bulgarian-English and German-English. Positive examples of sentence pairs in the datasets are the sentences that are mostly parallel with some missing words and sentences that are direct translations. 20 Wikipedia article pairs for Spanish-English, Bulgarian-English and German-English. Positive examples of sentence pairs in the datasets are the sentences that are mostly parallel with some missing words and sentences that are direct translations.
Line 55: Line 60:
  
 The SMT evaluation was using BLEU score. For each language the exploited 2 training conditions: The SMT evaluation was using BLEU score. For each language the exploited 2 training conditions:
-  * Medium +**1) Medium**    
-   Training set data:+Training set data:
      * Europarl corpus for Spanish and German;      * Europarl corpus for Spanish and German;
      * JRC-Aquis corpus for Bulgarian;      * JRC-Aquis corpus for Bulgarian;
      * article titles for parallel Wikipedia documents;      * article titles for parallel Wikipedia documents;
      * translations available from Wikipedia entries.      * translations available from Wikipedia entries.
-   * Large +** 2) Large**  
-   Training set data included:+Training set data included:
      * all Medium data;      * all Medium data;
      * broad range of available sources (data scraped from Web, data from United Nations, phrase books, software documentation and more)      * broad range of available sources (data scraped from Web, data from United Nations, phrase books, software documentation and more)
 In each condition they exploited impact of including parallel sentences automatically extracted from Wikipedia. In each condition they exploited impact of including parallel sentences automatically extracted from Wikipedia.
  
-== Coclusions ==+====== Conclusions ====== 
   * Wikipedia is a useful resource for mining parallel data and it is a good resource for machine translation.   * Wikipedia is a useful resource for mining parallel data and it is a good resource for machine translation.
   * Ranking approach sidesteps problematic class imbalance issue.   * Ranking approach sidesteps problematic class imbalance issue.
Line 74: Line 80:
   * Induced word-level lexicon in combination with sentence extraction helps to achieve substantial gains.   * Induced word-level lexicon in combination with sentence extraction helps to achieve substantial gains.
  
-== Strong sides of the article ==+====== Strong sides of the article ====== 
 + 
   * Novel approaches to extracting parallel sentences.   * Novel approaches to extracting parallel sentences.
   * Evaluation.   * Evaluation.
  
-== Weak sides of the article ==+====== Weak sides of the article ===== 
   * The authors use word-alignment model for sentence alignment task which is not typical. They should have stressed this and explain their reasons for using such technique.   * The authors use word-alignment model for sentence alignment task which is not typical. They should have stressed this and explain their reasons for using such technique.
   * The authors didn't explicitly say if the do pruning in the ranking model (if they choose just one sentence for alignment and prune all the others or not).   * The authors didn't explicitly say if the do pruning in the ranking model (if they choose just one sentence for alignment and prune all the others or not).
Line 84: Line 93:
 Our understanding of this  feature is: Our understanding of this  feature is:
 TOPIC A: EN <-> ES TOPIC A: EN <-> ES
-          ↓      ↓+         ↓      ↓     
 TOPIC B: EN <-> ES TOPIC B: EN <-> ES
  
 where  where 
 ↓ is a link ↓ is a link
 +
 <-> is an interwiki link <-> is an interwiki link
  
  
- --- //[[violinangy@gmail.com|Angelina Ivanova]] 2011/05/22 18:39//+ --- //Comments by Angelina Ivanova //
  
  
  
  

[ Back to the navigation ] [ Back to the content ]