Differences
This shows you the differences between two versions of the page.
| Both sides previous revision Previous revision Next revision | Previous revision | ||
|
courses:rg:extracting-parallel-sentences-from-comparable-corpora [2011/05/22 19:13] ivanova |
courses:rg:extracting-parallel-sentences-from-comparable-corpora [2011/05/22 19:23] (current) ivanova |
||
|---|---|---|---|
| Line 1: | Line 1: | ||
| + | **Extracting Parallel Sentences from Comparable Corpora using Document Level Alignment** | ||
| + | //Jason R. Smith Chris Quirk and Kristina Toutanova// | ||
| + | |||
| ====== Introduction ====== | ====== Introduction ====== | ||
| + | |||
| Article is about parallel sentence extraction from Wikipedia. This resource can be viewed as comparable corpus in which the document alignment is already provided by the interwiki links. | Article is about parallel sentence extraction from Wikipedia. This resource can be viewed as comparable corpus in which the document alignment is already provided by the interwiki links. | ||
| Line 41: | Line 45: | ||
| Using this model, the authors generate a new translation table which is used to define another HMM word-alignment | Using this model, the authors generate a new translation table which is used to define another HMM word-alignment | ||
| - | ==== Evaluation ==== | + | ====== Evaluation ====== |
| __Data for evaluation: | __Data for evaluation: | ||
| 20 Wikipedia article pairs for Spanish-English, | 20 Wikipedia article pairs for Spanish-English, | ||
| Line 55: | Line 60: | ||
| The SMT evaluation was using BLEU score. For each language the exploited 2 training conditions: | The SMT evaluation was using BLEU score. For each language the exploited 2 training conditions: | ||
| - | | + | **1) Medium** |
| - | Training set data: | + | Training set data: |
| * Europarl corpus for Spanish and German; | * Europarl corpus for Spanish and German; | ||
| * JRC-Aquis corpus for Bulgarian; | * JRC-Aquis corpus for Bulgarian; | ||
| * article titles for parallel Wikipedia documents; | * article titles for parallel Wikipedia documents; | ||
| * translations available from Wikipedia entries. | * translations available from Wikipedia entries. | ||
| - | * Large | + | ** 2) Large** |
| - | | + | Training set data included: |
| * all Medium data; | * all Medium data; | ||
| * broad range of available sources (data scraped from Web, data from United Nations, phrase books, software documentation and more) | * broad range of available sources (data scraped from Web, data from United Nations, phrase books, software documentation and more) | ||
| In each condition they exploited impact of including parallel sentences automatically extracted from Wikipedia. | In each condition they exploited impact of including parallel sentences automatically extracted from Wikipedia. | ||
| - | == Coclusions | + | ====== Conclusions ====== |
| * Wikipedia is a useful resource for mining parallel data and it is a good resource for machine translation. | * Wikipedia is a useful resource for mining parallel data and it is a good resource for machine translation. | ||
| * Ranking approach sidesteps problematic class imbalance issue. | * Ranking approach sidesteps problematic class imbalance issue. | ||
| Line 74: | Line 80: | ||
| * Induced word-level lexicon in combination with sentence extraction helps to achieve substantial gains. | * Induced word-level lexicon in combination with sentence extraction helps to achieve substantial gains. | ||
| - | == Strong sides of the article == | + | ====== Strong sides of the article ====== |
| + | |||
| * Novel approaches to extracting parallel sentences. | * Novel approaches to extracting parallel sentences. | ||
| * Evaluation. | * Evaluation. | ||
| - | == Weak sides of the article == | + | ====== Weak sides of the article ===== |
| * The authors use word-alignment model for sentence alignment task which is not typical. They should have stressed this and explain their reasons for using such technique. | * The authors use word-alignment model for sentence alignment task which is not typical. They should have stressed this and explain their reasons for using such technique. | ||
| * The authors didn't explicitly say if the do pruning in the ranking model (if they choose just one sentence for alignment and prune all the others or not). | * The authors didn't explicitly say if the do pruning in the ranking model (if they choose just one sentence for alignment and prune all the others or not). | ||
| Line 84: | Line 93: | ||
| Our understanding of this feature is: | Our understanding of this feature is: | ||
| TOPIC A: EN <-> ES | TOPIC A: EN <-> ES | ||
| - | | + | ↓ ↓ |
| TOPIC B: EN <-> ES | TOPIC B: EN <-> ES | ||
| where | where | ||
| ↓ is a link | ↓ is a link | ||
| + | |||
| <-> is an interwiki link | <-> is an interwiki link | ||
| - | --- //[[violinangy@gmail.com|Angelina Ivanova]] 2011/05/22 18:39// | + | --- //Comments by Angelina Ivanova // |
