Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
courses:rg:extracting-parallel-sentences-from-comparable-corpora [2011/05/22 19:13] ivanova |
courses:rg:extracting-parallel-sentences-from-comparable-corpora [2011/05/22 19:23] (current) ivanova |
||
---|---|---|---|
Line 1: | Line 1: | ||
+ | **Extracting Parallel Sentences from Comparable Corpora using Document Level Alignment** | ||
+ | //Jason R. Smith Chris Quirk and Kristina Toutanova// | ||
+ | |||
====== Introduction ====== | ====== Introduction ====== | ||
+ | |||
Article is about parallel sentence extraction from Wikipedia. This resource can be viewed as comparable corpus in which the document alignment is already provided by the interwiki links. | Article is about parallel sentence extraction from Wikipedia. This resource can be viewed as comparable corpus in which the document alignment is already provided by the interwiki links. | ||
Line 41: | Line 45: | ||
Using this model, the authors generate a new translation table which is used to define another HMM word-alignment | Using this model, the authors generate a new translation table which is used to define another HMM word-alignment | ||
- | ==== Evaluation ==== | + | ====== Evaluation ====== |
__Data for evaluation: | __Data for evaluation: | ||
20 Wikipedia article pairs for Spanish-English, | 20 Wikipedia article pairs for Spanish-English, | ||
Line 55: | Line 60: | ||
The SMT evaluation was using BLEU score. For each language the exploited 2 training conditions: | The SMT evaluation was using BLEU score. For each language the exploited 2 training conditions: | ||
- | | + | **1) Medium** |
- | Training set data: | + | Training set data: |
* Europarl corpus for Spanish and German; | * Europarl corpus for Spanish and German; | ||
* JRC-Aquis corpus for Bulgarian; | * JRC-Aquis corpus for Bulgarian; | ||
* article titles for parallel Wikipedia documents; | * article titles for parallel Wikipedia documents; | ||
* translations available from Wikipedia entries. | * translations available from Wikipedia entries. | ||
- | * Large | + | ** 2) Large** |
- | | + | Training set data included: |
* all Medium data; | * all Medium data; | ||
* broad range of available sources (data scraped from Web, data from United Nations, phrase books, software documentation and more) | * broad range of available sources (data scraped from Web, data from United Nations, phrase books, software documentation and more) | ||
In each condition they exploited impact of including parallel sentences automatically extracted from Wikipedia. | In each condition they exploited impact of including parallel sentences automatically extracted from Wikipedia. | ||
- | == Coclusions | + | ====== Conclusions ====== |
* Wikipedia is a useful resource for mining parallel data and it is a good resource for machine translation. | * Wikipedia is a useful resource for mining parallel data and it is a good resource for machine translation. | ||
* Ranking approach sidesteps problematic class imbalance issue. | * Ranking approach sidesteps problematic class imbalance issue. | ||
Line 74: | Line 80: | ||
* Induced word-level lexicon in combination with sentence extraction helps to achieve substantial gains. | * Induced word-level lexicon in combination with sentence extraction helps to achieve substantial gains. | ||
- | == Strong sides of the article == | + | ====== Strong sides of the article ====== |
+ | |||
* Novel approaches to extracting parallel sentences. | * Novel approaches to extracting parallel sentences. | ||
* Evaluation. | * Evaluation. | ||
- | == Weak sides of the article == | + | ====== Weak sides of the article ===== |
* The authors use word-alignment model for sentence alignment task which is not typical. They should have stressed this and explain their reasons for using such technique. | * The authors use word-alignment model for sentence alignment task which is not typical. They should have stressed this and explain their reasons for using such technique. | ||
* The authors didn't explicitly say if the do pruning in the ranking model (if they choose just one sentence for alignment and prune all the others or not). | * The authors didn't explicitly say if the do pruning in the ranking model (if they choose just one sentence for alignment and prune all the others or not). | ||
Line 84: | Line 93: | ||
Our understanding of this feature is: | Our understanding of this feature is: | ||
TOPIC A: EN <-> ES | TOPIC A: EN <-> ES | ||
- | | + | ↓ ↓ |
TOPIC B: EN <-> ES | TOPIC B: EN <-> ES | ||
where | where | ||
↓ is a link | ↓ is a link | ||
+ | |||
<-> is an interwiki link | <-> is an interwiki link | ||
- | --- //[[violinangy@gmail.com|Angelina Ivanova]] 2011/05/22 18:39// | + | --- //Comments by Angelina Ivanova // |