Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
courses:rg:extracting-parallel-sentences-from-comparable-corpora [2011/05/22 18:33] ivanova vytvořeno |
courses:rg:extracting-parallel-sentences-from-comparable-corpora [2011/05/22 19:23] (current) ivanova |
||
---|---|---|---|
Line 1: | Line 1: | ||
+ | **Extracting Parallel Sentences from Comparable Corpora using Document Level Alignment** | ||
+ | //Jason R. Smith Chris Quirk and Kristina Toutanova// | ||
+ | |||
====== Introduction ====== | ====== Introduction ====== | ||
Line 4: | Line 7: | ||
====== Training models ====== | ====== Training models ====== | ||
- | |||
Authors train three models: | Authors train three models: | ||
* binary classifier model; | * binary classifier model; | ||
* ranking model; | * ranking model; | ||
- | * Conditional Random Field (CRF) model | + | * conditional random field (CRF) model. |
When the binary classifier is used, there is a substantial class imbalance: O(n) positive examples and O(n²) negative examples. | When the binary classifier is used, there is a substantial class imbalance: O(n) positive examples and O(n²) negative examples. | ||
Line 17: | Line 19: | ||
====== Features ====== | ====== Features ====== | ||
- | ===== Category 1: Features | + | ===== Category 1: features |
- | + | - log probability of the alignment; | |
+ | - number of aligned/ | ||
+ | - longest aligned/ | ||
+ | - sentence length; | ||
+ | - the difference in relative document position of the two sentences. | ||
+ | Last two features are independent from word alignment. All these features are defined on sentence pairs and included in the binary classification and ranking models. | ||
+ | |||
+ | ==== Category 2: distortion features ==== | ||
+ | One set of features bins distances between previous and current aligned sentences. Another set of features looks at the absolute difference between the expected position (one after the previous aligned sentence) and the actual position. | ||
+ | |||
+ | ==== Category 3: features derived from Wikipedia markup ==== | ||
+ | - number of matching links in the sentence pairs; | ||
+ | - image feature (if two sentences are captions of the same image); | ||
+ | - list feature (if two sentences are both items in a list); | ||
+ | - bias feature (if the alignment is non-null). | ||
+ | |||
+ | ==== Category 4: word-level induced lexicon features ==== | ||
+ | - Translation probability; | ||
+ | - position difference; | ||
+ | - orthographic similarity (this is inexact and is a promising area for improvement) | ||
+ | - context translation probability; | ||
+ | - distributional similarity. | ||
+ | Using these features the authors train the weights of a log-linear ranking model for P(wt|ws, T, S) where wt is a word in the target language, ws is a word in the source language, and T and S are linked articles in the target and source languages respectively. The model is trained from a small set of annotated Wikipedia article pairs. | ||
+ | Using this model, the authors generate a new translation table which is used to define another HMM word-alignment | ||
+ | |||
+ | ====== Evaluation ====== | ||
+ | |||
+ | __Data for evaluation: | ||
+ | 20 Wikipedia article pairs for Spanish-English, | ||
+ | |||
+ | __Evaluation measures: | ||
+ | * average precision; | ||
+ | * recall at 90% precision; | ||
+ | * recall at 80% precision. | ||
+ | |||
+ | In the first set of experiments they didn't include Wikipedia features and lexicon features. They evaluate binary classifier, ranking and CRF models. | ||
+ | |||
+ | In the second set of experiments they use Wikipedia specific features. They evaluate ranker and CRF. As these two models are asymmetric, they ran modes in both directions, and combined their outputs by intersection. | ||
+ | |||
+ | The SMT evaluation was using BLEU score. For each language the exploited 2 training conditions: | ||
+ | **1) Medium** | ||
+ | Training set data: | ||
+ | * Europarl corpus for Spanish and German; | ||
+ | * JRC-Aquis corpus for Bulgarian; | ||
+ | * article titles for parallel Wikipedia documents; | ||
+ | * translations available from Wikipedia entries. | ||
+ | ** 2) Large** | ||
+ | Training set data included: | ||
+ | * all Medium data; | ||
+ | * broad range of available sources (data scraped from Web, data from United Nations, phrase books, software documentation and more) | ||
+ | In each condition they exploited impact of including parallel sentences automatically extracted from Wikipedia. | ||
+ | |||
+ | ====== Conclusions ====== | ||
+ | |||
+ | * Wikipedia is a useful resource for mining parallel data and it is a good resource for machine translation. | ||
+ | * Ranking approach sidesteps problematic class imbalance issue. | ||
+ | * Small sample of annotated articles is sufficient to train global level features that bring substantial improvement in the accuracy of sentence alignment. | ||
+ | * Learned classifiers are portable across languages. | ||
+ | * Induced word-level lexicon in combination with sentence extraction helps to achieve substantial gains. | ||
+ | |||
+ | ====== Strong sides of the article ====== | ||
+ | |||
+ | |||
+ | * Novel approaches to extracting parallel sentences. | ||
+ | * Evaluation. | ||
+ | |||
+ | ====== Weak sides of the article ===== | ||
+ | |||
+ | * The authors use word-alignment model for sentence alignment task which is not typical. They should have stressed this and explain their reasons for using such technique. | ||
+ | * The authors didn't explicitly say if the do pruning in the ranking model (if they choose just one sentence for alignment and prune all the others or not). | ||
+ | * The feature " | ||
+ | Our understanding of this feature is: | ||
+ | TOPIC A: EN <-> ES | ||
+ | | ||
+ | TOPIC B: EN <-> ES | ||
+ | |||
+ | where | ||
+ | ↓ is a link | ||
+ | |||
+ | <-> is an interwiki link | ||
+ | |||
+ | |||
+ | --- //Comments by Angelina Ivanova // | ||