[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki

[ Back to the navigation ]

Table of Contents

Extracting Parallel Sentences from Comparable Corpora using Document Level Alignment
Jason R. Smith Chris Quirk and Kristina Toutanova


Article is about parallel sentence extraction from Wikipedia. This resource can be viewed as comparable corpus in which the document alignment is already provided by the interwiki links.

Training models

Authors train three models:

When the binary classifier is used, there is a substantial class imbalance: O(n) positive examples and O(n²) negative examples.

The ranking model selects either a sentence in the target document or 'null' for each sentence target in the source document. This way there is no problem of class imbalance issue of the binary classifier.

A conditional random field is a type of discriminative undirected probabilistic graphical model. It is most often used for labeling or parsing of sequential data, such as natural language text.


Category 1: features derived from word alignment

  1. log probability of the alignment;
  2. number of aligned/unaligned words;
  3. longest aligned/unaligned sequence of words;
  4. sentence length;
  5. the difference in relative document position of the two sentences.

Last two features are independent from word alignment. All these features are defined on sentence pairs and included in the binary classification and ranking models.

Category 2: distortion features

One set of features bins distances between previous and current aligned sentences. Another set of features looks at the absolute difference between the expected position (one after the previous aligned sentence) and the actual position.

Category 3: features derived from Wikipedia markup

  1. number of matching links in the sentence pairs;
  2. image feature (if two sentences are captions of the same image);
  3. list feature (if two sentences are both items in a list);
  4. bias feature (if the alignment is non-null).

Category 4: word-level induced lexicon features

  1. Translation probability;
  2. position difference;
  3. orthographic similarity (this is inexact and is a promising area for improvement)
  4. context translation probability;
  5. distributional similarity.

Using these features the authors train the weights of a log-linear ranking model for P(wt|ws, T, S) where wt is a word in the target language, ws is a word in the source language, and T and S are linked articles in the target and source languages respectively. The model is trained from a small set of annotated Wikipedia article pairs.
Using this model, the authors generate a new translation table which is used to define another HMM word-alignment model for use in sentence extraction model.


Data for evaluation:
20 Wikipedia article pairs for Spanish-English, Bulgarian-English and German-English. Positive examples of sentence pairs in the datasets are the sentences that are mostly parallel with some missing words and sentences that are direct translations.

Evaluation measures:

In the first set of experiments they didn't include Wikipedia features and lexicon features. They evaluate binary classifier, ranking and CRF models.

In the second set of experiments they use Wikipedia specific features. They evaluate ranker and CRF. As these two models are asymmetric, they ran modes in both directions, and combined their outputs by intersection.

The SMT evaluation was using BLEU score. For each language the exploited 2 training conditions:
1) Medium
Training set data:

2) Large
Training set data included:

In each condition they exploited impact of including parallel sentences automatically extracted from Wikipedia.


Strong sides of the article

Weak sides of the article

Our understanding of this feature is:

       ↓      ↓     


↓ is a link

↔ is an interwiki link

Comments by Angelina Ivanova

[ Back to the navigation ] [ Back to the content ]