Introduction

Article is about parallel sentence extraction from Wikipedia. This resource can be viewed as comparable corpus in which the document alignment is already provided by the interwiki links.

Training models

Authors train three models:

binary classifier model;
ranking model;
conditional random field (CRF) model.

When the binary classifier is used, there is a substantial class imbalance: O(n) positive examples and O(n²) negative examples.

The ranking model selects either a sentence in the target document or 'null' for each sentence target in the source document. This way there is no problem of class imbalance issue of the binary classifier.

A conditional random field is a type of discriminative undirected probabilistic graphical model. It is most often used for labeling or parsing of sequential data, such as natural language text.

Features

Category 1: Features derived from word alignment

Číslovaný seznam log probability of the alignment;
number of aligned/unaligned words;
longest aligned/unaligned sequence of words;
sentence length;
the difference in relative document position of the two sentences.

Last two features are independent from word alignment. All these features are defined on sentence pairs and included in the binary classification and ranking models.

Category 2: Distortion features

One set of features bins distances between previous and current aligned sentences. Another set of features looks at the absolute difference between the expected position (one after the previous aligned sentence) and the actual position.

Category 3: Features derived from Wikipedia markup

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Table of Contents