Intersecting Parallel Corpora

Intersecting Parallel Corpora

The organizers of the annual Workshop on Machine Translation (WMT) prepare and distribute parallel corpora that can be used to train systems for the shared tasks. Two core types of corpora are the News Commentary corpus and the Europarl corpus. Both are available in several language pairs, always between English and another European language: cs-en, de-en, es-en and fr-en. The corpora are not multi-parallel. They come from the same source and there is significant overlap but still some sentences are translated to only a subset of the languages. The bi-parallel subsets do not all have the same number of sentence pairs. Such corpora cannot be directly used to train a system for e.g. de-cs (German-Czech). However, we can use English as a pivot language. If we identify the intersection of the English parts of cs-en and de-en, we can take the non-English counterparts of the overlapping English sentences to create a de-cs parallel corpus. That is what this software does.

Download

intersect.zip

Installation and Usage

An interpreter of the Perl programming language is required. Perl is freely available for most platforms. Unpack the contents of intersect.zip into a folder. Make sure that the folder is referenced from your $PATH, $PERLLIB and $PER5LLIB environment variables.

The combination of the scripts overlap.pl and filter-corpus.pl finds line numbers that occur in both corpora, then uses filters a corpus to output only lines with selected numbers:

overlap.pl -n text1.txt text2.txt > sentence_numbers.txt
filter-corpus.pl [-l|-r] < linenos.txt infile1 outfile1 [infile2 outfile2 [...]]

The intersect.pl script demonstrates application of the above approach to the WMT data:

# Use cs-en and de-en to compute de-cs:
intersect.pl europarl-v7 de

Authors

Daniel Zeman and Ondřej Bojar, Charles University in Prague, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (ÚFAL), 2012

License

Freely redistributable under the terms of the GNU General Public License version 3.0.

Acknowledgements

The creation of this software has been financially supported by the grant no. 7E11051 from the Czech Ministry of Education (Ministerstvo školství, mládeže a tělovýchovy České republiky).

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Table of Contents