[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
user:zeman:intersecting-parallel-corpora [2013/04/23 10:38]
zeman vytvořeno
user:zeman:intersecting-parallel-corpora [2013/04/23 13:25] (current)
zeman Pořadí autorů.
Line 4: Line 4:
  
 ===== Download ===== ===== Download =====
 +
 +{{:user:zeman:intersect.zip|intersect.zip}}
 +
 +===== Installation and Usage =====
 +
 +An interpreter of the Perl programming language is required. Perl is freely available for most platforms. Unpack the contents of ''intersect.zip'' into a folder. Make sure that the folder is referenced from your ''$PATH'', ''$PERLLIB'' and ''$PER5LLIB'' environment variables.
 +
 +The combination of the scripts ''overlap.pl'' and ''filter-corpus.pl'' finds line numbers that occur in both corpora, then uses filters a corpus to output only lines with selected numbers:
 +
 +<code bash>overlap.pl -n text1.txt text2.txt > sentence_numbers.txt
 +filter-corpus.pl [-l|-r] < linenos.txt infile1 outfile1 [infile2 outfile2 [...]]</code>
 +
 +The ''intersect.pl'' script demonstrates application of the above approach to the WMT data:
 +
 +<code bash># Use cs-en and de-en to compute de-cs:
 +intersect.pl europarl-v7 de</code>
  
 ===== Authors ===== ===== Authors =====
  
-Ondřej Bojar and Daniel Zeman, Charles University in Prague, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (ÚFAL), 2012+Daniel Zeman and Ondřej Bojar, Charles University in Prague, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (ÚFAL), 2012
  
 ===== License ===== ===== License =====

[ Back to the navigation ] [ Back to the content ]