user:zeman:acquis

Preprocessing of the Swedish Acquis

name = Acquis | path = /fs/clip-corpora/Europarl/acquis | owner = zeman

The JRC-Acquis corpus is a large collection of European Union documents in 21 languages: Czech, Danish, Dutch, German, Greek, English, Estonian, Finnish, French, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovene, Spanish, Swedish. Its original format is TEI-compliant XML. It is a parallel corpus and automatic paragraph-level alignment is available (sentence boundaries are not marked).

The abovementioned path currently contains Danish and Swedish (da and sv subfolders, respectively) and their alignment.

Preprocessing of the Swedish Acquis

This has been done by Dan for <a href='Parser Adaptation]]. First, we extract plain text from the TEI XML format. There will be one non-empty paragraph per output line, total 475,707 paragraphs.

cd /fs/clip-corpora/Europarl/acquis/sv
$PARSINGROOT/tools/tei2txt.pl < jrc-sv.xml > sv.01.txt

Tokenize the text using our English tokenizer. Then count words (tokens). There is 9,411,224 words and 163,084 word types.

$PARSINGROOT/tools/tokenizeE.pl - - < sv.01.txt > sv.02.tok.txt
$PARSINGROOT/tools/count_words.pl < sv.02.tok.txt

Find sentence boundaries and output one sentence per line. There is 532,505 sentences. Average number of words per sentence is 18 but the longest “sentence” has 922 words!

$PARSINGROOT/tools/find_sentences.pl < sv.02.tok.txt > sv.03.sent.txt

Remove sentences with more than 40 words and sentences with too many dashes or numbers. Long sentences are probably corrupt, they require too much time to parse or even make the parser fail.

~/projekty/stanford/tools/discard_long_bad_sentences.pl < sv.03.sent.txt > sv.04.clean.txt

The rest contains 430,808 sentences, 6,154,663 words and 137,617 word types. The average is 14 words per sentence. 55,389 sentences were removed because of their length, 46,308 short sentences were removed because of their contents.

Parsing Swedish Acquis

Parse the Swedish Acquis using a model trained on the Danish Treebank (see Parser Adaptation for why).

$PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl \
    -g /fs/clip-corpora/conll/danish/train.ecmj.tgz \
    < sv.04.clean.txt -o sv.05.ecmj.penn

All 430,808 sentences were parsed on the C cluster in about 30 hours.

Glossed Swedish Acquis

Translate the Swedish Acquis word-by-word to Danish, using glosses produced by Giza from Danish-Swedish parallel Acquis.

$PARSINGROOT/tools/wbwtranslate.pl -g ../glossary-sv-da.txt < sv.04.clean.txt > sv.06.dagloss.txt

The number of word types dropped from 137,617 to 106,858.

Parse the glossed Acquis using the reranking parser trained on the Danish Treebank. It takes about half a day on the C cluster.

$PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl \
    -g /fs/clip-rosetta/zeman/conll/danish/train.ecmj.tgz \
    < sv.06.dagloss.txt -o sv.07.dagloss.ecmj.penn

Translate the trees back to Swedish.

$PARSINGROOT/tools/restuff.pl -s sv.04.clean.txt \
    < sv.07.dagloss.ecmj.penn \
    > sv.08.restuffed.penn

Delexicalized Swedish Acquis

Tag the Swedish Acquis using the Hajič tagger. The tagger needs the input in CSTS format and ISO 8859-1 encoding. (Conversion to CSTS comes first because this script needs input in UTF-8. Encoding conversion cannot be done by iconv because it cannot process corrupted UTF on input.) We will further encode all vertical bars because the tagger cannot process input that contains them.

$PARSINGROOT/tools/tok2csts.pl -l sv < sv.04.clean.txt > sv.09.csts
perl -e 'use Encode; binmode(STDIN, ":utf8"); while(<>) { print(encode("iso-8859-1", $_)); }' \
    < sv.09.csts > sv.10.iso.csts
perl -pe 's/\'>/&verbar;/g' < sv.10.iso.csts > sv.11.verbar.csts

Run the tagger.

~zeman/nastroje/taggery/hajic-sv/2006-11-08/SE061108x TG sv.11.verbar.csts sv.12.hajic.verbar.csts

Recode the tagged Acquis back to UTF-8 (iconv is now fine) and reinstall the vertical bars.

perl -pe 's/&verbar;/\|/g' < sv.12.hajic.verbar.csts | iconv -f iso-8859-1 -t utf8 > sv.13.hajic.csts

Simplify the annotation: remove <MMl> and <MMt>, replace <MDl> and <MDt> by <l> and <t>, respectively.

perl -pe 's/\r?\n$//; s/<MM[lt][^>]*>[^<]*//g; s/<MD([lt])[^>]*>/<$1>/g; $_ = "$_\n"' \
    < sv.13.hajic.csts \
    > sv.14.hajic1.csts

Correct the tagger output. For some reason, it tags every sentence-final period as noun.

perl -pe 's/(<f>\.+<l>\.+<t>)N[^<]+/$1FE-------/' \
    < sv.14.hajic1.csts \
    > sv.15.hajic2.csts

Table of Contents

Preprocessing of the Swedish Acquis

Parsing Swedish Acquis

Glossed Swedish Acquis

Delexicalized Swedish Acquis