name = Acquis | path = /fs/clip-corpora/Europarl/acquis | owner = zeman
The JRC-Acquis corpus is a large collection of European Union documents in 21 languages: Czech, Danish, Dutch, German, Greek, English, Estonian, Finnish, French, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovene, Spanish, Swedish. Its original format is TEI-compliant XML. It is a parallel corpus and automatic paragraph-level alignment is available (sentence boundaries are not marked).
The abovementioned path currently contains Danish and Swedish (da
and sv
subfolders, respectively) and their alignment.
This has been done by Dan for <a href='Parser Adaptation]]. First, we extract plain text from the TEI XML format. There will be one non-empty paragraph per output line, total 475,707 paragraphs.
cd /fs/clip-corpora/Europarl/acquis/sv $PARSINGROOT/tools/tei2txt.pl < jrc-sv.xml > sv.01.txt
Tokenize the text using our English tokenizer. Then count words (tokens). There is 9,411,224 words and 163,084 word types.
$PARSINGROOT/tools/tokenizeE.pl - - < sv.01.txt > sv.02.tok.txt $PARSINGROOT/tools/count_words.pl < sv.02.tok.txt
Find sentence boundaries and output one sentence per line. There is 532,505 sentences. Average number of words per sentence is 18 but the longest “sentence” has 922 words!
$PARSINGROOT/tools/find_sentences.pl < sv.02.tok.txt > sv.03.sent.txt
Remove sentences with more than 40 words and sentences with too many dashes or numbers. Long sentences are probably corrupt, they require too much time to parse or even make the parser fail.
~/projekty/stanford/tools/discard_long_bad_sentences.pl < sv.03.sent.txt > sv.04.clean.txt
The rest contains 430,808 sentences, 6,154,663 words and 137,617 word types. The average is 14 words per sentence. 55,389 sentences were removed because of their length, 46,308 short sentences were removed because of their contents.
Parse the Swedish Acquis using a model trained on the Danish Treebank (see Parser Adaptation for why).
$PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl \ -g /fs/clip-corpora/conll/danish/train.ecmj.tgz \ < sv.04.clean.txt -o sv.05.ecmj.penn
All 430,808 sentences were parsed on the C cluster in about 30 hours.
Translate the Swedish Acquis word-by-word to Danish, using glosses produced by Giza from Danish-Swedish parallel Acquis.
$PARSINGROOT/tools/wbwtranslate.pl -g ../glossary-sv-da.txt < sv.04.clean.txt > sv.06.dagloss.txt
The number of word types dropped from 137,617 to 106,858.
Parse the glossed Acquis using the reranking parser trained on the Danish Treebank. It takes about half a day on the C cluster.
$PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl \ -g /fs/clip-rosetta/zeman/conll/danish/train.ecmj.tgz \ < sv.06.dagloss.txt -o sv.07.dagloss.ecmj.penn
Translate the trees back to Swedish.
$PARSINGROOT/tools/restuff.pl -s sv.04.clean.txt \ < sv.07.dagloss.ecmj.penn \ > sv.08.restuffed.penn
Tag the Swedish Acquis using the Hajič tagger. The tagger needs the input in CSTS format and ISO 8859-1 encoding. (Conversion to CSTS comes first because this script needs input in UTF-8. Encoding conversion cannot be done by iconv
because it cannot process corrupted UTF on input.) We will further encode all vertical bars because the tagger cannot process input that contains them.
$PARSINGROOT/tools/tok2csts.pl -l sv < sv.04.clean.txt > sv.09.csts perl -e 'use Encode; binmode(STDIN, ":utf8"); while(<>) { print(encode("iso-8859-1", $_)); }' \ < sv.09.csts > sv.10.iso.csts perl -pe 's/\'>/|/g' < sv.10.iso.csts > sv.11.verbar.csts
Run the tagger.
~zeman/nastroje/taggery/hajic-sv/2006-11-08/SE061108x TG sv.11.verbar.csts sv.12.hajic.verbar.csts
Recode the tagged Acquis back to UTF-8 (iconv
is now fine) and reinstall the vertical bars.
perl -pe 's/|/\|/g' < sv.12.hajic.verbar.csts | iconv -f iso-8859-1 -t utf8 > sv.13.hajic.csts
Simplify the annotation: remove <MMl>
and <MMt>
, replace <MDl>
and <MDt>
by <l>
and <t>
, respectively.
perl -pe 's/\r?\n$//; s/<MM[lt][^>]*>[^<]*//g; s/<MD([lt])[^>]*>/<$1>/g; $_ = "$_\n"' \ < sv.13.hajic.csts \ > sv.14.hajic1.csts
Correct the tagger output. For some reason, it tags every sentence-final period as noun.
perl -pe 's/(<f>\.+<l>\.+<t>)N[^<]+/$1FE-------/' \ < sv.14.hajic1.csts \ > sv.15.hajic2.csts