{{Infobox Resource | name = Acquis | path = /fs/clip-corpora/Europarl/acquis | owner = zeman }} The [[http://wt.jrc.it/lt/Acquis/|JRC-Acquis]] corpus is a large collection of European Union documents in 21 languages: Czech, Danish, Dutch, German, Greek, English, Estonian, Finnish, French, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovene, Spanish, Swedish. Its original format is TEI-compliant XML. It is a parallel corpus and automatic paragraph-level alignment is available (sentence boundaries are not marked). The abovementioned path currently contains Danish and Swedish (''da'' and ''sv'' subfolders, respectively) and their alignment. =====Preprocessing of the Swedish Acquis===== This has been done by Dan for $PARSINGROOT/tools/tok2csts.pl -l sv < sv.04.clean.txt > sv.09.csts perl -e 'use Encode; binmode(STDIN, ":utf8"); while(<>) { print(encode("iso-8859-1", $_)); }' \ < sv.09.csts > sv.10.iso.csts perl -pe 's/\'>/|/g' < sv.10.iso.csts > sv.11.verbar.csts Run the tagger. ~zeman/nastroje/taggery/hajic-sv/2006-11-08/SE061108x TG sv.11.verbar.csts sv.12.hajic.verbar.csts Recode the tagged Acquis back to UTF-8 (''iconv'' is now fine) and reinstall the vertical bars. perl -pe 's/|/\|/g' < sv.12.hajic.verbar.csts | iconv -f iso-8859-1 -t utf8 > sv.13.hajic.csts Simplify the annotation: remove '''' and '''', replace '''' and '''' by '''' and '''', respectively.

perl -pe 's/\r?\n$//; s/]*>[^<]*//g; s/]*>/<$1>/g; $_ = "$_\n"' \
    < sv.13.hajic.csts \
    > sv.14.hajic1.csts

Correct the tagger output. For some reason, it tags every sentence-final period as noun.

perl -pe 's/(\.+\.+)N[^<]+/$1FE-------/' \
    < sv.14.hajic1.csts \
    > sv.15.hajic2.csts