[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Table of Contents

name = Talbanken05 | owner = zeman | path = /fs/clip-corpora/conll/swedish | version = 2006

Talbanken05 is the Swedish treebank used in the CoNLL-06 shared task. There are 11,042 training sentences, 191,467 training words, 389 test sentences, and 5,656 test words. Average length of training sentences is 17 words, 15 words for test sentences. The data belong to mixed domains.

Parser adaptation data split

For the experiments in Parser adaptation, we save the CoNLL test data for the final evaluation and cut a development test set from the original training data. We split the original set to 10,700 training sentences and 342 test sentences. These sentences are not visible to the parsers during any phase of training. So, for instance, if reranker asks for “development data,” which is in fact held-out data used for tuning its weights (i.e. learning), these have to be cut from the 10,700 training sentences. We can perform two final evaluations: 1. trained exactly the same way as during the development; 2. trained on the union of the training and the development test data.

Data preparation

The data is in /fs/clip-corpora/conll/swedish. This section describes data preparation for the parser adaptation experiment. The original training data (to be split into our training and our development test) is called otrain, our training data is called train, the development test data is called dtest and the final evaluation data is called etest.

Convert the treebank from the CoNLL format to [[CSTS]]

$PARSINGROOT/tools/conll2csts.pl -l sv < otrain.conll > otrain.csts

Normalize trees

Transform the treebank so that it conforms to treebanking guidelines used in other treebanks.

At the same time, convert morphological tags to the part-of-speech tagset of the Penn Treebank.

$PARSINGROOT/tools/normalize_swedish_csts_trees.pl < otrain.csts > otrain.normalized.csts

Normalization and new tags can be viewed in Tred, if desired. To do that, we need to convert the normalized data to the FS format (because Tred does not allow CSTS encoded in UTF-8). This step is optional.

$PARSINGROOT/tools/cstsfs.pl < otrain.normalized.csts > otrain.normalized.fs
/fs/nlp/Programs/tred/tred otrain.normalized.fs

Convert dependencies to constituents

The flattest possible structure is created. The constituent labels (nonterminals) are derived from part-of-speech tags of the heads and then translated to the repertory of the Penn Treebank.

$PARSINGROOT/tools/csts2penn.pl otrain.normalized.csts > otrain.penn

Split the data

First do the steps described above separately for otrain and etest. We do not split otrain to train and dtest earlier because until now we have not had one sentence per line (and the splitting process is much easier once we have it).

head -10700 otrain.penn > train.penn
tail -342   otrain.penn > dtest.penn

Get plain text of test data

We need plain text as input to the parser.

$PARSINGROOT/tools/penn2text.pl < dtest.penn > dtest.txt
$PARSINGROOT/tools/penn2text.pl < etest.penn > etest.txt

Parsing experiments

Train and test Charniak

$PARSINGROOT/charniak-parser/scripts/train.pl < train.penn > train.ecdata.tgz
$PARSINGROOT/charniak-parser/scripts/parse.pl -g train.ecdata.tgz < dtest.txt | \
    $PARSINGROOT/tools/pennquotes2ascii.pl > dtest.ec.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm dtest.penn dtest.ec.penn

P = 72.37 %, R = 73.68 %, F = 73.02 %. Evaluated 317 sentences.

Learning curve

We repeat the experiment with various training data sizes to see the relation between data size and parsing accuracy. Smaller data sets are always cut from the beginning of the training data, e.g.:

foreach i (50 100 250 500 1000 2500 5000 10700)
  head -$i train.penn > train.$i.penn
  $PARSINGROOT/charniak-parser/scripts/train.pl < train.$i.penn > train.$i.ecdata.tgz
  $PARSINGROOT/charniak-parser/scripts/parse.pl -g train.$i.ecdata.tgz < dtest.txt > dtest.ec.$i.penn
  $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm dtest.penn dtest.ec.$i.penn > result.$i.txt
end

The results are summarized in the following table:

sentences 10,700 (full) 5,000 2,500 1,000 500 250 100 50
precision 72.37 69.74 67.51 63.65 59.23 54.61 44.22 42.72
recall 73.68 70.48 67.93 63.63 58.94 53.48 41.55 40.28
F 73.02 70.11 67.72 63.64 59.08 54.03 42.84 41.46

To see how much the F-score of the smaller data sets possibly depends on where have the data been taken from, we also run a modification of the experiment, with tail instead of head:

foreach i (50 100 250 500 1000 2500 5000 10700)
  tail -$i train.penn > train.tail.$i.penn
  $PARSINGROOT/charniak-parser/scripts/train.pl < train.tail.$i.penn > train.tail.$i.ecdata.tgz
  $PARSINGROOT/charniak-parser/scripts/parse.pl -g train.tail.$i.ecdata.tgz < dtest.txt > dtest.ec.tail.$i.penn
  $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm dtest.penn dtest.ec.tail.$i.penn > result.tail.$i.txt
end

The results are summarized in the following table:

sentences 10,700 (full) 5,000 2,500 1,000 500 250 100 50
precision 72.37 70.36 68.62 66.48 61.24 55.38 46.64 38.82
recall 73.68 71.39 69.35 66.89 61.75 55.29 45.36 37.67
F 73.02 70.87 68.98 66.69 61.49 55.33 45.99 38.23

Brown

For a long time, we were not able to train the Brown reranking parser on Swedish because of a bug in the treebank normalization script that did not encode < as &amp;lt; on output. The bug was fixed in March 2007 and training now works:

nohup nice $PARSINGROOT/brown-reranking-parser/scripts/train.pl -nick sv -reuse \
    < train.penn \
    > train.ecmj.tgz

Parse Swedish test data using train.ecmj.tgz just trained.

nohup nice $PARSINGROOT/brown-reranking-parser/scripts/parse.pl -g train.ecmj.tgz \
    < dtest.txt \
    > dtest.ecmj.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm dtest.penn dtest.ecmj.penn

P = 73.22 %, R = 72.58 %, F = 72.90 %, T = 49.17 %. Evaluated 317 sentences. Strange that the reranker did not help. I have currently no explanation.

Delexicalization

By delexicalization we mean replacing words by their morphological tags. We need it for the parser adaptation experiments. After delexicalization, Swedish tags will be terminals while the preterminals will still contain the simpler Penn-style tags.

The situation is different from Danish in that the manually assigned Swedish tags are too coarse. That's why we first tag the corpus using the <a href='Hajič tagger]] and then we use these tags.

The tagger needs data in CSTS format, ISO 8859-1 encoding, and without vertical bar characters. We also need to remove the manually assigned lemmas (always “_”) and tags (coarse Mamba set) because the tagger checks whether the manual annotation (if present) fits its own dictionary and tagset.

cd /fs/clip-corpora/conll/swedish
iconv -f utf8 -t iso-8859-1 < otrain.csts > otrain.iso.csts
perl -pe 's/<l>_<t>[^<]*//' < otrain.iso.csts > otrain.iso.nomorph.csts
~zeman/nastroje/taggery/hajic-sv/2006-11-08/SE061108x TG \
    otrain.iso.nomorph.csts \
    otrain.hajic.iso.csts
iconv -f iso-8859-1 -t utf8 < otrain.hajic.iso.csts > otrain.hajic.csts
perl -pe 's/<MM[lt][^>]*>[^<]*//g; s/<MD([lt])[^>]*>/<$1>/g' \
    < otrain.hajic.csts \
    > otrain.hajic1.csts
perl -pe 's/(<f>\.+<l>\.+<t>)N[^<]+/$1FE-------/' \
    < otrain.hajic1.csts \
    > otrain.hajic2.csts
$PARSINGROOT/tools/normalize_and_delexicalize_swedish_csts_trees.pl \
    < otrain.hajic2.csts \
    > otrain.delex.csts
cstsfs.pl < otrain.delex.csts > otrain.delex.fs
$PARSINGROOT/tools/csts2penn.pl otrain.delex.csts > otrain.delex.penn
head -10700 otrain.delex.penn > train.delex.penn
tail -342   otrain.delex.penn > dtest.delex.penn
$PARSINGROOT/tools/penn2text.pl < dtest.delex.penn > dtest.delex.txt

Parsing delexicalized treebank

Note: There are multiple sources of morphological tags used to replace word forms in delexicalized nodes:

The results in this section refer to experiments with the statistically assigned Hajič tags.

Train Charniak on delexicalized data, parse delexicalized test data, evaluate restuffed trees.

$PARSINGROOT/charniak-parser/scripts/train.pl < train.delex.penn > train.delex.ec.tgz
$PARSINGROOT/charniak-parser/scripts/parse.pl -g train.delex.ec.tgz < dtest.delex.txt \
    '> ~/projekty/stanford/tools/restuff.pl -s dtest.txt > dtest.delex.ec.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm dtest.penn dtest.delex.ec.penn

Evaluation (evalb, sentences of 40 or less tokens): P = 66.94 %, R = 67.77 %, F = 67.35 %, T = 35.57 %. Evaluated 317 sentences.

The probable reason why the results are significantly worse than for Danish is that we are using automatically assigned tags, instead of the gold standard.

Brown

nohup nice $PARSINGROOT/brown-reranking-parser/scripts/train.pl -nick sv-delex -reuse \
    < train.delex.penn \
    > train.delex.ecmj.tgz

Evaluation (evalb, sentences of 40 or less tokens): P = 68.67 %, R = 68.87 %, F = 68.77 %, T = 35.57 %. Evaluated 317 sentences.

Preterminal Mamba tags, terminal Hajič tags

perl -e '
    open(H, "otrain.hajic2.csts");
    open(O, "otrain.csts");
    while(<H>)
    {
        $o = <O>;
        $o =~ m/<t>([^<]*)/;
        $ot = $1; $ot =~ s/\t.*//;
        s/<t>/<t>$ot\t/;
        print;
    }' > otrain.mamba+hajic2.csts
$PARSINGROOT/tools/normalize_and_delexicalize_swedish_csts_trees.pl \
    < otrain.mamba+hajic2.csts \
    > otrain.delex.csts
$PARSINGROOT/tools/csts2penn.pl otrain.delex.csts > otrain.delex.penn
head -10700 otrain.delex.penn > train.delex.penn
tail -342   otrain.delex.penn > dtest.delex.penn
$PARSINGROOT/tools/penn2text.pl < dtest.delex.penn > dtest.delex.txt
$PARSINGROOT/charniak-parser/scripts/train.pl < train.delex.penn > train.delex.ec.tgz
$PARSINGROOT/charniak-parser/scripts/parse.pl -g train.delex.ec.tgz < dtest.delex.txt \
    | ~/projekty/stanford/tools/restuff.pl -s dtest.txt \
    > dtest.delex.ec.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm dtest.penn dtest.delex.ec.penn

Evaluation (evalb, sentences of 40 or less tokens): P = 67.22 %, R = 68.48 %, F = 67.84 %, T = 45.26 %. Evaluated 276 sentences.


[ Back to the navigation ] [ Back to the content ]