user:zeman:danish-dependency-treebank [ufal wiki]

Data preparation
Parsing experiments
Delexicalization
- Parsing delexicalized treebank

name = Danish Dependency Treebank | owner = zeman | path = /fs/clip-corpora/conll/danish | version = 2006

We have the Danish Dependency Treebank that has been used in the CoNLL-06 shared task. There are 5,190 training sentences, 94,386 training words, 322 test sentences, and 5,852 test words.

A few transformations have been applied to the treebank in order to make it more similar to the design of other treebanks. For instance, possessive pronouns should depend on the possessed noun, while in the original treebank the possessive pronoun is the head and the possessed thing depends on it.

The tags and features have been converted to the Penn Treebank tag set.

Our parsers work with constituents, not dependencies. Dependencies in the Danish treebank have been converted to constituents using the most flat structures possible. Nonterminal labels from the Penn Treebank repertoire have been used.

Data preparation

The data is in /fs/clip-corpora/conll/danish. This section describes data preparation for the parser adaptation experiment. The original training data (to be split into our training and our development test) is called otrain, our training data is called train, the development test data is called dtest and the final evaluation data is called etest.

Convert the treebank from the CoNLL format to [[CSTS]]

$PARSINGROOT/tools/conll2csts.pl -l da < otrain.conll > otrain.csts

Normalize trees

Transform the treebank so that it conforms to treebanking guidelines used in other treebanks. For instance, the original DDT annotators attached nouns as dependents of determiners, while we want the opposite: determiners governed by nouns.

At the same time, convert morphological tags to the part-of-speech tagset of the Penn Treebank.

$PARSINGROOT/tools/normalize_danish_csts_trees.pl < otrain.csts > otrain.normalized.csts

Normalization and new tags can be viewed in Tred, if desired. To do that, we need to convert the normalized data to the FS format (because Tred does not allow CSTS encoded in UTF-8). This step is optional.

$PARSINGROOT/tools/cstsfs.pl < otrain.normalized.csts > otrain.normalized.fs
/fs/nlp/Programs/tred/tred otrain.normalized.fs

Convert dependencies to constituents

The flattest possible structure is created. The constituent labels (nonterminals) are derived from part-of-speech tags of the heads and then translated to the repertory of the Penn Treebank.

$PARSINGROOT/tools/csts2penn.pl otrain.normalized.csts > otrain.penn

Split the data

First do the steps described above separately for otrain and etest. We do not split otrain to train and dtest earlier because until now we have not had one sentence per line (and the splitting process is much easier once we have it).

head -4900 otrain.penn > train.penn
tail -290  otrain.penn > dtest.penn

Get plain text of test data

We need plain text as input to the parser.

$PARSINGROOT/tools/penn2text.pl < dtest.penn > dtest.txt
$PARSINGROOT/tools/penn2text.pl < etest.penn > etest.txt

Parsing experiments

Stanford Parser

The parser changes “ tokens to `` or //, so these tokens were changed back to ” in Stanford output. Still, two sentences were reported erroneous by evalb.

Evaluation (evalb, sentences of 40 or less tokens): P = 66.56, R = 69.12, F = 67.82, C = 1.78.

Train Charniak

It takes 48 seconds on the C cluster.

$PARSINGROOT/charniak-parser/scripts/train.pl < train.penn > train.ecdata.tgz

Test Charniak

Parsing the test data takes about 3 minutes on the C cluster.

The parser changes “ tokens to `` or //, so these tokens were changed back to ” in the output. Still, two sentences were reported erroneous by evalb.

$PARSINGROOT/charniak-parser/scripts/parse.pl -g train.ecdata.tgz < dtest.txt | \
    $PARSINGROOT/tools/pennquotes2ascii.pl > dtest.ec.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm dtest.penn dtest.ec.penn

Evaluation (evalb, sentences of 40 or less tokens): P = 74.44 %, R = 75.54 %, F = 74.99 %. 0 errorneous sentences, evaluated 276 sentences.

Train Brown

By Brown parser we mean the combination of the Charniak N-best parser and the Johnson's reranker. The main point here is training the reranker but the resulting tgzipped file contains Charniak statistics as well.

$PARSINGROOT/brown-reranking-parser/scripts/train.pl < train.penn > train.ecmj.tgz

Test Brown

$PARSINGROOT/brown-reranking-parser/scripts/parse.pl -g train.ecmj.tgz < dtest.txt | \
    $PARSINGROOT/tools/pennquotes2ascii.pl > dtest.ecmj.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm dtest.penn dtest.ecmj.penn

P = 75.93 %, R = 75.70 %, F = 75.81 %. No errorneous sentences, evaluated 276 sentences.

Delexicalization

By delexicalization we mean replacing words by their morphological tags. We need it for the parser adaptation experiments. After delexicalization, Danish tags will be terminals while the preterminals will still contain the simpler Penn-style tags.

$PARSINGROOT/tools/normalize_and_delexicalize_danish_csts_trees.pl \
    < otrain.csts \
    > otrain.delex.csts
cstsfs.pl < otrain.delex.csts > otrain.delex.fs
$PARSINGROOT/tools/csts2penn.pl otrain.delex.csts > otrain.delex.penn
head -4900 otrain.delex.penn > train.delex.penn
tail -290  otrain.delex.penn > dtest.delex.penn
$PARSINGROOT/tools/penn2text.pl < dtest.delex.penn > dtest.delex.txt

Parsing delexicalized treebank

Train Charniak on delexicalized data, parse delexicalized test data, evaluate restuffed trees.

$PARSINGROOT/charniak-parser/scripts/train.pl < train.delex.penn > train.delex.ec.tgz
$PARSINGROOT/charniak-parser/scripts/cluster-parse.pl -n 1 -g train.delex.ec.tgz < dtest.delex.txt \
    -o dtest.delex.ec.penn
$PARSINGROOT/tools/restuff.pl -s dtest.txt < dtest.delex.ec.penn > dtest.delex.ec.restuffed.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm dtest.penn dtest.delex.ec.restuffed.penn \
    | tee result.dtest.delex.ec.txt

Evaluation (evalb, sentences of 40 or less tokens): P = 72.55 %, R = 73.03 %, F = 72.79 %, T = 51.21 %. Evaluated 276 sentences.

Train Brown on delexicalized data, parse delexicalized test data, evaluate restuffed trees.

$PARSINGROOT/brown-reranking-parser/scripts/train.pl -nick da-delex -reuse \
    < train.delex.penn \
    > train.delex.ecmj.tgz
$PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl -n 1 -g train.delex.ecmj.tgz < dtest.delex.txt \
    -o dtest.delex.ecmj.penn
$PARSINGROOT/tools/restuff.pl -s dtest.txt < dtest.delex.ecmj.penn \
    > dtest.delex.ecmj.restuffed.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm dtest.penn dtest.delex.ecmj.restuffed.penn \
    | tee result.delex.ecmj.txt

Evaluation (evalb, sentences of 40 or less tokens): P = 77.04 %, R = 76.96 %, F = 77.00 %, T = 51.21 %. Evaluated 276 sentences.

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Table of Contents