Table of Contents
We have the Danish Dependency Treebank that has been used in the CoNLL-06 shared task. There are 5,190 training sentences, 94,386 training words, 322 test sentences, and 5,852 test words.
A few transformations have been applied to the treebank in order to make it more similar to the design of other treebanks. For instance, possessive pronouns should depend on the possessed noun, while in the original treebank the possessive pronoun is the head and the possessed thing depends on it.
The tags and features have been converted to the Penn Treebank tag set.
Our parsers work with constituents, not dependencies. Dependencies in the Danish treebank have been converted to constituents using the most flat structures possible. Nonterminal labels from the Penn Treebank repertoire have been used.
Data preparation
The data is in /fs/clip-corpora/conll/danish
. This section describes data preparation for the parser adaptation experiment. The original training data (to be split into our training and our development test) is called otrain
, our training data is called train
, the development test data is called dtest
and the final evaluation data is called etest
.
Convert the treebank from the CoNLL format to [[CSTS]]
$PARSINGROOT/tools/conll2csts.pl -l da < otrain.conll > otrain.csts
Normalize trees
Transform the treebank so that it conforms to treebanking guidelines used in other treebanks. For instance, the original DDT annotators attached nouns as dependents of determiners, while we want the opposite: determiners governed by nouns.
At the same time, convert morphological tags to the part-of-speech tagset of the Penn Treebank.
$PARSINGROOT/tools/normalize_danish_csts_trees.pl < otrain.csts > otrain.normalized.csts
Normalization and new tags can be viewed in Tred, if desired. To do that, we need to convert the normalized data to the FS format (because Tred does not allow CSTS encoded in UTF-8). This step is optional.
$PARSINGROOT/tools/cstsfs.pl < otrain.normalized.csts > otrain.normalized.fs /fs/nlp/Programs/tred/tred otrain.normalized.fs
Convert dependencies to constituents
The flattest possible structure is created. The constituent labels (nonterminals) are derived from part-of-speech tags of the heads and then translated to the repertory of the Penn Treebank.
$PARSINGROOT/tools/csts2penn.pl otrain.normalized.csts > otrain.penn
Split the data
First do the steps described above separately for otrain
and etest
. We do not split otrain
to train
and dtest
earlier because until now we have not had one sentence per line (and the splitting process is much easier once we have it).
head -4900 otrain.penn > train.penn tail -290 otrain.penn > dtest.penn
Get plain text of test data
We need plain text as input to the parser.
$PARSINGROOT/tools/penn2text.pl < dtest.penn > dtest.txt $PARSINGROOT/tools/penn2text.pl < etest.penn > etest.txt
Parsing experiments
Stanford Parser
The parser changes “ tokens to `` or //, so these tokens were changed back to ” in Stanford output. Still, two sentences were reported erroneous by evalb.
Evaluation (evalb, sentences of 40 or less tokens): P = 66.56, R = 69.12, F = 67.82, C = 1.78.
Train Charniak
It takes 48 seconds on the C cluster.
$PARSINGROOT/charniak-parser/scripts/train.pl < train.penn > train.ecdata.tgz
Test Charniak
Parsing the test data takes about 3 minutes on the C cluster.
The parser changes “ tokens to `` or //, so these tokens were changed back to ” in the output. Still, two sentences were reported erroneous by evalb.
$PARSINGROOT/charniak-parser/scripts/parse.pl -g train.ecdata.tgz < dtest.txt | \ $PARSINGROOT/tools/pennquotes2ascii.pl > dtest.ec.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm dtest.penn dtest.ec.penn
Evaluation (evalb, sentences of 40 or less tokens): P = 74.44 %, R = 75.54 %, F = 74.99 %. 0 errorneous sentences, evaluated 276 sentences.
Train Brown
By Brown parser we mean the combination of the Charniak N-best parser and the Johnson's reranker. The main point here is training the reranker but the resulting tgzipped file contains Charniak statistics as well.
$PARSINGROOT/brown-reranking-parser/scripts/train.pl < train.penn > train.ecmj.tgz
Test Brown
$PARSINGROOT/brown-reranking-parser/scripts/parse.pl -g train.ecmj.tgz < dtest.txt | \ $PARSINGROOT/tools/pennquotes2ascii.pl > dtest.ecmj.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm dtest.penn dtest.ecmj.penn
P = 75.93 %, R = 75.70 %, F = 75.81 %. No errorneous sentences, evaluated 276 sentences.
Delexicalization
By delexicalization we mean replacing words by their morphological tags. We need it for the parser adaptation experiments. After delexicalization, Danish tags will be terminals while the preterminals will still contain the simpler Penn-style tags.
$PARSINGROOT/tools/normalize_and_delexicalize_danish_csts_trees.pl \ < otrain.csts \ > otrain.delex.csts cstsfs.pl < otrain.delex.csts > otrain.delex.fs $PARSINGROOT/tools/csts2penn.pl otrain.delex.csts > otrain.delex.penn head -4900 otrain.delex.penn > train.delex.penn tail -290 otrain.delex.penn > dtest.delex.penn $PARSINGROOT/tools/penn2text.pl < dtest.delex.penn > dtest.delex.txt
Parsing delexicalized treebank
Train Charniak on delexicalized data, parse delexicalized test data, evaluate restuffed trees.
$PARSINGROOT/charniak-parser/scripts/train.pl < train.delex.penn > train.delex.ec.tgz $PARSINGROOT/charniak-parser/scripts/cluster-parse.pl -n 1 -g train.delex.ec.tgz < dtest.delex.txt \ -o dtest.delex.ec.penn $PARSINGROOT/tools/restuff.pl -s dtest.txt < dtest.delex.ec.penn > dtest.delex.ec.restuffed.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm dtest.penn dtest.delex.ec.restuffed.penn \ | tee result.dtest.delex.ec.txt
Evaluation (evalb, sentences of 40 or less tokens): P = 72.55 %, R = 73.03 %, F = 72.79 %, T = 51.21 %. Evaluated 276 sentences.
Train Brown on delexicalized data, parse delexicalized test data, evaluate restuffed trees.
$PARSINGROOT/brown-reranking-parser/scripts/train.pl -nick da-delex -reuse \ < train.delex.penn \ > train.delex.ecmj.tgz $PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl -n 1 -g train.delex.ecmj.tgz < dtest.delex.txt \ -o dtest.delex.ecmj.penn $PARSINGROOT/tools/restuff.pl -s dtest.txt < dtest.delex.ecmj.penn \ > dtest.delex.ecmj.restuffed.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm dtest.penn dtest.delex.ecmj.restuffed.penn \ | tee result.delex.ecmj.txt
Evaluation (evalb, sentences of 40 or less tokens): P = 77.04 %, R = 76.96 %, F = 77.00 %, T = 51.21 %. Evaluated 276 sentences.