{{Infobox Resource | name = Danish Dependency Treebank | owner = zeman | path = /fs/clip-corpora/conll/danish | version = 2006 }} We have the Danish Dependency Treebank that has been used in the CoNLL-06 shared task. There are 5,190 training sentences, 94,386 training words, 322 test sentences, and 5,852 test words. A few transformations have been applied to the treebank in order to make it more similar to the design of other treebanks. For instance, possessive pronouns should depend on the possessed noun, while in the original treebank the possessive pronoun is the head and the possessed thing depends on it. The tags and features have been converted to the Penn Treebank tag set. Our parsers work with constituents, not dependencies. Dependencies in the Danish treebank have been converted to constituents using the most flat structures possible. Nonterminal labels from the Penn Treebank repertoire have been used. =====Data preparation===== The data is in ''/fs/clip-corpora/conll/danish''. This section describes data preparation for the [[parser adaptation]] experiment. The original training data (to be split into our training and our development test) is called ''otrain'', our training data is called ''train'', the development test data is called ''dtest'' and the final evaluation data is called ''etest''. ====Convert the treebank from the CoNLL format to [[CSTS]]==== $PARSINGROOT/tools/conll2csts.pl -l da < otrain.conll > otrain.csts ====Normalize trees==== Transform the treebank so that it conforms to treebanking guidelines used in other treebanks. For instance, the original DDT annotators attached nouns as dependents of determiners, while we want the opposite: determiners governed by nouns. At the same time, convert morphological tags to the part-of-speech tagset of the Penn Treebank. $PARSINGROOT/tools/normalize_danish_csts_trees.pl < otrain.csts > otrain.normalized.csts Normalization and new tags can be viewed in Tred, if desired. To do that, we need to convert the normalized data to the [[FS format]] (because Tred does not allow CSTS encoded in UTF-8). This step is optional. $PARSINGROOT/tools/cstsfs.pl < otrain.normalized.csts > otrain.normalized.fs /fs/nlp/Programs/tred/tred otrain.normalized.fs ====Convert dependencies to constituents==== The flattest possible structure is created. The constituent labels (nonterminals) are derived from part-of-speech tags of the heads and then translated to the repertory of the Penn Treebank. $PARSINGROOT/tools/csts2penn.pl otrain.normalized.csts > otrain.penn ====Split the data==== First do the steps described above separately for ''otrain'' and ''etest''. We do not split ''otrain'' to ''train'' and ''dtest'' earlier because until now we have not had one sentence per line (and the splitting process is much easier once we have it). head -4900 otrain.penn > train.penn tail -290 otrain.penn > dtest.penn ====Get plain text of test data==== We need plain text as input to the parser. $PARSINGROOT/tools/penn2text.pl < dtest.penn > dtest.txt $PARSINGROOT/tools/penn2text.pl < etest.penn > etest.txt =====Parsing experiments===== ====Stanford Parser==== The parser changes " tokens to `` or //, so these tokens were changed back to " in Stanford output. Still, two sentences were reported erroneous by evalb. Evaluation (evalb, sentences of 40 or less tokens): P = 66.56, R = 69.12, F = 67.82, C = 1.78. ====Train Charniak==== It takes 48 seconds on the C cluster. $PARSINGROOT/charniak-parser/scripts/train.pl < train.penn > train.ecdata.tgz ====Test Charniak==== Parsing the test data takes about 3 minutes on the C cluster. The parser changes " tokens to `` or //, so these tokens were changed back to " in the output. Still, two sentences were reported erroneous by evalb. $PARSINGROOT/charniak-parser/scripts/parse.pl -g train.ecdata.tgz < dtest.txt | \ $PARSINGROOT/tools/pennquotes2ascii.pl > dtest.ec.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm dtest.penn dtest.ec.penn Evaluation (evalb, sentences of 40 or less tokens): P = 74.44 %, R = 75.54 %, F = 74.99 %. 0 errorneous sentences, evaluated 276 sentences. ====Train Brown==== By Brown parser we mean the combination of the Charniak N-best parser and the Johnson's reranker. The main point here is training the reranker but the resulting tgzipped file contains Charniak statistics as well. $PARSINGROOT/brown-reranking-parser/scripts/train.pl < train.penn > train.ecmj.tgz ====Test Brown==== $PARSINGROOT/brown-reranking-parser/scripts/parse.pl -g train.ecmj.tgz < dtest.txt | \ $PARSINGROOT/tools/pennquotes2ascii.pl > dtest.ecmj.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm dtest.penn dtest.ecmj.penn P = 75.93 %, R = 75.70 %, F = 75.81 %. No errorneous sentences, evaluated 276 sentences. =====Delexicalization===== By //delexicalization// we mean replacing words by their morphological tags. We need it for the [[parser adaptation]] experiments. After delexicalization, Danish tags will be terminals while the preterminals will still contain the simpler Penn-style tags. $PARSINGROOT/tools/normalize_and_delexicalize_danish_csts_trees.pl \ < otrain.csts \ > otrain.delex.csts cstsfs.pl < otrain.delex.csts > otrain.delex.fs $PARSINGROOT/tools/csts2penn.pl otrain.delex.csts > otrain.delex.penn head -4900 otrain.delex.penn > train.delex.penn tail -290 otrain.delex.penn > dtest.delex.penn $PARSINGROOT/tools/penn2text.pl < dtest.delex.penn > dtest.delex.txt ====Parsing delexicalized treebank==== Train Charniak on delexicalized data, parse delexicalized test data, evaluate restuffed trees. $PARSINGROOT/charniak-parser/scripts/train.pl < train.delex.penn > train.delex.ec.tgz $PARSINGROOT/charniak-parser/scripts/cluster-parse.pl -n 1 -g train.delex.ec.tgz < dtest.delex.txt \ -o dtest.delex.ec.penn $PARSINGROOT/tools/restuff.pl -s dtest.txt < dtest.delex.ec.penn > dtest.delex.ec.restuffed.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm dtest.penn dtest.delex.ec.restuffed.penn \ | tee result.dtest.delex.ec.txt Evaluation (evalb, sentences of 40 or less tokens): P = 72.55 %, R = 73.03 %, F = 72.79 %, T = 51.21 %. Evaluated 276 sentences. Train Brown on delexicalized data, parse delexicalized test data, evaluate restuffed trees. $PARSINGROOT/brown-reranking-parser/scripts/train.pl -nick da-delex -reuse \ < train.delex.penn \ > train.delex.ecmj.tgz $PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl -n 1 -g train.delex.ecmj.tgz < dtest.delex.txt \ -o dtest.delex.ecmj.penn $PARSINGROOT/tools/restuff.pl -s dtest.txt < dtest.delex.ecmj.penn \ > dtest.delex.ecmj.restuffed.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm dtest.penn dtest.delex.ecmj.restuffed.penn \ | tee result.delex.ecmj.txt Evaluation (evalb, sentences of 40 or less tokens): P = 77.04 %, R = 76.96 %, F = 77.00 %, T = 51.21 %. Evaluated 276 sentences.