===== Danish (da) ===== [[http://www.buch-kromann.dk/matthias/treebank/|Danish Dependency Treebank]] (DDT) ==== Versions ==== * Original DDT 1.0 in the [[http://www.tei-c.org/index.xml|TEI-based]] [[http://www.buch-kromann.dk/matthias/dtag/|DTAG]] or [[http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERSearch/doc/html/TigerXML.html|Tiger-XML]] format * CoNLL 2006 The original DDT is based on [[http://www.buch-kromann.dk/matthias/files/diss-Dec05.pdf|Discontinuous Grammar]]. It natively encodes dependencies and other relations such as anaphora. The CoNLL version contains only the dependency relations. ==== Obtaining and License ==== DDT is available under the [[http://www.gnu.org/licenses/gpl-2.0.html|GNU General Public License version 2]]. Download the original distribution (DTAG + TIGER-XML formats) from http://www.buch-kromann.dk/matthias/treebank/. Download the CoNLL 2006 conversion from http://ilk.uvt.nl/conll/free_data.html. The license in short: * any usage, commercial or not * modification and redistribution under same license permitted * citation in publications not required (but it is common decency) DDT was created by members of the [[http://www.cbs.dk/en/Research/Departments-Centres/Institutter/ISV|Department of International Language Studies and Computational Linguistics]], Copenhagen Business School (Handelshøjskolen i København), Dalgas Have 15, DK-2000 Frederiksberg, Denmark. The underlying [[http://korpus.dsl.dk/e-resurser/vilkaar.php?lang=|PAROLE]] corpus (morphologically annotated) was created by the [[http://www.dsl.dk/|Society for Danish Language and Literature]] (Det Danske Sprog- og Litteraturselskab), Christians Brygge 1, DK-1219 København K, Denmark. ==== References ==== * Website * http://www.buch-kromann.dk/matthias/treebank/ (the old and no longer accessible website from http://www.id.cbs.dk/~mtk/ has been moved here) * Data * //no separate citation// * Principal publications * Matthias Trautner Kromann: [[http://www.buch-kromann.dk/matthias/files/030730-tlt-norfa.pdf|The Danish Dependency Treebank and the DTAG Treebank Tool]]. In: Proceedings of Treebanks and Linguistic Theories, Växjö, Sweden, 2003. * Documentation * //see the left-hand-side links at the treebank website, eg.:// * [[http://www.buch-kromann.dk/matthias/treebank/theory.html|Dependency theory and list of dependency relation labels]] * Britt Keson: [[http://www.buch-kromann.dk/matthias/treebank/PAROLE-manual.pdf|Vejledning til det danske morfosyntaktisk taggede PAROLE-korpus]] (morphosyntactic tags). Det Danske Sprog- og Litteraturselskab (DSL) ==== Domain ==== Unknown (the underlying PAROLE corpus “consists of quotations of 150-250 words from a wide range of randomly selected linguistically representative Danish texts from 1983-1992.”) ==== Size ==== The CoNLL 2006 version contains 100,238 tokens in 5512 sentences, yielding 18.19 tokens per sentence on average (CoNLL 2006 data split: 94386 tokens / 5190 sentences training, 5852 tokens / 322 sentences test). ==== Inside ==== The original morphosyntactic tags have been converted to fit into the three columns (CPOS, POS and FEAT) of the CoNLL format. There //should// be a 1-1 mapping between the [[http://www.buch-kromann.dk/matthias/treebank/PAROLE-manual.pdf|DDT positional tags]] and the CoNLL 2006 annotation. Use [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=da::conll|DZ Interset]] to inspect the CoNLL tagset. The morphological analysis in the CoNLL 2006 version does not include lemmas (the original DTAG version does contain them). The morphosyntactic tags have been assigned (probably) manually. Some multi-word expressions have been collapsed into one token, using underscore as the joining character. This includes adverbially used prepositional phrases (e.g. i_lørdags = on Saturdays) but not named entities. ==== Sample ==== The first sentence of DDT 1.0 in the DTAG format: Tagged sample of: 'Jeltsins skæbnetime' 158 running words PAROLE-DK
Christians Brygge 1,1., DK-1219 Copenhagen K.
1998-06-02

by agreement with distributor Jeltsins skæbnetime Nikulin, Leon Denmark Det Fri Aktuelt 1992-12-01 1992-12-01 Danish

To kendte russiske historikere Andronik Mirganjan og Igor Klamkin tror ikke , at Rusland kan udvikles uden en " jernnæve " . The first sentence of the CoNLL 2006 training data: | 1 | Samme | _ | A | AN | degree=pos|gender=common/neuter|number=sing/plur|case=unmarked|def=def/indef|transcat=unmarked | 0 | ROOT | _ | _ | | 2 | cifre | _ | N | NC | gender=neuter|number=plur|case=unmarked|def=indef | 1 | nobj | _ | _ | | 3 | , | _ | X | XP | _ | 1 | pnct | _ | _ | | 4 | de | _ | P | PD | gender=common/neuter|number=plur|case=unmarked|register=unmarked | 7 | subj | _ | _ | | 5 | norske | _ | A | AN | degree=pos|gender=common/neuter|number=plur|case=unmarked|def=def/indef|transcat=unmarked | 4 | mod | _ | _ | | 6 | piger | _ | N | NC | gender=common|number=plur|case=unmarked|def=indef | 4 | nobj | _ | _ | | 7 | tabte | _ | V | VA | mood=indic|tense=past|voice=active | 1 | rel | _ | _ | | 8 | med | _ | SP | SP | _ | 7 | pobj | _ | _ | | 9 | i_lørdags | _ | RG | RG | degree=unmarked | 7 | mod | _ | _ | | 10 | mod | _ | SP | SP | _ | 7 | pobj | _ | _ | | 11 | VMs | _ | N | NP | case=gen | 10 | nobj | _ | _ | | 12 | værtsnation | _ | N | NC | gender=common|number=sing|case=unmarked|def=indef | 11 | possd | _ | _ | | 13 | . | _ | X | XP | _ | 1 | pnct | _ | _ | The first sentence of the CoNLL 2006 test data: | 1 | To | _ | A | AC | case=unmarked | 10 | subj | _ | _ | | 2 | kendte | _ | A | AN | degree=pos|gender=common/neuter|number=plur|case=unmarked|def=def/indef|transcat=unmarked | 1 | mod | _ | _ | | 3 | russiske | _ | A | AN | degree=pos|gender=common/neuter|number=plur|case=unmarked|def=def/indef|transcat=unmarked | 1 | mod | _ | _ | | 4 | historikere | _ | N | NC | gender=common|number=plur|case=unmarked|def=indef | 1 | nobj | _ | _ | | 5 | Andronik | _ | N | NP | case=unmarked | 6 | namef | _ | _ | | 6 | Mirganjan | _ | N | NP | case=unmarked | 1 | appr | _ | _ | | 7 | og | _ | C | CC | _ | 6 | coord | _ | _ | | 8 | Igor | _ | N | NP | case=unmarked | 9 | namef | _ | _ | | 9 | Klamkin | _ | N | NP | case=unmarked | 7 | conj | _ | _ | | 10 | tror | _ | V | VA | mood=indic|tense=present|voice=active | 0 | ROOT | _ | _ | | 11 | ikke | _ | RG | RG | degree=unmarked | 10 | mod | _ | _ | | 12 | , | _ | X | XP | _ | 10 | pnct | _ | _ | | 13 | at | _ | C | CS | _ | 10 | dobj | _ | _ | | 14 | Rusland | _ | N | NP | case=unmarked | 15 | subj | _ | _ | | 15 | kan | _ | V | VA | mood=indic|tense=present|voice=active | 13 | vobj | _ | _ | | 16 | udvikles | _ | V | VA | mood=infin|voice=passive | 15 | vobj | _ | _ | | 17 | uden | _ | SP | SP | _ | 15 | mod | _ | _ | | 18 | en | _ | P | PI | gender=common|number=sing|case=unmarked|register=unmarked | 17 | nobj | _ | _ | | 19 | " | _ | X | XP | _ | 20 | pnct | _ | _ | | 20 | jernnæve | _ | N | NC | gender=common|number=sing|case=unmarked|def=indef | 18 | nobj | _ | _ | | 21 | " | _ | X | XP | _ | 20 | pnct | _ | _ | | 22 | . | _ | X | XP | _ | 10 | pnct | _ | _ | ==== Parsing ==== Nonprojectivities in DDT are not frequent. Only 988 of the 100,238 tokens in the CoNLL 2006 version are attached nonprojectively (0.99%). The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http://aclweb.org/anthology-new/W/W06/W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Danish: ^ Parser (Authors) ^ LAS ^ UAS ^ | MST (McDonald et al.) | 84.79 | 90.58 | | Malt (Nivre et al.) | 84.77 | 89.80 | | Riedel et al. | 83.63 | 89.66 |