[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

user:zeman:treebanks:da [2011/11/20 19:43] (current)
zeman vytvořeno
Line 1: Line 1:
 +===== Danish (da) =====
 +
 +[[http://​www.buch-kromann.dk/​matthias/​treebank/​|Danish Dependency Treebank]] (DDT)
 +
 +==== Versions ====
 +
 +  * Original DDT 1.0 in the [[http://​www.tei-c.org/​index.xml|TEI-based]] [[http://​www.buch-kromann.dk/​matthias/​dtag/​|DTAG]] or [[http://​www.ims.uni-stuttgart.de/​projekte/​TIGER/​TIGERSearch/​doc/​html/​TigerXML.html|Tiger-XML]] format
 +  * CoNLL 2006
 +
 +The original DDT is based on [[http://​www.buch-kromann.dk/​matthias/​files/​diss-Dec05.pdf|Discontinuous Grammar]]. It natively encodes dependencies and other relations such as anaphora. The CoNLL version contains only the dependency relations.
 +
 +==== Obtaining and License ====
 +
 +DDT is available under the [[http://​www.gnu.org/​licenses/​gpl-2.0.html|GNU General Public License version 2]]. Download the original distribution (DTAG + TIGER-XML formats) from http://​www.buch-kromann.dk/​matthias/​treebank/​. Download the CoNLL 2006 conversion from http://​ilk.uvt.nl/​conll/​free_data.html. The license in short:
 +
 +  * any usage, commercial or not
 +  * modification and redistribution under same license permitted
 +  * citation in publications not required (but it is common decency)
 +
 +DDT was created by members of the [[http://​www.cbs.dk/​en/​Research/​Departments-Centres/​Institutter/​ISV|Department of International Language Studies and Computational Linguistics]],​ Copenhagen Business School (Handelshøjskolen i København),​ Dalgas Have 15, DK-2000 Frederiksberg,​ Denmark. The underlying [[http://​korpus.dsl.dk/​e-resurser/​vilkaar.php?​lang=|PAROLE]] corpus (morphologically annotated) was created by the [[http://​www.dsl.dk/​|Society for Danish Language and Literature]] (Det Danske Sprog- og Litteraturselskab),​ Christians Brygge 1, DK-1219 København K, Denmark.
 +
 +==== References ====
 +
 +  * Website
 +    * http://​www.buch-kromann.dk/​matthias/​treebank/​ (the old and no longer accessible website from <​nowiki>​http://​www.id.cbs.dk/​~mtk/</​nowiki>​ has been moved here)
 +  * Data
 +    * //no separate citation//
 +  * Principal publications
 +    * Matthias Trautner Kromann: [[http://​www.buch-kromann.dk/​matthias/​files/​030730-tlt-norfa.pdf|The Danish Dependency Treebank and the DTAG Treebank Tool]]. In: Proceedings of Treebanks and Linguistic Theories, Växjö, Sweden, 2003.
 +  * Documentation
 +    * //see the left-hand-side links at the treebank website, eg.://
 +    * [[http://​www.buch-kromann.dk/​matthias/​treebank/​theory.html|Dependency theory and list of dependency relation labels]]
 +    * Britt Keson: [[http://​www.buch-kromann.dk/​matthias/​treebank/​PAROLE-manual.pdf|Vejledning til det danske morfosyntaktisk taggede PAROLE-korpus]] (morphosyntactic tags). Det Danske Sprog- og Litteraturselskab (DSL)
 +
 +==== Domain ====
 +
 +Unknown (the underlying PAROLE corpus “consists of quotations of 150-250 words from a wide range of randomly selected linguistically representative Danish texts from 1983-1992.”)
 +
 +==== Size ====
 +
 +The CoNLL 2006 version contains 100,238 tokens in 5512 sentences, yielding 18.19 tokens per sentence on average (CoNLL 2006 data split: 94386 tokens / 5190 sentences training, 5852 tokens / 322 sentences test).
 +
 +==== Inside ====
 +
 +The original morphosyntactic tags have been converted to fit into the three columns (CPOS, POS and FEAT) of the CoNLL format. There //should// be a 1-1 mapping between the [[http://​www.buch-kromann.dk/​matthias/​treebank/​PAROLE-manual.pdf|DDT positional tags]] and the CoNLL 2006 annotation. Use [[http://​quest.ms.mff.cuni.cz/​cgi-bin/​interset/​index.pl?​tagset=da::​conll|DZ Interset]] to inspect the CoNLL tagset.
 +
 +The morphological analysis in the CoNLL 2006 version does not include lemmas (the original DTAG version does contain them). The morphosyntactic tags have been assigned (probably) manually.
 +
 +Some multi-word expressions have been collapsed into one token, using underscore as the joining character. This includes adverbially used prepositional phrases (e.g. i_lørdags = on Saturdays) but not named entities.
 +
 +==== Sample ====
 +
 +The first sentence of DDT 1.0 in the DTAG format:
 +
 +<code xml><​tei.2>​
 +  <​teiHeader type=text>​
 +    <​fileDesc>​
 +      <​titleStmt>​
 +        <​title>​Tagged sample of: '​Jeltsins skæbnetime'</​title>​
 +      </​titleStmt>​
 +      <extent words=158>​158 running words</​extent>​
 +      <​publicationStmt>​
 +         <​distributor>​PAROLE-DK</​distributor>​
 +         <​address><​addrline>​Christians Brygge 1,1., DK-1219 Copenhagen K.</​address>​
 +         <​date>​1998-06-02</​date>​
 +         <​availability status=restricted><​p>​by agreement with distributor</​availability>​
 +      </​publicationStmt>​
 +      <​sourceDesc>​
 +        <​biblStruct>​
 +          <​analytic>​
 +            <​title>​Jeltsins skæbnetime</​title>​
 +            <author gender=m born=1925>​Nikulin,​ Leon</​author>​
 +          </​analytic>​
 +          <​monogr>​
 +            <​imprint><​pubPlace>​Denmark</​pubPlace>​
 +              <​publisher>​Det Fri Aktuelt</​publisher>​
 +              <​date>​1992-12-01</​date>​
 +            </​imprint>​
 +          </​monogr>​
 +        </​biblStruct>​
 +      </​sourceDesc>​
 +    </​fileDesc>​
 +    <​profileDesc>​
 +      <​creation>​1992-12-01</​creation>​
 +      <​langUsage><​language>​Danish</​langUsage>​
 +      <​textClass>​
 +        <catRef target="​P.M2">​
 +        <catRef target="​P.G4.8">​
 +        <catRef target="​P.T9.3">​
 +      </​textClass>​
 +    </​profileDesc>​
 +  </​teiHeader>​
 +<text id=AJK>
 +<​body>​
 +<div1 type=main>​
 +<p>
 +<s>
 +<W lemma="​to"​ msd="​AC---U=--"​ in="​9:​subj"​ out="​1:​mod|2:​mod|3:​nobj|5:​appr">​To</​W>​
 +<W lemma="​kendt"​ msd="​ANP[CN]PU=[DI]U"​ in="​-1:​mod"​ out="">​kendte</​W>​
 +<W lemma="​russisk"​ msd="​ANP[CN]PU=[DI]U"​ in="​-2:​mod"​ out="">​russiske</​W>​
 +<W lemma="​historiker"​ msd="​NCCPU==I"​ in="​-3:​nobj"​ out="">​historikere</​W>​
 +<W lemma="​Andronik"​ msd="​NP--U==-"​ in="​1:​namef"​ out="">​Andronik</​W>​
 +<W lemma="​Mirganjan"​ msd="​NP--U==-"​ in="​-5:​appr"​ out="​-1:​namef|1:​coord">​Mirganjan</​W>​
 +<W lemma="​og"​ msd="​CC"​ in="​-1:​coord"​ out="​2:​conj">​og</​W>​
 +<W lemma="​Igor"​ msd="​NP--U==-"​ in="​1:​namef"​ out="">​Igor</​W>​
 +<W lemma="​Klamkin"​ msd="​NP--U==-"​ in="​-2:​conj"​ out="​-1:​namef">​Klamkin</​W>​
 +<W lemma="​tro"​ msd="​VADR=----A-"​ in=""​ out="​-9:​subj|1:​mod|2:​pnct|3:​dobj|12:​pnct">​tror</​W>​
 +<W lemma="​ikke"​ msd="​RGU"​ in="​-1:​mod"​ out="">​ikke</​W>​
 +<W lemma=","​ msd="​XP"​ in="​-2:​pnct"​ out="">,</​W>​
 +<W lemma="​at"​ msd="​CS"​ in="​-3:​dobj"​ out="​2:​vobj">​at</​W>​
 +<W lemma="​Rusland"​ msd="​NP--U==-"​ in="​1:​subj|2:​[subj]"​ out="">​Rusland</​W>​
 +<W lemma="​kunne"​ msd="​VADR=----A-"​ in="​-2:​vobj"​ out="​-1:​subj|1:​vobj|2:​mod">​kan</​W>​
 +<W lemma="​udvikle"​ msd="​VAF-=----P-"​ in="​-1:​vobj"​ out="​-2:​[subj]">​udvikles</​W>​
 +<W lemma="​uden"​ msd="​SP"​ in="​-2:​mod"​ out="​1:​nobj">​uden</​W>​
 +<W lemma="​en"​ msd="​PI-CSU--U"​ in="​-1:​nobj"​ out="​2:​nobj">​en</​W>​
 +<W lemma="&​quot;"​ msd="​XP"​ in="​1:​pnct"​ out="">"</​W>​
 +<W lemma="​jernnæve"​ msd="​NCCSU==I"​ in="​-2:​nobj"​ out="​-1:​pnct|1:​pnct">​jernnæve</​W>​
 +<W lemma="&​quot;"​ msd="​XP"​ in="​-1:​pnct"​ out="">"</​W>​
 +<W lemma="​."​ msd="​XP"​ in="​-12:​pnct"​ out="">​.</​W>​
 +</​s></​code>​
 +
 +The first sentence of the CoNLL 2006 training data:
 +
 +| 1 | Samme | _ | A | AN | degree=pos<​nowiki>​|</​nowiki>​gender=common/​neuter<​nowiki>​|</​nowiki>​number=sing/​plur<​nowiki>​|</​nowiki>​case=unmarked<​nowiki>​|</​nowiki>​def=def/​indef<​nowiki>​|</​nowiki>​transcat=unmarked | 0 | ROOT | _ | _ |
 +| 2 | cifre | _ | N | NC | gender=neuter<​nowiki>​|</​nowiki>​number=plur<​nowiki>​|</​nowiki>​case=unmarked<​nowiki>​|</​nowiki>​def=indef | 1 | nobj | _ | _ |
 +| 3 | , | _ | X | XP | _ | 1 | pnct | _ | _ |
 +| 4 | de | _ | P | PD | gender=common/​neuter<​nowiki>​|</​nowiki>​number=plur<​nowiki>​|</​nowiki>​case=unmarked<​nowiki>​|</​nowiki>​register=unmarked | 7 | subj | _ | _ |
 +| 5 | norske | _ | A | AN | degree=pos<​nowiki>​|</​nowiki>​gender=common/​neuter<​nowiki>​|</​nowiki>​number=plur<​nowiki>​|</​nowiki>​case=unmarked<​nowiki>​|</​nowiki>​def=def/​indef<​nowiki>​|</​nowiki>​transcat=unmarked | 4 | mod | _ | _ |
 +| 6 | piger | _ | N | NC | gender=common<​nowiki>​|</​nowiki>​number=plur<​nowiki>​|</​nowiki>​case=unmarked<​nowiki>​|</​nowiki>​def=indef | 4 | nobj | _ | _ |
 +| 7 | tabte | _ | V | VA | mood=indic<​nowiki>​|</​nowiki>​tense=past<​nowiki>​|</​nowiki>​voice=active | 1 | rel | _ | _ |
 +| 8 | med | _ | SP | SP | _ | 7 | pobj | _ | _ |
 +| 9 | i_lørdags | _ | RG | RG | degree=unmarked | 7 | mod | _ | _ |
 +| 10 | mod | _ | SP | SP | _ | 7 | pobj | _ | _ |
 +| 11 | VMs | _ | N | NP | case=gen | 10 | nobj | _ | _ |
 +| 12 | værtsnation | _ | N | NC | gender=common<​nowiki>​|</​nowiki>​number=sing<​nowiki>​|</​nowiki>​case=unmarked<​nowiki>​|</​nowiki>​def=indef | 11 | possd | _ | _ |
 +| 13 | . | _ | X | XP | _ | 1 | pnct | _ | _ |
 +
 +The first sentence of the CoNLL 2006 test data:
 +
 +| 1 | To | _ | A | AC | case=unmarked | 10 | subj | _ | _ |
 +| 2 | kendte | _ | A | AN | degree=pos<​nowiki>​|</​nowiki>​gender=common/​neuter<​nowiki>​|</​nowiki>​number=plur<​nowiki>​|</​nowiki>​case=unmarked<​nowiki>​|</​nowiki>​def=def/​indef<​nowiki>​|</​nowiki>​transcat=unmarked | 1 | mod | _ | _ |
 +| 3 | russiske | _ | A | AN | degree=pos<​nowiki>​|</​nowiki>​gender=common/​neuter<​nowiki>​|</​nowiki>​number=plur<​nowiki>​|</​nowiki>​case=unmarked<​nowiki>​|</​nowiki>​def=def/​indef<​nowiki>​|</​nowiki>​transcat=unmarked | 1 | mod | _ | _ |
 +| 4 | historikere | _ | N | NC | gender=common<​nowiki>​|</​nowiki>​number=plur<​nowiki>​|</​nowiki>​case=unmarked<​nowiki>​|</​nowiki>​def=indef | 1 | nobj | _ | _ |
 +| 5 | Andronik | _ | N | NP | case=unmarked | 6 | namef | _ | _ |
 +| 6 | Mirganjan | _ | N | NP | case=unmarked | 1 | appr | _ | _ |
 +| 7 | og | _ | C | CC | _ | 6 | coord | _ | _ |
 +| 8 | Igor | _ | N | NP | case=unmarked | 9 | namef | _ | _ |
 +| 9 | Klamkin | _ | N | NP | case=unmarked | 7 | conj | _ | _ |
 +| 10 | tror | _ | V | VA | mood=indic<​nowiki>​|</​nowiki>​tense=present<​nowiki>​|</​nowiki>​voice=active | 0 | ROOT | _ | _ |
 +| 11 | ikke | _ | RG | RG | degree=unmarked | 10 | mod | _ | _ |
 +| 12 | , | _ | X | XP | _ | 10 | pnct | _ | _ |
 +| 13 | at | _ | C | CS | _ | 10 | dobj | _ | _ |
 +| 14 | Rusland | _ | N | NP | case=unmarked | 15 | subj | _ | _ |
 +| 15 | kan | _ | V | VA | mood=indic<​nowiki>​|</​nowiki>​tense=present<​nowiki>​|</​nowiki>​voice=active | 13 | vobj | _ | _ |
 +| 16 | udvikles | _ | V | VA | mood=infin<​nowiki>​|</​nowiki>​voice=passive | 15 | vobj | _ | _ |
 +| 17 | uden | _ | SP | SP | _ | 15 | mod | _ | _ |
 +| 18 | en | _ | P | PI | gender=common<​nowiki>​|</​nowiki>​number=sing<​nowiki>​|</​nowiki>​case=unmarked<​nowiki>​|</​nowiki>​register=unmarked | 17 | nobj | _ | _ |
 +| 19 | " | _ | X | XP | _ | 20 | pnct | _ | _ |
 +| 20 | jernnæve | _ | N | NC | gender=common<​nowiki>​|</​nowiki>​number=sing<​nowiki>​|</​nowiki>​case=unmarked<​nowiki>​|</​nowiki>​def=indef | 18 | nobj | _ | _ |
 +| 21 | " | _ | X | XP | _ | 20 | pnct | _ | _ |
 +| 22 | . | _ | X | XP | _ | 10 | pnct | _ | _ |
 +
 +==== Parsing ====
 +
 +Nonprojectivities in DDT are not frequent. Only 988 of the 100,238 tokens in the CoNLL 2006 version are attached nonprojectively (0.99%).
 +
 +The results of the CoNLL 2006 shared task are [[http://​ilk.uvt.nl/​conll/​results.html|available online]]. They have been published in [[http://​aclweb.org/​anthology-new/​W/​W06/​W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Danish:
 +
 +^ Parser (Authors) ^ LAS ^ UAS ^
 +| MST (McDonald et al.) | 84.79 | 90.58 |
 +| Malt (Nivre et al.) | 84.77 | 89.80 |
 +| Riedel et al. | 83.63 | 89.66 |
  

[ Back to the navigation ] [ Back to the content ]