|
|
— |
user:zeman:treebanks:da [2011/11/20 19:43] (current) zeman vytvořeno |
| ===== Danish (da) ===== |
| |
| [[http://www.buch-kromann.dk/matthias/treebank/|Danish Dependency Treebank]] (DDT) |
| |
| ==== Versions ==== |
| |
| * Original DDT 1.0 in the [[http://www.tei-c.org/index.xml|TEI-based]] [[http://www.buch-kromann.dk/matthias/dtag/|DTAG]] or [[http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERSearch/doc/html/TigerXML.html|Tiger-XML]] format |
| * CoNLL 2006 |
| |
| The original DDT is based on [[http://www.buch-kromann.dk/matthias/files/diss-Dec05.pdf|Discontinuous Grammar]]. It natively encodes dependencies and other relations such as anaphora. The CoNLL version contains only the dependency relations. |
| |
| ==== Obtaining and License ==== |
| |
| DDT is available under the [[http://www.gnu.org/licenses/gpl-2.0.html|GNU General Public License version 2]]. Download the original distribution (DTAG + TIGER-XML formats) from http://www.buch-kromann.dk/matthias/treebank/. Download the CoNLL 2006 conversion from http://ilk.uvt.nl/conll/free_data.html. The license in short: |
| |
| * any usage, commercial or not |
| * modification and redistribution under same license permitted |
| * citation in publications not required (but it is common decency) |
| |
| DDT was created by members of the [[http://www.cbs.dk/en/Research/Departments-Centres/Institutter/ISV|Department of International Language Studies and Computational Linguistics]], Copenhagen Business School (Handelshøjskolen i København), Dalgas Have 15, DK-2000 Frederiksberg, Denmark. The underlying [[http://korpus.dsl.dk/e-resurser/vilkaar.php?lang=|PAROLE]] corpus (morphologically annotated) was created by the [[http://www.dsl.dk/|Society for Danish Language and Literature]] (Det Danske Sprog- og Litteraturselskab), Christians Brygge 1, DK-1219 København K, Denmark. |
| |
| ==== References ==== |
| |
| * Website |
| * http://www.buch-kromann.dk/matthias/treebank/ (the old and no longer accessible website from <nowiki>http://www.id.cbs.dk/~mtk/</nowiki> has been moved here) |
| * Data |
| * //no separate citation// |
| * Principal publications |
| * Matthias Trautner Kromann: [[http://www.buch-kromann.dk/matthias/files/030730-tlt-norfa.pdf|The Danish Dependency Treebank and the DTAG Treebank Tool]]. In: Proceedings of Treebanks and Linguistic Theories, Växjö, Sweden, 2003. |
| * Documentation |
| * //see the left-hand-side links at the treebank website, eg.:// |
| * [[http://www.buch-kromann.dk/matthias/treebank/theory.html|Dependency theory and list of dependency relation labels]] |
| * Britt Keson: [[http://www.buch-kromann.dk/matthias/treebank/PAROLE-manual.pdf|Vejledning til det danske morfosyntaktisk taggede PAROLE-korpus]] (morphosyntactic tags). Det Danske Sprog- og Litteraturselskab (DSL) |
| |
| ==== Domain ==== |
| |
| Unknown (the underlying PAROLE corpus “consists of quotations of 150-250 words from a wide range of randomly selected linguistically representative Danish texts from 1983-1992.”) |
| |
| ==== Size ==== |
| |
| The CoNLL 2006 version contains 100,238 tokens in 5512 sentences, yielding 18.19 tokens per sentence on average (CoNLL 2006 data split: 94386 tokens / 5190 sentences training, 5852 tokens / 322 sentences test). |
| |
| ==== Inside ==== |
| |
| The original morphosyntactic tags have been converted to fit into the three columns (CPOS, POS and FEAT) of the CoNLL format. There //should// be a 1-1 mapping between the [[http://www.buch-kromann.dk/matthias/treebank/PAROLE-manual.pdf|DDT positional tags]] and the CoNLL 2006 annotation. Use [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=da::conll|DZ Interset]] to inspect the CoNLL tagset. |
| |
| The morphological analysis in the CoNLL 2006 version does not include lemmas (the original DTAG version does contain them). The morphosyntactic tags have been assigned (probably) manually. |
| |
| Some multi-word expressions have been collapsed into one token, using underscore as the joining character. This includes adverbially used prepositional phrases (e.g. i_lørdags = on Saturdays) but not named entities. |
| |
| ==== Sample ==== |
| |
| The first sentence of DDT 1.0 in the DTAG format: |
| |
| <code xml><tei.2> |
| <teiHeader type=text> |
| <fileDesc> |
| <titleStmt> |
| <title>Tagged sample of: 'Jeltsins skæbnetime'</title> |
| </titleStmt> |
| <extent words=158>158 running words</extent> |
| <publicationStmt> |
| <distributor>PAROLE-DK</distributor> |
| <address><addrline>Christians Brygge 1,1., DK-1219 Copenhagen K.</address> |
| <date>1998-06-02</date> |
| <availability status=restricted><p>by agreement with distributor</availability> |
| </publicationStmt> |
| <sourceDesc> |
| <biblStruct> |
| <analytic> |
| <title>Jeltsins skæbnetime</title> |
| <author gender=m born=1925>Nikulin, Leon</author> |
| </analytic> |
| <monogr> |
| <imprint><pubPlace>Denmark</pubPlace> |
| <publisher>Det Fri Aktuelt</publisher> |
| <date>1992-12-01</date> |
| </imprint> |
| </monogr> |
| </biblStruct> |
| </sourceDesc> |
| </fileDesc> |
| <profileDesc> |
| <creation>1992-12-01</creation> |
| <langUsage><language>Danish</langUsage> |
| <textClass> |
| <catRef target="P.M2"> |
| <catRef target="P.G4.8"> |
| <catRef target="P.T9.3"> |
| </textClass> |
| </profileDesc> |
| </teiHeader> |
| <text id=AJK> |
| <body> |
| <div1 type=main> |
| <p> |
| <s> |
| <W lemma="to" msd="AC---U=--" in="9:subj" out="1:mod|2:mod|3:nobj|5:appr">To</W> |
| <W lemma="kendt" msd="ANP[CN]PU=[DI]U" in="-1:mod" out="">kendte</W> |
| <W lemma="russisk" msd="ANP[CN]PU=[DI]U" in="-2:mod" out="">russiske</W> |
| <W lemma="historiker" msd="NCCPU==I" in="-3:nobj" out="">historikere</W> |
| <W lemma="Andronik" msd="NP--U==-" in="1:namef" out="">Andronik</W> |
| <W lemma="Mirganjan" msd="NP--U==-" in="-5:appr" out="-1:namef|1:coord">Mirganjan</W> |
| <W lemma="og" msd="CC" in="-1:coord" out="2:conj">og</W> |
| <W lemma="Igor" msd="NP--U==-" in="1:namef" out="">Igor</W> |
| <W lemma="Klamkin" msd="NP--U==-" in="-2:conj" out="-1:namef">Klamkin</W> |
| <W lemma="tro" msd="VADR=----A-" in="" out="-9:subj|1:mod|2:pnct|3:dobj|12:pnct">tror</W> |
| <W lemma="ikke" msd="RGU" in="-1:mod" out="">ikke</W> |
| <W lemma="," msd="XP" in="-2:pnct" out="">,</W> |
| <W lemma="at" msd="CS" in="-3:dobj" out="2:vobj">at</W> |
| <W lemma="Rusland" msd="NP--U==-" in="1:subj|2:[subj]" out="">Rusland</W> |
| <W lemma="kunne" msd="VADR=----A-" in="-2:vobj" out="-1:subj|1:vobj|2:mod">kan</W> |
| <W lemma="udvikle" msd="VAF-=----P-" in="-1:vobj" out="-2:[subj]">udvikles</W> |
| <W lemma="uden" msd="SP" in="-2:mod" out="1:nobj">uden</W> |
| <W lemma="en" msd="PI-CSU--U" in="-1:nobj" out="2:nobj">en</W> |
| <W lemma=""" msd="XP" in="1:pnct" out="">"</W> |
| <W lemma="jernnæve" msd="NCCSU==I" in="-2:nobj" out="-1:pnct|1:pnct">jernnæve</W> |
| <W lemma=""" msd="XP" in="-1:pnct" out="">"</W> |
| <W lemma="." msd="XP" in="-12:pnct" out="">.</W> |
| </s></code> |
| |
| The first sentence of the CoNLL 2006 training data: |
| |
| | 1 | Samme | _ | A | AN | degree=pos<nowiki>|</nowiki>gender=common/neuter<nowiki>|</nowiki>number=sing/plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=def/indef<nowiki>|</nowiki>transcat=unmarked | 0 | ROOT | _ | _ | |
| | 2 | cifre | _ | N | NC | gender=neuter<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=indef | 1 | nobj | _ | _ | |
| | 3 | , | _ | X | XP | _ | 1 | pnct | _ | _ | |
| | 4 | de | _ | P | PD | gender=common/neuter<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>register=unmarked | 7 | subj | _ | _ | |
| | 5 | norske | _ | A | AN | degree=pos<nowiki>|</nowiki>gender=common/neuter<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=def/indef<nowiki>|</nowiki>transcat=unmarked | 4 | mod | _ | _ | |
| | 6 | piger | _ | N | NC | gender=common<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=indef | 4 | nobj | _ | _ | |
| | 7 | tabte | _ | V | VA | mood=indic<nowiki>|</nowiki>tense=past<nowiki>|</nowiki>voice=active | 1 | rel | _ | _ | |
| | 8 | med | _ | SP | SP | _ | 7 | pobj | _ | _ | |
| | 9 | i_lørdags | _ | RG | RG | degree=unmarked | 7 | mod | _ | _ | |
| | 10 | mod | _ | SP | SP | _ | 7 | pobj | _ | _ | |
| | 11 | VMs | _ | N | NP | case=gen | 10 | nobj | _ | _ | |
| | 12 | værtsnation | _ | N | NC | gender=common<nowiki>|</nowiki>number=sing<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=indef | 11 | possd | _ | _ | |
| | 13 | . | _ | X | XP | _ | 1 | pnct | _ | _ | |
| |
| The first sentence of the CoNLL 2006 test data: |
| |
| | 1 | To | _ | A | AC | case=unmarked | 10 | subj | _ | _ | |
| | 2 | kendte | _ | A | AN | degree=pos<nowiki>|</nowiki>gender=common/neuter<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=def/indef<nowiki>|</nowiki>transcat=unmarked | 1 | mod | _ | _ | |
| | 3 | russiske | _ | A | AN | degree=pos<nowiki>|</nowiki>gender=common/neuter<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=def/indef<nowiki>|</nowiki>transcat=unmarked | 1 | mod | _ | _ | |
| | 4 | historikere | _ | N | NC | gender=common<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=indef | 1 | nobj | _ | _ | |
| | 5 | Andronik | _ | N | NP | case=unmarked | 6 | namef | _ | _ | |
| | 6 | Mirganjan | _ | N | NP | case=unmarked | 1 | appr | _ | _ | |
| | 7 | og | _ | C | CC | _ | 6 | coord | _ | _ | |
| | 8 | Igor | _ | N | NP | case=unmarked | 9 | namef | _ | _ | |
| | 9 | Klamkin | _ | N | NP | case=unmarked | 7 | conj | _ | _ | |
| | 10 | tror | _ | V | VA | mood=indic<nowiki>|</nowiki>tense=present<nowiki>|</nowiki>voice=active | 0 | ROOT | _ | _ | |
| | 11 | ikke | _ | RG | RG | degree=unmarked | 10 | mod | _ | _ | |
| | 12 | , | _ | X | XP | _ | 10 | pnct | _ | _ | |
| | 13 | at | _ | C | CS | _ | 10 | dobj | _ | _ | |
| | 14 | Rusland | _ | N | NP | case=unmarked | 15 | subj | _ | _ | |
| | 15 | kan | _ | V | VA | mood=indic<nowiki>|</nowiki>tense=present<nowiki>|</nowiki>voice=active | 13 | vobj | _ | _ | |
| | 16 | udvikles | _ | V | VA | mood=infin<nowiki>|</nowiki>voice=passive | 15 | vobj | _ | _ | |
| | 17 | uden | _ | SP | SP | _ | 15 | mod | _ | _ | |
| | 18 | en | _ | P | PI | gender=common<nowiki>|</nowiki>number=sing<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>register=unmarked | 17 | nobj | _ | _ | |
| | 19 | " | _ | X | XP | _ | 20 | pnct | _ | _ | |
| | 20 | jernnæve | _ | N | NC | gender=common<nowiki>|</nowiki>number=sing<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=indef | 18 | nobj | _ | _ | |
| | 21 | " | _ | X | XP | _ | 20 | pnct | _ | _ | |
| | 22 | . | _ | X | XP | _ | 10 | pnct | _ | _ | |
| |
| ==== Parsing ==== |
| |
| Nonprojectivities in DDT are not frequent. Only 988 of the 100,238 tokens in the CoNLL 2006 version are attached nonprojectively (0.99%). |
| |
| The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http://aclweb.org/anthology-new/W/W06/W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Danish: |
| |
| ^ Parser (Authors) ^ LAS ^ UAS ^ |
| | MST (McDonald et al.) | 84.79 | 90.58 | |
| | Malt (Nivre et al.) | 84.77 | 89.80 | |
| | Riedel et al. | 83.63 | 89.66 | |
| |