Both sides previous revision
Previous revision
Next revision
|
Previous revision
|
user:zeman:treebanks:sv [2012/01/17 12:30] zeman References. |
user:zeman:treebanks:sv [2014/04/22 16:56] zeman Updated link. |
===== Swedish (sv) ===== | ===== Swedish (sv) ===== |
| |
[[http://w3.msi.vxu.se/~nivre/research/Talbanken05.html|Talbanken05]] | [[http://stp.lingfil.uu.se/~nivre/research/Talbanken05.html|Talbanken05]] |
| |
==== Versions ==== | ==== Versions ==== |
* Jens Nilsson, Johan Hall, Joakim Nivre: conversion Talbanken76 to Talbanken05 | * Jens Nilsson, Johan Hall, Joakim Nivre: conversion Talbanken76 to Talbanken05 |
* Principal publications | * Principal publications |
* Matthias Trautner Kromann: [[http://www.buch-kromann.dk/matthias/files/030730-tlt-norfa.pdf|The Danish Dependency Treebank and the DTAG Treebank Tool]]. In: Proceedings of Treebanks and Linguistic Theories, Växjö, Sweden, 2003. | * Joakim Nivre, Jens Nilsson, Johan Hall: [[http://w3.msi.vxu.se/~nivre/papers/talbanken05.pdf|Talbanken05: A Swedish Treebank with Phrase Structure and Dependency Annotation]]. In: Proceedings of the fifth international conference on Language Resources and Evaluation (LREC2006), May 24-26, Genova, Italy, 2006. |
| * Jens Nilsson, Johan Hall, Joakim Nivre: [[http://w3.msi.vxu.se/~nivre/papers/treebank_nodalida05.pdf|MAMBA Meets TIGER: Reconstructing a Swedish Treebank from Antiquity]]. In: Proceedings of the NODALIDA Special Session on Treebanks. 2005. |
* Documentation | * Documentation |
* //see the left-hand-side links at the treebank website, eg.:// | * Jens Nilsson, Johan Hall: [[http://w3.msi.vxu.se/~jni/papers/msi_report05067.pdf|Reconstruction of the Swedish Treebank Talbanken]]. MSI report 05067. Växjö universitet: School of Mathematics and Systems Engineering. ISSN 1650-2647. Växjö, Sweden, 2005. |
* [[http://www.buch-kromann.dk/matthias/treebank/theory.html|Dependency theory and list of dependency relation labels]] | * Daniel Zeman: [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=sv::mamba|DZ Interset interpretation of the Mamba tagset]]. Praha, Czechia, 2007. |
* Britt Keson: [[http://www.buch-kromann.dk/matthias/treebank/PAROLE-manual.pdf|Vejledning til det danske morfosyntaktisk taggede PAROLE-korpus]] (morphosyntactic tags). Det Danske Sprog- og Litteraturselskab (DSL) | |
| |
==== Domain ==== | ==== Domain ==== |
| |
Unknown (the underlying PAROLE corpus “consists of quotations of 150-250 words from a wide range of randomly selected linguistically representative Danish texts from 1983-1992.”) | Mixed: “The text material consists of two sections, the so-called professional prose section (P), with data from textbooks, brochures, newspapers, etc., and a collection of high school students' essays (G).” |
| |
==== Size ==== | ==== Size ==== |
| |
The CoNLL 2006 version contains 100,238 tokens in 5512 sentences, yielding 18.19 tokens per sentence on average (CoNLL 2006 data split: 94386 tokens / 5190 sentences training, 5852 tokens / 322 sentences test). | The CoNLL 2006 version contains 197,123 tokens in 11431 sentences, yielding 17.24 tokens per sentence on average (CoNLL 2006 data split: 191,467 tokens / 11042 sentences training, 5656 tokens / 389 sentences test). |
| |
==== Inside ==== | ==== Inside ==== |
| |
The original morphosyntactic tags have been converted to fit into the three columns (CPOS, POS and FEAT) of the CoNLL format. There //should// be a 1-1 mapping between the [[http://www.buch-kromann.dk/matthias/treebank/PAROLE-manual.pdf|DDT positional tags]] and the CoNLL 2006 annotation. Use [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=da::conll|DZ Interset]] to inspect the CoNLL tagset. | The morphological analysis in the CoNLL 2006 version does not include lemmas. The part-of-speech tags have been assigned (probably) manually. The tagset is very coarse, there are no morphological features, just the part of speech. Use [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=sv::mamba|DZ Interset]] to inspect the tagset. |
| |
The morphological analysis in the CoNLL 2006 version does not include lemmas (the original DTAG version does contain them). The morphosyntactic tags have been assigned (probably) manually. | |
| |
Some multi-word expressions have been collapsed into one token, using underscore as the joining character. This includes adverbially used prepositional phrases (e.g. i_lørdags = on Saturdays) but not named entities. | |
| |
==== Sample ==== | ==== Sample ==== |
| |
The first sentence of DDT 1.0 in the DTAG format: | |
| |
<code xml><tei.2> | |
<teiHeader type=text> | |
<fileDesc> | |
<titleStmt> | |
<title>Tagged sample of: 'Jeltsins skæbnetime'</title> | |
</titleStmt> | |
<extent words=158>158 running words</extent> | |
<publicationStmt> | |
<distributor>PAROLE-DK</distributor> | |
<address><addrline>Christians Brygge 1,1., DK-1219 Copenhagen K.</address> | |
<date>1998-06-02</date> | |
<availability status=restricted><p>by agreement with distributor</availability> | |
</publicationStmt> | |
<sourceDesc> | |
<biblStruct> | |
<analytic> | |
<title>Jeltsins skæbnetime</title> | |
<author gender=m born=1925>Nikulin, Leon</author> | |
</analytic> | |
<monogr> | |
<imprint><pubPlace>Denmark</pubPlace> | |
<publisher>Det Fri Aktuelt</publisher> | |
<date>1992-12-01</date> | |
</imprint> | |
</monogr> | |
</biblStruct> | |
</sourceDesc> | |
</fileDesc> | |
<profileDesc> | |
<creation>1992-12-01</creation> | |
<langUsage><language>Danish</langUsage> | |
<textClass> | |
<catRef target="P.M2"> | |
<catRef target="P.G4.8"> | |
<catRef target="P.T9.3"> | |
</textClass> | |
</profileDesc> | |
</teiHeader> | |
<text id=AJK> | |
<body> | |
<div1 type=main> | |
<p> | |
<s> | |
<W lemma="to" msd="AC---U=--" in="9:subj" out="1:mod|2:mod|3:nobj|5:appr">To</W> | |
<W lemma="kendt" msd="ANP[CN]PU=[DI]U" in="-1:mod" out="">kendte</W> | |
<W lemma="russisk" msd="ANP[CN]PU=[DI]U" in="-2:mod" out="">russiske</W> | |
<W lemma="historiker" msd="NCCPU==I" in="-3:nobj" out="">historikere</W> | |
<W lemma="Andronik" msd="NP--U==-" in="1:namef" out="">Andronik</W> | |
<W lemma="Mirganjan" msd="NP--U==-" in="-5:appr" out="-1:namef|1:coord">Mirganjan</W> | |
<W lemma="og" msd="CC" in="-1:coord" out="2:conj">og</W> | |
<W lemma="Igor" msd="NP--U==-" in="1:namef" out="">Igor</W> | |
<W lemma="Klamkin" msd="NP--U==-" in="-2:conj" out="-1:namef">Klamkin</W> | |
<W lemma="tro" msd="VADR=----A-" in="" out="-9:subj|1:mod|2:pnct|3:dobj|12:pnct">tror</W> | |
<W lemma="ikke" msd="RGU" in="-1:mod" out="">ikke</W> | |
<W lemma="," msd="XP" in="-2:pnct" out="">,</W> | |
<W lemma="at" msd="CS" in="-3:dobj" out="2:vobj">at</W> | |
<W lemma="Rusland" msd="NP--U==-" in="1:subj|2:[subj]" out="">Rusland</W> | |
<W lemma="kunne" msd="VADR=----A-" in="-2:vobj" out="-1:subj|1:vobj|2:mod">kan</W> | |
<W lemma="udvikle" msd="VAF-=----P-" in="-1:vobj" out="-2:[subj]">udvikles</W> | |
<W lemma="uden" msd="SP" in="-2:mod" out="1:nobj">uden</W> | |
<W lemma="en" msd="PI-CSU--U" in="-1:nobj" out="2:nobj">en</W> | |
<W lemma=""" msd="XP" in="1:pnct" out="">"</W> | |
<W lemma="jernnæve" msd="NCCSU==I" in="-2:nobj" out="-1:pnct|1:pnct">jernnæve</W> | |
<W lemma=""" msd="XP" in="-1:pnct" out="">"</W> | |
<W lemma="." msd="XP" in="-12:pnct" out="">.</W> | |
</s></code> | |
| |
The first sentence of the CoNLL 2006 training data: | The first sentence of the CoNLL 2006 training data: |
| |
| 1 | Samme | _ | A | AN | degree=pos<nowiki>|</nowiki>gender=common/neuter<nowiki>|</nowiki>number=sing/plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=def/indef<nowiki>|</nowiki>transcat=unmarked | 0 | ROOT | _ | _ | | | 1 | Äktenskapet | <nowiki>_</nowiki> | NN | NN | <nowiki>_</nowiki> | 4 | SS | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 2 | cifre | _ | N | NC | gender=neuter<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=indef | 1 | nobj | _ | _ | | | 2 | och | <nowiki>_</nowiki> | <nowiki>++</nowiki> | <nowiki>++</nowiki> | <nowiki>_</nowiki> | 3 | <nowiki>++</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 3 | , | _ | X | XP | _ | 1 | pnct | _ | _ | | | 3 | familjen | <nowiki>_</nowiki> | NN | NN | <nowiki>_</nowiki> | 1 | CC | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 4 | de | _ | P | PD | gender=common/neuter<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>register=unmarked | 7 | subj | _ | _ | | | 4 | är | <nowiki>_</nowiki> | AV | AV | <nowiki>_</nowiki> | 0 | ROOT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 5 | norske | _ | A | AN | degree=pos<nowiki>|</nowiki>gender=common/neuter<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=def/indef<nowiki>|</nowiki>transcat=unmarked | 4 | mod | _ | _ | | | 5 | en | <nowiki>_</nowiki> | EN | EN | <nowiki>_</nowiki> | 7 | DT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 6 | piger | _ | N | NC | gender=common<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=indef | 4 | nobj | _ | _ | | | 6 | gammal | <nowiki>_</nowiki> | AJ | AJ | <nowiki>_</nowiki> | 7 | AT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 7 | tabte | _ | V | VA | mood=indic<nowiki>|</nowiki>tense=past<nowiki>|</nowiki>voice=active | 1 | rel | _ | _ | | | 7 | institution | <nowiki>_</nowiki> | NN | NN | <nowiki>_</nowiki> | 4 | SP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 8 | med | _ | SP | SP | _ | 7 | pobj | _ | _ | | | 8 | <nowiki>,</nowiki> | <nowiki>_</nowiki> | IK | IK | <nowiki>_</nowiki> | 7 | IK | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 9 | i_lørdags | _ | RG | RG | degree=unmarked | 7 | mod | _ | _ | | | 9 | som | <nowiki>_</nowiki> | PO | PO | <nowiki>_</nowiki> | 10 | SS | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 10 | mod | _ | SP | SP | _ | 7 | pobj | _ | _ | | | 10 | funnits | <nowiki>_</nowiki> | VV | VV | <nowiki>_</nowiki> | 7 | ET | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 11 | VMs | _ | N | NP | case=gen | 10 | nobj | _ | _ | | | 11 | sedan | <nowiki>_</nowiki> | PR | PR | <nowiki>_</nowiki> | 10 | TA | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 12 | værtsnation | _ | N | NC | gender=common<nowiki>|</nowiki>number=sing<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=indef | 11 | possd | _ | _ | | | 12 | <nowiki>1800-talet</nowiki> | <nowiki>_</nowiki> | NN | NN | <nowiki>_</nowiki> | 11 | PA | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 13 | . | _ | X | XP | _ | 1 | pnct | _ | _ | | | 13 | <nowiki>.</nowiki> | <nowiki>_</nowiki> | IP | IP | <nowiki>_</nowiki> | 4 | IP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| |
The first sentence of the CoNLL 2006 test data: | The first two sentences of the CoNLL 2006 test data: |
| |
| 1 | To | _ | A | AC | case=unmarked | 10 | subj | _ | _ | | | 1 | Individuell | <nowiki>_</nowiki> | AJ | AJ | <nowiki>_</nowiki> | 2 | AT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 2 | kendte | _ | A | AN | degree=pos<nowiki>|</nowiki>gender=common/neuter<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=def/indef<nowiki>|</nowiki>transcat=unmarked | 1 | mod | _ | _ | | | 2 | beskattning | <nowiki>_</nowiki> | VN | VN | <nowiki>_</nowiki> | 0 | ROOT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 3 | russiske | _ | A | AN | degree=pos<nowiki>|</nowiki>gender=common/neuter<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=def/indef<nowiki>|</nowiki>transcat=unmarked | 1 | mod | _ | _ | | | 3 | av | <nowiki>_</nowiki> | PR | PR | <nowiki>_</nowiki> | 2 | ET | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 4 | historikere | _ | N | NC | gender=common<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=indef | 1 | nobj | _ | _ | | | 4 | arbetsinkomster | <nowiki>_</nowiki> | NN | NN | <nowiki>_</nowiki> | 3 | PA | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 5 | Andronik | _ | N | NP | case=unmarked | 6 | namef | _ | _ | | | |||||||||| |
| 6 | Mirganjan | _ | N | NP | case=unmarked | 1 | appr | _ | _ | | | 1 | Genom | <nowiki>_</nowiki> | PR | PR | <nowiki>_</nowiki> | 3 | AA | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 7 | og | _ | C | CC | _ | 6 | coord | _ | _ | | | 2 | skattereformen | <nowiki>_</nowiki> | NN | NN | <nowiki>_</nowiki> | 1 | PA | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 8 | Igor | _ | N | NP | case=unmarked | 9 | namef | _ | _ | | | 3 | införs | <nowiki>_</nowiki> | VV | VV | <nowiki>_</nowiki> | 0 | ROOT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 9 | Klamkin | _ | N | NP | case=unmarked | 7 | conj | _ | _ | | | 4 | individuell | <nowiki>_</nowiki> | AJ | AJ | <nowiki>_</nowiki> | 5 | AT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 10 | tror | _ | V | VA | mood=indic<nowiki>|</nowiki>tense=present<nowiki>|</nowiki>voice=active | 0 | ROOT | _ | _ | | | 5 | beskattning | <nowiki>_</nowiki> | VN | VN | <nowiki>_</nowiki> | 3 | SS | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 11 | ikke | _ | RG | RG | degree=unmarked | 10 | mod | _ | _ | | | 6 | <nowiki>(</nowiki> | <nowiki>_</nowiki> | IR | IR | <nowiki>_</nowiki> | 5 | IR | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 12 | , | _ | X | XP | _ | 10 | pnct | _ | _ | | | 7 | särbeskattning | <nowiki>_</nowiki> | VN | VN | <nowiki>_</nowiki> | 5 | AN | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 13 | at | _ | C | CS | _ | 10 | dobj | _ | _ | | | 8 | <nowiki>)</nowiki> | <nowiki>_</nowiki> | IR | IR | <nowiki>_</nowiki> | 5 | JR | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 14 | Rusland | _ | N | NP | case=unmarked | 15 | subj | _ | _ | | | 9 | av | <nowiki>_</nowiki> | PR | PR | <nowiki>_</nowiki> | 5 | ET | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 15 | kan | _ | V | VA | mood=indic<nowiki>|</nowiki>tense=present<nowiki>|</nowiki>voice=active | 13 | vobj | _ | _ | | | 10 | arbetsinkomster | <nowiki>_</nowiki> | NN | NN | <nowiki>_</nowiki> | 9 | PA | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 16 | udvikles | _ | V | VA | mood=infin<nowiki>|</nowiki>voice=passive | 15 | vobj | _ | _ | | | 11 | <nowiki>.</nowiki> | <nowiki>_</nowiki> | IP | IP | <nowiki>_</nowiki> | 3 | IP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 17 | uden | _ | SP | SP | _ | 15 | mod | _ | _ | | |
| 18 | en | _ | P | PI | gender=common<nowiki>|</nowiki>number=sing<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>register=unmarked | 17 | nobj | _ | _ | | |
| 19 | " | _ | X | XP | _ | 20 | pnct | _ | _ | | |
| 20 | jernnæve | _ | N | NC | gender=common<nowiki>|</nowiki>number=sing<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=indef | 18 | nobj | _ | _ | | |
| 21 | " | _ | X | XP | _ | 20 | pnct | _ | _ | | |
| 22 | . | _ | X | XP | _ | 10 | pnct | _ | _ | | |
| |
==== Parsing ==== | ==== Parsing ==== |
| |
Nonprojectivities in DDT are not frequent. Only 988 of the 100,238 tokens in the CoNLL 2006 version are attached nonprojectively (0.99%). | Nonprojectivities in Talbanken are not frequent. Only 1928 of the 197,123 tokens in the CoNLL 2006 version are attached nonprojectively (0.98%). |
| |
The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http://aclweb.org/anthology-new/W/W06/W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Danish: | The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http://aclweb.org/anthology-new/W/W06/W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Swedish: |
| |
^ Parser (Authors) ^ LAS ^ UAS ^ | ^ Parser (Authors) ^ LAS ^ UAS ^ |
| MST (McDonald et al.) | 84.79 | 90.58 | | | Microsoft (Corston-Oliver and Aue) | 79.69 | 89.54 | |
| Malt (Nivre et al.) | 84.77 | 89.80 | | | Malt (Nivre et al.) | 84.58 | 89.50 | |
| Riedel et al. | 83.63 | 89.66 | | | Illinois (Do and Chang) | 82.31 | 89.05 | |
| | MST (McDonald et al.) | 82.55 | 88.93 | |
| | Kenji Sagae | 82.00 | 88.57 | |
| | Nara (Yuchang Cheng) | 81.08 | 88.57 | |
| | Basis (John O'Neil) | 81.78 | 88.45 | |
| | Riedel et al. | 80.66 | 88.33 | |
| |