[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
user:zeman:treebanks:sv [2012/01/17 14:08]
zeman Size.
user:zeman:treebanks:sv [2012/01/17 14:23]
zeman Parsing.
Line 47: Line 47:
 ==== Inside ==== ==== Inside ====
  
-The original morphosyntactic tags have been converted to fit into the three columns (CPOS, POS and FEATof the CoNLL formatThere //should// be a 1-1 mapping between the [[http://www.buch-kromann.dk/matthias/treebank/PAROLE-manual.pdf|DDT positional tags]] and the CoNLL 2006 annotation. Use [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=da::conll|DZ Interset]] to inspect the CoNLL tagset+The morphological analysis in the CoNLL 2006 version does not include lemmas. The part-of-speech tags have been assigned (probablymanuallyThe tagset is very coarse, there are no morphological features, just the part of speech. Use [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=sv::mamba|DZ Interset]] to inspect the tagset.
- +
-The morphological analysis in the CoNLL 2006 version does not include lemmas (the original DTAG version does contain them). The morphosyntactic tags have been assigned (probably) manually. +
- +
-Some multi-word expressions have been collapsed into one token, using underscore as the joining character. This includes adverbially used prepositional phrases (e.g. i_lørdags = on Saturdays) but not named entities.+
  
 ==== Sample ==== ==== Sample ====
- 
-The first sentence of DDT 1.0 in the DTAG format: 
- 
-<code xml><tei.2> 
-  <teiHeader type=text> 
-    <fileDesc> 
-      <titleStmt> 
-        <title>Tagged sample of: 'Jeltsins skæbnetime'</title> 
-      </titleStmt> 
-      <extent words=158>158 running words</extent> 
-      <publicationStmt> 
-         <distributor>PAROLE-DK</distributor> 
-         <address><addrline>Christians Brygge 1,1., DK-1219 Copenhagen K.</address> 
-         <date>1998-06-02</date> 
-         <availability status=restricted><p>by agreement with distributor</availability> 
-      </publicationStmt> 
-      <sourceDesc> 
-        <biblStruct> 
-          <analytic> 
-            <title>Jeltsins skæbnetime</title> 
-            <author gender=m born=1925>Nikulin, Leon</author> 
-          </analytic> 
-          <monogr> 
-            <imprint><pubPlace>Denmark</pubPlace> 
-              <publisher>Det Fri Aktuelt</publisher> 
-              <date>1992-12-01</date> 
-            </imprint> 
-          </monogr> 
-        </biblStruct> 
-      </sourceDesc> 
-    </fileDesc> 
-    <profileDesc> 
-      <creation>1992-12-01</creation> 
-      <langUsage><language>Danish</langUsage> 
-      <textClass> 
-        <catRef target="P.M2"> 
-        <catRef target="P.G4.8"> 
-        <catRef target="P.T9.3"> 
-      </textClass> 
-    </profileDesc> 
-  </teiHeader> 
-<text id=AJK> 
-<body> 
-<div1 type=main> 
-<p> 
-<s> 
-<W lemma="to" msd="AC---U=--" in="9:subj" out="1:mod|2:mod|3:nobj|5:appr">To</W> 
-<W lemma="kendt" msd="ANP[CN]PU=[DI]U" in="-1:mod" out="">kendte</W> 
-<W lemma="russisk" msd="ANP[CN]PU=[DI]U" in="-2:mod" out="">russiske</W> 
-<W lemma="historiker" msd="NCCPU==I" in="-3:nobj" out="">historikere</W> 
-<W lemma="Andronik" msd="NP--U==-" in="1:namef" out="">Andronik</W> 
-<W lemma="Mirganjan" msd="NP--U==-" in="-5:appr" out="-1:namef|1:coord">Mirganjan</W> 
-<W lemma="og" msd="CC" in="-1:coord" out="2:conj">og</W> 
-<W lemma="Igor" msd="NP--U==-" in="1:namef" out="">Igor</W> 
-<W lemma="Klamkin" msd="NP--U==-" in="-2:conj" out="-1:namef">Klamkin</W> 
-<W lemma="tro" msd="VADR=----A-" in="" out="-9:subj|1:mod|2:pnct|3:dobj|12:pnct">tror</W> 
-<W lemma="ikke" msd="RGU" in="-1:mod" out="">ikke</W> 
-<W lemma="," msd="XP" in="-2:pnct" out="">,</W> 
-<W lemma="at" msd="CS" in="-3:dobj" out="2:vobj">at</W> 
-<W lemma="Rusland" msd="NP--U==-" in="1:subj|2:[subj]" out="">Rusland</W> 
-<W lemma="kunne" msd="VADR=----A-" in="-2:vobj" out="-1:subj|1:vobj|2:mod">kan</W> 
-<W lemma="udvikle" msd="VAF-=----P-" in="-1:vobj" out="-2:[subj]">udvikles</W> 
-<W lemma="uden" msd="SP" in="-2:mod" out="1:nobj">uden</W> 
-<W lemma="en" msd="PI-CSU--U" in="-1:nobj" out="2:nobj">en</W> 
-<W lemma="&quot;" msd="XP" in="1:pnct" out="">"</W> 
-<W lemma="jernnæve" msd="NCCSU==I" in="-2:nobj" out="-1:pnct|1:pnct">jernnæve</W> 
-<W lemma="&quot;" msd="XP" in="-1:pnct" out="">"</W> 
-<W lemma="." msd="XP" in="-12:pnct" out="">.</W> 
-</s></code> 
  
 The first sentence of the CoNLL 2006 training data: The first sentence of the CoNLL 2006 training data:
  
-| 1 | Samme | _ | AN degree=pos<nowiki>|</nowiki>gender=common/neuter<nowiki>|</nowiki>number=sing/plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=def/indef<nowiki>|</nowiki>transcat=unmarked | 0 | ROOT | _ | | +| 1 | Äktenskapet <nowiki>_</nowiki> NN NN | <nowiki>_</nowiki> | 4 | SS | <nowiki>_</nowiki> <nowiki>_</nowiki> 
-| 2 | cifre | _ | N | NC | gender=neuter<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=indef | 1 | nobj | _ | _ +| 2 | och | <nowiki>_</nowiki> <nowiki>++</nowiki> | <nowiki>++</nowiki><nowiki>_</nowiki> | <nowiki>++</nowiki> <nowiki>_</nowiki> <nowiki>_</nowiki>
-| 3 | | _ | XP | _ | 1 | pnct | _ | _ | +| 3 | familjen <nowiki>_</nowiki> NN NN <nowiki>_</nowiki> | 1 | CC <nowiki>_</nowiki> <nowiki>_</nowiki> 
-| 4 | de | _ | PD gender=common/neuter<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>register=unmarked | 7 | subj | _ | _ +| 4 | är <nowiki>_</nowiki> AV AV | <nowiki>_</nowiki> | 0 | ROOT | <nowiki>_</nowiki> <nowiki>_</nowiki>
-| 5 | norske | _ | AN degree=pos<nowiki>|</nowiki>gender=common/neuter<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=def/indef<nowiki>|</nowiki>transcat=unmarked mod | _ | _ | +| 5 | en <nowiki>_</nowiki> EN EN | <nowiki>_</nowiki> | 7 | DT | <nowiki>_</nowiki> <nowiki>_</nowiki> 
-piger | _ | NC gender=common<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=indef | 4 | nobj | _ | _ +| 6 | gammal | <nowiki>_</nowiki> | AJ | AJ | <nowiki>_</nowiki>AT <nowiki>_</nowiki> <nowiki>_</nowiki> 
-7 | tabte | _ V | VA | mood=indic<nowiki>|</nowiki>tense=past<nowiki>|</nowiki>voice=active rel | _ | _ | +institution <nowiki>_</nowiki> NN NN | <nowiki>_</nowiki> | 4 | SP | <nowiki>_</nowiki> <nowiki>_</nowiki>
-| 8 | med | _ | SP | SP | _ | 7 | pobj | _ | _ | +| <nowiki>,</nowiki> <nowiki>_</nowiki>IK IK <nowiki>_</nowiki> | 7 | IK <nowiki>_</nowiki> <nowiki>_</nowiki> 
-| 9 | i_lørdags | _ | RG RG degree=unmarked mod | _ | _ | +| 9 | som <nowiki>_</nowiki> PO PO <nowiki>_</nowiki> 10 SS <nowiki>_</nowiki> <nowiki>_</nowiki> 
-| 10 | mod | _ | SP SP | _ | 7 | pobj | _ | _ | +| 10 | funnits <nowiki>_</nowiki> VV VV <nowiki>_</nowiki> | 7 | ET <nowiki>_</nowiki> <nowiki>_</nowiki> 
-| 11 | VMs | _ | NP case=gen | 10 | nobj | _ | _ | +| 11 | sedan <nowiki>_</nowiki> PR PR <nowiki>_</nowiki> | 10 | TA <nowiki>_</nowiki> <nowiki>_</nowiki> 
-| 12 | værtsnation | _ | N | NC | gender=common<nowiki>|</nowiki>number=sing<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=indef | 11 | possd | _ | _ | +| 12 | <nowiki>1800-talet</nowiki> <nowiki>_</nowiki> | NN | NN | <nowiki>_</nowiki> | 11 | PA <nowiki>_</nowiki> <nowiki>_</nowiki> 
-| 13 | . | _ | XP | _ | pnct | _ | _ |+| 13 | <nowiki>.</nowiki> <nowiki>_</nowiki> IP IP <nowiki>_</nowiki> IP <nowiki>_</nowiki> <nowiki>_</nowiki> |
  
-The first sentence of the CoNLL 2006 test data:+The first two sentences of the CoNLL 2006 test data:
  
-| 1 | To | _ | AC | case=unmarked | 10 | subj | _ | _ | +| 1 | Individuell <nowiki>_</nowiki> AJ AJ <nowiki>_</nowiki> | 2 | AT | <nowiki>_</nowiki> <nowiki>_</nowiki> 
-| 2 | kendte _ | A | AN | degree=pos<nowiki>|</nowiki>gender=common/neuter<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=def/indef<nowiki>|</nowiki>transcat=unmarked | 1 | mod | _ | _ | +| 2 | beskattning | <nowiki>_</nowiki> | VN | VN | <nowiki>_</nowiki> | 0 | ROOT | <nowiki>_</nowiki><nowiki>_</nowiki> 
-| 3 | russiske | _ | A | AN degree=pos<nowiki>|</nowiki>gender=common/neuter<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=def/indef<nowiki>|</nowiki>transcat=unmarked | 1 | mod | _ | _ +| 3 | av | <nowiki>_</nowiki> | PR | PR | <nowiki>_</nowiki> | 2 | ET | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-| 4 | historikere | _ | NC gender=common<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=indef | 1 | nobj | _ | _ +| 4 | arbetsinkomster <nowiki>_</nowiki> NN NN | <nowiki>_</nowiki> | 3 | PA | <nowiki>_</nowiki> <nowiki>_</nowiki>
-Andronik NP case=unmarked namef +| |||||||||| 
-| 6 | Mirganjan | _ | N | NP | case=unmarked | 1 | appr | _ | | +| 1 | Genom <nowiki>_</nowiki> PR PR <nowiki>_</nowiki> AA <nowiki>_</nowiki> <nowiki>_</nowiki> 
-| 7 | og | _ | CC | _ | 6 | coord | _ | _ | +skattereformen <nowiki>_</nowiki> NN NN <nowiki>_</nowiki> PA <nowiki>_</nowiki> <nowiki>_</nowiki> 
-Igor | _ | N | NP case=unmarked | 9 | namef | _ | | +införs | <nowiki>_</nowiki> | VV | VV | <nowiki>_</nowiki> | 0 | ROOT | <nowiki>_</nowiki> <nowiki>_</nowiki> 
-| 9 | Klamkin | _ | N | NP | case=unmarked | 7 | conj | _ | _ | +individuell <nowiki>_</nowiki> AJ AJ <nowiki>_</nowiki> AT <nowiki>_</nowiki> <nowiki>_</nowiki> 
-10 | tror | _ | V VA mood=indic<nowiki>|</nowiki>tense=present<nowiki>|</nowiki>voice=active | 0 | ROOT | _ | _ | +beskattning <nowiki>_</nowiki> VN VN <nowiki>_</nowiki> SS <nowiki>_</nowiki> <nowiki>_</nowiki> 
-11 ikke | _ | RG RG | degree=unmarked | 10 | mod | _ | _ | +| <nowiki>(</nowiki> <nowiki>_</nowiki>IR IR <nowiki>_</nowiki> IR <nowiki>_</nowiki> | <nowiki>_</nowiki>
-12 | , | _ | X | XP | _ | 10 | pnct | _ | _ | +särbeskattning <nowiki>_</nowiki> VN VN <nowiki>_</nowiki> AN <nowiki>_</nowiki> <nowiki>_</nowiki> 
-13 at | _ | CS | _ | 10 dobj | _ | _ | +| <nowiki>)</nowiki> <nowiki>_</nowiki> | IR | IR | <nowiki>_</nowiki>JR <nowiki>_</nowiki> <nowiki>_</nowiki> 
-14 | Rusland _ | N | NP | case=unmarked | 15 | subj | _ | _ | +av <nowiki>_</nowiki> PR PR <nowiki>_</nowiki> ET <nowiki>_</nowiki> <nowiki>_</nowiki> 
-| 15 | kan | _ | V | VA | mood=indic<nowiki>|</nowiki>tense=present<nowiki>|</nowiki>voice=active 13 vobj | _ | | +10 arbetsinkomster <nowiki>_</nowiki> NN NN | <nowiki>_</nowiki> | 9 | PA | <nowiki>_</nowiki> <nowiki>_</nowiki>
-| 16 | udvikles | _ | V | VA | mood=infin<nowiki>|</nowiki>voice=passive | 15 | vobj | _ | _ +11 <nowiki>.</nowiki> <nowiki>_</nowiki> IP IP <nowiki>_</nowiki> IP <nowiki>_</nowiki> <nowiki>_</nowiki> |
-17 uden | _ | SP SP | _ | 15 mod | _ | _ | +
-18 | en | _ P | PI | gender=common<nowiki>|</nowiki>number=sing<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>register=unmarked 17 nobj | _ | _ | +
-19 | _ | XP | _ | 20 pnct | _ | _ | +
-20 jernnæve | _ | NC gender=common<nowiki>|</nowiki>number=sing<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=indef | 18 | nobj | _ | _ +
-21 | " | _ | X | XP | _ | 20 | pnct | _ | _ | +
-| 22 | . | _ | XP | _ | 10 pnct | _ | _ |+
  
 ==== Parsing ==== ==== Parsing ====
  
-Nonprojectivities in DDT are not frequent. Only 988 of the 100,238 tokens in the CoNLL 2006 version are attached nonprojectively (0.99%).+Nonprojectivities in Talbanken are not frequent. Only 1928 of the 197,123 tokens in the CoNLL 2006 version are attached nonprojectively (0.98%).
  
-The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http://aclweb.org/anthology-new/W/W06/W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Danish:+The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http://aclweb.org/anthology-new/W/W06/W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Swedish:
  
 ^ Parser (Authors) ^ LAS ^ UAS ^ ^ Parser (Authors) ^ LAS ^ UAS ^
-MST (McDonald et al.) | 84.79 | 90.58 +Microsoft (Corston-Oliver and Aue) | 79.69 89.54 
-| Malt (Nivre et al.) | 84.77 | 89.80 +| Malt (Nivre et al.) | 84.58 | 89.50 
-Riedel et al. | 83.63 89.66 |+Illinois (Do and Chang) | 82.31 | 89.05 | 
 +| MST (McDonald et al.82.55 88.93 | 
 +| Kenji Sagae | 82.00 | 88.57 | 
 +| Nara (Yuchang Cheng) | 81.08 | 88.57 | 
 +| Basis (John O'Neil) | 81.78 | 88.45 | 
 +| Riedel et al. | 80.66 | 88.33 |
  

[ Back to the navigation ] [ Back to the content ]