[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
user:zeman:treebanks:sv [2012/01/17 14:08]
zeman Size.
user:zeman:treebanks:sv [2014/04/22 16:56]
zeman Updated link.
Line 1: Line 1:
 ===== Swedish (sv) ===== ===== Swedish (sv) =====
  
-[[http://w3.msi.vxu.se/~nivre/research/Talbanken05.html|Talbanken05]]+[[http://stp.lingfil.uu.se/~nivre/research/Talbanken05.html|Talbanken05]]
  
 ==== Versions ==== ==== Versions ====
Line 47: Line 47:
 ==== Inside ==== ==== Inside ====
  
-The original morphosyntactic tags have been converted to fit into the three columns (CPOS, POS and FEATof the CoNLL formatThere //should// be a 1-1 mapping between the [[http://www.buch-kromann.dk/matthias/treebank/PAROLE-manual.pdf|DDT positional tags]] and the CoNLL 2006 annotation. Use [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=da::conll|DZ Interset]] to inspect the CoNLL tagset+The morphological analysis in the CoNLL 2006 version does not include lemmas. The part-of-speech tags have been assigned (probablymanuallyThe tagset is very coarse, there are no morphological features, just the part of speech. Use [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=sv::mamba|DZ Interset]] to inspect the tagset.
- +
-The morphological analysis in the CoNLL 2006 version does not include lemmas (the original DTAG version does contain them). The morphosyntactic tags have been assigned (probably) manually. +
- +
-Some multi-word expressions have been collapsed into one token, using underscore as the joining character. This includes adverbially used prepositional phrases (e.g. i_lørdags = on Saturdays) but not named entities.+
  
 ==== Sample ==== ==== Sample ====
- 
-The first sentence of DDT 1.0 in the DTAG format: 
- 
-<code xml><tei.2> 
-  <teiHeader type=text> 
-    <fileDesc> 
-      <titleStmt> 
-        <title>Tagged sample of: 'Jeltsins skæbnetime'</title> 
-      </titleStmt> 
-      <extent words=158>158 running words</extent> 
-      <publicationStmt> 
-         <distributor>PAROLE-DK</distributor> 
-         <address><addrline>Christians Brygge 1,1., DK-1219 Copenhagen K.</address> 
-         <date>1998-06-02</date> 
-         <availability status=restricted><p>by agreement with distributor</availability> 
-      </publicationStmt> 
-      <sourceDesc> 
-        <biblStruct> 
-          <analytic> 
-            <title>Jeltsins skæbnetime</title> 
-            <author gender=m born=1925>Nikulin, Leon</author> 
-          </analytic> 
-          <monogr> 
-            <imprint><pubPlace>Denmark</pubPlace> 
-              <publisher>Det Fri Aktuelt</publisher> 
-              <date>1992-12-01</date> 
-            </imprint> 
-          </monogr> 
-        </biblStruct> 
-      </sourceDesc> 
-    </fileDesc> 
-    <profileDesc> 
-      <creation>1992-12-01</creation> 
-      <langUsage><language>Danish</langUsage> 
-      <textClass> 
-        <catRef target="P.M2"> 
-        <catRef target="P.G4.8"> 
-        <catRef target="P.T9.3"> 
-      </textClass> 
-    </profileDesc> 
-  </teiHeader> 
-<text id=AJK> 
-<body> 
-<div1 type=main> 
-<p> 
-<s> 
-<W lemma="to" msd="AC---U=--" in="9:subj" out="1:mod|2:mod|3:nobj|5:appr">To</W> 
-<W lemma="kendt" msd="ANP[CN]PU=[DI]U" in="-1:mod" out="">kendte</W> 
-<W lemma="russisk" msd="ANP[CN]PU=[DI]U" in="-2:mod" out="">russiske</W> 
-<W lemma="historiker" msd="NCCPU==I" in="-3:nobj" out="">historikere</W> 
-<W lemma="Andronik" msd="NP--U==-" in="1:namef" out="">Andronik</W> 
-<W lemma="Mirganjan" msd="NP--U==-" in="-5:appr" out="-1:namef|1:coord">Mirganjan</W> 
-<W lemma="og" msd="CC" in="-1:coord" out="2:conj">og</W> 
-<W lemma="Igor" msd="NP--U==-" in="1:namef" out="">Igor</W> 
-<W lemma="Klamkin" msd="NP--U==-" in="-2:conj" out="-1:namef">Klamkin</W> 
-<W lemma="tro" msd="VADR=----A-" in="" out="-9:subj|1:mod|2:pnct|3:dobj|12:pnct">tror</W> 
-<W lemma="ikke" msd="RGU" in="-1:mod" out="">ikke</W> 
-<W lemma="," msd="XP" in="-2:pnct" out="">,</W> 
-<W lemma="at" msd="CS" in="-3:dobj" out="2:vobj">at</W> 
-<W lemma="Rusland" msd="NP--U==-" in="1:subj|2:[subj]" out="">Rusland</W> 
-<W lemma="kunne" msd="VADR=----A-" in="-2:vobj" out="-1:subj|1:vobj|2:mod">kan</W> 
-<W lemma="udvikle" msd="VAF-=----P-" in="-1:vobj" out="-2:[subj]">udvikles</W> 
-<W lemma="uden" msd="SP" in="-2:mod" out="1:nobj">uden</W> 
-<W lemma="en" msd="PI-CSU--U" in="-1:nobj" out="2:nobj">en</W> 
-<W lemma="&quot;" msd="XP" in="1:pnct" out="">"</W> 
-<W lemma="jernnæve" msd="NCCSU==I" in="-2:nobj" out="-1:pnct|1:pnct">jernnæve</W> 
-<W lemma="&quot;" msd="XP" in="-1:pnct" out="">"</W> 
-<W lemma="." msd="XP" in="-12:pnct" out="">.</W> 
-</s></code> 
  
 The first sentence of the CoNLL 2006 training data: The first sentence of the CoNLL 2006 training data:
  
-| 1 | Samme | _ | AN degree=pos<nowiki>|</nowiki>gender=common/neuter<nowiki>|</nowiki>number=sing/plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=def/indef<nowiki>|</nowiki>transcat=unmarked | 0 | ROOT | _ | | +| 1 | Äktenskapet <nowiki>_</nowiki> NN NN | <nowiki>_</nowiki> | 4 | SS | <nowiki>_</nowiki> <nowiki>_</nowiki> 
-| 2 | cifre | _ | N | NC | gender=neuter<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=indef | 1 | nobj | _ | _ +| 2 | och | <nowiki>_</nowiki> <nowiki>++</nowiki> | <nowiki>++</nowiki><nowiki>_</nowiki> | <nowiki>++</nowiki> <nowiki>_</nowiki> <nowiki>_</nowiki>
-| 3 | | _ | XP | _ | 1 | pnct | _ | _ | +| 3 | familjen <nowiki>_</nowiki> NN NN <nowiki>_</nowiki> | 1 | CC <nowiki>_</nowiki> <nowiki>_</nowiki> 
-| 4 | de | _ | PD gender=common/neuter<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>register=unmarked | 7 | subj | _ | _ +| 4 | är <nowiki>_</nowiki> AV AV | <nowiki>_</nowiki> | 0 | ROOT | <nowiki>_</nowiki> <nowiki>_</nowiki>
-| 5 | norske | _ | AN degree=pos<nowiki>|</nowiki>gender=common/neuter<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=def/indef<nowiki>|</nowiki>transcat=unmarked mod | _ | _ | +| 5 | en <nowiki>_</nowiki> EN EN | <nowiki>_</nowiki> | 7 | DT | <nowiki>_</nowiki> <nowiki>_</nowiki> 
-piger | _ | NC gender=common<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=indef | 4 | nobj | _ | _ +| 6 | gammal | <nowiki>_</nowiki> | AJ | AJ | <nowiki>_</nowiki>AT <nowiki>_</nowiki> <nowiki>_</nowiki> 
-7 | tabte | _ V | VA | mood=indic<nowiki>|</nowiki>tense=past<nowiki>|</nowiki>voice=active rel | _ | _ | +institution <nowiki>_</nowiki> NN NN | <nowiki>_</nowiki> | 4 | SP | <nowiki>_</nowiki> <nowiki>_</nowiki>
-| 8 | med | _ | SP | SP | _ | 7 | pobj | _ | _ | +| <nowiki>,</nowiki> <nowiki>_</nowiki>IK IK <nowiki>_</nowiki> | 7 | IK <nowiki>_</nowiki> <nowiki>_</nowiki> 
-| 9 | i_lørdags | _ | RG RG degree=unmarked mod | _ | _ | +| 9 | som <nowiki>_</nowiki> PO PO <nowiki>_</nowiki> 10 SS <nowiki>_</nowiki> <nowiki>_</nowiki> 
-| 10 | mod | _ | SP SP | _ | 7 | pobj | _ | _ | +| 10 | funnits <nowiki>_</nowiki> VV VV <nowiki>_</nowiki> | 7 | ET <nowiki>_</nowiki> <nowiki>_</nowiki> 
-| 11 | VMs | _ | NP case=gen | 10 | nobj | _ | _ | +| 11 | sedan <nowiki>_</nowiki> PR PR <nowiki>_</nowiki> | 10 | TA <nowiki>_</nowiki> <nowiki>_</nowiki> 
-| 12 | værtsnation | _ | N | NC | gender=common<nowiki>|</nowiki>number=sing<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=indef | 11 | possd | _ | _ | +| 12 | <nowiki>1800-talet</nowiki> <nowiki>_</nowiki> | NN | NN | <nowiki>_</nowiki> | 11 | PA <nowiki>_</nowiki> <nowiki>_</nowiki> 
-| 13 | . | _ | XP | _ | pnct | _ | _ |+| 13 | <nowiki>.</nowiki> <nowiki>_</nowiki> IP IP <nowiki>_</nowiki> IP <nowiki>_</nowiki> <nowiki>_</nowiki> |
  
-The first sentence of the CoNLL 2006 test data:+The first two sentences of the CoNLL 2006 test data:
  
-| 1 | To | _ | AC | case=unmarked | 10 | subj | _ | _ | +| 1 | Individuell <nowiki>_</nowiki> AJ AJ <nowiki>_</nowiki> | 2 | AT | <nowiki>_</nowiki> <nowiki>_</nowiki> 
-| 2 | kendte _ | A | AN | degree=pos<nowiki>|</nowiki>gender=common/neuter<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=def/indef<nowiki>|</nowiki>transcat=unmarked | 1 | mod | _ | _ | +| 2 | beskattning | <nowiki>_</nowiki> | VN | VN | <nowiki>_</nowiki> | 0 | ROOT | <nowiki>_</nowiki><nowiki>_</nowiki> 
-| 3 | russiske | _ | A | AN degree=pos<nowiki>|</nowiki>gender=common/neuter<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=def/indef<nowiki>|</nowiki>transcat=unmarked | 1 | mod | _ | _ +| 3 | av | <nowiki>_</nowiki> | PR | PR | <nowiki>_</nowiki> | 2 | ET | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-| 4 | historikere | _ | NC gender=common<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=indef | 1 | nobj | _ | _ +| 4 | arbetsinkomster <nowiki>_</nowiki> NN NN | <nowiki>_</nowiki> | 3 | PA | <nowiki>_</nowiki> <nowiki>_</nowiki>
-Andronik NP case=unmarked namef +| |||||||||| 
-| 6 | Mirganjan | _ | N | NP | case=unmarked | 1 | appr | _ | | +| 1 | Genom <nowiki>_</nowiki> PR PR <nowiki>_</nowiki> AA <nowiki>_</nowiki> <nowiki>_</nowiki> 
-| 7 | og | _ | CC | _ | 6 | coord | _ | _ | +skattereformen <nowiki>_</nowiki> NN NN <nowiki>_</nowiki> PA <nowiki>_</nowiki> <nowiki>_</nowiki> 
-Igor | _ | N | NP case=unmarked | 9 | namef | _ | | +införs | <nowiki>_</nowiki> | VV | VV | <nowiki>_</nowiki> | 0 | ROOT | <nowiki>_</nowiki> <nowiki>_</nowiki> 
-| 9 | Klamkin | _ | N | NP | case=unmarked | 7 | conj | _ | _ | +individuell <nowiki>_</nowiki> AJ AJ <nowiki>_</nowiki> AT <nowiki>_</nowiki> <nowiki>_</nowiki> 
-10 | tror | _ | V VA mood=indic<nowiki>|</nowiki>tense=present<nowiki>|</nowiki>voice=active | 0 | ROOT | _ | _ | +beskattning <nowiki>_</nowiki> VN VN <nowiki>_</nowiki> SS <nowiki>_</nowiki> <nowiki>_</nowiki> 
-11 ikke | _ | RG RG | degree=unmarked | 10 | mod | _ | _ | +| <nowiki>(</nowiki> <nowiki>_</nowiki>IR IR <nowiki>_</nowiki> IR <nowiki>_</nowiki> | <nowiki>_</nowiki>
-12 | , | _ | X | XP | _ | 10 | pnct | _ | _ | +särbeskattning <nowiki>_</nowiki> VN VN <nowiki>_</nowiki> AN <nowiki>_</nowiki> <nowiki>_</nowiki> 
-13 at | _ | CS | _ | 10 dobj | _ | _ | +| <nowiki>)</nowiki> <nowiki>_</nowiki> | IR | IR | <nowiki>_</nowiki>JR <nowiki>_</nowiki> <nowiki>_</nowiki> 
-14 | Rusland _ | N | NP | case=unmarked | 15 | subj | _ | _ | +av <nowiki>_</nowiki> PR PR <nowiki>_</nowiki> ET <nowiki>_</nowiki> <nowiki>_</nowiki> 
-| 15 | kan | _ | V | VA | mood=indic<nowiki>|</nowiki>tense=present<nowiki>|</nowiki>voice=active 13 vobj | _ | | +10 arbetsinkomster <nowiki>_</nowiki> NN NN | <nowiki>_</nowiki> | 9 | PA | <nowiki>_</nowiki> <nowiki>_</nowiki>
-| 16 | udvikles | _ | V | VA | mood=infin<nowiki>|</nowiki>voice=passive | 15 | vobj | _ | _ +11 <nowiki>.</nowiki> <nowiki>_</nowiki> IP IP <nowiki>_</nowiki> IP <nowiki>_</nowiki> <nowiki>_</nowiki> |
-17 uden | _ | SP SP | _ | 15 mod | _ | _ | +
-18 | en | _ P | PI | gender=common<nowiki>|</nowiki>number=sing<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>register=unmarked 17 nobj | _ | _ | +
-19 | _ | XP | _ | 20 pnct | _ | _ | +
-20 jernnæve | _ | NC gender=common<nowiki>|</nowiki>number=sing<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=indef | 18 | nobj | _ | _ +
-21 | " | _ | X | XP | _ | 20 | pnct | _ | _ | +
-| 22 | . | _ | XP | _ | 10 pnct | _ | _ |+
  
 ==== Parsing ==== ==== Parsing ====
  
-Nonprojectivities in DDT are not frequent. Only 988 of the 100,238 tokens in the CoNLL 2006 version are attached nonprojectively (0.99%).+Nonprojectivities in Talbanken are not frequent. Only 1928 of the 197,123 tokens in the CoNLL 2006 version are attached nonprojectively (0.98%).
  
-The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http://aclweb.org/anthology-new/W/W06/W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Danish:+The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http://aclweb.org/anthology-new/W/W06/W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Swedish:
  
 ^ Parser (Authors) ^ LAS ^ UAS ^ ^ Parser (Authors) ^ LAS ^ UAS ^
-MST (McDonald et al.) | 84.79 | 90.58 +Microsoft (Corston-Oliver and Aue) | 79.69 89.54 
-| Malt (Nivre et al.) | 84.77 | 89.80 +| Malt (Nivre et al.) | 84.58 | 89.50 
-Riedel et al. | 83.63 89.66 |+Illinois (Do and Chang) | 82.31 | 89.05 | 
 +| MST (McDonald et al.82.55 88.93 | 
 +| Kenji Sagae | 82.00 | 88.57 | 
 +| Nara (Yuchang Cheng) | 81.08 | 88.57 | 
 +| Basis (John O'Neil) | 81.78 | 88.45 | 
 +| Riedel et al. | 80.66 | 88.33 |
  

[ Back to the navigation ] [ Back to the content ]