[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
user:zeman:treebanks:sl [2012/01/16 13:21]
zeman vytvořeno
user:zeman:treebanks:sl [2012/01/16 14:01] (current)
zeman Parsing.
Line 32: Line 32:
     * Tomaž Erjavec, Peter Holozan, Vojko Gorjanc, Marko Stabej: [[http://nl.ijs.si/ME/V3/msd/html/msd.html#SECTION05600000000000000000|Morphosyntactic tagset specification for Slovene]]     * Tomaž Erjavec, Peter Holozan, Vojko Gorjanc, Marko Stabej: [[http://nl.ijs.si/ME/V3/msd/html/msd.html#SECTION05600000000000000000|Morphosyntactic tagset specification for Slovene]]
     * [[http://ufal.mff.cuni.cz/pdt/Corpora/PDT_1.0/Doc/anal.html|The analytical layer of the Prague Dependency Treebank]]     * [[http://ufal.mff.cuni.cz/pdt/Corpora/PDT_1.0/Doc/anal.html|The analytical layer of the Prague Dependency Treebank]]
 +    * Morphological and syntactic tags are also documented directly inside the TEI XML data file.
  
 ==== Domain ==== ==== Domain ====
Line 39: Line 40:
 ==== Size ==== ==== Size ====
  
-The CoNLL 2006 version contains 196,151 tokens in 13221 sentences, yielding 14.84 tokens per sentence on average (CoNLL 2006 data split: 190,217 tokens / 12823 sentences training, 5934 tokens / 398 sentences test).+The CoNLL 2006 version contains 35140 tokens in 1936 sentences, yielding 18.15 tokens per sentence on average (CoNLL 2006 data split: 28750 tokens / 1534 sentences training, 6390 tokens / 402 sentences test).
  
 ==== Inside ==== ==== Inside ====
  
-The original morphosyntactic tags have been converted to fit into the three columns (CPOS, POS and FEAT) of the CoNLL format. There //should// be a 1-1 mapping between the [[http://www.bultreebank.org/TechRep/BTB-TR03.pdf|BTB positional tags]] and the CoNLL 2006 annotation. Use [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=bg::conll|DZ Interset]] to inspect the CoNLL tagset.+The original morphosyntactic tags have been converted to fit into the three columns (CPOS, POS and FEAT) of the CoNLL format. There //should// be a 1-1 mapping between the [[http://www.bultreebank.org/TechRep/BTB-TR03.pdf|BTB positional tags]] and the CoNLL 2006 annotation. Use [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=sl::conll|DZ Interset]] to inspect the CoNLL tagset.
  
-The morphological analysis does not include lemmas. The morphosyntactic tags have been assigned (probably) manually+The morphological analysis includes lemmas. The morphosyntactic tags have been assigned (probably) manually.
- +
-The guidelines for syntactic annotation are documented in the other [[http://www.bultreebank.org/TechRep/BTB-TR05.pdf|technical report]]. The CoNLL distribution contains the BulTreeBankReadMe.html file with a brief description of the syntactic tags (dependency relation labels).+
  
 ==== Sample ==== ==== Sample ====
  
-The first three sentences of the CoNLL 2006 training data:+The first sentence of the treebank in the TEI-compliant XML format: 
 + 
 +<code xml>    <text id="Osl." lang="sl"> 
 +      <body> 
 +        <div type="part" id="Osl.1"> 
 +          <div type="chapter" id="Osl.1.2"> 
 +            <p id="Osl.1.2.2"> 
 +              <s id="Osl.1.2.2.1"> 
 +                <w id="s1t1" afun="Pred" parallel="Co" dep="s1t8" lemma="biti" ana="Vcps-sma">Bil</w> 
 +                <w id="s1t2" afun="AuxV" dep="s1t1" lemma="biti" ana="Vcip3s--n">je</w> 
 +                <w id="s1t3" afun="Atr" parallel="Co" dep="s1t4" lemma="jasen" ana="Afpmsnn">jasen</w> 
 +                <c id="s1t4" afun="Coord" dep="s1t7">,</c> 
 +                <w id="s1t5" afun="Atr" parallel="Co" dep="s1t4" lemma="mrzel" ana="Afpmsnn">mrzel</w> 
 +                <w id="s1t6" afun="Atr" dep="s1t7" lemma="aprilski" ana="Aopmsn">aprilski</w> 
 +                <w id="s1t7" afun="Sb" dep="s1t1" lemma="dan" ana="Ncmsn">dan</w> 
 +                <w id="s1t8" afun="Coord" dep="root" lemma="in" ana="Ccs">in</w> 
 +                <w id="s1t9" afun="Sb" dep="s1t11" lemma="ura" ana="Ncfpn">ure</w> 
 +                <w id="s1t10" afun="AuxV" dep="s1t11" lemma="biti" ana="Vcip3p--n">so</w> 
 +                <w id="s1t11" afun="Pred" parallel="Co" dep="s1t8" lemma="biti" ana="Vmps-pfa">bile</w> 
 +                <w id="s1t12" afun="Obj" dep="s1t11" lemma="trinajst" ana="Mcnpnl">trinajst</w> 
 +                <c id="s1t13" afun="AuxK" dep="root">.</c> 
 +              </s></code> 
 + 
 +The first sentence of the CoNLL 2006 training data:
  
-| 1 | Глава Nc ROOT ROOT | +| 1 | Bil biti Verb <nowiki>Verb-copula</nowiki> <nowiki>VForm=participle|Tense=past|Number=singular|Gender=masculine|Voice=active</nowiki> Pred <nowiki>_</nowiki> <nowiki>_</nowiki> | 
-трета | _ | Mo gen=f<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 1 | mod mod +2 | je | biti | Verb | <nowiki>Verb-copula</nowiki> <nowiki>VForm=indicative|Tense=present|Person=third|Number=singular|Negative=no</nowiki> | 1 | AuxV <nowiki>_</nowiki> <nowiki>_</nowiki> 
-| |||||||||| +jasen jasen Adjective <nowiki>Adjective-qualificative</nowiki> <nowiki>Degree=positive|Gender=masculine|Number=singular|Case=nominative|Definiteness=no</nowiki> Atr <nowiki>_</nowiki> <nowiki>_</nowiki> | 
-НАРОДНО | _ | An gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i mod | 2 | mod | +| <nowiki>,</nowiki> <nowiki>,</nowiki>PUNC PUNC <nowiki>_</nowiki> Coord | <nowiki>_</nowiki> <nowiki>_</nowiki> | 
-| 2 | СЪБРАНИЕ | _ | Nc gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i ROOT ROOT | +mrzel mrzel Adjective <nowiki>Adjective-qualificative</nowiki> <nowiki>Degree=positive|Gender=masculine|Number=singular|Case=nominative|Definiteness=no</nowiki> Atr <nowiki>_</nowiki> <nowiki>_</nowiki> 
-| |||||||||| +aprilski aprilski Adjective | <nowiki>Adjective-ordinal</nowiki> <nowiki>Degree=positive|Gender=masculine|Number=singular|Case=nominative</nowiki>Atr <nowiki>_</nowiki> <nowiki>_</nowiki> 
-Народното | _ An gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=d mod mod +dan dan Noun | <nowiki>Noun-common</nowiki> <nowiki>Gender=masculine|Number=singular|Case=nominative</nowiki>Sb <nowiki>_</nowiki> <nowiki>_</nowiki> 
-събрание Nc | gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i subj subj +in in Conjunction | <nowiki>Conjunction-coordinating</nowiki> <nowiki>Formation=simple</nowiki> | 0 | Coord | <nowiki>_</nowiki> <nowiki>_</nowiki>
-осъществява Vpi | trans=t<nowiki>|</nowiki>mood=i<nowiki>|</nowiki>tense=r<nowiki>|</nowiki>pers=3<nowiki>|</nowiki>num=s | 0 | ROOT | 0 | ROOT +ure ura Noun | <nowiki>Noun-common</nowiki> <nowiki>Gender=feminine|Number=plural|Case=nominative</nowiki>11 Sb <nowiki>_</nowiki> <nowiki>_</nowiki> 
-законодателната Af | gen=f<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=d mod mod +10 so biti Verb <nowiki>Verb-copula</nowiki> <nowiki>VForm=indicative|Tense=present|Person=third|Number=plural|Negative=no</nowiki> 11 AuxV <nowiki>_</nowiki> <nowiki>_</nowiki> 
-власт Nc obj obj | +11 bile biti Verb | <nowiki>Verb-main</nowiki> <nowiki>VForm=participle|Tense=past|Number=plural|Gender=feminine|Voice=active</nowiki> | 8 | Pred | <nowiki>_</nowiki> <nowiki>_</nowiki> | 
-6 | и | _ | C | Cp | | 3 | conj | 3 | conj +12 trinajst trinajst Numeral <nowiki>Numeral-cardinal</nowiki> <nowiki>Gender=neuter|Number=plural|Case=nominative|Form=letter</nowiki> | 11 | Obj | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-упражнява Vpi | trans=t<nowiki>|</nowiki>mood=i<nowiki>|</nowiki>tense=r<nowiki>|</nowiki>pers=3<nowiki>|</nowiki>num=s conjarg conjarg | +13 | <nowiki>.</nowiki> <nowiki>.</nowiki>PUNC PUNC <nowiki>_</nowiki> AuxK <nowiki>_</nowiki> <nowiki>_</nowiki> |
-парламентарен Am | gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 9 | mod | 9 | mod +
-контрол | _ | N | Nc | gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i obj obj | +
-| 10 | . | _ | Punct | Punct | | 3 | punct | 3 | punct |+
  
-The first three sentences of the CoNLL 2006 test data:+The first sentence of the CoNLL 2006 test data:
  
-| 1 | Единственото An | gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=d mod mod +| 1 | Na na Adposition | <nowiki>Adposition-preposition</nowiki> <nowiki>Formation=simple|Case=locative</nowiki>AuxP <nowiki>_</nowiki> <nowiki>_</nowiki> 
-| 2 | решение Nc gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i ROOT ROOT | +| 2 | hrbtu hrbet Noun <nowiki>Noun-common</nowiki> | <nowiki>Gender=masculine|Number=singular|Case=locative</nowiki> | 1 | Adv | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 
-| |||||||||| +je biti Verb <nowiki>Verb-copula</nowiki> <nowiki>VForm=indicative|Tense=present|Person=third|Number=singular|Negative=no</nowiki> AuxV <nowiki>_</nowiki> <nowiki>_</nowiki> 
-Ерик Np gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=ROOT ROOT | +lahko lahko Adverb <nowiki>Adverb-general</nowiki> <nowiki>Degree=positive</nowiki> | 5 | AuxY <nowiki>_</nowiki> <nowiki>_</nowiki> | 
-Франк | _ | Np gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i mod mod +| 5 | čutil | čutiti | Verb | <nowiki>Verb-main</nowiki> | <nowiki>VForm=participle|Tense=past|Number=singular|Gender=masculine|Voice=active</nowiki> Pred <nowiki>_</nowiki> <nowiki>_</nowiki> | 
-Ръсел Hm gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=mod mod +| <nowiki>,</nowiki> <nowiki>,</nowiki> | PUNC | PUNC <nowiki>_</nowiki>AuxX <nowiki>_</nowiki> <nowiki>_</nowiki> 
-| |||||||||| +da da Conjunction <nowiki>Conjunction-subordinating</nowiki> <nowiki>Formation=simple</nowiki> | 5 | AuxC <nowiki>_</nowiki> <nowiki>_</nowiki>
-Пълен | _ | Am gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i mod mod +| 8 | vsi | ves | Pronoun <nowiki>Pronoun-general</nowiki> | <nowiki>Gender=masculine|Number=plural|Case=nominative|Syntactic-Type=nominal</nowiki> 9 | Sb | <nowiki>_</nowiki> | <nowiki>_</nowiki> 
-мрак Nc gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=ROOT ROOT | +upirajo upirati Verb <nowiki>Verb-main</nowiki> <nowiki>VForm=indicative|Tense=present|Person=third|Number=plural|Negative=no</nowiki> Obj <nowiki>_</nowiki> <nowiki>_</nowiki> | 
-и Cp | _ | conj conj | +10 | oči | oči | Noun | <nowiki>Noun-common</nowiki> <nowiki>Gender=feminine|Number=plural|Case=accusative</nowiki>Obj <nowiki>_</nowiki> <nowiki>_</nowiki> 
-пълна Af gen=f<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i mod mod | +11 Adposition <nowiki>Adposition-preposition</nowiki> <nowiki>Formation=simple|Case=accusative</nowiki> | 9 | AuxP | <nowiki>_</nowiki> <nowiki>_</nowiki>
-самота Nc conjarg 2 | conjarg +| 12 | njegov | njegov | Pronoun <nowiki>Pronoun-possessive</nowiki> | <nowiki>Person=third|Gender=masculine|Number=singular|Case=accusative|Owner-Number=singular|Owner-Gender=masculine|Syntactic-Type=adjectival|Animate=no</nowiki> 14 Atr <nowiki>_</nowiki> <nowiki>_</nowiki> | 
-| . | Punct Punct | _ | punct punct |+13 modri moder Adjective <nowiki>Adjective-qualificative</nowiki> <nowiki>Degree=positive|Gender=masculine|Number=singular|Case=accusative|Definiteness=yes|Animate=no</nowiki> | 14 | Atr | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 
 +14 kombinezon kombinezon Noun <nowiki>Noun-common</nowiki> <nowiki>Gender=masculine|Number=singular|Case=accusative|Animate=no</nowiki> 11 Adv <nowiki>_</nowiki> <nowiki>_</nowiki> 
 +15 <nowiki>.</nowiki> <nowiki>.</nowiki> PUNC PUNC <nowiki>_</nowiki> AuxK <nowiki>_</nowiki> <nowiki>_</nowiki> |
  
 ==== Parsing ==== ==== Parsing ====
  
-Nonprojectivities in BTB are rare. Only 747 of the 196,151 tokens in the CoNLL 2006 version are attached nonprojectively (0.38%).+Nonprojectivities in SDT are not frequent. Only 675 of the 35140 tokens in the CoNLL 2006 version are attached nonprojectively (1.92%).
  
-The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http://aclweb.org/anthology-new/W/W06/W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Bulgarian:+The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http://aclweb.org/anthology-new/W/W06/W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Slovene:
  
 ^ Parser (Authors) ^ LAS ^ UAS ^ ^ Parser (Authors) ^ LAS ^ UAS ^
-| MST (McDonald et al.) | 87.57 92.04 +| MST (McDonald et al.) | 73.44 83.17 
-Malt (Nivre et al.) | 87.41 91.72 | +Edinburgh (Riedel et al.) | 71.20 83.17 | 
-| Nara (Yuchang Cheng) | 86.34 91.30 |+| Microsoft (Corston-Oliver and Aue) | 72.42 | 81.77 | 
 +| Basis (John O'Neil) | 71.08 | 81.71 
 +| Nara (Yuchang Cheng) | 71.42 81.14 |
  

[ Back to the navigation ] [ Back to the content ]