Next revision
|
Previous revision
|
user:zeman:treebanks:sl [2012/01/16 13:21] zeman vytvořeno |
user:zeman:treebanks:sl [2012/01/16 14:01] (current) zeman Parsing. |
* Tomaž Erjavec, Peter Holozan, Vojko Gorjanc, Marko Stabej: [[http://nl.ijs.si/ME/V3/msd/html/msd.html#SECTION05600000000000000000|Morphosyntactic tagset specification for Slovene]] | * Tomaž Erjavec, Peter Holozan, Vojko Gorjanc, Marko Stabej: [[http://nl.ijs.si/ME/V3/msd/html/msd.html#SECTION05600000000000000000|Morphosyntactic tagset specification for Slovene]] |
* [[http://ufal.mff.cuni.cz/pdt/Corpora/PDT_1.0/Doc/anal.html|The analytical layer of the Prague Dependency Treebank]] | * [[http://ufal.mff.cuni.cz/pdt/Corpora/PDT_1.0/Doc/anal.html|The analytical layer of the Prague Dependency Treebank]] |
| * Morphological and syntactic tags are also documented directly inside the TEI XML data file. |
| |
==== Domain ==== | ==== Domain ==== |
==== Size ==== | ==== Size ==== |
| |
The CoNLL 2006 version contains 196,151 tokens in 13221 sentences, yielding 14.84 tokens per sentence on average (CoNLL 2006 data split: 190,217 tokens / 12823 sentences training, 5934 tokens / 398 sentences test). | The CoNLL 2006 version contains 35140 tokens in 1936 sentences, yielding 18.15 tokens per sentence on average (CoNLL 2006 data split: 28750 tokens / 1534 sentences training, 6390 tokens / 402 sentences test). |
| |
==== Inside ==== | ==== Inside ==== |
| |
The original morphosyntactic tags have been converted to fit into the three columns (CPOS, POS and FEAT) of the CoNLL format. There //should// be a 1-1 mapping between the [[http://www.bultreebank.org/TechRep/BTB-TR03.pdf|BTB positional tags]] and the CoNLL 2006 annotation. Use [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=bg::conll|DZ Interset]] to inspect the CoNLL tagset. | The original morphosyntactic tags have been converted to fit into the three columns (CPOS, POS and FEAT) of the CoNLL format. There //should// be a 1-1 mapping between the [[http://www.bultreebank.org/TechRep/BTB-TR03.pdf|BTB positional tags]] and the CoNLL 2006 annotation. Use [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=sl::conll|DZ Interset]] to inspect the CoNLL tagset. |
| |
The morphological analysis does not include lemmas. The morphosyntactic tags have been assigned (probably) manually. | The morphological analysis includes lemmas. The morphosyntactic tags have been assigned (probably) manually. |
| |
The guidelines for syntactic annotation are documented in the other [[http://www.bultreebank.org/TechRep/BTB-TR05.pdf|technical report]]. The CoNLL distribution contains the BulTreeBankReadMe.html file with a brief description of the syntactic tags (dependency relation labels). | |
| |
==== Sample ==== | ==== Sample ==== |
| |
The first three sentences of the CoNLL 2006 training data: | The first sentence of the treebank in the TEI-compliant XML format: |
| |
| <code xml> <text id="Osl." lang="sl"> |
| <body> |
| <div type="part" id="Osl.1"> |
| <div type="chapter" id="Osl.1.2"> |
| <p id="Osl.1.2.2"> |
| <s id="Osl.1.2.2.1"> |
| <w id="s1t1" afun="Pred" parallel="Co" dep="s1t8" lemma="biti" ana="Vcps-sma">Bil</w> |
| <w id="s1t2" afun="AuxV" dep="s1t1" lemma="biti" ana="Vcip3s--n">je</w> |
| <w id="s1t3" afun="Atr" parallel="Co" dep="s1t4" lemma="jasen" ana="Afpmsnn">jasen</w> |
| <c id="s1t4" afun="Coord" dep="s1t7">,</c> |
| <w id="s1t5" afun="Atr" parallel="Co" dep="s1t4" lemma="mrzel" ana="Afpmsnn">mrzel</w> |
| <w id="s1t6" afun="Atr" dep="s1t7" lemma="aprilski" ana="Aopmsn">aprilski</w> |
| <w id="s1t7" afun="Sb" dep="s1t1" lemma="dan" ana="Ncmsn">dan</w> |
| <w id="s1t8" afun="Coord" dep="root" lemma="in" ana="Ccs">in</w> |
| <w id="s1t9" afun="Sb" dep="s1t11" lemma="ura" ana="Ncfpn">ure</w> |
| <w id="s1t10" afun="AuxV" dep="s1t11" lemma="biti" ana="Vcip3p--n">so</w> |
| <w id="s1t11" afun="Pred" parallel="Co" dep="s1t8" lemma="biti" ana="Vmps-pfa">bile</w> |
| <w id="s1t12" afun="Obj" dep="s1t11" lemma="trinajst" ana="Mcnpnl">trinajst</w> |
| <c id="s1t13" afun="AuxK" dep="root">.</c> |
| </s></code> |
| |
| The first sentence of the CoNLL 2006 training data: |
| |
| 1 | Глава | _ | N | Nc | _ | 0 | ROOT | 0 | ROOT | | | 1 | Bil | biti | Verb | <nowiki>Verb-copula</nowiki> | <nowiki>VForm=participle|Tense=past|Number=singular|Gender=masculine|Voice=active</nowiki> | 8 | Pred | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 2 | трета | _ | M | Mo | gen=f<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 1 | mod | 1 | mod | | | 2 | je | biti | Verb | <nowiki>Verb-copula</nowiki> | <nowiki>VForm=indicative|Tense=present|Person=third|Number=singular|Negative=no</nowiki> | 1 | AuxV | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| |||||||||| | | 3 | jasen | jasen | Adjective | <nowiki>Adjective-qualificative</nowiki> | <nowiki>Degree=positive|Gender=masculine|Number=singular|Case=nominative|Definiteness=no</nowiki> | 4 | Atr | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 1 | НАРОДНО | _ | A | An | gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 2 | mod | 2 | mod | | | 4 | <nowiki>,</nowiki> | <nowiki>,</nowiki> | PUNC | PUNC | <nowiki>_</nowiki> | 7 | Coord | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 2 | СЪБРАНИЕ | _ | N | Nc | gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 0 | ROOT | 0 | ROOT | | | 5 | mrzel | mrzel | Adjective | <nowiki>Adjective-qualificative</nowiki> | <nowiki>Degree=positive|Gender=masculine|Number=singular|Case=nominative|Definiteness=no</nowiki> | 4 | Atr | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| |||||||||| | | 6 | aprilski | aprilski | Adjective | <nowiki>Adjective-ordinal</nowiki> | <nowiki>Degree=positive|Gender=masculine|Number=singular|Case=nominative</nowiki> | 7 | Atr | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 1 | Народното | _ | A | An | gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=d | 2 | mod | 2 | mod | | | 7 | dan | dan | Noun | <nowiki>Noun-common</nowiki> | <nowiki>Gender=masculine|Number=singular|Case=nominative</nowiki> | 1 | Sb | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 2 | събрание | _ | N | Nc | gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 3 | subj | 3 | subj | | | 8 | in | in | Conjunction | <nowiki>Conjunction-coordinating</nowiki> | <nowiki>Formation=simple</nowiki> | 0 | Coord | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 3 | осъществява | _ | V | Vpi | trans=t<nowiki>|</nowiki>mood=i<nowiki>|</nowiki>tense=r<nowiki>|</nowiki>pers=3<nowiki>|</nowiki>num=s | 0 | ROOT | 0 | ROOT | | | 9 | ure | ura | Noun | <nowiki>Noun-common</nowiki> | <nowiki>Gender=feminine|Number=plural|Case=nominative</nowiki> | 11 | Sb | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 4 | законодателната | _ | A | Af | gen=f<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=d | 5 | mod | 5 | mod | | | 10 | so | biti | Verb | <nowiki>Verb-copula</nowiki> | <nowiki>VForm=indicative|Tense=present|Person=third|Number=plural|Negative=no</nowiki> | 11 | AuxV | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 5 | власт | _ | N | Nc | _ | 3 | obj | 3 | obj | | | 11 | bile | biti | Verb | <nowiki>Verb-main</nowiki> | <nowiki>VForm=participle|Tense=past|Number=plural|Gender=feminine|Voice=active</nowiki> | 8 | Pred | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 6 | и | _ | C | Cp | _ | 3 | conj | 3 | conj | | | 12 | trinajst | trinajst | Numeral | <nowiki>Numeral-cardinal</nowiki> | <nowiki>Gender=neuter|Number=plural|Case=nominative|Form=letter</nowiki> | 11 | Obj | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 7 | упражнява | _ | V | Vpi | trans=t<nowiki>|</nowiki>mood=i<nowiki>|</nowiki>tense=r<nowiki>|</nowiki>pers=3<nowiki>|</nowiki>num=s | 3 | conjarg | 3 | conjarg | | | 13 | <nowiki>.</nowiki> | <nowiki>.</nowiki> | PUNC | PUNC | <nowiki>_</nowiki> | 0 | AuxK | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 8 | парламентарен | _ | A | Am | gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 9 | mod | 9 | mod | | |
| 9 | контрол | _ | N | Nc | gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 7 | obj | 7 | obj | | |
| 10 | . | _ | Punct | Punct | _ | 3 | punct | 3 | punct | | |
| |
The first three sentences of the CoNLL 2006 test data: | The first sentence of the CoNLL 2006 test data: |
| |
| 1 | Единственото | _ | A | An | gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=d | 2 | mod | 2 | mod | | | 1 | Na | na | Adposition | <nowiki>Adposition-preposition</nowiki> | <nowiki>Formation=simple|Case=locative</nowiki> | 5 | AuxP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 2 | решение | _ | N | Nc | gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 0 | ROOT | 0 | ROOT | | | 2 | hrbtu | hrbet | Noun | <nowiki>Noun-common</nowiki> | <nowiki>Gender=masculine|Number=singular|Case=locative</nowiki> | 1 | Adv | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| |||||||||| | | 3 | je | biti | Verb | <nowiki>Verb-copula</nowiki> | <nowiki>VForm=indicative|Tense=present|Person=third|Number=singular|Negative=no</nowiki> | 5 | AuxV | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 1 | Ерик | _ | N | Np | gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 0 | ROOT | 0 | ROOT | | | 4 | lahko | lahko | Adverb | <nowiki>Adverb-general</nowiki> | <nowiki>Degree=positive</nowiki> | 5 | AuxY | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 2 | Франк | _ | N | Np | gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 1 | mod | 1 | mod | | | 5 | čutil | čutiti | Verb | <nowiki>Verb-main</nowiki> | <nowiki>VForm=participle|Tense=past|Number=singular|Gender=masculine|Voice=active</nowiki> | 0 | Pred | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 3 | Ръсел | _ | H | Hm | gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 2 | mod | 2 | mod | | | 6 | <nowiki>,</nowiki> | <nowiki>,</nowiki> | PUNC | PUNC | <nowiki>_</nowiki> | 7 | AuxX | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| |||||||||| | | 7 | da | da | Conjunction | <nowiki>Conjunction-subordinating</nowiki> | <nowiki>Formation=simple</nowiki> | 5 | AuxC | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 1 | Пълен | _ | A | Am | gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 2 | mod | 2 | mod | | | 8 | vsi | ves | Pronoun | <nowiki>Pronoun-general</nowiki> | <nowiki>Gender=masculine|Number=plural|Case=nominative|Syntactic-Type=nominal</nowiki> | 9 | Sb | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 2 | мрак | _ | N | Nc | gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 0 | ROOT | 0 | ROOT | | | 9 | upirajo | upirati | Verb | <nowiki>Verb-main</nowiki> | <nowiki>VForm=indicative|Tense=present|Person=third|Number=plural|Negative=no</nowiki> | 7 | Obj | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 3 | и | _ | C | Cp | _ | 2 | conj | 2 | conj | | | 10 | oči | oči | Noun | <nowiki>Noun-common</nowiki> | <nowiki>Gender=feminine|Number=plural|Case=accusative</nowiki> | 9 | Obj | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 4 | пълна | _ | A | Af | gen=f<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 5 | mod | 5 | mod | | | 11 | v | v | Adposition | <nowiki>Adposition-preposition</nowiki> | <nowiki>Formation=simple|Case=accusative</nowiki> | 9 | AuxP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 5 | самота | _ | N | Nc | _ | 2 | conjarg | 2 | conjarg | | | 12 | njegov | njegov | Pronoun | <nowiki>Pronoun-possessive</nowiki> | <nowiki>Person=third|Gender=masculine|Number=singular|Case=accusative|Owner-Number=singular|Owner-Gender=masculine|Syntactic-Type=adjectival|Animate=no</nowiki> | 14 | Atr | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 6 | . | _ | Punct | Punct | _ | 2 | punct | 2 | punct | | | 13 | modri | moder | Adjective | <nowiki>Adjective-qualificative</nowiki> | <nowiki>Degree=positive|Gender=masculine|Number=singular|Case=accusative|Definiteness=yes|Animate=no</nowiki> | 14 | Atr | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| | 14 | kombinezon | kombinezon | Noun | <nowiki>Noun-common</nowiki> | <nowiki>Gender=masculine|Number=singular|Case=accusative|Animate=no</nowiki> | 11 | Adv | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| | 15 | <nowiki>.</nowiki> | <nowiki>.</nowiki> | PUNC | PUNC | <nowiki>_</nowiki> | 0 | AuxK | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| |
==== Parsing ==== | ==== Parsing ==== |
| |
Nonprojectivities in BTB are rare. Only 747 of the 196,151 tokens in the CoNLL 2006 version are attached nonprojectively (0.38%). | Nonprojectivities in SDT are not frequent. Only 675 of the 35140 tokens in the CoNLL 2006 version are attached nonprojectively (1.92%). |
| |
The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http://aclweb.org/anthology-new/W/W06/W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Bulgarian: | The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http://aclweb.org/anthology-new/W/W06/W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Slovene: |
| |
^ Parser (Authors) ^ LAS ^ UAS ^ | ^ Parser (Authors) ^ LAS ^ UAS ^ |
| MST (McDonald et al.) | 87.57 | 92.04 | | | MST (McDonald et al.) | 73.44 | 83.17 | |
| Malt (Nivre et al.) | 87.41 | 91.72 | | | Edinburgh (Riedel et al.) | 71.20 | 83.17 | |
| Nara (Yuchang Cheng) | 86.34 | 91.30 | | | Microsoft (Corston-Oliver and Aue) | 72.42 | 81.77 | |
| | Basis (John O'Neil) | 71.08 | 81.71 | |
| | Nara (Yuchang Cheng) | 71.42 | 81.14 | |
| |