SDT is natively dependency-based, modeled after the Prague Dependency Treebank of Czech.
SDT in all data formats is freely downloadable from http://nl.ijs.si/sdt/data/. The license in short:
SDT was created by members of the Institut “Jožef Stefan”, Jamova cesta 39, 1000 Ljubljana, Slovenia.
Fiction (Multext-East Orwell's “1984”).
The CoNLL 2006 version contains 35140 tokens in 1936 sentences, yielding 18.15 tokens per sentence on average (CoNLL 2006 data split: 28750 tokens / 1534 sentences training, 6390 tokens / 402 sentences test).
The original morphosyntactic tags have been converted to fit into the three columns (CPOS, POS and FEAT) of the CoNLL format. There should be a 1-1 mapping between the BTB positional tags and the CoNLL 2006 annotation. Use DZ Interset to inspect the CoNLL tagset.
The morphological analysis includes lemmas. The morphosyntactic tags have been assigned (probably) manually.
The first sentence of the treebank in the TEI-compliant XML format:
<text id="Osl." lang="sl"> <body> <div type="part" id="Osl.1"> <div type="chapter" id="Osl.1.2"> <p id="Osl.1.2.2"> <s id="Osl.1.2.2.1"> <w id="s1t1" afun="Pred" parallel="Co" dep="s1t8" lemma="biti" ana="Vcps-sma">Bil</w> <w id="s1t2" afun="AuxV" dep="s1t1" lemma="biti" ana="Vcip3s--n">je</w> <w id="s1t3" afun="Atr" parallel="Co" dep="s1t4" lemma="jasen" ana="Afpmsnn">jasen</w> <c id="s1t4" afun="Coord" dep="s1t7">,</c> <w id="s1t5" afun="Atr" parallel="Co" dep="s1t4" lemma="mrzel" ana="Afpmsnn">mrzel</w> <w id="s1t6" afun="Atr" dep="s1t7" lemma="aprilski" ana="Aopmsn">aprilski</w> <w id="s1t7" afun="Sb" dep="s1t1" lemma="dan" ana="Ncmsn">dan</w> <w id="s1t8" afun="Coord" dep="root" lemma="in" ana="Ccs">in</w> <w id="s1t9" afun="Sb" dep="s1t11" lemma="ura" ana="Ncfpn">ure</w> <w id="s1t10" afun="AuxV" dep="s1t11" lemma="biti" ana="Vcip3p--n">so</w> <w id="s1t11" afun="Pred" parallel="Co" dep="s1t8" lemma="biti" ana="Vmps-pfa">bile</w> <w id="s1t12" afun="Obj" dep="s1t11" lemma="trinajst" ana="Mcnpnl">trinajst</w> <c id="s1t13" afun="AuxK" dep="root">.</c> </s>
The first sentence of the CoNLL 2006 training data:
1 | Bil | biti | Verb | Verb-copula | VForm=participle|Tense=past|Number=singular|Gender=masculine|Voice=active | 8 | Pred | _ | _ |
2 | je | biti | Verb | Verb-copula | VForm=indicative|Tense=present|Person=third|Number=singular|Negative=no | 1 | AuxV | _ | _ |
3 | jasen | jasen | Adjective | Adjective-qualificative | Degree=positive|Gender=masculine|Number=singular|Case=nominative|Definiteness=no | 4 | Atr | _ | _ |
4 | , | , | PUNC | PUNC | _ | 7 | Coord | _ | _ |
5 | mrzel | mrzel | Adjective | Adjective-qualificative | Degree=positive|Gender=masculine|Number=singular|Case=nominative|Definiteness=no | 4 | Atr | _ | _ |
6 | aprilski | aprilski | Adjective | Adjective-ordinal | Degree=positive|Gender=masculine|Number=singular|Case=nominative | 7 | Atr | _ | _ |
7 | dan | dan | Noun | Noun-common | Gender=masculine|Number=singular|Case=nominative | 1 | Sb | _ | _ |
8 | in | in | Conjunction | Conjunction-coordinating | Formation=simple | 0 | Coord | _ | _ |
9 | ure | ura | Noun | Noun-common | Gender=feminine|Number=plural|Case=nominative | 11 | Sb | _ | _ |
10 | so | biti | Verb | Verb-copula | VForm=indicative|Tense=present|Person=third|Number=plural|Negative=no | 11 | AuxV | _ | _ |
11 | bile | biti | Verb | Verb-main | VForm=participle|Tense=past|Number=plural|Gender=feminine|Voice=active | 8 | Pred | _ | _ |
12 | trinajst | trinajst | Numeral | Numeral-cardinal | Gender=neuter|Number=plural|Case=nominative|Form=letter | 11 | Obj | _ | _ |
13 | . | . | PUNC | PUNC | _ | 0 | AuxK | _ | _ |
The first sentence of the CoNLL 2006 test data:
1 | Na | na | Adposition | Adposition-preposition | Formation=simple|Case=locative | 5 | AuxP | _ | _ |
2 | hrbtu | hrbet | Noun | Noun-common | Gender=masculine|Number=singular|Case=locative | 1 | Adv | _ | _ |
3 | je | biti | Verb | Verb-copula | VForm=indicative|Tense=present|Person=third|Number=singular|Negative=no | 5 | AuxV | _ | _ |
4 | lahko | lahko | Adverb | Adverb-general | Degree=positive | 5 | AuxY | _ | _ |
5 | čutil | čutiti | Verb | Verb-main | VForm=participle|Tense=past|Number=singular|Gender=masculine|Voice=active | 0 | Pred | _ | _ |
6 | , | , | PUNC | PUNC | _ | 7 | AuxX | _ | _ |
7 | da | da | Conjunction | Conjunction-subordinating | Formation=simple | 5 | AuxC | _ | _ |
8 | vsi | ves | Pronoun | Pronoun-general | Gender=masculine|Number=plural|Case=nominative|Syntactic-Type=nominal | 9 | Sb | _ | _ |
9 | upirajo | upirati | Verb | Verb-main | VForm=indicative|Tense=present|Person=third|Number=plural|Negative=no | 7 | Obj | _ | _ |
10 | oči | oči | Noun | Noun-common | Gender=feminine|Number=plural|Case=accusative | 9 | Obj | _ | _ |
11 | v | v | Adposition | Adposition-preposition | Formation=simple|Case=accusative | 9 | AuxP | _ | _ |
12 | njegov | njegov | Pronoun | Pronoun-possessive | Person=third|Gender=masculine|Number=singular|Case=accusative|Owner-Number=singular|Owner-Gender=masculine|Syntactic-Type=adjectival|Animate=no | 14 | Atr | _ | _ |
13 | modri | moder | Adjective | Adjective-qualificative | Degree=positive|Gender=masculine|Number=singular|Case=accusative|Definiteness=yes|Animate=no | 14 | Atr | _ | _ |
14 | kombinezon | kombinezon | Noun | Noun-common | Gender=masculine|Number=singular|Case=accusative|Animate=no | 11 | Adv | _ | _ |
15 | . | . | PUNC | PUNC | _ | 0 | AuxK | _ | _ |
Nonprojectivities in SDT are not frequent. Only 675 of the 35140 tokens in the CoNLL 2006 version are attached nonprojectively (1.92%).
The results of the CoNLL 2006 shared task are available online. They have been published in (Buchholz and Marsi, 2006). The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Slovene:
Parser (Authors) | LAS | UAS |
---|---|---|
MST (McDonald et al.) | 73.44 | 83.17 |
Edinburgh (Riedel et al.) | 71.20 | 83.17 |
Microsoft (Corston-Oliver and Aue) | 72.42 | 81.77 |
Basis (John O'Neil) | 71.08 | 81.71 |
Nara (Yuchang Cheng) | 71.42 | 81.14 |