===== Slovene (sl) ===== [[http://nl.ijs.si/sdt/|Slovene Dependency Treebank]] (SDT) ==== Versions ==== * Original TEI-compliant XML format * [[:format-fs|FS format]] (readable by the Tred tree editor and viewer) * [[:format-conll|CoNLL 2006 format]] SDT is natively dependency-based, modeled after the Prague Dependency Treebank of [[cs|Czech]]. ==== Obtaining and License ==== SDT in all data formats is freely downloadable from http://nl.ijs.si/sdt/data/. The [[http://nl.ijs.si/sdt/data/SDT-2006-05-17/00README.txt|license]] in short: * research usage * cite principal publication in publications * redistributability not discussed (might be permitted under the same conditions but ask the authors first) SDT was created by members of the [[http://www.ijs.si/|Institut “Jožef Stefan”]], Jamova cesta 39, 1000 Ljubljana, Slovenia. ==== References ==== * Website * http://nl.ijs.si/sdt/ * Data * //no separate citation// * Principal publications * Sašo Džeroski, Tomaž Erjavec, Nina Ledinek, Petr Pajas, Zdeněk Žabokrtský, Andreja Žele: [[http://nl.ijs.si/sdt/bib/SDT-LREC06.pdf|Towards a Slovene Dependency Treebank]] In: Proceedings of Fifth International Conference on Language Resources and Evaluation, LREC'06, 24-26 May 2006. Genova, Italy, 2006. * Documentation * Tomaž Erjavec, Peter Holozan, Vojko Gorjanc, Marko Stabej: [[http://nl.ijs.si/ME/V3/msd/html/msd.html#SECTION05600000000000000000|Morphosyntactic tagset specification for Slovene]] * [[http://ufal.mff.cuni.cz/pdt/Corpora/PDT_1.0/Doc/anal.html|The analytical layer of the Prague Dependency Treebank]] * Morphological and syntactic tags are also documented directly inside the TEI XML data file. ==== Domain ==== Fiction (Multext-East Orwell's “1984”). ==== Size ==== The CoNLL 2006 version contains 35140 tokens in 1936 sentences, yielding 18.15 tokens per sentence on average (CoNLL 2006 data split: 28750 tokens / 1534 sentences training, 6390 tokens / 402 sentences test). ==== Inside ==== The original morphosyntactic tags have been converted to fit into the three columns (CPOS, POS and FEAT) of the CoNLL format. There //should// be a 1-1 mapping between the [[http://www.bultreebank.org/TechRep/BTB-TR03.pdf|BTB positional tags]] and the CoNLL 2006 annotation. Use [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=sl::conll|DZ Interset]] to inspect the CoNLL tagset. The morphological analysis includes lemmas. The morphosyntactic tags have been assigned (probably) manually. ==== Sample ==== The first sentence of the treebank in the TEI-compliant XML format:

Bil je jasen , mrzel aprilski dan in ure so bile trinajst . The first sentence of the CoNLL 2006 training data: | 1 | Bil | biti | Verb | Verb-copula | VForm=participle|Tense=past|Number=singular|Gender=masculine|Voice=active | 8 | Pred | _ | _ | | 2 | je | biti | Verb | Verb-copula | VForm=indicative|Tense=present|Person=third|Number=singular|Negative=no | 1 | AuxV | _ | _ | | 3 | jasen | jasen | Adjective | Adjective-qualificative | Degree=positive|Gender=masculine|Number=singular|Case=nominative|Definiteness=no | 4 | Atr | _ | _ | | 4 | , | , | PUNC | PUNC | _ | 7 | Coord | _ | _ | | 5 | mrzel | mrzel | Adjective | Adjective-qualificative | Degree=positive|Gender=masculine|Number=singular|Case=nominative|Definiteness=no | 4 | Atr | _ | _ | | 6 | aprilski | aprilski | Adjective | Adjective-ordinal | Degree=positive|Gender=masculine|Number=singular|Case=nominative | 7 | Atr | _ | _ | | 7 | dan | dan | Noun | Noun-common | Gender=masculine|Number=singular|Case=nominative | 1 | Sb | _ | _ | | 8 | in | in | Conjunction | Conjunction-coordinating | Formation=simple | 0 | Coord | _ | _ | | 9 | ure | ura | Noun | Noun-common | Gender=feminine|Number=plural|Case=nominative | 11 | Sb | _ | _ | | 10 | so | biti | Verb | Verb-copula | VForm=indicative|Tense=present|Person=third|Number=plural|Negative=no | 11 | AuxV | _ | _ | | 11 | bile | biti | Verb | Verb-main | VForm=participle|Tense=past|Number=plural|Gender=feminine|Voice=active | 8 | Pred | _ | _ | | 12 | trinajst | trinajst | Numeral | Numeral-cardinal | Gender=neuter|Number=plural|Case=nominative|Form=letter | 11 | Obj | _ | _ | | 13 | . | . | PUNC | PUNC | _ | 0 | AuxK | _ | _ | The first sentence of the CoNLL 2006 test data: | 1 | Na | na | Adposition | Adposition-preposition | Formation=simple|Case=locative | 5 | AuxP | _ | _ | | 2 | hrbtu | hrbet | Noun | Noun-common | Gender=masculine|Number=singular|Case=locative | 1 | Adv | _ | _ | | 3 | je | biti | Verb | Verb-copula | VForm=indicative|Tense=present|Person=third|Number=singular|Negative=no | 5 | AuxV | _ | _ | | 4 | lahko | lahko | Adverb | Adverb-general | Degree=positive | 5 | AuxY | _ | _ | | 5 | čutil | čutiti | Verb | Verb-main | VForm=participle|Tense=past|Number=singular|Gender=masculine|Voice=active | 0 | Pred | _ | _ | | 6 | , | , | PUNC | PUNC | _ | 7 | AuxX | _ | _ | | 7 | da | da | Conjunction | Conjunction-subordinating | Formation=simple | 5 | AuxC | _ | _ | | 8 | vsi | ves | Pronoun | Pronoun-general | Gender=masculine|Number=plural|Case=nominative|Syntactic-Type=nominal | 9 | Sb | _ | _ | | 9 | upirajo | upirati | Verb | Verb-main | VForm=indicative|Tense=present|Person=third|Number=plural|Negative=no | 7 | Obj | _ | _ | | 10 | oči | oči | Noun | Noun-common | Gender=feminine|Number=plural|Case=accusative | 9 | Obj | _ | _ | | 11 | v | v | Adposition | Adposition-preposition | Formation=simple|Case=accusative | 9 | AuxP | _ | _ | | 12 | njegov | njegov | Pronoun | Pronoun-possessive | Person=third|Gender=masculine|Number=singular|Case=accusative|Owner-Number=singular|Owner-Gender=masculine|Syntactic-Type=adjectival|Animate=no | 14 | Atr | _ | _ | | 13 | modri | moder | Adjective | Adjective-qualificative | Degree=positive|Gender=masculine|Number=singular|Case=accusative|Definiteness=yes|Animate=no | 14 | Atr | _ | _ | | 14 | kombinezon | kombinezon | Noun | Noun-common | Gender=masculine|Number=singular|Case=accusative|Animate=no | 11 | Adv | _ | _ | | 15 | . | . | PUNC | PUNC | _ | 0 | AuxK | _ | _ | ==== Parsing ==== Nonprojectivities in SDT are not frequent. Only 675 of the 35140 tokens in the CoNLL 2006 version are attached nonprojectively (1.92%). The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http://aclweb.org/anthology-new/W/W06/W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Slovene: ^ Parser (Authors) ^ LAS ^ UAS ^ | MST (McDonald et al.) | 73.44 | 83.17 | | Edinburgh (Riedel et al.) | 71.20 | 83.17 | | Microsoft (Corston-Oliver and Aue) | 72.42 | 81.77 | | Basis (John O'Neil) | 71.08 | 81.71 | | Nara (Yuchang Cheng) | 71.42 | 81.14 |