This is an old revision of the document!
Table of Contents
Slovak (sk)
Slovak Treebank (part of Slovak National Corpus / Slovenský národný korpus)
Versions
- PML format as in Czech PDT 2.0 (.a, .m, .w files)
ST is natively dependency-based, modeled after the Prague Dependency Treebank of Czech.
Obtaining and License
ST has not been publicly released. Contact Radovan Garabík to inquire about availability and license terms.
ST was created by members of the Ľudovít Štúr Language Institute (Jazykovedný ústav Ľudovíta Štúra), Panská 26, 81364 Bratislava, Slovakia.
References
- Website
- http://korpus.sk/ (Slovenský národný korpus), not much about syntactic annotation
- Data
- no separate citation
- Principal publications
- Mária Šimková, Radovan Garabík: Синтаксическая разметка в Словацком национальном корпусе In: Tруды международной конференции Корпусная лингвистика – 2006. Sankt-Petersburg: St. Petersburg University Press 2006, p. 389 – 394. ISBN 5-288-04181-4.
- Documentation
Domain
Mixed.
Size
50,000 viet
The CoNLL 2006 version contains 35140 tokens in 1936 sentences, yielding 18.15 tokens per sentence on average (CoNLL 2006 data split: 28750 tokens / 1534 sentences training, 6390 tokens / 402 sentences test).
Inside
The original morphosyntactic tags have been converted to fit into the three columns (CPOS, POS and FEAT) of the CoNLL format. There should be a 1-1 mapping between the BTB positional tags and the CoNLL 2006 annotation. Use DZ Interset to inspect the CoNLL tagset.
The morphological analysis includes lemmas. The morphosyntactic tags have been assigned (probably) manually.
Sample
The first sentence of the treebank in the TEI-compliant XML format:
<text id="Osl." lang="sl"> <body> <div type="part" id="Osl.1"> <div type="chapter" id="Osl.1.2"> <p id="Osl.1.2.2"> <s id="Osl.1.2.2.1"> <w id="s1t1" afun="Pred" parallel="Co" dep="s1t8" lemma="biti" ana="Vcps-sma">Bil</w> <w id="s1t2" afun="AuxV" dep="s1t1" lemma="biti" ana="Vcip3s--n">je</w> <w id="s1t3" afun="Atr" parallel="Co" dep="s1t4" lemma="jasen" ana="Afpmsnn">jasen</w> <c id="s1t4" afun="Coord" dep="s1t7">,</c> <w id="s1t5" afun="Atr" parallel="Co" dep="s1t4" lemma="mrzel" ana="Afpmsnn">mrzel</w> <w id="s1t6" afun="Atr" dep="s1t7" lemma="aprilski" ana="Aopmsn">aprilski</w> <w id="s1t7" afun="Sb" dep="s1t1" lemma="dan" ana="Ncmsn">dan</w> <w id="s1t8" afun="Coord" dep="root" lemma="in" ana="Ccs">in</w> <w id="s1t9" afun="Sb" dep="s1t11" lemma="ura" ana="Ncfpn">ure</w> <w id="s1t10" afun="AuxV" dep="s1t11" lemma="biti" ana="Vcip3p--n">so</w> <w id="s1t11" afun="Pred" parallel="Co" dep="s1t8" lemma="biti" ana="Vmps-pfa">bile</w> <w id="s1t12" afun="Obj" dep="s1t11" lemma="trinajst" ana="Mcnpnl">trinajst</w> <c id="s1t13" afun="AuxK" dep="root">.</c> </s>
The first sentence of the CoNLL 2006 training data:
1 | Bil | biti | Verb | Verb-copula | VForm=participle|Tense=past|Number=singular|Gender=masculine|Voice=active | 8 | Pred | _ | _ |
2 | je | biti | Verb | Verb-copula | VForm=indicative|Tense=present|Person=third|Number=singular|Negative=no | 1 | AuxV | _ | _ |
3 | jasen | jasen | Adjective | Adjective-qualificative | Degree=positive|Gender=masculine|Number=singular|Case=nominative|Definiteness=no | 4 | Atr | _ | _ |
4 | , | , | PUNC | PUNC | _ | 7 | Coord | _ | _ |
5 | mrzel | mrzel | Adjective | Adjective-qualificative | Degree=positive|Gender=masculine|Number=singular|Case=nominative|Definiteness=no | 4 | Atr | _ | _ |
6 | aprilski | aprilski | Adjective | Adjective-ordinal | Degree=positive|Gender=masculine|Number=singular|Case=nominative | 7 | Atr | _ | _ |
7 | dan | dan | Noun | Noun-common | Gender=masculine|Number=singular|Case=nominative | 1 | Sb | _ | _ |
8 | in | in | Conjunction | Conjunction-coordinating | Formation=simple | 0 | Coord | _ | _ |
9 | ure | ura | Noun | Noun-common | Gender=feminine|Number=plural|Case=nominative | 11 | Sb | _ | _ |
10 | so | biti | Verb | Verb-copula | VForm=indicative|Tense=present|Person=third|Number=plural|Negative=no | 11 | AuxV | _ | _ |
11 | bile | biti | Verb | Verb-main | VForm=participle|Tense=past|Number=plural|Gender=feminine|Voice=active | 8 | Pred | _ | _ |
12 | trinajst | trinajst | Numeral | Numeral-cardinal | Gender=neuter|Number=plural|Case=nominative|Form=letter | 11 | Obj | _ | _ |
13 | . | . | PUNC | PUNC | _ | 0 | AuxK | _ | _ |
The first sentence of the CoNLL 2006 test data:
1 | Na | na | Adposition | Adposition-preposition | Formation=simple|Case=locative | 5 | AuxP | _ | _ |
2 | hrbtu | hrbet | Noun | Noun-common | Gender=masculine|Number=singular|Case=locative | 1 | Adv | _ | _ |
3 | je | biti | Verb | Verb-copula | VForm=indicative|Tense=present|Person=third|Number=singular|Negative=no | 5 | AuxV | _ | _ |
4 | lahko | lahko | Adverb | Adverb-general | Degree=positive | 5 | AuxY | _ | _ |
5 | čutil | čutiti | Verb | Verb-main | VForm=participle|Tense=past|Number=singular|Gender=masculine|Voice=active | 0 | Pred | _ | _ |
6 | , | , | PUNC | PUNC | _ | 7 | AuxX | _ | _ |
7 | da | da | Conjunction | Conjunction-subordinating | Formation=simple | 5 | AuxC | _ | _ |
8 | vsi | ves | Pronoun | Pronoun-general | Gender=masculine|Number=plural|Case=nominative|Syntactic-Type=nominal | 9 | Sb | _ | _ |
9 | upirajo | upirati | Verb | Verb-main | VForm=indicative|Tense=present|Person=third|Number=plural|Negative=no | 7 | Obj | _ | _ |
10 | oči | oči | Noun | Noun-common | Gender=feminine|Number=plural|Case=accusative | 9 | Obj | _ | _ |
11 | v | v | Adposition | Adposition-preposition | Formation=simple|Case=accusative | 9 | AuxP | _ | _ |
12 | njegov | njegov | Pronoun | Pronoun-possessive | Person=third|Gender=masculine|Number=singular|Case=accusative|Owner-Number=singular|Owner-Gender=masculine|Syntactic-Type=adjectival|Animate=no | 14 | Atr | _ | _ |
13 | modri | moder | Adjective | Adjective-qualificative | Degree=positive|Gender=masculine|Number=singular|Case=accusative|Definiteness=yes|Animate=no | 14 | Atr | _ | _ |
14 | kombinezon | kombinezon | Noun | Noun-common | Gender=masculine|Number=singular|Case=accusative|Animate=no | 11 | Adv | _ | _ |
15 | . | . | PUNC | PUNC | _ | 0 | AuxK | _ | _ |
Parsing
Nonprojectivities…
Parsing results…