This is an old revision of the document!
Table of Contents
Slovene (sl)
Versions
- Original TEI-compliant XML format
- FS format (readable by the Tred tree editor and viewer)
SDT is natively dependency-based, modeled after the Prague Dependency Treebank of Czech.
Obtaining and License
SDT in all data formats is freely downloadable from http://nl.ijs.si/sdt/data/. The license in short:
- research usage
- cite principal publication in publications
- redistributability not discussed (might be permitted under the same conditions but ask the authors first)
SDT was created by members of the Institut “Jožef Stefan”, Jamova cesta 39, 1000 Ljubljana, Slovenia.
References
- Website
- Data
- no separate citation
- Principal publications
- Sašo Džeroski, Tomaž Erjavec, Nina Ledinek, Petr Pajas, Zdeněk Žabokrtský, Andreja Žele: Towards a Slovene Dependency Treebank In: Proceedings of Fifth International Conference on Language Resources and Evaluation, LREC'06, 24-26 May 2006. Genova, Italy, 2006.
- Documentation
- Tomaž Erjavec, Peter Holozan, Vojko Gorjanc, Marko Stabej: Morphosyntactic tagset specification for Slovene
Domain
Fiction (Multext-East Orwell's “1984”).
Size
The CoNLL 2006 version contains 35140 tokens in 1936 sentences, yielding 18.15 tokens per sentence on average (CoNLL 2006 data split: 28750 tokens / 1534 sentences training, 6390 tokens / 402 sentences test).
Inside
The original morphosyntactic tags have been converted to fit into the three columns (CPOS, POS and FEAT) of the CoNLL format. There should be a 1-1 mapping between the BTB positional tags and the CoNLL 2006 annotation. Use DZ Interset to inspect the CoNLL tagset.
The morphological analysis does not include lemmas. The morphosyntactic tags have been assigned (probably) manually.
The guidelines for syntactic annotation are documented in the other technical report. The CoNLL distribution contains the BulTreeBankReadMe.html file with a brief description of the syntactic tags (dependency relation labels).
Sample
The first three sentences of the CoNLL 2006 training data:
1 | Глава | _ | N | Nc | _ | 0 | ROOT | 0 | ROOT |
2 | трета | _ | M | Mo | gen=f|num=s|def=i | 1 | mod | 1 | mod |
1 | НАРОДНО | _ | A | An | gen=n|num=s|def=i | 2 | mod | 2 | mod |
2 | СЪБРАНИЕ | _ | N | Nc | gen=n|num=s|def=i | 0 | ROOT | 0 | ROOT |
1 | Народното | _ | A | An | gen=n|num=s|def=d | 2 | mod | 2 | mod |
2 | събрание | _ | N | Nc | gen=n|num=s|def=i | 3 | subj | 3 | subj |
3 | осъществява | _ | V | Vpi | trans=t|mood=i|tense=r|pers=3|num=s | 0 | ROOT | 0 | ROOT |
4 | законодателната | _ | A | Af | gen=f|num=s|def=d | 5 | mod | 5 | mod |
5 | власт | _ | N | Nc | _ | 3 | obj | 3 | obj |
6 | и | _ | C | Cp | _ | 3 | conj | 3 | conj |
7 | упражнява | _ | V | Vpi | trans=t|mood=i|tense=r|pers=3|num=s | 3 | conjarg | 3 | conjarg |
8 | парламентарен | _ | A | Am | gen=m|num=s|def=i | 9 | mod | 9 | mod |
9 | контрол | _ | N | Nc | gen=m|num=s|def=i | 7 | obj | 7 | obj |
10 | . | _ | Punct | Punct | _ | 3 | punct | 3 | punct |
The first three sentences of the CoNLL 2006 test data:
1 | Единственото | _ | A | An | gen=n|num=s|def=d | 2 | mod | 2 | mod |
2 | решение | _ | N | Nc | gen=n|num=s|def=i | 0 | ROOT | 0 | ROOT |
1 | Ерик | _ | N | Np | gen=m|num=s|def=i | 0 | ROOT | 0 | ROOT |
2 | Франк | _ | N | Np | gen=m|num=s|def=i | 1 | mod | 1 | mod |
3 | Ръсел | _ | H | Hm | gen=m|num=s|def=i | 2 | mod | 2 | mod |
1 | Пълен | _ | A | Am | gen=m|num=s|def=i | 2 | mod | 2 | mod |
2 | мрак | _ | N | Nc | gen=m|num=s|def=i | 0 | ROOT | 0 | ROOT |
3 | и | _ | C | Cp | _ | 2 | conj | 2 | conj |
4 | пълна | _ | A | Af | gen=f|num=s|def=i | 5 | mod | 5 | mod |
5 | самота | _ | N | Nc | _ | 2 | conjarg | 2 | conjarg |
6 | . | _ | Punct | Punct | _ | 2 | punct | 2 | punct |
Parsing
Nonprojectivities in BTB are rare. Only 747 of the 196,151 tokens in the CoNLL 2006 version are attached nonprojectively (0.38%).
The results of the CoNLL 2006 shared task are available online. They have been published in (Buchholz and Marsi, 2006). The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Bulgarian:
Parser (Authors) | LAS | UAS |
---|---|---|
MST (McDonald et al.) | 87.57 | 92.04 |
Malt (Nivre et al.) | 87.41 | 91.72 |
Nara (Yuchang Cheng) | 86.34 | 91.30 |