This is an old revision of the document!
Table of Contents
Croatian (hr)
SETimes.HR treebank
Versions
- Version 1, available on-line
- Unreleased (yet) version, obtained 2014-07-16 from Željko Agić
Obtaining and License
The corpus is available on-line for free download under the CC BY-SA 3.0 license. The license in short:
- use for whatever work you want
- redistribution permitted under the same license
- cite their paper in publications
SETimes.HR was created by Željko Agić (Universität Potsdam) and Nikola Ljubešić (Filozofski fakultet Sveučilišta u Zagrebu), Ivana Lučića 3, HR-10000 Zagreb, Croatia.
References
- Website
- Data
- no separate citation
- Principal publications
- Željko Agić, Nikola Ljubešić: The SETimes.HR Linguistically Annotated Corpus of Croatian. In: Proceedings of LREC 2014, pp. 1724–1727. Reykjavík, Iceland, 2014.
- Documentation
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Domain
Unknown (“A set of Bulgarian sentences marked-up with detailed syntactic information. These sentences are mainly extracted from authentic Bulgarian texts. They are chosen with regards two criteria. First, they cover the variety of syntactic structures of Bulgarian. Second, they show the statistical distribution of these phenomena in real texts.”) At least part of it is probably news (Novinar, Sega, Standart).
Size
The CoNLL 2006 version contains 196,151 tokens in 13221 sentences, yielding 14.84 tokens per sentence on average (CoNLL 2006 data split: 190,217 tokens / 12823 sentences training, 5934 tokens / 398 sentences test).
Inside
The original morphosyntactic tags have been converted to fit into the three columns (CPOS, POS and FEAT) of the CoNLL format. There should be a 1-1 mapping between the BTB positional tags and the CoNLL 2006 annotation. Use DZ Interset to inspect the CoNLL tagset.
The morphological analysis does not include lemmas. The morphosyntactic tags have been assigned (probably) manually.
The guidelines for syntactic annotation are documented in the other technical report. The CoNLL distribution contains the BulTreeBankReadMe.html file with a brief description of the syntactic tags (dependency relation labels).
Sample
The first three sentences of the CoNLL 2006 training data:
1 | Глава | _ | N | Nc | _ | 0 | ROOT | 0 | ROOT |
2 | трета | _ | M | Mo | gen=f|num=s|def=i | 1 | mod | 1 | mod |
1 | НАРОДНО | _ | A | An | gen=n|num=s|def=i | 2 | mod | 2 | mod |
2 | СЪБРАНИЕ | _ | N | Nc | gen=n|num=s|def=i | 0 | ROOT | 0 | ROOT |
1 | Народното | _ | A | An | gen=n|num=s|def=d | 2 | mod | 2 | mod |
2 | събрание | _ | N | Nc | gen=n|num=s|def=i | 3 | subj | 3 | subj |
3 | осъществява | _ | V | Vpi | trans=t|mood=i|tense=r|pers=3|num=s | 0 | ROOT | 0 | ROOT |
4 | законодателната | _ | A | Af | gen=f|num=s|def=d | 5 | mod | 5 | mod |
5 | власт | _ | N | Nc | _ | 3 | obj | 3 | obj |
6 | и | _ | C | Cp | _ | 3 | conj | 3 | conj |
7 | упражнява | _ | V | Vpi | trans=t|mood=i|tense=r|pers=3|num=s | 3 | conjarg | 3 | conjarg |
8 | парламентарен | _ | A | Am | gen=m|num=s|def=i | 9 | mod | 9 | mod |
9 | контрол | _ | N | Nc | gen=m|num=s|def=i | 7 | obj | 7 | obj |
10 | . | _ | Punct | Punct | _ | 3 | punct | 3 | punct |
The first three sentences of the CoNLL 2006 test data:
1 | Единственото | _ | A | An | gen=n|num=s|def=d | 2 | mod | 2 | mod |
2 | решение | _ | N | Nc | gen=n|num=s|def=i | 0 | ROOT | 0 | ROOT |
1 | Ерик | _ | N | Np | gen=m|num=s|def=i | 0 | ROOT | 0 | ROOT |
2 | Франк | _ | N | Np | gen=m|num=s|def=i | 1 | mod | 1 | mod |
3 | Ръсел | _ | H | Hm | gen=m|num=s|def=i | 2 | mod | 2 | mod |
1 | Пълен | _ | A | Am | gen=m|num=s|def=i | 2 | mod | 2 | mod |
2 | мрак | _ | N | Nc | gen=m|num=s|def=i | 0 | ROOT | 0 | ROOT |
3 | и | _ | C | Cp | _ | 2 | conj | 2 | conj |
4 | пълна | _ | A | Af | gen=f|num=s|def=i | 5 | mod | 5 | mod |
5 | самота | _ | N | Nc | _ | 2 | conjarg | 2 | conjarg |
6 | . | _ | Punct | Punct | _ | 2 | punct | 2 | punct |
Parsing
Nonprojectivities in BTB are rare. Only 747 of the 196,151 tokens in the CoNLL 2006 version are attached nonprojectively (0.38%).
The results of the CoNLL 2006 shared task are available online. They have been published in (Buchholz and Marsi, 2006). The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Bulgarian:
Parser (Authors) | LAS | UAS |
---|---|---|
MST (McDonald et al.) | 87.57 | 92.04 |
Malt (Nivre et al.) | 87.41 | 91.72 |
Nara (Yuchang Cheng) | 86.34 | 91.30 |