This is an old revision of the document!
Table of Contents
Croatian (hr)
SETimes.HR treebank
Versions
- Version 1, available on-line
- Unreleased (yet) version, obtained 2014-07-16 from Željko Agić
Obtaining and License
The corpus is available on-line for free download under the CC BY-SA 3.0 license. The license in short:
- use for whatever work you want
- redistribution permitted under the same license
- cite their paper in publications
SETimes.HR was created by Željko Agić (Universität Potsdam) and Nikola Ljubešić (Filozofski fakultet Sveučilišta u Zagrebu), Ivana Lučića 3, HR-10000 Zagreb, Croatia.
References
- Website
- Data
- no separate citation
- Principal publications
- Željko Agić, Nikola Ljubešić: The SETimes.HR Linguistically Annotated Corpus of Croatian. In: Proceedings of LREC 2014, pp. 1724–1727. Reykjavík, Iceland, 2014.
- Documentation
Domain
Croatian newspaper text from Southeast European Times.
Size
Version 1 contains 178,981 tokens in 7995 sentences, yielding 22.39 tokens per sentence on average. The file is a mixture of trees and non-trees, as only 2490 sentences have been annotated on the syntactic level. Part of the corpus (up to line number 93124) contains manually assigned lemmas and morphosyntactic descriptions (tags), while the rest contains automatic morphological annotation.
The improved pre-release version contains 83640 tokens in 3736 sentences, yielding 22.39 tokens per sentence on average.
Inside
All sentences in the improved pre-release version are manually annotated on morphological and syntactic levels. The officially available version 1 is a mixture of manual and automatic annotation, see the section on sizes above.
The treebank is distributed in the CoNLL 2006 file format. Multext-East morphosyntactic tags appear in both the CPOS and POS columns, while the FEAT column is empty.
In Version 1, if there is a token that has empty (“_”) value of the DEPREL column, then the sentence has not been syntactically annotated (even though there are numbers in the HEAD column; these are fake head links, typically they refer to the same node).
All sentences in the improved pre-release contain dependency information; however, at a few places there are errors introduced by the annotation software that result in a cyclic graph (not a tree).
The syntactic tags (DEPREL) are simplistic but somewhat inspired by the Prague Dependency Treebank, there are only 15 of them:
Tag | Percent | Example | Description |
---|---|---|---|
Adv | 5% | Kosovu | adverbial modifier |
Ap | 3% | Esat | appositional modifier, incl. first name attached to last name |
Atr | 26% | privatizacije | attribute modifying a noun phrase |
Atv | 2% | iskoristiti | ? |
Aux | 7% | se | ? |
Co | 3% | a | conjunction as coordination head (Prague-style coordinations) |
Elp | 0.6% | Proces | ellipsis |
Obj | 7% | privatizacije | object of a verb |
Oth | 2% | Barem | other |
Pnom | 2% | složen | nominal predicate attached to copula |
Pred | 10% | analizira | predicate (verbal) |
Prep | 10% | na | preposition |
Punc | 13% | . | punctuation |
Sb | 7% | Kosovo | subject |
Sub | 4% | da | subordinating conjunction |
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Sample
The first three sentences of the CoNLL 2006 training data:
1 | Глава | _ | N | Nc | _ | 0 | ROOT | 0 | ROOT |
2 | трета | _ | M | Mo | gen=f|num=s|def=i | 1 | mod | 1 | mod |
1 | НАРОДНО | _ | A | An | gen=n|num=s|def=i | 2 | mod | 2 | mod |
2 | СЪБРАНИЕ | _ | N | Nc | gen=n|num=s|def=i | 0 | ROOT | 0 | ROOT |
1 | Народното | _ | A | An | gen=n|num=s|def=d | 2 | mod | 2 | mod |
2 | събрание | _ | N | Nc | gen=n|num=s|def=i | 3 | subj | 3 | subj |
3 | осъществява | _ | V | Vpi | trans=t|mood=i|tense=r|pers=3|num=s | 0 | ROOT | 0 | ROOT |
4 | законодателната | _ | A | Af | gen=f|num=s|def=d | 5 | mod | 5 | mod |
5 | власт | _ | N | Nc | _ | 3 | obj | 3 | obj |
6 | и | _ | C | Cp | _ | 3 | conj | 3 | conj |
7 | упражнява | _ | V | Vpi | trans=t|mood=i|tense=r|pers=3|num=s | 3 | conjarg | 3 | conjarg |
8 | парламентарен | _ | A | Am | gen=m|num=s|def=i | 9 | mod | 9 | mod |
9 | контрол | _ | N | Nc | gen=m|num=s|def=i | 7 | obj | 7 | obj |
10 | . | _ | Punct | Punct | _ | 3 | punct | 3 | punct |
The first three sentences of the CoNLL 2006 test data:
1 | Единственото | _ | A | An | gen=n|num=s|def=d | 2 | mod | 2 | mod |
2 | решение | _ | N | Nc | gen=n|num=s|def=i | 0 | ROOT | 0 | ROOT |
1 | Ерик | _ | N | Np | gen=m|num=s|def=i | 0 | ROOT | 0 | ROOT |
2 | Франк | _ | N | Np | gen=m|num=s|def=i | 1 | mod | 1 | mod |
3 | Ръсел | _ | H | Hm | gen=m|num=s|def=i | 2 | mod | 2 | mod |
1 | Пълен | _ | A | Am | gen=m|num=s|def=i | 2 | mod | 2 | mod |
2 | мрак | _ | N | Nc | gen=m|num=s|def=i | 0 | ROOT | 0 | ROOT |
3 | и | _ | C | Cp | _ | 2 | conj | 2 | conj |
4 | пълна | _ | A | Af | gen=f|num=s|def=i | 5 | mod | 5 | mod |
5 | самота | _ | N | Nc | _ | 2 | conjarg | 2 | conjarg |
6 | . | _ | Punct | Punct | _ | 2 | punct | 2 | punct |
Parsing
Nonprojectivities in BTB are rare. Only 747 of the 196,151 tokens in the CoNLL 2006 version are attached nonprojectively (0.38%).
The results of the CoNLL 2006 shared task are available online. They have been published in (Buchholz and Marsi, 2006). The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Bulgarian:
Parser (Authors) | LAS | UAS |
---|---|---|
MST (McDonald et al.) | 87.57 | 92.04 |
Malt (Nivre et al.) | 87.41 | 91.72 |
Nara (Yuchang Cheng) | 86.34 | 91.30 |