[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

This is an old revision of the document!


Table of Contents

Croatian (hr)

SETimes.HR treebank

Versions

Obtaining and License

The corpus is available on-line for free download under the CC BY-SA 3.0 license. The license in short:

SETimes.HR was created by Željko Agić (Universität Potsdam) and Nikola Ljubešić (Filozofski fakultet Sveučilišta u Zagrebu), Ivana Lučića 3, HR-10000 Zagreb, Croatia.

References

Domain

Croatian newspaper text from Southeast European Times.

Size

Version 1 contains 178,981 tokens in 7995 sentences, yielding 22.39 tokens per sentence on average. The file is a mixture of trees and non-trees, as only 2490 sentences have been annotated on the syntactic level. Part of the corpus (up to line number 93124) contains manually assigned lemmas and morphosyntactic descriptions (tags), while the rest contains automatic morphological annotation.

The improved pre-release version contains 83640 tokens in 3736 sentences, yielding 22.39 tokens per sentence on average.

There is no official training-test division of the original data. For HamleDT, we have split the data 90:10, i.e. the first 3362 sentences (75236 tokens) for training and the remaining 374 sentences (8404 tokens) for testing.

Inside

All sentences in the improved pre-release version are manually annotated on morphological and syntactic levels. The officially available version 1 is a mixture of manual and automatic annotation, see the section on sizes above.

The treebank is distributed in the CoNLL 2006 file format. Multext-East morphosyntactic tags appear in both the CPOS and POS columns, while the FEAT column is empty.

In Version 1, if there is a token that has empty (“_”) value of the DEPREL column, then the sentence has not been syntactically annotated (even though there are numbers in the HEAD column; these are fake head links, typically they refer to the same node).

All sentences in the improved pre-release contain dependency information; however, at a few places there are errors introduced by the annotation software that result in a cyclic graph (not a tree).

The syntactic tags (DEPREL) are simplistic but somewhat inspired by the Prague Dependency Treebank, there are only 15 of them:

Tag Percent Example Description
Adv 5% Kosovu adverbial modifier
Ap 3% Esat appositional modifier, incl. first name attached to last name
Atr 26% privatizacije attribute modifying a noun phrase
Atv 2% iskoristiti ?
Aux 7% se ?
Co 3% a conjunction as coordination head (Prague-style coordinations)
Elp 0.6% Proces ellipsis
Obj 7% privatizacije object of a verb
Oth 2% Barem other
Pnom 2% složen nominal predicate attached to copula
Pred 10% analizira predicate (verbal)
Prep 10% na preposition
Punc 13% . punctuation
Sb 7% Kosovo subject
Sub 4% da subordinating conjunction

(The sum of the percentages exceeds 100% because of rounding.)

Sample

The first three sentences of the improved pre-relase version:

1 Proces proces Ncmsn Ncmsn _ 0 Elp _ _
2 privatizacije privatizacija Ncfsg Ncfsg _ 1 Obj _ _
3 na na Sl Sl _ 1 Prep _ _
4 Kosovu Kosovo Npnsl Npnsl _ 3 Adv _ _
5 pod pod Si Si _ 0 Prep _ _
6 povećalom povećalo Ncnsi Ncnsi _ 5 Elp _ _
1 Kosovo Kosovo Npnsn Npnsn _ 3 Sb _ _
2 ozbiljno ozbiljno Rgp Rgp _ 3 Adv _ _
3 analizira analizirati Vmr3s Vmr3s _ 0 Pred _ _
4 proces proces Ncmsan Ncmsan _ 3 Obj _ _
5 privatizacije privatizacija Ncfsg Ncfsg _ 4 Atr _ _
6 u u Sl Sl _ 3 Prep _ _
7 svjetlu svjetlo Ncnsl Ncnsl _ 6 Obj _ _
8 učestalih učestao Agpfpg Agpfpg _ 9 Atr _ _
9 pritužbi pritužba Ncfpg Ncfpg _ 7 Atr _ _
10 . . Z Z _ 0 Punc _ _
1 Barem barem Rgp Rgp _ 2 Oth _ _
2 na na Sl Sl _ 5 Prep _ _
3 papiru papir Ncmsl Ncmsl _ 2 Obj _ _
4 , , Z Z _ 2 Punc _ _
5 izgleda izgledati Vmr3s Vmr3s _ 0 Pred _ _
6 kao kao Cs Cs _ 8 Oth _ _
7 odlična odličan Agpfsn Agpfsn _ 8 Atr _ _
8 ideja ideja Ncfsn Ncfsn _ 5 Adv _ _
9 . . Z Z _ 0 Punc _ _

Parsing

Nonprojectivities in SETimes.HR are rare. Only 461 of the 83640 tokens in the pre-release version are attached nonprojectively (0.55%).

Are there any published parsing results on this corpus?


[ Back to the navigation ] [ Back to the content ]