user:zeman:treebanks:hr

Croatian (hr)
- Versions
- Obtaining and License
- References
- Domain
- Size
- Inside
- Sample
- Parsing

Croatian (hr)

Versions

Version 1, available on-line
Unreleased (yet) version, obtained 2014-07-16 from Željko Agić

Obtaining and License

The corpus is available on-line for free download under the CC BY-SA 3.0 license. The license in short:

use for whatever work you want
redistribution permitted under the same license
cite their paper in publications

SETimes.HR was created by Željko Agić (Universität Potsdam) and Nikola Ljubešić (Filozofski fakultet Sveučilišta u Zagrebu), Ivana Lučića 3, HR-10000 Zagreb, Croatia.

References

Website
- http://nlp.ffzg.hr/resources/corpora/setimes-hr/
Data
- no separate citation
Principal publications
- Željko Agić, Nikola Ljubešić: The SETimes.HR Linguistically Annotated Corpus of Croatian. In: Proceedings of LREC 2014, pp. 1724–1727. Reykjavík, Iceland, 2014.
Documentation
- Multext-East v5 Croatian Tagset, 2013.
- A discussion of the syntactic tags is in Danijela Merkler, Željko Agić, Ana Agić: Babel Treebank of Public Messages in Croatian. In: Procedia – Social and Behavioral Sciences, vol. 95, pp. 490-497, 2013.

Domain

Croatian newspaper text from Southeast European Times.

Size

Version 1 contains 178,981 tokens in 7995 sentences, yielding 22.39 tokens per sentence on average. The file is a mixture of trees and non-trees, as only 2490 sentences have been annotated on the syntactic level. Part of the corpus (up to line number 93124) contains manually assigned lemmas and morphosyntactic descriptions (tags), while the rest contains automatic morphological annotation.

The improved pre-release version contains 83640 tokens in 3736 sentences, yielding 22.39 tokens per sentence on average.

There is no official training-test division of the original data. For HamleDT, we have split the data 90:10, i.e. the first 3362 sentences (75236 tokens) for training and the remaining 374 sentences (8404 tokens) for testing.

Inside

All sentences in the improved pre-release version are manually annotated on morphological and syntactic levels. The officially available version 1 is a mixture of manual and automatic annotation, see the section on sizes above.

The treebank is distributed in the CoNLL 2006 file format. Multext-East morphosyntactic tags appear in both the CPOS and POS columns, while the FEAT column is empty.

In Version 1, if there is a token that has empty (“_”) value of the DEPREL column, then the sentence has not been syntactically annotated (even though there are numbers in the HEAD column; these are fake head links, typically they refer to the same node).

All sentences in the improved pre-release contain dependency information; however, at a few places there are errors introduced by the annotation software that result in a cyclic graph (not a tree).

The syntactic tags (DEPREL) are simplistic but somewhat inspired by the Prague Dependency Treebank, there are only 15 of them:

Tag	Percent	Example	Description
Adv	5%	Kosovu	adverbial modifier
Ap	3%	Esat	appositional modifier, incl. first name attached to last name
Atr	26%	privatizacije	attribute modifying a noun phrase
Atv	2%	iskoristiti	?
Aux	7%	se	?
Co	3%	a	conjunction as coordination head (Prague-style coordinations)
Elp	0.6%	Proces	ellipsis
Obj	7%	privatizacije	object of a verb
Oth	2%	Barem	other
Pnom	2%	složen	nominal predicate attached to copula
Pred	10%	analizira	predicate (verbal)
Prep	10%	na	preposition
Punc	13%	.	punctuation
Sb	7%	Kosovo	subject
Sub	4%	da	subordinating conjunction

(The sum of the percentages exceeds 100% because of rounding.)

Cycles

Eight dependency graphs in the pre-release version contain cycles. Most of the time these are individual nodes attached to themselves (according to Željko, this is the default in the annotation software, thus the annotator probably just forgot to attach the nodes). Five of them are punctuation nodes and fixing the attachment should be relatively easy. The only complicated case is the sentence #25 in the test file. Its dependency graph is wrong at multiple spots.

train/006#247:a
Analitičari upozoravaju na kosovski trend: osnivanje novih političkih stranaka neposredno prije izbora, a od strane ljudi iz već postojećih političkih stranaka ili nekog drugog aspekta javnog života.

train/006#381:mnogo
“Ne možemo mnogo učiniti kako bismo je spriječili da ide malo šetati ili plivati.

train/006#399:,
U međuvremenu, troškovi života porasli su: najamnina za mali stan u Podgorici iznosi oko 200 eura mjesečno – što mnogima otežava spajanje kraja s krajem.

train/007#6:”
“Nije riječ o tome da imamo jednu političku opciju koja tvrdi kako piramidu ne bi trebalo uništiti, dok druga smatra da je treba uništiti.

train/007#190:.
“Moramo biti svjesni kako se kod naroda stvara strah”, izjavio je čelnik stranke ORA Veton Surroi kosovskom dnevniku Express, piše Reuters.

train/007#359:,
Ulaganja u Srbiji dosegnula su rekordnih 1,5 milijardi eura u 2005. godini, priopćila je u srijedu vlada, izražavajući očekivanja glede nastavka rasta i u sljedećoj godini.

One more Punc-CYCLE:1 occurred somewhere else.

test/001#25:toga
Rezultat je toga da je artikulacija praktičnih zajedničkih interesa postala teža, kao i definiranje konkretnih misija.
Překlad s pomocí Google Translate:
Důsledek toho je, že členění praktických společných zájmů se stalo těžší, jakož i vymezení konkrétních misí.
Tohle je asi jediný zajímavý případ. Nejde o pověšení uzlu na sebe sama. “Rezultat” visí na “postala”, “postala” na “da”, “da” na “toga” a “toga” chtěli pověsit opět na “Rezultat”. Je tam k tomu i celkem divoká neprojektivita. Celá ta věta je podle mě rozebraná špatně (je tam několik dalších chyb) a chtělo by to, abychom ji v průběhu harmonizace úplně předělali.

JINÉ:
V té větě train/006#247 nahoře: “političkih stranaka”, “političkih” je označeno jako apozice. Opravit chybu. Pokud je jako apozice přídavné jméno, které visí na následujícím podstatném jméně a shoduje se s ním v rodě, čísle a pádě, není to Apposition, ale Atr.

Věta test/001#1 má v kořeni pomocné sloveso “je” a jeho deprel není Pred, ale Aux!

Sample

The first three sentences of the improved pre-relase version:

1	Proces	proces	Ncmsn	Ncmsn	_	0	Elp	_	_
2	privatizacije	privatizacija	Ncfsg	Ncfsg	_	1	Obj	_	_
3	na	na	Sl	Sl	_	1	Prep	_	_
4	Kosovu	Kosovo	Npnsl	Npnsl	_	3	Adv	_	_
5	pod	pod	Si	Si	_	0	Prep	_	_
6	povećalom	povećalo	Ncnsi	Ncnsi	_	5	Elp	_	_

1	Kosovo	Kosovo	Npnsn	Npnsn	_	3	Sb	_	_
2	ozbiljno	ozbiljno	Rgp	Rgp	_	3	Adv	_	_
3	analizira	analizirati	Vmr3s	Vmr3s	_	0	Pred	_	_
4	proces	proces	Ncmsan	Ncmsan	_	3	Obj	_	_
5	privatizacije	privatizacija	Ncfsg	Ncfsg	_	4	Atr	_	_
6	u	u	Sl	Sl	_	3	Prep	_	_
7	svjetlu	svjetlo	Ncnsl	Ncnsl	_	6	Obj	_	_
8	učestalih	učestao	Agpfpg	Agpfpg	_	9	Atr	_	_
9	pritužbi	pritužba	Ncfpg	Ncfpg	_	7	Atr	_	_
10	.	.	Z	Z	_	0	Punc	_	_

1	Barem	barem	Rgp	Rgp	_	2	Oth	_	_
2	na	na	Sl	Sl	_	5	Prep	_	_
3	papiru	papir	Ncmsl	Ncmsl	_	2	Obj	_	_
4	,	,	Z	Z	_	2	Punc	_	_
5	izgleda	izgledati	Vmr3s	Vmr3s	_	0	Pred	_	_
6	kao	kao	Cs	Cs	_	8	Oth	_	_
7	odlična	odličan	Agpfsn	Agpfsn	_	8	Atr	_	_
8	ideja	ideja	Ncfsn	Ncfsn	_	5	Adv	_	_
9	.	.	Z	Z	_	0	Punc	_	_

Parsing

Nonprojectivities in SETimes.HR are rare. Only 461 of the 83640 tokens in the pre-release version are attached nonprojectively (0.55%).

Are there any published parsing results on this corpus?

Table of Contents