user:zeman:treebanks:eu [ufal wiki]

CoNLL 2007
Extended version (2011?)

There does not seem to be any regular distribution channel for the Basque Dependency Treebank. The CoNLL 2007 version had a restricted license for the duration of the shared task only. Republication of the CoNLL version in LDC is planned but it has not happenned yet. In the meantime, one can ask Koldo Gojenola (koldo (dot) gojenola (at) ehu (dot) es) about availability of the corpus.

BDT was created by members of the IXA Group (IXA taldea), University of the Basque Country (Euskal Herriko Unibertsitatea), 649 Posta kutxa, E-20080 Donostia, Spain.

Website
- no website dedicated to the treebank
Data
- no separate citation
Principal publications
- Itziar Aduriz, María Jesús Aranzabe, José María Arriola, Aitziber Atutxa, Arantza Díaz de Ilarraza, Aitzpea Garmendia, Maite Oronoz: Construction of a Basque Dependency Treebank In: Proceedings of The Second Workshop on Treebanks and Linguistic Theories (TLT 2003), pp. 149-160, Växjö, Sweden, 2003.
Documentation
- Description of tags and feature values is hard to find; the doc/README file in the CoNLL 2007 data distribution is not very informative. See below for information obtained per e-mail communication.
- María Jesús Aranzabe, José Mari Arriola, Aitziber Atutxa, Irene Balza, Larraitz Uria: Guía para la anotación sintáctica manual de Eus3LB (corpus del euskera anotado a nivel sintáctico, semántico y pragmático). UPV/EHU/LSI/TR 13-2003, Donostia, Spain, 2003.

Newswire + unknown (“25000 word forms from EPEC (Aduriz et al., 2003) and 25000 word forms coming from newspapers that can be considered equivalent to the other corpora in the project [3LB, i.e. Catalan and Spanish]”; “EPEC, a corpus of written Basque tagged at morphological and syntactic levels for the automatic processing”).

The CoNLL 2007 version contains 70223 tokens in 2902 sentences, yielding 24.20 tokens per sentence on average (CoNLL 2007 data split: 65419 tokens / 2705 sentences training, 4804 tokens / 197 sentences test).

The syntactic annotation style and the tagset for dependency relations (analytical functions) in GDT has been modeled after the Prague Dependency Treebank.

Part of speech tag description (obtained per e-mail from Koldo Gojenola, thanks!):

IZE = noun
- ARR = common
- IZB = proper name
- LIB = place name
- ZKI = number
ADJ = adjective
- ARR = common
- GAL = question
ADI = verb
- SIN = simple
- ADK = composed
- ADP = periphrastic
- FAK = factitive
ADB = adverb
- ARR = common
- GAL = question
DET = determiner
- ERKARR = demonstrative common
- ERKIND = demonstrative emphatic
- NOLARR = indefinite common
- NOLGAL = indefinite question
- ZNB = number
- DZH = definite
- BAN = distributive
- ORD = ordinal
- DZG = indefinite
- ORO = general
IOR = pronoun
- PERARR = personal common
- PERIND = personal emphatic
- IZGMGB = indefinite
- IZGGAL = question
- BIH = ???
- ELK = ???
LOT = link
- LOK = connector
- JNT = conjunction
PRT = particle
ITJ = interjection
BST = other
ADL = auxiliary verb
ADT = synthetic verb
SIG = acronym
SNB = symbol
LAB = abbreviation

Main features:

KAS = case (ERG = ergative, ABS = absolutive, DAT = dative…)
ASP = aspect
ERL = relation (relative sentence, completive sentence, indirect question…)

The first sentence of the CoNLL 2007 training data:

1	“	”	PUNCT	PUNCT	_	10	AuxG	_	_
2	Τα	ο	At	AtDf	Ne\|Pl\|Nm	3	Atr	_	_
3	αντισώματα	αντίσωμα	No	NoCm	Ne\|Pl\|Nm	5	Sb	_	_
4	IgG	IgG	Rg	RgFwOr	_	3	Atr	_	_
5	είναι	είμαι	Vb	VbMn	Id\|Pr\|03\|Pl\|Xx\|Ip\|Pv\|Xx	10	Obj_Co	_	_
6	σαν	σαν	Ad	Ad	Ba	5	Adv	_	_
7	μακροπρόθεσμη	μακροπρόθεσμος	Aj	Aj	Ba\|Fe\|Sg\|Nm	8	Atr	_	_
8	μνήμη	μνήμη	No	NoCm	Fe\|Sg\|Nm	6	Adv	_	_
9	,	,	PUNCT	PUNCT	_	10	AuxX	_	_
10	ενώ	ενώ	Cj	CjCo	_	26	Coord	_	_
11	το	ο	At	AtDf	Ne\|Sg\|Nm	12	Atr	_	_
12	IgA	IgA	Rg	RgFwOr	_	15	Sb	_	_
13	πιστεύεται	πιστεύεται	Vb	VbMn	Id\|Pr\|03\|Sg\|Xx\|Ip\|Pv\|Xx	10	Obj_Co	_	_
14	ότι	ότι	Cj	CjSb	_	13	AuxC	_	_
15	είναι	είμαι	Vb	VbMn	Id\|Pr\|03\|Sg\|Xx\|Ip\|Pv\|Xx	14	Sb	_	_
16	ένας	ένας	At	AtId	Ma\|Sg\|Nm	18	Atr	_	_
17	συγκεκριμένος	συγκεκριμένος	Aj	Aj	Ba\|Ma\|Sg\|Nm	18	Atr	_	_
18	δείκτης	δείκτης	No	NoCm	Ma\|Sg\|Nm	15	Pnom	_	_
19	για	για	AsPp	AsPpSp	_	18	AuxP	_	_
20	πρόσφατες	πρόσφατος	Aj	Aj	Ba\|Fe\|Pl\|Ac	21	Atr_Co	_	_
21	ή	ή	Cj	CjCo	_	23	Coord	_	_
22	χρόνιες	χρόνιος	Aj	Aj	Ba\|Fe\|Pl\|Ac	21	Atr_Co	_	_
23	λοιμώξεις	λοίμωξη	No	NoCm	Fe\|Pl\|Ac	19	Atr	_	_
24	“	”	PUNCT	PUNCT	_	10	AuxG	_	_
25	,	,	PUNCT	PUNCT	_	10	AuxX	_	_
26	εξηγεί	εξηγώ	Vb	VbMn	Id\|Pr\|03\|Sg\|Xx\|Ip\|Av\|Xx	0	Pred	_	_
27	η	ο	At	AtDf	Fe\|Sg\|Nm	28	Atr	_	_
28	Δρ	Δρ	Rg	RgFwTr	_	26	Sb	_	_
29	Αρκάρι	Αρκάρι	No	NoCm	Ne\|Sg\|Nm	28	Atr	_	_
30	.	.	PUNCT	PUNCT	_	0	AuxK	_	_

The first sentence of the CoNLL 2007 test data:

1	Η	ο	At	AtDf	Fe\|Sg\|Nm	2	Atr	_	_
2	Σίφνος	Σίφνος	No	NoPr	Fe\|Sg\|Nm	3	Sb	_	_
3	φημίζεται	φημίζομαι	Vb	VbMn	Id\|Pr\|03\|Sg\|Xx\|Ip\|Pv\|Xx	0	Pred	_	_
4	και	και	Cj	CjCo	_	5	AuxY	_	_
5	για	για	AsPp	AsPpSp	_	3	AuxP	_	_
6	τα	ο	At	AtDf	Ne\|Pl\|Ac	8	Atr	_	_
7	καταγάλανα	καταγάλανος	Aj	Aj	Ba\|Ne\|Pl\|Ac	8	Atr	_	_
8	νερά	νερό	No	NoCm	Ne\|Pl\|Ac	5	Obj	_	_
9	των	ο	At	AtDf	Fe\|Pl\|Ge	11	Atr	_	_
10	πανέμορφων	πανέμορφος	Aj	Aj	Ba\|Fe\|Pl\|Ge	11	Atr	_	_
11	ακτών	ακτή	No	NoCm	Fe\|Pl\|Ge	8	Atr	_	_
12	της	μου	Pn	PnPo	Fe\|03\|Sg\|Ge\|Xx	11	Atr	_	_
13	.	.	PUNCT	PUNCT	_	0	AuxK	_	_

Nonprojectivities in GDT are not frequent. Only 823 of the 70223 tokens in the CoNLL 2007 version are attached nonprojectively (1.17%).

The results of the CoNLL 2007 shared task are available online. They have been published in (Nivre et al., 2007). The evaluation procedure was changed to include punctuation tokens. These are the best results for Greek:

Parser (Authors)	LAS	UAS
Nakagawa	76.31	84.08
Keith Hall et al.	74.21	82.04
Carreras	73.56	81.37
Malt (Nilsson et al.)	74.65	81.22
Titov et al.	73.52	81.20
Chen	74.42	81.16
Duan	74.29	80.77
Attardi et al.	73.92	80.75
Malt (J. Hall et al.)	74.21	80.66

The two Malt parser results of 2007 (single malt and blended) are described in (Hall et al., 2007) and the details about the parser configuration are described here.

Institute of Formal and Applied Linguistics Wiki

Table of Contents

Basque (eu)

Versions

Obtaining and License

References

Domain

Size

Inside

Sample

Parsing