[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

This is an old revision of the document!


Table of Contents

Basque (eu)

Basque Dependency Treebank (BDT)

Versions

Obtaining and License

There does not seem to be any regular distribution channel for the Basque Dependency Treebank. The CoNLL 2007 version had a restricted license for the duration of the shared task only. Republication of the CoNLL version in LDC is planned but it has not happenned yet. In the meantime, one can ask Koldo Gojenola (koldo (dot) gojenola (at) ehu (dot) es) about availability of the corpus.

Informally agreed upon terms:

BDT was created by members of the IXA Group (IXA taldea), University of the Basque Country (Euskal Herriko Unibertsitatea), 649 Posta kutxa, E-20080 Donostia, Spain.

References

Domain

Newswire + unknown (“25000 word forms from EPEC (Aduriz et al., 2003) and 25000 word forms coming from newspapers that can be considered equivalent to the other corpora in the project [3LB, i.e. Catalan and Spanish]”; “EPEC, a corpus of written Basque tagged at morphological and syntactic levels for the automatic processing”).

Size

The CoNLL 2007 dataset was officially split into training and test part. The data split of BDT-II was provided by Koldo Gojenola and should correspond to data split used in parsing experiments published by the IXA Group.

Version Train Sentences Train Tokens D-test Sentences D-test Tokens E-test Sentences E-test Tokens Total Sentences Total Tokens Sentence Length
CoNLL 2007 3190 50526 334 5390 3524 55916 15.87
BDT-II 9094 124,684 1010 12625 1122 14295 11226 151,604 13.50

Inside

Part of speech tag description (obtained per e-mail from Koldo Gojenola, thanks!):

Main features:

The syntactic guidelines (structure and labels) are described in Spanish in this technical report. See Appendix 3 for some lists of tags.

Sample

The first sentence of the CoNLL 2007 training data:

1 PUNCT PUNCT _ 10 AuxG _ _
2 Τα ο At AtDf Ne|Pl|Nm 3 Atr _ _
3 αντισώματα αντίσωμα No NoCm Ne|Pl|Nm 5 Sb _ _
4 IgG IgG Rg RgFwOr _ 3 Atr _ _
5 είναι είμαι Vb VbMn Id|Pr|03|Pl|Xx|Ip|Pv|Xx 10 Obj_Co _ _
6 σαν σαν Ad Ad Ba 5 Adv _ _
7 μακροπρόθεσμη μακροπρόθεσμος Aj Aj Ba|Fe|Sg|Nm 8 Atr _ _
8 μνήμη μνήμη No NoCm Fe|Sg|Nm 6 Adv _ _
9 , , PUNCT PUNCT _ 10 AuxX _ _
10 ενώ ενώ Cj CjCo _ 26 Coord _ _
11 το ο At AtDf Ne|Sg|Nm 12 Atr _ _
12 IgA IgA Rg RgFwOr _ 15 Sb _ _
13 πιστεύεται πιστεύεται Vb VbMn Id|Pr|03|Sg|Xx|Ip|Pv|Xx 10 Obj_Co _ _
14 ότι ότι Cj CjSb _ 13 AuxC _ _
15 είναι είμαι Vb VbMn Id|Pr|03|Sg|Xx|Ip|Pv|Xx 14 Sb _ _
16 ένας ένας At AtId Ma|Sg|Nm 18 Atr _ _
17 συγκεκριμένος συγκεκριμένος Aj Aj Ba|Ma|Sg|Nm 18 Atr _ _
18 δείκτης δείκτης No NoCm Ma|Sg|Nm 15 Pnom _ _
19 για για AsPp AsPpSp _ 18 AuxP _ _
20 πρόσφατες πρόσφατος Aj Aj Ba|Fe|Pl|Ac 21 Atr_Co _ _
21 ή ή Cj CjCo _ 23 Coord _ _
22 χρόνιες χρόνιος Aj Aj Ba|Fe|Pl|Ac 21 Atr_Co _ _
23 λοιμώξεις λοίμωξη No NoCm Fe|Pl|Ac 19 Atr _ _
24 PUNCT PUNCT _ 10 AuxG _ _
25 , , PUNCT PUNCT _ 10 AuxX _ _
26 εξηγεί εξηγώ Vb VbMn Id|Pr|03|Sg|Xx|Ip|Av|Xx 0 Pred _ _
27 η ο At AtDf Fe|Sg|Nm 28 Atr _ _
28 Δρ Δρ Rg RgFwTr _ 26 Sb _ _
29 Αρκάρι Αρκάρι No NoCm Ne|Sg|Nm 28 Atr _ _
30 . . PUNCT PUNCT _ 0 AuxK _ _

The first sentence of the CoNLL 2007 test data:

1 Η ο At AtDf Fe|Sg|Nm 2 Atr _ _
2 Σίφνος Σίφνος No NoPr Fe|Sg|Nm 3 Sb _ _
3 φημίζεται φημίζομαι Vb VbMn Id|Pr|03|Sg|Xx|Ip|Pv|Xx 0 Pred _ _
4 και και Cj CjCo _ 5 AuxY _ _
5 για για AsPp AsPpSp _ 3 AuxP _ _
6 τα ο At AtDf Ne|Pl|Ac 8 Atr _ _
7 καταγάλανα καταγάλανος Aj Aj Ba|Ne|Pl|Ac 8 Atr _ _
8 νερά νερό No NoCm Ne|Pl|Ac 5 Obj _ _
9 των ο At AtDf Fe|Pl|Ge 11 Atr _ _
10 πανέμορφων πανέμορφος Aj Aj Ba|Fe|Pl|Ge 11 Atr _ _
11 ακτών ακτή No NoCm Fe|Pl|Ge 8 Atr _ _
12 της μου Pn PnPo Fe|03|Sg|Ge|Xx 11 Atr _ _
13 . . PUNCT PUNCT _ 0 AuxK _ _

Parsing

Nonprojectivities in GDT are not frequent. Only 823 of the 70223 tokens in the CoNLL 2007 version are attached nonprojectively (1.17%).

The results of the CoNLL 2007 shared task are available online. They have been published in (Nivre et al., 2007). The evaluation procedure was changed to include punctuation tokens. These are the best results for Greek:

Parser (Authors) LAS UAS
Nakagawa 76.31 84.08
Keith Hall et al. 74.21 82.04
Carreras 73.56 81.37
Malt (Nilsson et al.) 74.65 81.22
Titov et al. 73.52 81.20
Chen 74.42 81.16
Duan 74.29 80.77
Attardi et al. 73.92 80.75
Malt (J. Hall et al.) 74.21 80.66

The two Malt parser results of 2007 (single malt and blended) are described in (Hall et al., 2007) and the details about the parser configuration are described here.


[ Back to the navigation ] [ Back to the content ]