This is an old revision of the document!
Table of Contents
Basque (eu)
Basque Dependency Treebank (BDT)
Versions
- CoNLL 2007
- BDT-II (obtained per e-mail in 2011)
Obtaining and License
There does not seem to be any regular distribution channel for the Basque Dependency Treebank. The CoNLL 2007 version had a restricted license for the duration of the shared task only. Republication of the CoNLL version in LDC is planned but it has not happenned yet. In the meantime, one can ask Koldo Gojenola (koldo (dot) gojenola (at) ehu (dot) es) about availability of the corpus.
Informally agreed upon terms:
- no redistribution
- cite the principal publication (see below) in publications
BDT was created by members of the IXA Group (IXA taldea), University of the Basque Country (Euskal Herriko Unibertsitatea), 649 Posta kutxa, E-20080 Donostia, Spain.
References
- Website
- no website dedicated to the treebank
- Data
- no separate citation
- Principal publications
- Itziar Aduriz, María Jesús Aranzabe, José María Arriola, Aitziber Atutxa, Arantza Díaz de Ilarraza, Aitzpea Garmendia, Maite Oronoz: Construction of a Basque Dependency Treebank In: Proceedings of The Second Workshop on Treebanks and Linguistic Theories (TLT 2003), pp. 149-160, Växjö, Sweden, 2003.
- Documentation
- Description of tags and feature values is hard to find; the
doc/README
file in the CoNLL 2007 data distribution is not very informative. See below for information obtained per e-mail communication. - María Jesús Aranzabe, José Mari Arriola, Aitziber Atutxa, Irene Balza, Larraitz Uria: Guía para la anotación sintáctica manual de Eus3LB (corpus del euskera anotado a nivel sintáctico, semántico y pragmático). UPV/EHU/LSI/TR 13-2003, Donostia, Spain, 2003.
Domain
Newswire + unknown (“25000 word forms from EPEC (Aduriz et al., 2003) and 25000 word forms coming from newspapers that can be considered equivalent to the other corpora in the project [3LB, i.e. Catalan and Spanish]”; “EPEC, a corpus of written Basque tagged at morphological and syntactic levels for the automatic processing”).
Size
The CoNLL 2007 dataset was officially split into training and test part. The data split of BDT-II was provided by Koldo Gojenola and should correspond to data split used in parsing experiments published by the IXA Group.
Version | Train Sentences | Train Tokens | D-test Sentences | D-test Tokens | E-test Sentences | E-test Tokens | Total Sentences | Total Tokens | Sentence Length |
---|---|---|---|---|---|---|---|---|---|
CoNLL 2007 | 3190 | 50526 | 334 | 5390 | 3524 | 55916 | 15.87 | ||
BDT-II | 9094 | 124,684 | 1010 | 12625 | 1122 | 14295 | 11226 | 151,604 | 13.50 |
Inside
Both versions (CoNLL 2007 and BDT-II) are in the CoNLL 2006/2007 format.
Part of speech tag description (obtained per e-mail from Koldo Gojenola, thanks!):
- IZE = noun
- ARR = common
- IZB = proper name
- LIB = place name
- ZKI = number
- ADJ = adjective
- ARR = common
- GAL = question
- ADI = verb
- SIN = simple
- ADK = composed
- ADP = periphrastic
- FAK = factitive
- ADB = adverb
- ARR = common
- GAL = question
- DET = determiner
- ERKARR = demonstrative common
- ERKIND = demonstrative emphatic
- NOLARR = indefinite common
- NOLGAL = indefinite question
- ZNB = number
- DZH = definite
- BAN = distributive
- ORD = ordinal
- DZG = indefinite
- ORO = general
- IOR = pronoun
- PERARR = personal common
- PERIND = personal emphatic
- IZGMGB = indefinite
- IZGGAL = question
- BIH = ???
- ELK = ???
- LOT = link
- LOK = connector
- JNT = conjunction
- PRT = particle
- ITJ = interjection
- BST = other
- ADL = auxiliary verb
- ADT = synthetic verb
- SIG = acronym
- SNB = symbol
- LAB = abbreviation
Main features:
- KAS = case (ERG = ergative, ABS = absolutive, DAT = dative…)
- ASP = aspect
- ERL = relation (relative sentence, completive sentence, indirect question…)
The syntactic guidelines (structure and labels) are described in Spanish in this technical report. See Appendix 3 for some lists of tags.
Sample
The first sentence of the CoNLL 2007 training data:
1 | espainiako_poliziak | Espainia_Poliziak | IZE | IZE_LIB | PLU-|ENTI_LOC | 4 | ncsubj | _ | _ |
2 | hiru | hiru | DET | DET_DZH | NMGP | 3 | detmod | _ | _ |
3 | gazte | gazte | IZE | IZE_ARR | ABS|MG | 4 | ncobj | _ | _ |
4 | atxilotu | atxilotu | ADI | ADI_SIN | PART|BURU | 8 | lot | _ | _ |
5 | ditu | *edun | ADL | ADL | A1|NR_HAIEK|NK_HARK | 4 | auxmod | _ | _ |
6 | atarrabian | Atarrabia | IZE | IZE_LIB | PLU-|INE|NUMS|MUGM|ENTI_LOC | 4 | ncmod | _ | _ |
7 | , | , | PUNC | PUNC_KOMA | _ | 6 | PUNC | _ | _ |
8 | eta | eta | LOT | LOT_JNT | - | 0 | ROOT | _ | _ |
9 | madrilera | Madril | IZE | IZE_LIB | PLU-|ALA|NUMS|MUGM|ENTI_LOC | 10 | ncmod | _ | _ |
10 | eraman | eraman | ADI | ADI_SIN | PART|BURU | 8 | lot | _ | _ |
11 | ditu | *edun | ADL | ADL | A1|NR_HAIEK|NK_HARK | 10 | auxmod | _ | _ |
12 | . | . | PUNC | PUNC_PUNC | _ | 11 | PUNC | _ | _ |
The first sentence of the CoNLL 2007 test data:
1 | epaileek | epaile | IZE | IZE_ARR | BIZ+|ERG|NUMP|MUGM |
2 | diote | esan | ADT | ADT | PNT|A1|NR_HURA|NK_HAIEK-K |
3 | eaeko | EAE | IZE | IZE_LIB | SIG|GEL|NUMS|MUGM|ENTI_LOC |
4 | parlamentarioek | parlamentario | ADJ | ADJ_ARR | IZAUR-|ERG|NUMP|MUGM |
5 | eaetik_kanpo | EAE | SIG | SIG- | DEK|NUMS|MUGM|DEK|ABL_kanpo_ABS|ENTI_LOC|POS |
6 | eginiko | egin | ADI | ADI_SIN | PART|GEL |
7 | delituak | delitu | IZE | IZE_ARR | BIZ-|ABS|NUMP|MUGM |
8 | ikertzea | ikertu | ADI | ADI_SIN | ADIZE|KONPL|ABS |
9 | eta | eta | LOT | LOT_JNT | - |
10 | epaitzea | epaitu | ADI | ADI_SIN | ADIZE|KONPL|ABS |
11 | auzitegi_gorenari | auzitegi_gora | ADJ | ADJ_IZO | DEK|GEN|NUMP|MUGM|DEK|DAT|NUMS|MUGM|ENTI_LOC |
12 | dagokiola | egon | ADT | ADT | PNT|KONPL|A1|NR_HURA|NI_HARI |
13 | , | , | PUNC | PUNC_KOMA | _ |
14 | baina | baina | LOT | LOT_JNT | AURK |
15 | atzerrian | atzerri | IZE | IZE_ARR | INE|NUMS|MUGM |
16 | izaniko | izan | ADI | ADI_SIN | PART|GEL |
17 | kontaktu | kontaktu | IZE | IZE_ARR | _ |
18 | horiek | horiek | DET | DET_ERKARR | ABS|NUMP|MUGM |
19 | ezin_direla | ezin_izan | ADI | ADI_ADK | PNT|KONPL|A1|NR_HAIEK|MWCorrect |
20 | delitutzat | delitu | IZE | IZE_ARR | BIZ-|PRO|MG |
21 | hartu | hartu | ADI | ADI_SIN | PART |
22 | . | . | PUNC | PUNC_PUNC | _ |
The first sentence of the BDT-II training data:
1 | Estatu_Batuetako_DEAko | Estatu_Batuak_DEA | IZE | LIB | PLU:+|IZAUR:-|KAS:GEL|NUM:P|MUG:M|MW:B|ENT:Erakundea | 2 | ncmod | _ | _ |
2 | buru | buru | IZE | ARR | _ | 4 | ncsubj | _ | _ |
3 | ohiak | ohi | ADJ | ARR | IZAUR:-|KAS:ERG|NUM:S|MUG:M | 2 | ncmod | _ | _ |
4 | aztertuko | aztertu | ADI | SIN | ADM:PART|ASP:GERO | 0 | ROOT | _ | _ |
5 | du | *edun | ADL | ADL | MDN:A1|NOR:HURA|NORK:HARK | 4 | auxmod | _ | _ |
6 | RUCen | RUC | IZE | IZB | MTKAT:SIG|KAS:GEN|NUM:S|MUG:M|ENT:Erakundea | 7 | ncmod | _ | _ |
7 | erreforma | erreforma | IZE | ARR | KAS:ABS|NUM:S|MUG:M | 4 | ncobj | _ | _ |
8 | . | . | PUNT_MARKA | PUNT_PUNT | _ | 7 | PUNC | _ | _ |
The first sentence of the BDT-II development data:
1 | Irakaskuntzan | irakaskuntza | IZE | ARR | BIZ:-|KAS:INE|NUM:S|MUG:M | 2 | ncmod | _ | _ |
2 | jardun | jardun | ADI | SIN | ADM:PART|ASP:BURU | 0 | ROOT | _ | _ |
3 | zuen | *edun | ADL | ADL | MDN:B1|NOR:HURA|NORK:HARK | 2 | auxmod | _ | _ |
4 | Miel | Miel | IZE | IZB | PLU:-|ENT:Pertsona | 5 | entios | _ | _ |
5 | Anjel_Elustondok | Anjel_Elustondo | IZE | IZB | PLU:-|KAS:ERG|NUM:S|MUG:M|ENT:Pertsona | 2 | ncsubj | _ | _ |
6 | 1980 | 1980 | IZE | ZKI | _ | 7 | ncmod | _ | _ |
7 | urtetik | urte | IZE | ARR | BIZ:-|KAS:ABL|NUM:S|MUG:M | 2 | ncmod | _ | _ |
8 | 1992ra | 1992 | IZE | ZKI | KAS:ALA|NUM:S|MUG:M | 2 | ncmod | _ | _ |
9 | , | , | PUNT_MARKA | PUNT_KOMA | _ | 8 | PUNC | _ | _ |
10 | hauetatik | hauek | DET | ERKARR | KAS:ABL|NUM:P|MUG:M | 16 | ncmod | _ | _ |
11 | hamar | hamar | DET | DZH | NMG:P | 12 | detmod | _ | _ |
12 | urtez | urte | IZE | ARR | BIZ:-|KAS:INS|MUG:MG | 16 | lot | _ | _ |
13 | Azpeitiko | Azpeitia | IZE | LIB | PLU:-|KAS:GEL|NUM:S|MUG:M|ENT:Tokia | 14 | ncmod | _ | _ |
14 | ikastolan | ikastola | IZE | ARR | BIZ:-|KAS:INE|NUM:S|MUG:M | 16 | ncmod | _ | _ |
15 | irakasle | irakasle | IZE | ARR | KAS:ABS|MUG:MG | 16 | ncpred | _ | _ |
16 | eta | eta | LOT | JNT | ERL:EMEN | 8 | aponcmod | _ | _ |
17 | beste | beste | DET | DZG | _ | 18 | detmod | _ | _ |
18 | biak | bi | IZE | ZKI | KAS:ABS|NUM:P|MUG:M | 16 | lot | _ | _ |
19 | , | , | PUNT_MARKA | PUNT_KOMA | _ | 18 | PUNC | _ | _ |
20 | Arabako | Araba | IZE | LIB | PLU:-|KAS:GEL|NUM:S|MUG:M|ENT:Tokia | 21 | ncmod | _ | _ |
21 | ikastolen | ikastola | IZE | ARR | BIZ:-|KAS:GEN|NUM:P|MUG:M | 22 | ncmod | _ | _ |
22 | elkartean | elkarte | IZE | ARR | BIZ:-|KAS:INE|NUM:S|MUG:M | 16 | ncmod | _ | _ |
23 | . | . | PUNT_MARKA | PUNT_PUNT | _ | 22 | PUNC | _ | _ |
The first sentence of the BDT-II test data:
1 | Hegoaldean | hegoalde | IZE | ARR | KAS:INE|NUM:S|MUG:M | 2 | ncmod | _ | _ |
2 | iduri_zait | iduri_izan | ADI | ADK | ASP:PNT|MDN:A1|NOR:HURA|NORI:NIRI|MW:B | 0 | ROOT | _ | _ |
3 | euskararen | euskara | IZE | ARR | BIZ:-|KAS:GEN|NUM:S|MUG:M | 4 | ncmod | _ | _ |
4 | mundu | mundu | IZE | ARR | BIZ:- | 7 | ncsubj | _ | _ |
5 | hau | hau | DET | ERKARR | KAS:ABS|NUM:S|MUG:M | 4 | detmod | _ | _ |
6 | adi-adi | adi-adi | ADB | ARR | _ | 7 | ncmod | _ | _ |
7 | dagola | egon | ADT | ADT | ASP:PNT|ERL:KONPL|MDN:A3|NOR:HURA | 2 | ccomp_obj | _ | _ |
8 | , | , | PUNT_MARKA | PUNT_KOMA | _ | 7 | PUNC | _ | _ |
9 | Euskaltzaindiak | Euskaltzaindia | IZE | LIB | PLU:-|KAS:ERG|NUM:S|MUG:M|ENT:Tokia | 11 | ncsubj | _ | _ |
10 | zer | zer | DET | NOLGAL | NMG:MG|KAS:ABS|MUG:MG | 11 | ncobj | _ | _ |
11 | erranen | erran | ADI | SIN | ADM:PART|ASP:GERO | 13 | menos | _ | _ |
12 | duen | *edun | ADL | ADL | ERL:ZHG|MDN:A1|NOR:HURA|NORK:HARK | 11 | auxmod | _ | _ |
13 | zain | zain | ADB | ARR | _ | 7 | cmod | _ | _ |
14 | , | , | PUNT_MARKA | PUNT_KOMA | _ | 13 | PUNC | _ | _ |
15 | haren | hura | DET | ERKARR | KAS:GEN|NUM:S|MUG:M | 16 | ncmod | _ | _ |
16 | arauen | arau | IZE | ARR | KAS:ABS|MUG:MG | 18 | ncmod | _ | _ |
17 | berehala | berehala | ADB | ARR | _ | 18 | ncmod | _ | _ |
18 | betetzeko | bete | ADI | SIN | ADM:ADIZE|ERL:HELB|KAS:ABS|MUG:MG | 7 | xmod | _ | _ |
19 | . | . | PUNT_MARKA | PUNT_PUNT | _ | 18 | PUNC | _ | _ |
Parsing
Nonprojectivities in GDT are not frequent. Only 823 of the 70223 tokens in the CoNLL 2007 version are attached nonprojectively (1.17%).
The results of the CoNLL 2007 shared task are available online. They have been published in (Nivre et al., 2007). The evaluation procedure was changed to include punctuation tokens. These are the best results for Greek:
Parser (Authors) | LAS | UAS |
---|---|---|
Nakagawa | 76.31 | 84.08 |
Keith Hall et al. | 74.21 | 82.04 |
Carreras | 73.56 | 81.37 |
Malt (Nilsson et al.) | 74.65 | 81.22 |
Titov et al. | 73.52 | 81.20 |
Chen | 74.42 | 81.16 |
Duan | 74.29 | 80.77 |
Attardi et al. | 73.92 | 80.75 |
Malt (J. Hall et al.) | 74.21 | 80.66 |
The two Malt parser results of 2007 (single malt and blended) are described in (Hall et al., 2007) and the details about the parser configuration are described here.