[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

This is an old revision of the document!


Table of Contents

Hungarian (hu)

Szeged Treebank (SzTB)

Versions

Obtaining and License

The Szeged Treebank is available for research free of charge, provided the user signs the license agreement first. The website uses JavaScript to manage content, which makes it difficult to directly link to relevant sections. Click on “downloads” (letöltések) to get the list of downloadable corpora and links to their descriptions (e.g. Szeged Treebank 2.0). To obtain the treebank, one is supposed to complete the license form, print it, sign it and fax it to +36-62-546397 or mail it to Vincze Veronika, Árpád tér 2, H-6720 Szeged. You will be given a user ID and password needed to download the data. There are links to Microsoft Word documents with the license agreement but they do not work for me. Ask Veronika Vincze how to proceed (vinczev (at) inf (dot) u-szeged (dot) hu).

Republication of the CoNLL 2007 version in the LDC is planned but it has not happened yet.

The CoNLL 2007 license in short:

SzTB was created by members of the Human Language Technology Group (Nyelvtechnológiai Csoport), Department of Informatics (Informatikai Tanszékcsoport), University of Szeged (Szegedi Tudományegyetem), Árpád tér 2, H-6720 Szeged, Hungary. Conversion from constituents to dependencies for the CoNLL 2007 shared task was done by Zoltán Alexin.

References

Domain

Newswire + unknown (“25000 word forms from EPEC (Aduriz et al., 2003) and 25000 word forms coming from newspapers that can be considered equivalent to the other corpora in the project [3LB, i.e. Catalan and Spanish]”; “EPEC, a corpus of written Basque tagged at morphological and syntactic levels for the automatic processing”).

Size

The CoNLL 2007 dataset was officially split into training and test part. The data split of BDT-II was provided by Koldo Gojenola and should correspond to data split used in parsing experiments published by the IXA Group.

Version Train Sentences Train Tokens D-test Sentences D-test Tokens E-test Sentences E-test Tokens Total Sentences Total Tokens Sentence Length
CoNLL 2007 3190 50526 334 5390 3524 55916 15.87
BDT-II 9094 124,684 1010 12625 1122 14295 11226 151,604 13.50

Inside

Both versions (CoNLL 2007 and BDT-II) are in the CoNLL 2006/2007 format.

Part of speech tag description (obtained per e-mail from Koldo Gojenola, thanks!):

Main features:

List of all 286 features found in the corpus with frequencies:

The syntactic guidelines (structure and labels) are described in Spanish in this technical report. See Appendix 3 for some lists of tags.

Multi-word expressions have been collapsed into one token, using underscore as the joining character (e.g. Espainia_Poliziak, iduri_zait).

Sample

The first sentence of the CoNLL 2007 training data:

1 espainiako_poliziak Espainia_Poliziak IZE IZE_LIB PLU-|ENTI_LOC 4 ncsubj _ _
2 hiru hiru DET DET_DZH NMGP 3 detmod _ _
3 gazte gazte IZE IZE_ARR ABS|MG 4 ncobj _ _
4 atxilotu atxilotu ADI ADI_SIN PART|BURU 8 lot _ _
5 ditu *edun ADL ADL A1|NR_HAIEK|NK_HARK 4 auxmod _ _
6 atarrabian Atarrabia IZE IZE_LIB PLU-|INE|NUMS|MUGM|ENTI_LOC 4 ncmod _ _
7 , , PUNC PUNC_KOMA _ 6 PUNC _ _
8 eta eta LOT LOT_JNT - 0 ROOT _ _
9 madrilera Madril IZE IZE_LIB PLU-|ALA|NUMS|MUGM|ENTI_LOC 10 ncmod _ _
10 eraman eraman ADI ADI_SIN PART|BURU 8 lot _ _
11 ditu *edun ADL ADL A1|NR_HAIEK|NK_HARK 10 auxmod _ _
12 . . PUNC PUNC_PUNC _ 11 PUNC _ _

The first sentence of the CoNLL 2007 test data:

1 epaileek epaile IZE IZE_ARR BIZ+|ERG|NUMP|MUGM
2 diote esan ADT ADT PNT|A1|NR_HURA|NK_HAIEK-K
3 eaeko EAE IZE IZE_LIB SIG|GEL|NUMS|MUGM|ENTI_LOC
4 parlamentarioek parlamentario ADJ ADJ_ARR IZAUR-|ERG|NUMP|MUGM
5 eaetik_kanpo EAE SIG SIG- DEK|NUMS|MUGM|DEK|ABL_kanpo_ABS|ENTI_LOC|POS
6 eginiko egin ADI ADI_SIN PART|GEL
7 delituak delitu IZE IZE_ARR BIZ-|ABS|NUMP|MUGM
8 ikertzea ikertu ADI ADI_SIN ADIZE|KONPL|ABS
9 eta eta LOT LOT_JNT -
10 epaitzea epaitu ADI ADI_SIN ADIZE|KONPL|ABS
11 auzitegi_gorenari auzitegi_gora ADJ ADJ_IZO DEK|GEN|NUMP|MUGM|DEK|DAT|NUMS|MUGM|ENTI_LOC
12 dagokiola egon ADT ADT PNT|KONPL|A1|NR_HURA|NI_HARI
13 , , PUNC PUNC_KOMA _
14 baina baina LOT LOT_JNT AURK
15 atzerrian atzerri IZE IZE_ARR INE|NUMS|MUGM
16 izaniko izan ADI ADI_SIN PART|GEL
17 kontaktu kontaktu IZE IZE_ARR _
18 horiek horiek DET DET_ERKARR ABS|NUMP|MUGM
19 ezin_direla ezin_izan ADI ADI_ADK PNT|KONPL|A1|NR_HAIEK|MWCorrect
20 delitutzat delitu IZE IZE_ARR BIZ-|PRO|MG
21 hartu hartu ADI ADI_SIN PART
22 . . PUNC PUNC_PUNC _

The first sentence of the BDT-II training data:

1 Estatu_Batuetako_DEAko Estatu_Batuak_DEA IZE LIB PLU:+|IZAUR:-|KAS:GEL|NUM:P|MUG:M|MW:B|ENT:Erakundea 2 ncmod _ _
2 buru buru IZE ARR _ 4 ncsubj _ _
3 ohiak ohi ADJ ARR IZAUR:-|KAS:ERG|NUM:S|MUG:M 2 ncmod _ _
4 aztertuko aztertu ADI SIN ADM:PART|ASP:GERO 0 ROOT _ _
5 du *edun ADL ADL MDN:A1|NOR:HURA|NORK:HARK 4 auxmod _ _
6 RUCen RUC IZE IZB MTKAT:SIG|KAS:GEN|NUM:S|MUG:M|ENT:Erakundea 7 ncmod _ _
7 erreforma erreforma IZE ARR KAS:ABS|NUM:S|MUG:M 4 ncobj _ _
8 . . PUNT_MARKA PUNT_PUNT _ 7 PUNC _ _

The first sentence of the BDT-II development data:

1 Irakaskuntzan irakaskuntza IZE ARR BIZ:-|KAS:INE|NUM:S|MUG:M 2 ncmod _ _
2 jardun jardun ADI SIN ADM:PART|ASP:BURU 0 ROOT _ _
3 zuen *edun ADL ADL MDN:B1|NOR:HURA|NORK:HARK 2 auxmod _ _
4 Miel Miel IZE IZB PLU:-|ENT:Pertsona 5 entios _ _
5 Anjel_Elustondok Anjel_Elustondo IZE IZB PLU:-|KAS:ERG|NUM:S|MUG:M|ENT:Pertsona 2 ncsubj _ _
6 1980 1980 IZE ZKI _ 7 ncmod _ _
7 urtetik urte IZE ARR BIZ:-|KAS:ABL|NUM:S|MUG:M 2 ncmod _ _
8 1992ra 1992 IZE ZKI KAS:ALA|NUM:S|MUG:M 2 ncmod _ _
9 , , PUNT_MARKA PUNT_KOMA _ 8 PUNC _ _
10 hauetatik hauek DET ERKARR KAS:ABL|NUM:P|MUG:M 16 ncmod _ _
11 hamar hamar DET DZH NMG:P 12 detmod _ _
12 urtez urte IZE ARR BIZ:-|KAS:INS|MUG:MG 16 lot _ _
13 Azpeitiko Azpeitia IZE LIB PLU:-|KAS:GEL|NUM:S|MUG:M|ENT:Tokia 14 ncmod _ _
14 ikastolan ikastola IZE ARR BIZ:-|KAS:INE|NUM:S|MUG:M 16 ncmod _ _
15 irakasle irakasle IZE ARR KAS:ABS|MUG:MG 16 ncpred _ _
16 eta eta LOT JNT ERL:EMEN 8 aponcmod _ _
17 beste beste DET DZG _ 18 detmod _ _
18 biak bi IZE ZKI KAS:ABS|NUM:P|MUG:M 16 lot _ _
19 , , PUNT_MARKA PUNT_KOMA _ 18 PUNC _ _
20 Arabako Araba IZE LIB PLU:-|KAS:GEL|NUM:S|MUG:M|ENT:Tokia 21 ncmod _ _
21 ikastolen ikastola IZE ARR BIZ:-|KAS:GEN|NUM:P|MUG:M 22 ncmod _ _
22 elkartean elkarte IZE ARR BIZ:-|KAS:INE|NUM:S|MUG:M 16 ncmod _ _
23 . . PUNT_MARKA PUNT_PUNT _ 22 PUNC _ _

The first sentence of the BDT-II test data:

1 Hegoaldean hegoalde IZE ARR KAS:INE|NUM:S|MUG:M 2 ncmod _ _
2 iduri_zait iduri_izan ADI ADK ASP:PNT|MDN:A1|NOR:HURA|NORI:NIRI|MW:B 0 ROOT _ _
3 euskararen euskara IZE ARR BIZ:-|KAS:GEN|NUM:S|MUG:M 4 ncmod _ _
4 mundu mundu IZE ARR BIZ:- 7 ncsubj _ _
5 hau hau DET ERKARR KAS:ABS|NUM:S|MUG:M 4 detmod _ _
6 adi-adi adi-adi ADB ARR _ 7 ncmod _ _
7 dagola egon ADT ADT ASP:PNT|ERL:KONPL|MDN:A3|NOR:HURA 2 ccomp_obj _ _
8 , , PUNT_MARKA PUNT_KOMA _ 7 PUNC _ _
9 Euskaltzaindiak Euskaltzaindia IZE LIB PLU:-|KAS:ERG|NUM:S|MUG:M|ENT:Tokia 11 ncsubj _ _
10 zer zer DET NOLGAL NMG:MG|KAS:ABS|MUG:MG 11 ncobj _ _
11 erranen erran ADI SIN ADM:PART|ASP:GERO 13 menos _ _
12 duen *edun ADL ADL ERL:ZHG|MDN:A1|NOR:HURA|NORK:HARK 11 auxmod _ _
13 zain zain ADB ARR _ 7 cmod _ _
14 , , PUNT_MARKA PUNT_KOMA _ 13 PUNC _ _
15 haren hura DET ERKARR KAS:GEN|NUM:S|MUG:M 16 ncmod _ _
16 arauen arau IZE ARR KAS:ABS|MUG:MG 18 ncmod _ _
17 berehala berehala ADB ARR _ 18 ncmod _ _
18 betetzeko bete ADI SIN ADM:ADIZE|ERL:HELB|KAS:ABS|MUG:MG 7 xmod _ _
19 . . PUNT_MARKA PUNT_PUNT _ 18 PUNC _ _

Parsing

BDT is a mildly nonprojective treebank. 1925 of the 151,604 tokens of combined BDT-II training and test sets are attached nonprojectively (1.27%).

The results of the CoNLL 2007 shared task are available online. They have been published in (Nivre et al., 2007). The evaluation procedure was changed to include punctuation tokens. These are the best results for Greek:

Parser (Authors) LAS UAS
Malt (Nilsson et al.) 76.94 82.84
Titov et al. 75.49 81.93
Sagae 74.64 81.19
Carreras 75.75 81.11
Nakagawa 72.56 81.04
Malt (J. Hall et al.) 74.99 80.61
Johansson et al. 75.08 80.43

The two Malt parser results of 2007 (single malt and blended) are described in (Hall et al., 2007) and the details about the parser configuration are described here.

Parsing results on BDT-II have been published in Kepa Bengoetxea, Koldo Gojenola: Application of Different Techniques to Dependency Parsing of Basque. In: Proceedings of the First Workshop on Statistical Parsing of Morphologically Rich Languages (SPMRL 2010), NAACL Workshop, Los Angeles, California, USA, 2010. They report only Labeled Attachment Score (LAS) and their best system achieved LAS = 78.98%.


[ Back to the navigation ] [ Back to the content ]