This is an old revision of the document!
Table of Contents
Hungarian (hu)
Szeged Treebank (SzTB)
Versions
- Szeged Treebank 1.0 (shallow parse)
- Szeged Treebank 2.0 (full parse)
- CoNLL 2007 (based on SzTB 2.0)
Obtaining and License
The Szeged Treebank is available for research free of charge, provided the user signs the license agreement first. The website uses JavaScript to manage content, which makes it difficult to directly link to relevant sections. Click on “downloads” (letöltések) to get the list of downloadable corpora and links to their descriptions (e.g. Szeged Treebank 2.0). To obtain the treebank, one is supposed to complete the license form, print it, sign it and fax it to +36-62-546397 or mail it to Vincze Veronika, Árpád tér 2, H-6720 Szeged. You will be given a user ID and password needed to download the data. There are links to Microsoft Word documents with the license agreement but they do not work for me. Ask Veronika Vincze how to proceed (vinczev (at) inf (dot) u-szeged (dot) hu).
Republication of the CoNLL 2007 version in the LDC is planned but it has not happened yet.
The CoNLL 2007 license in short:
- non-profit education and research purposes
- no redistribution
- no modification
- cite the principal publication (see below) in publications
SzTB was created by members of the Human Language Technology Group (Nyelvtechnológiai Csoport), Department of Informatics (Informatikai Tanszékcsoport), University of Szeged (Szegedi Tudományegyetem), Árpád tér 2, H-6720 Szeged, Hungary. Conversion from constituents to dependencies for the CoNLL 2007 shared task was done by Zoltán Alexin.
References
- Website
- http://www.inf.u-szeged.hu/projectdirs/hlt/hu/Treebank/treebank2.html (on-line browsing using a Java applet)
- Data
- no separate citation
- Principal publications
- Dóra Csendes, János Csirik, Tibor Gyimóthy, András Kocsor: The Szeged Treebank In: Václav Matoušek, Pavel Mautner, Tomáš Pavelka (eds.): Text, Speech and Dialogue. 8th International Conference, TSD 2005, Karlovy Vary, Czech Republic, September 12-15, 2005. Proceedings. Lecture Notes in Computer Science, vol. 3658/2005, pp. 123-131, Springer-Verlag, Berlin / Heidelberg, Germany, 2005. ISSN 0302-9743, ISBN 978-3-540-28789-6.
- Documentation
- The
doc/README
file in the CoNLL 2007 data distribution contains a quick guide to part of speech tags. There are also several PDF documents with detailed documentation of the annotation. - A lot of useful information on SzTB 2.0 (original, not CoNLL version), including morphosyntax, can be found at the abovementioned website.
Domain
Mixed:
- Fiction
- Short essays by 14 to 16 year-old students
- Newspapers (Népszabadság, Népszava, Magyar Hírlap, HVG)
- Texts related to computer science
- Legal texts
- Economic and financial short news
Size
The CoNLL 2007 dataset was officially split into training and test part. The data split of BDT-II was provided by Koldo Gojenola and should correspond to data split used in parsing experiments published by the IXA Group.
Version | Train Sentences | Train Tokens | D-test Sentences | D-test Tokens | E-test Sentences | E-test Tokens | Total Sentences | Total Tokens | Sentence Length |
---|---|---|---|---|---|---|---|---|---|
CoNLL 2007 | 3190 | 50526 | 334 | 5390 | 3524 | 55916 | 15.87 | ||
BDT-II | 9094 | 124,684 | 1010 | 12625 | 1122 | 14295 | 11226 | 151,604 | 13.50 |
Inside
Both versions (CoNLL 2007 and BDT-II) are in the CoNLL 2006/2007 format.
Part of speech tag description (obtained per e-mail from Koldo Gojenola, thanks!):
- IZE = noun
- ARR = common
- IZB = proper name
- LIB = place name
- ZKI = number
- ADJ = adjective
- ARR = common
- GAL = question
- ADI = verb
- SIN = simple
- ADK = composed
- ADP = periphrastic
- FAK = factitive
- ADB = adverb
- ARR = common
- GAL = question
- DET = determiner
- ERKARR = demonstrative common
- ERKIND = demonstrative emphatic
- NOLARR = indefinite common
- NOLGAL = indefinite question
- ZNB = number
- DZH = definite
- BAN = distributive
- ORD = ordinal
- DZG = indefinite
- ORO = general
- IOR = pronoun
- PERARR = personal common
- PERIND = personal emphatic
- IZGMGB = indefinite
- IZGGAL = question
- BIH = ???
- ELK = ???
- LOT = link
- LOK = connector
- JNT = conjunction
- PRT = particle
- ITJ = interjection
- BST = other
- ADL = auxiliary verb
- ADT = synthetic verb
- SIG = acronym
- SNB = symbol
- LAB = abbreviation
Main features:
- KAS = case. Various descriptions of Basque grammar list different numbers of cases and it is not easy to match all of the BDT case tags with them. Some but not all of them are described in the Annex 3 of the technical report mentioned above. The following list gives all case tags occurring in BDT with their frequencies in brackets.
- KAS:ABL (984) = ablativo = ablative
- KAS:ABS (22805) = absolutivo = absolutive
- KAS:ABU (32) = adlativo terminal (“-raino”) = “until, as far as” = terminative
- KAS:ABZ (27) = adlativo direccional (“-rantz”) = “towards” ~ lative?
- KAS:ALA (1093) = adlativo = allative
- KAS:BNK (13) =? special case of the locative genitive (“-ko”, “-eko”)
- KAS:DAT (1451) = dativo = dative
- KAS:DES (181) = destinativo = benefactive (“-entzat”)
- KAS:DESK (223) =? descriptive locative genitive (“-ko”, “-eko”), also frequently used for counted noun after numeral
- KAS:EM (705) = multiword token with postposition (e.g. “_gabe”, “_arabera”, “_batera”, “_bezala”…)
- KAS:ERG (6059) = ergativo = ergative
- KAS:GEL (6259) = genitivo locativo = locative genitive
- KAS:GEN (4307) = genitivo de posesión = possessive genitive
- KAS:INE (7690) = inesivo = inessive
- KAS:INS (1370) = instrumental
- KAS:MOT (165) = motivativo = causative
- KAS:PAR (930) = partitivo = partitive
- KAS:PRO (89) = prolativo = essive
- KAS:SOZ (928) = asociativo = comitative
- ASP = aspect
- ERL = relation (relative sentence, completive sentence, indirect question…)
The syntactic guidelines (structure and labels) are described in Spanish in this technical report. See Appendix 3 for some lists of tags.
Multi-word expressions have been collapsed into one token, using underscore as the joining character (e.g. Espainia_Poliziak, iduri_zait).
Sample
The first sentence of the CoNLL 2007 training data:
1 | espainiako_poliziak | Espainia_Poliziak | IZE | IZE_LIB | PLU-|ENTI_LOC | 4 | ncsubj | _ | _ |
2 | hiru | hiru | DET | DET_DZH | NMGP | 3 | detmod | _ | _ |
3 | gazte | gazte | IZE | IZE_ARR | ABS|MG | 4 | ncobj | _ | _ |
4 | atxilotu | atxilotu | ADI | ADI_SIN | PART|BURU | 8 | lot | _ | _ |
5 | ditu | *edun | ADL | ADL | A1|NR_HAIEK|NK_HARK | 4 | auxmod | _ | _ |
6 | atarrabian | Atarrabia | IZE | IZE_LIB | PLU-|INE|NUMS|MUGM|ENTI_LOC | 4 | ncmod | _ | _ |
7 | , | , | PUNC | PUNC_KOMA | _ | 6 | PUNC | _ | _ |
8 | eta | eta | LOT | LOT_JNT | - | 0 | ROOT | _ | _ |
9 | madrilera | Madril | IZE | IZE_LIB | PLU-|ALA|NUMS|MUGM|ENTI_LOC | 10 | ncmod | _ | _ |
10 | eraman | eraman | ADI | ADI_SIN | PART|BURU | 8 | lot | _ | _ |
11 | ditu | *edun | ADL | ADL | A1|NR_HAIEK|NK_HARK | 10 | auxmod | _ | _ |
12 | . | . | PUNC | PUNC_PUNC | _ | 11 | PUNC | _ | _ |
The first sentence of the CoNLL 2007 test data:
1 | epaileek | epaile | IZE | IZE_ARR | BIZ+|ERG|NUMP|MUGM |
2 | diote | esan | ADT | ADT | PNT|A1|NR_HURA|NK_HAIEK-K |
3 | eaeko | EAE | IZE | IZE_LIB | SIG|GEL|NUMS|MUGM|ENTI_LOC |
4 | parlamentarioek | parlamentario | ADJ | ADJ_ARR | IZAUR-|ERG|NUMP|MUGM |
5 | eaetik_kanpo | EAE | SIG | SIG- | DEK|NUMS|MUGM|DEK|ABL_kanpo_ABS|ENTI_LOC|POS |
6 | eginiko | egin | ADI | ADI_SIN | PART|GEL |
7 | delituak | delitu | IZE | IZE_ARR | BIZ-|ABS|NUMP|MUGM |
8 | ikertzea | ikertu | ADI | ADI_SIN | ADIZE|KONPL|ABS |
9 | eta | eta | LOT | LOT_JNT | - |
10 | epaitzea | epaitu | ADI | ADI_SIN | ADIZE|KONPL|ABS |
11 | auzitegi_gorenari | auzitegi_gora | ADJ | ADJ_IZO | DEK|GEN|NUMP|MUGM|DEK|DAT|NUMS|MUGM|ENTI_LOC |
12 | dagokiola | egon | ADT | ADT | PNT|KONPL|A1|NR_HURA|NI_HARI |
13 | , | , | PUNC | PUNC_KOMA | _ |
14 | baina | baina | LOT | LOT_JNT | AURK |
15 | atzerrian | atzerri | IZE | IZE_ARR | INE|NUMS|MUGM |
16 | izaniko | izan | ADI | ADI_SIN | PART|GEL |
17 | kontaktu | kontaktu | IZE | IZE_ARR | _ |
18 | horiek | horiek | DET | DET_ERKARR | ABS|NUMP|MUGM |
19 | ezin_direla | ezin_izan | ADI | ADI_ADK | PNT|KONPL|A1|NR_HAIEK|MWCorrect |
20 | delitutzat | delitu | IZE | IZE_ARR | BIZ-|PRO|MG |
21 | hartu | hartu | ADI | ADI_SIN | PART |
22 | . | . | PUNC | PUNC_PUNC | _ |
The first sentence of the BDT-II training data:
1 | Estatu_Batuetako_DEAko | Estatu_Batuak_DEA | IZE | LIB | PLU:+|IZAUR:-|KAS:GEL|NUM:P|MUG:M|MW:B|ENT:Erakundea | 2 | ncmod | _ | _ |
2 | buru | buru | IZE | ARR | _ | 4 | ncsubj | _ | _ |
3 | ohiak | ohi | ADJ | ARR | IZAUR:-|KAS:ERG|NUM:S|MUG:M | 2 | ncmod | _ | _ |
4 | aztertuko | aztertu | ADI | SIN | ADM:PART|ASP:GERO | 0 | ROOT | _ | _ |
5 | du | *edun | ADL | ADL | MDN:A1|NOR:HURA|NORK:HARK | 4 | auxmod | _ | _ |
6 | RUCen | RUC | IZE | IZB | MTKAT:SIG|KAS:GEN|NUM:S|MUG:M|ENT:Erakundea | 7 | ncmod | _ | _ |
7 | erreforma | erreforma | IZE | ARR | KAS:ABS|NUM:S|MUG:M | 4 | ncobj | _ | _ |
8 | . | . | PUNT_MARKA | PUNT_PUNT | _ | 7 | PUNC | _ | _ |
The first sentence of the BDT-II development data:
1 | Irakaskuntzan | irakaskuntza | IZE | ARR | BIZ:-|KAS:INE|NUM:S|MUG:M | 2 | ncmod | _ | _ |
2 | jardun | jardun | ADI | SIN | ADM:PART|ASP:BURU | 0 | ROOT | _ | _ |
3 | zuen | *edun | ADL | ADL | MDN:B1|NOR:HURA|NORK:HARK | 2 | auxmod | _ | _ |
4 | Miel | Miel | IZE | IZB | PLU:-|ENT:Pertsona | 5 | entios | _ | _ |
5 | Anjel_Elustondok | Anjel_Elustondo | IZE | IZB | PLU:-|KAS:ERG|NUM:S|MUG:M|ENT:Pertsona | 2 | ncsubj | _ | _ |
6 | 1980 | 1980 | IZE | ZKI | _ | 7 | ncmod | _ | _ |
7 | urtetik | urte | IZE | ARR | BIZ:-|KAS:ABL|NUM:S|MUG:M | 2 | ncmod | _ | _ |
8 | 1992ra | 1992 | IZE | ZKI | KAS:ALA|NUM:S|MUG:M | 2 | ncmod | _ | _ |
9 | , | , | PUNT_MARKA | PUNT_KOMA | _ | 8 | PUNC | _ | _ |
10 | hauetatik | hauek | DET | ERKARR | KAS:ABL|NUM:P|MUG:M | 16 | ncmod | _ | _ |
11 | hamar | hamar | DET | DZH | NMG:P | 12 | detmod | _ | _ |
12 | urtez | urte | IZE | ARR | BIZ:-|KAS:INS|MUG:MG | 16 | lot | _ | _ |
13 | Azpeitiko | Azpeitia | IZE | LIB | PLU:-|KAS:GEL|NUM:S|MUG:M|ENT:Tokia | 14 | ncmod | _ | _ |
14 | ikastolan | ikastola | IZE | ARR | BIZ:-|KAS:INE|NUM:S|MUG:M | 16 | ncmod | _ | _ |
15 | irakasle | irakasle | IZE | ARR | KAS:ABS|MUG:MG | 16 | ncpred | _ | _ |
16 | eta | eta | LOT | JNT | ERL:EMEN | 8 | aponcmod | _ | _ |
17 | beste | beste | DET | DZG | _ | 18 | detmod | _ | _ |
18 | biak | bi | IZE | ZKI | KAS:ABS|NUM:P|MUG:M | 16 | lot | _ | _ |
19 | , | , | PUNT_MARKA | PUNT_KOMA | _ | 18 | PUNC | _ | _ |
20 | Arabako | Araba | IZE | LIB | PLU:-|KAS:GEL|NUM:S|MUG:M|ENT:Tokia | 21 | ncmod | _ | _ |
21 | ikastolen | ikastola | IZE | ARR | BIZ:-|KAS:GEN|NUM:P|MUG:M | 22 | ncmod | _ | _ |
22 | elkartean | elkarte | IZE | ARR | BIZ:-|KAS:INE|NUM:S|MUG:M | 16 | ncmod | _ | _ |
23 | . | . | PUNT_MARKA | PUNT_PUNT | _ | 22 | PUNC | _ | _ |
The first sentence of the BDT-II test data:
1 | Hegoaldean | hegoalde | IZE | ARR | KAS:INE|NUM:S|MUG:M | 2 | ncmod | _ | _ |
2 | iduri_zait | iduri_izan | ADI | ADK | ASP:PNT|MDN:A1|NOR:HURA|NORI:NIRI|MW:B | 0 | ROOT | _ | _ |
3 | euskararen | euskara | IZE | ARR | BIZ:-|KAS:GEN|NUM:S|MUG:M | 4 | ncmod | _ | _ |
4 | mundu | mundu | IZE | ARR | BIZ:- | 7 | ncsubj | _ | _ |
5 | hau | hau | DET | ERKARR | KAS:ABS|NUM:S|MUG:M | 4 | detmod | _ | _ |
6 | adi-adi | adi-adi | ADB | ARR | _ | 7 | ncmod | _ | _ |
7 | dagola | egon | ADT | ADT | ASP:PNT|ERL:KONPL|MDN:A3|NOR:HURA | 2 | ccomp_obj | _ | _ |
8 | , | , | PUNT_MARKA | PUNT_KOMA | _ | 7 | PUNC | _ | _ |
9 | Euskaltzaindiak | Euskaltzaindia | IZE | LIB | PLU:-|KAS:ERG|NUM:S|MUG:M|ENT:Tokia | 11 | ncsubj | _ | _ |
10 | zer | zer | DET | NOLGAL | NMG:MG|KAS:ABS|MUG:MG | 11 | ncobj | _ | _ |
11 | erranen | erran | ADI | SIN | ADM:PART|ASP:GERO | 13 | menos | _ | _ |
12 | duen | *edun | ADL | ADL | ERL:ZHG|MDN:A1|NOR:HURA|NORK:HARK | 11 | auxmod | _ | _ |
13 | zain | zain | ADB | ARR | _ | 7 | cmod | _ | _ |
14 | , | , | PUNT_MARKA | PUNT_KOMA | _ | 13 | PUNC | _ | _ |
15 | haren | hura | DET | ERKARR | KAS:GEN|NUM:S|MUG:M | 16 | ncmod | _ | _ |
16 | arauen | arau | IZE | ARR | KAS:ABS|MUG:MG | 18 | ncmod | _ | _ |
17 | berehala | berehala | ADB | ARR | _ | 18 | ncmod | _ | _ |
18 | betetzeko | bete | ADI | SIN | ADM:ADIZE|ERL:HELB|KAS:ABS|MUG:MG | 7 | xmod | _ | _ |
19 | . | . | PUNT_MARKA | PUNT_PUNT | _ | 18 | PUNC | _ | _ |
Parsing
BDT is a mildly nonprojective treebank. 1925 of the 151,604 tokens of combined BDT-II training and test sets are attached nonprojectively (1.27%).
The results of the CoNLL 2007 shared task are available online. They have been published in (Nivre et al., 2007). The evaluation procedure was changed to include punctuation tokens. These are the best results for Greek:
Parser (Authors) | LAS | UAS |
---|---|---|
Malt (Nilsson et al.) | 76.94 | 82.84 |
Titov et al. | 75.49 | 81.93 |
Sagae | 74.64 | 81.19 |
Carreras | 75.75 | 81.11 |
Nakagawa | 72.56 | 81.04 |
Malt (J. Hall et al.) | 74.99 | 80.61 |
Johansson et al. | 75.08 | 80.43 |
The two Malt parser results of 2007 (single malt and blended) are described in (Hall et al., 2007) and the details about the parser configuration are described here.
Parsing results on BDT-II have been published in Kepa Bengoetxea, Koldo Gojenola: Application of Different Techniques to Dependency Parsing of Basque. In: Proceedings of the First Workshop on Statistical Parsing of Morphologically Rich Languages (SPMRL 2010), NAACL Workshop, Los Angeles, California, USA, 2010. They report only Labeled Attachment Score (LAS) and their best system achieved LAS = 78.98%.