[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Table of Contents

Hungarian (hu)

Szeged Treebank (SzTB)

Versions

Obtaining and License

The Szeged Treebank is available for research free of charge, provided the user signs the license agreement first. The website uses JavaScript to manage content, which makes it difficult to directly link to relevant sections. Click on “downloads” (letöltések) to get the list of downloadable corpora and links to their descriptions (e.g. Szeged Treebank 2.0). To obtain the treebank, one is supposed to complete the license form, print it, sign it and fax it to +36-62-546397 or mail it to Vincze Veronika, Árpád tér 2, H-6720 Szeged. You will be given a user ID and password needed to download the data. There are links to Microsoft Word documents with the license agreement but they do not work for me. Ask Veronika Vincze how to proceed (vinczev (at) inf (dot) u-szeged (dot) hu).

Republication of the CoNLL 2007 version in the LDC is planned but it has not happened yet.

The CoNLL 2007 license in short:

SzTB was created by members of the Human Language Technology Group (Nyelvtechnológiai Csoport), Department of Informatics (Informatikai Tanszékcsoport), University of Szeged (Szegedi Tudományegyetem), Árpád tér 2, H-6720 Szeged, Hungary. Conversion from constituents to dependencies for the CoNLL 2007 shared task was done by Zoltán Alexin.

References

Domain

Mixed:

Size

According to their website, SzTB 2.0 contains 1.2 million words plus 250 thousand punctuation tokens in 82000 sentences. Only a fragment was converted to dependencies in the CoNLL 2007 version: 139,143 tokens in 6424 sentences, yielding 21.66 tokens per sentence on average (131,799 tokens / 6034 sentences training, 7344 tokens / 390 sentences test).

Inside

The original Szeged Treebank is a phrase-based treebank and it is distributed in XML-based, TEI-compliant format. The CoNLL 2007 version is dependency-based (i.e. the head of each phrase was identified), distributed in the CoNLL 2006/2007 format.

Morphological annotation includes lemmas. Morphosyntactic tags were probably disambiguated manually. The tagset used in SzTB seems to be same or similar to Multext-East. In the CoNLL version, tags were decomposed into CPOS column, POS column and the list of feature-value pairs in the FEAT column.

Personal names have been collapsed into one token, using underscore as the joining character (e.g. Torgyán_József).

Sample

The first sentence of the CoNLL 2007 training data:

1 Az az T Tf def=yes 4 DET _ _
2 elmúlt elmúlt A Af deg=positive|n=singular|case=nominative 4 ATT _ _
3 nyolc nyolc M Mc n=singular|case=nominative 4 ATT _ _
4 hónapban hónap N Nc n=singular|case=inessive|proper=no 16 INE _ _
5 , _ WPUNCT WPUNCT _ 16 PUNCT _ _
6 amelyből amely P Pr p=3rd|n=singular|case=elative 11 ELA _ _
7 összesen összesen R Rx _ 8 ADV _ _
8 hatot hat M Mc n=singular|case=accusative 11 OBJ _ _
9 kényszerűségből kényszerűség N Nc n=singular|case=elative|proper=no 11 ELA _ _
10 szabadságon szabadság N Nc n=singular|case=superessive|proper=no 11 SUP _ _
11 töltött tölt V Vm mood=indicative|t=past|p=3rd|n=singular|def=no 16 ATT _ _
12 a a T Tf def=yes 14 DET _ _
13 parlamenti parlamenti A Af deg=positive|n=singular|case=nominative 14 ATT _ _
14 ellenzék ellenzék N Nc n=singular|case=nominative|proper=no 11 SUBJ _ _
15 , _ WPUNCT WPUNCT _ 16 PUNCT _ _
16 megváltozott megváltozik V Vm mood=indicative|t=past|p=3rd|n=singular|def=no 0 ROOT _ _
17 itthon itthon R Rx _ 16 LOCY _ _
18 a a T Tf def=yes 19 DET _ _
19 hatalommegosztás hatalommegosztás N Nc n=singular|case=nominative|proper=no 22 ATT _ _
20 1990-ben 1990 M Mc n=singular|case=inessive 21 ATT _ _
21 kialakított kialakított A Af deg=positive|n=singular|case=nominative 22 ATT _ _
22 rendszere rendszer N Nc n=singular|case=nominative|proper=no|pperson=3rd|pnumber=singular 16 SUBJ _ _
23 : _ WPUNCT WPUNCT _ 16 PUNCT _ _
24 az az T Tf def=yes 26 DET _ _
25 e e P Pd p=3rd|n=singular|case=nominative 26 ATT _ _
26 héten hét N Nc n=singular|case=superessive|proper=no 28 ATT _ _
27 audienciát audiencia N Nc n=singular|case=accusative|proper=no 28 ATT _ _
28 tartó tartó A Af deg=positive|n=singular|case=nominative 29 ATT _ _
29 kormányfő kormányfő N Nc n=singular|case=nominative|proper=no 31 SUBJ _ _
30 gyakorlatilag gyakorlati A Af deg=positive|n=singular|case=essive 31 ADV _ _
31 kivonta kivon V Vm mood=indicative|t=past|p=3rd|n=singular|def=yes 16 CP _ _
32 magát maga P Px p=3rd|n=singular|case=accusative 31 OBJ _ _
33 az az T Tf def=yes 34 DET _ _
34 Országgyűlés Országgyűlés N Np n=singular|case=nominative|proper=yes 35 ATT _ _
35 ellenőrzése ellenőrzés N Nc n=singular|case=nominative|proper=no|pperson=3rd|pnumber=singular 36 ATT _ _
36 alól alól S St _ 31 PP _ _
37 . _ SPUNCT SPUNCT _ 16 PUNCT _ _

The first sentence of the CoNLL 2007 test data:

1 A a T Tf def=yes 2 DET _ _
2 bankokkal bank N Nc n=plural|case=instrumental|proper=no 4 INS _ _
3 kell kell V Vm mood=indicative|t=present|p=3rd|n=singular|def=no 0 ROOT _ _
4 egyezkedniük egyezkedik V Vm mood=infinitive|t=present|p=3rd|n=plural 3 INF _ _
5 azoknak az P Pd p=3rd|n=plural|case=dative 8 ATT _ _
6 a a T Tf def=yes 8 DET _ _
7 mezőgazdasági mezőgazdasági A Af deg=positive|n=singular|case=nominative 8 ATT _ _
8 termelőknek termelő N Nc n=plural|case=dative|proper=no 4 DAT _ _
9 , _ WPUNCT WPUNCT _ 3 PUNCT _ _
10 akik aki P Pr p=3rd|n=plural|case=nominative 21 SUBJ _ _
11 egy egy T Ti def=no 19 DET _ _
12 , _ WPUNCT WPUNCT _ 19 PUNCT _ _
13 a a T Tf def=yes 15 DET _ _
14 múlt múlt A Af deg=positive|n=singular|case=nominative 15 ATT _ _
15 héten hét N Nc n=singular|case=superessive|proper=no 16 ATT _ _
16 megjelent megjelent A Af deg=positive|n=singular|case=nominative 19 ATT _ _
17 földművelésügyi földművelésügyi A Af deg=positive|n=singular|case=nominative 18 ATT _ _
18 minisztériumi minisztériumi A Af deg=positive|n=singular|case=nominative 19 ATT _ _
19 rendelet rendelet N Nc n=singular|case=nominative|proper=no 20 ATT _ _
20 alapján alap N Nc n=singular|case=superessive|proper=no|pperson=3rd|pnumber=singular 21 SUP _ _
21 kérik kér V Vm mood=indicative|t=present|p=3rd|n=plural|def=yes 5 ATT _ _
22 ősszel ősszel R Rx _ 23 ADV _ _
23 lejáró lejáró A Af deg=positive|n=singular|case=nominative 27 ATT _ _
24 , _ WPUNCT WPUNCT _ 27 PUNCT _ _
25 éven év N Nc n=singular|case=superessive|proper=no 26 ATT _ _
26 belüli belüli A Af deg=positive|n=singular|case=nominative 27 ATT _ _
27 hiteleik hitel N Nc n=plural|case=nominative|proper=no|pperson=3rd|pnumber=plural 28 ATT _ _
28 átütemezését átütemezés N Nc n=singular|case=accusative|proper=no|pperson=3rd|pnumber=singular 21 OBJ _ _
29 . _ SPUNCT SPUNCT _ 3 PUNCT _ _

Parsing

SzTB is a mildly nonprojective treebank. 4032 of the 139,143 tokens of the CoNLL 2007 version are attached nonprojectively (2.9%).

The results of the CoNLL 2007 shared task are available online. They have been published in (Nivre et al., 2007). The evaluation procedure was changed to include punctuation tokens. These are the best results for Hungarian:

Parser (Authors) LAS UAS
Malt (Nilsson et al.) 80.27 83.55
Sagae 79.53 83.51
Nakagawa 76.74 82.49
Titov et al. 77.94 82.18

The two Malt parser results of 2007 (single malt and blended) are described in (Hall et al., 2007) and the details about the parser configuration are described here.


[ Back to the navigation ] [ Back to the content ]