This is an old revision of the document!
Table of Contents
Turkish (tr)
METU-Sabanci Turkish Treebank (ODTÜ-Sabancı Türkçe Ağaç Yapılı Derlemi)
Versions
- METU-Sabanci in original XML format
- CoNLL 2007
Obtaining and License
The METU-Sabanci Turkish Treebank is available for research free of charge, provided the user signs the license agreement first. To obtain the treebank, one is supposed to complete the license form, print it, sign it, scan it and mail it to corpus (at) ii (dot) metu (dot) edu (dot) tr or fax it to +90-312-210-3745 (in attention to Corpus Project / Işın Demirşahin).
Republication of the CoNLL 2007 version in the LDC is planned but it has not happened yet.
The license in short:
- research purposes
- no redistribution
- cite the principal publications (see below) in publications
METU treebank was created by members of the Informatics Institute (Enformatik Enstitüsü), Middle-East Technical University (Orta Doğu Teknik Üniversitesi), Universiteler Mahallesi, Dumlupınar Bulvarı, No:1, TR-06800, Ankara, Turkey, and of Faculty of Engineering and Natural Sciences (Mühendislik ve Doğa Bilimleri Fakültesi), Sabanci University (Sabancı Üniversitesi), TR-34956, Tuzla, İstanbul, Turkey.
References
- Website
- Data
- no separate citation
- Principal publications
- Kemal Oflazer, Bilge Say, Dilek Zeynep Hakkani-Tür, Gökhan Tür: Building a Turkish Treebank. In: Anne Abeillé (ed.): Building and Exploiting Syntactically Annotated Corpora. Kluwer Academic Publishers, 2003.
- Nart B. Atalay, Kemal Oflazer, Bilge Say: The Annotation Process in the Turkish Treebank. In: Proceedings of the EACL Workshop on Linguistically Interpreted Corpora – LINC. Budapest, Hungary, 2003.
- Documentation
- Three PDF files are attached to the CoNLL version in the
doc
folder: ttbankkl.pdf (the chapter from Anne Abeillé, contains list of morphological tags), turkishtreebank.pdf (the paper from the EACL workshop) and user_guide.pdf (annotation manual for dependencies, in Turkish).
Domain
Mixed:
- Fiction
- Short essays by 14 to 16 year-old students
- Newspapers (Népszabadság, Népszava, Magyar Hírlap, HVG)
- Texts related to computer science
- Legal texts
- Economic and financial short news
Size
According to their website, SzTB 2.0 contains 1.2 million words plus 250 thousand punctuation tokens in 82000 sentences. Only a fragment was converted to dependencies in the CoNLL 2007 version: 139,143 tokens in 6424 sentences, yielding 21.66 tokens per sentence on average (131,799 tokens / 6034 sentences training, 7344 tokens / 390 sentences test).
Inside
The original Szeged Treebank is a phrase-based treebank and it is distributed in XML-based, TEI-compliant format. The CoNLL 2007 version is dependency-based (i.e. the head of each phrase was identified), distributed in the CoNLL 2006/2007 format.
Morphological annotation includes lemmas. Morphosyntactic tags were probably disambiguated manually. The tagset used in SzTB seems to be same or similar to Multext-East. In the CoNLL version, tags were decomposed into CPOS column, POS column and the list of feature-value pairs in the FEAT column.
Personal names have been collapsed into one token, using underscore as the joining character (e.g. Torgyán_József).
Sample
The first sentence of the CoNLL 2007 training data:
1 | Az | az | T | Tf | def=yes | 4 | DET | _ | _ |
2 | elmúlt | elmúlt | A | Af | deg=positive|n=singular|case=nominative | 4 | ATT | _ | _ |
3 | nyolc | nyolc | M | Mc | n=singular|case=nominative | 4 | ATT | _ | _ |
4 | hónapban | hónap | N | Nc | n=singular|case=inessive|proper=no | 16 | INE | _ | _ |
5 | , | _ | WPUNCT | WPUNCT | _ | 16 | PUNCT | _ | _ |
6 | amelyből | amely | P | Pr | p=3rd|n=singular|case=elative | 11 | ELA | _ | _ |
7 | összesen | összesen | R | Rx | _ | 8 | ADV | _ | _ |
8 | hatot | hat | M | Mc | n=singular|case=accusative | 11 | OBJ | _ | _ |
9 | kényszerűségből | kényszerűség | N | Nc | n=singular|case=elative|proper=no | 11 | ELA | _ | _ |
10 | szabadságon | szabadság | N | Nc | n=singular|case=superessive|proper=no | 11 | SUP | _ | _ |
11 | töltött | tölt | V | Vm | mood=indicative|t=past|p=3rd|n=singular|def=no | 16 | ATT | _ | _ |
12 | a | a | T | Tf | def=yes | 14 | DET | _ | _ |
13 | parlamenti | parlamenti | A | Af | deg=positive|n=singular|case=nominative | 14 | ATT | _ | _ |
14 | ellenzék | ellenzék | N | Nc | n=singular|case=nominative|proper=no | 11 | SUBJ | _ | _ |
15 | , | _ | WPUNCT | WPUNCT | _ | 16 | PUNCT | _ | _ |
16 | megváltozott | megváltozik | V | Vm | mood=indicative|t=past|p=3rd|n=singular|def=no | 0 | ROOT | _ | _ |
17 | itthon | itthon | R | Rx | _ | 16 | LOCY | _ | _ |
18 | a | a | T | Tf | def=yes | 19 | DET | _ | _ |
19 | hatalommegosztás | hatalommegosztás | N | Nc | n=singular|case=nominative|proper=no | 22 | ATT | _ | _ |
20 | 1990-ben | 1990 | M | Mc | n=singular|case=inessive | 21 | ATT | _ | _ |
21 | kialakított | kialakított | A | Af | deg=positive|n=singular|case=nominative | 22 | ATT | _ | _ |
22 | rendszere | rendszer | N | Nc | n=singular|case=nominative|proper=no|pperson=3rd|pnumber=singular | 16 | SUBJ | _ | _ |
23 | : | _ | WPUNCT | WPUNCT | _ | 16 | PUNCT | _ | _ |
24 | az | az | T | Tf | def=yes | 26 | DET | _ | _ |
25 | e | e | P | Pd | p=3rd|n=singular|case=nominative | 26 | ATT | _ | _ |
26 | héten | hét | N | Nc | n=singular|case=superessive|proper=no | 28 | ATT | _ | _ |
27 | audienciát | audiencia | N | Nc | n=singular|case=accusative|proper=no | 28 | ATT | _ | _ |
28 | tartó | tartó | A | Af | deg=positive|n=singular|case=nominative | 29 | ATT | _ | _ |
29 | kormányfő | kormányfő | N | Nc | n=singular|case=nominative|proper=no | 31 | SUBJ | _ | _ |
30 | gyakorlatilag | gyakorlati | A | Af | deg=positive|n=singular|case=essive | 31 | ADV | _ | _ |
31 | kivonta | kivon | V | Vm | mood=indicative|t=past|p=3rd|n=singular|def=yes | 16 | CP | _ | _ |
32 | magát | maga | P | Px | p=3rd|n=singular|case=accusative | 31 | OBJ | _ | _ |
33 | az | az | T | Tf | def=yes | 34 | DET | _ | _ |
34 | Országgyűlés | Országgyűlés | N | Np | n=singular|case=nominative|proper=yes | 35 | ATT | _ | _ |
35 | ellenőrzése | ellenőrzés | N | Nc | n=singular|case=nominative|proper=no|pperson=3rd|pnumber=singular | 36 | ATT | _ | _ |
36 | alól | alól | S | St | _ | 31 | PP | _ | _ |
37 | . | _ | SPUNCT | SPUNCT | _ | 16 | PUNCT | _ | _ |
The first sentence of the CoNLL 2007 test data:
1 | A | a | T | Tf | def=yes | 2 | DET | _ | _ |
2 | bankokkal | bank | N | Nc | n=plural|case=instrumental|proper=no | 4 | INS | _ | _ |
3 | kell | kell | V | Vm | mood=indicative|t=present|p=3rd|n=singular|def=no | 0 | ROOT | _ | _ |
4 | egyezkedniük | egyezkedik | V | Vm | mood=infinitive|t=present|p=3rd|n=plural | 3 | INF | _ | _ |
5 | azoknak | az | P | Pd | p=3rd|n=plural|case=dative | 8 | ATT | _ | _ |
6 | a | a | T | Tf | def=yes | 8 | DET | _ | _ |
7 | mezőgazdasági | mezőgazdasági | A | Af | deg=positive|n=singular|case=nominative | 8 | ATT | _ | _ |
8 | termelőknek | termelő | N | Nc | n=plural|case=dative|proper=no | 4 | DAT | _ | _ |
9 | , | _ | WPUNCT | WPUNCT | _ | 3 | PUNCT | _ | _ |
10 | akik | aki | P | Pr | p=3rd|n=plural|case=nominative | 21 | SUBJ | _ | _ |
11 | egy | egy | T | Ti | def=no | 19 | DET | _ | _ |
12 | , | _ | WPUNCT | WPUNCT | _ | 19 | PUNCT | _ | _ |
13 | a | a | T | Tf | def=yes | 15 | DET | _ | _ |
14 | múlt | múlt | A | Af | deg=positive|n=singular|case=nominative | 15 | ATT | _ | _ |
15 | héten | hét | N | Nc | n=singular|case=superessive|proper=no | 16 | ATT | _ | _ |
16 | megjelent | megjelent | A | Af | deg=positive|n=singular|case=nominative | 19 | ATT | _ | _ |
17 | földművelésügyi | földművelésügyi | A | Af | deg=positive|n=singular|case=nominative | 18 | ATT | _ | _ |
18 | minisztériumi | minisztériumi | A | Af | deg=positive|n=singular|case=nominative | 19 | ATT | _ | _ |
19 | rendelet | rendelet | N | Nc | n=singular|case=nominative|proper=no | 20 | ATT | _ | _ |
20 | alapján | alap | N | Nc | n=singular|case=superessive|proper=no|pperson=3rd|pnumber=singular | 21 | SUP | _ | _ |
21 | kérik | kér | V | Vm | mood=indicative|t=present|p=3rd|n=plural|def=yes | 5 | ATT | _ | _ |
22 | ősszel | ősszel | R | Rx | _ | 23 | ADV | _ | _ |
23 | lejáró | lejáró | A | Af | deg=positive|n=singular|case=nominative | 27 | ATT | _ | _ |
24 | , | _ | WPUNCT | WPUNCT | _ | 27 | PUNCT | _ | _ |
25 | éven | év | N | Nc | n=singular|case=superessive|proper=no | 26 | ATT | _ | _ |
26 | belüli | belüli | A | Af | deg=positive|n=singular|case=nominative | 27 | ATT | _ | _ |
27 | hiteleik | hitel | N | Nc | n=plural|case=nominative|proper=no|pperson=3rd|pnumber=plural | 28 | ATT | _ | _ |
28 | átütemezését | átütemezés | N | Nc | n=singular|case=accusative|proper=no|pperson=3rd|pnumber=singular | 21 | OBJ | _ | _ |
29 | . | _ | SPUNCT | SPUNCT | _ | 3 | PUNCT | _ | _ |
Parsing
SzTB is a mildly nonprojective treebank. 4032 of the 139,143 tokens of the CoNLL 2007 version are attached nonprojectively (2.9%).
The results of the CoNLL 2007 shared task are available online. They have been published in (Nivre et al., 2007). The evaluation procedure was changed to include punctuation tokens. These are the best results for Hungarian:
Parser (Authors) | LAS | UAS |
---|---|---|
Malt (Nilsson et al.) | 80.27 | 83.55 |
Sagae | 79.53 | 83.51 |
Nakagawa | 76.74 | 82.49 |
Titov et al. | 77.94 | 82.18 |
The two Malt parser results of 2007 (single malt and blended) are described in (Hall et al., 2007) and the details about the parser configuration are described here.