[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

This is an old revision of the document!


Table of Contents

Turkish (tr)

METU-Sabanci Turkish Treebank (ODTÜ-Sabancı Türkçe Ağaç Yapılı Derlemi)

Versions

Obtaining and License

The METU-Sabanci Turkish Treebank is available for research free of charge, provided the user signs the license agreement first. To obtain the treebank, one is supposed to complete the license form, print it, sign it, scan it and mail it to corpus (at) ii (dot) metu (dot) edu (dot) tr or fax it to +90-312-210-3745 (in attention to Corpus Project / Işın Demirşahin).

Republication of the CoNLL 2007 version in the LDC is planned but it has not happened yet.

The license in short:

METU treebank was created by members of the Informatics Institute (Enformatik Enstitüsü), Middle-East Technical University (Orta Doğu Teknik Üniversitesi), Universiteler Mahallesi, Dumlupınar Bulvarı, No:1, TR-06800, Ankara, Turkey, and of Faculty of Engineering and Natural Sciences (Mühendislik ve Doğa Bilimleri Fakültesi), Sabanci University (Sabancı Üniversitesi), TR-34956, Tuzla, İstanbul, Turkey.

References

Domain

Post-1990 written Turkish, sampled from various genres.

Size

According to their website, the treebank contains 7262 sentences. The CoNLL 2007 version contains 69695 tokens in 5935 sentences, yielding 11.74 tokens per sentence on average (65182 tokens / 5635 sentences training, 4513 tokens / 300 sentences test).

Inside

The original Szeged Treebank is a phrase-based treebank and it is distributed in XML-based, TEI-compliant format. The CoNLL 2007 version is dependency-based (i.e. the head of each phrase was identified), distributed in the CoNLL 2006/2007 format.

Morphological annotation includes lemmas. Morphosyntactic tags were probably disambiguated manually. The tagset used in SzTB seems to be same or similar to Multext-East. In the CoNLL version, tags were decomposed into CPOS column, POS column and the list of feature-value pairs in the FEAT column.

Personal names have been collapsed into one token, using underscore as the joining character (e.g. Torgyán_József).

Sample

The first sentence of the CoNLL 2007 training data:

1 Ama ama Conj Conj _ 8 S.MODIFIER _ _
2 hiçbir hiçbir Det Det _ 3 DETERMINER _ _
3 şey şey Noun Noun A3sg|Pnon|Nom 4 OBJECT _ _
4 söylemedim söyle Verb Verb Neg|Past|A1sg 8 SENTENCE _ _
5 ki ki Conj Conj _ 4 INTENSIFIER _ _
6 ben ben Pron PersP A1sg|Pnon|Nom 4 SUBJECT _ _
7 sizlere siz Pron PersP A2pl|Pnon|Dat 4 DATIVE.ADJUNCT _ _
8 . . Punc Punc _ 0 ROOT _ _

The first sentence of the CoNLL 2007 test data:

1 _ ötele Verb Verb Pos 2 DERIV _ _
2 Öteleme _ Noun NInf A3sg|Pnon|Nom 3 CLASSIFIER _ _
3 işleminde işlem Noun Noun A3sg|P3sg|Loc 10 LOCATIVE.ADJUNCT _ _
4 kuyrukta kuyruk Noun Noun A3sg|Pnon|Loc 5 LOCATIVE.ADJUNCT _ _
5 _ bekle Verb Verb Pos 6 DERIV _ _
6 bekleyen _ Adj APresPart _ 7 MODIFIER _ _
7 eleman eleman Noun Noun A3sg|Pnon|Nom 10 SUBJECT _ _
8 yığına yığın Noun Noun A3sg|Pnon|Dat 10 DATIVE.ADJUNCT _ _
9 _ it Verb Verb _ 10 DERIV _ _
10 itilir _ Verb Verb Pass|Pos|Aor|A3sg 11 SENTENCE _ _
11 . . Punc Punc _ 0 ROOT _ _

Parsing

SzTB is a mildly nonprojective treebank. 4032 of the 139,143 tokens of the CoNLL 2007 version are attached nonprojectively (2.9%).

The results of the CoNLL 2007 shared task are available online. They have been published in (Nivre et al., 2007). The evaluation procedure was changed to include punctuation tokens. These are the best results for Hungarian:

Parser (Authors) LAS UAS
Malt (Nilsson et al.) 80.27 83.55
Sagae 79.53 83.51
Nakagawa 76.74 82.49
Titov et al. 77.94 82.18

The two Malt parser results of 2007 (single malt and blended) are described in (Hall et al., 2007) and the details about the parser configuration are described here.


[ Back to the navigation ] [ Back to the content ]