Table of Contents
Turkish (tr)
METU-Sabanci Turkish Treebank (ODTÜ-Sabancı Türkçe Ağaç Yapılı Derlemi)
Versions
- METU-Sabanci in original XML format
- CoNLL 2007
Obtaining and License
The METU-Sabanci Turkish Treebank is available for research free of charge, provided the user signs the license agreement first. To obtain the treebank, one is supposed to complete the license form, print it, sign it, scan it and mail it to corpus (at) ii (dot) metu (dot) edu (dot) tr or fax it to +90-312-210-3745 (in attention to Corpus Project / Işın Demirşahin).
Republication of the CoNLL 2007 version in the LDC is planned but it has not happened yet.
The license in short:
- research purposes
- no redistribution
- cite the principal publications (see below) in publications
METU treebank was created by members of the Informatics Institute (Enformatik Enstitüsü), Middle-East Technical University (Orta Doğu Teknik Üniversitesi), Universiteler Mahallesi, Dumlupınar Bulvarı, No:1, TR-06800, Ankara, Turkey, and of Faculty of Engineering and Natural Sciences (Mühendislik ve Doğa Bilimleri Fakültesi), Sabanci University (Sabancı Üniversitesi), TR-34956, Tuzla, İstanbul, Turkey.
References
- Website
- Data
- no separate citation
- Principal publications
- Kemal Oflazer, Bilge Say, Dilek Zeynep Hakkani-Tür, Gökhan Tür: Building a Turkish Treebank. In: Anne Abeillé (ed.): Building and Exploiting Syntactically Annotated Corpora. Kluwer Academic Publishers, 2003.
- Nart B. Atalay, Kemal Oflazer, Bilge Say: The Annotation Process in the Turkish Treebank. In: Proceedings of the EACL Workshop on Linguistically Interpreted Corpora – LINC. Budapest, Hungary, 2003.
- Documentation
- Three PDF files are attached to the CoNLL version in the
doc
folder: ttbankkl.pdf (the chapter from Anne Abeillé, contains list of morphological tags), turkishtreebank.pdf (the paper from the EACL workshop) and user_guide.pdf (annotation manual for dependencies, in Turkish).
Domain
Post-1990 written Turkish, sampled from various genres.
Size
According to their website, the treebank contains 7262 sentences. The CoNLL 2007 version contains 69695 tokens in 5935 sentences, yielding 11.74 tokens per sentence on average (65182 tokens / 5635 sentences training, 4513 tokens / 300 sentences test).
Inside
The original METU-Sabanci Treebank is distributed in XML-based, TEI-compliant format. The CoNLL 2007 version is distributed in the CoNLL 2006/2007 format.
Morphological annotation includes lemmas. Morphosyntactic tags were probably disambiguated manually.
There are special derivational nodes. Derived words have been split into several tokens (see also the sample below). Typical pattern (maybe the only pattern but I have not confirmed that) is as follows: There are two nodes connected with a dependency link. The head node corresponds to the surface word. It has the word form, part of speech and morphological features but it has no lemma (lemma is '_'). The surface word is a result of a derivational morphological process. It has been derived from another word, often a different part of speech (e.g. a noun was derived from a verb). The dependent node represents the source of the derivation. It has no word form but it has a lemma. Its part-of-speech tag describes the source word and thus it can differ from the part-of-speech tag of the head node. The FEAT column says just 'Pos'. The dependent node need not be a leave. Other nodes may depend on it, instead of depending on the parent node. If we have a noun derived from a verb, i.e. we have a verbal node depending on the nominal node, and there is a dependent filling a verbal valency slot of the derived noun, we can expect the dependent to be attached to the verbal node.
Occasionally there are derivational chains longer than two nodes. An example is in the sentence No. 82 of the test data:
lemma azal / Verb → _ / Verb / Caus → _ / Verb / Pass|Pos → azaltılması / Noun / NInf / A3sg|P3sg|Nom
According to Google Translate, azal means “to decrease” and azaltılması means “reduced”. TRmorph gives the following four analyses:
analyze> azaltılması azal<v><caus><pass><vn_ma><p3s> azal<v><caus><pass><vn_ma><p3s><3s> azal<v><caus><pass><vn_ma><p3s><3p> azal<v><caus><pass><cv_ma><p3s>
Sample
The first sentence of the CoNLL 2007 training data:
1 | Ama | ama | Conj | Conj | _ | 8 | S.MODIFIER | _ | _ |
2 | hiçbir | hiçbir | Det | Det | _ | 3 | DETERMINER | _ | _ |
3 | şey | şey | Noun | Noun | A3sg|Pnon|Nom | 4 | OBJECT | _ | _ |
4 | söylemedim | söyle | Verb | Verb | Neg|Past|A1sg | 8 | SENTENCE | _ | _ |
5 | ki | ki | Conj | Conj | _ | 4 | INTENSIFIER | _ | _ |
6 | ben | ben | Pron | PersP | A1sg|Pnon|Nom | 4 | SUBJECT | _ | _ |
7 | sizlere | siz | Pron | PersP | A2pl|Pnon|Dat | 4 | DATIVE.ADJUNCT | _ | _ |
8 | . | . | Punc | Punc | _ | 0 | ROOT | _ | _ |
The first sentence of the CoNLL 2007 test data:
1 | _ | ötele | Verb | Verb | Pos | 2 | DERIV | _ | _ |
2 | Öteleme | _ | Noun | NInf | A3sg|Pnon|Nom | 3 | CLASSIFIER | _ | _ |
3 | işleminde | işlem | Noun | Noun | A3sg|P3sg|Loc | 10 | LOCATIVE.ADJUNCT | _ | _ |
4 | kuyrukta | kuyruk | Noun | Noun | A3sg|Pnon|Loc | 5 | LOCATIVE.ADJUNCT | _ | _ |
5 | _ | bekle | Verb | Verb | Pos | 6 | DERIV | _ | _ |
6 | bekleyen | _ | Adj | APresPart | _ | 7 | MODIFIER | _ | _ |
7 | eleman | eleman | Noun | Noun | A3sg|Pnon|Nom | 10 | SUBJECT | _ | _ |
8 | yığına | yığın | Noun | Noun | A3sg|Pnon|Dat | 10 | DATIVE.ADJUNCT | _ | _ |
9 | _ | it | Verb | Verb | _ | 10 | DERIV | _ | _ |
10 | itilir | _ | Verb | Verb | Pass|Pos|Aor|A3sg | 11 | SENTENCE | _ | _ |
11 | . | . | Punc | Punc | _ | 0 | ROOT | _ | _ |
Parsing
Nonprojectivity rate in METU-Sabanci is relatively high. 3716 of the 69695 tokens of the CoNLL 2007 version are attached nonprojectively (5.33%).
The results of the CoNLL 2007 shared task are available online. They have been published in (Nivre et al., 2007). The evaluation procedure was changed to include punctuation tokens. These are the best results for Turkish:
Parser (Authors) | LAS | UAS |
---|---|---|
Titov et al. | 79.81 | 86.22 |
Malt (Nilsson et al.) | 79.79 | 85.77 |
Nakagawa | 78.22 | 85.77 |
Keith Hall | 77.42 | 85.18 |
Malt (Johan Hall) | 79.24 | 85.04 |
The two Malt parser results of 2007 (single malt and blended) are described in (Hall et al., 2007) and the details about the parser configuration are described here.