Both sides previous revision
Previous revision
Next revision
|
Previous revision
|
user:zeman:treebanks:tr [2012/03/22 20:57] zeman Sample. |
user:zeman:treebanks:tr [2014/04/22 17:01] (current) zeman Updated link. |
===== Turkish (tr) ===== | ===== Turkish (tr) ===== |
| |
[[http://www.ii.metu.edu.tr/content/treebank|METU-Sabanci Turkish Treebank]] (ODTÜ-Sabancı Türkçe Ağaç Yapılı Derlemi) | [[http://ii.metu.edu.tr/corpus|METU-Sabanci Turkish Treebank]] (ODTÜ-Sabancı Türkçe Ağaç Yapılı Derlemi) |
| |
==== Versions ==== | ==== Versions ==== |
* Principal publications | * Principal publications |
* Kemal Oflazer, Bilge Say, Dilek Zeynep Hakkani-Tür, Gökhan Tür: Building a Turkish Treebank. In: Anne Abeillé (ed.): Building and Exploiting Syntactically Annotated Corpora. Kluwer Academic Publishers, 2003. | * Kemal Oflazer, Bilge Say, Dilek Zeynep Hakkani-Tür, Gökhan Tür: Building a Turkish Treebank. In: Anne Abeillé (ed.): Building and Exploiting Syntactically Annotated Corpora. Kluwer Academic Publishers, 2003. |
* Nart B. Atalay, Kemal Oflazer, Bilge Say: The Annotation Process in the Turkish Treebank. In: Proceedings of the EACL Workshop on Linguistically Interpreted Corpora – LINC. Budapest, Hungary, 2003. | * Nart B. Atalay, Kemal Oflazer, Bilge Say: [[http://aclweb.org/anthology-new/W/W03/W03-2405.pdf|The Annotation Process in the Turkish Treebank]]. In: Proceedings of the EACL Workshop on Linguistically Interpreted Corpora – LINC. Budapest, Hungary, 2003. |
* Documentation | * Documentation |
* Three PDF files are attached to the CoNLL version in the ''doc'' folder: ttbankkl.pdf (the chapter from Anne Abeillé, contains list of morphological tags), turkishtreebank.pdf (the paper from the EACL workshop) and user_guide.pdf (annotation manual for dependencies, in Turkish). | * Three PDF files are attached to the CoNLL version in the ''doc'' folder: {{:user:zeman:treebanks:ttbankkl.pdf|ttbankkl.pdf}} (the chapter from Anne Abeillé, contains list of morphological tags), turkishtreebank.pdf (the paper from the EACL workshop) and user_guide.pdf (annotation manual for dependencies, in Turkish). |
| |
==== Domain ==== | ==== Domain ==== |
==== Inside ==== | ==== Inside ==== |
| |
The original Szeged Treebank is a phrase-based treebank and it is distributed in XML-based, TEI-compliant format. The CoNLL 2007 version is dependency-based (i.e. the head of each phrase was identified), distributed in the CoNLL 2006/2007 format. | The original METU-Sabanci Treebank is distributed in XML-based, TEI-compliant format. The CoNLL 2007 version is distributed in the [[:format-conll|CoNLL 2006/2007 format]]. |
| |
Morphological annotation includes lemmas. Morphosyntactic tags were probably disambiguated manually. The tagset used in SzTB seems to be same or similar to [[http://nl.ijs.si/ME/V4/msd/html/msd-hu.html|Multext-East]]. In the CoNLL version, tags were decomposed into CPOS column, POS column and the list of feature-value pairs in the FEAT column. | Morphological annotation includes lemmas. Morphosyntactic tags were probably disambiguated manually. |
| |
Personal names have been collapsed into one token, using underscore as the joining character (e.g. Torgyán_József). | There are special derivational nodes. Derived words have been split into several tokens (see also the sample below). Typical pattern (maybe the only pattern but I have not confirmed that) is as follows: There are two nodes connected with a dependency link. The head node corresponds to the surface word. It has the word form, part of speech and morphological features but it has no lemma (lemma is '_'). The surface word is a result of a derivational morphological process. It has been derived from another word, often a different part of speech (e.g. a noun was derived from a verb). The dependent node represents the source of the derivation. It has no word form but it has a lemma. Its part-of-speech tag describes the source word and thus it can differ from the part-of-speech tag of the head node. The FEAT column says just 'Pos'. The dependent node need not be a leave. Other nodes may depend on it, instead of depending on the parent node. If we have a noun derived from a verb, i.e. we have a verbal node depending on the nominal node, and there is a dependent filling a verbal valency slot of the derived noun, we can expect the dependent to be attached to the verbal node. |
| |
| Occasionally there are derivational chains longer than two nodes. An example is in the sentence No. 82 of the test data: |
| lemma azal / Verb -> _ / Verb / Caus -> _ / Verb / Pass|Pos -> azaltılması / Noun / NInf / A3sg|P3sg|Nom |
| According to Google Translate, //azal// means “to decrease” and //azaltılması// means “reduced”. TRmorph gives the following four analyses: |
| <code> |
| analyze> azaltılması |
| azal<v><caus><pass><vn_ma><p3s> |
| azal<v><caus><pass><vn_ma><p3s><3s> |
| azal<v><caus><pass><vn_ma><p3s><3p> |
| azal<v><caus><pass><cv_ma><p3s> |
| </code> |
| |
==== Sample ==== | ==== Sample ==== |
==== Parsing ==== | ==== Parsing ==== |
| |
SzTB is a mildly nonprojective treebank. 4032 of the 139,143 tokens of the CoNLL 2007 version are attached nonprojectively (2.9%). | Nonprojectivity rate in METU-Sabanci is relatively high. 3716 of the 69695 tokens of the CoNLL 2007 version are attached nonprojectively (5.33%). |
| |
The results of the CoNLL 2007 shared task are [[http://nextens.uvt.nl/depparse-wiki/AllScores|available online]]. They have been published in [[http://aclweb.org/anthology-new/D/D07/D07-1096.pdf|(Nivre et al., 2007)]]. The evaluation procedure was changed to include punctuation tokens. These are the best results for Hungarian: | The results of the CoNLL 2007 shared task are [[http://nextens.uvt.nl/depparse-wiki/AllScores|available online]]. They have been published in [[http://aclweb.org/anthology-new/D/D07/D07-1096.pdf|(Nivre et al., 2007)]]. The evaluation procedure was changed to include punctuation tokens. These are the best results for Turkish: |
| |
^ Parser (Authors) ^ LAS ^ UAS ^ | ^ Parser (Authors) ^ LAS ^ UAS ^ |
| Malt (Nilsson et al.) | 80.27 | 83.55 | | | Titov et al. | 79.81 | 86.22 | |
| Sagae | 79.53 | 83.51 | | | Malt (Nilsson et al.) | 79.79 | 85.77 | |
| Nakagawa | 76.74 | 82.49 | | | Nakagawa | 78.22 | 85.77 | |
| Titov et al. | 77.94 | 82.18 | | | Keith Hall | 77.42 | 85.18 | |
| | Malt (Johan Hall) | 79.24 | 85.04 | |
| |
The two Malt parser results of 2007 (single malt and blended) are described in [[http://aclweb.org/anthology-new/D/D07/D07-1097.pdf|(Hall et al., 2007)]] and the details about the parser configuration are described [[http://w3.msi.vxu.se/users/jha/conll07/|here]]. | The two Malt parser results of 2007 (single malt and blended) are described in [[http://aclweb.org/anthology-new/D/D07/D07-1097.pdf|(Hall et al., 2007)]] and the details about the parser configuration are described [[http://w3.msi.vxu.se/users/jha/conll07/|here]]. |
| |