===== Tamil (ta) =====
[[http://ufal.mff.cuni.cz/~ramasamy/tamiltb/0.1/|Tamil Dependency Treebank]] (TamilTB)
==== Versions ====
* TamilTB 0.1
==== Obtaining and License ====
TamilTB 0.1 is [[http://ufal.mff.cuni.cz/~ramasamy/tamiltb/0.1/download.html|distributed]] under the [[http://creativecommons.org/licenses/by-nc-sa/3.0/|Creative Commons by-nc-sa license]]. The license in short:
* non-commercial usage
* redistribution permitted
* attribution to Charles University in Prague, Institute of Formal and Applied Linguistics required
* cite one of the principal publications (see below) in published work using the treebank
TamilTB was created by members of the [[http://ufal.mff.cuni.cz/|Institute of Formal and Applied Linguistics]] (Ústav formální a aplikované lingvistiky, ÚFAL), Faculty of Mathematics and Physics (Matematicko-fyzikální fakulta), Charles University in Prague (Univerzita Karlova v Praze), Malostranské náměstí 25, Praha, CZ-11800, Czechia.
==== References ====
* Website
* http://ufal.mff.cuni.cz/~ramasamy/tamiltb/0.1/
* Data
* //no separate citation//
* Principal publications
* Loganathan Ramasamy, Zdeněk Žabokrtský: [[http://www.springerlink.com/content/w18v7621070h51g1/|Tamil Dependency Parsing: Results using Rule Based and Corpus Based Approaches]]. In: //Proceedings of the 12th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing 2011) – Volume Part I//, pages 82-95, Tokyo, Japan, 2011, published by Springer Berlin / Heidelberg, ISBN 978-3-642-19399-6.
* Loganathan Ramasamy, Zdeněk Žabokrtský: Prague Dependency Style Treebank for Tamil. In: //Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)//, İstanbul, Turkey, 2012
* Documentation
* [[http://ufal.mff.cuni.cz/~ramasamy/tamiltb/0.1/morph_annotation.html|Morphological annotation]]
* [[http://ufal.mff.cuni.cz/~ramasamy/tamiltb/0.1/dependency_annotation.html|Syntactic annotation]]
* Loganathan Ramasamy, Zdeněk Žabokrtský: [[http://ufal.mff.cuni.cz/~ramasamy/papers/2011-TamilTB-TR.pdf|Tamil Dependency Treebank (TamilTB) – 0.1 Annotation Manual]]. Technical Report TR-2011-42, ÚFAL MFF UK, Praha, Czechia, 2011
==== Domain ====
News (http://www.dinamani.com/).
==== Size ====
Version 0.1 contains 9581 tokens in 600 sentences, yielding 15.97 tokens per sentence on average. We defined the following data split: 7592 tokens / 480 sentences training, 1989 tokens / 120 sentences test.
==== Inside ====
Tamil script has been [[http://ufal.mff.cuni.cz/~ramasamy/tamiltb/0.1/introduction.html#Text_preprocessing|romanized]] (the romanization is case-sensitive).
The treebank is distributed in three formats: TMT ([[http://ufal.mff.cuni.cz/tectomt/|TectoMT]] XML), [[:formát CoNLL|CoNLL]] and TnT-tagger style (only POS-tagged layer).
Morphological annotation is manual and it includes lemmas, parts of speech and morphosyntactic features. Syntactic annotation follows the style of the [[cs|Prague Dependency Treebank]].
==== Sample ====
The first sentence of the CoNLL version of training data:
| 1 | cennai | cennai | N | NEN-3SN-- | Cas=N|Per=3|Num=S|Gen=N | 2 | AAdjn | _ | _ |
| 2 | arukE | arukE | P | PP------- | _ | 18 | AuxP | _ | _ |
| 3 | sri | sri | N | NEN-3SN-- | Cas=N|Per=3|Num=S|Gen=N | 4 | Atr | _ | _ |
| 4 | perumpuTUril | perumpuTUr | N | NEL-3SN-- | Cas=L|Per=3|Num=S|Gen=N | 18 | AAdjn | _ | _ |
| 5 | kirIn | kirIn | N | NEN-3SN-- | Cas=N|Per=3|Num=S|Gen=N | 6 | Atr | _ | _ |
| 6 | pIltu | pIltu | N | NEN-3SN-- | Cas=N|Per=3|Num=S|Gen=N | 11 | Atr | _ | _ |
| 7 | ( | ( | Z | Z:------- | _ | 6 | AuxG | _ | _ |
| 8 | wavIna | wavInam | J | JJ------- | _ | 6 | Atr | _ | _ |
| 9 | ) | ) | Z | Z:------- | _ | 6 | AuxG | _ | _ |
| 10 | vimAna | vimAnam | N | NO--3SN-- | Per=3|Num=S|Gen=N | 11 | Atr | _ | _ |
| 11 | wilaiyaTTukkukk | wilaiyam | N | NND-3SN-- | Cas=D|Per=3|Num=S|Gen=N | 12 | Atr | _ | _ |
| 12 | Ana | Aku | T | Tg------- | _ | 13 | Atr | _ | _ |
| 13 | wilam | wilam | N | NNN-3SN-- | Cas=N|Per=3|Num=S|Gen=N | 18 | Sb | _ | _ |
| 14 | yArukkum | yAr | R | RBD-3SA-- | Cas=D|Per=3|Num=S|Gen=A | 15 | Atr | _ | _ |
| 15 | pATippu | pATippu | N | NNN-3SN-- | Cas=N|Per=3|Num=S|Gen=N | 16 | Comp | _ | _ |
| 16 | illATa | il | P | PP------- | _ | 17 | AuxP | _ | _ |
| 17 | vakaiyil | vakai | N | NNL-3SN-- | Cas=L|Per=3|Num=S|Gen=N | 18 | AAdjn | _ | _ |
| 18 | etukkap | etu | V | Vu-T---AA | Ten=T|Voi=A|Neg=A | 20 | Obj | _ | _ |
| 19 | patum | patu | V | VR-F3SNPA | Ten=F|Per=3|Num=S|Gen=N|Voi=P|Neg=A | 18 | AuxV | _ | _ |
| 20 | enRu | en | T | Tt-T----A | Ten=T|Neg=A | 23 | AuxC | _ | _ |
| 21 | muTalvar | muTalvar | N | NNN-3SH-- | Cas=N|Per=3|Num=S|Gen=H | 22 | Atr | _ | _ |
| 22 | karuNAwiTi | karuNAwiTi | N | NEN-3SH-- | Cas=N|Per=3|Num=S|Gen=H | 23 | Sb | _ | _ |
| 23 | uRuTiyaLiTT | uRuTiyaLi | V | Vt-T---AA | Ten=T|Voi=A|Neg=A | 0 | Pred | _ | _ |
| 24 | uLLAr | uL | V | VR-T3SHAA | Ten=T|Per=3|Num=S|Gen=H|Voi=A|Neg=A | 23 | AuxV | _ | _ |
| 25 | . | . | Z | Z#------- | _ | 0 | AuxK | _ | _ |
The first sentence of the CoNLL version of test data:
| 1 | pikAr | pikAr | N | NEN-3SN-- | Cas=N|Per=3|Num=S|Gen=N | 2 | Atr | _ | _ |
| 2 | iliruwTu | iliruwTu | P | PP------- | _ | 4 | AuxP | _ | _ |
| 3 | ErALamAna | ErALamAna | J | JJ------- | _ | 4 | Atr | _ | _ |
| 4 | iLainjarkaL | iLainjar | N | NNN-3PA-- | Cas=N|Per=3|Num=P|Gen=A | 9 | Sb | _ | _ |
| 5 | vElai | vElai | N | NNN-3SN-- | Cas=N|Per=3|Num=S|Gen=N | 6 | Obj | _ | _ |
| 6 | TEti | TEtu | V | Vt-T---AA | Ten=T|Voi=A|Neg=A | 9 | AAdjn | _ | _ |
| 7 | veLi | veLi | J | JJ------- | _ | 8 | Atr | _ | _ |
| 8 | mAwilangkaLukku | mAwilam | N | NND-3PN-- | Cas=D|Per=3|Num=P|Gen=N | 9 | AAdjn | _ | _ |
| 9 | kutipeyarwTu | kutipeyar | V | Vt-T---AA | Ten=T|Voi=A|Neg=A | 0 | Pred | _ | _ |
| 10 | varukinRanar | varu | V | VR-P3PHAA | Ten=P|Per=3|Num=P|Gen=H|Voi=A|Neg=A | 9 | AuxV | _ | _ |
| 11 | . | . | Z | Z#------- | _ | 0 | AuxK | _ | _ |
==== Parsing ====
Nonprojectivities in PADT are very rare. Only 15 of the 9581 tokens are attached nonprojectively (0.16%).
Initial parsing results were published by [[http://ufal.mff.cuni.cz/~ramasamy/papers/2011-pres-CICLing.pdf|(Ramasamy and Žabokrtský, 2011)]]. They use smaller data and different training-test data split than defined here (2008 tokens training, 953 tokens test).
^ Parser (Authors) ^ LAS ^ UAS ^
| Malt (Nivre et al.) | 65.69 | 75.03 |
| MST (McDonald et al.) | 65.69 | 74.92 |