===== Tamil (ta) ===== [[http://ufal.mff.cuni.cz/~ramasamy/tamiltb/0.1/|Tamil Dependency Treebank]] (TamilTB) ==== Versions ==== * TamilTB 0.1 ==== Obtaining and License ==== TamilTB 0.1 is [[http://ufal.mff.cuni.cz/~ramasamy/tamiltb/0.1/download.html|distributed]] under the [[http://creativecommons.org/licenses/by-nc-sa/3.0/|Creative Commons by-nc-sa license]]. The license in short: * non-commercial usage * redistribution permitted * attribution to Charles University in Prague, Institute of Formal and Applied Linguistics required * cite one of the principal publications (see below) in published work using the treebank TamilTB was created by members of the [[http://ufal.mff.cuni.cz/|Institute of Formal and Applied Linguistics]] (Ústav formální a aplikované lingvistiky, ÚFAL), Faculty of Mathematics and Physics (Matematicko-fyzikální fakulta), Charles University in Prague (Univerzita Karlova v Praze), Malostranské náměstí 25, Praha, CZ-11800, Czechia. ==== References ==== * Website * http://ufal.mff.cuni.cz/~ramasamy/tamiltb/0.1/ * Data * //no separate citation// * Principal publications * Loganathan Ramasamy, Zdeněk Žabokrtský: [[http://www.springerlink.com/content/w18v7621070h51g1/|Tamil Dependency Parsing: Results using Rule Based and Corpus Based Approaches]]. In: //Proceedings of the 12th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing 2011) – Volume Part I//, pages 82-95, Tokyo, Japan, 2011, published by Springer Berlin / Heidelberg, ISBN 978-3-642-19399-6. * Loganathan Ramasamy, Zdeněk Žabokrtský: Prague Dependency Style Treebank for Tamil. In: //Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)//, İstanbul, Turkey, 2012 * Documentation * [[http://ufal.mff.cuni.cz/~ramasamy/tamiltb/0.1/morph_annotation.html|Morphological annotation]] * [[http://ufal.mff.cuni.cz/~ramasamy/tamiltb/0.1/dependency_annotation.html|Syntactic annotation]] * Loganathan Ramasamy, Zdeněk Žabokrtský: [[http://ufal.mff.cuni.cz/~ramasamy/papers/2011-TamilTB-TR.pdf|Tamil Dependency Treebank (TamilTB) – 0.1 Annotation Manual]]. Technical Report TR-2011-42, ÚFAL MFF UK, Praha, Czechia, 2011 ==== Domain ==== News (http://www.dinamani.com/). ==== Size ==== Version 0.1 contains 9581 tokens in 600 sentences, yielding 15.97 tokens per sentence on average. We defined the following data split: 7592 tokens / 480 sentences training, 1989 tokens / 120 sentences test. ==== Inside ==== Tamil script has been [[http://ufal.mff.cuni.cz/~ramasamy/tamiltb/0.1/introduction.html#Text_preprocessing|romanized]] (the romanization is case-sensitive). The treebank is distributed in three formats: TMT ([[http://ufal.mff.cuni.cz/tectomt/|TectoMT]] XML), [[:formát CoNLL|CoNLL]] and TnT-tagger style (only POS-tagged layer). Morphological annotation is manual and it includes lemmas, parts of speech and morphosyntactic features. Syntactic annotation follows the style of the [[cs|Prague Dependency Treebank]]. ==== Sample ==== The first sentence of the CoNLL version of training data: | 1 | cennai | cennai | N | NEN-3SN-- | Cas=N|Per=3|Num=S|Gen=N | 2 | AAdjn | _ | _ | | 2 | arukE | arukE | P | PP------- | _ | 18 | AuxP | _ | _ | | 3 | sri | sri | N | NEN-3SN-- | Cas=N|Per=3|Num=S|Gen=N | 4 | Atr | _ | _ | | 4 | perumpuTUril | perumpuTUr | N | NEL-3SN-- | Cas=L|Per=3|Num=S|Gen=N | 18 | AAdjn | _ | _ | | 5 | kirIn | kirIn | N | NEN-3SN-- | Cas=N|Per=3|Num=S|Gen=N | 6 | Atr | _ | _ | | 6 | pIltu | pIltu | N | NEN-3SN-- | Cas=N|Per=3|Num=S|Gen=N | 11 | Atr | _ | _ | | 7 | ( | ( | Z | Z:------- | _ | 6 | AuxG | _ | _ | | 8 | wavIna | wavInam | J | JJ------- | _ | 6 | Atr | _ | _ | | 9 | ) | ) | Z | Z:------- | _ | 6 | AuxG | _ | _ | | 10 | vimAna | vimAnam | N | NO--3SN-- | Per=3|Num=S|Gen=N | 11 | Atr | _ | _ | | 11 | wilaiyaTTukkukk | wilaiyam | N | NND-3SN-- | Cas=D|Per=3|Num=S|Gen=N | 12 | Atr | _ | _ | | 12 | Ana | Aku | T | Tg------- | _ | 13 | Atr | _ | _ | | 13 | wilam | wilam | N | NNN-3SN-- | Cas=N|Per=3|Num=S|Gen=N | 18 | Sb | _ | _ | | 14 | yArukkum | yAr | R | RBD-3SA-- | Cas=D|Per=3|Num=S|Gen=A | 15 | Atr | _ | _ | | 15 | pATippu | pATippu | N | NNN-3SN-- | Cas=N|Per=3|Num=S|Gen=N | 16 | Comp | _ | _ | | 16 | illATa | il | P | PP------- | _ | 17 | AuxP | _ | _ | | 17 | vakaiyil | vakai | N | NNL-3SN-- | Cas=L|Per=3|Num=S|Gen=N | 18 | AAdjn | _ | _ | | 18 | etukkap | etu | V | Vu-T---AA | Ten=T|Voi=A|Neg=A | 20 | Obj | _ | _ | | 19 | patum | patu | V | VR-F3SNPA | Ten=F|Per=3|Num=S|Gen=N|Voi=P|Neg=A | 18 | AuxV | _ | _ | | 20 | enRu | en | T | Tt-T----A | Ten=T|Neg=A | 23 | AuxC | _ | _ | | 21 | muTalvar | muTalvar | N | NNN-3SH-- | Cas=N|Per=3|Num=S|Gen=H | 22 | Atr | _ | _ | | 22 | karuNAwiTi | karuNAwiTi | N | NEN-3SH-- | Cas=N|Per=3|Num=S|Gen=H | 23 | Sb | _ | _ | | 23 | uRuTiyaLiTT | uRuTiyaLi | V | Vt-T---AA | Ten=T|Voi=A|Neg=A | 0 | Pred | _ | _ | | 24 | uLLAr | uL | V | VR-T3SHAA | Ten=T|Per=3|Num=S|Gen=H|Voi=A|Neg=A | 23 | AuxV | _ | _ | | 25 | . | . | Z | Z#------- | _ | 0 | AuxK | _ | _ | The first sentence of the CoNLL version of test data: | 1 | pikAr | pikAr | N | NEN-3SN-- | Cas=N|Per=3|Num=S|Gen=N | 2 | Atr | _ | _ | | 2 | iliruwTu | iliruwTu | P | PP------- | _ | 4 | AuxP | _ | _ | | 3 | ErALamAna | ErALamAna | J | JJ------- | _ | 4 | Atr | _ | _ | | 4 | iLainjarkaL | iLainjar | N | NNN-3PA-- | Cas=N|Per=3|Num=P|Gen=A | 9 | Sb | _ | _ | | 5 | vElai | vElai | N | NNN-3SN-- | Cas=N|Per=3|Num=S|Gen=N | 6 | Obj | _ | _ | | 6 | TEti | TEtu | V | Vt-T---AA | Ten=T|Voi=A|Neg=A | 9 | AAdjn | _ | _ | | 7 | veLi | veLi | J | JJ------- | _ | 8 | Atr | _ | _ | | 8 | mAwilangkaLukku | mAwilam | N | NND-3PN-- | Cas=D|Per=3|Num=P|Gen=N | 9 | AAdjn | _ | _ | | 9 | kutipeyarwTu | kutipeyar | V | Vt-T---AA | Ten=T|Voi=A|Neg=A | 0 | Pred | _ | _ | | 10 | varukinRanar | varu | V | VR-P3PHAA | Ten=P|Per=3|Num=P|Gen=H|Voi=A|Neg=A | 9 | AuxV | _ | _ | | 11 | . | . | Z | Z#------- | _ | 0 | AuxK | _ | _ | ==== Parsing ==== Nonprojectivities in PADT are very rare. Only 15 of the 9581 tokens are attached nonprojectively (0.16%). Initial parsing results were published by [[http://ufal.mff.cuni.cz/~ramasamy/papers/2011-pres-CICLing.pdf|(Ramasamy and Žabokrtský, 2011)]]. They use smaller data and different training-test data split than defined here (2008 tokens training, 953 tokens test). ^ Parser (Authors) ^ LAS ^ UAS ^ | Malt (Nivre et al.) | 65.69 | 75.03 | | MST (McDonald et al.) | 65.69 | 74.92 |