[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Table of Contents

Tamil (ta)

Tamil Dependency Treebank (TamilTB)

Versions

Obtaining and License

TamilTB 0.1 is distributed under the Creative Commons by-nc-sa license. The license in short:

TamilTB was created by members of the Institute of Formal and Applied Linguistics (Ústav formální a aplikované lingvistiky, ÚFAL), Faculty of Mathematics and Physics (Matematicko-fyzikální fakulta), Charles University in Prague (Univerzita Karlova v Praze), Malostranské náměstí 25, Praha, CZ-11800, Czechia.

References

Domain

News (http://www.dinamani.com/).

Size

Version 0.1 contains 9581 tokens in 600 sentences, yielding 15.97 tokens per sentence on average. We defined the following data split: 7592 tokens / 480 sentences training, 1989 tokens / 120 sentences test.

Inside

Tamil script has been romanized (the romanization is case-sensitive).

The treebank is distributed in three formats: TMT (TectoMT XML), CoNLL and TnT-tagger style (only POS-tagged layer).

Morphological annotation is manual and it includes lemmas, parts of speech and morphosyntactic features. Syntactic annotation follows the style of the Prague Dependency Treebank.

Sample

The first sentence of the CoNLL version of training data:

1 cennai cennai N NEN-3SN-- Cas=N|Per=3|Num=S|Gen=N 2 AAdjn _ _
2 arukE arukE P PP------- _ 18 AuxP _ _
3 sri sri N NEN-3SN-- Cas=N|Per=3|Num=S|Gen=N 4 Atr _ _
4 perumpuTUril perumpuTUr N NEL-3SN-- Cas=L|Per=3|Num=S|Gen=N 18 AAdjn _ _
5 kirIn kirIn N NEN-3SN-- Cas=N|Per=3|Num=S|Gen=N 6 Atr _ _
6 pIltu pIltu N NEN-3SN-- Cas=N|Per=3|Num=S|Gen=N 11 Atr _ _
7 ( ( Z Z:------- _ 6 AuxG _ _
8 wavIna wavInam J JJ------- _ 6 Atr _ _
9 ) ) Z Z:------- _ 6 AuxG _ _
10 vimAna vimAnam N NO--3SN-- Per=3|Num=S|Gen=N 11 Atr _ _
11 wilaiyaTTukkukk wilaiyam N NND-3SN-- Cas=D|Per=3|Num=S|Gen=N 12 Atr _ _
12 Ana Aku T Tg------- _ 13 Atr _ _
13 wilam wilam N NNN-3SN-- Cas=N|Per=3|Num=S|Gen=N 18 Sb _ _
14 yArukkum yAr R RBD-3SA-- Cas=D|Per=3|Num=S|Gen=A 15 Atr _ _
15 pATippu pATippu N NNN-3SN-- Cas=N|Per=3|Num=S|Gen=N 16 Comp _ _
16 illATa il P PP------- _ 17 AuxP _ _
17 vakaiyil vakai N NNL-3SN-- Cas=L|Per=3|Num=S|Gen=N 18 AAdjn _ _
18 etukkap etu V Vu-T---AA Ten=T|Voi=A|Neg=A 20 Obj _ _
19 patum patu V VR-F3SNPA Ten=F|Per=3|Num=S|Gen=N|Voi=P|Neg=A 18 AuxV _ _
20 enRu en T Tt-T----A Ten=T|Neg=A 23 AuxC _ _
21 muTalvar muTalvar N NNN-3SH-- Cas=N|Per=3|Num=S|Gen=H 22 Atr _ _
22 karuNAwiTi karuNAwiTi N NEN-3SH-- Cas=N|Per=3|Num=S|Gen=H 23 Sb _ _
23 uRuTiyaLiTT uRuTiyaLi V Vt-T---AA Ten=T|Voi=A|Neg=A 0 Pred _ _
24 uLLAr uL V VR-T3SHAA Ten=T|Per=3|Num=S|Gen=H|Voi=A|Neg=A 23 AuxV _ _
25 . . Z Z#------- _ 0 AuxK _ _

The first sentence of the CoNLL version of test data:

1 pikAr pikAr N NEN-3SN-- Cas=N|Per=3|Num=S|Gen=N 2 Atr _ _
2 iliruwTu iliruwTu P PP------- _ 4 AuxP _ _
3 ErALamAna ErALamAna J JJ------- _ 4 Atr _ _
4 iLainjarkaL iLainjar N NNN-3PA-- Cas=N|Per=3|Num=P|Gen=A 9 Sb _ _
5 vElai vElai N NNN-3SN-- Cas=N|Per=3|Num=S|Gen=N 6 Obj _ _
6 TEti TEtu V Vt-T---AA Ten=T|Voi=A|Neg=A 9 AAdjn _ _
7 veLi veLi J JJ------- _ 8 Atr _ _
8 mAwilangkaLukku mAwilam N NND-3PN-- Cas=D|Per=3|Num=P|Gen=N 9 AAdjn _ _
9 kutipeyarwTu kutipeyar V Vt-T---AA Ten=T|Voi=A|Neg=A 0 Pred _ _
10 varukinRanar varu V VR-P3PHAA Ten=P|Per=3|Num=P|Gen=H|Voi=A|Neg=A 9 AuxV _ _
11 . . Z Z#------- _ 0 AuxK _ _

Parsing

Nonprojectivities in PADT are very rare. Only 15 of the 9581 tokens are attached nonprojectively (0.16%).

Initial parsing results were published by (Ramasamy and Žabokrtský, 2011). They use smaller data and different training-test data split than defined here (2008 tokens training, 953 tokens test).

Parser (Authors) LAS UAS
Malt (Nivre et al.) 65.69 75.03
MST (McDonald et al.) 65.69 74.92

[ Back to the navigation ] [ Back to the content ]