Tamil Dependency Treebank (TamilTB)
TamilTB 0.1 is distributed under the Creative Commons by-nc-sa license. The license in short:
TamilTB was created by members of the Institute of Formal and Applied Linguistics (Ústav formální a aplikované lingvistiky, ÚFAL), Faculty of Mathematics and Physics (Matematicko-fyzikální fakulta), Charles University in Prague (Univerzita Karlova v Praze), Malostranské náměstí 25, Praha, CZ-11800, Czechia.
News (http://www.dinamani.com/).
Version 0.1 contains 9581 tokens in 600 sentences, yielding 15.97 tokens per sentence on average. We defined the following data split: 7592 tokens / 480 sentences training, 1989 tokens / 120 sentences test.
Tamil script has been romanized (the romanization is case-sensitive).
The treebank is distributed in three formats: TMT (TectoMT XML), CoNLL and TnT-tagger style (only POS-tagged layer).
Morphological annotation is manual and it includes lemmas, parts of speech and morphosyntactic features. Syntactic annotation follows the style of the Prague Dependency Treebank.
The first sentence of the CoNLL version of training data:
1 | cennai | cennai | N | NEN-3SN-- | Cas=N|Per=3|Num=S|Gen=N | 2 | AAdjn | _ | _ |
2 | arukE | arukE | P | PP------- | _ | 18 | AuxP | _ | _ |
3 | sri | sri | N | NEN-3SN-- | Cas=N|Per=3|Num=S|Gen=N | 4 | Atr | _ | _ |
4 | perumpuTUril | perumpuTUr | N | NEL-3SN-- | Cas=L|Per=3|Num=S|Gen=N | 18 | AAdjn | _ | _ |
5 | kirIn | kirIn | N | NEN-3SN-- | Cas=N|Per=3|Num=S|Gen=N | 6 | Atr | _ | _ |
6 | pIltu | pIltu | N | NEN-3SN-- | Cas=N|Per=3|Num=S|Gen=N | 11 | Atr | _ | _ |
7 | ( | ( | Z | Z:------- | _ | 6 | AuxG | _ | _ |
8 | wavIna | wavInam | J | JJ------- | _ | 6 | Atr | _ | _ |
9 | ) | ) | Z | Z:------- | _ | 6 | AuxG | _ | _ |
10 | vimAna | vimAnam | N | NO--3SN-- | Per=3|Num=S|Gen=N | 11 | Atr | _ | _ |
11 | wilaiyaTTukkukk | wilaiyam | N | NND-3SN-- | Cas=D|Per=3|Num=S|Gen=N | 12 | Atr | _ | _ |
12 | Ana | Aku | T | Tg------- | _ | 13 | Atr | _ | _ |
13 | wilam | wilam | N | NNN-3SN-- | Cas=N|Per=3|Num=S|Gen=N | 18 | Sb | _ | _ |
14 | yArukkum | yAr | R | RBD-3SA-- | Cas=D|Per=3|Num=S|Gen=A | 15 | Atr | _ | _ |
15 | pATippu | pATippu | N | NNN-3SN-- | Cas=N|Per=3|Num=S|Gen=N | 16 | Comp | _ | _ |
16 | illATa | il | P | PP------- | _ | 17 | AuxP | _ | _ |
17 | vakaiyil | vakai | N | NNL-3SN-- | Cas=L|Per=3|Num=S|Gen=N | 18 | AAdjn | _ | _ |
18 | etukkap | etu | V | Vu-T---AA | Ten=T|Voi=A|Neg=A | 20 | Obj | _ | _ |
19 | patum | patu | V | VR-F3SNPA | Ten=F|Per=3|Num=S|Gen=N|Voi=P|Neg=A | 18 | AuxV | _ | _ |
20 | enRu | en | T | Tt-T----A | Ten=T|Neg=A | 23 | AuxC | _ | _ |
21 | muTalvar | muTalvar | N | NNN-3SH-- | Cas=N|Per=3|Num=S|Gen=H | 22 | Atr | _ | _ |
22 | karuNAwiTi | karuNAwiTi | N | NEN-3SH-- | Cas=N|Per=3|Num=S|Gen=H | 23 | Sb | _ | _ |
23 | uRuTiyaLiTT | uRuTiyaLi | V | Vt-T---AA | Ten=T|Voi=A|Neg=A | 0 | Pred | _ | _ |
24 | uLLAr | uL | V | VR-T3SHAA | Ten=T|Per=3|Num=S|Gen=H|Voi=A|Neg=A | 23 | AuxV | _ | _ |
25 | . | . | Z | Z#------- | _ | 0 | AuxK | _ | _ |
The first sentence of the CoNLL version of test data:
1 | pikAr | pikAr | N | NEN-3SN-- | Cas=N|Per=3|Num=S|Gen=N | 2 | Atr | _ | _ |
2 | iliruwTu | iliruwTu | P | PP------- | _ | 4 | AuxP | _ | _ |
3 | ErALamAna | ErALamAna | J | JJ------- | _ | 4 | Atr | _ | _ |
4 | iLainjarkaL | iLainjar | N | NNN-3PA-- | Cas=N|Per=3|Num=P|Gen=A | 9 | Sb | _ | _ |
5 | vElai | vElai | N | NNN-3SN-- | Cas=N|Per=3|Num=S|Gen=N | 6 | Obj | _ | _ |
6 | TEti | TEtu | V | Vt-T---AA | Ten=T|Voi=A|Neg=A | 9 | AAdjn | _ | _ |
7 | veLi | veLi | J | JJ------- | _ | 8 | Atr | _ | _ |
8 | mAwilangkaLukku | mAwilam | N | NND-3PN-- | Cas=D|Per=3|Num=P|Gen=N | 9 | AAdjn | _ | _ |
9 | kutipeyarwTu | kutipeyar | V | Vt-T---AA | Ten=T|Voi=A|Neg=A | 0 | Pred | _ | _ |
10 | varukinRanar | varu | V | VR-P3PHAA | Ten=P|Per=3|Num=P|Gen=H|Voi=A|Neg=A | 9 | AuxV | _ | _ |
11 | . | . | Z | Z#------- | _ | 0 | AuxK | _ | _ |
Nonprojectivities in PADT are very rare. Only 15 of the 9581 tokens are attached nonprojectively (0.16%).
Initial parsing results were published by (Ramasamy and Žabokrtský, 2011). They use smaller data and different training-test data split than defined here (2008 tokens training, 953 tokens test).
Parser (Authors) | LAS | UAS |
---|---|---|
Malt (Nivre et al.) | 65.69 | 75.03 |
MST (McDonald et al.) | 65.69 | 74.92 |