This is an old revision of the document!
Table of Contents
Tamil (ta)
Tamil Dependency Treebank (TamilTB)
Versions
- TamilTB 0.1
Obtaining and License
TamilTB 0.1 is distributed under the Creative Commons by-nc-sa license. The license in short:
- non-commercial usage
- redistribution permitted
- attribution to Charles University in Prague, Institute of Formal and Applied Linguistics required
- cite one of the principal publications (see below) in published work using the treebank
TamilTB was created by members of the Institute of Formal and Applied Linguistics (Ústav formální a aplikované lingvistiky, ÚFAL), Faculty of Mathematics and Physics (Matematicko-fyzikální fakulta), Charles University in Prague (Univerzita Karlova v Praze), Malostranské náměstí 25, Praha, CZ-11800, Czechia.
References
- Website
- Data
- no separate citation
- Principal publications
- Loganathan Ramasamy, Zdeněk Žabokrtský: Tamil Dependency Parsing: Results using Rule Based and Corpus Based Approaches. In: Proceedings of the 12th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing 2011) – Volume Part I, pages 82-95, Tokyo, Japan, 2011, published by Springer Berlin / Heidelberg, ISBN 978-3-642-19399-6.
- Loganathan Ramasamy, Zdeněk Žabokrtský: Prague Dependency Style Treebank for Tamil. In: Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), İstanbul, Turkey, 2012
- Documentation
- Loganathan Ramasamy, Zdeněk Žabokrtský: Tamil Dependency Treebank (TamilTB) – 0.1 Annotation Manual. Technical Report TR-2011-42, ÚFAL MFF UK, Praha, Czechia, 2011
Domain
News (http://www.dinamani.com/).
Size
Version 0.1 contains 9581 tokens in 600 sentences, yielding 15.97 tokens per sentence on average. We defined the following data split: 7592 tokens / 480 sentences training, 1989 tokens / 120 sentences test.
Inside
Tamil script has been romanized (the romanization is case-sensitive).
The treebank is distributed in three formats: TMT (TectoMT XML), CoNLL and TnT-tagger style (only POS-tagged layer).
Morphological annotation is manual and it includes lemmas, parts of speech and morphosyntactic features. Syntactic annotation follows the style of the Prague Dependency Treebank.
Sample
The first sentence of the CoNLL version of training data:
1 | cennai | cennai | N | NEN-3SN-- | Cas=N|Per=3|Num=S|Gen=N | 2 | AAdjn | _ | _ |
2 | arukE | arukE | P | PP------- | _ | 18 | AuxP | _ | _ |
3 | sri | sri | N | NEN-3SN-- | Cas=N|Per=3|Num=S|Gen=N | 4 | Atr | _ | _ |
4 | perumpuTUril | perumpuTUr | N | NEL-3SN-- | Cas=L|Per=3|Num=S|Gen=N | 18 | AAdjn | _ | _ |
5 | kirIn | kirIn | N | NEN-3SN-- | Cas=N|Per=3|Num=S|Gen=N | 6 | Atr | _ | _ |
6 | pIltu | pIltu | N | NEN-3SN-- | Cas=N|Per=3|Num=S|Gen=N | 11 | Atr | _ | _ |
7 | ( | ( | Z | Z:------- | _ | 6 | AuxG | _ | _ |
8 | wavIna | wavInam | J | JJ------- | _ | 6 | Atr | _ | _ |
9 | ) | ) | Z | Z:------- | _ | 6 | AuxG | _ | _ |
10 | vimAna | vimAnam | N | NO--3SN-- | Per=3|Num=S|Gen=N | 11 | Atr | _ | _ |
11 | wilaiyaTTukkukk | wilaiyam | N | NND-3SN-- | Cas=D|Per=3|Num=S|Gen=N | 12 | Atr | _ | _ |
12 | Ana | Aku | T | Tg------- | _ | 13 | Atr | _ | _ |
13 | wilam | wilam | N | NNN-3SN-- | Cas=N|Per=3|Num=S|Gen=N | 18 | Sb | _ | _ |
14 | yArukkum | yAr | R | RBD-3SA-- | Cas=D|Per=3|Num=S|Gen=A | 15 | Atr | _ | _ |
15 | pATippu | pATippu | N | NNN-3SN-- | Cas=N|Per=3|Num=S|Gen=N | 16 | Comp | _ | _ |
16 | illATa | il | P | PP------- | _ | 17 | AuxP | _ | _ |
17 | vakaiyil | vakai | N | NNL-3SN-- | Cas=L|Per=3|Num=S|Gen=N | 18 | AAdjn | _ | _ |
18 | etukkap | etu | V | Vu-T---AA | Ten=T|Voi=A|Neg=A | 20 | Obj | _ | _ |
19 | patum | patu | V | VR-F3SNPA | Ten=F|Per=3|Num=S|Gen=N|Voi=P|Neg=A | 18 | AuxV | _ | _ |
20 | enRu | en | T | Tt-T----A | Ten=T|Neg=A | 23 | AuxC | _ | _ |
21 | muTalvar | muTalvar | N | NNN-3SH-- | Cas=N|Per=3|Num=S|Gen=H | 22 | Atr | _ | _ |
22 | karuNAwiTi | karuNAwiTi | N | NEN-3SH-- | Cas=N|Per=3|Num=S|Gen=H | 23 | Sb | _ | _ |
23 | uRuTiyaLiTT | uRuTiyaLi | V | Vt-T---AA | Ten=T|Voi=A|Neg=A | 0 | Pred | _ | _ |
24 | uLLAr | uL | V | VR-T3SHAA | Ten=T|Per=3|Num=S|Gen=H|Voi=A|Neg=A | 23 | AuxV | _ | _ |
25 | . | . | Z | Z#------- | _ | 0 | AuxK | _ | _ |
The first sentence of the CoNLL version of test data:
1 | pikAr | pikAr | N | NEN-3SN-- | Cas=N|Per=3|Num=S|Gen=N | 2 | Atr | _ | _ |
2 | iliruwTu | iliruwTu | P | PP------- | _ | 4 | AuxP | _ | _ |
3 | ErALamAna | ErALamAna | J | JJ------- | _ | 4 | Atr | _ | _ |
4 | iLainjarkaL | iLainjar | N | NNN-3PA-- | Cas=N|Per=3|Num=P|Gen=A | 9 | Sb | _ | _ |
5 | vElai | vElai | N | NNN-3SN-- | Cas=N|Per=3|Num=S|Gen=N | 6 | Obj | _ | _ |
6 | TEti | TEtu | V | Vt-T---AA | Ten=T|Voi=A|Neg=A | 9 | AAdjn | _ | _ |
7 | veLi | veLi | J | JJ------- | _ | 8 | Atr | _ | _ |
8 | mAwilangkaLukku | mAwilam | N | NND-3PN-- | Cas=D|Per=3|Num=P|Gen=N | 9 | AAdjn | _ | _ |
9 | kutipeyarwTu | kutipeyar | V | Vt-T---AA | Ten=T|Voi=A|Neg=A | 0 | Pred | _ | _ |
10 | varukinRanar | varu | V | VR-P3PHAA | Ten=P|Per=3|Num=P|Gen=H|Voi=A|Neg=A | 9 | AuxV | _ | _ |
11 | . | . | Z | Z#------- | _ | 0 | AuxK | _ | _ |
Parsing
Nonprojectivities in PADT are rare. Only 431 of the 116,793 tokens in the CoNLL 2007 version are attached nonprojectively (0.37%).
The results of the CoNLL 2006 shared task are available online. They have been published in (Buchholz and Marsi, 2006). The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Arabic:
Parser (Authors) | LAS | UAS |
---|---|---|
MST (McDonald et al.) | 66.91 | 79.34 |
Basis (O'Neil) | 66.71 | 78.54 |
Malt (Nivre et al.) | 66.71 | 77.52 |
Edinburgh (Riedel et al.) | 66.65 | 78.62 |
The results of the CoNLL 2007 shared task are available online. They have been published in (Nivre et al., 2007). The evaluation procedure was changed to include punctuation tokens. These are the best results for Arabic:
Parser (Authors) | LAS | UAS |
---|---|---|
Malt (Nilsson et al.) | 76.52 | 85.81 |
Nakagawa | 75.08 | 86.09 |
Malt (Hall et al.) | 74.75 | 84.21 |
Sagae | 74.71 | 84.04 |
Chen | 74.65 | 83.49 |
Titov et al. | 74.12 | 83.18 |
The two Malt parser results of 2007 (single malt and blended) are described in (Hall et al., 2007) and the details about the parser configuration are described here.