[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

This is an old revision of the document!


Table of Contents

Tamil (ta)

Tamil Dependency Treebank (TamilTB)

Versions

Obtaining and License

TamilTB 0.1 is distributed under the Creative Commons by-nc-sa license. The license in short:

TamilTB was created by members of the Institute of Formal and Applied Linguistics (Ústav formální a aplikované lingvistiky, ÚFAL), Faculty of Mathematics and Physics (Matematicko-fyzikální fakulta), Charles University in Prague (Univerzita Karlova v Praze), Malostranské náměstí 25, Praha, CZ-11800, Czechia.

References

Domain

News (http://www.dinamani.com/).

Size

Version 0.1 contains 9581 tokens in 600 sentences, yielding 15.97 tokens per sentence on average. We defined the following data split: 7592 tokens / 480 sentences training, 1989 tokens / 120 sentences test.

Inside

Tamil script has been romanized (the romanization is case-sensitive).

The treebank is distributed in three formats: TMT (TectoMT XML), CoNLL and TnT-tagger style (only POS-tagged layer).

Morphological annotation is manual and it includes lemmas, parts of speech and morphosyntactic features. Syntactic annotation follows the style of the Prague Dependency Treebank.

Sample

The first two sentences of the CoNLL 2006 training data:

1 غِيابُ_giyAbu غِياب_giyAb N N case=1|def=R 0 ExD _ _
2 فُؤاد_fu&Ad فُؤاد_fu&Ad Z Z _ 3 Atr _ _
3 كَنْعان_kanoEAn كَنْعان_kanoEAn Z Z _ 1 Atr _ _
1 فُؤاد_fu&Ad فُؤاد_fu&Ad Z Z _ 2 Atr _ _
2 كَنْعان_kanoEAn كَنْعان_kanoEAn Z Z _ 9 Sb _ _
3 ،_, ،_, G G _ 2 AuxG _ _
4 رائِد_rA}id رائِد_rA}id N N _ 2 Atr _ _
5 القِصَّة_AlqiS~ap قِصَّة_qiS~ap N N gen=F|num=S|def=D 4 Atr _ _
6 القَصِيرَةِ_AlqaSiyrapi قَصِير_qaSiyr A A gen=F|num=S|case=2|def=D 5 Atr _ _
7 فِي_fiy فِي_fiy P P _ 4 AuxP _ _
8 لُبْنانِ_lubonAni لُبْنان_lubonAn Z Z case=2|def=R 7 Atr _ _
9 رَحَلَ_raHala رَحَل-َ_raHal-a V VP pers=3|gen=M|num=S 0 Pred _ _
10 مَساءَ_masA'a مَساء_masA' D D _ 9 Adv _ _
11 أَمْسِ_>amosi أَمْسِ_>amosi D D _ 10 Atr _ _
12 عَن_Ean عَن_Ean P P _ 9 AuxP _ _
13 81_81 81_81 Q Q _ 12 Adv _ _
14 عاماً_EAmAF عام_EAm N N gen=M|num=S|case=4|def=I 13 Atr _ _
15 ._. ._. G G _ 0 AuxK _ _

The first sentence of the CoNLL 2006 test data:

1 اِتِّفاقٌ_Ait~ifAqN اِتِّفاق_Ait~ifAq N N case=1|def=I 0 ExD _ _
2 بَيْنَ_bayona بَيْنَ_bayona P P _ 1 AuxP _ _
3 لُبْنانِ_lubonAni لُبْنان_lubonAn Z Z case=2|def=R 4 Atr _ _
4 وَ_wa وَ_wa C C _ 2 Coord _ _
5 سُورِيَّةٍ_suwriy~apK سُورِيا_suwriyA Z Z gen=F|num=S|case=2|def=I 4 Atr _ _
6 عَلَى_EalaY عَلَى_EalaY P P _ 1 AuxP _ _
7 رَفْعِ_rafoEi رَفْع_rafoE N N case=2|def=R 6 Atr _ _
8 مُسْتَوَى_musotawaY مُسْتَوَى_musotawaY N N _ 7 Atr _ _
9 التَبادُلِ_AltabAduli تَبادُل_tabAdul N N case=2|def=D 8 Atr _ _
10 التِجارِيِّ_AltijAriy~i تِجارِيّ_tijAriy~ A A case=2|def=D 9 Atr _ _
11 إِلَى_<ilaY إِلَى_<ilaY P P _ 7 AuxP _ _
12 500_500 500_500 Q Q _ 11 Atr _ _
13 مِلْيُونِ_miloyuwni مِلْيُون_miloyuwn N N case=2|def=R 12 Atr _ _
14 دُولارٍ_duwlArK دُولار_duwlAr N N case=2|def=I 13 Atr _ _

The first sentence of the CoNLL 2007 training data:

1 تَعْدادُ تَعْداد_1 N N- Case=1|Defin=R 7 Sb _ _
2 سُكّانِ ساكِن_1 N N- Case=2|Defin=R 1 Atr _ _
3 22 [DEFAULT] Q Q- _ 2 Atr _ _
4 دَوْلَةً دَوْلَة_1 N N- Gender=F|Number=S|Case=4|Defin=I 3 Atr _ _
5 عَرَبِيَّةً عَرَبِيّ_1 A A- Gender=F|Number=S|Case=4|Defin=I 4 Atr _ _
6 سَ سَ_FUT F F- _ 7 AuxM _ _
7 يَرْتَفِعُ اِرْتَفَع_1 V VI Mood=I|Voice=A|Person=3|Gender=M|Number=S 0 Pred _ _
8 إِلَى إِلَى_1 P P- _ 7 AuxP _ _
9 654 [DEFAULT] Q Q- _ 8 Adv _ _
10 مِلْيُونَ مِلْيُون_1 N N- Case=4|Defin=R 9 Atr _ _
11 نَسَمَةٍ نَسَمَة_1 N N- Gender=F|Number=S|Case=2|Defin=I 10 Atr _ _
12 فِي فِي_1 P P- _ 7 AuxP _ _
13 مُنْتَصَفِ مُنْتَصَف_1 N N- Case=2|Defin=R 12 Adv _ _
14 القَرْنِ قَرْن_1 N N- Case=2|Defin=D 13 Atr _ _

The first sentence of the CoNLL 2007 test data:

1 مُقاوَمَةُ مُقاوَمَة_1 N N- Gender=F|Number=S|Case=1|Defin=R 0 ExD _ _
2 زَواجِ زَواج_1 N N- Case=2|Defin=R 1 Atr _ _
3 الطُلّابِ طالِب_1 N N- Case=2|Defin=D 2 Atr _ _
4 العُرْفِيِّ عُرْفِيّ_1 A A- Case=2|Defin=D 2 Atr _ _

Parsing

Nonprojectivities in PADT are rare. Only 431 of the 116,793 tokens in the CoNLL 2007 version are attached nonprojectively (0.37%).

The results of the CoNLL 2006 shared task are available online. They have been published in (Buchholz and Marsi, 2006). The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Arabic:

Parser (Authors) LAS UAS
MST (McDonald et al.) 66.91 79.34
Basis (O'Neil) 66.71 78.54
Malt (Nivre et al.) 66.71 77.52
Edinburgh (Riedel et al.) 66.65 78.62

The results of the CoNLL 2007 shared task are available online. They have been published in (Nivre et al., 2007). The evaluation procedure was changed to include punctuation tokens. These are the best results for Arabic:

Parser (Authors) LAS UAS
Malt (Nilsson et al.) 76.52 85.81
Nakagawa 75.08 86.09
Malt (Hall et al.) 74.75 84.21
Sagae 74.71 84.04
Chen 74.65 83.49
Titov et al. 74.12 83.18

The two Malt parser results of 2007 (single malt and blended) are described in (Hall et al., 2007) and the details about the parser configuration are described here.


[ Back to the navigation ] [ Back to the content ]