user:zeman:treebanks:ta [ufal wiki]

Tamil (ta)
- Versions
- Obtaining and License
- References
- Domain
- Size
- Inside
- Sample
- Parsing

Tamil (ta)

Versions

TamilTB 0.1

Obtaining and License

TamilTB 0.1 is distributed under the Creative Commons by-nc-sa license. The license in short:

non-commercial usage
redistribution permitted
attribution to Charles University in Prague, Institute of Formal and Applied Linguistics required
- cite one of the principal publications (see below) in published work using the treebank

TamilTB was created by members of the Institute of Formal and Applied Linguistics (Ústav formální a aplikované lingvistiky, ÚFAL), Faculty of Mathematics and Physics (Matematicko-fyzikální fakulta), Charles University in Prague (Univerzita Karlova v Praze), Malostranské náměstí 25, Praha, CZ-11800, Czechia.

References

Website
- http://ufal.mff.cuni.cz/~ramasamy/tamiltb/0.1/
Data
- no separate citation
Principal publications
- Loganathan Ramasamy, Zdeněk Žabokrtský: Tamil Dependency Parsing: Results using Rule Based and Corpus Based Approaches. In: Proceedings of the 12th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing 2011) – Volume Part I, pages 82-95, Tokyo, Japan, 2011, published by Springer Berlin / Heidelberg, ISBN 978-3-642-19399-6.
- Loganathan Ramasamy, Zdeněk Žabokrtský: Prague Dependency Style Treebank for Tamil. In: Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), İstanbul, Turkey, 2012
Documentation
- Morphological annotation
- Syntactic annotation
- Loganathan Ramasamy, Zdeněk Žabokrtský: Tamil Dependency Treebank (TamilTB) – 0.1 Annotation Manual. Technical Report TR-2011-42, ÚFAL MFF UK, Praha, Czechia, 2011

Domain

News (http://www.dinamani.com/).

Size

Version 0.1 contains 9581 tokens in 600 sentences, yielding 15.97 tokens per sentence on average. We defined the following data split: 7592 tokens / 480 sentences training, 1989 tokens / 120 sentences test.

Inside

Tamil script has been romanized (the romanization is case-sensitive).

The treebank is distributed in three formats: TMT (TectoMT XML), CoNLL and TnT-tagger style (only POS-tagged layer).

Morphological annotation is manual and it includes lemmas, parts of speech and morphosyntactic features. Syntactic annotation follows the style of the Prague Dependency Treebank.

Sample

The first sentence of the CoNLL version of training data:

1	cennai	cennai	N	NEN-3SN--	Cas=N\|Per=3\|Num=S\|Gen=N	2	AAdjn	_	_
2	arukE	arukE	P	PP-------	_	18	AuxP	_	_
3	sri	sri	N	NEN-3SN--	Cas=N\|Per=3\|Num=S\|Gen=N	4	Atr	_	_
4	perumpuTUril	perumpuTUr	N	NEL-3SN--	Cas=L\|Per=3\|Num=S\|Gen=N	18	AAdjn	_	_
5	kirIn	kirIn	N	NEN-3SN--	Cas=N\|Per=3\|Num=S\|Gen=N	6	Atr	_	_
6	pIltu	pIltu	N	NEN-3SN--	Cas=N\|Per=3\|Num=S\|Gen=N	11	Atr	_	_
7	(	(	Z	Z:-------	_	6	AuxG	_	_
8	wavIna	wavInam	J	JJ-------	_	6	Atr	_	_
9	)	)	Z	Z:-------	_	6	AuxG	_	_
10	vimAna	vimAnam	N	NO--3SN--	Per=3\|Num=S\|Gen=N	11	Atr	_	_
11	wilaiyaTTukkukk	wilaiyam	N	NND-3SN--	Cas=D\|Per=3\|Num=S\|Gen=N	12	Atr	_	_
12	Ana	Aku	T	Tg-------	_	13	Atr	_	_
13	wilam	wilam	N	NNN-3SN--	Cas=N\|Per=3\|Num=S\|Gen=N	18	Sb	_	_
14	yArukkum	yAr	R	RBD-3SA--	Cas=D\|Per=3\|Num=S\|Gen=A	15	Atr	_	_
15	pATippu	pATippu	N	NNN-3SN--	Cas=N\|Per=3\|Num=S\|Gen=N	16	Comp	_	_
16	illATa	il	P	PP-------	_	17	AuxP	_	_
17	vakaiyil	vakai	N	NNL-3SN--	Cas=L\|Per=3\|Num=S\|Gen=N	18	AAdjn	_	_
18	etukkap	etu	V	Vu-T---AA	Ten=T\|Voi=A\|Neg=A	20	Obj	_	_
19	patum	patu	V	VR-F3SNPA	Ten=F\|Per=3\|Num=S\|Gen=N\|Voi=P\|Neg=A	18	AuxV	_	_
20	enRu	en	T	Tt-T----A	Ten=T\|Neg=A	23	AuxC	_	_
21	muTalvar	muTalvar	N	NNN-3SH--	Cas=N\|Per=3\|Num=S\|Gen=H	22	Atr	_	_
22	karuNAwiTi	karuNAwiTi	N	NEN-3SH--	Cas=N\|Per=3\|Num=S\|Gen=H	23	Sb	_	_
23	uRuTiyaLiTT	uRuTiyaLi	V	Vt-T---AA	Ten=T\|Voi=A\|Neg=A	0	Pred	_	_
24	uLLAr	uL	V	VR-T3SHAA	Ten=T\|Per=3\|Num=S\|Gen=H\|Voi=A\|Neg=A	23	AuxV	_	_
25	.	.	Z	Z#-------	_	0	AuxK	_	_

The first sentence of the CoNLL version of test data:

1	pikAr	pikAr	N	NEN-3SN--	Cas=N\|Per=3\|Num=S\|Gen=N	2	Atr	_	_
2	iliruwTu	iliruwTu	P	PP-------	_	4	AuxP	_	_
3	ErALamAna	ErALamAna	J	JJ-------	_	4	Atr	_	_
4	iLainjarkaL	iLainjar	N	NNN-3PA--	Cas=N\|Per=3\|Num=P\|Gen=A	9	Sb	_	_
5	vElai	vElai	N	NNN-3SN--	Cas=N\|Per=3\|Num=S\|Gen=N	6	Obj	_	_
6	TEti	TEtu	V	Vt-T---AA	Ten=T\|Voi=A\|Neg=A	9	AAdjn	_	_
7	veLi	veLi	J	JJ-------	_	8	Atr	_	_
8	mAwilangkaLukku	mAwilam	N	NND-3PN--	Cas=D\|Per=3\|Num=P\|Gen=N	9	AAdjn	_	_
9	kutipeyarwTu	kutipeyar	V	Vt-T---AA	Ten=T\|Voi=A\|Neg=A	0	Pred	_	_
10	varukinRanar	varu	V	VR-P3PHAA	Ten=P\|Per=3\|Num=P\|Gen=H\|Voi=A\|Neg=A	9	AuxV	_	_
11	.	.	Z	Z#-------	_	0	AuxK	_	_

Parsing

Nonprojectivities in PADT are very rare. Only 15 of the 9581 tokens are attached nonprojectively (0.16%).

Initial parsing results were published by (Ramasamy and Žabokrtský, 2011). They use smaller data and different training-test data split than defined here (2008 tokens training, 953 tokens test).

Parser (Authors)	LAS	UAS
Malt (Nivre et al.)	65.69	75.03
MST (McDonald et al.)	65.69	74.92

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Table of Contents