user:zeman:treebanks:fa [ufal wiki]

“Preversion” 0.1 (January 2012, 12,200 annotated sentences)
Final version (expected fall 2012, 30,000 sentences)

The treebank is available for free under the GNU GPLicense (with the additional requirement that the data be used non-commercially). Contact the Dadegan Research Group using their on-line form at http://dadegan.ir/en/contact-us and ask them for the data. The license in short:

non-commercial usage
redistribution permitted under the same license
citation of publications not explicitly required but it is common courtesy

PDT was created by members of the Dadegan Research Group (دادگان, Dādegān), Computer Engineering Department, Iran University of Science and Technology (دانشگاه علم و صنعت ایران), Tehrān تهران, Iran. The copyright lies with the Supreme Council of Information and Communication Technology (SCICT).

Website
- http://dadegan.ir/en/persiandependencytreebank
Data
- no separate citation
Principal publications
- Mohammad Sadegh Rasooli, Amirsaeid Moloodi, Manouchehr Kouhestani, Behrouz Minaei-Bidgoli: A Syntactic Valency Lexicon for Persian Verbs: The First Steps towards Persian Dependency Treebank. In: 5th Language & Technology Conference (LTC): Human Language Technologies as a Challenge for Computer Science and Linguistics, pp. 227-231, Poznań, Poland, 2011.
Documentation
- Attached to the data distribution: Persian Dependency Treebank Version 0.1, Annotation Manual and User Guide, Dadegan Research Group, Tehran, Iran, 2012. (http://dadegan.ir/sites/default/files/Persian%20Dependency%20Treebank%20Version%200.1%20Annotation%20Manual%20and%20User%20Guide.pdf)

Unknown.

PDT contains 189,572 tokens in 12455 sentences, yielding 15.22 tokens per sentence on average. No official training-test data split is defined. For our HamleDT experiments, we took the first 182,878 tokens / 12126 sentences for training and the remaining 6694 tokens / 329 sentences for testing.

Provided in the CoNLL data format. The morphosyntactic annotation contains lemmas. Morphosyntactic / part-of-speech tags have been assigned manually. The text does not contain diacritical marks to distinguish short vowels (not normally shown in Persian text).

Tokenization is subordinated to the need of displaying syntactic relations. Some orthographic words have been broken into several tokens (e.g. a verb and its object). Elsewhere a tree node (token) consists of two orthographic words (and they are not joined using the underscore character, i.e. there is a space inside the token!) (e.g. the analytical form of subjunctive preterite: “کرده باشید” korde bašäd).

The first sentence of the corpus in the CoNLL format:

1	به	به	PREP	PREP	attachment=ISO\|senID=23472	26	ADV	_	_
2	گزارش	گزارش	N	IANM	attachment=ISO\|number=SING\|senID=23472	1	POSDEP	_	_
3	خبرنگار	خبرنگار	N	ANM	attachment=ISO\|number=SING\|senID=23472	2	MOZ	_	_
4	مهر	مهر	N	IANM	attachment=ISO\|number=SING\|senID=23472	3	MOZ	_	_
5	در	در	PREP	PREP	attachment=ISO\|senID=23472	3	NPP	_	_
6	گرگان	گرگان	N	IANM	attachment=ISO\|number=SING\|senID=23472	5	POSDEP	_	_
7	،	،	PUNC	PUNC	attachment=ISO\|senID=23472	6	PUNC	_	_
8	بر	بر	PREP	PREP	attachment=ISO\|senID=23472	26	ADV	_	_
9	اساس	اساس	N	IANM	attachment=ISO\|number=SING\|senID=23472	8	POSDEP	_	_
10	باورهای	باور	N	IANM	attachment=ISO\|number=PLUR\|senID=23472	9	MOZ	_	_
11	دینی	دینی	ADJ	AJP	attachment=ISO\|senID=23472	10	NPOSTMOD	_	_
12	ترکمن‌ها	ترکمن	N	ANM	attachment=ISO\|number=PLUR\|senID=23472	10	MOZ	_	_
13	در	در	PREP	PREP	attachment=ISO\|senID=23472	26	ADV	_	_
14	این	این	PREM	DEMAJ	attachment=ISO\|senID=23472	15	NPREMOD	_	_
15	روز	روز	N	IANM	attachment=ISO\|number=SING\|senID=23472	13	POSDEP	_	_
16	برای	برای	PREP	PREP	attachment=ISO\|senID=23472	26	NPP	_	_
17	پیامبر	پیامبر	N	ANM	attachment=ISO\|number=SING\|senID=23472	16	VPP	_	_
18	اکرم	اکرم	ADJ	AJP	attachment=ISO\|senID=23472	17	NPOSTMOD	_	_
19	(	(	PUNC	PUNC	attachment=ISO\|senID=23472	20	PUNC	_	_
20	ص	ص	ADJ	AJP	attachment=ISO\|senID=23472	17	APP	_	_
21	)	)	PUNC	PUNC	attachment=ISO\|senID=23472	20	PUNC	_	_
22	ناراحتی	ناراحتی	N	IANM	attachment=ISO\|number=SING\|senID=23472	26	SBJ	_	_
23	و	و	CONJ	CONJ	attachment=ISO\|senID=23472	22	NCONJ	_	_
24	بیماری	بیماری	N	IANM	attachment=ISO\|number=SING\|senID=23472	23	POSDEP	_	_
25	رخ	رخ	N	IANM	attachment=ISO\|number=SING\|senID=23472	26	NVE	_	_
26	داد	داد#ده	V	ACT	person=3\|attachment=ISO\|number=SING\|tma=GS\|senID=23472	0	ROOT	_	_
27	که	که	SUBR	SUBR	attachment=ISO\|senID=23472	26	AJUCL	_	_
28	چند	چند	PREM	AMBAJ	attachment=ISO\|senID=23472	29	NPREMOD	_	_
29	روز	روز	N	IANM	attachment=ISO\|number=SING\|senID=23472	39	ADV	_	_
30	بعد	بعد	ADJ	AJP	attachment=ISO\|senID=23472	29	NPOSTMOD	_	_
31	با	با	PREP	PREP	attachment=ISO\|senID=23472	39	ADV	_	_
32	رحلت	رحلت	N	IANM	attachment=ISO\|number=SING\|senID=23472	31	POSDEP	_	_
33	نبی	نبی	N	ANM	attachment=ISO\|number=SING\|senID=23472	32	MOZ	_	_
34	مکرم	مکرم	ADJ	AJP	attachment=ISO\|senID=23472	33	NPOSTMOD	_	_
35	اسلام	اسلام	N	IANM	attachment=ISO\|number=SING\|senID=23472	33	MOZ	_	_
36	جهان	جهان	N	IANM	attachment=ISO\|number=SING\|senID=23472	39	SBJ	_	_
37	عزادار	عزادار	ADJ	AJP	attachment=ISO\|senID=23472	39	MOS	_	_
38	ماتمش	ماتم	N	IANM	attachment=ISO\|number=SING\|senID=23472	37	MOZ	_	_
39	شد	کرد#کن	V	PASS	person=3\|attachment=ISO\|number=SING\|tma=GS\|senID=23472	27	PRD	_	_
40	.	.	PUNC	PUNC	attachment=ISO\|senID=23472	26	PUNC	_	_

Nonprojectivities in PDT are relatively rare. Only 3357 of the 189,572 tokens are attached nonprojectively (1.77%).

I am not aware of any published results of Persian dependency parsing. Our own experiments gave 86.84% unlabeled attachment score with Malt Parser, the stack-lazy algorithm.

Institute of Formal and Applied Linguistics Wiki

Table of Contents

Persian (fa)

Versions

Obtaining and License

References

Domain

Size

Inside

Sample

Parsing