[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

This is an old revision of the document!


Table of Contents

Persian (fa)

Persian Dependency Treebank (پیکرۀ وابستگی)

Versions

Obtaining and License

The treebank is available for free under the GNU GPLicense (with the additional requirement that the data be used non-commercially). Complete the license form and they will send you the data by e-mail. (You may also contact info(at)dadegan(dot)ir or Mohammad Sadegh Rasooli.) The license in short:

PDT was created by members of the Dadegan Research Group (دادگان, Dādegān), Computer Engineering Department, Iran University of Science and Technology (دانشگاه علم و صنعت ایران), Tehrān تهران, Iran. The copyright lies with the Supreme Council of Information and Communication Technology (SCICT).

References

Domain

Unknown.

Size

PDT contains 189,572 tokens in 12455 sentences, yielding 15.22 tokens per sentence on average. No official training-test data split is defined. For our HamleDT experiments, we took the first 182,878 tokens / 12126 sentences for training and the remaining 6694 tokens / 329 sentences for testing.

Inside

Provided in the CoNLL data format. The morphosyntactic annotation contains lemmas. Morphosyntactic / part-of-speech tags have been assigned manually. The text does not contain diacritical marks to distinguish short vowels (not normally shown in Persian text).

Tokenization is subordinated to the need of displaying syntactic relations. Some orthographic words have been broken into several tokens (e.g. a verb and its object). Elsewhere a tree node (token) consists of two orthographic words (and they are not joined using the underscore character, i.e. there is a space inside the token!) (e.g. the analytical form of subjunctive preterite: “کرده باشید” korde bašäd).

Sample

The first sentence of the corpus in the CoNLL format:

1 به به PREP PREP attachment=ISO|senID=23472 26 ADV _ _
2 گزارش گزارش N IANM attachment=ISO|number=SING|senID=23472 1 POSDEP _ _
3 خبرنگار خبرنگار N ANM attachment=ISO|number=SING|senID=23472 2 MOZ _ _
4 مهر مهر N IANM attachment=ISO|number=SING|senID=23472 3 MOZ _ _
5 در در PREP PREP attachment=ISO|senID=23472 3 NPP _ _
6 گرگان گرگان N IANM attachment=ISO|number=SING|senID=23472 5 POSDEP _ _
7 ، ، PUNC PUNC attachment=ISO|senID=23472 6 PUNC _ _
8 بر بر PREP PREP attachment=ISO|senID=23472 26 ADV _ _
9 اساس اساس N IANM attachment=ISO|number=SING|senID=23472 8 POSDEP _ _
10 باورهای باور N IANM attachment=ISO|number=PLUR|senID=23472 9 MOZ _ _
11 دینی دینی ADJ AJP attachment=ISO|senID=23472 10 NPOSTMOD _ _
12 ترکمن‌ها ترکمن N ANM attachment=ISO|number=PLUR|senID=23472 10 MOZ _ _
13 در در PREP PREP attachment=ISO|senID=23472 26 ADV _ _
14 این این PREM DEMAJ attachment=ISO|senID=23472 15 NPREMOD _ _
15 روز روز N IANM attachment=ISO|number=SING|senID=23472 13 POSDEP _ _
16 برای برای PREP PREP attachment=ISO|senID=23472 26 NPP _ _
17 پیامبر پیامبر N ANM attachment=ISO|number=SING|senID=23472 16 VPP _ _
18 اکرم اکرم ADJ AJP attachment=ISO|senID=23472 17 NPOSTMOD _ _
19 ( ( PUNC PUNC attachment=ISO|senID=23472 20 PUNC _ _
20 ص ص ADJ AJP attachment=ISO|senID=23472 17 APP _ _
21 ) ) PUNC PUNC attachment=ISO|senID=23472 20 PUNC _ _
22 ناراحتی ناراحتی N IANM attachment=ISO|number=SING|senID=23472 26 SBJ _ _
23 و و CONJ CONJ attachment=ISO|senID=23472 22 NCONJ _ _
24 بیماری بیماری N IANM attachment=ISO|number=SING|senID=23472 23 POSDEP _ _
25 رخ رخ N IANM attachment=ISO|number=SING|senID=23472 26 NVE _ _
26 داد داد#ده V ACT person=3|attachment=ISO|number=SING|tma=GS|senID=23472 0 ROOT _ _
27 که که SUBR SUBR attachment=ISO|senID=23472 26 AJUCL _ _
28 چند چند PREM AMBAJ attachment=ISO|senID=23472 29 NPREMOD _ _
29 روز روز N IANM attachment=ISO|number=SING|senID=23472 39 ADV _ _
30 بعد بعد ADJ AJP attachment=ISO|senID=23472 29 NPOSTMOD _ _
31 با با PREP PREP attachment=ISO|senID=23472 39 ADV _ _
32 رحلت رحلت N IANM attachment=ISO|number=SING|senID=23472 31 POSDEP _ _
33 نبی نبی N ANM attachment=ISO|number=SING|senID=23472 32 MOZ _ _
34 مکرم مکرم ADJ AJP attachment=ISO|senID=23472 33 NPOSTMOD _ _
35 اسلام اسلام N IANM attachment=ISO|number=SING|senID=23472 33 MOZ _ _
36 جهان جهان N IANM attachment=ISO|number=SING|senID=23472 39 SBJ _ _
37 عزادار عزادار ADJ AJP attachment=ISO|senID=23472 39 MOS _ _
38 ماتمش ماتم N IANM attachment=ISO|number=SING|senID=23472 37 MOZ _ _
39 شد کرد#کن V PASS person=3|attachment=ISO|number=SING|tma=GS|senID=23472 27 PRD _ _
40 . . PUNC PUNC attachment=ISO|senID=23472 26 PUNC _ _

Parsing

Nonprojectivities in PDT are relatively rare. Only 3357 of the 189,572 tokens are attached nonprojectively (1.77%).

I am not aware of any published results of Persian dependency parsing. Our own experiments gave 86.84% unlabeled attachment score with Malt Parser, the stack-lazy algorithm.


[ Back to the navigation ] [ Back to the content ]