Persian Dependency Treebank (پیکرۀ وابستگی)
The treebank is available for free under the GNU GPLicense (with the additional requirement that the data be used non-commercially). Contact the Dadegan Research Group using their on-line form at http://dadegan.ir/en/contact-us and ask them for the data. The license in short:
PDT was created by members of the Dadegan Research Group (دادگان, Dādegān), Computer Engineering Department, Iran University of Science and Technology (دانشگاه علم و صنعت ایران), Tehrān تهران, Iran. The copyright lies with the Supreme Council of Information and Communication Technology (SCICT).
Unknown.
PDT contains 189,572 tokens in 12455 sentences, yielding 15.22 tokens per sentence on average. No official training-test data split is defined. For our HamleDT experiments, we took the first 182,878 tokens / 12126 sentences for training and the remaining 6694 tokens / 329 sentences for testing.
Provided in the CoNLL data format. The morphosyntactic annotation contains lemmas. Morphosyntactic / part-of-speech tags have been assigned manually. The text does not contain diacritical marks to distinguish short vowels (not normally shown in Persian text).
Tokenization is subordinated to the need of displaying syntactic relations. Some orthographic words have been broken into several tokens (e.g. a verb and its object). Elsewhere a tree node (token) consists of two orthographic words (and they are not joined using the underscore character, i.e. there is a space inside the token!) (e.g. the analytical form of subjunctive preterite: “کرده باشید” korde bašäd).
The first sentence of the corpus in the CoNLL format:
1 | به | به | PREP | PREP | attachment=ISO|senID=23472 | 26 | ADV | _ | _ |
2 | گزارش | گزارش | N | IANM | attachment=ISO|number=SING|senID=23472 | 1 | POSDEP | _ | _ |
3 | خبرنگار | خبرنگار | N | ANM | attachment=ISO|number=SING|senID=23472 | 2 | MOZ | _ | _ |
4 | مهر | مهر | N | IANM | attachment=ISO|number=SING|senID=23472 | 3 | MOZ | _ | _ |
5 | در | در | PREP | PREP | attachment=ISO|senID=23472 | 3 | NPP | _ | _ |
6 | گرگان | گرگان | N | IANM | attachment=ISO|number=SING|senID=23472 | 5 | POSDEP | _ | _ |
7 | ، | ، | PUNC | PUNC | attachment=ISO|senID=23472 | 6 | PUNC | _ | _ |
8 | بر | بر | PREP | PREP | attachment=ISO|senID=23472 | 26 | ADV | _ | _ |
9 | اساس | اساس | N | IANM | attachment=ISO|number=SING|senID=23472 | 8 | POSDEP | _ | _ |
10 | باورهای | باور | N | IANM | attachment=ISO|number=PLUR|senID=23472 | 9 | MOZ | _ | _ |
11 | دینی | دینی | ADJ | AJP | attachment=ISO|senID=23472 | 10 | NPOSTMOD | _ | _ |
12 | ترکمنها | ترکمن | N | ANM | attachment=ISO|number=PLUR|senID=23472 | 10 | MOZ | _ | _ |
13 | در | در | PREP | PREP | attachment=ISO|senID=23472 | 26 | ADV | _ | _ |
14 | این | این | PREM | DEMAJ | attachment=ISO|senID=23472 | 15 | NPREMOD | _ | _ |
15 | روز | روز | N | IANM | attachment=ISO|number=SING|senID=23472 | 13 | POSDEP | _ | _ |
16 | برای | برای | PREP | PREP | attachment=ISO|senID=23472 | 26 | NPP | _ | _ |
17 | پیامبر | پیامبر | N | ANM | attachment=ISO|number=SING|senID=23472 | 16 | VPP | _ | _ |
18 | اکرم | اکرم | ADJ | AJP | attachment=ISO|senID=23472 | 17 | NPOSTMOD | _ | _ |
19 | ( | ( | PUNC | PUNC | attachment=ISO|senID=23472 | 20 | PUNC | _ | _ |
20 | ص | ص | ADJ | AJP | attachment=ISO|senID=23472 | 17 | APP | _ | _ |
21 | ) | ) | PUNC | PUNC | attachment=ISO|senID=23472 | 20 | PUNC | _ | _ |
22 | ناراحتی | ناراحتی | N | IANM | attachment=ISO|number=SING|senID=23472 | 26 | SBJ | _ | _ |
23 | و | و | CONJ | CONJ | attachment=ISO|senID=23472 | 22 | NCONJ | _ | _ |
24 | بیماری | بیماری | N | IANM | attachment=ISO|number=SING|senID=23472 | 23 | POSDEP | _ | _ |
25 | رخ | رخ | N | IANM | attachment=ISO|number=SING|senID=23472 | 26 | NVE | _ | _ |
26 | داد | داد#ده | V | ACT | person=3|attachment=ISO|number=SING|tma=GS|senID=23472 | 0 | ROOT | _ | _ |
27 | که | که | SUBR | SUBR | attachment=ISO|senID=23472 | 26 | AJUCL | _ | _ |
28 | چند | چند | PREM | AMBAJ | attachment=ISO|senID=23472 | 29 | NPREMOD | _ | _ |
29 | روز | روز | N | IANM | attachment=ISO|number=SING|senID=23472 | 39 | ADV | _ | _ |
30 | بعد | بعد | ADJ | AJP | attachment=ISO|senID=23472 | 29 | NPOSTMOD | _ | _ |
31 | با | با | PREP | PREP | attachment=ISO|senID=23472 | 39 | ADV | _ | _ |
32 | رحلت | رحلت | N | IANM | attachment=ISO|number=SING|senID=23472 | 31 | POSDEP | _ | _ |
33 | نبی | نبی | N | ANM | attachment=ISO|number=SING|senID=23472 | 32 | MOZ | _ | _ |
34 | مکرم | مکرم | ADJ | AJP | attachment=ISO|senID=23472 | 33 | NPOSTMOD | _ | _ |
35 | اسلام | اسلام | N | IANM | attachment=ISO|number=SING|senID=23472 | 33 | MOZ | _ | _ |
36 | جهان | جهان | N | IANM | attachment=ISO|number=SING|senID=23472 | 39 | SBJ | _ | _ |
37 | عزادار | عزادار | ADJ | AJP | attachment=ISO|senID=23472 | 39 | MOS | _ | _ |
38 | ماتمش | ماتم | N | IANM | attachment=ISO|number=SING|senID=23472 | 37 | MOZ | _ | _ |
39 | شد | کرد#کن | V | PASS | person=3|attachment=ISO|number=SING|tma=GS|senID=23472 | 27 | PRD | _ | _ |
40 | . | . | PUNC | PUNC | attachment=ISO|senID=23472 | 26 | PUNC | _ | _ |
Nonprojectivities in PDT are relatively rare. Only 3357 of the 189,572 tokens are attached nonprojectively (1.77%).
I am not aware of any published results of Persian dependency parsing. Our own experiments gave 86.84% unlabeled attachment score with Malt Parser, the stack-lazy algorithm.