This is an old revision of the document!
Table of Contents
Persian (fa)
Persian Dependency Treebank (پیکرۀ وابستگی)
Versions
- “Preversion” 0.1 (January 2012, 12,200 annotated sentences)
- Final version (expected fall 2012, 30,000 sentences)
Obtaining and License
The treebank is available for free under the GNU GPLicense (with the additional requirement that the data be used non-commercially). Complete the license form and they will send you the data by e-mail. (You may also contact info(at)dadegan(dot)ir or Mohammad Sadegh Rasooli.) The license in short:
- non-commercial usage
- redistribution permitted under the same license
- citation of publications not explicitly required but it is common courtesy
PDT was created by members of the Dadegan Research Group (دادگان, Dādegān), Computer Engineering Department, Iran University of Science and Technology (دانشگاه علم و صنعت ایران), Tehrān تهران, Iran. The copyright lies with the Supreme Council of Information and Communication Technology (SCICT).
References
- Website
- Data
- no separate citation
- Principal publications
- Mohammad Sadegh Rasooli, Amirsaeid Moloodi, Manouchehr Kouhestani, Behrouz Minaei-Bidgoli: A Syntactic Valency Lexicon for Persian Verbs: The First Steps towards Persian Dependency Treebank. In: 5th Language & Technology Conference (LTC): Human Language Technologies as a Challenge for Computer Science and Linguistics, pp. 227-231, Poznań, Poland, 2011.
- Documentation
- Attached to the data distribution: Persian Dependency Treebank Version 0.1, Annotation Manual and User Guide, Dadegan Research Group, Tehran, Iran, 2012. (http://dadegan.ir/sites/default/files/Persian%20Dependency%20Treebank%20Version%200.1%20Annotation%20Manual%20and%20User%20Guide.pdf)
Domain
Unknown.
Size
PDT contains 189,572 tokens in 12455 sentences, yielding 15.22 tokens per sentence on average. No official training-test data split is defined. For our HamleDT experiments, we took the first 182,878 tokens / 12126 sentences for training and the remaining 6694 tokens / 329 sentences for testing.
Inside
Provided in the CoNLL data format. The morphosyntactic annotation contains lemmas. Morphosyntactic / part-of-speech tags have been assigned manually. The text does not contain diacritical marks to distinguish short vowels (not normally shown in Persian text).
Tokenization is subordinated to the need of displaying syntactic relations. Some orthographic words have been broken into several tokens (e.g. a verb and its object). Elsewhere a tree node (token) consists of two orthographic words (and they are not joined using the underscore character, i.e. there is a space inside the token!) (e.g. the analytical form of subjunctive preterite: “کرده باشید” korde bašäd).
Sample
The first sentence of the corpus in the CoNLL format:
1 | به | به | PREP | PREP | attachment=ISO|senID=23472 | 26 | ADV | _ | _ |
2 | گزارش | گزارش | N | IANM | attachment=ISO|number=SING|senID=23472 | 1 | POSDEP | _ | _ |
3 | خبرنگار | خبرنگار | N | ANM | attachment=ISO|number=SING|senID=23472 | 2 | MOZ | _ | _ |
4 | مهر | مهر | N | IANM | attachment=ISO|number=SING|senID=23472 | 3 | MOZ | _ | _ |
5 | در | در | PREP | PREP | attachment=ISO|senID=23472 | 3 | NPP | _ | _ |
6 | گرگان | گرگان | N | IANM | attachment=ISO|number=SING|senID=23472 | 5 | POSDEP | _ | _ |
7 | ، | ، | PUNC | PUNC | attachment=ISO|senID=23472 | 6 | PUNC | _ | _ |
8 | بر | بر | PREP | PREP | attachment=ISO|senID=23472 | 26 | ADV | _ | _ |
9 | اساس | اساس | N | IANM | attachment=ISO|number=SING|senID=23472 | 8 | POSDEP | _ | _ |
10 | باورهای | باور | N | IANM | attachment=ISO|number=PLUR|senID=23472 | 9 | MOZ | _ | _ |
11 | دینی | دینی | ADJ | AJP | attachment=ISO|senID=23472 | 10 | NPOSTMOD | _ | _ |
12 | ترکمنها | ترکمن | N | ANM | attachment=ISO|number=PLUR|senID=23472 | 10 | MOZ | _ | _ |
13 | در | در | PREP | PREP | attachment=ISO|senID=23472 | 26 | ADV | _ | _ |
14 | این | این | PREM | DEMAJ | attachment=ISO|senID=23472 | 15 | NPREMOD | _ | _ |
15 | روز | روز | N | IANM | attachment=ISO|number=SING|senID=23472 | 13 | POSDEP | _ | _ |
16 | برای | برای | PREP | PREP | attachment=ISO|senID=23472 | 26 | NPP | _ | _ |
17 | پیامبر | پیامبر | N | ANM | attachment=ISO|number=SING|senID=23472 | 16 | VPP | _ | _ |
18 | اکرم | اکرم | ADJ | AJP | attachment=ISO|senID=23472 | 17 | NPOSTMOD | _ | _ |
19 | ( | ( | PUNC | PUNC | attachment=ISO|senID=23472 | 20 | PUNC | _ | _ |
20 | ص | ص | ADJ | AJP | attachment=ISO|senID=23472 | 17 | APP | _ | _ |
21 | ) | ) | PUNC | PUNC | attachment=ISO|senID=23472 | 20 | PUNC | _ | _ |
22 | ناراحتی | ناراحتی | N | IANM | attachment=ISO|number=SING|senID=23472 | 26 | SBJ | _ | _ |
23 | و | و | CONJ | CONJ | attachment=ISO|senID=23472 | 22 | NCONJ | _ | _ |
24 | بیماری | بیماری | N | IANM | attachment=ISO|number=SING|senID=23472 | 23 | POSDEP | _ | _ |
25 | رخ | رخ | N | IANM | attachment=ISO|number=SING|senID=23472 | 26 | NVE | _ | _ |
26 | داد | داد#ده | V | ACT | person=3|attachment=ISO|number=SING|tma=GS|senID=23472 | 0 | ROOT | _ | _ |
27 | که | که | SUBR | SUBR | attachment=ISO|senID=23472 | 26 | AJUCL | _ | _ |
28 | چند | چند | PREM | AMBAJ | attachment=ISO|senID=23472 | 29 | NPREMOD | _ | _ |
29 | روز | روز | N | IANM | attachment=ISO|number=SING|senID=23472 | 39 | ADV | _ | _ |
30 | بعد | بعد | ADJ | AJP | attachment=ISO|senID=23472 | 29 | NPOSTMOD | _ | _ |
31 | با | با | PREP | PREP | attachment=ISO|senID=23472 | 39 | ADV | _ | _ |
32 | رحلت | رحلت | N | IANM | attachment=ISO|number=SING|senID=23472 | 31 | POSDEP | _ | _ |
33 | نبی | نبی | N | ANM | attachment=ISO|number=SING|senID=23472 | 32 | MOZ | _ | _ |
34 | مکرم | مکرم | ADJ | AJP | attachment=ISO|senID=23472 | 33 | NPOSTMOD | _ | _ |
35 | اسلام | اسلام | N | IANM | attachment=ISO|number=SING|senID=23472 | 33 | MOZ | _ | _ |
36 | جهان | جهان | N | IANM | attachment=ISO|number=SING|senID=23472 | 39 | SBJ | _ | _ |
37 | عزادار | عزادار | ADJ | AJP | attachment=ISO|senID=23472 | 39 | MOS | _ | _ |
38 | ماتمش | ماتم | N | IANM | attachment=ISO|number=SING|senID=23472 | 37 | MOZ | _ | _ |
39 | شد | کرد#کن | V | PASS | person=3|attachment=ISO|number=SING|tma=GS|senID=23472 | 27 | PRD | _ | _ |
40 | . | . | PUNC | PUNC | attachment=ISO|senID=23472 | 26 | PUNC | _ | _ |
Parsing
Nonprojectivities in PDT are relatively rare. Only 3357 of the 189,572 tokens are attached nonprojectively (1.77%).
I am not aware of any published results of Persian dependency parsing. Our own experiments gave 86.84% unlabeled attachment score with Malt Parser, the stack-lazy algorithm.