[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
user:zeman:treebanks:fa [2012/01/29 18:19]
zeman Update. I have seen the data!
user:zeman:treebanks:fa [2012/03/13 21:12]
zeman Odkaz na původní verzi anotačního manuálu.
Line 27: Line 27:
     * Mohammad Sadegh Rasooli, Amirsaeid Moloodi, Manouchehr Kouhestani, Behrouz Minaei-Bidgoli: [[http://dadegan.ir/sites/default/files/A%20Syntactic%20Valency%20Lexicon%20for%20Persian%20Verbs%20The%20First%20Steps%20towards%20Persian%20Dependency%20Treebank.pdf|A Syntactic Valency Lexicon for Persian Verbs: The First Steps towards Persian Dependency Treebank]]. In: 5th Language & Technology Conference (LTC): Human Language Technologies as a Challenge for Computer Science and Linguistics, pp. 227-231, Poznań, Poland, 2011.     * Mohammad Sadegh Rasooli, Amirsaeid Moloodi, Manouchehr Kouhestani, Behrouz Minaei-Bidgoli: [[http://dadegan.ir/sites/default/files/A%20Syntactic%20Valency%20Lexicon%20for%20Persian%20Verbs%20The%20First%20Steps%20towards%20Persian%20Dependency%20Treebank.pdf|A Syntactic Valency Lexicon for Persian Verbs: The First Steps towards Persian Dependency Treebank]]. In: 5th Language & Technology Conference (LTC): Human Language Technologies as a Challenge for Computer Science and Linguistics, pp. 227-231, Poznań, Poland, 2011.
   * Documentation   * Documentation
-    * Attached to the data distribution: {{:user:zeman:treebanks:persian-dependency-treebank-version-0.1-annotation-manual-and-user-guide.pdf|Persian Dependency Treebank Version 0.1, Annotation Manual and User Guide}}, Dadegan Research Group, Tehran, Iran, 2012.+    * Attached to the data distribution: {{:user:zeman:treebanks:persian-dependency-treebank-version-0.1-annotation-manual-and-user-guide.pdf|Persian Dependency Treebank Version 0.1, Annotation Manual and User Guide}}, Dadegan Research Group, Tehran, Iran, 2012. (http://dadegan.ir/sites/default/files/Persian%20Dependency%20Treebank%20Version%200.1%20Annotation%20Manual%20and%20User%20Guide.pdf)
  
 ==== Domain ==== ==== Domain ====
Line 35: Line 35:
 ==== Size ==== ==== Size ====
  
-12200 annotated sentences.+PDT contains 189,572 tokens in 12455 sentences, yielding 15.22 tokens per sentence on average. No official training-test data split is defined. For our HamleDT experiments, we took the first 182,878 tokens / 12126 sentences for training and the remaining 6694 tokens / 329 sentences for testing.
  
 ==== Inside ==== ==== Inside ====
  
 Provided in the [[:format-conll|CoNLL data format]]. The morphosyntactic annotation contains lemmas. Morphosyntactic / part-of-speech tags have been assigned manually. The text does not contain diacritical marks to distinguish short vowels (not normally shown in Persian text). Provided in the [[:format-conll|CoNLL data format]]. The morphosyntactic annotation contains lemmas. Morphosyntactic / part-of-speech tags have been assigned manually. The text does not contain diacritical marks to distinguish short vowels (not normally shown in Persian text).
 +
 +Tokenization is subordinated to the need of displaying syntactic relations. Some orthographic words have been broken into several tokens (e.g. a verb and its object). Elsewhere a tree node (token) consists of two orthographic words (and they are not joined using the underscore character, i.e. there is a space inside the token!) (e.g. the analytical form of subjunctive preterite: "کرده باشید" //korde bašäd//).
  
 ==== Sample ==== ==== Sample ====
 +
 +The first sentence of the corpus in the CoNLL format:
 +
 +| 1 | به | به | PREP | PREP | <nowiki>attachment=ISO|senID=23472</nowiki> | 26 | ADV | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 2 | گزارش | گزارش | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 1 | POSDEP | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 3 | خبرنگار | خبرنگار | N | ANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 2 | MOZ | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 4 | مهر | مهر | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 3 | MOZ | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 5 | در | در | PREP | PREP | <nowiki>attachment=ISO|senID=23472</nowiki> | 3 | NPP | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 6 | گرگان | گرگان | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 5 | POSDEP | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 7 | <nowiki>،</nowiki> | <nowiki>،</nowiki> | PUNC | PUNC | <nowiki>attachment=ISO|senID=23472</nowiki> | 6 | PUNC | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 8 | بر | بر | PREP | PREP | <nowiki>attachment=ISO|senID=23472</nowiki> | 26 | ADV | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 9 | اساس | اساس | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 8 | POSDEP | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 10 | باورهای | باور | N | IANM | <nowiki>attachment=ISO|number=PLUR|senID=23472</nowiki> | 9 | MOZ | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 11 | دینی | دینی | ADJ | AJP | <nowiki>attachment=ISO|senID=23472</nowiki> | 10 | NPOSTMOD | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 12 | <nowiki>ترکمن‌ها</nowiki> | ترکمن | N | ANM | <nowiki>attachment=ISO|number=PLUR|senID=23472</nowiki> | 10 | MOZ | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 13 | در | در | PREP | PREP | <nowiki>attachment=ISO|senID=23472</nowiki> | 26 | ADV | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 14 | این | این | PREM | DEMAJ | <nowiki>attachment=ISO|senID=23472</nowiki> | 15 | NPREMOD | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 15 | روز | روز | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 13 | POSDEP | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 16 | برای | برای | PREP | PREP | <nowiki>attachment=ISO|senID=23472</nowiki> | 26 | NPP | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 17 | پیامبر | پیامبر | N | ANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 16 | VPP | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 18 | اکرم | اکرم | ADJ | AJP | <nowiki>attachment=ISO|senID=23472</nowiki> | 17 | NPOSTMOD | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 19 | <nowiki>(</nowiki> | <nowiki>(</nowiki> | PUNC | PUNC | <nowiki>attachment=ISO|senID=23472</nowiki> | 20 | PUNC | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 20 | ص | ص | ADJ | AJP | <nowiki>attachment=ISO|senID=23472</nowiki> | 17 | APP | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 21 | <nowiki>)</nowiki> | <nowiki>)</nowiki> | PUNC | PUNC | <nowiki>attachment=ISO|senID=23472</nowiki> | 20 | PUNC | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 22 | ناراحتی | ناراحتی | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 26 | SBJ | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 23 | و | و | CONJ | CONJ | <nowiki>attachment=ISO|senID=23472</nowiki> | 22 | NCONJ | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 24 | بیماری | بیماری | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 23 | POSDEP | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 25 | رخ | رخ | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 26 | NVE | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 26 | داد | <nowiki>داد#ده</nowiki> | V | ACT | <nowiki>person=3|attachment=ISO|number=SING|tma=GS|senID=23472</nowiki> | 0 | ROOT | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 27 | که | که | SUBR | SUBR | <nowiki>attachment=ISO|senID=23472</nowiki> | 26 | AJUCL | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 28 | چند | چند | PREM | AMBAJ | <nowiki>attachment=ISO|senID=23472</nowiki> | 29 | NPREMOD | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 29 | روز | روز | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 39 | ADV | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 30 | بعد | بعد | ADJ | AJP | <nowiki>attachment=ISO|senID=23472</nowiki> | 29 | NPOSTMOD | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 31 | با | با | PREP | PREP | <nowiki>attachment=ISO|senID=23472</nowiki> | 39 | ADV | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 32 | رحلت | رحلت | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 31 | POSDEP | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 33 | نبی | نبی | N | ANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 32 | MOZ | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 34 | مکرم | مکرم | ADJ | AJP | <nowiki>attachment=ISO|senID=23472</nowiki> | 33 | NPOSTMOD | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 35 | اسلام | اسلام | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 33 | MOZ | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 36 | جهان | جهان | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 39 | SBJ | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 37 | عزادار | عزادار | ADJ | AJP | <nowiki>attachment=ISO|senID=23472</nowiki> | 39 | MOS | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 38 | ماتمش | ماتم | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 37 | MOZ | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 39 | شد | <nowiki>کرد#کن</nowiki> | V | PASS | <nowiki>person=3|attachment=ISO|number=SING|tma=GS|senID=23472</nowiki> | 27 | PRD | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 40 | <nowiki>.</nowiki> | <nowiki>.</nowiki> | PUNC | PUNC | <nowiki>attachment=ISO|senID=23472</nowiki> | 26 | PUNC | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
  
 ==== Parsing ==== ==== Parsing ====

[ Back to the navigation ] [ Back to the content ]