[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
user:zeman:treebanks:fa [2012/01/28 23:04]
zeman Some more changes.
user:zeman:treebanks:fa [2012/03/19 17:18]
zeman Nonprojectivity and parsing.
Line 10: Line 10:
 ==== Obtaining and License ==== ==== Obtaining and License ====
  
-The treebank is available for free after completing the [[http://dadegan.ir/en/content/user-agreement-persian-dependency-treebank|license form]]. (You may also contact info(at)dadegan(dot)ir or Mohammad Sadegh Rasooli.) The license in short:+The treebank is available for free under the GNU GPLicense (with the additional requirement that the data be used non-commercially). Complete the [[http://dadegan.ir/en/content/user-agreement-persian-dependency-treebank|license form]] and they will send you the data by e-mail. (You may also contact info(at)dadegan(dot)ir or Mohammad Sadegh Rasooli.) The license in short:
  
-  * non-commercial research usage +  * non-commercial usage 
-  * redistribution not discussed, probably not allowed +  * redistribution permitted under the same license 
-  * citation of publications not specified+  * citation of publications not explicitly required but it is common courtesy
  
-PDT was created by members of the [[http://dadegan.ir/|Dadegan Research Group]] (دادگان, Dādegān), Computer Engineering Department, [[http://www.iust.ac.ir/|Iran University of Science and Technology]] (دانشگاه علم و صنعت ایران), Tehrān تهران, Iran.+PDT was created by members of the [[http://dadegan.ir/|Dadegan Research Group]] (دادگان, Dādegān), Computer Engineering Department, [[http://www.iust.ac.ir/|Iran University of Science and Technology]] (دانشگاه علم و صنعت ایران), Tehrān تهران, Iran. The copyright lies with the Supreme Council of Information and Communication Technology (SCICT).
  
 ==== References ==== ==== References ====
Line 27: Line 27:
     * Mohammad Sadegh Rasooli, Amirsaeid Moloodi, Manouchehr Kouhestani, Behrouz Minaei-Bidgoli: [[http://dadegan.ir/sites/default/files/A%20Syntactic%20Valency%20Lexicon%20for%20Persian%20Verbs%20The%20First%20Steps%20towards%20Persian%20Dependency%20Treebank.pdf|A Syntactic Valency Lexicon for Persian Verbs: The First Steps towards Persian Dependency Treebank]]. In: 5th Language & Technology Conference (LTC): Human Language Technologies as a Challenge for Computer Science and Linguistics, pp. 227-231, Poznań, Poland, 2011.     * Mohammad Sadegh Rasooli, Amirsaeid Moloodi, Manouchehr Kouhestani, Behrouz Minaei-Bidgoli: [[http://dadegan.ir/sites/default/files/A%20Syntactic%20Valency%20Lexicon%20for%20Persian%20Verbs%20The%20First%20Steps%20towards%20Persian%20Dependency%20Treebank.pdf|A Syntactic Valency Lexicon for Persian Verbs: The First Steps towards Persian Dependency Treebank]]. In: 5th Language & Technology Conference (LTC): Human Language Technologies as a Challenge for Computer Science and Linguistics, pp. 227-231, Poznań, Poland, 2011.
   * Documentation   * Documentation
-    * //none so far//+    * Attached to the data distribution: {{:user:zeman:treebanks:persian-dependency-treebank-version-0.1-annotation-manual-and-user-guide.pdf|Persian Dependency Treebank Version 0.1, Annotation Manual and User Guide}}, Dadegan Research Group, Tehran, Iran, 2012. (http://dadegan.ir/sites/default/files/Persian%20Dependency%20Treebank%20Version%200.1%20Annotation%20Manual%20and%20User%20Guide.pdf)
  
 ==== Domain ==== ==== Domain ====
Line 35: Line 35:
 ==== Size ==== ==== Size ====
  
-Unknown.+PDT contains 189,572 tokens in 12455 sentences, yielding 15.22 tokens per sentence on average. No official training-test data split is defined. For our HamleDT experiments, we took the first 182,878 tokens / 12126 sentences for training and the remaining 6694 tokens / 329 sentences for testing.
  
 ==== Inside ==== ==== Inside ====
  
-The original morphosyntactic tags have been converted to fit into the three columns (CPOS, POS and FEAT) of the CoNLL format. There //should// be a 1-1 mapping between the [[http://www.bultreebank.org/TechRep/BTB-TR03.pdf|BTB positional tags]] and the CoNLL 2006 annotation. Use [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=bg::conll|DZ Interset]] to inspect the CoNLL tagset.+Provided in the [[:format-conll|CoNLL data format]]. The morphosyntactic annotation contains lemmasMorphosyntactic part-of-speech tags have been assigned manuallyThe text does not contain diacritical marks to distinguish short vowels (not normally shown in Persian text).
  
-The morphological analysis does not include lemmasThe morphosyntactic tags have been assigned (probablymanually. +Tokenization is subordinated to the need of displaying syntactic relationsSome orthographic words have been broken into several tokens (e.g. a verb and its object). Elsewhere a tree node (token) consists of two orthographic words (and they are not joined using the underscore character, i.ethere is a space inside the token!) (e.g. the analytical form of subjunctive preterite: "کرده باشید" //korde bašäd//).
- +
-The guidelines for syntactic annotation are documented in the other [[http://www.bultreebank.org/TechRep/BTB-TR05.pdf|technical report]]The CoNLL distribution contains the BulTreeBankReadMe.html file with a brief description of the syntactic tags (dependency relation labels).+
  
 ==== Sample ==== ==== Sample ====
  
-==== Parsing ====+The first sentence of the corpus in the CoNLL format:
  
-Nonprojectivities in BTB are rare. Only 747 of the 196,151 tokens in the CoNLL 2006 version are attached nonprojectively (0.38%).+| 1 | به | به | PREP | PREP | <nowiki>attachment=ISO|senID=23472</nowiki> | 26 | ADV | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 2 | گزارش | گزارش | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 1 | POSDEP | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 3 | خبرنگار | خبرنگار | N | ANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 2 | MOZ | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 4 | مهر | مهر | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 3 | MOZ | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 5 | در | در | PREP | PREP | <nowiki>attachment=ISO|senID=23472</nowiki> | 3 | NPP | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 6 | گرگان | گرگان | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 5 | POSDEP | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 7 | <nowiki>،</nowiki> | <nowiki>،</nowiki> | PUNC | PUNC | <nowiki>attachment=ISO|senID=23472</nowiki> | 6 | PUNC | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 8 | بر | بر | PREP | PREP | <nowiki>attachment=ISO|senID=23472</nowiki> | 26 | ADV | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 9 | اساس | اساس | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 8 | POSDEP | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 10 | باورهای | باور | N | IANM | <nowiki>attachment=ISO|number=PLUR|senID=23472</nowiki> | 9 | MOZ | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 11 | دینی | دینی | ADJ | AJP | <nowiki>attachment=ISO|senID=23472</nowiki> | 10 | NPOSTMOD | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 12 | <nowiki>ترکمن‌ها</nowiki> | ترکمن | N | ANM | <nowiki>attachment=ISO|number=PLUR|senID=23472</nowiki> | 10 | MOZ | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 13 | در | در | PREP | PREP | <nowiki>attachment=ISO|senID=23472</nowiki> | 26 | ADV | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 14 | این | این | PREM | DEMAJ | <nowiki>attachment=ISO|senID=23472</nowiki> | 15 | NPREMOD | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 15 | روز | روز | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 13 | POSDEP | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 16 | برای | برای | PREP | PREP | <nowiki>attachment=ISO|senID=23472</nowiki> | 26 | NPP | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 17 | پیامبر | پیامبر | N | ANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 16 | VPP | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 18 | اکرم | اکرم | ADJ | AJP | <nowiki>attachment=ISO|senID=23472</nowiki> | 17 | NPOSTMOD | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 19 | <nowiki>(</nowiki> | <nowiki>(</nowiki> | PUNC | PUNC | <nowiki>attachment=ISO|senID=23472</nowiki> | 20 | PUNC | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 20 | ص | ص | ADJ | AJP | <nowiki>attachment=ISO|senID=23472</nowiki> | 17 | APP | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 21 | <nowiki>)</nowiki> | <nowiki>)</nowiki> | PUNC | PUNC | <nowiki>attachment=ISO|senID=23472</nowiki> | 20 | PUNC | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 22 | ناراحتی | ناراحتی | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 26 | SBJ | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 23 | و | و | CONJ | CONJ | <nowiki>attachment=ISO|senID=23472</nowiki> | 22 | NCONJ | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 24 | بیماری | بیماری | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 23 | POSDEP | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 25 | رخ | رخ | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 26 | NVE | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 26 | داد | <nowiki>داد#ده</nowiki> | V | ACT | <nowiki>person=3|attachment=ISO|number=SING|tma=GS|senID=23472</nowiki>| ROOT | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 27 | که | که | SUBR | SUBR | <nowiki>attachment=ISO|senID=23472</nowiki> | 26 | AJUCL | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 28 | چند | چند | PREM | AMBAJ | <nowiki>attachment=ISO|senID=23472</nowiki> | 29 | NPREMOD | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 29 | روز | روز | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 39 | ADV | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 30 | بعد | بعد | ADJ | AJP | <nowiki>attachment=ISO|senID=23472</nowiki> | 29 | NPOSTMOD | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 31 | با | با | PREP | PREP | <nowiki>attachment=ISO|senID=23472</nowiki> | 39 | ADV | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 32 | رحلت | رحلت | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 31 | POSDEP | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 33 | نبی | نبی | N | ANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 32 | MOZ | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 34 | مکرم | مکرم | ADJ | AJP | <nowiki>attachment=ISO|senID=23472</nowiki> | 33 | NPOSTMOD | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 35 | اسلام | اسلام | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 33 | MOZ | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 36 | جهان | جهان | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 39 | SBJ | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 37 | عزادار | عزادار | ADJ | AJP | <nowiki>attachment=ISO|senID=23472</nowiki> | 39 | MOS | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +38 | ماتمش | ماتم | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 37 | MOZ | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 39 | شد | <nowiki>کرد#کن</nowiki> | V | PASS | <nowiki>person=3|attachment=ISO|number=SING|tma=GS|senID=23472</nowiki> | 27 | PRD | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 40 | <nowiki>.</nowiki> | <nowiki>.</nowiki> | PUNC | PUNC | <nowiki>attachment=ISO|senID=23472</nowiki> | 26 | PUNC | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
  
-The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http://aclweb.org/anthology-new/W/W06/W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Bulgarian:+==== Parsing ====
  
-^ Parser (Authors) ^ LAS ^ UAS ^ +Nonprojectivities in PDT are relatively rareOnly 3357 of the 189,572 tokens are attached nonprojectively (1.77%).
-| MST (McDonald et al.) | 87.57 | 92.04 | +
-| Malt (Nivre et al.) | 87.41 | 91.72 | +
-| Nara (Yuchang Cheng) | 86.34 | 91.30 |+
  
 +I am not aware of any published results of Persian dependency parsing. Our own experiments gave 86.84% unlabeled attachment score with Malt Parser, the stack-lazy algorithm.

[ Back to the navigation ] [ Back to the content ]