[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
user:zeman:treebanks:fa [2012/01/28 23:04]
zeman Some more changes.
user:zeman:treebanks:fa [2015/06/24 15:26] (current)
zeman The license form is no longer accessible at the location where it was in 2012.
Line 10: Line 10:
 ==== Obtaining and License ==== ==== Obtaining and License ====
  
-The treebank is available for free after completing the [[http://dadegan.ir/en/content/user-agreement-persian-dependency-treebank|license form]](You may also contact info(at)dadegan(dot)ir or Mohammad Sadegh Rasooli.) The license in short:+The treebank is available for free under the GNU GPLicense (with the additional requirement that the data be used non-commercially). Contact the Dadegan Research Group using their on-line form at http://dadegan.ir/en/contact-us and ask them for the data. The license in short:
  
-  * non-commercial research usage +  * non-commercial usage 
-  * redistribution not discussed, probably not allowed +  * redistribution permitted under the same license 
-  * citation of publications not specified+  * citation of publications not explicitly required but it is common courtesy
  
-PDT was created by members of the [[http://dadegan.ir/|Dadegan Research Group]] (دادگان, Dādegān), Computer Engineering Department, [[http://www.iust.ac.ir/|Iran University of Science and Technology]] (دانشگاه علم و صنعت ایران), Tehrān تهران, Iran.+PDT was created by members of the [[http://dadegan.ir/|Dadegan Research Group]] (دادگان, Dādegān), Computer Engineering Department, [[http://www.iust.ac.ir/|Iran University of Science and Technology]] (دانشگاه علم و صنعت ایران), Tehrān تهران, Iran. The copyright lies with the Supreme Council of Information and Communication Technology (SCICT).
  
 ==== References ==== ==== References ====
Line 27: Line 27:
     * Mohammad Sadegh Rasooli, Amirsaeid Moloodi, Manouchehr Kouhestani, Behrouz Minaei-Bidgoli: [[http://dadegan.ir/sites/default/files/A%20Syntactic%20Valency%20Lexicon%20for%20Persian%20Verbs%20The%20First%20Steps%20towards%20Persian%20Dependency%20Treebank.pdf|A Syntactic Valency Lexicon for Persian Verbs: The First Steps towards Persian Dependency Treebank]]. In: 5th Language & Technology Conference (LTC): Human Language Technologies as a Challenge for Computer Science and Linguistics, pp. 227-231, Poznań, Poland, 2011.     * Mohammad Sadegh Rasooli, Amirsaeid Moloodi, Manouchehr Kouhestani, Behrouz Minaei-Bidgoli: [[http://dadegan.ir/sites/default/files/A%20Syntactic%20Valency%20Lexicon%20for%20Persian%20Verbs%20The%20First%20Steps%20towards%20Persian%20Dependency%20Treebank.pdf|A Syntactic Valency Lexicon for Persian Verbs: The First Steps towards Persian Dependency Treebank]]. In: 5th Language & Technology Conference (LTC): Human Language Technologies as a Challenge for Computer Science and Linguistics, pp. 227-231, Poznań, Poland, 2011.
   * Documentation   * Documentation
-    * //none so far//+    * Attached to the data distribution: {{:user:zeman:treebanks:persian-dependency-treebank-version-0.1-annotation-manual-and-user-guide.pdf|Persian Dependency Treebank Version 0.1, Annotation Manual and User Guide}}, Dadegan Research Group, Tehran, Iran, 2012. (http://dadegan.ir/sites/default/files/Persian%20Dependency%20Treebank%20Version%200.1%20Annotation%20Manual%20and%20User%20Guide.pdf)
  
 ==== Domain ==== ==== Domain ====
Line 35: Line 35:
 ==== Size ==== ==== Size ====
  
-Unknown.+PDT contains 189,572 tokens in 12455 sentences, yielding 15.22 tokens per sentence on average. No official training-test data split is defined. For our HamleDT experiments, we took the first 182,878 tokens / 12126 sentences for training and the remaining 6694 tokens / 329 sentences for testing.
  
 ==== Inside ==== ==== Inside ====
  
-The original morphosyntactic tags have been converted to fit into the three columns (CPOS, POS and FEAT) of the CoNLL format. There //should// be a 1-1 mapping between the [[http://www.bultreebank.org/TechRep/BTB-TR03.pdf|BTB positional tags]] and the CoNLL 2006 annotation. Use [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=bg::conll|DZ Interset]] to inspect the CoNLL tagset.+Provided in the [[:format-conll|CoNLL data format]]. The morphosyntactic annotation contains lemmasMorphosyntactic part-of-speech tags have been assigned manuallyThe text does not contain diacritical marks to distinguish short vowels (not normally shown in Persian text).
  
-The morphological analysis does not include lemmasThe morphosyntactic tags have been assigned (probablymanually. +Tokenization is subordinated to the need of displaying syntactic relationsSome orthographic words have been broken into several tokens (e.g. a verb and its object). Elsewhere a tree node (token) consists of two orthographic words (and they are not joined using the underscore character, i.ethere is a space inside the token!) (e.g. the analytical form of subjunctive preterite: "کرده باشید" //korde bašäd//).
- +
-The guidelines for syntactic annotation are documented in the other [[http://www.bultreebank.org/TechRep/BTB-TR05.pdf|technical report]]The CoNLL distribution contains the BulTreeBankReadMe.html file with a brief description of the syntactic tags (dependency relation labels).+
  
 ==== Sample ==== ==== Sample ====
  
-==== Parsing ====+The first sentence of the corpus in the CoNLL format:
  
-Nonprojectivities in BTB are rare. Only 747 of the 196,151 tokens in the CoNLL 2006 version are attached nonprojectively (0.38%).+| 1 | به | به | PREP | PREP | <nowiki>attachment=ISO|senID=23472</nowiki> | 26 | ADV | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 2 | گزارش | گزارش | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 1 | POSDEP | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 3 | خبرنگار | خبرنگار | N | ANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 2 | MOZ | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 4 | مهر | مهر | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 3 | MOZ | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 5 | در | در | PREP | PREP | <nowiki>attachment=ISO|senID=23472</nowiki> | 3 | NPP | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 6 | گرگان | گرگان | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 5 | POSDEP | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 7 | <nowiki>،</nowiki> | <nowiki>،</nowiki> | PUNC | PUNC | <nowiki>attachment=ISO|senID=23472</nowiki> | 6 | PUNC | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 8 | بر | بر | PREP | PREP | <nowiki>attachment=ISO|senID=23472</nowiki> | 26 | ADV | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 9 | اساس | اساس | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 8 | POSDEP | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 10 | باورهای | باور | N | IANM | <nowiki>attachment=ISO|number=PLUR|senID=23472</nowiki> | 9 | MOZ | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 11 | دینی | دینی | ADJ | AJP | <nowiki>attachment=ISO|senID=23472</nowiki> | 10 | NPOSTMOD | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 12 | <nowiki>ترکمن‌ها</nowiki> | ترکمن | N | ANM | <nowiki>attachment=ISO|number=PLUR|senID=23472</nowiki> | 10 | MOZ | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 13 | در | در | PREP | PREP | <nowiki>attachment=ISO|senID=23472</nowiki> | 26 | ADV | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 14 | این | این | PREM | DEMAJ | <nowiki>attachment=ISO|senID=23472</nowiki> | 15 | NPREMOD | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 15 | روز | روز | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 13 | POSDEP | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 16 | برای | برای | PREP | PREP | <nowiki>attachment=ISO|senID=23472</nowiki> | 26 | NPP | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 17 | پیامبر | پیامبر | N | ANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 16 | VPP | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 18 | اکرم | اکرم | ADJ | AJP | <nowiki>attachment=ISO|senID=23472</nowiki> | 17 | NPOSTMOD | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 19 | <nowiki>(</nowiki> | <nowiki>(</nowiki> | PUNC | PUNC | <nowiki>attachment=ISO|senID=23472</nowiki> | 20 | PUNC | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 20 | ص | ص | ADJ | AJP | <nowiki>attachment=ISO|senID=23472</nowiki> | 17 | APP | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 21 | <nowiki>)</nowiki> | <nowiki>)</nowiki> | PUNC | PUNC | <nowiki>attachment=ISO|senID=23472</nowiki> | 20 | PUNC | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 22 | ناراحتی | ناراحتی | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 26 | SBJ | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 23 | و | و | CONJ | CONJ | <nowiki>attachment=ISO|senID=23472</nowiki> | 22 | NCONJ | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 24 | بیماری | بیماری | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 23 | POSDEP | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 25 | رخ | رخ | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 26 | NVE | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 26 | داد | <nowiki>داد#ده</nowiki> | V | ACT | <nowiki>person=3|attachment=ISO|number=SING|tma=GS|senID=23472</nowiki>| ROOT | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 27 | که | که | SUBR | SUBR | <nowiki>attachment=ISO|senID=23472</nowiki> | 26 | AJUCL | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 28 | چند | چند | PREM | AMBAJ | <nowiki>attachment=ISO|senID=23472</nowiki> | 29 | NPREMOD | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 29 | روز | روز | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 39 | ADV | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 30 | بعد | بعد | ADJ | AJP | <nowiki>attachment=ISO|senID=23472</nowiki> | 29 | NPOSTMOD | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 31 | با | با | PREP | PREP | <nowiki>attachment=ISO|senID=23472</nowiki> | 39 | ADV | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 32 | رحلت | رحلت | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 31 | POSDEP | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 33 | نبی | نبی | N | ANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 32 | MOZ | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 34 | مکرم | مکرم | ADJ | AJP | <nowiki>attachment=ISO|senID=23472</nowiki> | 33 | NPOSTMOD | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 35 | اسلام | اسلام | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 33 | MOZ | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 36 | جهان | جهان | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 39 | SBJ | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 37 | عزادار | عزادار | ADJ | AJP | <nowiki>attachment=ISO|senID=23472</nowiki> | 39 | MOS | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +38 | ماتمش | ماتم | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 37 | MOZ | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 39 | شد | <nowiki>کرد#کن</nowiki> | V | PASS | <nowiki>person=3|attachment=ISO|number=SING|tma=GS|senID=23472</nowiki> | 27 | PRD | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 40 | <nowiki>.</nowiki> | <nowiki>.</nowiki> | PUNC | PUNC | <nowiki>attachment=ISO|senID=23472</nowiki> | 26 | PUNC | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
  
-The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http://aclweb.org/anthology-new/W/W06/W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Bulgarian:+==== Parsing ====
  
-^ Parser (Authors) ^ LAS ^ UAS ^ +Nonprojectivities in PDT are relatively rareOnly 3357 of the 189,572 tokens are attached nonprojectively (1.77%).
-| MST (McDonald et al.) | 87.57 | 92.04 | +
-| Malt (Nivre et al.) | 87.41 | 91.72 | +
-| Nara (Yuchang Cheng) | 86.34 | 91.30 |+
  
 +I am not aware of any published results of Persian dependency parsing. Our own experiments gave 86.84% unlabeled attachment score with Malt Parser, the stack-lazy algorithm.

[ Back to the navigation ] [ Back to the content ]