[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
user:zeman:treebanks:fa [2012/01/28 18:45]
zeman vytvořeno
user:zeman:treebanks:fa [2015/06/24 15:26] (current)
zeman The license form is no longer accessible at the location where it was in 2012.
Line 10: Line 10:
 ==== Obtaining and License ==== ==== Obtaining and License ====
  
-The treebank is available for free after completing the [[http://dadegan.ir/en/content/user-agreement-persian-dependency-treebank|license form]]. The license in short:+The treebank is available for free under the GNU GPLicense (with the additional requirement that the data be used non-commercially). Contact the Dadegan Research Group using their on-line form at http://dadegan.ir/en/contact-us and ask them for the data. The license in short:
  
-  * non-commercial research usage +  * non-commercial usage 
-  * redistribution not discussed, probably not allowed +  * redistribution permitted under the same license 
-  * citation of publications not specified+  * citation of publications not explicitly required but it is common courtesy
  
-PDT was created by members of the [[http://dadegan.ir/|Dadegan Research Group]] (Секция Лингвистично моделиране)Bulgarian Academy of Sciences (Българска академия на науките), УлАкадГБончевБл. 25 А1113 София, Bulgaria.+PDT was created by members of the [[http://dadegan.ir/|Dadegan Research Group]] (دادگانDādegān), Computer Engineering Department, [[http://www.iust.ac.ir/|Iran University of Science and Technology]] (دانشگاه علم و صنعت ایران)Tehrān تهرانIran. The copyright lies with the Supreme Council of Information and Communication Technology (SCICT).
  
 ==== References ==== ==== References ====
  
   * Website   * Website
-    * http://www.bultreebank.org/indexBTB.html+    * http://dadegan.ir/en/persiandependencytreebank
   * Data   * Data
     * //no separate citation//     * //no separate citation//
   * Principal publications   * Principal publications
-    * Kiril SimovPetya OsenovaAlexander SimovMilen Kouylekov: //Design and Implementation of the Bulgarian HPSG-based Treebank.// In: Erhard Hinrichs, Kiril Simov (eds.): Journal of Research on Language and ComputationSpecial Issue, vol2no. 4pp. 495–522Kluwer Academic Publishers, ISSN 1570-7075. 2004.+    * Mohammad Sadegh RasooliAmirsaeid MoloodiManouchehr KouhestaniBehrouz Minaei-Bidgoli: [[http://dadegan.ir/sites/default/files/A%20Syntactic%20Valency%20Lexicon%20for%20Persian%20Verbs%20The%20First%20Steps%20towards%20Persian%20Dependency%20Treebank.pdf|A Syntactic Valency Lexicon for Persian Verbs: The First Steps towards Persian Dependency Treebank]]. In: 5th Language & Technology Conference (LTC): Human Language Technologies as a Challenge for Computer Science and Linguisticspp227-231PoznańPoland2011.
   * Documentation   * Documentation
-    * Kiril Simov, Petya Osenova, Milena Slavcheva[[http://www.bultreebank.org/TechRep/BTB-TR03.pdf|BTB-TR03: BulTreeBank Morphosyntactic Tagset]]Technical report2004. +    * Attached to the data distribution{{:user:zeman:treebanks:persian-dependency-treebank-version-0.1-annotation-manual-and-user-guide.pdf|Persian Dependency Treebank Version 0.1Annotation Manual and User Guide}}, Dadegan Research Group, Tehran, Iran, 2012(http://dadegan.ir/sites/default/files/Persian%20Dependency%20Treebank%20Version%200.1%20Annotation%20Manual%20and%20User%20Guide.pdf)
-    * Petya Osenova, Kiril Simov: [[http://www.bultreebank.org/TechRep/BTB-TR05.pdf|BTB-TR05: BulTreeBank Stylebook]]. Technical report, 2004. +
-    * http://www.bultreebank.org/dpbtb/ provides the list of dependency relation labels (s-tagswith brief description.+
  
 ==== Domain ==== ==== Domain ====
  
-Unknown (“A set of Bulgarian sentences marked-up with detailed syntactic information. These sentences are mainly extracted from authentic Bulgarian texts. They are chosen with regards two criteria. First, they cover the variety of syntactic structures of Bulgarian. Second, they show the statistical distribution of these phenomena in real texts.”) At least part of it is probably news (Novinar, Sega, Standart).+Unknown.
  
 ==== Size ==== ==== Size ====
  
-The CoNLL 2006 version contains 196,151 tokens in 13221 sentences, yielding 14.84 tokens per sentence on average (CoNLL 2006 data split: 190,217 tokens / 12823 sentences training, 5934 tokens / 398 sentences test).+PDT contains 189,572 tokens in 12455 sentences, yielding 15.22 tokens per sentence on average. No official training-test data split is defined. For our HamleDT experimentswe took the first 182,878 tokens / 12126 sentences for training and the remaining 6694 tokens / 329 sentences for testing.
  
 ==== Inside ==== ==== Inside ====
  
-The original morphosyntactic tags have been converted to fit into the three columns (CPOS, POS and FEAT) of the CoNLL format. There //should// be a 1-1 mapping between the [[http://www.bultreebank.org/TechRep/BTB-TR03.pdf|BTB positional tags]] and the CoNLL 2006 annotation. Use [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=bg::conll|DZ Interset]] to inspect the CoNLL tagset.+Provided in the [[:format-conll|CoNLL data format]]. The morphosyntactic annotation contains lemmasMorphosyntactic part-of-speech tags have been assigned manuallyThe text does not contain diacritical marks to distinguish short vowels (not normally shown in Persian text).
  
-The morphological analysis does not include lemmasThe morphosyntactic tags have been assigned (probablymanually. +Tokenization is subordinated to the need of displaying syntactic relationsSome orthographic words have been broken into several tokens (e.g. a verb and its object). Elsewhere a tree node (token) consists of two orthographic words (and they are not joined using the underscore character, i.ethere is a space inside the token!) (e.g. the analytical form of subjunctive preterite: "کرده باشید" //korde bašäd//).
- +
-The guidelines for syntactic annotation are documented in the other [[http://www.bultreebank.org/TechRep/BTB-TR05.pdf|technical report]]The CoNLL distribution contains the BulTreeBankReadMe.html file with a brief description of the syntactic tags (dependency relation labels).+
  
 ==== Sample ==== ==== Sample ====
  
-The first three sentences of the CoNLL 2006 training data:+The first sentence of the corpus in the CoNLL format:
  
-| 1 | Глава Nc ROOT ROOT +| 1 | به به PREP PREP <nowiki>attachment=ISO|senID=23472</nowiki> 26 ADV <nowiki>_</nowiki> | <nowiki>_</nowiki> 
-| 2 | трета Mo gen=f<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i mod mod | +| 2 | گزارش گزارش IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 1 | POSDEP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 
-| |||||||||| +خبرنگار خبرنگار ANM <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> MOZ <nowiki>_</nowiki> <nowiki>_</nowiki> | 
-НАРОДНО An gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=mod mod +مهر مهر IANM <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 3 | MOZ | <nowiki>_</nowiki> | <nowiki>_</nowiki> 
-СЪБРАНИЕ | N | Nc gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=ROOT ROOT +| 5 | در | در | PREP | PREP | <nowiki>attachment=ISO|senID=23472</nowiki> NPP <nowiki>_</nowiki> | <nowiki>_</nowiki> 
-| |||||||||| +گرگان گرگان | N | IANM <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 5 | POSDEP <nowiki>_</nowiki> <nowiki>_</nowiki>
-Народното An gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=mod mod +| 7 <nowiki>،</nowiki> | <nowiki>،</nowiki> | PUNC | PUNC | <nowiki>attachment=ISO|senID=23472</nowiki> PUNC <nowiki>_</nowiki> | <nowiki>_</nowiki> 
-събрание Nc gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=subj subj +بر بر PREP PREP <nowiki>attachment=ISO|senID=23472</nowiki> 26 ADV <nowiki>_</nowiki> | <nowiki>_</nowiki> 
-осъществява Vpi trans=t<nowiki>|</nowiki>mood=i<nowiki>|</nowiki>tense=r<nowiki>|</nowiki>pers=3<nowiki>|</nowiki>num=s ROOT ROOT +اساس اساس IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 8 | POSDEP | <nowiki>_</nowiki> | <nowiki>_</nowiki> 
-законодателната Af gen=f<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=d mod mod | +| 10 | باورهای | باور | N | IANM | <nowiki>attachment=ISO|number=PLUR|senID=23472</nowiki> MOZ | <nowiki>_</nowiki> | <nowiki>_</nowiki> 
-власт Nc | _ | obj obj | +11 دینی دینی ADJ AJP <nowiki>attachment=ISO|senID=23472</nowiki> | 10 | NPOSTMOD | <nowiki>_</nowiki> <nowiki>_</nowiki> | 
-и Cp conj 3 | conj +| 12 | <nowiki>ترکمن‌ها</nowiki> | ترکمن | N | ANM | <nowiki>attachment=ISO|number=PLUR|senID=23472</nowiki> 10 MOZ | <nowiki>_</nowiki> | <nowiki>_</nowiki> 
-упражнява Vpi trans=t<nowiki>|</nowiki>mood=i<nowiki>|</nowiki>tense=r<nowiki>|</nowiki>pers=3<nowiki>|</nowiki>num=s conjarg conjarg +13 در در PREP PREP | <nowiki>attachment=ISO|senID=23472</nowiki> | 26 | ADV | <nowiki>_</nowiki> <nowiki>_</nowiki> 
-парламентарен Am gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=mod mod +| 14 | این | این | PREM | DEMAJ | <nowiki>attachment=ISO|senID=23472</nowiki>15 NPREMOD <nowiki>_</nowiki> <nowiki>_</nowiki> 
-контрол | N | Nc gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i obj obj | +15 روز روز IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 13 | POSDEP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 
-10 Punct Punct | _ | punct punct | +16 برای برای PREP PREP <nowiki>attachment=ISO|senID=23472</nowiki> 26 NPP <nowiki>_</nowiki> <nowiki>_</nowiki> | 
- +17 پیامبر پیامبر ANM <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> 16 VPP <nowiki>_</nowiki> <nowiki>_</nowiki> 
-The first three sentences of the CoNLL 2006 test data: +18 اکرم اکرم ADJ AJP | <nowiki>attachment=ISO|senID=23472</nowiki> | 17 | NPOSTMOD | <nowiki>_</nowiki> | <nowiki>_</nowiki> 
- +| 19 | <nowiki>(</nowiki> | <nowiki>(</nowiki> | PUNC | PUNC | <nowiki>attachment=ISO|senID=23472</nowiki>20 PUNC <nowiki>_</nowiki> <nowiki>_</nowiki> 
-Единственото An | gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=d | 2 | mod | 2 | mod +20 ص ص ADJ AJP <nowiki>attachment=ISO|senID=23472</nowiki> | 17 | APP | <nowiki>_</nowiki> <nowiki>_</nowiki> 
-решение | N | Nc gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=ROOT | 0 | ROOT | +| 21 <nowiki>)</nowiki> | <nowiki>)</nowiki> | PUNC | PUNC | <nowiki>attachment=ISO|senID=23472</nowiki> 20 PUNC <nowiki>_</nowiki> | <nowiki>_</nowiki> 
-| |||||||||| +22 ناراحتی ناراحتی | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 26 | SBJ | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 
-Ерик Np gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i ROOT ROOT | +23 و و CONJ CONJ <nowiki>attachment=ISO|senID=23472</nowiki> 22 NCONJ <nowiki>_</nowiki> <nowiki>_</nowiki> | 
-Франк Np gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i mod mod +24 بیماری بیماری IANM <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 23 | POSDEP | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-Ръсел Hm gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i mod mod | +25 رخ رخ | N | IANM <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 26 | NVE <nowiki>_</nowiki> <nowiki>_</nowiki> | 
-| |||||||||| +| 26 | داد | <nowiki>داد#ده</nowiki> | V | ACT | <nowiki>person=3|attachment=ISO|number=SING|tma=GS|senID=23472</nowiki> | 0 | ROOT | <nowiki>_</nowiki> | <nowiki>_</nowiki> 
-Пълен Am gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=mod mod +27 که که SUBR SUBR <nowiki>attachment=ISO|senID=23472</nowiki> 26 AJUCL <nowiki>_</nowiki> | <nowiki>_</nowiki> 
-мрак | N | Nc gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=ROOT ROOT +28 چند چند PREM AMBAJ | <nowiki>attachment=ISO|senID=23472</nowiki> | 29 | NPREMOD | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 
-и Cp conj conj +29 روز روز IANM <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> 39 | ADV | <nowiki>_</nowiki> | <nowiki>_</nowiki> 
-пълна Af gen=f<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i mod mod | +| 30 | بعد | بعد | ADJ | AJP | <nowiki>attachment=ISO|senID=23472</nowiki>29 NPOSTMOD <nowiki>_</nowiki> <nowiki>_</nowiki> 
-самота Nc conjarg conjarg +31 با با PREP PREP | <nowiki>attachment=ISO|senID=23472</nowiki> | 39 | ADV | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 
-| . | Punct Punct punct punct |+32 رحلت رحلت IANM <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> 31 POSDEP <nowiki>_</nowiki> <nowiki>_</nowiki> | 
 +33 نبی نبی ANM <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 32 | MOZ | <nowiki>_</nowiki> | <nowiki>_</nowiki> 
 +| 34 | مکرم | مکرم | ADJ | AJP | <nowiki>attachment=ISO|senID=23472</nowiki> 33 NPOSTMOD <nowiki>_</nowiki> | <nowiki>_</nowiki> 
 +35 اسلام اسلام | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 33 | MOZ | <nowiki>_</nowiki> | <nowiki>_</nowiki> 
 +| 36 | جهان | جهان | N | IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> 39 SBJ | <nowiki>_</nowiki> | <nowiki>_</nowiki> 
 +37 عزادار عزادار ADJ AJP <nowiki>attachment=ISO|senID=23472</nowiki> 39 MOS <nowiki>_</nowiki> | <nowiki>_</nowiki> 
 +38 ماتمش ماتم IANM | <nowiki>attachment=ISO|number=SING|senID=23472</nowiki> | 37 | MOZ | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 
 +39 شد <nowiki>کرد#کن</nowiki> PASS <nowiki>person=3|attachment=ISO|number=SING|tma=GS|senID=23472</nowiki> 27 PRD <nowiki>_</nowiki> <nowiki>_</nowiki> 
 +40 <nowiki>.</nowiki> <nowiki>.</nowiki> PUNC PUNC <nowiki>attachment=ISO|senID=23472</nowiki> 26 PUNC <nowiki>_</nowiki> | <nowiki>_</nowiki> |
  
 ==== Parsing ==== ==== Parsing ====
  
-Nonprojectivities in BTB are rare. Only 747 of the 196,151 tokens in the CoNLL 2006 version are attached nonprojectively (0.38%). +Nonprojectivities in PDT are relatively rare. Only 3357 of the 189,572 tokens are attached nonprojectively (1.77%).
- +
-The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http://aclweb.org/anthology-new/W/W06/W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Bulgarian: +
- +
-^ Parser (Authors) ^ LAS ^ UAS ^ +
-| MST (McDonald et al.) | 87.57 | 92.04 | +
-| Malt (Nivre et al.) | 87.41 | 91.72 | +
-| Nara (Yuchang Cheng) | 86.34 | 91.30 |+
  
 +I am not aware of any published results of Persian dependency parsing. Our own experiments gave 86.84% unlabeled attachment score with Malt Parser, the stack-lazy algorithm.

[ Back to the navigation ] [ Back to the content ]