Both sides previous revision
Previous revision
|
Next revision
Both sides next revision
|
user:zeman:treebanks:fa [2012/01/28 23:04] zeman Some more changes. |
user:zeman:treebanks:fa [2012/01/29 18:19] zeman Update. I have seen the data! |
==== Obtaining and License ==== | ==== Obtaining and License ==== |
| |
The treebank is available for free after completing the [[http://dadegan.ir/en/content/user-agreement-persian-dependency-treebank|license form]]. (You may also contact info(at)dadegan(dot)ir or Mohammad Sadegh Rasooli.) The license in short: | The treebank is available for free under the GNU GPLicense (with the additional requirement that the data be used non-commercially). Complete the [[http://dadegan.ir/en/content/user-agreement-persian-dependency-treebank|license form]] and they will send you the data by e-mail. (You may also contact info(at)dadegan(dot)ir or Mohammad Sadegh Rasooli.) The license in short: |
| |
* non-commercial research usage | * non-commercial usage |
* redistribution not discussed, probably not allowed | * redistribution permitted under the same license |
* citation of publications not specified | * citation of publications not explicitly required but it is common courtesy |
| |
PDT was created by members of the [[http://dadegan.ir/|Dadegan Research Group]] (دادگان, Dādegān), Computer Engineering Department, [[http://www.iust.ac.ir/|Iran University of Science and Technology]] (دانشگاه علم و صنعت ایران), Tehrān تهران, Iran. | PDT was created by members of the [[http://dadegan.ir/|Dadegan Research Group]] (دادگان, Dādegān), Computer Engineering Department, [[http://www.iust.ac.ir/|Iran University of Science and Technology]] (دانشگاه علم و صنعت ایران), Tehrān تهران, Iran. The copyright lies with the Supreme Council of Information and Communication Technology (SCICT). |
| |
==== References ==== | ==== References ==== |
* Mohammad Sadegh Rasooli, Amirsaeid Moloodi, Manouchehr Kouhestani, Behrouz Minaei-Bidgoli: [[http://dadegan.ir/sites/default/files/A%20Syntactic%20Valency%20Lexicon%20for%20Persian%20Verbs%20The%20First%20Steps%20towards%20Persian%20Dependency%20Treebank.pdf|A Syntactic Valency Lexicon for Persian Verbs: The First Steps towards Persian Dependency Treebank]]. In: 5th Language & Technology Conference (LTC): Human Language Technologies as a Challenge for Computer Science and Linguistics, pp. 227-231, Poznań, Poland, 2011. | * Mohammad Sadegh Rasooli, Amirsaeid Moloodi, Manouchehr Kouhestani, Behrouz Minaei-Bidgoli: [[http://dadegan.ir/sites/default/files/A%20Syntactic%20Valency%20Lexicon%20for%20Persian%20Verbs%20The%20First%20Steps%20towards%20Persian%20Dependency%20Treebank.pdf|A Syntactic Valency Lexicon for Persian Verbs: The First Steps towards Persian Dependency Treebank]]. In: 5th Language & Technology Conference (LTC): Human Language Technologies as a Challenge for Computer Science and Linguistics, pp. 227-231, Poznań, Poland, 2011. |
* Documentation | * Documentation |
* //none so far// | * Attached to the data distribution: {{:user:zeman:treebanks:persian-dependency-treebank-version-0.1-annotation-manual-and-user-guide.pdf|Persian Dependency Treebank Version 0.1, Annotation Manual and User Guide}}, Dadegan Research Group, Tehran, Iran, 2012. |
| |
==== Domain ==== | ==== Domain ==== |
==== Size ==== | ==== Size ==== |
| |
Unknown. | 12200 annotated sentences. |
| |
==== Inside ==== | ==== Inside ==== |
| |
The original morphosyntactic tags have been converted to fit into the three columns (CPOS, POS and FEAT) of the CoNLL format. There //should// be a 1-1 mapping between the [[http://www.bultreebank.org/TechRep/BTB-TR03.pdf|BTB positional tags]] and the CoNLL 2006 annotation. Use [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=bg::conll|DZ Interset]] to inspect the CoNLL tagset. | Provided in the [[:format-conll|CoNLL data format]]. The morphosyntactic annotation contains lemmas. Morphosyntactic / part-of-speech tags have been assigned manually. The text does not contain diacritical marks to distinguish short vowels (not normally shown in Persian text). |
| |
The morphological analysis does not include lemmas. The morphosyntactic tags have been assigned (probably) manually. | |
| |
The guidelines for syntactic annotation are documented in the other [[http://www.bultreebank.org/TechRep/BTB-TR05.pdf|technical report]]. The CoNLL distribution contains the BulTreeBankReadMe.html file with a brief description of the syntactic tags (dependency relation labels). | |
| |
==== Sample ==== | ==== Sample ==== |
Nonprojectivities in BTB are rare. Only 747 of the 196,151 tokens in the CoNLL 2006 version are attached nonprojectively (0.38%). | Nonprojectivities in BTB are rare. Only 747 of the 196,151 tokens in the CoNLL 2006 version are attached nonprojectively (0.38%). |
| |
The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http://aclweb.org/anthology-new/W/W06/W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Bulgarian: | I am not aware of any published results of Persian dependency parsing. |
| |
^ Parser (Authors) ^ LAS ^ UAS ^ | |
| MST (McDonald et al.) | 87.57 | 92.04 | | |
| Malt (Nivre et al.) | 87.41 | 91.72 | | |
| Nara (Yuchang Cheng) | 86.34 | 91.30 | | |