This is an old revision of the document!
Table of Contents
Persian (fa)
Persian Dependency Treebank (پیکرۀ وابستگی)
Versions
- “Preversion” 0.1 (January 2012, 12,200 annotated sentences)
- Final version (expected fall 2012, 30,000 sentences)
Obtaining and License
The treebank is available for free after completing the license form. (You may also contact info(at)dadegan(dot)ir or Mohammad Sadegh Rasooli.) The license in short:
- non-commercial research usage
- redistribution not discussed, probably not allowed
- citation of publications not specified
PDT was created by members of the Dadegan Research Group (دادگان, Dādegān), Computer Engineering Department, Iran University of Science and Technology (دانشگاه علم و صنعت ایران), Tehrān تهران, Iran.
References
- Website
- Data
- no separate citation
- Principal publications
- Mohammad Sadegh Rasooli, Amirsaeid Moloodi, Manouchehr Kouhestani, Behrouz Minaei-Bidgoli: A Syntactic Valency Lexicon for Persian Verbs: The First Steps towards Persian Dependency Treebank. In: 5th Language & Technology Conference (LTC): Human Language Technologies as a Challenge for Computer Science and Linguistics, pp. 227-231, Poznań, Poland, 2011.
- Documentation
- none so far
Domain
Unknown.
Size
Unknown.
Inside
The original morphosyntactic tags have been converted to fit into the three columns (CPOS, POS and FEAT) of the CoNLL format. There should be a 1-1 mapping between the BTB positional tags and the CoNLL 2006 annotation. Use DZ Interset to inspect the CoNLL tagset.
The morphological analysis does not include lemmas. The morphosyntactic tags have been assigned (probably) manually.
The guidelines for syntactic annotation are documented in the other technical report. The CoNLL distribution contains the BulTreeBankReadMe.html file with a brief description of the syntactic tags (dependency relation labels).
Sample
Parsing
Nonprojectivities in BTB are rare. Only 747 of the 196,151 tokens in the CoNLL 2006 version are attached nonprojectively (0.38%).
The results of the CoNLL 2006 shared task are available online. They have been published in (Buchholz and Marsi, 2006). The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Bulgarian:
Parser (Authors) | LAS | UAS |
---|---|---|
MST (McDonald et al.) | 87.57 | 92.04 |
Malt (Nivre et al.) | 87.41 | 91.72 |
Nara (Yuchang Cheng) | 86.34 | 91.30 |