[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
user:zeman:treebanks:ru [2012/01/13 21:33]
zeman Documentation.
user:zeman:treebanks:ru [2012/01/13 21:49] (current)
zeman Data split.
Line 32: Line 32:
     * David Mareček, Natalia Kljueva: [[http://aclweb.org/anthology/W/W09/W09-4005.pdf|Converting Russian Treebank SynTagRus into Praguian PDT Style]]. In: Multilingual Resources, Technologies and Evaluation for Central and Eastern European Languages, pp. 26-31, Bulgaria, 2009.     * David Mareček, Natalia Kljueva: [[http://aclweb.org/anthology/W/W09/W09-4005.pdf|Converting Russian Treebank SynTagRus into Praguian PDT Style]]. In: Multilingual Resources, Technologies and Evaluation for Central and Eastern European Languages, pp. 26-31, Bulgaria, 2009.
   * Documentation   * Documentation
-    * Description of tags and feature values is hard to find; see also the [[#Inside|Inside section below]]. 
     * Daniel Zeman: {{:user:zeman:treebanks:russian_dependency_treebank.pdf|Russian Dependency Treebank}} //(written based on a Russian document pamjatka_korpus.doc)//, College Park, Maryland, USA, 2006     * Daniel Zeman: {{:user:zeman:treebanks:russian_dependency_treebank.pdf|Russian Dependency Treebank}} //(written based on a Russian document pamjatka_korpus.doc)//, College Park, Maryland, USA, 2006
  
Line 47: Line 46:
 The native file format of Syntagrus is the XML-based ''.tgt'' format. It uses the Windows-1251 encoding, which can be converted to UTF-8. Converting the file names is more of a challenge, depending on the file system (one typically gets a zipped archive containing files whose names use the Cyrillic alphabet, and the file system may store the names in a codepage different from Windows-1251). The native file format of Syntagrus is the XML-based ''.tgt'' format. It uses the Windows-1251 encoding, which can be converted to UTF-8. Converting the file names is more of a challenge, depending on the file system (one typically gets a zipped archive containing files whose names use the Cyrillic alphabet, and the file system may store the names in a codepage different from Windows-1251).
  
-Part of speech tag description (obtained per e-mail from Koldo Gojenola, thanks!):+Morphological annotation has probably been done manually and it contains lemmas (uppercased). See references for a description of morphological tags (features) and syntactic tags (dependency relation labels). Note that the tags use the Cyrillic alphabet.
  
-  * IZE = noun +The syntactic trees do not contain punctuation. Punctuation tokens have not been removed but they do not have independent nodes in the trees.
-    * ARR = common +
-    * IZB = proper name +
-    * LIB = place name +
-    * ZKI = number +
-  * ADJ = adjective +
-    * ARR = common +
-    * GAL = question +
-  * ADI = verb +
-    * SIN = simple +
-    * ADK = composed +
-    * ADP = periphrastic +
-    * FAK = factitive +
-  * ADB = adverb +
-    * ARR = common +
-    * GAL = question +
-  * DET = determiner +
-    * ERKARR = demonstrative common +
-    * ERKIND = demonstrative emphatic +
-    * NOLARR = indefinite common +
-    * NOLGAL = indefinite question +
-    * ZNB = number +
-    * DZH = definite +
-    * BAN = distributive +
-    * ORD = ordinal +
-    * DZG = indefinite +
-    * ORO = general +
-  * IOR = pronoun +
-    * PERARR = personal common +
-    * PERIND = personal emphatic +
-    * IZGMGB = indefinite +
-    * IZGGAL = question +
-    * BIH = ??? +
-    * ELK = ??? +
-  * LOT = link +
-    * LOK = connector +
-    * JNT = conjunction +
-  * PRT = particle +
-  * ITJ = interjection +
-  * BST = other +
-  * ADL = auxiliary verb +
-  * ADT = synthetic verb +
-  * SIG = acronym +
-  * SNB = symbol +
-  * LAB = abbreviation+
  
 ==== Sample ==== ==== Sample ====
Line 125: Line 80:
 ==== Parsing ==== ==== Parsing ====
  
-BDT is a mildly nonprojective treebank1925 of the 151,604 tokens of combined BDT-II training and test sets are attached nonprojectively (1.27%).+Nonprojectivities in SynTagRus are not frequentOnly 4146 of the 497,465 tokens are attached nonprojectively (0.83%).
  
-The results of the CoNLL 2007 shared task are [[http://nextens.uvt.nl/depparse-wiki/AllScores|available online]]. They have been published in [[http://aclweb.org/anthology-new/D/D07/D07-1096.pdf|(Nivre et al.2007)]]. The evaluation procedure was changed to include punctuation tokens. These are the best results for Basque:+Parsing results have been published by [[http://aclweb.org/anthology-new/C/C08/C08-1081.pdf|Nivre, Boguslavsky and Iomdin (2008)]] (note that they used different training-test data split from ours):
  
-^ Parser (Authors) ^ LAS ^ UAS ^ +^ Parser ^ LAS ^ UAS ^ 
-| Malt (Nilsson et al.) | 76.94 | 82.84 | +| Malt | 82.89.|
-| Titov et al. | 75.49 | 81.93 | +
-| Sagae | 74.64 | 81.19 | +
-| Carreras | 75.75 | 81.11 | +
-| Nakagawa | 72.56 | 81.04 | +
-| Malt (J. Hall et al.) | 74.99 | 80.61 | +
-| Johansson et al. | 75.08 80.43 |+
  
-The two Malt parser results of 2007 (single malt and blended) are described in [[http://aclweb.org/anthology-new/D/D07/D07-1097.pdf|(Hall et al., 2007)]] and the details about the parser configuration are described [[http://w3.msi.vxu.se/users/jha/conll07/|here]]. 
- 
-Parsing results on BDT-II have been published in Kepa Bengoetxea, Koldo Gojenola: [[http://aclweb.org/anthology-new/W/W10/W10-1404.pdf|Application of Different Techniques to Dependency Parsing of Basque]]. In: Proceedings of the First Workshop on Statistical Parsing of Morphologically Rich Languages (SPMRL 2010), NAACL Workshop, Los Angeles, California, USA, 2010. They report only Labeled Attachment Score (LAS) and their best system achieved LAS = 78.98%. 

[ Back to the navigation ] [ Back to the content ]