Differences

This shows you the differences between two versions of the page.

--- user:zeman:treebanks:ru [2012/01/13 18:04]
zeman Sample.
+++ user:zeman:treebanks:ru [2012/01/13 21:39]
zeman Inside.
@@ Line 32: / Line 32: @@
     * David Mareček, Natalia Kljueva: [[http://aclweb.org/anthology/W/W09/W09-4005.pdf|Converting Russian Treebank SynTagRus into Praguian PDT Style]]. In: Multilingual Resources, Technologies and Evaluation for Central and Eastern European Languages, pp. 26-31, Bulgaria, 2009.
   * Documentation
-    * Description of tags and feature values is hard to find; see also the [[#Inside|Inside section below]].
+    * Daniel Zeman: {{:user:zeman:treebanks:russian_dependency_treebank.pdf|Russian Dependency Treebank}} //(written based on a Russian document pamjatka_korpus.doc)//, College Park, Maryland, USA, 2006
 ==== Domain ====
@@ Line 40: / Line 40: @@
 ==== Size ====
-There are 497,465 tokens in 34895 sentences, yielding 14.26 tokens per sentence on average. The original data was not split to training and test. In our HamleDT experiments, we take one file (Vyzhivshij_kamikadze, 402 sentences, 3458 tokens) as the test data, while the rest serves for training.
+There are 497,465 tokens in 34895 sentences, yielding 14.26 tokens per sentence on average. The original data was not split to training and test. In our HamleDT experiments, we take one file (''Выживший_камикадзе.tgt'', 402 sentences, 3458 tokens) as the test data, while the rest serves for training.
 ==== Inside ====
-We have a Treex reader for the Syntagrus native format (.tgt). Note however that Dan converted the original windows-1251 encoding to utf-8.
+The native file format of Syntagrus is the XML-based ''.tgt'' format. It uses the Windows-1251 encoding, which can be converted to UTF-8. Converting the file names is more of a challenge, depending on the file system (one typically gets a zipped archive containing files whose names use the Cyrillic alphabet, and the file system may store the names in a codepage different from Windows-1251).
-Both versions (CoNLL 2007 and BDT-II) are in the CoNLL 2006/2007 format.
+Morphological annotation has probably been done manually and it contains lemmas (uppercased). See references for a description of morphological tags (features) and syntactic tags (dependency relation labels). Note that the tags use the Cyrillic alphabet.
-Part of speech tag description (obtained per e-mail from Koldo Gojenola, thanks!):
+The syntactic trees do not contain punctuation. Punctuation tokens have not been removed but they do not have independent nodes in the trees.
-  * IZE = noun
-    * ARR = common
-    * IZB = proper name
-    * LIB = place name
-    * ZKI = number
-  * ADJ = adjective
-    * ARR = common
-    * GAL = question
-  * ADI = verb
-    * SIN = simple
-    * ADK = composed
-    * ADP = periphrastic
-    * FAK = factitive
-  * ADB = adverb
-    * ARR = common
-    * GAL = question
-  * DET = determiner
-    * ERKARR = demonstrative common
-    * ERKIND = demonstrative emphatic
-    * NOLARR = indefinite common
-    * NOLGAL = indefinite question
-    * ZNB = number
-    * DZH = definite
-    * BAN = distributive
-    * ORD = ordinal
-    * DZG = indefinite
-    * ORO = general
-  * IOR = pronoun
-    * PERARR = personal common
-    * PERIND = personal emphatic
-    * IZGMGB = indefinite
-    * IZGGAL = question
-    * BIH = ???
-    * ELK = ???
-  * LOT = link
-    * LOK = connector
-    * JNT = conjunction
-  * PRT = particle
-  * ITJ = interjection
-  * BST = other
-  * ADL = auxiliary verb
-  * ADT = synthetic verb
-  * SIG = acronym
-  * SNB = symbol
-  * LAB = abbreviation
 ==== Sample ====

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences