[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
Next revision Both sides next revision
user:zeman:treebanks:ru [2012/01/13 17:26]
zeman vytvořeno
user:zeman:treebanks:ru [2012/01/13 18:00]
zeman Domain and size.
Line 1: Line 1:
 ===== Russian (ru) ===== ===== Russian (ru) =====
  
-Russian Dependency Treebank (RDT, Syntagrus)+[[http://www.ruscorpora.ru/en/search-syntax.html|Russian Dependency Treebank]] (RDT, SynTagRus)
  
 ==== Versions ==== ==== Versions ====
Line 7: Line 7:
   * 2006 (small part obtained by Dan Zeman per e-mail from Igor Boguslavsky)   * 2006 (small part obtained by Dan Zeman per e-mail from Igor Boguslavsky)
   * 2009 (newer and larger version obtained by Natalia Klyueva)   * 2009 (newer and larger version obtained by Natalia Klyueva)
 +  * The version at the site of the [[http://www.ruscorpora.ru/en/search-syntax.html|Russian National Corpus]] (searchable on-line but not available for download)
  
 ==== Obtaining and License ==== ==== Obtaining and License ====
Line 26: Line 27:
     * //no separate citation//     * //no separate citation//
   * Principal publications   * Principal publications
-    * Itziar AdurizMaría Jesús AranzabeJosé María ArriolaAitziber AtutxaArantza Díaz de IlarrazaAitzpea GarmendiaMaite Oronoz: [[http://w3.msi.vxu.se/~rics/TLT2003/doc/aduriz_et_al.pdf|Construction of a Basque Dependency Treebank]] In: Proceedings of The Second Workshop on Treebanks and Linguistic Theories (TLT 2003), pp. 149-160VäxjöSweden2003.+    * Igor BoguslavskyIvan ChardinSvetlana GrigorievaNikolai GrigorievLeonid IomdinLeonid KreidlinNadezhda Frid: [[http://cl.iitp.ru/bibitems/treebank_lrec.pdf|Development of a Dependency Treebank for Russian and its Possible Applications in NLP]] In: Proceedings of The Third International Conference on Language Resources and Evaluation (LREC 2002), pp. 852-856Las Palmas, Spain, 2002. 
 +  * Other publications 
 +    * Joakim Nivre, Igor M. Boguslavsky, Leonid L. Iomdin: [[http://aclweb.org/anthology-new/C/C08/C08-1081.pdf|Parsing the SynTagRus Treebank of Russian]]. In: Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pp. 641-648, Manchester, UK, 2008. 
 +    * David Mareček, Natalia Kljueva: [[http://aclweb.org/anthology/W/W09/W09-4005.pdf|Converting Russian Treebank SynTagRus into Praguian PDT Style]]. In: Multilingual Resources, Technologies and Evaluation for Central and Eastern European Languages, pp. 26-31Bulgaria2009.
   * Documentation   * Documentation
-    * Description of tags and feature values is hard to find; the ''doc/README'' file in the CoNLL 2007 data distribution is not very informative. See below for information obtained per e-mail communication. +    * Description of tags and feature values is hard to find; see also the [[#Inside|Inside section below]].
-    * María Jesús Aranzabe, José Mari Arriola, Aitziber Atutxa, Irene Balza, Larraitz Uria: [[http://ixa.si.ehu.es/Ixa/Argitalpenak/Barne_txostenak/1068549887/publikoak/guia.pdf|Guía para la anotación sintáctica manual de Eus3LB (corpus del euskera anotado a nivel sintáctico, semántico y pragmático)]]. UPV/EHU/LSI/TR 13-2003, Donostia, Spain, 2003. +
-    * [[http://www.google.cz/url?sa=t&rct=j&q=adlativo%20direccional%20norantz&source=web&cd=1&ved=0CB0QFjAA&url=http%3A%2F%2Flenguaesp.usal.es%2Fhtml%2Fes%2Fdbfs%2Fdownload.html%3FfileId%3D1118%26_key_%3D248d9f4b64589181dfabafad22b8e483&ei=Qg3VTpKCFpDNswaarJyNDg&usg=AFQjCNEA86oRVR_7sNixk1EKvDFCoSrSsg&sig2=yTsTylb19CsOqsdu-wOtwA&cad=rja|Here]] at the University of Salamanca is a Microsoft Word document in Spanish describing the Basque morphology. It does not mention the treebank but it could help understand some of the tags. +
-    * José Ignacio Hualde, Jon Ortiz de Urbina: [[http://books.google.cz/books?id=Kss999lxKm0C&printsec=frontcover&dq=grammar+of+basque&cd=1&redir_esc=y#v=onepage&q&f=false|A Grammar of Basque]]. Mouton de Gruyter, Berlin, 2003. ISBN 3-11-017683-1.+
  
 ==== Domain ==== ==== Domain ====
  
-Newswire + unknown (“25000 word forms from EPEC (Aduriz et al., 2003) and 25000 word forms coming from newspapers that can be considered equivalent to the other corpora in the project [3LBi.eCatalan and Spanish]”; “EPECa corpus of written Basque tagged at morphological and syntactic levels for the automatic processing”).+Uppsala University Corpus of contemporary Russian prose (balanced fiction-journalistic, + small percentage of scientific and popular science). In addition, several hundred short texts published in 2001-2002 on various Internet news portals” (yandex.rurbc.rupolit.ru, lenta.rustrana.ru, news.ru etc.)
  
 ==== Size ==== ==== Size ====
  
-The CoNLL 2007 dataset was officially split into training and test partThe data split of BDT-II was provided by Koldo Gojenola and should correspond to data split used in parsing experiments published by the IXA Group. +There are 497,465 tokens in 34895 sentences, yielding 14.26 tokens per sentence on average. The original data was not split to training and test. In our HamleDT experiments, we take one file (Vyzhivshij_kamikadze, 402 sentences, 3458 tokens) as the test datawhile the rest serves for training.
- +
-^ Version ^ Train Sentences ^ Train Tokens ^ D-test Sentences ^ D-test Tokens ^ E-test Sentences ^ E-test Tokens ^ Total Sentences ^ Total Tokens ^ Sentence Length ^ +
-| CoNLL 2007 |  3190 |  50526 |  334 |  5390 |              |   3524 |    55916 |  15.87 | +
-| BDT-II |  9094 |  124,684 |  1010 |  12625 |  1122 |  14295 |  11226 |  151,604 |  13.50 |+
  
 ==== Inside ==== ==== Inside ====
 +
 +We have a Treex reader for the Syntagrus native format (.tgt). Note however that Dan converted the original windows-1251 encoding to utf-8.
  
 Both versions (CoNLL 2007 and BDT-II) are in the CoNLL 2006/2007 format. Both versions (CoNLL 2007 and BDT-II) are in the CoNLL 2006/2007 format.

[ Back to the navigation ] [ Back to the content ]