user:zeman:treebanks:ru [ufal wiki]

This is an old revision of the document!

Russian (ru)
- Versions
- Obtaining and License
- References
- Domain
- Size
- Inside
- Sample
- Parsing

Russian (ru)

Russian Dependency Treebank (RDT, SynTagRus)

Versions

2006 (small part obtained by Dan Zeman per e-mail from Igor Boguslavsky)
2009 (newer and larger version obtained by Natalia Klyueva)
The version at the site of the Russian National Corpus (searchable on-line but not available for download)

Obtaining and License

There does not seem to be any regular distribution channel for the Russian Dependency Treebank. One can ask Igor Boguslavsky (Игор Богуславский, bogus (at) iitp (dot) ru) about availability of the corpus.

Informally agreed upon terms:

no redistribution
cite the principal publication (see below) in publications
let the creators know about interesting research based on the corpus

RDT was created by members of the Institute for Information Transmission Problems (Институт проблем передачи информации), Russian Academy of Sciences (Российская академия наук), RUS-127994, г.Москва, ГСП-4, Большой Каретный переулок, 19, стр.1.

References

Website
- no website dedicated to the treebank
Data
- no separate citation
Principal publications
- Igor Boguslavsky, Ivan Chardin, Svetlana Grigorieva, Nikolai Grigoriev, Leonid Iomdin, Leonid Kreidlin, Nadezhda Frid: Development of a Dependency Treebank for Russian and its Possible Applications in NLP In: Proceedings of The Third International Conference on Language Resources and Evaluation (LREC 2002), pp. 852-856, Las Palmas, Spain, 2002.
Other publications
- Joakim Nivre, Igor M. Boguslavsky, Leonid L. Iomdin: Parsing the SynTagRus Treebank of Russian. In: Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pp. 641-648, Manchester, UK, 2008.
- David Mareček, Natalia Kljueva: Converting Russian Treebank SynTagRus into Praguian PDT Style. In: Multilingual Resources, Technologies and Evaluation for Central and Eastern European Languages, pp. 26-31, Bulgaria, 2009.
Documentation
- Daniel Zeman: Russian Dependency Treebank (written based on a Russian document pamjatka_korpus.doc), College Park, Maryland, USA, 2006

Domain

Uppsala University Corpus of contemporary Russian prose (balanced fiction-journalistic, + small percentage of scientific and popular science). In addition, “several hundred short texts published in 2001-2002 on various Internet news portals” (yandex.ru, rbc.ru, polit.ru, lenta.ru, strana.ru, news.ru etc.)

Size

There are 497,465 tokens in 34895 sentences, yielding 14.26 tokens per sentence on average. The original data was not split to training and test. In our HamleDT experiments, we take one file (Выживший_камикадзе.tgt, 402 sentences, 3458 tokens) as the test data, while the rest serves for training.

Inside

The native file format of Syntagrus is the XML-based .tgt format. It uses the Windows-1251 encoding, which can be converted to UTF-8. Converting the file names is more of a challenge, depending on the file system (one typically gets a zipped archive containing files whose names use the Cyrillic alphabet, and the file system may store the names in a codepage different from Windows-1251).

Morphological annotation has probably been done manually and it contains lemmas (uppercased). See references for a description of morphological tags (features) and syntactic tags (dependency relation labels). Note that the tags use the Cyrillic alphabet.

The syntactic trees do not contain punctuation. Punctuation tokens have not been removed but they do not have independent nodes in the trees.

Sample

The first sentence of our test file in the native TGT format:

<?xml version="1.0" encoding="windows-1251"?>
<text><inf>
<author>Л, ТЕЛЕНЬ</author>
<title>Интервью "МН" Е. Гайдара</title>
<source/>
<annot>ЛИ</annot>
<editor>ИБ</editor>
<date>10.12.03</date>
<comment/>
</inf>
<body>
<S ID="1">
<W DOM="3" EXTRAFEAT="CAP" FEAT="S ЕД МУЖ ИМ ОД" ID="1" LEMMA="ЕГОР" LINK="предик">Егор</W> 
<W DOM="1" EXTRAFEAT="CAP" FEAT="S ЕД МУЖ ИМ ОД" ID="2" LEMMA="ГАЙДАР" LINK="аппоз">Гайдар</W> 
<W DOM="_root" EXTRAFEAT="ЛИЧ" FEAT="V НЕСОВ ИЗЪЯВ ПРОШ ЕД МУЖ" ID="3" LEMMA="ВОЗГЛАВЛЯТЬ">возглавлял</W> 
<W DOM="6" FEAT="A ЕД СРЕД ВИН" ID="4" LEMMA="ПЕРВЫЙ" LINK="опред">первое</W> 
<W DOM="6" FEAT="A ЕД СРЕД ВИН" ID="5" LEMMA="РОССИЙСКИЙ" LINK="опред">российское</W> 
<W DOM="3" FEAT="S ЕД СРЕД ВИН НЕОД" ID="6" LEMMA="ПРАВИТЕЛЬСТВО" LINK="1-компл">правительство</W>, 
<W DOM="8" FEAT="S ЕД СРЕД ВИН" ID="7" LEMMA="КОТОРЫЙ" LINK="1-компл">которое</W> 
<W DOM="6" EXTRAFEAT="ЛИЧ" FEAT="V НЕСОВ ИЗЪЯВ ПРОШ МН" ID="8" LEMMA="НАЗЫВАТЬ" LINK="релят">называли</W> 
<W DOM="8" FEAT="S ЕД СРЕД ТВОР НЕОД" ID="9" LEMMA="ПРАВИТЕЛЬСТВО" LINK="2-компл">правительством</W> 
<W DOM="9" FEAT="S ЕД МУЖ РОД ОД" ID="10" LEMMA="КАМИКАДЗЕ" LINK="квазиагент">камикадзе</W>. 
</S>

Parsing

BDT is a mildly nonprojective treebank. 1925 of the 151,604 tokens of combined BDT-II training and test sets are attached nonprojectively (1.27%).

The results of the CoNLL 2007 shared task are available online. They have been published in (Nivre et al., 2007). The evaluation procedure was changed to include punctuation tokens. These are the best results for Basque:

Parser (Authors)	LAS	UAS
Malt (Nilsson et al.)	76.94	82.84
Titov et al.	75.49	81.93
Sagae	74.64	81.19
Carreras	75.75	81.11
Nakagawa	72.56	81.04
Malt (J. Hall et al.)	74.99	80.61
Johansson et al.	75.08	80.43

The two Malt parser results of 2007 (single malt and blended) are described in (Hall et al., 2007) and the details about the parser configuration are described here.

Parsing results on BDT-II have been published in Kepa Bengoetxea, Koldo Gojenola: Application of Different Techniques to Dependency Parsing of Basque. In: Proceedings of the First Workshop on Statistical Parsing of Morphologically Rich Languages (SPMRL 2010), NAACL Workshop, Los Angeles, California, USA, 2010. They report only Labeled Attachment Score (LAS) and their best system achieved LAS = 78.98%.

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Table of Contents