[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

user:zeman:treebanks:bg [2011/11/20 21:19] (current)
zeman vytvořeno
Line 1: Line 1:
 +===== Bulgarian (bg) =====
 +
 +[[http://www.bultreebank.org/indexBTB.html|BulTreeBank]] (BTB)
 +
 +==== Versions ====
 +
 +  * Original BTB in native format
 +  * CoNLL 2006 (BulTreeBank-DP)
 +
 +The original BTB is based on HPSG (head-driven phrase-structure grammar). The CoNLL version contains only the dependency information encoded in HPSG BulTreeBank.
 +
 +==== Obtaining and License ====
 +
 +Only the CoNLL version seems to be distributed but you may ask the creators about the HPSG version. For the dependency version, print the [[http://www.bultreebank.org/dpbtb/dp-btb-license.htm|license]], sign, scan, send to Kiril Simov (kivs (at) bultreebank (dot) org) and wait for the data. The license in short:
 +
 +  * research usage
 +  * no redistribution
 +  * cite [[http://www.bultreebank.org/dpbtb/dp-btb-license.htm|one publication]] in publications
 +
 +BTB was created by members of the [[http://lml.bas.bg/|Linguistic Modelling Department]] (Секция Лингвистично моделиране), Bulgarian Academy of Sciences (Българска академия на науките), Ул. Акад. Г. Бончев, Бл. 25 А, 1113 София, Bulgaria.
 +
 +==== References ====
 +
 +  * Website
 +    * http://www.bultreebank.org/indexBTB.html
 +  * Data
 +    * //no separate citation//
 +  * Principal publications
 +    * Kiril Simov, Petya Osenova, Alexander Simov, Milen Kouylekov: //Design and Implementation of the Bulgarian HPSG-based Treebank.// In: Erhard Hinrichs, Kiril Simov (eds.): Journal of Research on Language and Computation, Special Issue, vol. 2, no. 4, pp. 495–522, Kluwer Academic Publishers, ISSN 1570-7075. 2004.
 +  * Documentation
 +    * Kiril Simov, Petya Osenova, Milena Slavcheva: [[http://www.bultreebank.org/TechRep/BTB-TR03.pdf|BTB-TR03: BulTreeBank Morphosyntactic Tagset]]. Technical report, 2004.
 +    * Petya Osenova, Kiril Simov: [[http://www.bultreebank.org/TechRep/BTB-TR05.pdf|BTB-TR05: BulTreeBank Stylebook]]. Technical report, 2004.
 +    * http://www.bultreebank.org/dpbtb/ provides the list of dependency relation labels (s-tags) with brief description.
 +
 +==== Domain ====
 +
 +Unknown (“A set of Bulgarian sentences marked-up with detailed syntactic information. These sentences are mainly extracted from authentic Bulgarian texts. They are chosen with regards two criteria. First, they cover the variety of syntactic structures of Bulgarian. Second, they show the statistical distribution of these phenomena in real texts.”) At least part of it is probably news (Novinar, Sega, Standart).
 +
 +==== Size ====
 +
 +The CoNLL 2006 version contains 196,151 tokens in 13221 sentences, yielding 14.84 tokens per sentence on average (CoNLL 2006 data split: 190,217 tokens / 12823 sentences training, 5934 tokens / 398 sentences test).
 +
 +==== Inside ====
 +
 +The original morphosyntactic tags have been converted to fit into the three columns (CPOS, POS and FEAT) of the CoNLL format. There //should// be a 1-1 mapping between the [[http://www.bultreebank.org/TechRep/BTB-TR03.pdf|BTB positional tags]] and the CoNLL 2006 annotation. Use [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=bg::conll|DZ Interset]] to inspect the CoNLL tagset.
 +
 +The morphological analysis does not include lemmas. The morphosyntactic tags have been assigned (probably) manually.
 +
 +The guidelines for syntactic annotation are documented in the other [[http://www.bultreebank.org/TechRep/BTB-TR05.pdf|technical report]]. The CoNLL distribution contains the BulTreeBankReadMe.html file with a brief description of the syntactic tags (dependency relation labels).
 +
 +==== Sample ====
 +
 +The first three sentences of the CoNLL 2006 training data:
 +
 +| 1 | Глава | _ | N | Nc | _ | 0 | ROOT | 0 | ROOT |
 +| 2 | трета | _ | M | Mo | gen=f<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 1 | mod | 1 | mod |
 +| ||||||||||
 +| 1 | НАРОДНО | _ | A | An | gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 2 | mod | 2 | mod |
 +| 2 | СЪБРАНИЕ | _ | N | Nc | gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 0 | ROOT | 0 | ROOT |
 +| ||||||||||
 +| 1 | Народното | _ | A | An | gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=d | 2 | mod | 2 | mod |
 +| 2 | събрание | _ | N | Nc | gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 3 | subj | 3 | subj |
 +| 3 | осъществява | _ | V | Vpi | trans=t<nowiki>|</nowiki>mood=i<nowiki>|</nowiki>tense=r<nowiki>|</nowiki>pers=3<nowiki>|</nowiki>num=s | 0 | ROOT | 0 | ROOT |
 +| 4 | законодателната | _ | A | Af | gen=f<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=d | 5 | mod | 5 | mod |
 +| 5 | власт | _ | N | Nc | _ | 3 | obj | 3 | obj |
 +| 6 | и | _ | C | Cp | _ | 3 | conj | 3 | conj |
 +| 7 | упражнява | _ | V | Vpi | trans=t<nowiki>|</nowiki>mood=i<nowiki>|</nowiki>tense=r<nowiki>|</nowiki>pers=3<nowiki>|</nowiki>num=s | 3 | conjarg | 3 | conjarg |
 +| 8 | парламентарен | _ | A | Am | gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 9 | mod | 9 | mod |
 +| 9 | контрол | _ | N | Nc | gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 7 | obj | 7 | obj |
 +| 10 | . | _ | Punct | Punct | _ | 3 | punct | 3 | punct |
 +
 +The first three sentences of the CoNLL 2006 test data:
 +
 +| 1 | Единственото | _ | A | An | gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=d | 2 | mod | 2 | mod |
 +| 2 | решение | _ | N | Nc | gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 0 | ROOT | 0 | ROOT |
 +| ||||||||||
 +| 1 | Ерик | _ | N | Np | gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 0 | ROOT | 0 | ROOT |
 +| 2 | Франк | _ | N | Np | gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 1 | mod | 1 | mod |
 +| 3 | Ръсел | _ | H | Hm | gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 2 | mod | 2 | mod |
 +| ||||||||||
 +| 1 | Пълен | _ | A | Am | gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 2 | mod | 2 | mod |
 +| 2 | мрак | _ | N | Nc | gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 0 | ROOT | 0 | ROOT |
 +| 3 | и | _ | C | Cp | _ | 2 | conj | 2 | conj |
 +| 4 | пълна | _ | A | Af | gen=f<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 5 | mod | 5 | mod |
 +| 5 | самота | _ | N | Nc | _ | 2 | conjarg | 2 | conjarg |
 +| 6 | . | _ | Punct | Punct | _ | 2 | punct | 2 | punct |
 +
 +==== Parsing ====
 +
 +Nonprojectivities in BTB are rare. Only 747 of the 196,151 tokens in the CoNLL 2006 version are attached nonprojectively (0.38%).
 +
 +The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http://aclweb.org/anthology-new/W/W06/W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Bulgarian:
 +
 +^ Parser (Authors) ^ LAS ^ UAS ^
 +| MST (McDonald et al.) | 87.57 | 92.04 |
 +| Malt (Nivre et al.) | 87.41 | 91.72 |
 +| Nara (Yuchang Cheng) | 86.34 | 91.30 |
  

[ Back to the navigation ] [ Back to the content ]