[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
Next revision Both sides next revision
user:zeman:treebanks:sk [2014/03/01 22:35]
zeman created
user:zeman:treebanks:sk [2014/03/03 12:00]
zeman Inside.
Line 35: Line 35:
 ==== Size ==== ==== Size ====
  
-50,000 viet +The treebank reportedly contains about 50000 sentences. In HamleDT, we are currently experimenting with a subset that contains Annotator 1 annotations of documents that have manual morphological annotation, and of Wikipedia (for which the source of morphological annotation has not been confirmed). This subset contains 479473 tokens and 26149 sentences, yielding 18.34 tokens per sentence on average. We have not yet split the data into training and test parts.
- +
-The CoNLL 2006 version contains 35140 tokens in 1936 sentences, yielding 18.15 tokens per sentence on average (CoNLL 2006 data split: 28750 tokens / 1534 sentences training, 6390 tokens / 402 sentences test).+
  
 ==== Inside ==== ==== Inside ====
  
-The original morphosyntactic tags have been converted to fit into the three columns (CPOS, POS and FEATof the CoNLL formatThere //should// be a 1-1 mapping between the [[http://www.bultreebank.org/TechRep/BTB-TR03.pdf|BTB positional tags]] and the CoNLL 2006 annotation. Use [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=sl::conll|DZ Interset]] to inspect the CoNLL tagset.+The syntactic annotation scheme has been taken from analytical layer of the (CzechPrague Dependency Treebank 2.0The set of syntactic tags (dependency relation labels) is identical to the set of analytical functions (afuns) in PDT. Morphosyntactic tagset is that of the Slovak National Corpus. Use [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=sk::snk|DZ Interset]] to inspect the tagset. 
 + 
 +A significant part of the treebank (but not all) has been syntactically annotated in parallel by two independent annotators. (In the data we have for HamleDT these parallel annotations have not been merged.)
  
-The morphological analysis includes lemmas. The morphosyntactic tags have been assigned (probably) manually.+The morphological analysis includes lemmas. The morphosyntactic tags and lemmas have been assigned manually only in part of the treebank: Orwell1984, MojaPrvaLaska, Mucska, MilosFerko, MilosFerko2, Patmos, PsiaKoza “and some others.
  
 ==== Sample ==== ==== Sample ====

[ Back to the navigation ] [ Back to the content ]