===== Slovak (sk) ===== [[http://korpus.sk/|Slovak Treebank]] (part of Slovak National Corpus / Slovenský národný korpus) ==== Versions ==== * [[:format-pml|PML format]] as in Czech PDT 2.0 (.a, .m, .w files) ST is natively dependency-based, modeled after the Prague Dependency Treebank of [[cs|Czech]]. ==== Obtaining and License ==== ST has not been publicly released. Contact [[http://kassiopeia.juls.savba.sk/~garabik/|Radovan Garabík]] to inquire about availability and license terms. ST was created by members of the [[http://juls.savba.sk/|Ľudovít Štúr Language Institute]] (Jazykovedný ústav Ľudovíta Štúra), Panská 26, 81364 Bratislava, Slovakia. ==== References ==== * Website * http://korpus.sk/ (Slovenský národný korpus), not much about syntactic annotation * http://metashare.tilde.com/repository/browse/slovak-treebank/36e46d0a649311e292cd00163e00007874586ecb0acd48909e54babd7c5e7bc2/ (Short info at MetaShare) * https://metashare.korpus.sk/repository/browse/slovak-treebank/36e46d0a649311e292cd00163e00007874586ecb0acd48909e54babd7c5e7bc2/ * Data * //no separate citation// * Principal publications * Mária Šimková, Radovan Garabík: [[http://kassiopeia.juls.savba.sk/~garabik/publications/18/syntax.pdf|Синтаксическая разметка в Словацком национальном корпусе]] In: Tруды международной конференции Корпусная лингвистика – 2006. Sankt-Petersburg: St. Petersburg University Press 2006, p. 389 – 394. ISBN 5-288-04181-4. * Documentation * [[http://korpus.sk/morpho.html|Morphological tagset of the Slovak National Corpus]] * [[http://ufal.mff.cuni.cz/pdt/Corpora/PDT_1.0/Doc/anal.html|The analytical layer of the Prague Dependency Treebank]] ==== Domain ==== Mixed. ==== Size ==== The treebank reportedly contains about 50000 sentences. In HamleDT, we are currently experimenting with a subset that contains Annotator 1 annotations of documents that have manual morphological annotation, and of Wikipedia (for which the source of morphological annotation has not been confirmed). This subset contains 479473 tokens and 26149 sentences, yielding 18.34 tokens per sentence on average. We have not yet split the data into training and test parts. ==== Inside ==== The syntactic annotation scheme has been taken from analytical layer of the (Czech) Prague Dependency Treebank 2.0. The set of syntactic tags (dependency relation labels) is identical to the set of analytical functions (afuns) in PDT. Morphosyntactic tagset is that of the Slovak National Corpus. Use [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=sk::snk|DZ Interset]] to inspect the tagset. A significant part of the treebank (but not all) has been syntactically annotated in parallel by two independent annotators. (In the data we have for HamleDT these parallel annotations have not been merged.) The morphological analysis includes lemmas. The morphosyntactic tags and lemmas have been assigned manually only in part of the treebank: Orwell1984, MojaPrvaLaska, Mucska, MilosFerko, MilosFerko2, Patmos, PsiaKoza “and some others.” ==== Sample ==== Beginning of the file ''Orwell1984/bsnk01_jankal_dok.fsnew.w'' (I have selected this text because we have the same part of Orwell's 1984 as our sample of [[sl#Sample|Slovenian]].




 
  
 
 
  csts
 
 
  
  
   1
   
    :
   
   
    ORWELL
   
   
    GEORGE
   
   
    deväťsto
   
   
    Tisíc
   
   
    osemdesiatštyri
   
   
    :
   
   
    Preklad
   
   
    Vojtek
   
   
    Juraj
   
   
    I
   
   
    .
   
   
    1
   
   
    a
   
   
    Bol
   
   
    deň
   
   
    aprílový
   
   
    ale
   
   
    jasný
   
   
    ,
   
   
    chladný
   
   
    odbíjali
   
   
    hodiny
   
   
    trinástu

The same sentence on the m-layer:




 
  
  
   
  
 
 
  cs
 
 
  
   manual
   w#w-.b.1
   :
   :
   Z
  
  
   manual
   w#w-.b.2
   ORWELL
   orwell
   SSms1:r
  
  
   manual
   w#w-.b.3
   GEORGE
   george
   SSms1:r
  
  
   manual
   w#w-.b.4
   deväťsto
   deväťsto
   NX
  
  
   manual
   w#w-.b.5
   Tisíc
   tisíc
   NX
  
  
   manual
   w#w-.b.6
   osemdesiatštyri
   osemdesiatštyri
   NX
  
 
 
  
   manual
   w#w-.c.1
   :
   :
   Z
  
  
   manual
   w#w-.c.2
   Preklad
   preklad
   SSis1
  
  
   manual
   w#w-.c.3
   Vojtek
   vojtek
   SSms1:r
  
  
   manual
   w#w-.c.4
   Juraj
   juraj
   SSms1:r
  
 
 
  
   manual
   w#w-.d.1
   I
   i
   0
  
  
   manual
   w#w-.d.2
   .
   .
   Z
  
 
 
  
   manual
   w#w-.e.1
   1
   1
   0
  
 
 
  
   manual
   w#w-.f.1
   a
   a
   O
  
  
   manual
   w#w-.f.2
   Bol
   byť
   VLesci+
  
  
   manual
   w#w-.f.3
   deň
   deň
   SSis1
  
  
   manual
   w#w-.f.4
   aprílový
   aprílový
   AAis1x
  
  
   manual
   w#w-.f.5
   ale
   ale
   O
  
  
   manual
   w#w-.f.6
   jasný
   jasný
   AAis1x
  
  
   manual
   w#w-.f.7
   ,
   ,
   Z
  
  
   manual
   w#w-.f.8
   chladný
   chladný
   AAis1x
  
  
   manual
   w#w-.f.9
   odbíjali
   odbíjať
   VLepcf+
  
  
   manual
   w#w-.f.10
   hodiny
   hodiny
   SSfp1
  
  
   manual
   w#w-.f.11
   trinástu
   trinásty
   NAfs4
  
  
   manual
   w#w-.f.12
   .
   .
   Z

The same sentence on the a-layer:

==== Parsing ==== Nonprojectivities are relatively rare. 1.73 % of nodes are attached nonprojectively. Parsing results: we obtained a UAS of 80.73 % using Malt Parser with the stack-lazy algorithm. No other published parsing results are known to us.