[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

This is an old revision of the document!


Table of Contents

Slovak (sk)

Slovak Treebank (part of Slovak National Corpus / Slovenský národný korpus)

Versions

ST is natively dependency-based, modeled after the Prague Dependency Treebank of Czech.

Obtaining and License

ST has not been publicly released. Contact Radovan Garabík to inquire about availability and license terms.

ST was created by members of the Ľudovít Štúr Language Institute (Jazykovedný ústav Ľudovíta Štúra), Panská 26, 81364 Bratislava, Slovakia.

References

Domain

Mixed.

Size

The treebank reportedly contains about 50000 sentences. In HamleDT, we are currently experimenting with a subset that contains Annotator 1 annotations of documents that have manual morphological annotation, and of Wikipedia (for which the source of morphological annotation has not been confirmed). This subset contains 479473 tokens and 26149 sentences, yielding 18.34 tokens per sentence on average. We have not yet split the data into training and test parts.

Inside

The syntactic annotation scheme has been taken from analytical layer of the (Czech) Prague Dependency Treebank 2.0. The set of syntactic tags (dependency relation labels) is identical to the set of analytical functions (afuns) in PDT. Morphosyntactic tagset is that of the Slovak National Corpus. Use DZ Interset to inspect the tagset.

A significant part of the treebank (but not all) has been syntactically annotated in parallel by two independent annotators. (In the data we have for HamleDT these parallel annotations have not been merged.)

The morphological analysis includes lemmas. The morphosyntactic tags and lemmas have been assigned manually only in part of the treebank: Orwell1984, MojaPrvaLaska, Mucska, MilosFerko, MilosFerko2, Patmos, PsiaKoza “and some others.”

Sample

Beginning of the file Orwell1984/bsnk01_jankal_dok.fsnew.w (I have selected this text because we have the same part of Orwell's 1984 as our sample of Slovenian.

<?xml version="1.0" encoding="utf-8"?>
 
<wdata xmlns="http://ufal.mff.cuni.cz/pdt/pml/">
 <head>
  <schema href="wdata_schema.xml" />
 </head>
 <meta>
  <original_format>csts</original_format>
 </meta>
 <doc continues="1">
  <docmeta></docmeta>
  <para>
   <othermarkup origin="csts/doc/p/@n">1</othermarkup>
   <w id="w-.b.1">
    <token>:</token>
   </w>
   <w id="w-.b.2">
    <token>ORWELL</token>
   </w>
   <w id="w-.b.3">
    <token>GEORGE</token>
   </w>
   <w id="w-.b.4">
    <token>deväťsto</token>
   </w>
   <w id="w-.b.5">
    <token>Tisíc</token>
   </w>
   <w id="w-.b.6">
    <token>osemdesiatštyri</token>
   </w>
   <w id="w-.c.1">
    <token>:</token>
   </w>
   <w id="w-.c.2">
    <token>Preklad</token>
   </w>
   <w id="w-.c.3">
    <token>Vojtek</token>
   </w>
   <w id="w-.c.4">
    <token>Juraj</token>
   </w>
   <w id="w-.d.1">
    <token>I</token>
   </w>
   <w id="w-.d.2">
    <token>.</token>
   </w>
   <w id="w-.e.1">
    <token>1</token>
   </w>
   <w id="w-.f.1">
    <token>a</token>
   </w>
   <w id="w-.f.2">
    <token>Bol</token>
   </w>
   <w id="w-.f.3">
    <token>deň</token>
   </w>
   <w id="w-.f.4">
    <token>aprílový</token>
   </w>
   <w id="w-.f.5">
    <token>ale</token>
   </w>
   <w id="w-.f.6">
    <token>jasný</token>
   </w>
   <w id="w-.f.7">
    <token>,</token>
   </w>
   <w id="w-.f.8">
    <token>chladný</token>
   </w>
   <w id="w-.f.9">
    <token>odbíjali</token>
   </w>
   <w id="w-.f.10">
    <token>hodiny</token>
   </w>
   <w id="w-.f.11">
    <token>trinástu</token>
   </w>

The same sentence on the m-layer:

<?xml version="1.0" encoding="utf-8"?>
 
<mdata xmlns="http://ufal.mff.cuni.cz/pdt/pml/">
 <head>
  <schema href="mdata_schema.xml" />
  <references>
   <reffile id="w" name="wdata" href="bsnk01_jankal_dok.fsnew.w" />
  </references>
 </head>
 <meta>
  <lang>cs</lang>
 </meta>
 <s id="m-.b">
  <m id="m-.b.1">
   <src.rf>manual</src.rf>
   <w.rf>w#w-.b.1</w.rf>
   <form>:</form>
   <lemma>:</lemma>
   <tag>Z</tag>
  </m>
  <m id="m-.b.2">
   <src.rf>manual</src.rf>
   <w.rf>w#w-.b.2</w.rf>
   <form>ORWELL</form>
   <lemma>orwell</lemma>
   <tag>SSms1:r</tag>
  </m>
  <m id="m-.b.3">
   <src.rf>manual</src.rf>
   <w.rf>w#w-.b.3</w.rf>
   <form>GEORGE</form>
   <lemma>george</lemma>
   <tag>SSms1:r</tag>
  </m>
  <m id="m-.b.4">
   <src.rf>manual</src.rf>
   <w.rf>w#w-.b.4</w.rf>
   <form>deväťsto</form>
   <lemma>deväťsto</lemma>
   <tag>NX</tag>
  </m>
  <m id="m-.b.5">
   <src.rf>manual</src.rf>
   <w.rf>w#w-.b.5</w.rf>
   <form>Tisíc</form>
   <lemma>tisíc</lemma>
   <tag>NX</tag>
  </m>
  <m id="m-.b.6">
   <src.rf>manual</src.rf>
   <w.rf>w#w-.b.6</w.rf>
   <form>osemdesiatštyri</form>
   <lemma>osemdesiatštyri</lemma>
   <tag>NX</tag>
  </m>
 </s>
 <s id="m-.c">
  <m id="m-.c.1">
   <src.rf>manual</src.rf>
   <w.rf>w#w-.c.1</w.rf>
   <form>:</form>
   <lemma>:</lemma>
   <tag>Z</tag>
  </m>
  <m id="m-.c.2">
   <src.rf>manual</src.rf>
   <w.rf>w#w-.c.2</w.rf>
   <form>Preklad</form>
   <lemma>preklad</lemma>
   <tag>SSis1</tag>
  </m>
  <m id="m-.c.3">
   <src.rf>manual</src.rf>
   <w.rf>w#w-.c.3</w.rf>
   <form>Vojtek</form>
   <lemma>vojtek</lemma>
   <tag>SSms1:r</tag>
  </m>
  <m id="m-.c.4">
   <src.rf>manual</src.rf>
   <w.rf>w#w-.c.4</w.rf>
   <form>Juraj</form>
   <lemma>juraj</lemma>
   <tag>SSms1:r</tag>
  </m>
 </s>
 <s id="m-.d">
  <m id="m-.d.1">
   <src.rf>manual</src.rf>
   <w.rf>w#w-.d.1</w.rf>
   <form>I</form>
   <lemma>i</lemma>
   <tag>0</tag>
  </m>
  <m id="m-.d.2">
   <src.rf>manual</src.rf>
   <w.rf>w#w-.d.2</w.rf>
   <form>.</form>
   <lemma>.</lemma>
   <tag>Z</tag>
  </m>
 </s>
 <s id="m-.e">
  <m id="m-.e.1">
   <src.rf>manual</src.rf>
   <w.rf>w#w-.e.1</w.rf>
   <form>1</form>
   <lemma>1</lemma>
   <tag>0</tag>
  </m>
 </s>
 <s id="m-.f">
  <m id="m-.f.1">
   <src.rf>manual</src.rf>
   <w.rf>w#w-.f.1</w.rf>
   <form>a</form>
   <lemma>a</lemma>
   <tag>O</tag>
  </m>
  <m id="m-.f.2">
   <src.rf>manual</src.rf>
   <w.rf>w#w-.f.2</w.rf>
   <form>Bol</form>
   <lemma>byť</lemma>
   <tag>VLesci+</tag>
  </m>
  <m id="m-.f.3">
   <src.rf>manual</src.rf>
   <w.rf>w#w-.f.3</w.rf>
   <form>deň</form>
   <lemma>deň</lemma>
   <tag>SSis1</tag>
  </m>
  <m id="m-.f.4">
   <src.rf>manual</src.rf>
   <w.rf>w#w-.f.4</w.rf>
   <form>aprílový</form>
   <lemma>aprílový</lemma>
   <tag>AAis1x</tag>
  </m>
  <m id="m-.f.5">
   <src.rf>manual</src.rf>
   <w.rf>w#w-.f.5</w.rf>
   <form>ale</form>
   <lemma>ale</lemma>
   <tag>O</tag>
  </m>
  <m id="m-.f.6">
   <src.rf>manual</src.rf>
   <w.rf>w#w-.f.6</w.rf>
   <form>jasný</form>
   <lemma>jasný</lemma>
   <tag>AAis1x</tag>
  </m>
  <m id="m-.f.7">
   <src.rf>manual</src.rf>
   <w.rf>w#w-.f.7</w.rf>
   <form>,</form>
   <lemma>,</lemma>
   <tag>Z</tag>
  </m>
  <m id="m-.f.8">
   <src.rf>manual</src.rf>
   <w.rf>w#w-.f.8</w.rf>
   <form>chladný</form>
   <lemma>chladný</lemma>
   <tag>AAis1x</tag>
  </m>
  <m id="m-.f.9">
   <src.rf>manual</src.rf>
   <w.rf>w#w-.f.9</w.rf>
   <form>odbíjali</form>
   <lemma>odbíjať</lemma>
   <tag>VLepcf+</tag>
  </m>
  <m id="m-.f.10">
   <src.rf>manual</src.rf>
   <w.rf>w#w-.f.10</w.rf>
   <form>hodiny</form>
   <lemma>hodiny</lemma>
   <tag>SSfp1</tag>
  </m>
  <m id="m-.f.11">
   <src.rf>manual</src.rf>
   <w.rf>w#w-.f.11</w.rf>
   <form>trinástu</form>
   <lemma>trinásty</lemma>
   <tag>NAfs4</tag>
  </m>
  <m id="m-.f.12">
   <src.rf>manual</src.rf>
   <w.rf>w#w-.f.12</w.rf>
   <form>.</form>
   <lemma>.</lemma>
   <tag>Z</tag>
  </m>
 </s>

The same sentence on the a-layer:

<?xml version="1.0" encoding="utf-8"?>
 
<adata xmlns="http://ufal.mff.cuni.cz/pdt/pml/">
 <head>
  <schema href="adata_schema.xml" />
  <references>
   <reffile id="m" name="mdata" href="bsnk01_jankal_dok.fsnew.m" />
   <reffile id="w" name="wdata" href="bsnk01_jankal_dok.fsnew.w" />
  </references>
 </head>
 <trees>
  <LM id="a-.b">
   <s.rf>m#m-.b</s.rf>
   <ord>0</ord>
   <children>
    <LM id="a-.b.1">
     <m.rf>m#m-.b.1</m.rf>
     <afun>Coord</afun>
     <ord>3</ord>
     <children>
      <LM id="a-.b.2">
       <m.rf>m#m-.b.2</m.rf>
       <afun>ExD</afun>
       <is_member>1</is_member>
       <ord>2</ord>
       <children>
        <LM id="a-.b.3">
         <m.rf>m#m-.b.3</m.rf>
         <afun>Atr</afun>
         <ord>1</ord>
        </LM>
       </children>
      </LM>
      <LM id="a-.b.4">
       <m.rf>m#m-.b.4</m.rf>
       <afun>ExD</afun>
       <is_member>1</is_member>
       <ord>5</ord>
       <children>
        <LM id="a-.b.5">
         <m.rf>m#m-.b.5</m.rf>
         <afun>Atr</afun>
         <ord>4</ord>
        </LM>
        <LM id="a-.b.6">
         <m.rf>m#m-.b.6</m.rf>
         <afun>Atr</afun>
         <ord>6</ord>
        </LM>
       </children>
      </LM>
     </children>
    </LM>
   </children>
  </LM>
  <LM id="a-.c">
   <s.rf>m#m-.c</s.rf>
   <ord>0</ord>
   <children>
    <LM id="a-.c.1">
     <m.rf>m#m-.c.1</m.rf>
     <afun>Coord</afun>
     <ord>2</ord>
     <children>
      <LM id="a-.c.2">
       <m.rf>m#m-.c.2</m.rf>
       <afun>ExD</afun>
       <is_member>1</is_member>
       <ord>1</ord>
      </LM>
      <LM id="a-.c.3">
       <m.rf>m#m-.c.3</m.rf>
       <afun>ExD</afun>
       <is_member>1</is_member>
       <ord>4</ord>
       <children>
        <LM id="a-.c.4">
         <m.rf>m#m-.c.4</m.rf>
         <afun>Atr</afun>
         <ord>3</ord>
        </LM>
       </children>
      </LM>
     </children>
    </LM>
   </children>
  </LM>
  <LM id="a-.d">
   <s.rf>m#m-.d</s.rf>
   <ord>0</ord>
   <children>
    <LM id="a-.d.1">
     <m.rf>m#m-.d.1</m.rf>
     <afun>ExD</afun>
     <ord>1</ord>
     <children>
      <LM id="a-.d.2">
       <m.rf>m#m-.d.2</m.rf>
       <afun>AuxG</afun>
       <ord>2</ord>
      </LM>
     </children>
    </LM>
   </children>
  </LM>
  <LM id="a-.e">
   <s.rf>m#m-.e</s.rf>
   <ord>0</ord>
   <children>
    <LM id="a-.e.1">
     <m.rf>m#m-.e.1</m.rf>
     <afun>ExD</afun>
     <ord>1</ord>
    </LM>
   </children>
  </LM>
  <LM id="a-.f">
   <s.rf>m#m-.f</s.rf>
   <ord>0</ord>
   <children>
    <LM id="a-.f.1">
     <m.rf>m#m-.f.1</m.rf>
     <afun>Coord</afun>
     <ord>8</ord>
     <children>
      <LM id="a-.f.2">
       <m.rf>m#m-.f.2</m.rf>
       <afun>Pred</afun>
       <is_member>1</is_member>
       <ord>1</ord>
       <children>
        <LM id="a-.f.3">
         <m.rf>m#m-.f.3</m.rf>
         <afun>Pnom</afun>
         <ord>7</ord>
         <children>
          <LM id="a-.f.4">
           <m.rf>m#m-.f.4</m.rf>
           <afun>Atr</afun>
           <ord>6</ord>
           <children>
            <LM id="a-.f.5">
             <m.rf>m#m-.f.5</m.rf>
             <afun>Coord</afun>
             <ord>4</ord>
             <children>
              <LM id="a-.f.6">
               <m.rf>m#m-.f.6</m.rf>
               <afun>Atr</afun>
               <is_member>1</is_member>
               <ord>2</ord>
              </LM>
              <LM id="a-.f.7">
               <m.rf>m#m-.f.7</m.rf>
               <afun>AuxX</afun>
               <ord>3</ord>
              </LM>
              <LM id="a-.f.8">
               <m.rf>m#m-.f.8</m.rf>
               <afun>Atr</afun>
               <is_member>1</is_member>
               <ord>5</ord>
              </LM>
             </children>
            </LM>
           </children>
          </LM>
         </children>
        </LM>
       </children>
      </LM>
      <LM id="a-.f.9">
       <m.rf>m#m-.f.9</m.rf>
       <afun>Pred</afun>
       <is_member>1</is_member>
       <ord>10</ord>
       <children>
        <LM id="a-.f.10">
         <m.rf>m#m-.f.10</m.rf>
         <afun>Sb</afun>
         <ord>9</ord>
        </LM>
        <LM id="a-.f.11">
         <m.rf>m#m-.f.11</m.rf>
         <afun>Obj</afun>
         <ord>11</ord>
        </LM>
       </children>
      </LM>
     </children>
    </LM>
    <LM id="a-.f.12">
     <m.rf>m#m-.f.12</m.rf>
     <afun>AuxK</afun>
     <ord>12</ord>
    </LM>
   </children>
  </LM>

Parsing

Nonprojectivities are relatively rare. 1.73 % of nodes are attached nonprojectively.

Parsing results: we obtained the UAS of 80.73 % using Malt Parser with stack-lazy algorithm. No other published parsing results are known to us.


[ Back to the navigation ] [ Back to the content ]