[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

This is an old revision of the document!


Table of Contents

Slovene (sl)

Slovene Dependency Treebank (SDT)

Versions

SDT is natively dependency-based, modeled after the Prague Dependency Treebank of Czech.

Obtaining and License

SDT in all data formats is freely downloadable from http://nl.ijs.si/sdt/data/. The license in short:

SDT was created by members of the Institut “Jožef Stefan”, Jamova cesta 39, 1000 Ljubljana, Slovenia.

References

Domain

Fiction (Multext-East Orwell's “1984”).

Size

The CoNLL 2006 version contains 35140 tokens in 1936 sentences, yielding 18.15 tokens per sentence on average (CoNLL 2006 data split: 28750 tokens / 1534 sentences training, 6390 tokens / 402 sentences test).

Inside

The original morphosyntactic tags have been converted to fit into the three columns (CPOS, POS and FEAT) of the CoNLL format. There should be a 1-1 mapping between the BTB positional tags and the CoNLL 2006 annotation. Use DZ Interset to inspect the CoNLL tagset.

The morphological analysis includes lemmas. The morphosyntactic tags have been assigned (probably) manually.

Sample

The first sentence of the treebank in the TEI-compliant XML format:

    <text id="Osl." lang="sl">
      <body>
        <div type="part" id="Osl.1">
          <div type="chapter" id="Osl.1.2">
            <p id="Osl.1.2.2">
              <s id="Osl.1.2.2.1">
                <w id="s1t1" afun="Pred" parallel="Co" dep="s1t8" lemma="biti" ana="Vcps-sma">Bil</w>
                <w id="s1t2" afun="AuxV" dep="s1t1" lemma="biti" ana="Vcip3s--n">je</w>
                <w id="s1t3" afun="Atr" parallel="Co" dep="s1t4" lemma="jasen" ana="Afpmsnn">jasen</w>
                <c id="s1t4" afun="Coord" dep="s1t7">,</c>
                <w id="s1t5" afun="Atr" parallel="Co" dep="s1t4" lemma="mrzel" ana="Afpmsnn">mrzel</w>
                <w id="s1t6" afun="Atr" dep="s1t7" lemma="aprilski" ana="Aopmsn">aprilski</w>
                <w id="s1t7" afun="Sb" dep="s1t1" lemma="dan" ana="Ncmsn">dan</w>
                <w id="s1t8" afun="Coord" dep="root" lemma="in" ana="Ccs">in</w>
                <w id="s1t9" afun="Sb" dep="s1t11" lemma="ura" ana="Ncfpn">ure</w>
                <w id="s1t10" afun="AuxV" dep="s1t11" lemma="biti" ana="Vcip3p--n">so</w>
                <w id="s1t11" afun="Pred" parallel="Co" dep="s1t8" lemma="biti" ana="Vmps-pfa">bile</w>
                <w id="s1t12" afun="Obj" dep="s1t11" lemma="trinajst" ana="Mcnpnl">trinajst</w>
                <c id="s1t13" afun="AuxK" dep="root">.</c>
              </s>

The first sentence of the CoNLL 2006 training data:

1 Bil biti Verb Verb-copula VForm=participle|Tense=past|Number=singular|Gender=masculine|Voice=active 8 Pred _ _
2 je biti Verb Verb-copula VForm=indicative|Tense=present|Person=third|Number=singular|Negative=no 1 AuxV _ _
3 jasen jasen Adjective Adjective-qualificative Degree=positive|Gender=masculine|Number=singular|Case=nominative|Definiteness=no 4 Atr _ _
4 , , PUNC PUNC _ 7 Coord _ _
5 mrzel mrzel Adjective Adjective-qualificative Degree=positive|Gender=masculine|Number=singular|Case=nominative|Definiteness=no 4 Atr _ _
6 aprilski aprilski Adjective Adjective-ordinal Degree=positive|Gender=masculine|Number=singular|Case=nominative 7 Atr _ _
7 dan dan Noun Noun-common Gender=masculine|Number=singular|Case=nominative 1 Sb _ _
8 in in Conjunction Conjunction-coordinating Formation=simple 0 Coord _ _
9 ure ura Noun Noun-common Gender=feminine|Number=plural|Case=nominative 11 Sb _ _
10 so biti Verb Verb-copula VForm=indicative|Tense=present|Person=third|Number=plural|Negative=no 11 AuxV _ _
11 bile biti Verb Verb-main VForm=participle|Tense=past|Number=plural|Gender=feminine|Voice=active 8 Pred _ _
12 trinajst trinajst Numeral Numeral-cardinal Gender=neuter|Number=plural|Case=nominative|Form=letter 11 Obj _ _
13 . . PUNC PUNC _ 0 AuxK _ _

The first sentence of the CoNLL 2006 test data:

1 Na na Adposition Adposition-preposition Formation=simple|Case=locative 5 AuxP _ _
2 hrbtu hrbet Noun Noun-common Gender=masculine|Number=singular|Case=locative 1 Adv _ _
3 je biti Verb Verb-copula VForm=indicative|Tense=present|Person=third|Number=singular|Negative=no 5 AuxV _ _
4 lahko lahko Adverb Adverb-general Degree=positive 5 AuxY _ _
5 čutil čutiti Verb Verb-main VForm=participle|Tense=past|Number=singular|Gender=masculine|Voice=active 0 Pred _ _
6 , , PUNC PUNC _ 7 AuxX _ _
7 da da Conjunction Conjunction-subordinating Formation=simple 5 AuxC _ _
8 vsi ves Pronoun Pronoun-general Gender=masculine|Number=plural|Case=nominative|Syntactic-Type=nominal 9 Sb _ _
9 upirajo upirati Verb Verb-main VForm=indicative|Tense=present|Person=third|Number=plural|Negative=no 7 Obj _ _
10 oči oči Noun Noun-common Gender=feminine|Number=plural|Case=accusative 9 Obj _ _
11 v v Adposition Adposition-preposition Formation=simple|Case=accusative 9 AuxP _ _
12 njegov njegov Pronoun Pronoun-possessive Person=third|Gender=masculine|Number=singular|Case=accusative|Owner-Number=singular|Owner-Gender=masculine|Syntactic-Type=adjectival|Animate=no 14 Atr _ _
13 modri moder Adjective Adjective-qualificative Degree=positive|Gender=masculine|Number=singular|Case=accusative|Definiteness=yes|Animate=no 14 Atr _ _
14 kombinezon kombinezon Noun Noun-common Gender=masculine|Number=singular|Case=accusative|Animate=no 11 Adv _ _
15 . . PUNC PUNC _ 0 AuxK _ _

Parsing

Nonprojectivities in BTB are rare. Only 747 of the 196,151 tokens in the CoNLL 2006 version are attached nonprojectively (0.38%).

The results of the CoNLL 2006 shared task are available online. They have been published in (Buchholz and Marsi, 2006). The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Bulgarian:

Parser (Authors) LAS UAS
MST (McDonald et al.) 87.57 92.04
Malt (Nivre et al.) 87.41 91.72
Nara (Yuchang Cheng) 86.34 91.30

[ Back to the navigation ] [ Back to the content ]