[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

This is an old revision of the document!


Table of Contents

Estonian (et)

Eesti keele puudepank (Google translate)

Versions

Obtaining and License

The TIGER Treebank is freely downloadable after you accept the license terms by pressing a button.

Republication of the two CoNLL versions in LDC is planned but it has not happenned yet.

The license in short:

The TIGER Treebank was created by members of three institutes:

References

Domain

Mostly newswire (Frankfurter Rundschau).

Size

According to their website, the TIGER Treebank version 1 contains approximately 700,000 tokens in 40,000 sentences. Version 2.1 contains approximately 900,000 tokens in 50,000 sentences.

The CoNLL 2006 version contains 705,304 tokens in 39573 sentences, yielding 17.82 tokens per sentence on average (CoNLL 2006 data split: 699,610 tokens / 39216 sentences training, 5694 tokens / 357 sentences test).

The CoNLL 2009 version contains 712,332 tokens in 40020 sentences, yielding 17.80 tokens per sentence on average (CoNLL 2009 data split: 648,677 tokens / 36020 sentences training, 32033 tokens / 2000 sentences development, 31622 tokens / 2000 sentences test).

Inside

All versions contain semi-automatic part of speech tags (Stuttgart-Tübingen Tagset, STTS) and syntactic structure. Lemmas and morphosyntactic features are available only for newer versions (TIGER Treebank version 2 and onwards, and CoNLL 2009). The parts of speech are heavily context-dependent, e.g. many words can be used both substantively (pronouns) and attributively (determiners), which is distinguished by different POS tags.

It is not clear what the semi-automatic annotation means (probably first auto-tagging, then manual correction?) and whether it also applies to the morphosyntactic annotation. The CoNLL 2009 version also contains automatically disambiguated lemmas, tags and features.

The original treebank is phrase-based. The dependencies in the CoNLL versions must have thus been drawn using a head-selection procedure. Besides CoNLL data, the TIGER project also provides a subset of the TIGER Treebank in a dependency format.

Sample

The first sentence of TIGER Treebank 2.1 in the TIGER-XML format:

<s id="s1">
  <graph root="s1_VROOT">
    <terminals>
      <t id="s1_1" word="``" lemma="--" pos="$(" morph="--" case="--" number="--" gender="--" person="--" degree="--" tense="--" mood="--" />
      <t id="s1_2" word="Ross" lemma="Ross" pos="NE" morph="Nom.Sg.Masc" case="Nom" number="Sg" gender="Masc" person="--" degree="--" tense="--" mood="--" />
      <t id="s1_3" word="Perot" lemma="Perot" pos="NE" morph="Nom.Sg.Masc" case="Nom" number="Sg" gender="Masc" person="--" degree="--" tense="--" mood="--" />
      <t id="s1_4" word="wäre" lemma="sein" pos="VAFIN" morph="3.Sg.Past.Subj" case="--" number="Sg" gender="--" person="3" degree="--" tense="Past" mood="Subj" />
      <t id="s1_5" word="vielleicht" lemma="vielleicht" pos="ADV" morph="--" case="--" number="--" gender="--" person="--" degree="--" tense="--" mood="--" />
      <t id="s1_6" word="ein" lemma="ein" pos="ART" morph="Nom.Sg.Masc" case="Nom" number="Sg" gender="Masc" person="--" degree="--" tense="--" mood="--" />
      <t id="s1_7" word="prächtiger" lemma="prächtig" pos="ADJA" morph="Pos.Nom.Sg.Masc" case="Nom" number="Sg" gender="Masc" person="--" degree="Pos" tense="--" mood="--" />
      <t id="s1_8" word="Diktator" lemma="Diktator" pos="NN" morph="Nom.Sg.Masc" case="Nom" number="Sg" gender="Masc" person="--" degree="--" tense="--" mood="--" />
      <t id="s1_9" word="''" lemma="--" pos="$(" morph="--" case="--" number="--" gender="--" person="--" degree="--" tense="--" mood="--" />
    </terminals>
    <nonterminals>
      <nt id="s1_500" cat="PN">
        <edge label="PNC" idref="s1_2" />
        <edge label="PNC" idref="s1_3" />
      </nt>
      <nt id="s1_501" cat="NP">
        <edge label="NK" idref="s1_6" />
        <edge label="NK" idref="s1_7" />
        <edge label="NK" idref="s1_8" />
      </nt>
      <nt id="s1_502" cat="S">
        <edge label="SB" idref="s1_500" />
        <edge label="HD" idref="s1_4" />
        <edge label="MO" idref="s1_5" />
        <edge label="PD" idref="s1_501" />
      </nt>
      <nt id="s1_VROOT" cat="VROOT">
        <edge label="--" idref="s1_1" />
        <edge label="--" idref="s1_502" />
        <edge label="--" idref="s1_9" />
      </nt>
    </nonterminals>
  </graph>
</s>

The first sentence of the CoNLL 2006 training data:

1 `` _ $( $( _ 4 PUNC 4 PUNC
2 Ross _ NE NE _ 4 SB 4 SB
3 Perot _ NE NE _ 2 PNC 2 PNC
4 wäre _ VAFIN VAFIN _ 0 ROOT 0 ROOT
5 vielleicht _ ADV ADV _ 4 MO 4 MO
6 ein _ ART ART _ 8 NK 8 NK
7 prächtiger _ ADJA ADJA _ 8 NK 8 NK
8 Diktator _ NN NN _ 4 PD 4 PD
9 '' _ $( $( _ 4 PUNC 4 PUNC

The first sentence of the CoNLL 2006 test data:

1 Zwei _ CARD CARD _ 2 NK 2 NK
2 Themen _ NN NN _ 14 SB 14 SB
3 , _ $, $, _ 2 PUNC 2 PUNC
4 die _ PRELS PRELS _ 8 OA 8 OA
5 Perot _ NE NE _ 8 SB 8 SB
6 immer _ ADV ADV _ 7 MO 7 MO
7 wieder _ ADV ADV _ 8 MO 8 MO
8 anspricht _ VVFIN VVFIN _ 2 RC 2 RC
9 , _ $, $, _ 2 PUNC 2 PUNC
10 Rezession _ NN NN _ 2 APP 2 APP
11 und _ KON KON _ 10 CD 10 CD
12 Bürokratie _ NN NN _ 10 CJ 10 CJ
13 , _ $, $, _ 14 PUNC 14 PUNC
14 machen _ VVFIN VVFIN _ 0 ROOT 0 ROOT
15 ihnen _ PPER PPER _ 18 DA 18 DA
16 besonders _ ADV ADV _ 18 MO 18 MO
17 zu _ PTKZU PTKZU _ 18 PM 18 PM
18 schaffen _ VVINF VVINF _ 14 OC 14 OC
19 . _ $. $. _ 14 PUNC 14 PUNC

The first sentence of the CoNLL 2009 training data:

1 `` _ `` $( $( _ _ 4 4 PUNC PUNC _ _
2 Ross Ross Roß NE NN Nom|Sg|Masc _ 3 3 PNC PNC _ _
3 Perot Perot Perot NE NE Nom|Sg|Masc _ 4 4 SB SB _ _
4 wäre sein sein VAFIN VAFIN 3|Sg|Past|Subj *|Sg|Past|Subj 0 0 ROOT ROOT _ _
5 vielleicht vielleicht vielleicht ADV ADV _ _ 4 4 MO MO _ _
6 ein ein ein ART ART Nom|Sg|Masc *|Sg|* 8 8 NK NK _ _
7 prächtiger prächtig prächtig ADJA ADJA Pos|Nom|Sg|Masc *|*|*|* 8 8 NK NK _ _
8 Diktator Diktator Diktator NN NN Nom|Sg|Masc *|Sg|Masc 4 4 PD PD _ _
9 '' _ '' $( $( _ _ 4 4 PUNC PUNC _ _

The first sentence of the CoNLL 2009 development data:

1 Maschinenbau Maschinenbau Maschinenbau NN NN Nom|Sg|Masc *|Sg|Masc 0 4 ROOT NK _ _
2 / _ / $( $( _ _ 0 1 PUNC PUNC _ _
3 ( _ ( $( $( _ _ 0 4 PUNC PUNC _ _
4 Zusammenfassung Zusammenfassung Zusammenfassung NN NN Nom|Sg|Fem *|Sg|Fem 0 0 ROOT ROOT _ _
5 ) _ ) $( $( _ _ 0 1 PUNC PUNC _ _

The first sentence of the CoNLL 2009 test data:

1 Gegen gegen gegen APPR APPR _ _ _ _ _ _ _
2 eine ein ein ART ART Acc|Sg|Fem *|Sg|Fem _ _ _ _ _
3 Erweiterung Erweiterung Erweiterung NN NN Acc|Sg|Fem *|Sg|Fem _ _ _ _ _
4 ihrer ihr ihr PPOSAT PPOSAT Gen|Sg|Fem *|*|* _ _ _ _ _
5 Organisation Organisation Organisation NN NN Gen|Sg|Fem *|Sg|Fem _ _ _ _ _
6 zu zu zu APPR APPR _ _ _ _ _ _ _
7 einem ein ein ART ART Dat|Sg|Neut Dat|Sg|* _ _ _ _ _
8 sicherheitspolitischen sicherheitspolitisch sicherheitspolitisch ADJA ADJA Pos|Dat|Sg|Neut Pos|*|*|* _ _ _ _ _
9 Forum Forum Forum NN NN Dat|Sg|Neut *|Sg|Neut _ _ _ _ _
10 sprachen sprechen sprechen VVFIN VVFIN 3|Pl|Past|Ind *|Pl|Past|Ind _ _ _ _ Y
11 sich sich er|es|sie|Sie PRF PRF 3|Acc|Pl *|*|* _ _ _ _ _
12 die der d ART ART Nom|Pl|Masc *|*|* _ _ _ _ _
13 meisten meister meist PIAT PIAT Nom|Pl|Masc *|*|* _ _ _ _ _
14 Staaten Staat Staat NN NN Nom|Pl|Masc *|Pl|Masc _ _ _ _ _
15 beim bei beim APPRART APPRART Dat|Sg|Neut Dat|Sg|* _ _ _ _ _
16 Gipfeltreffen Gipfeltreffen Gipfeltreffen NN NN Dat|Sg|Neut *|*|Neut _ _ _ _ _
17 für für für APPR APPR _ _ _ _ _ _ _
18 Asiatisch-Pazifische asiatisch-pazifisch Asiatisch-Pazifische ADJA NN Pos|Acc|Sg|Fem *|*|* _ _ _ _ _
19 Wirtschaftskooperation Wirtschaftskooperation Wirtschaftskooperation NN NN Acc|Sg|Fem *|Sg|Fem _ _ _ _ _
20 ( _ ( $( $( _ _ _ _ _ _ _
21 Apec Apec _ NE NE Nom|Sg|Fem _ _ _ _ _ _
22 ) _ ) $( $( _ _ _ _ _ _ _
23 in in in APPR APPR _ _ _ _ _ _ _
24 Osaka Osaka Osaka NE NE Dat|Sg|Neut *|Sg|Neut _ _ _ _ _
25 aus aus aus PTKVZ PTKVZ _ _ _ _ _ _ _
26 . _ . $. $. _ _ _ _ _ _ _

Parsing

TIGER is a mildly nonprojective treebank. 15875 of the 680,710 tokens in the CoNLL 2009 training+development datasets are attached nonprojectively (2.33%).

The results of the CoNLL 2006 shared task are available online. They have been published in (Buchholz and Marsi, 2006). The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for German:

Parser (Authors) LAS UAS
MST (McDonald et al.) 87.34 90.38
Riedel et al. 86.24 89.76
Basis (O'Neil) 85.36 89.16
Malt (Nivre et al.) 85.82 88.76

The results of the CoNLL 2009 shared task are available online. They have been published in (Hajič et al., 2009). Unlabeled attachment score was not published. These are the best results for German:

Parser (Authors) LAS
Bohnet 87.48
Merlo 87.29
Chen 86.24
Che 86.19

[ Back to the navigation ] [ Back to the content ]