user:zeman:treebanks:de [ufal wiki]

TIGER Treebank 1 (2003)
TIGER Treebank 2 (2005)
TIGER Treebank 2.1 (2007) in TIGER-XML or Negra export (text) format
CoNLL 2006
CoNLL 2009

The TIGER Treebank is freely downloadable after you accept the license terms by pressing a button.

Republication of the two CoNLL versions in LDC is planned but it has not happenned yet.

The license in short:

non-commercial research and evaluation usage by academic or educational institutions
no redistribution
acknowledge the use of the corpus in publications

The TIGER Treebank was created by members of three institutes:

Department of Computational Linguistics and Phonetics (Computerlinguistik, CoLi), Saarland University (Universität des Saarlandes), Postfach 151150, D-66041 Saarbrücken, Germany.
Institute for Natural Language Processing (Institut für Maschinelle Sprachverarbeitung, IMS), University of Stuttgart (Universität Stuttgart), Azenbergstraße 12, D-70174 Stuttgart, Germany.
German Department (Institut für Germanistik), Philosophische Fakultät, Universität Potsdam, Am Neuen Palais 10, Haus 05, D-14469 Potsdam, Germany.

Website
- http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERCorpus/
Data
- no separate citation
Principal publications
- Sabine Brants, Stefanie Dipper, Silvia Hansen, Wolfgang Lezius, George Smith: The TIGER Treebank. In: Proceedings of the Workshop on Treebanks and Linguistic Theories (TLT), Sozopol, Bulgaria, 2002.
- List of publications
Documentation
- Stuttgart-Tübingen Tagset (part of speech)
- Berthold Crysmann, Silvia Hansen-Schirra, George Smith, Dorothea Ziegler-Eisele: TIGER Morphologie-Annotationsschema, 2005.
- Stefanie Albert, Jan Anderssen, Regine Bader, Stephanie Becker, Tobias Bracht, Sabine Brants, Thorsten Brants, Vera Demberg, Stefanie Dipper, Peter Eisenberg, Silvia Hansen, Hagen Hirschmann, Juliane Janitzek, Carolin Kirstein, Robert Langner, Lukas Michelbacher, Oliver Plaehn, Cordula Preis, Marcus Pußel, Marco Rower, Bettina Schrader, Anne Schwartz, George Smith, Hans Uszkoreit: TIGER Annotationsschema (syntax), 2003.
- The header of the XML version of the TIGER Treebank contains lists of various sorts of tags with brief explanation.

Mostly newswire (Frankfurter Rundschau).

According to their website, the TIGER Treebank version 1 contains approximately 700,000 tokens in 40,000 sentences. Version 2.1 contains approximately 900,000 tokens in 50,000 sentences.

The CoNLL 2006 version contains 705,304 tokens in 39573 sentences, yielding 17.82 tokens per sentence on average (CoNLL 2006 data split: 699,610 tokens / 39216 sentences training, 5694 tokens / 357 sentences test).

The CoNLL 2009 version contains 712,332 tokens in 40020 sentences, yielding 17.80 tokens per sentence on average (CoNLL 2009 data split: 648,677 tokens / 36020 sentences training, 32033 tokens / 2000 sentences development, 31622 tokens / 2000 sentences test).

All versions contain semi-automatic part of speech tags (Stuttgart-Tübingen Tagset, STTS) and syntactic structure. Lemmas and morphosyntactic features are available only for newer versions (TIGER Treebank version 2 and onwards, and CoNLL 2009). The parts of speech are heavily context-dependent, e.g. many words can be used both substantively (pronouns) and attributively (determiners), which is distinguished by different POS tags.

It is not clear what the semi-automatic annotation means (probably first auto-tagging, then manual correction?) and whether it also applies to the morphosyntactic annotation. The CoNLL 2009 version also contains automatically disambiguated lemmas, tags and features.

The original treebank is phrase-based. The dependencies in the CoNLL versions must have thus been drawn using a head-selection procedure. Besides CoNLL data, the TIGER project also provides a subset of the TIGER Treebank in a dependency format.

The first sentence of TIGER Treebank 2.1 in the TIGER-XML format:

<s id="s1">
  <graph root="s1_VROOT">
    <terminals>
      <t id="s1_1" word="``" lemma="--" pos="$(" morph="--" case="--" number="--" gender="--" person="--" degree="--" tense="--" mood="--" />
      <t id="s1_2" word="Ross" lemma="Ross" pos="NE" morph="Nom.Sg.Masc" case="Nom" number="Sg" gender="Masc" person="--" degree="--" tense="--" mood="--" />
      <t id="s1_3" word="Perot" lemma="Perot" pos="NE" morph="Nom.Sg.Masc" case="Nom" number="Sg" gender="Masc" person="--" degree="--" tense="--" mood="--" />
      <t id="s1_4" word="wäre" lemma="sein" pos="VAFIN" morph="3.Sg.Past.Subj" case="--" number="Sg" gender="--" person="3" degree="--" tense="Past" mood="Subj" />
      <t id="s1_5" word="vielleicht" lemma="vielleicht" pos="ADV" morph="--" case="--" number="--" gender="--" person="--" degree="--" tense="--" mood="--" />
      <t id="s1_6" word="ein" lemma="ein" pos="ART" morph="Nom.Sg.Masc" case="Nom" number="Sg" gender="Masc" person="--" degree="--" tense="--" mood="--" />
      <t id="s1_7" word="prächtiger" lemma="prächtig" pos="ADJA" morph="Pos.Nom.Sg.Masc" case="Nom" number="Sg" gender="Masc" person="--" degree="Pos" tense="--" mood="--" />
      <t id="s1_8" word="Diktator" lemma="Diktator" pos="NN" morph="Nom.Sg.Masc" case="Nom" number="Sg" gender="Masc" person="--" degree="--" tense="--" mood="--" />
      <t id="s1_9" word="''" lemma="--" pos="$(" morph="--" case="--" number="--" gender="--" person="--" degree="--" tense="--" mood="--" />
    </terminals>
    <nonterminals>
      <nt id="s1_500" cat="PN">
        <edge label="PNC" idref="s1_2" />
        <edge label="PNC" idref="s1_3" />
      </nt>
      <nt id="s1_501" cat="NP">
        <edge label="NK" idref="s1_6" />
        <edge label="NK" idref="s1_7" />
        <edge label="NK" idref="s1_8" />
      </nt>
      <nt id="s1_502" cat="S">
        <edge label="SB" idref="s1_500" />
        <edge label="HD" idref="s1_4" />
        <edge label="MO" idref="s1_5" />
        <edge label="PD" idref="s1_501" />
      </nt>
      <nt id="s1_VROOT" cat="VROOT">
        <edge label="--" idref="s1_1" />
        <edge label="--" idref="s1_502" />
        <edge label="--" idref="s1_9" />
      </nt>
    </nonterminals>
  </graph>
</s>

The first sentence of the CoNLL 2006 training data:

1	``	_	$(	$(	_	4	PUNC	4	PUNC
2	Ross	_	NE	NE	_	4	SB	4	SB
3	Perot	_	NE	NE	_	2	PNC	2	PNC
4	wäre	_	VAFIN	VAFIN	_	0	ROOT	0	ROOT
5	vielleicht	_	ADV	ADV	_	4	MO	4	MO
6	ein	_	ART	ART	_	8	NK	8	NK
7	prächtiger	_	ADJA	ADJA	_	8	NK	8	NK
8	Diktator	_	NN	NN	_	4	PD	4	PD
9	''	_	$(	$(	_	4	PUNC	4	PUNC

The first sentence of the CoNLL 2006 test data:

1	Zwei	_	CARD	CARD	_	2	NK	2	NK
2	Themen	_	NN	NN	_	14	SB	14	SB
3	,	_	$,	$,	_	2	PUNC	2	PUNC
4	die	_	PRELS	PRELS	_	8	OA	8	OA
5	Perot	_	NE	NE	_	8	SB	8	SB
6	immer	_	ADV	ADV	_	7	MO	7	MO
7	wieder	_	ADV	ADV	_	8	MO	8	MO
8	anspricht	_	VVFIN	VVFIN	_	2	RC	2	RC
9	,	_	$,	$,	_	2	PUNC	2	PUNC
10	Rezession	_	NN	NN	_	2	APP	2	APP
11	und	_	KON	KON	_	10	CD	10	CD
12	Bürokratie	_	NN	NN	_	10	CJ	10	CJ
13	,	_	$,	$,	_	14	PUNC	14	PUNC
14	machen	_	VVFIN	VVFIN	_	0	ROOT	0	ROOT
15	ihnen	_	PPER	PPER	_	18	DA	18	DA
16	besonders	_	ADV	ADV	_	18	MO	18	MO
17	zu	_	PTKZU	PTKZU	_	18	PM	18	PM
18	schaffen	_	VVINF	VVINF	_	14	OC	14	OC
19	.	_	$.	$.	_	14	PUNC	14	PUNC

The first sentence of the CoNLL 2009 training data:

1

``

_

``

$(

_

4

PUNC

_

2

Ross

Roß

NE

NN

Nom|Sg|Masc

_

3

PNC

_

3

Perot

NE

Nom|Sg|Masc

_

4

SB

_

4

wäre

sein

VAFIN

3|Sg|Past|Subj

*|Sg|Past|Subj

0

ROOT

_

5

vielleicht

ADV

_

4

MO

_

6

ein

ART

Nom|Sg|Masc

*|Sg|*

8

NK

_

7

prächtiger

prächtig

ADJA

Pos|Nom|Sg|Masc

*|*|*|*

8

NK

_

8

Diktator

NN

Nom|Sg|Masc

*|Sg|Masc

4

PD

_

9

''

_

''

$(

_

4

PUNC

_

The first sentence of the CoNLL 2009 development data:

1

Maschinenbau

NN

Nom|Sg|Masc

*|Sg|Masc

0

4

ROOT

NK

_

2

/

_

/

$(

_

0

1

PUNC

_

3

(

_

(

$(

_

0

4

PUNC

_

4

Zusammenfassung

NN

Nom|Sg|Fem

*|Sg|Fem

0

ROOT

_

5

)

_

)

$(

_

0

1

PUNC

_

The first sentence of the CoNLL 2009 test data:

1

Gegen

gegen

APPR

_

2

eine

ein

ART

Acc|Sg|Fem

*|Sg|Fem

_

3

Erweiterung

NN

Acc|Sg|Fem

*|Sg|Fem

_

4

ihrer

ihr

PPOSAT

Gen|Sg|Fem

*|*|*

_

5

Organisation

NN

Gen|Sg|Fem

*|Sg|Fem

_

6

zu

APPR

_

7

einem

ein

ART

Dat|Sg|Neut

Dat|Sg|*

_

8

sicherheitspolitischen

sicherheitspolitisch

ADJA

Pos|Dat|Sg|Neut

Pos|*|*|*

_

9

Forum

NN

Dat|Sg|Neut

*|Sg|Neut

_

10

sprachen

sprechen

VVFIN

3|Pl|Past|Ind

*|Pl|Past|Ind

_

Y

11

sich

er|es|sie|Sie

PRF

3|Acc|Pl

*|*|*

_

12

die

der

d

ART

Nom|Pl|Masc

*|*|*

_

13

meisten

meister

meist

PIAT

Nom|Pl|Masc

*|*|*

_

14

Staaten

Staat

NN

Nom|Pl|Masc

*|Pl|Masc

_

15

beim

bei

beim

APPRART

Dat|Sg|Neut

Dat|Sg|*

_

16

Gipfeltreffen

NN

Dat|Sg|Neut

*|*|Neut

_

17

für

APPR

_

18

Asiatisch-Pazifische

asiatisch-pazifisch

Asiatisch-Pazifische

ADJA

NN

Pos|Acc|Sg|Fem

*|*|*

_

19

Wirtschaftskooperation

NN

Acc|Sg|Fem

*|Sg|Fem

_

20

(

_

(

$(

_

21

Apec

_

NE

Nom|Sg|Fem

_

22

)

_

)

$(

_

23

in

APPR

_

24

Osaka

NE

Dat|Sg|Neut

*|Sg|Neut

_

25

aus

PTKVZ

_

26

.

_

.

$.

_

TIGER is a mildly nonprojective treebank. 15875 of the 680,710 tokens in the CoNLL 2009 training+development datasets are attached nonprojectively (2.33%).

The results of the CoNLL 2006 shared task are available online. They have been published in (Buchholz and Marsi, 2006). The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for German:

Parser (Authors)	LAS	UAS
MST (McDonald et al.)	87.34	90.38
Riedel et al.	86.24	89.76
Basis (O'Neil)	85.36	89.16
Malt (Nivre et al.)	85.82	88.76

The results of the CoNLL 2009 shared task are available online. They have been published in (Hajič et al., 2009). Unlabeled attachment score was not published. These are the best results for German:

Parser (Authors)	LAS
Bohnet	87.48
Merlo	87.29
Chen	86.24
Che	86.19

Institute of Formal and Applied Linguistics Wiki

Table of Contents

German (de)

Versions

Obtaining and License

References

Domain

Size

Inside

Sample

Parsing