[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision Both sides next revision
user:zeman:treebanks:et [2011/11/21 13:31]
zeman
user:zeman:treebanks:et [2011/11/21 13:44]
zeman Inside, sample and parsing.
Line 37: Line 37:
 ==== Size ==== ==== Size ====
  
-According to their website, the TIGER Treebank version 1 contains approximately 700,000 tokens in 40,000 sentences. Version 2.1 contains approximately 900,000 tokens in 50,000 sentences. +All four parts of the treebank together contain 9491 tokens in 1315 sentences, yielding 7.22 tokens per sentence on average. No official training-test data split is defined.
- +
-The CoNLL 2006 version contains 705,304 tokens in 39573 sentences, yielding 17.82 tokens per sentence on average (CoNLL 2006 data split: 699,610 tokens / 39216 sentences training, 5694 tokens / 357 sentences test). +
- +
-The CoNLL 2009 version contains 712,332 tokens in 40020 sentences, yielding 17.80 tokens per sentence on average (CoNLL 2009 data split: 648,677 tokens / 36020 sentences training, 32033 tokens / 2000 sentences development, 31622 tokens / 2000 sentences test).+
  
 ==== Inside ==== ==== Inside ====
  
-The treebank is part of the [[http://corp.hum.sdu.dk/tgrepeye_est.html|Arborest]] project and [[http://beta.visl.sdu.dk/|VISL]] (Visual Interactive Syntax Learning). As such, it is based on Constraint Grammar (Fred Karlsson et al., 1995: Constraint Grammar – A Language-Independent System for Parsing Unrestricted Text. Mouton de Gruyter).+The treebank is part of the [[http://corp.hum.sdu.dk/tgrepeye_est.html|Arborest]] project and [[http://beta.visl.sdu.dk/|VISL]] (Visual Interactive Syntax Learning). As such, it is based on Constraint Grammar (Fred Karlsson et al., 1995: Constraint Grammar – A Language-Independent System for Parsing Unrestricted Text. Mouton de Gruyter). All four parts are available in the [[http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERSearch/doc/html/TigerXML.html|TIGER-XML]] format. Two of them are also available in the [[http://beta.visl.sdu.dk/treebanks.html#The_source_format|VISL]] format.
  
-All versions contain //semi-automatic// part of speech tags ([[http://www.ims.uni-stuttgart.de/projekte/corplex/TagSets/stts-table.html|Stuttgart-Tübingen Tagset]]STTS) and syntactic structure. Lemmas and morphosyntactic features are available only for newer versions (TIGER Treebank version 2 and onwards, and CoNLL 2009)The parts of speech are heavily context-dependent, e.g. many words can be used both substantively (pronouns) and attributively (determiners), which is distinguished by different POS tags.+The annotation contains lemmas, part of speech tags, morphosyntactic features, nonterminal labels and phrase structureIt is not clear whether (and to what degreethe annotation was performed or checked manually.
  
-It is not clear what the //semi-automatic// annotation means (probably first auto-tagging, then manual correction?) and whether it also applies to the morphosyntactic annotation. The CoNLL 2009 version also contains automatically disambiguated lemmas, tags and features.+==== Sample ====
  
-The original treebank is phrase-based. The dependencies in the CoNLL versions must have thus been drawn using a head-selection procedure. Besides CoNLL data, the TIGER project also provides a subset of the TIGER Treebank in a dependency format.+The first sentence of the corpus in the TIGER-XML format:
  
-==== Sample ==== +<code xml><s id="ratsep-13" ref="ratsep-1" source="id=ratsep-1" forest="1/1" text="Peeter aerutas üle väina saarele puhkama"> 
- + <graph root="ratsep-13_501"> 
-The first sentence of TIGER Treebank 2.1 in the TIGER-XML format:+ <terminals> 
 + <t id="ratsep-13_1" word="Peeter" lemma="Peeter+0" pos="prop" morph="prop,sg,nom,.cap"/> 
 + <t id="ratsep-13_2" word="aerutas" lemma="aeruta+s" pos="v-fin" morph="main,indic,impf,ps3,sg,ps,af,.FinV"/> 
 + <t id="ratsep-13_3" word="üle" lemma="üle+0" pos="prp" morph="pre,.gen"/> 
 + <t id="ratsep-13_4" word="väina" lemma="väin+0" pos="n" morph="com,sg,gen"/> 
 + <t id="ratsep-13_5" word="saarele" lemma="saar+le" pos="n" morph="com,sg,all"/> 
 + <t id="ratsep-13_6" word="puhkama" lemma="puhka+ma" pos="v-inf" morph="main,sup,ps,ill,.Part"/> 
 + <t id="ratsep-13_7" word="." lemma="." pos="punc" morph="Fst"/> 
 + </terminals>
  
-<code xml><s id="s1"> + <nonterminals> 
-  <graph root="s1_VROOT"> + <nt id="ratsep-13_501" cat="VROOT"> 
-    <terminals> + <edge label="STA" idref="ratsep-13_502"/> 
-      <t id="s1_1" word="``" lemma="--" pos="$(" morph="--" case="--" number="--" gender="--" person="--" degree="--" tense="--" mood="--" /> + </nt> 
-      <t id="s1_2" word="Ross" lemma="Ross" pos="NE" morph="Nom.Sg.Masc" case="Nom" number="Sg" gender="Masc" person="--" degree="--" tense="--" mood="--" /> + <nt id="ratsep-13_502" cat="fcl"> 
-      <t id="s1_3" word="Perot" lemma="Perot" pos="NE" morph="Nom.Sg.Masc" case="Nom" number="Sg" gender="Masc" person="--" degree="--" tense="--" mood="--" /> + <edge label="S" idref="ratsep-13_1"/> 
-      <t id="s1_4" word="wäre" lemma="sein" pos="VAFIN" morph="3.Sg.Past.Subj" case="--" number="Sg" gender="--" person="3" degree="--" tense="Past" mood="Subj" /> + <edge label="P" idref="ratsep-13_2"/> 
-      <t id="s1_5" word="vielleicht" lemma="vielleicht" pos="ADV" morph="--" case="--" number="--" gender="--" person="--" degree="--" tense="--" mood="--" /> + <edge label="A" idref="ratsep-13_503"/> 
-      <t id="s1_6" word="ein" lemma="ein" pos="ART" morph="Nom.Sg.Masc" case="Nom" number="Sg" gender="Masc" person="--" degree="--" tense="--" mood="--" /> + <edge label="A" idref="ratsep-13_5"/> 
-      <t id="s1_7" word="prächtiger" lemma="prächtig" pos="ADJA" morph="Pos.Nom.Sg.Masc" case="Nom" number="Sg" gender="Masc" person="--" degree="Pos" tense="--" mood="--" /> + <edge label="A" idref="ratsep-13_6"/> 
-      <t id="s1_8" word="Diktator" lemma="Diktator" pos="NN" morph="Nom.Sg.Masc" case="Nom" number="Sg" gender="Masc" person="--" degree="--" tense="--" mood="--" /> + <edge label="FST" idref="ratsep-13_7"/> 
-      <t id="s1_9" word="''" lemma="--" pos="$(" morph="--" case="--" number="--" gender="--" person="--" degree="--" tense="--" mood="--" /> + </nt> 
-    </terminals> + <nt id="ratsep-13_503" cat="pp"> 
-    <nonterminals> + <edge label="H" idref="ratsep-13_3"/> 
-      <nt id="s1_500" cat="PN"> + <edge label="D" idref="ratsep-13_4"/> 
-        <edge label="PNC" idref="s1_2" /> + </nt> 
-        <edge label="PNC" idref="s1_3" /> + </nonterminals> 
-      </nt> + </graph>
-      <nt id="s1_501" cat="NP"> +
-        <edge label="NK" idref="s1_6" /> +
-        <edge label="NK" idref="s1_7" /> +
-        <edge label="NK" idref="s1_8" /> +
-      </nt> +
-      <nt id="s1_502" cat="S"> +
-        <edge label="SB" idref="s1_500" /> +
-        <edge label="HD" idref="s1_4" /> +
-        <edge label="MO" idref="s1_5" /> +
-        <edge label="PD" idref="s1_501" /> +
-      </nt> +
-      <nt id="s1_VROOT" cat="VROOT"> +
-        <edge label="--" idref="s1_1" /> +
-        <edge label="--" idref="s1_502" /> +
-        <edge label="--" idref="s1_9" /> +
-      </nt> +
-    </nonterminals> +
-  </graph>+
 </s></code> </s></code>
- 
-The first sentence of the CoNLL 2006 training data: 
- 
-| 1 | `` | _ | $( | $( | _ | 4 | PUNC | 4 | PUNC | 
-| 2 | Ross | _ | NE | NE | _ | 4 | SB | 4 | SB | 
-| 3 | Perot | _ | NE | NE | _ | 2 | PNC | 2 | PNC | 
-| 4 | wäre | _ | VAFIN | VAFIN | _ | 0 | ROOT | 0 | ROOT | 
-| 5 | vielleicht | _ | ADV | ADV | _ | 4 | MO | 4 | MO | 
-| 6 | ein | _ | ART | ART | _ | 8 | NK | 8 | NK | 
-| 7 | prächtiger | _ | ADJA | ADJA | _ | 8 | NK | 8 | NK | 
-| 8 | Diktator | _ | NN | NN | _ | 4 | PD | 4 | PD | 
-| 9 | <nowiki>''</nowiki> | _ | $( | $( | _ | 4 | PUNC | 4 | PUNC | 
- 
-The first sentence of the CoNLL 2006 test data: 
- 
-| 1 | Zwei | _ | CARD | CARD | _ | 2 | NK | 2 | NK | 
-| 2 | Themen | _ | NN | NN | _ | 14 | SB | 14 | SB | 
-| 3 | , | _ | $, | $, | _ | 2 | PUNC | 2 | PUNC | 
-| 4 | die | _ | PRELS | PRELS | _ | 8 | OA | 8 | OA | 
-| 5 | Perot | _ | NE | NE | _ | 8 | SB | 8 | SB | 
-| 6 | immer | _ | ADV | ADV | _ | 7 | MO | 7 | MO | 
-| 7 | wieder | _ | ADV | ADV | _ | 8 | MO | 8 | MO | 
-| 8 | anspricht | _ | VVFIN | VVFIN | _ | 2 | RC | 2 | RC | 
-| 9 | , | _ | $, | $, | _ | 2 | PUNC | 2 | PUNC | 
-| 10 | Rezession | _ | NN | NN | _ | 2 | APP | 2 | APP | 
-| 11 | und | _ | KON | KON | _ | 10 | CD | 10 | CD | 
-| 12 | Bürokratie | _ | NN | NN | _ | 10 | CJ | 10 | CJ | 
-| 13 | , | _ | $, | $, | _ | 14 | PUNC | 14 | PUNC | 
-| 14 | machen | _ | VVFIN | VVFIN | _ | 0 | ROOT | 0 | ROOT | 
-| 15 | ihnen | _ | PPER | PPER | _ | 18 | DA | 18 | DA | 
-| 16 | besonders | _ | ADV | ADV | _ | 18 | MO | 18 | MO | 
-| 17 | zu | _ | PTKZU | PTKZU | _ | 18 | PM | 18 | PM | 
-| 18 | schaffen | _ | VVINF | VVINF | _ | 14 | OC | 14 | OC | 
-| 19 | . | _ | $. | $. | _ | 14 | PUNC | 14 | PUNC | 
- 
-The first sentence of the CoNLL 2009 training data: 
- 
-| 1 | `` | _ | `` | $( | $( | _ | _ | 4 | 4 | PUNC | PUNC | _ | _ | 
-| 2 | Ross | Ross | Roß | NE | NN | Nom<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Masc | _ | 3 | 3 | PNC | PNC | _ | _ | 
-| 3 | Perot | Perot | Perot | NE | NE | Nom<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Masc | _ | 4 | 4 | SB | SB | _ | _ | 
-| 4 | wäre | sein | sein | VAFIN | VAFIN | 3<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Past<nowiki>|</nowiki>Subj | *<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Past<nowiki>|</nowiki>Subj | 0 | 0 | ROOT | ROOT | _ | _ | 
-| 5 | vielleicht | vielleicht | vielleicht | ADV | ADV | _ | _ | 4 | 4 | MO | MO | _ | _ | 
-| 6 | ein | ein | ein | ART | ART | Nom<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Masc | *<nowiki>|</nowiki>Sg<nowiki>|</nowiki>* | 8 | 8 | NK | NK | _ | _ | 
-| 7 | prächtiger | prächtig | prächtig | ADJA | ADJA | Pos<nowiki>|</nowiki>Nom<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Masc | *<nowiki>|</nowiki>*<nowiki>|</nowiki>*<nowiki>|</nowiki>* | 8 | 8 | NK | NK | _ | _ | 
-| 8 | Diktator | Diktator | Diktator | NN | NN | Nom<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Masc | *<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Masc | 4 | 4 | PD | PD | _ | _ | 
-| 9 | <nowiki>''</nowiki> | _ | <nowiki>''</nowiki> | $( | $( | _ | _ | 4 | 4 | PUNC | PUNC | _ | _ | 
- 
-The first sentence of the CoNLL 2009 development data: 
- 
-| 1 | Maschinenbau | Maschinenbau | Maschinenbau | NN | NN | Nom<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Masc | *<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Masc | 0 | 4 | ROOT | NK | _ | _ | 
-| 2 | / | _ | / | $( | $( | _ | _ | 0 | 1 | PUNC | PUNC | _ | _ | 
-| 3 | ( | _ | ( | $( | $( | _ | _ | 0 | 4 | PUNC | PUNC | _ | _ | 
-| 4 | Zusammenfassung | Zusammenfassung | Zusammenfassung | NN | NN | Nom<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Fem | *<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Fem | 0 | 0 | ROOT | ROOT | _ | _ | 
-| 5 | ) | _ | ) | $( | $( | _ | _ | 0 | 1 | PUNC | PUNC | _ | _ | 
- 
-The first sentence of the CoNLL 2009 test data: 
- 
-| 1 | Gegen | gegen | gegen | APPR | APPR | _ | _ | _ | _ | _ | _ | _ | 
-| 2 | eine | ein | ein | ART | ART | Acc<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Fem | *<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Fem | _ | _ | _ | _ | _ | 
-| 3 | Erweiterung | Erweiterung | Erweiterung | NN | NN | Acc<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Fem | *<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Fem | _ | _ | _ | _ | _ | 
-| 4 | ihrer | ihr | ihr | PPOSAT | PPOSAT | Gen<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Fem | *<nowiki>|</nowiki>*<nowiki>|</nowiki>* | _ | _ | _ | _ | _ | 
-| 5 | Organisation | Organisation | Organisation | NN | NN | Gen<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Fem | *<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Fem | _ | _ | _ | _ | _ | 
-| 6 | zu | zu | zu | APPR | APPR | _ | _ | _ | _ | _ | _ | _ | 
-| 7 | einem | ein | ein | ART | ART | Dat<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Neut | Dat<nowiki>|</nowiki>Sg<nowiki>|</nowiki>* | _ | _ | _ | _ | _ | 
-| 8 | sicherheitspolitischen | sicherheitspolitisch | sicherheitspolitisch | ADJA | ADJA | Pos<nowiki>|</nowiki>Dat<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Neut | Pos<nowiki>|</nowiki>*<nowiki>|</nowiki>*<nowiki>|</nowiki>* | _ | _ | _ | _ | _ | 
-| 9 | Forum | Forum | Forum | NN | NN | Dat<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Neut | *<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Neut | _ | _ | _ | _ | _ | 
-| 10 | sprachen | sprechen | sprechen | VVFIN | VVFIN | 3<nowiki>|</nowiki>Pl<nowiki>|</nowiki>Past<nowiki>|</nowiki>Ind | *<nowiki>|</nowiki>Pl<nowiki>|</nowiki>Past<nowiki>|</nowiki>Ind | _ | _ | _ | _ | Y | 
-| 11 | sich | sich | er<nowiki>|</nowiki>es<nowiki>|</nowiki>sie<nowiki>|</nowiki>Sie | PRF | PRF | 3<nowiki>|</nowiki>Acc<nowiki>|</nowiki>Pl | *<nowiki>|</nowiki>*<nowiki>|</nowiki>* | _ | _ | _ | _ | _ | 
-| 12 | die | der | d | ART | ART | Nom<nowiki>|</nowiki>Pl<nowiki>|</nowiki>Masc | *<nowiki>|</nowiki>*<nowiki>|</nowiki>* | _ | _ | _ | _ | _ | 
-| 13 | meisten | meister | meist | PIAT | PIAT | Nom<nowiki>|</nowiki>Pl<nowiki>|</nowiki>Masc | *<nowiki>|</nowiki>*<nowiki>|</nowiki>* | _ | _ | _ | _ | _ | 
-| 14 | Staaten | Staat | Staat | NN | NN | Nom<nowiki>|</nowiki>Pl<nowiki>|</nowiki>Masc | *<nowiki>|</nowiki>Pl<nowiki>|</nowiki>Masc | _ | _ | _ | _ | _ | 
-| 15 | beim | bei | beim | APPRART | APPRART | Dat<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Neut | Dat<nowiki>|</nowiki>Sg<nowiki>|</nowiki>* | _ | _ | _ | _ | _ | 
-| 16 | Gipfeltreffen | Gipfeltreffen | Gipfeltreffen | NN | NN | Dat<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Neut | *<nowiki>|</nowiki>*<nowiki>|</nowiki>Neut | _ | _ | _ | _ | _ | 
-| 17 | für | für | für | APPR | APPR | _ | _ | _ | _ | _ | _ | _ | 
-| 18 | Asiatisch-Pazifische | asiatisch-pazifisch | Asiatisch-Pazifische | ADJA | NN | Pos<nowiki>|</nowiki>Acc<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Fem | *<nowiki>|</nowiki>*<nowiki>|</nowiki>* | _ | _ | _ | _ | _ | 
-| 19 | Wirtschaftskooperation | Wirtschaftskooperation | Wirtschaftskooperation | NN | NN | Acc<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Fem | *<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Fem | _ | _ | _ | _ | _ | 
-| 20 | ( | _ | ( | $( | $( | _ | _ | _ | _ | _ | _ | _ | 
-| 21 | Apec | Apec | _ | NE | NE | Nom<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Fem | _ | _ | _ | _ | _ | _ | 
-| 22 | ) | _ | ) | $( | $( | _ | _ | _ | _ | _ | _ | _ | 
-| 23 | in | in | in | APPR | APPR | _ | _ | _ | _ | _ | _ | _ | 
-| 24 | Osaka | Osaka | Osaka | NE | NE | Dat<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Neut | *<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Neut | _ | _ | _ | _ | _ | 
-| 25 | aus | aus | aus | PTKVZ | PTKVZ | _ | _ | _ | _ | _ | _ | _ | 
-| 26 | . | _ | . | $. | $. | _ | _ | _ | _ | _ | _ | _ | 
  
 ==== Parsing ==== ==== Parsing ====
  
-TIGER is a mildly nonprojective treebank. 15875 of the 680,710 tokens in the CoNLL 2009 training+development datasets are attached nonprojectively (2.33%). +The phrase structure is projective by definition.
- +
-The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http://aclweb.org/anthology-new/W/W06/W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for German: +
- +
-^ Parser (Authors) ^ LAS ^ UAS ^ +
-| MST (McDonald et al.) | 87.34 | 90.38 | +
-| Riedel et al. | 86.24 | 89.76 | +
-| Basis (O'Neil) | 85.36 | 89.16 | +
-| Malt (Nivre et al.) | 85.82 | 88.76 | +
- +
-The results of the CoNLL 2009 shared task are [[http://ufal.mff.cuni.cz/conll2009-st/results/results.php|available online]]. They have been published in [[http://aclweb.org/anthology/W/W09/W09-1201.pdf|(Hajič et al., 2009)]]. Unlabeled attachment score was not publishedThese are the best results for German:+
  
-^ Parser (Authors) ^ LAS ^ +There is a constraint grammar parser for Estonian by Kaili MüürisepI am not aware of any published evaluation of parsing accuracyHowever, I am not sure that the treebank described here is not just output of the parser.
-| Bohnet | 87.48 | +
-| Merlo | 87.29 | +
-| Chen | 86.24 | +
-| Che | 86.19 |+
  

[ Back to the navigation ] [ Back to the content ]