This is an old revision of the document!
Table of Contents
Finnish (fi)
Versions
- 23.5.2011 Downloadable from the website of the treebank
Obtaining and License
The TDT is freely downloadable from here under the Creative Commons Attribution-Share Alike license. The license in short:
- any usage, commercial or not
- modification and redistribution permitted
- linking to the treebank website and citing the principal publication in publications required
TDT was created by members of the Turku BioNLP Group, University of Turku (Turun yliopisto), 20014 Turku, Finland.
References
- Website
- Data
- no separate citation
- Principal publications
- Katri Haverinen, Filip Ginter, Veronika Laippala, Timo Viljanen, Tapio Salakoski: Dependency Annotation of Wikipedia: First Steps Towards a Finnish Treebank. In: Proceedings of The Eighth International Workshop on Treebanks and Linguistic Theories (TLT8). Milano, Italy, 2009.
- Katri Haverinen, Timo Viljanen, Veronika Laippala, Samuel Kohonen, Filip Ginter, Tapio Salakoski: Treebanking Finnish. In: Proceedings of The Ninth International Workshop on Treebanks and Linguistic Theories (TLT9), pp. 79-90. Tartu, Estonia, 2010.
- Documentation
- The file FILE-FORMAT.txt in the distribution
- Partial list of part-of-speech tags with descriptions (POS tagging has been done by www.lingsoft.fi)
Domain
Mixed (Wikipedia, Wikinews, university web-magazine and blogs).
Size
All four parts of the treebank together contain 9491 tokens in 1315 sentences, yielding 7.22 tokens per sentence on average. No official training-test data split is defined. Due to the small size of the treebank and extraordinary domain diversity, a good test set should sample from all four parts of the treebank. This is the case of our HamleDT experimental data split, shown in the last two rows of the table.
File | Sentences | Terminals | Average t/s |
---|---|---|---|
arborest.xml | 175 | 2451 | 14.01 |
piialaused.xml | 732 | 4505 | 6.15 |
ratsepalaused.xml | 388 | 2348 | 6.05 |
sul.xml | 20 | 187 | 9.35 |
total | 1315 | 9491 | 7.22 |
training | 1184 | 8535 | 7.21 |
test | 131 | 956 | 7.30 |
Inside
The treebank is part of the Arborest project and VISL (Visual Interactive Syntax Learning). As such, it is based on Constraint Grammar (Fred Karlsson et al., 1995: Constraint Grammar – A Language-Independent System for Parsing Unrestricted Text. Mouton de Gruyter). All four parts are available in the TIGER-XML format. Two of them are also available in the VISL format.
The annotation contains lemmas, part of speech tags, morphosyntactic features, nonterminal labels and phrase structure. It is not clear whether (and to what degree) the annotation was performed or checked manually.
Note that the TIGER-XML format, despite being phrase-based, stores word order separately from structure and thus allows for nonprojectivities.
Sample
The first sentence of the corpus in the TIGER-XML format:
<s id="ratsep-13" ref="ratsep-1" source="id=ratsep-1" forest="1/1" text="Peeter aerutas üle väina saarele puhkama"> <graph root="ratsep-13_501"> <terminals> <t id="ratsep-13_1" word="Peeter" lemma="Peeter+0" pos="prop" morph="prop,sg,nom,.cap"/> <t id="ratsep-13_2" word="aerutas" lemma="aeruta+s" pos="v-fin" morph="main,indic,impf,ps3,sg,ps,af,.FinV"/> <t id="ratsep-13_3" word="üle" lemma="üle+0" pos="prp" morph="pre,.gen"/> <t id="ratsep-13_4" word="väina" lemma="väin+0" pos="n" morph="com,sg,gen"/> <t id="ratsep-13_5" word="saarele" lemma="saar+le" pos="n" morph="com,sg,all"/> <t id="ratsep-13_6" word="puhkama" lemma="puhka+ma" pos="v-inf" morph="main,sup,ps,ill,.Part"/> <t id="ratsep-13_7" word="." lemma="." pos="punc" morph="Fst"/> </terminals> <nonterminals> <nt id="ratsep-13_501" cat="VROOT"> <edge label="STA" idref="ratsep-13_502"/> </nt> <nt id="ratsep-13_502" cat="fcl"> <edge label="S" idref="ratsep-13_1"/> <edge label="P" idref="ratsep-13_2"/> <edge label="A" idref="ratsep-13_503"/> <edge label="A" idref="ratsep-13_5"/> <edge label="A" idref="ratsep-13_6"/> <edge label="FST" idref="ratsep-13_7"/> </nt> <nt id="ratsep-13_503" cat="pp"> <edge label="H" idref="ratsep-13_3"/> <edge label="D" idref="ratsep-13_4"/> </nt> </nonterminals> </graph> </s>
Parsing
Nonprojectivities in EKP are very rare. Only 7 out of the 9491 tokens are attached nonprojectively (0.074%).
There is a constraint grammar parser for Estonian by Kaili Müürisep. I am not aware of any published evaluation of parsing accuracy. However, I am not sure that the treebank described here is not just output of the parser.