user:zeman:treebanks:fi [ufal wiki]

This is an old revision of the document!

Finnish (fi)
- Versions
- Obtaining and License
- References
- Domain
- Size
- Inside
- Sample
- Parsing

Finnish (fi)

Versions

23.5.2011 Downloadable from the website of the treebank

Obtaining and License

The TDT is freely downloadable from here under the Creative Commons Attribution-Share Alike license. The license in short:

any usage, commercial or not
modification and redistribution permitted
linking to the treebank website and citing the principal publication in publications required

TDT was created by members of the Turku BioNLP Group, University of Turku (Turun yliopisto), 20014 Turku, Finland.

References

Website
- http://bionlp.utu.fi/fintreebank.html
Data
- no separate citation
Principal publications
- Katri Haverinen, Filip Ginter, Veronika Laippala, Timo Viljanen, Tapio Salakoski: Dependency Annotation of Wikipedia: First Steps Towards a Finnish Treebank. In: Proceedings of The Eighth International Workshop on Treebanks and Linguistic Theories (TLT8). Milano, Italy, 2009.
- Katri Haverinen, Timo Viljanen, Veronika Laippala, Samuel Kohonen, Filip Ginter, Tapio Salakoski: Treebanking Finnish. In: Proceedings of The Ninth International Workshop on Treebanks and Linguistic Theories (TLT9), pp. 79-90. Tartu, Estonia, 2010.
Documentation
- The file FILE-FORMAT.txt in the distribution
- Partial list of part-of-speech tags with descriptions (POS tagging has been done by www.lingsoft.fi)

Domain

Mixed (Wikipedia, Wikinews, university web-magazine and blogs).

Size

TDT contains 58576 tokens in 4307 sentences, yielding 13.60 tokens per sentence on average. No official training-test data split is defined. For our HamleDT experiments, we took the first 90 % (53151 tokens / 3877 sentences) for training and the remaining 10 % (5425 tokens / 430 sentences) for testing.

Inside

The treebank is part of the Arborest project and VISL (Visual Interactive Syntax Learning). As such, it is based on Constraint Grammar (Fred Karlsson et al., 1995: Constraint Grammar – A Language-Independent System for Parsing Unrestricted Text. Mouton de Gruyter). All four parts are available in the TIGER-XML format. Two of them are also available in the VISL format.

The annotation contains lemmas, part of speech tags, morphosyntactic features, nonterminal labels and phrase structure. It is not clear whether (and to what degree) the annotation was performed or checked manually.

Note that the TIGER-XML format, despite being phrase-based, stores word order separately from structure and thus allows for nonprojectivities.

Sample

The first two sentences of the corpus in its native XML format:

<treeset name="http://ranneliike.net/blogi.php?nick=Aboa Kirjoitettu: 02.02.2010, 15:41:06">
  <sentence txt="Kävelyreitti III">
    <token charOff="0-12">
      <posreading CG="true" baseform="kävely#reitti" rawtags="N NOM SG &lt;up&gt;" />
    </token>
    <token charOff="13-16">
      <posreading CG="true" baseform="III" rawtags="&lt;roman&gt; ABBR NOM SG &lt;up&gt;" />
      <posreading CG="true" baseform="iii" rawtags="ABBR &lt;up&gt;" />
      <posreading CG="true" baseform="iii" rawtags="&lt;roman&gt; ABBR NOM SG &lt;up&gt;" />
    </token>
    <dep dep="1" gov="0" type="num" />
  </sentence>
  <sentence txt="Jäällä kävely avaa aina hauskoja ja erikoisia näkökulmia kaupunkiin.">
    <token charOff="0-6">
      <posreading CG="true" baseform="jää" rawtags="N ADE SG &lt;up&gt;" />
    </token>
    <token charOff="7-13">
      <posreading CG="true" baseform="kävely" rawtags="DV-U N NOM SG" />
    </token>
    <token charOff="14-18">
      <posreading CG="true" baseform="avata" rawtags="V PRES ACT SG3" />
      <posreading CG="false" baseform="avata" rawtags="V PRES ACT NEG" />
      <posreading CG="false" baseform="avata" rawtags="V IMPV ACT SG2" />
      <posreading CG="false" baseform="avata" rawtags="V IMPV ACT NEG" />
    </token>
    <token charOff="19-23">
      <posreading CG="true" baseform="aina" rawtags="ADV" />
    </token>
    <token charOff="24-32">
      <posreading CG="true" baseform="hauska" rawtags="A POS PTV PL" />
    </token>
    <token charOff="33-35">
      <posreading CG="true" baseform="ja" rawtags="COORD C" />
    </token>
    <token charOff="36-45">
      <posreading CG="true" baseform="erikoinen" rawtags="A POS PTV PL" />
    </token>
    <token charOff="46-56">
      <posreading CG="true" baseform="näkö#kulma" rawtags="N PTV PL" />
    </token>
    <token charOff="57-67">
      <posreading CG="true" baseform="kaupunki" rawtags="N ILL SG" />
    </token>
    <token charOff="67-68">
      <posreading CG="true" baseform="." rawtags="PUNCT" />
    </token>
    <dep dep="0" gov="1" type="nommod" />
    <dep dep="1" gov="2" type="nsubj" />
    <dep dep="3" gov="2" type="advmod" />
    <dep dep="7" gov="2" type="dobj" />
    <dep dep="9" gov="2" type="punct" />
    <dep dep="5" gov="4" type="cc" />
    <dep dep="6" gov="4" type="conj" />
    <dep dep="4" gov="7" type="amod" />
    <dep dep="8" gov="7" type="nommod" />
  </sentence>

The same two sentences in the CoNLL format:

# b101.d.xml/1
1	Kävelyreitti	kävely\|reitti	NOM\|up\|SG\|N	NOM\|up\|SG\|N	_	0	ROOT	_	_
2	III	III	roman\|NOM\|up\|SG\|ABBR	roman\|NOM\|up\|SG\|ABBR	_	1	num	_	_

# b101.d.xml/2
1	Jäällä	jää	ADE\|SG\|up\|N	ADE\|SG\|up\|N	_	2	nommod	_	_
2	kävely	kävely	DV-U\|NOM\|SG\|N	DV-U\|NOM\|SG\|N	_	3	nsubj	_	_
3	avaa	avata	SG3\|ACT\|PRES\|V	SG3\|ACT\|PRES\|V	_	0	ROOT	_	_
4	aina	aina	ADV	ADV	_	3	advmod	_	_
5	hauskoja	hauska	A\|PTV\|POS\|PL	A\|PTV\|POS\|PL	_	8	amod	_	_
6	ja	ja	C\|COORD	C\|COORD	_	5	cc	_	_
7	erikoisia	erikoinen	A\|PTV\|POS\|PL	A\|PTV\|POS\|PL	_	5	conj	_	_
8	näkökulmia	näkö\|kulma	PTV\|PL\|N	PTV\|PL\|N	_	3	dobj	_	_
9	kaupunkiin	kaupunki	ILL\|SG\|N	ILL\|SG\|N	_	8	nommod	_	_
10	.	.	PUNCT	PUNCT	_	3	punct	_	_

Parsing

Nonprojectivities in EKP are very rare. Only 7 out of the 9491 tokens are attached nonprojectively (0.074%).

There is a constraint grammar parser for Estonian by Kaili Müürisep. I am not aware of any published evaluation of parsing accuracy. However, I am not sure that the treebank described here is not just output of the parser.

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Table of Contents