user:zeman:treebanks:et [ufal wiki]

This is an old revision of the document!

Estonian (et)
- Versions
- Obtaining and License
- References
- Domain
- Size
- Inside
- Sample
- Parsing

Estonian (et)

Eesti keele puudepank (Google translate) (EKP)

Versions

Downloadable on-line, part of Arborest project (puudepank)
8.12.2010 arborest.xml downloadable from the same site (same size, improved markup)
http://vvv.cs.ut.ee/~kaili/Korpus/pindmine/

Obtaining and License

The EKP is freely downloadable from here in VISL or TIGER-XML format. Licensing terms are unknown.

EKP was created / coordinated (?) by Kaili Müürisep, Institute of Computer Science (Arvutiteaduse instituut), University of Tartu (Tartu Ülikool), Liivi 2, 50409 Tartu, Estonia.

References

Website
- http://vvv.cs.ut.ee/~kaili/Korpus/puud/ (Google translate)
Data
- no separate citation
Principal publications
- Kaili Müürisep, Tiina Puolakainen, Kadri Muischnek, Mare Koit, Tiit Roosmaa, Heli Uibo: A New Language for Constraint Grammar: Estonian. In: International Conference Recent Advances in Natural Language Processing. Proceedings, pp. 304-310, Borovets, Bulgaria, 2003.
Documentation
- File formats
- The header of the TIGER-XML version of the treebank contains lists of various sorts of tags with brief explanation.

Domain

Mixed:

388 tailored sentences with movement verbs
732 sentences with movement verbs from the Estonian FrameNet corpus
175 sentences from the Arborest corpus
20 sentences of spoken language

Size

All four parts of the treebank together contain 9491 tokens in 1315 sentences, yielding 7.22 tokens per sentence on average. No official training-test data split is defined.

Inside

The treebank is part of the Arborest project and VISL (Visual Interactive Syntax Learning). As such, it is based on Constraint Grammar (Fred Karlsson et al., 1995: Constraint Grammar – A Language-Independent System for Parsing Unrestricted Text. Mouton de Gruyter). All four parts are available in the TIGER-XML format. Two of them are also available in the VISL format.

The annotation contains lemmas, part of speech tags, morphosyntactic features, nonterminal labels and phrase structure. It is not clear whether (and to what degree) the annotation was performed or checked manually.

Sample

The first sentence of the corpus in the TIGER-XML format:

<s id="ratsep-13" ref="ratsep-1" source="id=ratsep-1" forest="1/1" text="Peeter aerutas üle väina saarele puhkama">
	<graph root="ratsep-13_501">
		<terminals>
			<t id="ratsep-13_1" word="Peeter" lemma="Peeter+0" pos="prop" morph="prop,sg,nom,.cap"/>
			<t id="ratsep-13_2" word="aerutas" lemma="aeruta+s" pos="v-fin" morph="main,indic,impf,ps3,sg,ps,af,.FinV"/>
			<t id="ratsep-13_3" word="üle" lemma="üle+0" pos="prp" morph="pre,.gen"/>
			<t id="ratsep-13_4" word="väina" lemma="väin+0" pos="n" morph="com,sg,gen"/>
			<t id="ratsep-13_5" word="saarele" lemma="saar+le" pos="n" morph="com,sg,all"/>
			<t id="ratsep-13_6" word="puhkama" lemma="puhka+ma" pos="v-inf" morph="main,sup,ps,ill,.Part"/>
			<t id="ratsep-13_7" word="." lemma="." pos="punc" morph="Fst"/>
		</terminals>
 
		<nonterminals>
			<nt id="ratsep-13_501" cat="VROOT">
				<edge label="STA" idref="ratsep-13_502"/>
			</nt>
			<nt id="ratsep-13_502" cat="fcl">
				<edge label="S" idref="ratsep-13_1"/>
				<edge label="P" idref="ratsep-13_2"/>
				<edge label="A" idref="ratsep-13_503"/>
				<edge label="A" idref="ratsep-13_5"/>
				<edge label="A" idref="ratsep-13_6"/>
				<edge label="FST" idref="ratsep-13_7"/>
			</nt>
			<nt id="ratsep-13_503" cat="pp">
				<edge label="H" idref="ratsep-13_3"/>
				<edge label="D" idref="ratsep-13_4"/>
			</nt>
		</nonterminals>
	</graph>
</s>

Parsing

The phrase structure is projective by definition.

There is a constraint grammar parser for Estonian by Kaili Müürisep. I am not aware of any published evaluation of parsing accuracy. However, I am not sure that the treebank described here is not just output of the parser.

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Table of Contents