[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki

[ Back to the navigation ]


This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
user:zeman:treebanks:fi [2011/12/05 14:46]
zeman Sample.
user:zeman:treebanks:fi [2011/12/05 15:37]
zeman Inside and parsing.
Line 40: Line 40:
 ==== Inside ==== ==== Inside ====
-The treebank is part of the [[http://​corp.hum.sdu.dk/​tgrepeye_est.html|Arborest]] project and [[http://​beta.visl.sdu.dk/​|VISL]] (Visual Interactive Syntax Learning). As such, it is based on Constraint Grammar (Fred Karlsson et al., 1995: Constraint Grammar – A Language-Independent System for Parsing Unrestricted Text. Mouton de Gruyter). All four parts are available ​in the [[http://​www.ims.uni-stuttgart.de/​projekte/​TIGER/​TIGERSearch/​doc/​html/​TigerXML.html|TIGER-XML]] format. Two of them are also available in the [[http://​beta.visl.sdu.dk/​treebanks.html#​The_source_format|VISL]] format. +The native file format ​of the treebank ​is based on XMLBesides thatTDT is also distributed ​in the [[:format-conll|CoNLL-format]]. The part-of-speech ​tag AND the morphosyntactic features ​are joined in one feature stringwhich is copied in both the CPOS and the POS columns of the CoNLL formatThe FEAT column ​is empty (i.e. it contains the underscore character). Lemmas are available, too. Morphological ​annotation ​and disambiguation is automatic, it is no gold standardThe native ​XML format ​shows all morphological readings of every word based on the lexicon, and the disambiguation is left upon the user.
- +
-The annotation contains lemmas, ​part of speech ​tags, morphosyntactic features, ​nonterminal labels ​and phrase structureIt is not clear whether ​(and to what degreethe annotation ​was performed or checked manually. +
- +
-Note that the TIGER-XML format, despite being phrase-based, ​stores word order separately from structure ​and thus allows for nonprojectivities.+
 ==== Sample ==== ==== Sample ====
Line 127: Line 123:
 ==== Parsing ==== ==== Parsing ====
-Nonprojectivities in EKP are very rare. Only out of the 9491 tokens are attached nonprojectively (0.074%).+Nonprojectivities in TDT are rare. Only 299 out of the 58576 tokens are attached nonprojectively (0.51%).
-There is a constraint grammar parser for Estonian by Kaili Müürisep. ​I am not aware of any published evaluation of parsing accuracy. However, I am not sure that the treebank described here is not just output of the parser.+I am not aware of any published evaluation of Finnish ​parsing accuracy.

[ Back to the navigation ] [ Back to the content ]