Differences
This shows you the differences between two versions of the page.
Both sides previous revision
Previous revision
Next revision
|
Previous revision
|
user:zeman:treebanks:fi [2011/12/05 14:46] zeman Sample. |
user:zeman:treebanks:fi [2011/12/05 15:37] (current) zeman Inside and parsing. |
==== Inside ==== | ==== Inside ==== |
| |
The treebank is part of the [[http://corp.hum.sdu.dk/tgrepeye_est.html|Arborest]] project and [[http://beta.visl.sdu.dk/|VISL]] (Visual Interactive Syntax Learning). As such, it is based on Constraint Grammar (Fred Karlsson et al., 1995: Constraint Grammar – A Language-Independent System for Parsing Unrestricted Text. Mouton de Gruyter). All four parts are available in the [[http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERSearch/doc/html/TigerXML.html|TIGER-XML]] format. Two of them are also available in the [[http://beta.visl.sdu.dk/treebanks.html#The_source_format|VISL]] format. | The native file format of the treebank is based on XML. Besides that, TDT is also distributed in the [[:format-conll|CoNLL-X format]]. The part-of-speech tag AND the morphosyntactic features are joined in one feature string, which is copied in both the CPOS and the POS columns of the CoNLL format. The FEAT column is empty (i.e. it contains the underscore character). Lemmas are available, too. Morphological annotation and disambiguation is automatic, it is no gold standard. The native XML format shows all morphological readings of every word based on the lexicon, and the disambiguation is left upon the user. |
| |
The annotation contains lemmas, part of speech tags, morphosyntactic features, nonterminal labels and phrase structure. It is not clear whether (and to what degree) the annotation was performed or checked manually. | |
| |
Note that the TIGER-XML format, despite being phrase-based, stores word order separately from structure and thus allows for nonprojectivities. | |
| |
==== Sample ==== | ==== Sample ==== |
==== Parsing ==== | ==== Parsing ==== |
| |
Nonprojectivities in EKP are very rare. Only 7 out of the 9491 tokens are attached nonprojectively (0.074%). | Nonprojectivities in TDT are rare. Only 299 out of the 58576 tokens are attached nonprojectively (0.51%). |
| |
There is a constraint grammar parser for Estonian by Kaili Müürisep. I am not aware of any published evaluation of parsing accuracy. However, I am not sure that the treebank described here is not just output of the parser. | I am not aware of any published evaluation of Finnish parsing accuracy. |
| |