Differences

This shows you the differences between two versions of the page.

--- user:zeman:treebanks:fa [2012/01/29 20:35]
zeman Sample.
+++ user:zeman:treebanks:fa [2012/03/10 11:58]
zeman Tokenization.
@@ Line 35: / Line 35: @@
 ==== Size ====
-annotated sentences.
+PDT contains 189,572 tokens in 12455 sentences, yielding 15.22 tokens per sentence on average. No official training-test data split is defined. For our HamleDT experiments, we took the first 182,878 tokens / 12126 sentences for training and the remaining 6694 tokens / 329 sentences for testing.
 ==== Inside ====
 Provided in the [[:format-conll|CoNLL data format]]. The morphosyntactic annotation contains lemmas. Morphosyntactic / part-of-speech tags have been assigned manually. The text does not contain diacritical marks to distinguish short vowels (not normally shown in Persian text).
+Tokenization is subordinated to the need of displaying syntactic relations. Some orthographic words have been broken into several tokens (e.g. a verb and its object). Elsewhere a tree node (token) consists of two orthographic words (and they are not joined using the underscore character, i.e. there is a space inside the token!) (e.g. the analytical form of subjunctive preterite: "کرده باشید" //korde bašäd//).
 ==== Sample ====

Institute of Formal and Applied Linguistics Wiki