Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
user:zeman:treebanks:fa [2012/01/29 20:35] zeman Sample. |
user:zeman:treebanks:fa [2012/03/10 11:58] zeman Tokenization. |
||
---|---|---|---|
Line 35: | Line 35: | ||
==== Size ==== | ==== Size ==== | ||
- | 12200 annotated | + | PDT contains 189,572 tokens in 12455 sentences, yielding 15.22 tokens per sentence on average. No official training-test data split is defined. For our HamleDT experiments, |
==== Inside ==== | ==== Inside ==== | ||
Provided in the [[: | Provided in the [[: | ||
+ | |||
+ | Tokenization is subordinated to the need of displaying syntactic relations. Some orthographic words have been broken into several tokens (e.g. a verb and its object). Elsewhere a tree node (token) consists of two orthographic words (and they are not joined using the underscore character, i.e. there is a space inside the token!) (e.g. the analytical form of subjunctive preterite: " | ||
==== Sample ==== | ==== Sample ==== |