[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
user:zeman:treebanks:fi [2011/12/05 14:11]
zeman References.
user:zeman:treebanks:fi [2011/12/05 14:37]
zeman Size.
Line 32: Line 32:
 ==== Domain ==== ==== Domain ====
  
-Mixed+Mixed (Wikipedia, Wikinews, university web-magazine and blogs).
-  * 388 tailored sentences with movement verbs +
-  * 732 sentences with movement verbs from the Estonian FrameNet corpus +
-  * 175 sentences from the Arborest corpus +
-  * 20 sentences of spoken language+
  
 ==== Size ==== ==== Size ====
  
-All four parts of the treebank together contain 9491 tokens in 1315 sentences, yielding 7.22 tokens per sentence on average. No official training-test data split is defined. Due to the small size of the treebank and extraordinary domain diversity, a good test set should sample from all four parts of the treebank. This is the case of our HamleDT experimental data splitshown in the last two rows of the table. +TDT contains 58576 tokens in 4307 sentences, yielding 13.60 tokens per sentence on average. No official training-test data split is defined. For our HamleDT experimentswe took the first 90 % (53151 tokens / 3877 sentences) for training and the remaining 10 % (5425 tokens 430 sentences) for testing.
- +
-^ File ^ Sentences ^ Terminals ^ Average t/s ^ +
-| arborest.xml |  175 |  2451 |  14.01 | +
-| piialaused.xml |  732 |  4505 |  6.15 | +
-| ratsepalaused.xml |  388 |  2348 |  6.05 | +
-| sul.xml |  20 |  187 |  9.35 | +
-| **total** |  **1315** |  **9491** |  **7.22** | +
-| training |  1184 |  8535 |  7.21 | +
-| test |  131 |  956 |  7.30 |+
  
 ==== Inside ==== ==== Inside ====

[ Back to the navigation ] [ Back to the content ]