[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision Both sides next revision
user:zeman:treebanks:fi [2011/12/05 14:27]
zeman Domain.
user:zeman:treebanks:fi [2011/12/05 14:37]
zeman Size.
Line 36: Line 36:
 ==== Size ==== ==== Size ====
  
-All four parts of the treebank together contain 9491 tokens in 1315 sentences, yielding 7.22 tokens per sentence on average. No official training-test data split is defined. Due to the small size of the treebank and extraordinary domain diversity, a good test set should sample from all four parts of the treebank. This is the case of our HamleDT experimental data splitshown in the last two rows of the table. +TDT contains 58576 tokens in 4307 sentences, yielding 13.60 tokens per sentence on average. No official training-test data split is defined. For our HamleDT experimentswe took the first 90 % (53151 tokens / 3877 sentences) for training and the remaining 10 % (5425 tokens 430 sentences) for testing.
- +
-^ File ^ Sentences ^ Terminals ^ Average t/s ^ +
-| arborest.xml |  175 |  2451 |  14.01 | +
-| piialaused.xml |  732 |  4505 |  6.15 | +
-| ratsepalaused.xml |  388 |  2348 |  6.05 | +
-| sul.xml |  20 |  187 |  9.35 | +
-| **total** |  **1315** |  **9491** |  **7.22** | +
-| training |  1184 |  8535 |  7.21 | +
-| test |  131 |  956 |  7.30 |+
  
 ==== Inside ==== ==== Inside ====

[ Back to the navigation ] [ Back to the content ]