Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision | |||
user:zeman:treebanks:et [2011/11/28 09:48] zeman Size and nonprojectivity. |
user:zeman:treebanks:et [2011/11/28 17:10] (current) zeman New training/test data split. |
||
---|---|---|---|
Line 37: | Line 37: | ||
==== Size ==== | ==== Size ==== | ||
- | All four parts of the treebank together contain 9491 tokens in 1315 sentences, yielding 7.22 tokens per sentence on average. No official training-test data split is defined. Due to the small size of the treebank and extraordinary domain diversity, a good test set should sample from all four parts of the treebank. | + | All four parts of the treebank together contain 9491 tokens in 1315 sentences, yielding 7.22 tokens per sentence on average. No official training-test data split is defined. Due to the small size of the treebank and extraordinary domain diversity, a good test set should sample from all four parts of the treebank. This is the case of our HamleDT experimental data split, shown in the last two rows of the table. |
^ File ^ Sentences ^ Terminals ^ Average t/s ^ | ^ File ^ Sentences ^ Terminals ^ Average t/s ^ | ||
Line 44: | Line 44: | ||
| ratsepalaused.xml | 388 | 2348 | 6.05 | | | ratsepalaused.xml | 388 | 2348 | 6.05 | | ||
| sul.xml | 20 | 187 | 9.35 | | | sul.xml | 20 | 187 | 9.35 | | ||
- | | **total** | 1315 | 9491 | 7.22 | | + | | **total** | |
+ | | training | 1184 | 8535 | 7.21 | | ||
+ | | test | 131 | 956 | 7.30 | | ||
==== Inside ==== | ==== Inside ==== |