Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
user:zeman:treebanks:te [2012/03/22 11:34] zeman Training data size (both sentences and words) was identical in ICON 2009 and 2010. |
user:zeman:treebanks:te [2012/03/22 17:06] zeman Inside. |
||
---|---|---|---|
Line 43: | Line 43: | ||
==== Size ==== | ==== Size ==== | ||
- | HyDT-Telugu shows dependencies between chunks, not words. The node/tree ratio is thus much lower than in other treebanks. The ICON 2009 version came with a data split into three parts: training, development and test; the same data was also distributed for ICON 2010: | + | HyDT-Telugu shows dependencies between chunks, not words. The node/tree ratio is thus much lower than in other treebanks. The ICON 2009 version came with a data split into three parts: training, development and test: |
- | ^ Part ^ Sentences ^ Chunks | + | ^ Part ^ Sentences ^ Chunks ^ Ratio ^ |
- | | Training | 1400 | ? | ? | 7602 | 5.43 | | + | | Training |
- | | Development | 150 | ? | ? | 839 | 5.59 | | + | | Development | |
- | | Test | 150 | ? | ? | 836 | 5.57 | | + | | Test | |
- | | TOTAL | 1700 | ? | ? | 9277 | 5.46 | | + | | TOTAL |
- | We drew our training and test data from the ICON 2010 datasets but we have fewer sentences | + | As for ICON 2010, the data description in [[http:// |
- | ^ Part ^ Sentences ^ Chunks ^ Ratio ^ | + | ^ Part ^ Sentences ^ Chunks |
- | | Training | 1300 | 5125 | 3.94 | | + | | Training |
- | | Test | 150 | 597 | 3.98 | | + | | Development |
- | | TOTAL | | + | | Test | 150 | 599 | 3.99 | 150 | 836 | 5.57 | |
+ | | TOTAL | ||
==== Inside ==== | ==== Inside ==== | ||
- | The text uses the [[http:// | + | The text uses the [[http:// |
The CoNLL format contains only the chunk heads. The native SSF format shows the other words in the chunk, too, but it does not capture intra-chunk dependency relations. This is an example of a multi-word chunk: | The CoNLL format contains only the chunk heads. The native SSF format shows the other words in the chunk, too, but it does not capture intra-chunk dependency relations. This is an example of a multi-word chunk: | ||
- | < | + | < |
- | 3.1 ekatA QC <fs af='eka, | + | 3.1 932 QC <fs af='932, |
- | 3.2 ledisa | + | 3.2 maMxi |
- | 3.3 rumAla | + | 3.3 AdavAlYle |
))</ | ))</ | ||
Line 78: | Line 79: | ||
According to [[http:// | According to [[http:// | ||
- | Note: There have been cycles in the Hindi part of HyDT but no such problem occurs in the Bengali | + | Note: There have been cycles in the Hindi part of HyDT but no such problem occurs in the Telugu |
==== Sample ==== | ==== Sample ==== |