Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision | Next revision Both sides next revision | ||
user:zeman:treebanks:hi [2011/12/08 08:38] zeman Zarovnání čísel v tabulkách. |
user:zeman:treebanks:hi [2012/10/02 16:13] zeman HTB 0.5 for the 2012 shared task. |
||
---|---|---|---|
Line 9: | Line 9: | ||
* CoNLL format | * CoNLL format | ||
* ICON 2010 | * ICON 2010 | ||
+ | * Shakti Standard Format (SSF; native) | ||
+ | * CoNLL format | ||
+ | * HPST 2012 (MTPIL workshop COLING 2012; this version is also called HTB (Hindi Treebank) 0.5) | ||
* Shakti Standard Format (SSF; native) | * Shakti Standard Format (SSF; native) | ||
* CoNLL format | * CoNLL format | ||
- | There has been no official release of the treebank yet. There have been two as-is sample releases for the purposes of the NLP tools contests in parsing Indian languages, attached to the [[http:// | + | There has been no official release of the treebank yet. There have been three as-is sample releases for the purposes of the NLP tools contests in parsing Indian languages, attached to the [[http:// |
==== Obtaining and License ==== | ==== Obtaining and License ==== | ||
- | There is no standard distribution channel for the treebank after the ICON 2010 evaluation period. Inquire at the LTRC (ltrc (at) iiit (dot) ac (dot) in) about the possibility of getting the data. The ICON 2010 license in short: | + | There is no standard distribution channel for the treebank after the shared task evaluation period. Inquire at the LTRC (ltrc (at) iiit (dot) ac (dot) in) about the possibility of getting the data. The ICON 2010 and HPST 2012 license in short: |
* non-commercial research usage | * non-commercial research usage | ||
Line 39: | Line 42: | ||
==== Domain ==== | ==== Domain ==== | ||
- | Unknown. | + | News domain corpus from ISI Kolkata. |
==== Size ==== | ==== Size ==== | ||
Line 60: | Line 63: | ||
I have counted the sentences and tokens (words) on the '' | I have counted the sentences and tokens (words) on the '' | ||
+ | |||
+ | The HTB 0.5 (2012) version came with a data split into three parts: training, development and test. The intra-chunk dependencies have been added: | ||
+ | |||
+ | ^ Part ^ Sentences ^ Chunks ^ Ratio ^ Words ^ Ratio ^ | ||
+ | | Training | 12041 | | | 268093 | 22.27 | | ||
+ | | Development | 1233 | | | 26416 | 21.42 | | ||
+ | | Test | | | | | | | ||
+ | | TOTAL | | | | | | | ||
==== Inside ==== | ==== Inside ==== | ||
+ | |||
+ | HTB 0.5 is distributed in Devanagari UTF-8 and in the WX encoding (see below), both in SSF and CoNLL formats, each with gold-standard and automatic morphology. | ||
+ | |||
+ | //The rest of this section applies to the ICON datasets. It may or may not still be valid for HTB 0.5.// | ||
The text uses the [[http:// | The text uses the [[http:// |