Differences

This shows you the differences between two versions of the page.

--- user:zeman:treebanks:hi [2011/12/08 08:38]
zeman Zarovnání čísel v tabulkách.
+++ user:zeman:treebanks:hi [2012/10/02 16:13]
zeman HTB 0.5 for the 2012 shared task.
@@ Line 9: / Line 9: @@
     * CoNLL format
   * ICON 2010
+    * Shakti Standard Format (SSF; native)
+    * CoNLL format
+  * HPST 2012 (MTPIL workshop COLING 2012; this version is also called HTB (Hindi Treebank) 0.5)
     * Shakti Standard Format (SSF; native)
     * CoNLL format
-There has been no official release of the treebank yet. There have been two as-is sample releases for the purposes of the NLP tools contests in parsing Indian languages, attached to the [[http://ltrc.iiit.ac.in/nlptools2009/|ICON 2009]] and [[http://ltrc.iiit.ac.in/nlptools2010/|2010]] conferences.
+There has been no official release of the treebank yet. There have been three as-is sample releases for the purposes of the NLP tools contests in parsing Indian languages, attached to the [[http://ltrc.iiit.ac.in/nlptools2009/|ICON 2009]] and [[http://ltrc.iiit.ac.in/nlptools2010/|2010]] conferences and the [[http://ltrc.iiit.ac.in/mtpil2012/|MTPIL]] workshop of [[http://www.coling2012-iitb.org/|COLING 2012]].
 ==== Obtaining and License ====
-There is no standard distribution channel for the treebank after the ICON 2010 evaluation period. Inquire at the LTRC (ltrc (at) iiit (dot) ac (dot) in) about the possibility of getting the data. The ICON 2010 license in short:
+There is no standard distribution channel for the treebank after the shared task evaluation period. Inquire at the LTRC (ltrc (at) iiit (dot) ac (dot) in) about the possibility of getting the data. The ICON 2010 and HPST 2012 license in short:
   * non-commercial research usage
@@ Line 39: / Line 42: @@
 ==== Domain ====
-Unknown.
+News domain corpus from ISI Kolkata.
 ==== Size ====
@@ Line 60: / Line 63: @@
 I have counted the sentences and tokens (words) on the ''.conll'' files; there are slight differences from the statistics presented in (Husain et al., 2010).
+The HTB 0.5 (2012) version came with a data split into three parts: training, development and test. The intra-chunk dependencies have been added:
+^ Part ^ Sentences ^ Chunks ^ Ratio ^ Words ^ Ratio ^
+| Training |   12041 | | | 268093 |  22.27 |
+| Development | 1233 | | |  26416 |  21.42 |
+| Test |             | | |        |        |
+| TOTAL |            | | |        |        |
 ==== Inside ====
+HTB 0.5 is distributed in Devanagari UTF-8 and in the WX encoding (see below), both in SSF and CoNLL formats, each with gold-standard and automatic morphology.
+//The rest of this section applies to the ICON datasets. It may or may not still be valid for HTB 0.5.//
 The text uses the [[http://ltrc.iiit.ac.in/nlptools2010/files/documents/map.pdf|WX encoding]] of Indian letters. If we know what the original script is (Devanagari in this case) we can map the WX encoding to the original characters in UTF-8. WX uses English letters so if there was embedded English (or other string using Latin letters) it will probably get lost during the conversion. Note that there are (not infrequent) broken characters (''\x{FFFD} REPLACEMENT CHARACTER'') in the WX encoding and the correct characters cannot be recovered automatically.

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences