[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
user:zeman:treebanks:hi [2011/12/08 08:38]
zeman Zarovnání čísel v tabulkách.
user:zeman:treebanks:hi [2012/12/15 13:15] (current)
zeman
Line 11: Line 11:
     * Shakti Standard Format (SSF; native)     * Shakti Standard Format (SSF; native)
     * CoNLL format     * CoNLL format
 +  * HPST 2012 (MTPIL workshop COLING 2012; this version is also called HTB (Hindi Treebank) 0.5)
 +    * Shakti Standard Format (SSF; native)
 +    * CoNLL format
 +    * Hyderabad DT říkají tomu starému s malými daty. Tohle je Hindi treebank z velkého projektu sponzorovaného NSF
  
-There has been no official release of the treebank yet. There have been two as-is sample releases for the purposes of the NLP tools contests in parsing Indian languages, attached to the [[http://ltrc.iiit.ac.in/nlptools2009/|ICON 2009]] and [[http://ltrc.iiit.ac.in/nlptools2010/|2010]] conferences.+ 
 +There has been no official release of the treebank yet. There have been three as-is sample releases for the purposes of the NLP tools contests in parsing Indian languages, attached to the [[http://ltrc.iiit.ac.in/nlptools2009/|ICON 2009]] and [[http://ltrc.iiit.ac.in/nlptools2010/|2010]] conferences and the [[http://ltrc.iiit.ac.in/mtpil2012/|MTPIL]] workshop of [[http://www.coling2012-iitb.org/|COLING 2012]].
  
 ==== Obtaining and License ==== ==== Obtaining and License ====
  
-There is no standard distribution channel for the treebank after the ICON 2010 evaluation period. Inquire at the LTRC (ltrc (at) iiit (dot) ac (dot) in) about the possibility of getting the data. The ICON 2010 license in short:+There is no standard distribution channel for the treebank after the shared task evaluation period. Inquire at the LTRC (ltrc (at) iiit (dot) ac (dot) in) about the possibility of getting the data. The ICON 2010 and HPST 2012 license in short:
  
   * non-commercial research usage   * non-commercial research usage
Line 39: Line 44:
 ==== Domain ==== ==== Domain ====
  
-Unknown.+News domain corpus from ISI Kolkata.
  
 ==== Size ==== ==== Size ====
Line 60: Line 65:
  
 I have counted the sentences and tokens (words) on the ''.conll'' files; there are slight differences from the statistics presented in (Husain et al., 2010). I have counted the sentences and tokens (words) on the ''.conll'' files; there are slight differences from the statistics presented in (Husain et al., 2010).
 +
 +The HTB 0.5 (2012) version came with a data split into three parts: training, development and test. The intra-chunk dependencies have been added:
 +
 +^ Part ^ Sentences ^ Chunks ^ Ratio ^ Words ^ Ratio ^
 +| Training |    12041 | | | 268093 |  22.27 |
 +| Development |  1233 | | |  26416 |  21.42 |
 +| Test |              | | |        |        |
 +| TOTAL |             | | |        |        |
  
 ==== Inside ==== ==== Inside ====
 +
 +HTB 0.5 is distributed in Devanagari UTF-8 and in the WX encoding (see below), both in SSF and CoNLL formats, each with gold-standard and automatic morphology.
 +
 +//The rest of this section applies to the ICON datasets. It may or may not still be valid for HTB 0.5.//
  
 The text uses the [[http://ltrc.iiit.ac.in/nlptools2010/files/documents/map.pdf|WX encoding]] of Indian letters. If we know what the original script is (Devanagari in this case) we can map the WX encoding to the original characters in UTF-8. WX uses English letters so if there was embedded English (or other string using Latin letters) it will probably get lost during the conversion. Note that there are (not infrequent) broken characters (''\x{FFFD} REPLACEMENT CHARACTER'') in the WX encoding and the correct characters cannot be recovered automatically. The text uses the [[http://ltrc.iiit.ac.in/nlptools2010/files/documents/map.pdf|WX encoding]] of Indian letters. If we know what the original script is (Devanagari in this case) we can map the WX encoding to the original characters in UTF-8. WX uses English letters so if there was embedded English (or other string using Latin letters) it will probably get lost during the conversion. Note that there are (not infrequent) broken characters (''\x{FFFD} REPLACEMENT CHARACTER'') in the WX encoding and the correct characters cannot be recovered automatically.
Line 584: Line 601:
 | <nowiki>36</nowiki> | <nowiki>करेंगे</nowiki> | <nowiki>कर</nowiki> | <nowiki>VM</nowiki> | <nowiki>v</nowiki> | <nowiki>lex-kara|cat-v|gend-m|num-pl|pers-3|case-|vib-gA|tam-gA|posn-360|name-kareMge|chunkId-VGF|chunkType-head:VGF</nowiki> | <nowiki>0</nowiki> | <nowiki>main</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | <nowiki>36</nowiki> | <nowiki>करेंगे</nowiki> | <nowiki>कर</nowiki> | <nowiki>VM</nowiki> | <nowiki>v</nowiki> | <nowiki>lex-kara|cat-v|gend-m|num-pl|pers-3|case-|vib-gA|tam-gA|posn-360|name-kareMge|chunkId-VGF|chunkType-head:VGF</nowiki> | <nowiki>0</nowiki> | <nowiki>main</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 | <nowiki>37</nowiki> | <nowiki>.</nowiki> | <nowiki>.</nowiki> | <nowiki>SYM</nowiki> | <nowiki>punc</nowiki> | <nowiki>lex-.|cat-punc|gend-|num-|pers-|case-|vib-|tam-|posn-370|chunkType-child:VGF|name-.</nowiki> | <nowiki>36</nowiki> | <nowiki>rsym</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | <nowiki>37</nowiki> | <nowiki>.</nowiki> | <nowiki>.</nowiki> | <nowiki>SYM</nowiki> | <nowiki>punc</nowiki> | <nowiki>lex-.|cat-punc|gend-|num-|pers-|case-|vib-|tam-|posn-370|chunkType-child:VGF|name-.</nowiki> | <nowiki>36</nowiki> | <nowiki>rsym</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +
 +The first sentence of the HPST 2012 training data in UTF8 SSF format with gold-standard morphology:
 +
 +<code xml><Sentence id='1'>
 +1 गुजरात NNP <fs af='गुजरात,n,m,sg,3,o,0_का,0' name='गुजरात' posn='10' chunkId='NP' drel='r6:मुख्यमंत्री' vpos='vib_2' chunkType='head:NP'>
 +2 के PSP <fs af='का,psp,m,sg,,o,,' name='के' posn='20' drel='lwg__psp:गुजरात' chunkType='child:NP'>
 +3 मुख्यमंत्री NNP <fs af='मुख्यमंत्री,n,m,sg,3,o,0,0' name='मुख्यमंत्री' posn='30' chunkId='NP2' drel='nmod:मोदी' chunkType='head:NP2'>
 +4 नरेंद्र NNPC <fs af='नरेंद्र,n,m,sg,3,d,0,0' name='नरेंद्र' posn='40' drel='pof__cn:मोदी' chunkType='child:NP3'>
 +5 मोदी NNP <fs af='मोदी,n,m,sg,3,o,0_ने,0' name='मोदी' posn='50' chunkId='NP3' drel='k1:किया' vpos='vib_3' chunkType='head:NP3'>
 +6 ने PSP <fs af='ने,psp,,,,,,' name='ने' posn='60' drel='lwg__psp:मोदी' chunkType='child:NP3'>
 +7 मंगलवार NNP <fs af='मंगलवार,n,m,sg,3,o,0_को,0' name='मंगलवार' posn='70' chunkId='NP4' drel='k7t:किया' vpos='vib_2' chunkType='head:NP4'>
 +8 को PSP <fs af='को,psp,,,,,,' name='को' posn='80' drel='lwg__psp:मंगलवार' chunkType='child:NP4'>
 +9 गृह NNPC <fs af='गृह,n,m,sg,3,d,0,0' name='गृह' posn='90' drel='pof__cn:मंत्री' chunkType='child:NP5'>
 +10 मंत्री NNP <fs af='मंत्री,n,m,sg,3,d,0,0' name='मंत्री' posn='100' drel='nmod__adj:पाटिल' chunkType='child:NP5'>
 +11 शिवराज NNPC <fs af='शिवराज,n,m,sg,3,d,0,0' name='शिवराज' posn='110' drel='pof__cn:पाटिल' chunkType='child:NP5'>
 +12 पाटिल NNP <fs af='पाटिल,n,m,sg,3,o,0_से,0' name='पाटिल' posn='120' chunkId='NP5' drel='k4:किया' vpos='vib_vib_5' chunkType='head:NP5'>
 +13 से PSP <fs af='से,psp,,,,,,' name='से' posn='130' drel='lwg__psp:पाटिल' chunkType='child:NP5'>
 +14 मुलाकात NN <fs af='मुलाकात,n,f,sg,3,d,0,0' name='मुलाकात' posn='140' chunkId='NP6' drel='pof:कर' chunkType='head:NP6'>
 +15 कर VM <fs af='कर,v,any,any,any,,0,0' name='कर' posn='150' chunkId='VGNF' drel='vmod:किया' chunkType='head:VGNF'>
 +16 आईएएस NNP <fs af='आईएएस,n,m,sg,3,o,0,0' name='आईएएस' posn='160' chunkId='NP7' drel='ccof:और' chunkType='head:NP7'>
 +17 और CC <fs af='और,avy,,,,,,' name='और' posn='170' chunkId='CCP' drel='r6:तर्ज' chunkType='head:CCP'>
 +18 आईपीएस NNP <fs af='आईपीएस,n,m,sg,3,o,0_का,0' name='आईपीएस' posn='180' chunkId='NP8' drel='ccof:और' vpos='vib_2' chunkType='head:NP8'>
 +19 की PSP <fs af='का,psp,f,sg,,o,,' name='की' posn='190' drel='lwg__psp:आईपीएस' chunkType='child:NP8'>
 +20 तर्ज NN <fs af='तर्ज,n,f,sg,3,o,0_पर,0' name='तर्ज' posn='200' chunkId='NP9' drel='k7:किया' vpos='vib_2' chunkType='head:NP9'>
 +21 पर PSP <fs af='पर,psp,,,,,,' name='पर' posn='210' drel='lwg__psp:तर्ज' chunkType='child:NP9'>
 +22 राष्ट्रीय JJ <fs af='राष्ट्रीय,adj,any,any,,o,,' name='राष्ट्रीय' posn='220' drel='nmod__adj:स्तर' chunkType='child:NP10'>
 +23 स्तर NN <fs af='स्तर,n,m,sg,3,o,0_पर,0' name='स्तर' posn='230' chunkId='NP10' drel='k7:किया' vpos='vib_3' chunkType='head:NP10'>
 +24 पर PSP <fs af='पर,psp,,,,,,' name='पर2' posn='240' drel='lwg__psp:स्तर' chunkType='child:NP10'>
 +25 एक QC <fs af='एक,num,any,any,,any,,' name='एक' posn='250' drel='nmod__adj:सेवा' chunkType='child:NP11'>
 +26 खुफिया JJ <fs af='खुफिया,adj,any,any,,d,,' name='खुफिया' posn='260' drel='nmod__adj:सेवा' chunkType='child:NP11'>
 +27 सेवा NN <fs af='सेवा,n,f,sg,3,d,0,0' name='सेवा' posn='270' chunkId='NP11' drel='k2:करने' chunkType='head:NP11'>
 +28 शुरू NN <fs af='शुरू,n,m,sg,3,d,0,0' name='शुरू' posn='280' chunkId='NP12' drel='pof:करने' chunkType='head:NP12'>
 +29 करने VM <fs af='कर,v,any,sg,any,o,ना_का,nA' name='करने' posn='290' chunkId='VGNN' drel='r6-k2:अनुरोध' vpos='tam_2' chunkType='head:VGNN'>
 +30 का PSP <fs af='का,psp,m,sg,,d,,' name='का' posn='300' drel='lwg__psp:करने' chunkType='child:VGNN'>
 +31 अनुरोध NN <fs af='अनुरोध,n,m,sg,3,d,0,0' name='अनुरोध' posn='310' chunkId='NP13' drel='pof:किया' chunkType='head:NP13'>
 +32 किया VM <fs af='कर,v,m,sg,any,,या,yA' name='किया' posn='320' chunkId='VGF' chunkType='head:VGF' voicetype='active' stype='declarative'>
 +33 । SYM <fs af='।,punc,,,,,,' name='।' posn='330' chunkId='BLK' drel='rsym:किया' chunkType='head:BLK'>
 +</Sentence></code>
 +
 +And the same in CoNLL format:
 +
 +| 1 | <nowiki>गुजरात</nowiki> | <nowiki>गुजरात</nowiki> | NNP | n | <nowiki>lex-गुजरात|cat-n|gen-m|num-sg|pers-3|case-o|vib-0_का|tam-0|chunkId-NP|chunkType-head|stype-|voicetype-</nowiki> | 3 | r6 | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 2 | <nowiki>के</nowiki> | <nowiki>का</nowiki> | PSP | psp | <nowiki>lex-का|cat-psp|gen-m|num-sg|pers-|case-o|vib-|tam-|chunkId-NP|chunkType-child|stype-|voicetype-</nowiki> | 1 | <nowiki>lwg__psp</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 3 | <nowiki>मुख्यमंत्री</nowiki> | <nowiki>मुख्यमंत्री</nowiki> | NNP | n | <nowiki>lex-मुख्यमंत्री|cat-n|gen-m|num-sg|pers-3|case-o|vib-0|tam-0|chunkId-NP2|chunkType-head|stype-|voicetype-</nowiki> | 5 | nmod | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 4 | <nowiki>नरेंद्र</nowiki> | <nowiki>नरेंद्र</nowiki> | NNPC | n | <nowiki>lex-नरेंद्र|cat-n|gen-m|num-sg|pers-3|case-d|vib-0|tam-0|chunkId-NP3|chunkType-child|stype-|voicetype-</nowiki> | 5 | <nowiki>pof__cn</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 5 | <nowiki>मोदी</nowiki> | <nowiki>मोदी</nowiki> | NNP | n | <nowiki>lex-मोदी|cat-n|gen-m|num-sg|pers-3|case-o|vib-0_ने|tam-0|chunkId-NP3|chunkType-head|stype-|voicetype-</nowiki> | 32 | k1 | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 6 | <nowiki>ने</nowiki> | <nowiki>ने</nowiki> | PSP | psp | <nowiki>lex-ने|cat-psp|gen-|num-|pers-|case-|vib-|tam-|chunkId-NP3|chunkType-child|stype-|voicetype-</nowiki> | 5 | <nowiki>lwg__psp</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 7 | <nowiki>मंगलवार</nowiki> | <nowiki>मंगलवार</nowiki> | NNP | n | <nowiki>lex-मंगलवार|cat-n|gen-m|num-sg|pers-3|case-o|vib-0_को|tam-0|chunkId-NP4|chunkType-head|stype-|voicetype-</nowiki> | 32 | k7t | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 8 | <nowiki>को</nowiki> | <nowiki>को</nowiki> | PSP | psp | <nowiki>lex-को|cat-psp|gen-|num-|pers-|case-|vib-|tam-|chunkId-NP4|chunkType-child|stype-|voicetype-</nowiki> | 7 | <nowiki>lwg__psp</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 9 | <nowiki>गृह</nowiki> | <nowiki>गृह</nowiki> | NNPC | n | <nowiki>lex-गृह|cat-n|gen-m|num-sg|pers-3|case-d|vib-0|tam-0|chunkId-NP5|chunkType-child|stype-|voicetype-</nowiki> | 10 | <nowiki>pof__cn</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 10 | <nowiki>मंत्री</nowiki> | <nowiki>मंत्री</nowiki> | NNP | n | <nowiki>lex-मंत्री|cat-n|gen-m|num-sg|pers-3|case-d|vib-0|tam-0|chunkId-NP5|chunkType-child|stype-|voicetype-</nowiki> | 12 | <nowiki>nmod__adj</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 11 | <nowiki>शिवराज</nowiki> | <nowiki>शिवराज</nowiki> | NNPC | n | <nowiki>lex-शिवराज|cat-n|gen-m|num-sg|pers-3|case-d|vib-0|tam-0|chunkId-NP5|chunkType-child|stype-|voicetype-</nowiki> | 12 | <nowiki>pof__cn</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 12 | <nowiki>पाटिल</nowiki> | <nowiki>पाटिल</nowiki> | NNP | n | <nowiki>lex-पाटिल|cat-n|gen-m|num-sg|pers-3|case-o|vib-0_से|tam-0|chunkId-NP5|chunkType-head|stype-|voicetype-</nowiki> | 32 | k4 | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 13 | <nowiki>से</nowiki> | <nowiki>से</nowiki> | PSP | psp | <nowiki>lex-से|cat-psp|gen-|num-|pers-|case-|vib-|tam-|chunkId-NP5|chunkType-child|stype-|voicetype-</nowiki> | 12 | <nowiki>lwg__psp</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 14 | <nowiki>मुलाकात</nowiki> | <nowiki>मुलाकात</nowiki> | NN | n | <nowiki>lex-मुलाकात|cat-n|gen-f|num-sg|pers-3|case-d|vib-0|tam-0|chunkId-NP6|chunkType-head|stype-|voicetype-</nowiki> | 15 | pof | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 15 | कर | कर | VM | v | <nowiki>lex-कर|cat-v|gen-any|num-any|pers-any|case-|vib-0|tam-0|chunkId-VGNF|chunkType-head|stype-|voicetype-</nowiki> | 32 | vmod | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 16 | आईएएस | आईएएस | NNP | n | <nowiki>lex-आईएएस|cat-n|gen-m|num-sg|pers-3|case-o|vib-0|tam-0|chunkId-NP7|chunkType-head|stype-|voicetype-</nowiki> | 17 | ccof | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 17 | और | और | CC | avy | <nowiki>lex-और|cat-avy|gen-|num-|pers-|case-|vib-|tam-|chunkId-CCP|chunkType-head|stype-|voicetype-</nowiki> | 20 | r6 | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 18 | <nowiki>आईपीएस</nowiki> | <nowiki>आईपीएस</nowiki> | NNP | n | <nowiki>lex-आईपीएस|cat-n|gen-m|num-sg|pers-3|case-o|vib-0_का|tam-0|chunkId-NP8|chunkType-head|stype-|voicetype-</nowiki> | 17 | ccof | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 19 | <nowiki>की</nowiki> | <nowiki>का</nowiki> | PSP | psp | <nowiki>lex-का|cat-psp|gen-f|num-sg|pers-|case-o|vib-|tam-|chunkId-NP8|chunkType-child|stype-|voicetype-</nowiki> | 18 | <nowiki>lwg__psp</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 20 | <nowiki>तर्ज</nowiki> | <nowiki>तर्ज</nowiki> | NN | n | <nowiki>lex-तर्ज|cat-n|gen-f|num-sg|pers-3|case-o|vib-0_पर|tam-0|chunkId-NP9|chunkType-head|stype-|voicetype-</nowiki> | 32 | k7 | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 21 | पर | पर | PSP | psp | <nowiki>lex-पर|cat-psp|gen-|num-|pers-|case-|vib-|tam-|chunkId-NP9|chunkType-child|stype-|voicetype-</nowiki> | 20 | <nowiki>lwg__psp</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 22 | <nowiki>राष्ट्रीय</nowiki> | <nowiki>राष्ट्रीय</nowiki> | JJ | adj | <nowiki>lex-राष्ट्रीय|cat-adj|gen-any|num-any|pers-|case-o|vib-|tam-|chunkId-NP10|chunkType-child|stype-|voicetype-</nowiki> | 23 | <nowiki>nmod__adj</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 23 | <nowiki>स्तर</nowiki> | <nowiki>स्तर</nowiki> | NN | n | <nowiki>lex-स्तर|cat-n|gen-m|num-sg|pers-3|case-o|vib-0_पर|tam-0|chunkId-NP10|chunkType-head|stype-|voicetype-</nowiki> | 32 | k7 | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 24 | पर | पर | PSP | psp | <nowiki>lex-पर|cat-psp|gen-|num-|pers-|case-|vib-|tam-|chunkId-NP10|chunkType-child|stype-|voicetype-</nowiki> | 23 | <nowiki>lwg__psp</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 25 | एक | एक | QC | num | <nowiki>lex-एक|cat-num|gen-any|num-any|pers-|case-any|vib-|tam-|chunkId-NP11|chunkType-child|stype-|voicetype-</nowiki> | 27 | <nowiki>nmod__adj</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 26 | <nowiki>खुफिया</nowiki> | <nowiki>खुफिया</nowiki> | JJ | adj | <nowiki>lex-खुफिया|cat-adj|gen-any|num-any|pers-|case-d|vib-|tam-|chunkId-NP11|chunkType-child|stype-|voicetype-</nowiki> | 27 | <nowiki>nmod__adj</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 27 | <nowiki>सेवा</nowiki> | <nowiki>सेवा</nowiki> | NN | n | <nowiki>lex-सेवा|cat-n|gen-f|num-sg|pers-3|case-d|vib-0|tam-0|chunkId-NP11|chunkType-head|stype-|voicetype-</nowiki> | 29 | k2 | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 28 | <nowiki>शुरू</nowiki> | <nowiki>शुरू</nowiki> | NN | n | <nowiki>lex-शुरू|cat-n|gen-m|num-sg|pers-3|case-d|vib-0|tam-0|chunkId-NP12|chunkType-head|stype-|voicetype-</nowiki> | 29 | pof | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 29 | <nowiki>करने</nowiki> | कर | VM | v | <nowiki>lex-कर|cat-v|gen-any|num-sg|pers-any|case-o|vib-ना_का|tam-nA|chunkId-VGNN|chunkType-head|stype-|voicetype-</nowiki> | 31 | <nowiki>r6-k2</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 30 | <nowiki>का</nowiki> | <nowiki>का</nowiki> | PSP | psp | <nowiki>lex-का|cat-psp|gen-m|num-sg|pers-|case-d|vib-|tam-|chunkId-VGNN|chunkType-child|stype-|voicetype-</nowiki> | 29 | <nowiki>lwg__psp</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 31 | <nowiki>अनुरोध</nowiki> | <nowiki>अनुरोध</nowiki> | NN | n | <nowiki>lex-अनुरोध|cat-n|gen-m|num-sg|pers-3|case-d|vib-0|tam-0|chunkId-NP13|chunkType-head|stype-|voicetype-</nowiki> | 32 | pof | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 32 | <nowiki>किया</nowiki> | कर | VM | v | <nowiki>lex-कर|cat-v|gen-m|num-sg|pers-any|case-|vib-या|tam-yA|chunkId-VGF|chunkType-head|stype-declarative'>|voicetype-active</nowiki> | 0 | main | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 33 | <nowiki>।</nowiki> | <nowiki>।</nowiki> | SYM | punc | <nowiki>lex-।|cat-punc|gen-|num-|pers-|case-|vib-|tam-|chunkId-BLK|chunkType-head|stype-|voicetype-</nowiki> | 32 | rsym | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +
 +The same sentence with “automatically tagged” morphology. Apparently it means no morphology at all, and the contestants should probably use their own taggers to tag it.
 +
 +| 1 | <nowiki>गुजरात</nowiki> | <nowiki>_</nowiki> | NNP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 3 | r6 | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 2 | <nowiki>के</nowiki> | <nowiki>_</nowiki> | PSP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 1 | <nowiki>lwg__psp</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 3 | <nowiki>मुख्यमंत्री</nowiki> | <nowiki>_</nowiki> | NNP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 5 | nmod | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 4 | <nowiki>नरेंद्र</nowiki> | <nowiki>_</nowiki> | NNPC | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 5 | <nowiki>pof__cn</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 5 | <nowiki>मोदी</nowiki> | <nowiki>_</nowiki> | NNP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 32 | k1 | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 6 | <nowiki>ने</nowiki> | <nowiki>_</nowiki> | PSP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 5 | <nowiki>lwg__psp</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 7 | <nowiki>मंगलवार</nowiki> | <nowiki>_</nowiki> | NNP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 32 | k7t | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 8 | <nowiki>को</nowiki> | <nowiki>_</nowiki> | PSP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 7 | <nowiki>lwg__psp</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 9 | <nowiki>गृह</nowiki> | <nowiki>_</nowiki> | NNPC | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 10 | <nowiki>pof__cn</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 10 | <nowiki>मंत्री</nowiki> | <nowiki>_</nowiki> | NNP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 12 | <nowiki>nmod__adj</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 11 | <nowiki>शिवराज</nowiki> | <nowiki>_</nowiki> | NNPC | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 12 | <nowiki>pof__cn</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 12 | <nowiki>पाटिल</nowiki> | <nowiki>_</nowiki> | NNP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 32 | k4 | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 13 | <nowiki>से</nowiki> | <nowiki>_</nowiki> | PSP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 12 | <nowiki>lwg__psp</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 14 | <nowiki>मुलाकात</nowiki> | <nowiki>_</nowiki> | NN | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 15 | pof | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 15 | कर | <nowiki>_</nowiki> | VM | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 32 | vmod | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 16 | आईएएस | <nowiki>_</nowiki> | NNP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 17 | ccof | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 17 | और | <nowiki>_</nowiki> | CC | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 20 | r6 | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 18 | <nowiki>आईपीएस</nowiki> | <nowiki>_</nowiki> | NNP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 17 | ccof | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 19 | <nowiki>की</nowiki> | <nowiki>_</nowiki> | PSP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 18 | <nowiki>lwg__psp</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 20 | <nowiki>तर्ज</nowiki> | <nowiki>_</nowiki> | NN | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 32 | k7 | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 21 | पर | <nowiki>_</nowiki> | PSP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 20 | <nowiki>lwg__psp</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 22 | <nowiki>राष्ट्रीय</nowiki> | <nowiki>_</nowiki> | JJ | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 23 | <nowiki>nmod__adj</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 23 | <nowiki>स्तर</nowiki> | <nowiki>_</nowiki> | NN | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 32 | k7 | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 24 | पर | <nowiki>_</nowiki> | PSP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 23 | <nowiki>lwg__psp</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 25 | एक | <nowiki>_</nowiki> | QC | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 27 | <nowiki>nmod__adj</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 26 | <nowiki>खुफिया</nowiki> | <nowiki>_</nowiki> | NNC | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 27 | <nowiki>nmod__adj</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 27 | <nowiki>सेवा</nowiki> | <nowiki>_</nowiki> | NN | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 29 | k2 | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 28 | <nowiki>शुरू</nowiki> | <nowiki>_</nowiki> | NN | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 29 | pof | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 29 | <nowiki>करने</nowiki> | <nowiki>_</nowiki> | VM | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 31 | <nowiki>r6-k2</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 30 | <nowiki>का</nowiki> | <nowiki>_</nowiki> | PSP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 29 | <nowiki>lwg__psp</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 31 | <nowiki>अनुरोध</nowiki> | <nowiki>_</nowiki> | NN | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 32 | pof | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 32 | <nowiki>किया</nowiki> | <nowiki>_</nowiki> | VM | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 0 | main | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 33 | <nowiki>।</nowiki> | <nowiki>_</nowiki> | SYM | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 32 | rsym | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +
 +The first sentence of the development data in the UTF8 SSF format with gold-standard morphology:
 +
 +<code xml><Sentence id='1'>
 +1 भाजपा NNP <fs af='भाजपा,n,f,sg,3,o,0_ने,0' name='भाजपा' posn='10' chunkId='NP' drel='k1:लगाया' vpos='vib_2' chunkType='head:NP'>
 +2 ने PSP <fs af='ने,psp,,,,,,' name='ने' posn='20' drel='lwg__psp:भाजपा' chunkType='child:NP'>
 +3 केंद्र NNPC <fs name='केंद्र' chunkId='FRAGP' chunkType='head:'FRAGP' drel='ccof:और'>
 +4 और CC <fs af='और,avy,,,,,,' name='और' posn='40' chunkId='CCP' drel='nmod:सरकार' chunkType='head:CCP'>
 +5 केरल NNPC <fs name='केरल' chunkId='FRAGP2' chunkType='head:'FRAGP2' drel='ccof:और'>
 +6 सरकार NNP <fs af='सरकार,n,f,sg,3,o,0_पर,0' name='सरकार' posn='60' chunkId='NP2' drel='k7:लगाया' vpos='vib_2' chunkType='head:NP2'>
 +7 पर PSP <fs af='पर,psp,,,,,,' name='पर' posn='70' drel='lwg__psp:सरकार' chunkType='child:NP2'>
 +8 भारतीय JJ <fs af='भारतीय,adj,any,any,,o,,' name='भारतीय' posn='80' drel='nmod__adj:ड्राइवर' chunkType='child:NP3'>
 +9 ड्राइवर NN <fs af='ड्राइवर,n,m,sg,3,o,0,0' name='ड्राइवर' posn='90' chunkId='NP3' drel='nmod:कुट्टी' chunkType='head:NP3'>
 +10 एम. NNPC <fs af='एम.,n,m,sg,3,d,0,0' name='एम.' posn='100' drel='pof__cn:कुट्टी' chunkType='child:NP4'>
 +11 आर. NNPC <fs af='आर.,n,m,sg,3,d,0,0' name='आर.' posn='110' drel='pof__cn:कुट्टी' chunkType='child:NP4'>
 +12 कुट्टी NNP <fs af='कुट्टी,n,m,sg,3,o,0_का,0' name='कुट्टी' posn='120' chunkId='NP4' drel='r6:हत्या' vpos='vib_4' chunkType='head:NP4'>
 +13 की PSP <fs af='का,psp,f,sg,,o,,' name='की' posn='130' drel='lwg__psp:कुट्टी' chunkType='child:NP4'>
 +14 हत्या NN <fs af='हत्या,n,f,sg,3,o,0_के_लिए,0' name='हत्या' posn='140' chunkId='NP5' drel='jjmod:जिम्मेदार' vpos='vib_2_3' chunkType='head:NP5'>
 +15 के PSP <fs af='के,psp,,,,,,' name='के' posn='150' drel='lwg__psp:हत्या' chunkType='child:NP5'>
 +16 लिए PSP <fs af='लिए,psp,,,,,,' name='लिए' posn='160' drel='lwg__cont:हत्या' chunkType='child:NP5'>
 +17 जिम्मेदार JJ <fs af='जिम्मेदार,adj,any,any,,o,,' name='जिम्मेदार' posn='170' chunkId='JJP' drel='nmod:तालिबान' chunkType='head:JJP'>
 +18 तालिबान NNP <fs af='तालिबान,n,m,sg,3,o,0_के_साथ,0' name='तालिबान' posn='180' chunkId='NP6' drel='ras-k1:लगाया' vpos='vib_2_3' chunkType='head:NP6'>
 +19 के PSP <fs af='के,psp,,,,,,' name='के2' posn='190' drel='lwg__psp:तालिबान' chunkType='child:NP6'>
 +20 साथ NST <fs af='साथ,nst,m,sg,3,d,,' name='साथ' posn='200' drel='lwg__cont:तालिबान' chunkType='child:NP6'>
 +21 निपटने VM <fs af='निपट,v,any,any,any,o,ना_में,nA' name='निपटने' posn='210' chunkId='VGNN' drel='k7:लगाया' vpos='tam_2' chunkType='head:VGNN'>
 +22 में PSP <fs af='में,psp,,,,,,' name='में' posn='220' drel='lwg__psp:निपटने' chunkType='child:VGNN'>
 +23 ढिलाई NN <fs af='ढिलाई,n,f,sg,3,d,0,0' name='ढिलाई' posn='230' chunkId='NP7' drel='k2:बरतने' chunkType='head:NP7'>
 +24 बरतने VM <fs af='बरत,v,any,sg,any,o,ना_का,nA' name='बरतने' posn='240' chunkId='VGNN2' drel='r6:आरोप' vpos='tam_2' chunkType='head:VGNN2'>
 +25 का PSP <fs af='का,psp,m,sg,,d,,' name='का' posn='250' drel='lwg__psp:बरतने' chunkType='child:VGNN2'>
 +26 आरोप NN <fs af='आरोप,n,m,sg,3,d,0,0' name='आरोप' posn='260' chunkId='NP8' drel='k2:लगाया' chunkType='head:NP8'>
 +27 लगाया VM <fs af='लगा,v,m,sg,3,,या_है,yA' name='लगाया' posn='270' chunkId='VGF' chunkType='head:VGF' voicetype='active' vpos='tam_2' stype='declarative'>
 +28 है VAUX <fs af='है,v,any,sg,3,,है,hE' name='है' posn='280' drel='lwg__vaux:लगाया' chunkType='child:VGF'>
 +29 । SYM <fs af='।,punc,,,,,,' name='।' posn='290' chunkId='BLK' drel='rsym:लगाया' chunkType='head:BLK'>
 +</Sentence></code>
  
 ==== Parsing ==== ==== Parsing ====

[ Back to the navigation ] [ Back to the content ]