[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
user:zeman:treebanks:hi [2012/10/02 16:33]
zeman HPST 2012 sample.
user:zeman:treebanks:hi [2012/12/15 13:15] (current)
zeman
Line 14: Line 14:
     * Shakti Standard Format (SSF; native)     * Shakti Standard Format (SSF; native)
     * CoNLL format     * CoNLL format
 +    * Hyderabad DT říkají tomu starému s malými daty. Tohle je Hindi treebank z velkého projektu sponzorovaného NSF
 +
  
 There has been no official release of the treebank yet. There have been three as-is sample releases for the purposes of the NLP tools contests in parsing Indian languages, attached to the [[http://ltrc.iiit.ac.in/nlptools2009/|ICON 2009]] and [[http://ltrc.iiit.ac.in/nlptools2010/|2010]] conferences and the [[http://ltrc.iiit.ac.in/mtpil2012/|MTPIL]] workshop of [[http://www.coling2012-iitb.org/|COLING 2012]]. There has been no official release of the treebank yet. There have been three as-is sample releases for the purposes of the NLP tools contests in parsing Indian languages, attached to the [[http://ltrc.iiit.ac.in/nlptools2009/|ICON 2009]] and [[http://ltrc.iiit.ac.in/nlptools2010/|2010]] conferences and the [[http://ltrc.iiit.ac.in/mtpil2012/|MTPIL]] workshop of [[http://www.coling2012-iitb.org/|COLING 2012]].
Line 67: Line 69:
  
 ^ Part ^ Sentences ^ Chunks ^ Ratio ^ Words ^ Ratio ^ ^ Part ^ Sentences ^ Chunks ^ Ratio ^ Words ^ Ratio ^
-| Training |   12041 | | | 268093 |  22.27 | +| Training |    12041 | | | 268093 |  22.27 | 
-| Development | 1233 | | |  26416 |  21.42 | +| Development |  1233 | | |  26416 |  21.42 | 
-| Test |             | | |        |        | +| Test |              | | |        |        | 
-| TOTAL |            | | |        |        |+| TOTAL |             | | |        |        |
  
 ==== Inside ==== ==== Inside ====
Line 673: Line 675:
 | 32 | <nowiki>किया</nowiki> | कर | VM | v | <nowiki>lex-कर|cat-v|gen-m|num-sg|pers-any|case-|vib-या|tam-yA|chunkId-VGF|chunkType-head|stype-declarative'>|voicetype-active</nowiki> | 0 | main | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | 32 | <nowiki>किया</nowiki> | कर | VM | v | <nowiki>lex-कर|cat-v|gen-m|num-sg|pers-any|case-|vib-या|tam-yA|chunkId-VGF|chunkType-head|stype-declarative'>|voicetype-active</nowiki> | 0 | main | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 | 33 | <nowiki>।</nowiki> | <nowiki>।</nowiki> | SYM | punc | <nowiki>lex-।|cat-punc|gen-|num-|pers-|case-|vib-|tam-|chunkId-BLK|chunkType-head|stype-|voicetype-</nowiki> | 32 | rsym | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | 33 | <nowiki>।</nowiki> | <nowiki>।</nowiki> | SYM | punc | <nowiki>lex-।|cat-punc|gen-|num-|pers-|case-|vib-|tam-|chunkId-BLK|chunkType-head|stype-|voicetype-</nowiki> | 32 | rsym | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +
 +The same sentence with “automatically tagged” morphology. Apparently it means no morphology at all, and the contestants should probably use their own taggers to tag it.
 +
 +| 1 | <nowiki>गुजरात</nowiki> | <nowiki>_</nowiki> | NNP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 3 | r6 | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 2 | <nowiki>के</nowiki> | <nowiki>_</nowiki> | PSP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 1 | <nowiki>lwg__psp</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 3 | <nowiki>मुख्यमंत्री</nowiki> | <nowiki>_</nowiki> | NNP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 5 | nmod | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 4 | <nowiki>नरेंद्र</nowiki> | <nowiki>_</nowiki> | NNPC | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 5 | <nowiki>pof__cn</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 5 | <nowiki>मोदी</nowiki> | <nowiki>_</nowiki> | NNP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 32 | k1 | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 6 | <nowiki>ने</nowiki> | <nowiki>_</nowiki> | PSP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 5 | <nowiki>lwg__psp</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 7 | <nowiki>मंगलवार</nowiki> | <nowiki>_</nowiki> | NNP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 32 | k7t | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 8 | <nowiki>को</nowiki> | <nowiki>_</nowiki> | PSP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 7 | <nowiki>lwg__psp</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 9 | <nowiki>गृह</nowiki> | <nowiki>_</nowiki> | NNPC | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 10 | <nowiki>pof__cn</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 10 | <nowiki>मंत्री</nowiki> | <nowiki>_</nowiki> | NNP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 12 | <nowiki>nmod__adj</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 11 | <nowiki>शिवराज</nowiki> | <nowiki>_</nowiki> | NNPC | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 12 | <nowiki>pof__cn</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 12 | <nowiki>पाटिल</nowiki> | <nowiki>_</nowiki> | NNP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 32 | k4 | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 13 | <nowiki>से</nowiki> | <nowiki>_</nowiki> | PSP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 12 | <nowiki>lwg__psp</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 14 | <nowiki>मुलाकात</nowiki> | <nowiki>_</nowiki> | NN | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 15 | pof | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 15 | कर | <nowiki>_</nowiki> | VM | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 32 | vmod | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 16 | आईएएस | <nowiki>_</nowiki> | NNP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 17 | ccof | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 17 | और | <nowiki>_</nowiki> | CC | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 20 | r6 | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 18 | <nowiki>आईपीएस</nowiki> | <nowiki>_</nowiki> | NNP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 17 | ccof | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 19 | <nowiki>की</nowiki> | <nowiki>_</nowiki> | PSP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 18 | <nowiki>lwg__psp</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 20 | <nowiki>तर्ज</nowiki> | <nowiki>_</nowiki> | NN | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 32 | k7 | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 21 | पर | <nowiki>_</nowiki> | PSP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 20 | <nowiki>lwg__psp</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 22 | <nowiki>राष्ट्रीय</nowiki> | <nowiki>_</nowiki> | JJ | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 23 | <nowiki>nmod__adj</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 23 | <nowiki>स्तर</nowiki> | <nowiki>_</nowiki> | NN | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 32 | k7 | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 24 | पर | <nowiki>_</nowiki> | PSP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 23 | <nowiki>lwg__psp</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 25 | एक | <nowiki>_</nowiki> | QC | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 27 | <nowiki>nmod__adj</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 26 | <nowiki>खुफिया</nowiki> | <nowiki>_</nowiki> | NNC | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 27 | <nowiki>nmod__adj</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 27 | <nowiki>सेवा</nowiki> | <nowiki>_</nowiki> | NN | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 29 | k2 | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 28 | <nowiki>शुरू</nowiki> | <nowiki>_</nowiki> | NN | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 29 | pof | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 29 | <nowiki>करने</nowiki> | <nowiki>_</nowiki> | VM | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 31 | <nowiki>r6-k2</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 30 | <nowiki>का</nowiki> | <nowiki>_</nowiki> | PSP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 29 | <nowiki>lwg__psp</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 31 | <nowiki>अनुरोध</nowiki> | <nowiki>_</nowiki> | NN | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 32 | pof | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 32 | <nowiki>किया</nowiki> | <nowiki>_</nowiki> | VM | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 0 | main | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +| 33 | <nowiki>।</nowiki> | <nowiki>_</nowiki> | SYM | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 32 | rsym | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 +
 +The first sentence of the development data in the UTF8 SSF format with gold-standard morphology:
 +
 +<code xml><Sentence id='1'>
 +1 भाजपा NNP <fs af='भाजपा,n,f,sg,3,o,0_ने,0' name='भाजपा' posn='10' chunkId='NP' drel='k1:लगाया' vpos='vib_2' chunkType='head:NP'>
 +2 ने PSP <fs af='ने,psp,,,,,,' name='ने' posn='20' drel='lwg__psp:भाजपा' chunkType='child:NP'>
 +3 केंद्र NNPC <fs name='केंद्र' chunkId='FRAGP' chunkType='head:'FRAGP' drel='ccof:और'>
 +4 और CC <fs af='और,avy,,,,,,' name='और' posn='40' chunkId='CCP' drel='nmod:सरकार' chunkType='head:CCP'>
 +5 केरल NNPC <fs name='केरल' chunkId='FRAGP2' chunkType='head:'FRAGP2' drel='ccof:और'>
 +6 सरकार NNP <fs af='सरकार,n,f,sg,3,o,0_पर,0' name='सरकार' posn='60' chunkId='NP2' drel='k7:लगाया' vpos='vib_2' chunkType='head:NP2'>
 +7 पर PSP <fs af='पर,psp,,,,,,' name='पर' posn='70' drel='lwg__psp:सरकार' chunkType='child:NP2'>
 +8 भारतीय JJ <fs af='भारतीय,adj,any,any,,o,,' name='भारतीय' posn='80' drel='nmod__adj:ड्राइवर' chunkType='child:NP3'>
 +9 ड्राइवर NN <fs af='ड्राइवर,n,m,sg,3,o,0,0' name='ड्राइवर' posn='90' chunkId='NP3' drel='nmod:कुट्टी' chunkType='head:NP3'>
 +10 एम. NNPC <fs af='एम.,n,m,sg,3,d,0,0' name='एम.' posn='100' drel='pof__cn:कुट्टी' chunkType='child:NP4'>
 +11 आर. NNPC <fs af='आर.,n,m,sg,3,d,0,0' name='आर.' posn='110' drel='pof__cn:कुट्टी' chunkType='child:NP4'>
 +12 कुट्टी NNP <fs af='कुट्टी,n,m,sg,3,o,0_का,0' name='कुट्टी' posn='120' chunkId='NP4' drel='r6:हत्या' vpos='vib_4' chunkType='head:NP4'>
 +13 की PSP <fs af='का,psp,f,sg,,o,,' name='की' posn='130' drel='lwg__psp:कुट्टी' chunkType='child:NP4'>
 +14 हत्या NN <fs af='हत्या,n,f,sg,3,o,0_के_लिए,0' name='हत्या' posn='140' chunkId='NP5' drel='jjmod:जिम्मेदार' vpos='vib_2_3' chunkType='head:NP5'>
 +15 के PSP <fs af='के,psp,,,,,,' name='के' posn='150' drel='lwg__psp:हत्या' chunkType='child:NP5'>
 +16 लिए PSP <fs af='लिए,psp,,,,,,' name='लिए' posn='160' drel='lwg__cont:हत्या' chunkType='child:NP5'>
 +17 जिम्मेदार JJ <fs af='जिम्मेदार,adj,any,any,,o,,' name='जिम्मेदार' posn='170' chunkId='JJP' drel='nmod:तालिबान' chunkType='head:JJP'>
 +18 तालिबान NNP <fs af='तालिबान,n,m,sg,3,o,0_के_साथ,0' name='तालिबान' posn='180' chunkId='NP6' drel='ras-k1:लगाया' vpos='vib_2_3' chunkType='head:NP6'>
 +19 के PSP <fs af='के,psp,,,,,,' name='के2' posn='190' drel='lwg__psp:तालिबान' chunkType='child:NP6'>
 +20 साथ NST <fs af='साथ,nst,m,sg,3,d,,' name='साथ' posn='200' drel='lwg__cont:तालिबान' chunkType='child:NP6'>
 +21 निपटने VM <fs af='निपट,v,any,any,any,o,ना_में,nA' name='निपटने' posn='210' chunkId='VGNN' drel='k7:लगाया' vpos='tam_2' chunkType='head:VGNN'>
 +22 में PSP <fs af='में,psp,,,,,,' name='में' posn='220' drel='lwg__psp:निपटने' chunkType='child:VGNN'>
 +23 ढिलाई NN <fs af='ढिलाई,n,f,sg,3,d,0,0' name='ढिलाई' posn='230' chunkId='NP7' drel='k2:बरतने' chunkType='head:NP7'>
 +24 बरतने VM <fs af='बरत,v,any,sg,any,o,ना_का,nA' name='बरतने' posn='240' chunkId='VGNN2' drel='r6:आरोप' vpos='tam_2' chunkType='head:VGNN2'>
 +25 का PSP <fs af='का,psp,m,sg,,d,,' name='का' posn='250' drel='lwg__psp:बरतने' chunkType='child:VGNN2'>
 +26 आरोप NN <fs af='आरोप,n,m,sg,3,d,0,0' name='आरोप' posn='260' chunkId='NP8' drel='k2:लगाया' chunkType='head:NP8'>
 +27 लगाया VM <fs af='लगा,v,m,sg,3,,या_है,yA' name='लगाया' posn='270' chunkId='VGF' chunkType='head:VGF' voicetype='active' vpos='tam_2' stype='declarative'>
 +28 है VAUX <fs af='है,v,any,sg,3,,है,hE' name='है' posn='280' drel='lwg__vaux:लगाया' chunkType='child:VGF'>
 +29 । SYM <fs af='।,punc,,,,,,' name='।' posn='290' chunkId='BLK' drel='rsym:लगाया' chunkType='head:BLK'>
 +</Sentence></code>
  
 ==== Parsing ==== ==== Parsing ====

[ Back to the navigation ] [ Back to the content ]