Both sides previous revision
Previous revision
Next revision
|
Previous revision
|
user:zeman:treebanks:te [2012/03/22 16:47] zeman ICON 2010 Telugu data size. |
user:zeman:treebanks:te [2012/03/22 18:18] (current) zeman Parsing. |
==== Inside ==== | ==== Inside ==== |
| |
The text uses the [[http://ltrc.iiit.ac.in/nlptools2010/files/documents/map.pdf|WX encoding]] of Indian letters. If we know what the original script is (Bengali in this case) we can map the WX encoding to the original characters in UTF-8. WX uses English letters so if there was embedded English (or other string using Latin letters) it will probably get lost during the conversion. | The text uses the [[http://ltrc.iiit.ac.in/nlptools2010/files/documents/map.pdf|WX encoding]] of Indian letters. If we know what the original script is (Telugu in this case) we can map the WX encoding to the original characters in UTF-8. WX uses English letters so if there was embedded English (or other string using Latin letters) it will probably get lost during the conversion. |
| |
The CoNLL format contains only the chunk heads. The native SSF format shows the other words in the chunk, too, but it does not capture intra-chunk dependency relations. This is an example of a multi-word chunk: | The CoNLL format contains only the chunk heads. The native SSF format shows the other words in the chunk, too, but it does not capture intra-chunk dependency relations. This is an example of a multi-word chunk: |
| |
<code>3 (( NP <fs af='rumAla,n,,sg,,d,0,0' head="rumAla" drel=k2:VGF name=NP3> | <code>3 (( NP <fs af='AdavAlYlu,n,,sg,,,0,0_e' head='AdavAlYle' pbank='ARG3' name='NP3'> |
3.1 ekatA QC <fs af='eka,num,,,,,,'> | 3.1 932 QC <fs af='932,num,,,,,,'> |
3.2 ledisa JJ <fs af='ledisa,unk,,,,,,'> | 3.2 maMxi CL <fs af='maMxi,n,,pl,,d,0,0'> |
3.3 rumAla NN <fs af='rumAla,n,,sg,,d,0,0' name="rumAla"> | 3.3 AdavAlYle NN <fs af='AdavAlYlu,n,,sg,,,0,0_e' name='AdavAlYle'> |
))</code> | ))</code> |
| |
According to [[http://ltrc.iiit.ac.in/nlptools2010/files/documents/toolscontest10-workshoppaper-final.pdf|(Husain et al., 2010)]], in the ICON 2010 version, the chunk tags, POS tags and inter-chunk dependencies (topology + tags) were annotated manually. The rest (lemma, morphosyntactic features, headword of chunk) was marked automatically. | According to [[http://ltrc.iiit.ac.in/nlptools2010/files/documents/toolscontest10-workshoppaper-final.pdf|(Husain et al., 2010)]], in the ICON 2010 version, the chunk tags, POS tags and inter-chunk dependencies (topology + tags) were annotated manually. The rest (lemma, morphosyntactic features, headword of chunk) was marked automatically. |
| |
Note: There have been cycles in the Hindi part of HyDT but no such problem occurs in the Bengali part. | Note: There have been cycles in the Hindi part of HyDT but no such problem occurs in the Telugu part. |
| |
==== Sample ==== | ==== Sample ==== |
The first sentence of the ICON 2010 training data (with fine-grained syntactic tags) in the Shakti format: | The first sentence of the ICON 2010 training data (with fine-grained syntactic tags) in the Shakti format: |
| |
<code xml><document id=""> | <code xml><document id=""> |
<head> | <head> |
<annotated-resource name="HyDT-Bangla" version="0.5" type="dep-interchunk-only" layers="morph,pos,chunk,dep-interchunk-only" language="ben" date-of-release="20100831"> | <annotated-resource name="HyDT-Telugu" version="0.5" type="dep-interchunk-only" layers="morph,pos,chunk,dep-interchunk-only" language="tel" date-of-release="20100831"> |
<annotation-standard> | <annotation-standard> |
<morph-standard name="Anncorra-morph" version="1.31" date="20080920" /> | <morph-standard name="Anncorra-morph" version="1.31" date="20080920" /> |
<dependency-standard name="Anncorra-dep" version="2.0" date="" dep-tagset-granularity="6" /> | <dependency-standard name="Anncorra-dep" version="2.0" date="" dep-tagset-granularity="6" /> |
</annotation-standard> | </annotation-standard> |
</annotated-resource> | </annotated-resource> |
</head> | </head> |
<Sentence id="1"> | <Sentence id="1"> |
1 (( NP <fs af='Age,adv,,,,,,' head="Agei" drel=k7t:VGF name=NP> | 1 (( NP <fs af='saMgawi,n,,sg,,d,0,0' head='saMgawi' drel='k1:VGF'> |
1.1 mudZira NN <fs af='mudZi,n,,sg,,o,era,era'> | 1.1 maro QF <fs af='maro,avy,,,,,,'> |
1.2 Agei NST <fs af='Age,adv,,,,,,' name="Agei"> | 1.2 saMgawi NN <fs af='saMgawi,n,,sg,,d,0,0' name='saMgawi'> |
)) | )) |
2 (( NP <fs af='cA,n,,sg,,d,0,0' head="cA" drel=k1:VGF name=NP2> | 2 (( NP <fs af='mIru,pn,any,pl,2,,ki,ki' head='mIku' drel='k4:VGF' name='NP2'> |
2.1 praWama QO <fs af='praWama,num,,,,,,'> | 2.1 mIku PRP <fs af='mIru,pn,any,pl,2,,ki,ki' name='mIku'> |
2.2 kApa NN <fs af='kApa,unk,,,,,,'> | )) |
2.3 cA NN <fs af='cA,n,,sg,,d,0,0' name="cA"> | 3 (( VGF <fs af='weVlusA,avy,,,,,0,0_avy' head='weVlusA' name='VGF'> |
)) | 3.1 weVlusA VM <fs af='weVlusA,avy,,,,,0,0_avy' name='weVlusA'> |
3 (( VGF <fs af='As,v,,,5,,A_yA+Ce,A' head="ese" name=VGF> | 3.2 ? SYM <fs af='?,punc,,,,,,'> |
3.1 ese VM <fs af='As,v,,,7,,A,A' name="ese"> | )) |
3.2 . SYM <fs af='.,punc,,,,,,'> | |
)) | |
</Sentence></code> | </Sentence></code> |
| |
And in the CoNLL format: | And in the CoNLL format: |
| |
| 1 | Agei | Age | NP | NST | lex-Age<nowiki>|</nowiki>cat-adv<nowiki>|</nowiki>gend-<nowiki>|</nowiki>num-<nowiki>|</nowiki>pers-<nowiki>|</nowiki>case-<nowiki>|</nowiki>vib-<nowiki>|</nowiki>tam-<nowiki>|</nowiki>head-Agei<nowiki>|</nowiki>name-NP | 3 | k7t | _ | _ | | | 1 | saMgawi | saMgawi | NP | NN | <nowiki>lex-saMgawi|cat-n|gend-|num-sg|pers-|case-d|vib-0|tam-0|head-saMgawi</nowiki> | 3 | k1 | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 2 | cA | cA | NP | NN | lex-cA<nowiki>|</nowiki>cat-n<nowiki>|</nowiki>gend-<nowiki>|</nowiki>num-sg<nowiki>|</nowiki>pers-<nowiki>|</nowiki>case-d<nowiki>|</nowiki>vib-0<nowiki>|</nowiki>tam-0<nowiki>|</nowiki>head-cA<nowiki>|</nowiki>name-NP2 | 3 | k1 | _ | _ | | | 2 | mIku | mIru | NP | PRP | <nowiki>lex-mIru|cat-pn|gend-any|num-pl|pers-2|case-|vib-ki|tam-ki|head-mIku|name-NP2</nowiki> | 3 | k4 | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 3 | ese | As | VGF | VM | lex-As<nowiki>|</nowiki>cat-v<nowiki>|</nowiki>gend-<nowiki>|</nowiki>num-<nowiki>|</nowiki>pers-5<nowiki>|</nowiki>case-<nowiki>|</nowiki>vib-A_yA+Ce<nowiki>|</nowiki>tam-A<nowiki>|</nowiki>head-ese<nowiki>|</nowiki>name-VGF | 0 | main | _ | _ | | | 3 | weVlusA | weVlusA | VGF | VM | <nowiki>lex-weVlusA|cat-avy|gend-|num-|pers-|case-|vib-0|tam-0_avy|head-weVlusA|name-VGF</nowiki> | 0 | main | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| |
And after conversion of the WX encoding to the Bengali script in UTF-8: | And after conversion of the WX encoding to the Telugu script in UTF-8: |
| |
| 1 | আগেই | আগে | NP | NST | lex-Age<nowiki>|</nowiki>cat-adv<nowiki>|</nowiki>gend-<nowiki>|</nowiki>num-<nowiki>|</nowiki>pers-<nowiki>|</nowiki>case-<nowiki>|</nowiki>vib-<nowiki>|</nowiki>tam-<nowiki>|</nowiki>head-Agei<nowiki>|</nowiki>name-NP | 3 | k7t | _ | _ | | | 1 | <nowiki>సంగతి</nowiki> | <nowiki>సంగతి</nowiki> | NP | NN | <nowiki>lex-saMgawi|cat-n|gend-|num-sg|pers-|case-d|vib-0|tam-0|head-saMgawi</nowiki> | 3 | k1 | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 2 | চা | চা | NP | NN | lex-cA<nowiki>|</nowiki>cat-n<nowiki>|</nowiki>gend-<nowiki>|</nowiki>num-sg<nowiki>|</nowiki>pers-<nowiki>|</nowiki>case-d<nowiki>|</nowiki>vib-0<nowiki>|</nowiki>tam-0<nowiki>|</nowiki>head-cA<nowiki>|</nowiki>name-NP2 | 3 | k1 | _ | _ | | | 2 | <nowiki>మీకు</nowiki> | <nowiki>మీరు</nowiki> | NP | PRP | <nowiki>lex-mIru|cat-pn|gend-any|num-pl|pers-2|case-|vib-ki|tam-ki|head-mIku|name-NP2</nowiki> | 3 | k4 | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 3 | এসে | আস্ | VGF | VM | lex-As<nowiki>|</nowiki>cat-v<nowiki>|</nowiki>gend-<nowiki>|</nowiki>num-<nowiki>|</nowiki>pers-5<nowiki>|</nowiki>case-<nowiki>|</nowiki>vib-A_yA+Ce<nowiki>|</nowiki>tam-A<nowiki>|</nowiki>head-ese<nowiki>|</nowiki>name-VGF | 0 | main | _ | _ | | | 3 | <nowiki>తెలుసా</nowiki> | <nowiki>తెలుసా</nowiki> | VGF | VM | <nowiki>lex-weVlusA|cat-avy|gend-|num-|pers-|case-|vib-0|tam-0_avy|head-weVlusA|name-VGF</nowiki> | 0 | main | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| |
The first sentence of the ICON 2010 development data (with fine-grained syntactic tags) in the Shakti format: | The first sentence of the ICON 2010 development data (with fine-grained syntactic tags) in the Shakti format: |
<code xml><document id=""> | <code xml><document id=""> |
<head> | <head> |
<annotated-resource name="HyDT-Bangla" version="0.5" type="dep-interchunk-only" layers="morph,pos,chunk,dep-interchunk-only" language="ben" date-of-release="20100831"> | <annotated-resource name="HyDT-Telugu" version="0.5" type="dep-interchunk-only" layers="morph,pos,chunk,dep-interchunk-only" language="tel" date-of-release="20100831"> |
<annotation-standard> | <annotation-standard> |
<morph-standard name="Anncorra-morph" version="1.31" date="20080920" /> | <morph-standard name="Anncorra-morph" version="1.31" date="20080920" /> |
</head> | </head> |
<Sentence id="1"> | <Sentence id="1"> |
1 (( NP <fs af='parabarwIkAle,adv,,,,,,' head="parabarwIkAle" drel=k7t:VGF name=NP> | 1 (( RBP <fs af='eVMwa,pn,,sg,,d,0,0' head='eVMwa' drel='adv:NP'> |
1.1 parabarwIkAle NN <fs af='parabarwIkAle,adv,,,,,,' name="parabarwIkAle"> | 1.1 eVMwa WQ <fs af='eVMwa,pn,,sg,,d,0,0' name='eVMwa'> |
)) | )) |
2 (( NP <fs af='aPisa-biyArAraxera,unk,,,,,,' head="aPisa-biyArAraxera" drel=r6:NP3 name=NP2> | 2 (( NP <fs af='bAXEnA,unk,,,,,,' head='bAXEnA' drel='k2s:VGNF' name='NP'> |
2.1 aPisa-biyArAraxera NN <fs af='aPisa-biyArAraxera,unk,,,,,,' name="aPisa-biyArAraxera"> | 2.1 bAXEnA NN <fs af='bAXEnA,unk,,,,,,' name='bAXEnA'> |
)) | )) |
3 (( NP <fs af='nAma,n,,sg,,d,0,0' head="nAma" drel=k2:VGNN name=NP3> | 3 (( NP <fs af='ixi,pn,fn,sg,3,o,ti,ti' head='xIni' drel='k2:VGNF' name='NP2'> |
3.1 nAma NN <fs af='nAma,n,,sg,,d,0,0' name="nAma"> | 3.1 xIni PRP <fs af='ixi,pn,fn,sg,3,o,ti,ti' name='xIni'> |
)) | )) |
4 (( NP <fs af='GoRaNA,unk,,,,,,' head="GoRaNA" drel=pof:VGNN name=NP4> | 4 (( RBP <fs af='eVlA,avy,,,,,0,0_avy' head='eVlA' drel='adv:VGNF' name='RBP2'> |
4.1 GoRaNA NN <fs af='GoRaNA,unk,,,,,,' name="GoRaNA"> | 4.1 eVlA WQ <fs af='eVlA,avy,,,,,0,0_avy' name='eVlA'> |
)) | )) |
5 (( VGNN <fs af='kar,n,,,any,,,' head="karAra" drel=r6:NP5 name=VGNN> | 5 (( NP <fs af='bayata,n,,sg,,d,0,0' head='bayata' drel='pof:VGNF' name='NP3'> |
5.1 karAra VM <fs af='kar,n,,,any,,,' name="karAra"> | 5.1 bayata NST <fs af='bayata,n,,sg,,d,0,0' name='bayata'> |
)) | )) |
6 (( NP <fs af='samay,unk,,,,,,' head="samay" drel=k7t:VGF name=NP5> | 6 (( VGNF <fs af='peVttuko,pn,,sg,,,e_axi,e_axi_0' head='peVttukoVnexi' drel='k1s:VGNN' name='VGNF'> |
6.1 samay NN <fs af='samay,unk,,,,,,' name="samay"> | 6.1 peVttukoVnexi VM <fs af='peVttuko,pn,,sg,,,e_axi,e_axi_0' name='peVttukoVnexi'> |
)) | )) |
7 (( NP <fs af='animeRake,unk,,,,,,' head="animeRake" drel=k2:VGF name=NP6> | 7 (( RBP <fs af='sarigA,avy,,,,,0,0_avy' head='sarigA' drel='adv:VGNN' name='RBP3'> |
7.1 animeRake NNP <fs af='animeRake,unk,,,,,,' name="animeRake"> | 7.1 sarigA RB <fs af='sarigA,avy,,,,,0,0_avy' name='sarigA'> |
)) | )) |
8 (( VGF <fs af='sariye,unk,,,5,,0_rAKA+ka_ha+la,' head="sariye" name=VGF> | 8 (( NP <fs af='viRayaM,n,,sg,,d,0,0' head='viRayaM' drel='k1:VGNN' name='NP4'> |
8.1 sariye VM <fs af='sariye,unk,,,,,,' name="sariye"> | 8.1 viRayaM NN <fs af='viRayaM,n,,sg,,d,0,0' name='viRayaM'> |
8.2 . SYM <fs af='.,punc,,,,,,'> | )) |
)) | 9 (( VGNN <fs af='weVliyu,v,any,any,any,,aka_po_adaM,aka_po_adaM' head='weVliyakapovadaM' name='VGNN'> |
| 9.1 weVliyakapovadaM VM <fs af='weVliyu,v,any,any,any,,aka_po_adaM,aka_po_adaM' name='weVliyakapovadaM'> |
| 9.2 . SYM <fs af='.,punc,,,,,,'> |
| )) |
</Sentence></code> | </Sentence></code> |
| |
And in the CoNLL format: | And in the CoNLL format: |
| |
| 1 | parabarwIkAle | parabarwIkAle | NP | NN | lex-parabarwIkAle<nowiki>|</nowiki>cat-adv<nowiki>|</nowiki>gend-<nowiki>|</nowiki>num-<nowiki>|</nowiki>pers-<nowiki>|</nowiki>case-<nowiki>|</nowiki>vib-<nowiki>|</nowiki>tam-<nowiki>|</nowiki>head-parabarwIkAle<nowiki>|</nowiki>name-NP | 8 | k7t | _ | _ | | | 1 | eVMwa | eVMwa | RBP | WQ | <nowiki>lex-eVMwa|cat-pn|gend-|num-sg|pers-|case-d|vib-0|tam-0|head-eVMwa</nowiki> | 2 | adv | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 2 | aPisa-biyArAraxera | aPisa-biyArAraxera | NP | NN | lex-aPisa-biyArAraxera<nowiki>|</nowiki>cat-unk<nowiki>|</nowiki>gend-<nowiki>|</nowiki>num-<nowiki>|</nowiki>pers-<nowiki>|</nowiki>case-<nowiki>|</nowiki>vib-<nowiki>|</nowiki>tam-<nowiki>|</nowiki>head-aPisa-biyArAraxera<nowiki>|</nowiki>name-NP2 | 3 | r6 | _ | _ | | | 2 | bAXEnA | bAXEnA | NP | NN | <nowiki>lex-bAXEnA|cat-unk|gend-|num-|pers-|case-|vib-|tam-|head-bAXEnA|name-NP</nowiki> | 6 | k2s | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 3 | nAma | nAma | NP | NN | lex-nAma<nowiki>|</nowiki>cat-n<nowiki>|</nowiki>gend-<nowiki>|</nowiki>num-sg<nowiki>|</nowiki>pers-<nowiki>|</nowiki>case-d<nowiki>|</nowiki>vib-0<nowiki>|</nowiki>tam-0<nowiki>|</nowiki>head-nAma<nowiki>|</nowiki>name-NP3 | 5 | k2 | _ | _ | | | 3 | xIni | ixi | NP | PRP | <nowiki>lex-ixi|cat-pn|gend-fn|num-sg|pers-3|case-o|vib-ti|tam-ti|head-xIni|name-NP2</nowiki> | 6 | k2 | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 4 | GoRaNA | GoRaNA | NP | NN | lex-GoRaNA<nowiki>|</nowiki>cat-unk<nowiki>|</nowiki>gend-<nowiki>|</nowiki>num-<nowiki>|</nowiki>pers-<nowiki>|</nowiki>case-<nowiki>|</nowiki>vib-<nowiki>|</nowiki>tam-<nowiki>|</nowiki>head-GoRaNA<nowiki>|</nowiki>name-NP4 | 5 | pof | _ | _ | | | 4 | eVlA | eVlA | RBP | WQ | <nowiki>lex-eVlA|cat-avy|gend-|num-|pers-|case-|vib-0|tam-0_avy|head-eVlA|name-RBP2</nowiki> | 6 | adv | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 5 | karAra | kar | VGNN | VM | lex-kar<nowiki>|</nowiki>cat-n<nowiki>|</nowiki>gend-<nowiki>|</nowiki>num-<nowiki>|</nowiki>pers-any<nowiki>|</nowiki>case-<nowiki>|</nowiki>vib-<nowiki>|</nowiki>tam-<nowiki>|</nowiki>head-karAra<nowiki>|</nowiki>name-VGNN | 6 | r6 | _ | _ | | | 5 | bayata | bayata | NP | NST | <nowiki>lex-bayata|cat-n|gend-|num-sg|pers-|case-d|vib-0|tam-0|head-bayata|name-NP3</nowiki> | 6 | pof | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 6 | samay | samay | NP | NN | lex-samay<nowiki>|</nowiki>cat-unk<nowiki>|</nowiki>gend-<nowiki>|</nowiki>num-<nowiki>|</nowiki>pers-<nowiki>|</nowiki>case-<nowiki>|</nowiki>vib-<nowiki>|</nowiki>tam-<nowiki>|</nowiki>head-samay<nowiki>|</nowiki>name-NP5 | 8 | k7t | _ | _ | | | 6 | peVttukoVnexi | peVttuko | VGNF | VM | <nowiki>lex-peVttuko|cat-pn|gend-|num-sg|pers-|case-|vib-e_axi|tam-e_axi_0|head-peVttukoVnexi|name-VGNF</nowiki> | 9 | k1s | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 7 | animeRake | animeRake | NP | NNP | lex-animeRake<nowiki>|</nowiki>cat-unk<nowiki>|</nowiki>gend-<nowiki>|</nowiki>num-<nowiki>|</nowiki>pers-<nowiki>|</nowiki>case-<nowiki>|</nowiki>vib-<nowiki>|</nowiki>tam-<nowiki>|</nowiki>head-animeRake<nowiki>|</nowiki>name-NP6 | 8 | k2 | _ | _ | | | 7 | sarigA | sarigA | RBP | RB | <nowiki>lex-sarigA|cat-avy|gend-|num-|pers-|case-|vib-0|tam-0_avy|head-sarigA|name-RBP3</nowiki> | 9 | adv | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 8 | sariye | sariye | VGF | VM | lex-sariye<nowiki>|</nowiki>cat-unk<nowiki>|</nowiki>gend-<nowiki>|</nowiki>num-<nowiki>|</nowiki>pers-5<nowiki>|</nowiki>case-<nowiki>|</nowiki>vib-0_rAKA+ka_ha+la<nowiki>|</nowiki>tam-<nowiki>|</nowiki>head-sariye<nowiki>|</nowiki>name-VGF | 0 | main | _ | _ | | | 8 | viRayaM | viRayaM | NP | NN | <nowiki>lex-viRayaM|cat-n|gend-|num-sg|pers-|case-d|vib-0|tam-0|head-viRayaM|name-NP4</nowiki> | 9 | k1 | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| | 9 | weVliyakapovadaM | weVliyu | VGNN | VM | <nowiki>lex-weVliyu|cat-v|gend-any|num-any|pers-any|case-|vib-aka_po_adaM|tam-aka_po_adaM|head-weVliyakapovadaM|name-VGNN</nowiki> | 0 | main | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| |
And after conversion of the WX encoding to the Bengali script in UTF-8: | And after conversion of the WX encoding to the Telugu script in UTF-8: |
| |
| 1 | পরবর্তীকালে | পরবর্তীকালে | NP | NN | lex-parabarwIkAle<nowiki>|</nowiki>cat-adv<nowiki>|</nowiki>gend-<nowiki>|</nowiki>num-<nowiki>|</nowiki>pers-<nowiki>|</nowiki>case-<nowiki>|</nowiki>vib-<nowiki>|</nowiki>tam-<nowiki>|</nowiki>head-parabarwIkAle<nowiki>|</nowiki>name-NP | 8 | k7t | _ | _ | | | 1 | <nowiki>ఎంత</nowiki> | <nowiki>ఎంత</nowiki> | RBP | WQ | <nowiki>lex-eVMwa|cat-pn|gend-|num-sg|pers-|case-d|vib-0|tam-0|head-eVMwa</nowiki> | 2 | adv | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 2 | অফিস-বিযারারদের | অফিস-বিযারারদের | NP | NN | lex-aPisa-biyArAraxera<nowiki>|</nowiki>cat-unk<nowiki>|</nowiki>gend-<nowiki>|</nowiki>num-<nowiki>|</nowiki>pers-<nowiki>|</nowiki>case-<nowiki>|</nowiki>vib-<nowiki>|</nowiki>tam-<nowiki>|</nowiki>head-aPisa-biyArAraxera<nowiki>|</nowiki>name-NP2 | 3 | r6 | _ | _ | | | 2 | <nowiki>బాధైనా</nowiki> | <nowiki>బాధైనా</nowiki> | NP | NN | <nowiki>lex-bAXEnA|cat-unk|gend-|num-|pers-|case-|vib-|tam-|head-bAXEnA|name-NP</nowiki> | 6 | k2s | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 3 | নাম | নাম | NP | NN | lex-nAma<nowiki>|</nowiki>cat-n<nowiki>|</nowiki>gend-<nowiki>|</nowiki>num-sg<nowiki>|</nowiki>pers-<nowiki>|</nowiki>case-d<nowiki>|</nowiki>vib-0<nowiki>|</nowiki>tam-0<nowiki>|</nowiki>head-nAma<nowiki>|</nowiki>name-NP3 | 5 | k2 | _ | _ | | | 3 | <nowiki>దీని</nowiki> | <nowiki>ఇది</nowiki> | NP | PRP | <nowiki>lex-ixi|cat-pn|gend-fn|num-sg|pers-3|case-o|vib-ti|tam-ti|head-xIni|name-NP2</nowiki> | 6 | k2 | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 4 | ঘোষণা | ঘোষণা | NP | NN | lex-GoRaNA<nowiki>|</nowiki>cat-unk<nowiki>|</nowiki>gend-<nowiki>|</nowiki>num-<nowiki>|</nowiki>pers-<nowiki>|</nowiki>case-<nowiki>|</nowiki>vib-<nowiki>|</nowiki>tam-<nowiki>|</nowiki>head-GoRaNA<nowiki>|</nowiki>name-NP4 | 5 | pof | _ | _ | | | 4 | <nowiki>ఎలా</nowiki> | <nowiki>ఎలా</nowiki> | RBP | WQ | <nowiki>lex-eVlA|cat-avy|gend-|num-|pers-|case-|vib-0|tam-0_avy|head-eVlA|name-RBP2</nowiki> | 6 | adv | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 5 | করার | কর্ | VGNN | VM | lex-kar<nowiki>|</nowiki>cat-n<nowiki>|</nowiki>gend-<nowiki>|</nowiki>num-<nowiki>|</nowiki>pers-any<nowiki>|</nowiki>case-<nowiki>|</nowiki>vib-<nowiki>|</nowiki>tam-<nowiki>|</nowiki>head-karAra<nowiki>|</nowiki>name-VGNN | 6 | r6 | _ | _ | | | 5 | బయట | బయట | NP | NST | <nowiki>lex-bayata|cat-n|gend-|num-sg|pers-|case-d|vib-0|tam-0|head-bayata|name-NP3</nowiki> | 6 | pof | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 6 | সময্ | সময্ | NP | NN | lex-samay<nowiki>|</nowiki>cat-unk<nowiki>|</nowiki>gend-<nowiki>|</nowiki>num-<nowiki>|</nowiki>pers-<nowiki>|</nowiki>case-<nowiki>|</nowiki>vib-<nowiki>|</nowiki>tam-<nowiki>|</nowiki>head-samay<nowiki>|</nowiki>name-NP5 | 8 | k7t | _ | _ | | | 6 | <nowiki>పెట్టుకొనేది</nowiki> | <nowiki>పెట్టుకొ</nowiki> | VGNF | VM | <nowiki>lex-peVttuko|cat-pn|gend-|num-sg|pers-|case-|vib-e_axi|tam-e_axi_0|head-peVttukoVnexi|name-VGNF</nowiki> | 9 | k1s | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 7 | অনিমেষকে | অনিমেষকে | NP | NNP | lex-animeRake<nowiki>|</nowiki>cat-unk<nowiki>|</nowiki>gend-<nowiki>|</nowiki>num-<nowiki>|</nowiki>pers-<nowiki>|</nowiki>case-<nowiki>|</nowiki>vib-<nowiki>|</nowiki>tam-<nowiki>|</nowiki>head-animeRake<nowiki>|</nowiki>name-NP6 | 8 | k2 | _ | _ | | | 7 | <nowiki>సరిగా</nowiki> | <nowiki>సరిగా</nowiki> | RBP | RB | <nowiki>lex-sarigA|cat-avy|gend-|num-|pers-|case-|vib-0|tam-0_avy|head-sarigA|name-RBP3</nowiki> | 9 | adv | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 8 | সরিযে | সরিযে | VGF | VM | lex-sariye<nowiki>|</nowiki>cat-unk<nowiki>|</nowiki>gend-<nowiki>|</nowiki>num-<nowiki>|</nowiki>pers-5<nowiki>|</nowiki>case-<nowiki>|</nowiki>vib-0_rAKA+ka_ha+la<nowiki>|</nowiki>tam-<nowiki>|</nowiki>head-sariye<nowiki>|</nowiki>name-VGF | 0 | main | _ | _ | | | 8 | <nowiki>విషయం</nowiki> | <nowiki>విషయం</nowiki> | NP | NN | <nowiki>lex-viRayaM|cat-n|gend-|num-sg|pers-|case-d|vib-0|tam-0|head-viRayaM|name-NP4</nowiki> | 9 | k1 | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| | 9 | <nowiki>తెలియకపొవడం</nowiki> | <nowiki>తెలియు</nowiki> | VGNN | VM | <nowiki>lex-weVliyu|cat-v|gend-any|num-any|pers-any|case-|vib-aka_po_adaM|tam-aka_po_adaM|head-weVliyakapovadaM|name-VGNN</nowiki> | 0 | main | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| |
The first sentence of the ICON 2010 test data (with fine-grained syntactic tags) in the Shakti format: | The first sentence of the ICON 2010 test data (with fine-grained syntactic tags) in the Shakti format: |
<code xml><document id=""> | <code xml><document id=""> |
<head> | <head> |
<annotated-resource name="HyDT-Bangla" version="0.5" type="dep-interchunk-only" layers="morph,pos,chunk,dep-interchunk-only" language="ben" date-of-release="20101013"> | <annotated-resource name="HyDT-Telugu" version="0.5" type="dep-interchunk-only" layers="morph,pos,chunk,dep-interchunk-only" language="tel" date-of-release="20101013"> |
<annotation-standard> | <annotation-standard> |
<morph-standard name="Anncorra-morph" version="1.31" date="20080920" /> | <morph-standard name="Anncorra-morph" version="1.31" date="20080920" /> |
<pos-standard name="Anncorra-pos" version="" date="20061215" /> | <pos-standard name="Anncorra-pos" version="" date="20061215" /> |
<chunk-standard name="Anncorra-chunk" version="" date="20061215" /> | <chunk-standard name="Anncorra-chunk" version="" date="20061215" /> |
<dependency-standard name="Anncorra-dep" version="2.0" date="" dep-tagset-granularity="6" /> | <dependency-standard name="Anncorra-dep" version="2.0" date="" dep-tagset-granularity="6" /> |
</annotation-standard> | </annotation-standard> |
<annotated-resource> | </annotated-resource> |
</head> | </head> |
<Sentence id="1"> | <Sentence id="29"> |
1 (( NP <fs af='mAXabIlawA,n,,sg,,d,0,0' head="mAXabIlawA" drel=k1:VGF name=NP> | 1 (( NP <fs af='iMkA,avy,,,,,0,0_avy' head="iMkA" drel=vmod:NULL_VGF name=NP poslcat="NM"> |
1.1 mAXabIlawA NNP <fs af='mAXabIlawA,n,,sg,,d,0,0' name="mAXabIlawA"> | 1.1 iMkA PRP <fs af='iMkA,avy,,,,,0,0_avy' poslcat="NM" name="iMkA"> |
)) | )) |
2 (( NP <fs af='waKana,pn,,,,d,0,0' head="waKana" drel=k7t:VGF name=NP2> | 2 (( RBP <fs af='warawarAlugA,avy,,,,,0,0_avy' head="warawarAlugA" drel=adv:VGNF name=RBP poslcat="NM"> |
2.1 waKana PRP <fs af='waKana,pn,,,,d,0,0' name="waKana"> | 2.1 warawarAlugA RB <fs af='warawarAlugA,avy,,,,,0,0_avy' poslcat="NM" name="warawarAlugA"> |
)) | )) |
3 (( NP <fs af='hAwa,n,,sg,,o,era,era' head="hAwera" drel=r6:NP4 name=NP3> | 3 (( VGNF <fs af='nAtuko,v,any,any,any,,i_po_ina,i_po_ina' head="nAtukupoyina" drel=nmod:NP2 name=VGNF> |
3.1 hAwera NN <fs af='hAwa,n,,sg,,o,era,era' name="hAwera"> | 3.1 nAtukupoyina VM <fs af='nAtuko,v,any,any,any,,i_po_ina,i_po_ina' name="nAtukupoyina"> |
)) | )) |
4 (( NP <fs af='GadZi,unk,,,,,,' head="GadZi" drel=k2:VGNF name=NP4> | 4 (( NP <fs af='aBiprAyaM,n,,pl,,d,0,0' head="aBiprAyAlu" drel=k1:NULL_VGF name=NP2> |
4.1 GadZi NN <fs af='GadZi,unk,,,,,,' name="GadZi"> | 4.1 aBiprAyAlu NN <fs af='aBiprAyaM,n,,pl,,d,0,0' name="aBiprAyAlu"> |
)) | )) |
5 (( VGNF <fs af='Kul,v,,,5,,ne,ne' head="Kule" drel=vmod:VGF name=VGNF> | 5 (( NULL_VGF <fs name='NULL_VGF'> |
5.1 Kule VM <fs af='Kul,v,,,5,,ne,ne' name="Kule"> | 5.1 NULL VM <fs af='NULL,unk,,,,,,' poslcat="NM"> |
)) | 5.2 . SYM <fs af='.,punc,,,,,,' poslcat="NM"> |
6 (( NP <fs af='tebila,n,,sg,,d,me,me' head="tebile" drel=k7p:VGF name=NP5> | )) |
6.1 tebile NN <fs af='tebila,n,,sg,,d,me,me' name="tebile"> | |
)) | |
7 (( VGF <fs af='rAK,v,,,5,,Cila,Cila' head="rAKaCila" name=VGF> | |
7.1 rAKaCila VM <fs af='rAK,v,,,5,,Cila,Cila' name="rAKaCila"> | |
7.2 । SYM | |
)) | |
</Sentence></code> | </Sentence></code> |
| |
And in the CoNLL format: | And in the CoNLL format: |
| |
| 1 | mAXabIlawA | mAXabIlawA | NP | NNP | lex-mAXabIlawA<nowiki>|</nowiki>cat-n<nowiki>|</nowiki>gend-<nowiki>|</nowiki>num-sg<nowiki>|</nowiki>pers-<nowiki>|</nowiki>case-d<nowiki>|</nowiki>vib-0<nowiki>|</nowiki>tam-0<nowiki>|</nowiki>head-mAXabIlawA<nowiki>|</nowiki>name-NP | 7 | k1 | _ | _ | | | 1 | iMkA | iMkA | NP | PRP | <nowiki>lex-iMkA|cat-avy|gend-|num-|pers-|case-|vib-0|tam-0_avy|head-iMkA|name-NP|poslcat-NM</nowiki> | 5 | vmod | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 2 | waKana | waKana | NP | PRP | lex-waKana<nowiki>|</nowiki>cat-pn<nowiki>|</nowiki>gend-<nowiki>|</nowiki>num-<nowiki>|</nowiki>pers-<nowiki>|</nowiki>case-d<nowiki>|</nowiki>vib-0<nowiki>|</nowiki>tam-0<nowiki>|</nowiki>head-waKana<nowiki>|</nowiki>name-NP2 | 7 | k7t | _ | _ | | | 2 | warawarAlugA | warawarAlugA | RBP | RB | <nowiki>lex-warawarAlugA|cat-avy|gend-|num-|pers-|case-|vib-0|tam-0_avy|head-warawarAlugA|name-RBP|poslcat-NM</nowiki> | 3 | adv | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 3 | hAwera | hAwa | NP | NN | lex-hAwa<nowiki>|</nowiki>cat-n<nowiki>|</nowiki>gend-<nowiki>|</nowiki>num-sg<nowiki>|</nowiki>pers-<nowiki>|</nowiki>case-o<nowiki>|</nowiki>vib-era<nowiki>|</nowiki>tam-era<nowiki>|</nowiki>head-hAwera<nowiki>|</nowiki>name-NP3 | 4 | r6 | _ | _ | | | 3 | nAtukupoyina | nAtuko | VGNF | VM | <nowiki>lex-nAtuko|cat-v|gend-any|num-any|pers-any|case-|vib-i_po_ina|tam-i_po_ina|head-nAtukupoyina|name-VGNF</nowiki> | 4 | nmod | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 4 | GadZi | GadZi | NP | NN | lex-GadZi<nowiki>|</nowiki>cat-unk<nowiki>|</nowiki>gend-<nowiki>|</nowiki>num-<nowiki>|</nowiki>pers-<nowiki>|</nowiki>case-<nowiki>|</nowiki>vib-<nowiki>|</nowiki>tam-<nowiki>|</nowiki>head-GadZi<nowiki>|</nowiki>name-NP4 | 5 | k2 | _ | _ | | | 4 | aBiprAyAlu | aBiprAyaM | NP | NN | <nowiki>lex-aBiprAyaM|cat-n|gend-|num-pl|pers-|case-d|vib-0|tam-0|head-aBiprAyAlu|name-NP2</nowiki> | 5 | k1 | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 5 | Kule | Kul | VGNF | VM | lex-Kul<nowiki>|</nowiki>cat-v<nowiki>|</nowiki>gend-<nowiki>|</nowiki>num-<nowiki>|</nowiki>pers-5<nowiki>|</nowiki>case-<nowiki>|</nowiki>vib-ne<nowiki>|</nowiki>tam-ne<nowiki>|</nowiki>head-Kule<nowiki>|</nowiki>name-VGNF | 7 | vmod | _ | _ | | | 5 | NULL | NULL | <nowiki>NULL_VGF</nowiki> | VM | <nowiki>name-NULL_VGF</nowiki> | 0 | main | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 6 | tebile | tebila | NP | NN | lex-tebila<nowiki>|</nowiki>cat-n<nowiki>|</nowiki>gend-<nowiki>|</nowiki>num-sg<nowiki>|</nowiki>pers-<nowiki>|</nowiki>case-d<nowiki>|</nowiki>vib-me<nowiki>|</nowiki>tam-me<nowiki>|</nowiki>head-tebile<nowiki>|</nowiki>name-NP5 | 7 | k7p | _ | _ | | |
| 7 | rAKaCila | rAK | VGF | VM | lex-rAK<nowiki>|</nowiki>cat-v<nowiki>|</nowiki>gend-<nowiki>|</nowiki>num-<nowiki>|</nowiki>pers-5<nowiki>|</nowiki>case-<nowiki>|</nowiki>vib-Cila<nowiki>|</nowiki>tam-Cila<nowiki>|</nowiki>head-rAKaCila<nowiki>|</nowiki>name-VGF | 0 | main | _ | _ | | |
| |
And after conversion of the WX encoding to the Bengali script in UTF-8: | And after conversion of the WX encoding to the Telugu script in UTF-8: |
| |
| 1 | মাধবীলতা | মাধবীলতা | NP | NNP | lex-mAXabIlawA<nowiki>|</nowiki>cat-n<nowiki>|</nowiki>gend-<nowiki>|</nowiki>num-sg<nowiki>|</nowiki>pers-<nowiki>|</nowiki>case-d<nowiki>|</nowiki>vib-0<nowiki>|</nowiki>tam-0<nowiki>|</nowiki>head-mAXabIlawA<nowiki>|</nowiki>name-NP | 7 | k1 | _ | _ | | | 1 | <nowiki>ఇంకా</nowiki> | <nowiki>ఇంకా</nowiki> | NP | PRP | <nowiki>lex-iMkA|cat-avy|gend-|num-|pers-|case-|vib-0|tam-0_avy|head-iMkA|name-NP|poslcat-NM</nowiki> | 5 | vmod | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 2 | তখন | তখন | NP | PRP | lex-waKana<nowiki>|</nowiki>cat-pn<nowiki>|</nowiki>gend-<nowiki>|</nowiki>num-<nowiki>|</nowiki>pers-<nowiki>|</nowiki>case-d<nowiki>|</nowiki>vib-0<nowiki>|</nowiki>tam-0<nowiki>|</nowiki>head-waKana<nowiki>|</nowiki>name-NP2 | 7 | k7t | _ | _ | | | 2 | <nowiki>తరతరాలుగా</nowiki> | <nowiki>తరతరాలుగా</nowiki> | RBP | RB | <nowiki>lex-warawarAlugA|cat-avy|gend-|num-|pers-|case-|vib-0|tam-0_avy|head-warawarAlugA|name-RBP|poslcat-NM</nowiki> | 3 | adv | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 3 | হাতের | হাত | NP | NN | lex-hAwa<nowiki>|</nowiki>cat-n<nowiki>|</nowiki>gend-<nowiki>|</nowiki>num-sg<nowiki>|</nowiki>pers-<nowiki>|</nowiki>case-o<nowiki>|</nowiki>vib-era<nowiki>|</nowiki>tam-era<nowiki>|</nowiki>head-hAwera<nowiki>|</nowiki>name-NP3 | 4 | r6 | _ | _ | | | 3 | <nowiki>నాటుకుపొయిన</nowiki> | <nowiki>నాటుకొ</nowiki> | VGNF | VM | <nowiki>lex-nAtuko|cat-v|gend-any|num-any|pers-any|case-|vib-i_po_ina|tam-i_po_ina|head-nAtukupoyina|name-VGNF</nowiki> | 4 | nmod | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 4 | ঘড়ি | ঘড়ি | NP | NN | lex-GadZi<nowiki>|</nowiki>cat-unk<nowiki>|</nowiki>gend-<nowiki>|</nowiki>num-<nowiki>|</nowiki>pers-<nowiki>|</nowiki>case-<nowiki>|</nowiki>vib-<nowiki>|</nowiki>tam-<nowiki>|</nowiki>head-GadZi<nowiki>|</nowiki>name-NP4 | 5 | k2 | _ | _ | | | 4 | <nowiki>అభిప్రాయాలు</nowiki> | <nowiki>అభిప్రాయం</nowiki> | NP | NN | <nowiki>lex-aBiprAyaM|cat-n|gend-|num-pl|pers-|case-d|vib-0|tam-0|head-aBiprAyAlu|name-NP2</nowiki> | 5 | k1 | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 5 | খুলে | খুল্ | VGNF | VM | lex-Kul<nowiki>|</nowiki>cat-v<nowiki>|</nowiki>gend-<nowiki>|</nowiki>num-<nowiki>|</nowiki>pers-5<nowiki>|</nowiki>case-<nowiki>|</nowiki>vib-ne<nowiki>|</nowiki>tam-ne<nowiki>|</nowiki>head-Kule<nowiki>|</nowiki>name-VGNF | 7 | vmod | _ | _ | | | 5 | NULL | NULL | <nowiki>NULL_VGF</nowiki> | VM | <nowiki>name-NULL_VGF</nowiki> | 0 | main | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 6 | টেবিলে | টেবিল | NP | NN | lex-tebila<nowiki>|</nowiki>cat-n<nowiki>|</nowiki>gend-<nowiki>|</nowiki>num-sg<nowiki>|</nowiki>pers-<nowiki>|</nowiki>case-d<nowiki>|</nowiki>vib-me<nowiki>|</nowiki>tam-me<nowiki>|</nowiki>head-tebile<nowiki>|</nowiki>name-NP5 | 7 | k7p | _ | _ | | |
| 7 | রাখছিল | রাখ্ | VGF | VM | lex-rAK<nowiki>|</nowiki>cat-v<nowiki>|</nowiki>gend-<nowiki>|</nowiki>num-<nowiki>|</nowiki>pers-5<nowiki>|</nowiki>case-<nowiki>|</nowiki>vib-Cila<nowiki>|</nowiki>tam-Cila<nowiki>|</nowiki>head-rAKaCila<nowiki>|</nowiki>name-VGF | 0 | main | _ | _ | | |
| |
==== Parsing ==== | ==== Parsing ==== |
| |
Nonprojectivities in HyDT-Bangla are not frequent. Only 78 of the 7252 chunks in the training+development ICON 2010 version are attached nonprojectively (1.08%). | Nonprojectivities in HyDT-Telugu are very rare. Only 13 of the 5722 chunks in the training+development ICON 2010 version are attached nonprojectively (0.23%). |
| |
The results of the ICON 2009 NLP tools contest have been published in [[http://ltrc.iiit.ac.in/nlptools2009/CR/intro-husain.pdf|(Husain, 2009)]]. There were two evaluation rounds, the first with the coarse-grained syntactic tags, the second with the fine-grained syntactic tags. To reward language independence, only systems that parsed all three languages were officially ranked. The following table presents the Bengali/coarse-grained results of the four officially ranked systems, and the best Bengali-only* system. | The results of the ICON 2009 NLP tools contest have been published in [[http://ltrc.iiit.ac.in/nlptools2009/CR/intro-husain.pdf|(Husain, 2009)]]. There were two evaluation rounds, the first with the coarse-grained syntactic tags, the second with the fine-grained syntactic tags. To reward language independence, only systems that parsed all three languages were officially ranked. The following table presents the Telugu/coarse-grained results of the four officially ranked systems. |
| |
^ Parser (Authors) ^ LAS ^ UAS ^ | ^ Parser (Authors) ^ LAS ^ UAS ^ |
| Kolkata (De et al.)* | 84.29 | 90.32 | | | Malt (Nivre) | 62.44 | 86.28 | |
| Hyderabad (Ambati et al.) | 78.25 | 90.22 | | | Mannem | 65.01 | 85.76 | |
| Malt (Nivre) | 76.07 | 88.97 | | | Hyderabad (Ambati et al.) | 65.01 | 85.25 | |
| Malt+MST (Zeman) | 71.49 | 86.89 | | | Malt+MST (Zeman) | 56.43 | 81.30 | |
| Mannem | 70.34 | 83.56 | | |
| |
The results of the ICON 2010 NLP tools contest have been published in [[http://ltrc.iiit.ac.in/nlptools2010/files/documents/toolscontest10-workshoppaper-final.pdf|(Husain et al., 2010)]], page 6. These are the best results for Bengali with fine-grained syntactic tags: | The results of the ICON 2010 NLP tools contest have been published in [[http://ltrc.iiit.ac.in/nlptools2010/files/documents/toolscontest10-workshoppaper-final.pdf|(Husain et al., 2010)]], page 6. These are the best results for Telugu with fine-grained syntactic tags: |
| |
^ Parser (Authors) ^ LAS ^ UAS ^ | ^ Parser (Authors) ^ LAS ^ UAS ^ |
| Attardi et al. | 70.66 | 87.41 | | | Kosaraju et al. | 70.12 | 91.82 | |
| Kosaraju et al. | 70.55 | 86.16 | | | Attardi et al. | 65.61 | 90.48 | |
| Kolachina et al. | 70.14 | 87.10 | | | Kolachina et al. | 68.11 | 90.15 | |
| |