Both sides previous revision
Previous revision
|
Next revision
Both sides next revision
|
user:zeman:treebanks:te [2012/03/22 16:47] zeman ICON 2010 Telugu data size. |
user:zeman:treebanks:te [2012/03/22 17:06] zeman Inside. |
==== Inside ==== | ==== Inside ==== |
| |
The text uses the [[http://ltrc.iiit.ac.in/nlptools2010/files/documents/map.pdf|WX encoding]] of Indian letters. If we know what the original script is (Bengali in this case) we can map the WX encoding to the original characters in UTF-8. WX uses English letters so if there was embedded English (or other string using Latin letters) it will probably get lost during the conversion. | The text uses the [[http://ltrc.iiit.ac.in/nlptools2010/files/documents/map.pdf|WX encoding]] of Indian letters. If we know what the original script is (Telugu in this case) we can map the WX encoding to the original characters in UTF-8. WX uses English letters so if there was embedded English (or other string using Latin letters) it will probably get lost during the conversion. |
| |
The CoNLL format contains only the chunk heads. The native SSF format shows the other words in the chunk, too, but it does not capture intra-chunk dependency relations. This is an example of a multi-word chunk: | The CoNLL format contains only the chunk heads. The native SSF format shows the other words in the chunk, too, but it does not capture intra-chunk dependency relations. This is an example of a multi-word chunk: |
| |
<code>3 (( NP <fs af='rumAla,n,,sg,,d,0,0' head="rumAla" drel=k2:VGF name=NP3> | <code>3 (( NP <fs af='AdavAlYlu,n,,sg,,,0,0_e' head='AdavAlYle' pbank='ARG3' name='NP3'> |
3.1 ekatA QC <fs af='eka,num,,,,,,'> | 3.1 932 QC <fs af='932,num,,,,,,'> |
3.2 ledisa JJ <fs af='ledisa,unk,,,,,,'> | 3.2 maMxi CL <fs af='maMxi,n,,pl,,d,0,0'> |
3.3 rumAla NN <fs af='rumAla,n,,sg,,d,0,0' name="rumAla"> | 3.3 AdavAlYle NN <fs af='AdavAlYlu,n,,sg,,,0,0_e' name='AdavAlYle'> |
))</code> | ))</code> |
| |
According to [[http://ltrc.iiit.ac.in/nlptools2010/files/documents/toolscontest10-workshoppaper-final.pdf|(Husain et al., 2010)]], in the ICON 2010 version, the chunk tags, POS tags and inter-chunk dependencies (topology + tags) were annotated manually. The rest (lemma, morphosyntactic features, headword of chunk) was marked automatically. | According to [[http://ltrc.iiit.ac.in/nlptools2010/files/documents/toolscontest10-workshoppaper-final.pdf|(Husain et al., 2010)]], in the ICON 2010 version, the chunk tags, POS tags and inter-chunk dependencies (topology + tags) were annotated manually. The rest (lemma, morphosyntactic features, headword of chunk) was marked automatically. |
| |
Note: There have been cycles in the Hindi part of HyDT but no such problem occurs in the Bengali part. | Note: There have been cycles in the Hindi part of HyDT but no such problem occurs in the Telugu part. |
| |
==== Sample ==== | ==== Sample ==== |