[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision Both sides next revision
user:zeman:treebanks:te [2012/03/22 16:47]
zeman ICON 2010 Telugu data size.
user:zeman:treebanks:te [2012/03/22 17:06]
zeman Inside.
Line 61: Line 61:
 ==== Inside ==== ==== Inside ====
  
-The text uses the [[http://ltrc.iiit.ac.in/nlptools2010/files/documents/map.pdf|WX encoding]] of Indian letters. If we know what the original script is (Bengali in this case) we can map the WX encoding to the original characters in UTF-8. WX uses English letters so if there was embedded English (or other string using Latin letters) it will probably get lost during the conversion.+The text uses the [[http://ltrc.iiit.ac.in/nlptools2010/files/documents/map.pdf|WX encoding]] of Indian letters. If we know what the original script is (Telugu in this case) we can map the WX encoding to the original characters in UTF-8. WX uses English letters so if there was embedded English (or other string using Latin letters) it will probably get lost during the conversion.
  
 The CoNLL format contains only the chunk heads. The native SSF format shows the other words in the chunk, too, but it does not capture intra-chunk dependency relations. This is an example of a multi-word chunk: The CoNLL format contains only the chunk heads. The native SSF format shows the other words in the chunk, too, but it does not capture intra-chunk dependency relations. This is an example of a multi-word chunk:
  
-<code>      ((      NP      <fs af='rumAla,n,,sg,,d,0,0' head="rumAla" drel=k2:VGF name=NP3> +<code>      ((      NP      <fs af='AdavAlYlu,n,,sg,,,0,0_e' head='AdavAlYle' pbank='ARG3' name='NP3'
-3.1     ekatA   QC      <fs af='eka,num,,,,,,'> +3.1     932     QC      <fs af='932,num,,,,,,'> 
-3.2     ledisa  JJ      <fs af='ledisa,unk,,,,,,'> +3.2     maMxi   CL      <fs af='maMxi,n,,pl,,d,0,0'> 
-3.3     rumAla  NN      <fs af='rumAla,n,,sg,,d,0,0' name="rumAla">+3.3     AdavAlYle       NN      <fs af='AdavAlYlu,n,,sg,,,0,0_e' name='AdavAlYle'>
         ))</code>         ))</code>
  
Line 79: Line 79:
 According to [[http://ltrc.iiit.ac.in/nlptools2010/files/documents/toolscontest10-workshoppaper-final.pdf|(Husain et al., 2010)]], in the ICON 2010 version, the chunk tags, POS tags and inter-chunk dependencies (topology + tags) were annotated manually. The rest (lemma, morphosyntactic features, headword of chunk) was marked automatically. According to [[http://ltrc.iiit.ac.in/nlptools2010/files/documents/toolscontest10-workshoppaper-final.pdf|(Husain et al., 2010)]], in the ICON 2010 version, the chunk tags, POS tags and inter-chunk dependencies (topology + tags) were annotated manually. The rest (lemma, morphosyntactic features, headword of chunk) was marked automatically.
  
-Note: There have been cycles in the Hindi part of HyDT but no such problem occurs in the Bengali part.+Note: There have been cycles in the Hindi part of HyDT but no such problem occurs in the Telugu part.
  
 ==== Sample ==== ==== Sample ====

[ Back to the navigation ] [ Back to the content ]