[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
user:zeman:interset:drivers [2008/04/03 23:02]
zeman Portuguese.
user:zeman:interset:drivers [2009/02/20 15:10]
zeman
Line 1: Line 1:
 ====== Tag Set Drivers ====== ====== Tag Set Drivers ======
  
-This is an overview of existing tag set drivers. Tag-set or language specific issues are described here.+This is an overview of existing tag set drivers. Tag-set or language specific issues are described here. I also try to keep track of the work time needed for particular drivers because the original motivation behind DZ Interset was to save time and effort.
  
 ===== Arabic (ar) ===== ===== Arabic (ar) =====
Line 58: Line 58:
  
 More than half of the time was consumed during testing for tuning tags containing the Sem feature. More than half of the time was consumed during testing for tuning tags containing the Sem feature.
 +
 +==== Multext ====
 +
 +The tagset of the MULTEXT-EAST project and corpora. The file ''mte-lex/wfl-cs.tbl'' contains 1428 unique tags (which is not to say that other tags are not possible). The corpora are stored in a TEI-compliant SGML format. It is easily readable except that non-ASCII characters are encoded using SGML entities.
 +
 +Work started: 16.2.2009
 +Work finished: 18.2.2009
 +Total work time: 16:36 h
 +
 +Czech tagsets are notoriously complex. This one maps quite nicely to DZ Interset features. However, the few distinctions that are not (yet) represented in DZ Interset made debugging difficult. Clitic_s and generic numerals represented using the ''other'' feature led to wrong feature-value combinations in conversions to/from other Czech sets. DZ Interset had to be slightly modified in response to this tagset, and more changes that initiated here will be done later.
  
 ===== Danish (da) ===== ===== Danish (da) =====
Line 98: Line 108:
 Work finished: 31.3.2008 Work finished: 31.3.2008
 Total work time: 10 min Total work time: 10 min
- 
- 
- 
- 
  
 ===== Portuguese (pt) ===== ===== Portuguese (pt) =====
Line 109: Line 115:
 http://visl.sdu.dk/visl/pt/info/symbolset-floresta.html http://visl.sdu.dk/visl/pt/info/symbolset-floresta.html
 http://en.wikipedia.org/wiki/Portuguese_grammar http://en.wikipedia.org/wiki/Portuguese_grammar
 +
 +Work started: 2.4.2008
 +Work finished: 24.4.2008
 +Total work time: 28:18 h
 +
 +The CoNLL version of the Floresta tagset was a real pain. Not only is the tagset complex with many features, some of them strangely overlapping, some of them undocumented. There was also a terrible proportion of noise, typos or otherwise introduced errors in annotation.
  
 | **Feature** | **Explanation** | **Examples** | | **Feature** | **Explanation** | **Examples** |
 | _ | no features | prepositions, punctuation etc. | | _ | no features | prepositions, punctuation etc. |
-| 1 | 1st person | | 
 | 1/3S | 1st person or 3rd person singular | leia, disse, seria, prefira | | 1/3S | 1st person or 3rd person singular | leia, disse, seria, prefira |
 | 1S | 1st person singular | tenho, tinha, usei, vivo, vou | | 1S | 1st person singular | tenho, tinha, usei, vivo, vou |
Line 215: Line 226:
 | > | noise; should be ignored | | | > | noise; should be ignored | |
 | 0/1/3S | noise; should probably be 1/3S | | | 0/1/3S | noise; should probably be 1/3S | |
 +| 1 | noise; should be 1S | aproveitaria, saiba, tinha, vivia |
 | 1S> | noise; should be 1S | meu, meus, minha, minhas | | 1S> | noise; should be 1S | meu, meus, minha, minhas |
 | 1P> | noise; should be 1P | nossa, nossas, nosso, nossos | | 1P> | noise; should be 1P | nossa, nossas, nosso, nossos |
Line 248: Line 260:
 | <prop>M | noise; should be two features | | | <prop>M | noise; should be two features | |
 | <prparg> | noise; should be <co-prparg> | | | <prparg> | noise; should be <co-prparg> | |
-| R | noise | 2 occurrences |+| R | noise; should be PR | 2 occurrences |
 | recohidas> | noise; should be <ALT> | recolhidas | | recohidas> | noise; should be <ALT> | recolhidas |
 | <rel><ks> | noise; should be two features | | | <rel><ks> | noise; should be two features | |

[ Back to the navigation ] [ Back to the content ]