Differences

This shows you the differences between two versions of the page.

--- user:zeman:interset:drivers [2008/04/03 23:02]
zeman Portuguese.
+++ user:zeman:interset:drivers [2009/02/20 15:10]
zeman
@@ Line 1: / Line 1: @@
 ====== Tag Set Drivers ======
-This is an overview of existing tag set drivers. Tag-set or language specific issues are described here.
+This is an overview of existing tag set drivers. Tag-set or language specific issues are described here. I also try to keep track of the work time needed for particular drivers because the original motivation behind DZ Interset was to save time and effort.
 ===== Arabic (ar) =====
@@ Line 58: / Line 58: @@
 More than half of the time was consumed during testing for tuning tags containing the Sem feature.
+==== Multext ====
+The tagset of the MULTEXT-EAST project and corpora. The file ''mte-lex/wfl-cs.tbl'' contains 1428 unique tags (which is not to say that other tags are not possible). The corpora are stored in a TEI-compliant SGML format. It is easily readable except that non-ASCII characters are encoded using SGML entities.
+Work started: 16.2.2009
+Work finished: 18.2.2009
+Total work time: 16:36 h
+Czech tagsets are notoriously complex. This one maps quite nicely to DZ Interset features. However, the few distinctions that are not (yet) represented in DZ Interset made debugging difficult. Clitic_s and generic numerals represented using the ''other'' feature led to wrong feature-value combinations in conversions to/from other Czech sets. DZ Interset had to be slightly modified in response to this tagset, and more changes that initiated here will be done later.
 ===== Danish (da) =====
@@ Line 98: / Line 108: @@
 Work finished: 31.3.2008
 Total work time: 10 min
 ===== Portuguese (pt) =====
@@ Line 109: / Line 115: @@
 http://visl.sdu.dk/visl/pt/info/symbolset-floresta.html
 http://en.wikipedia.org/wiki/Portuguese_grammar
+Work started: 2.4.2008
+Work finished: 24.4.2008
+Total work time: 28:18 h
+The CoNLL version of the Floresta tagset was a real pain. Not only is the tagset complex with many features, some of them strangely overlapping, some of them undocumented. There was also a terrible proportion of noise, typos or otherwise introduced errors in annotation.
 | **Feature** | **Explanation** | **Examples** |
 | _ | no features | prepositions, punctuation etc. |
-| 1 | 1st person | |
 | 1/3S | 1st person or 3rd person singular | leia, disse, seria, prefira |
 | 1S | 1st person singular | tenho, tinha, usei, vivo, vou |
@@ Line 215: / Line 226: @@
 | > | noise; should be ignored | |
 | 0/1/3S | noise; should probably be 1/3S | |
+| 1 | noise; should be 1S | aproveitaria, saiba, tinha, vivia |
 | 1S> | noise; should be 1S | meu, meus, minha, minhas |
 | 1P> | noise; should be 1P | nossa, nossas, nosso, nossos |
@@ Line 248: / Line 260: @@
 | <prop>M | noise; should be two features | |
 | <prparg> | noise; should be <co-prparg> | |
-| R | noise | 2 occurrences |
+| R | noise; should be PR | 2 occurrences |
 | recohidas> | noise; should be <ALT> | recolhidas |
 | <rel><ks> | noise; should be two features | |

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences