Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
user:zeman:interset:drivers [2009/02/16 15:57] zeman Český Multext. |
user:zeman:interset:drivers [2009/03/25 21:52] zeman en::conll2009 |
||
---|---|---|---|
Line 1: | Line 1: | ||
====== Tag Set Drivers ====== | ====== Tag Set Drivers ====== | ||
- | This is an overview of existing tag set drivers. Tag-set or language specific issues are described here. | + | This is an overview of existing tag set drivers. Tag-set or language specific issues are described here. I also try to keep track of the work time needed for particular drivers because the original motivation behind DZ Interset was to save time and effort. |
===== Arabic (ar) ===== | ===== Arabic (ar) ===== | ||
Line 47: | Line 47: | ||
České značky PDT (přes 4000 značek; jádro Intersetu vzniklo jako vedlejší produkt, když jsem dělal tohle) asi 2 dny, tedy dejme tomu 18 hodin. Dalších 11:09 hodin jsem spotřeboval, | České značky PDT (přes 4000 značek; jádro Intersetu vzniklo jako vedlejší produkt, když jsem dělal tohle) asi 2 dny, tedy dejme tomu 18 hodin. Dalších 11:09 hodin jsem spotřeboval, | ||
- | ==== CoNLL (derived from PDT) ==== | + | ==== CoNLL 2006 ==== |
The CoNLL 2006 and 2007 Czech treebanks are data from PDT converted to the CoNLL format. The PDT morphological tags have been decomposed into coarse-grained part of speech, detailed part of speech, and a set of feature values. All PDT tags have unique equivalents in CoNLL. However, the mapping to the original PDT tags is not one-to-one. Some information, | The CoNLL 2006 and 2007 Czech treebanks are data from PDT converted to the CoNLL format. The PDT morphological tags have been decomposed into coarse-grained part of speech, detailed part of speech, and a set of feature values. All PDT tags have unique equivalents in CoNLL. However, the mapping to the original PDT tags is not one-to-one. Some information, | ||
Line 58: | Line 58: | ||
More than half of the time was consumed during testing for tuning tags containing the Sem feature. | More than half of the time was consumed during testing for tuning tags containing the Sem feature. | ||
+ | |||
+ | ==== CoNLL 2009 ==== | ||
+ | |||
+ | The [[: | ||
+ | |||
+ | The '' | ||
+ | |||
+ | Work started: 24.3.2009 | ||
+ | Work finished: 24.3.2009 | ||
+ | Total work time: 1:10 h | ||
==== Multext ==== | ==== Multext ==== | ||
Line 64: | Line 74: | ||
Work started: 16.2.2009 | Work started: 16.2.2009 | ||
+ | Work finished: 18.2.2009 | ||
+ | Total work time: 16:36 h | ||
+ | |||
+ | Czech tagsets are notoriously complex. This one maps quite nicely to DZ Interset features. However, the few distinctions that are not (yet) represented in DZ Interset made debugging difficult. Clitic_s and generic numerals represented using the '' | ||
===== Danish (da) ===== | ===== Danish (da) ===== | ||
Line 79: | Line 93: | ||
Total work time: about 3 hours | Total work time: about 3 hours | ||
- | ==== CoNLL Tagset | + | ==== CoNLL Tagset ==== |
The driver is just an envelope around the '' | The driver is just an envelope around the '' | ||
Total work time: 48 minutes | Total work time: 48 minutes | ||
+ | |||
+ | ==== CoNLL 2009 Tagset ==== | ||
+ | |||
+ | Another envelope around the '' | ||
+ | |||
+ | Work started: 25.3.2008 | ||
+ | Work finished: 25.3.2008 | ||
+ | Total work time: 2:57 h | ||
===== German (de) ===== | ===== German (de) ===== |