Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
user:zeman:interset:drivers [2008/04/03 23:02] zeman Portuguese. |
user:zeman:interset:drivers [2014/07/17 16:32] (current) zeman hr::multext |
||
---|---|---|---|
Line 1: | Line 1: | ||
====== Tag Set Drivers ====== | ====== Tag Set Drivers ====== | ||
- | This is an overview of existing tag set drivers. Tag-set or language specific issues are described here. | + | This is an overview of existing tag set drivers. Tag-set or language specific issues are described here. I also try to keep track of the work time needed for particular drivers because the original motivation behind DZ Interset was to save time and effort. |
===== Arabic (ar) ===== | ===== Arabic (ar) ===== | ||
+ | |||
+ | ==== CoNLL 2006 ==== | ||
The Arabic CoNLL tags are derived from the tags of the Prague Arabic Dependency Treebank. | The Arabic CoNLL tags are derived from the tags of the Prague Arabic Dependency Treebank. | ||
Line 9: | Line 11: | ||
Created in 2006-2007. | Created in 2006-2007. | ||
Total work time: 13 hours | Total work time: 13 hours | ||
+ | |||
+ | ==== CoNLL 2007 ==== | ||
+ | |||
+ | The Arabic tags in CoNLL 2007 slightly differed from 2006. There are also new tags. The driver '' | ||
+ | |||
+ | Created: 23.6.2011 | ||
+ | Total work time: 2 hours | ||
===== Bulgarian (bg) ===== | ===== Bulgarian (bg) ===== | ||
Line 36: | Line 45: | ||
Most of the time was dedicated to extracting, transcribing and translating examples in an effort to understand the tag classes. | Most of the time was dedicated to extracting, transcribing and translating examples in an effort to understand the tag classes. | ||
+ | |||
+ | ===== Croatian (hr) ===== | ||
+ | |||
+ | ==== Multext ==== | ||
+ | |||
+ | The tagset of the MULTEXT-EAST project as used in the SETimes.HR corpus. Documentation lists 1291 tags, we removed one wrong tag and kept 1290. | ||
+ | |||
+ | Work started: 16.7.2014 | ||
+ | Work finished: 17.7.2014 | ||
+ | Total work time: 5:45 h | ||
+ | |||
+ | This is the second Multext-East tagset covered by DZ Interset. Adding it was not too difficult because much of the previous effort on '' | ||
===== Czech (cs) ===== | ===== Czech (cs) ===== | ||
Line 47: | Line 68: | ||
České značky PDT (přes 4000 značek; jádro Intersetu vzniklo jako vedlejší produkt, když jsem dělal tohle) asi 2 dny, tedy dejme tomu 18 hodin. Dalších 11:09 hodin jsem spotřeboval, | České značky PDT (přes 4000 značek; jádro Intersetu vzniklo jako vedlejší produkt, když jsem dělal tohle) asi 2 dny, tedy dejme tomu 18 hodin. Dalších 11:09 hodin jsem spotřeboval, | ||
- | ==== CoNLL (derived from PDT) ==== | + | ==== CoNLL 2006 ==== |
The CoNLL 2006 and 2007 Czech treebanks are data from PDT converted to the CoNLL format. The PDT morphological tags have been decomposed into coarse-grained part of speech, detailed part of speech, and a set of feature values. All PDT tags have unique equivalents in CoNLL. However, the mapping to the original PDT tags is not one-to-one. Some information, | The CoNLL 2006 and 2007 Czech treebanks are data from PDT converted to the CoNLL format. The PDT morphological tags have been decomposed into coarse-grained part of speech, detailed part of speech, and a set of feature values. All PDT tags have unique equivalents in CoNLL. However, the mapping to the original PDT tags is not one-to-one. Some information, | ||
Line 58: | Line 79: | ||
More than half of the time was consumed during testing for tuning tags containing the Sem feature. | More than half of the time was consumed during testing for tuning tags containing the Sem feature. | ||
+ | |||
+ | ==== CoNLL 2009 ==== | ||
+ | |||
+ | The [[: | ||
+ | |||
+ | The '' | ||
+ | |||
+ | Work started: 24.3.2009 | ||
+ | Work finished: 24.3.2009 | ||
+ | Total work time: 1:10 h | ||
+ | |||
+ | ==== Multext ==== | ||
+ | |||
+ | The tagset of the MULTEXT-EAST project and corpora. The file '' | ||
+ | |||
+ | Work started: 16.2.2009 | ||
+ | Work finished: 18.2.2009 | ||
+ | Total work time: 16:36 h | ||
+ | |||
+ | Czech tagsets are notoriously complex. This one maps quite nicely to DZ Interset features. However, the few distinctions that are not (yet) represented in DZ Interset made debugging difficult. Clitic_s and generic numerals represented using the '' | ||
+ | |||
+ | ==== Prague Spoken Corpus ==== | ||
+ | |||
+ | The Prague Spoken Corpus (Pražský mluvený korpus, PMK) is distributed together with the frequency dictionary of spoken Czech (book). It uses very strange tags and very many of them (over 10000!) Extremely high portion of the tags has to rely on the '' | ||
+ | |||
+ | Work started: 26.11.2009 | ||
+ | Work finished: 4.10.2010 | ||
+ | Total work time: 57 hours | ||
===== Danish (da) ===== | ===== Danish (da) ===== | ||
Line 73: | Line 122: | ||
Total work time: about 3 hours | Total work time: about 3 hours | ||
- | ==== CoNLL Tagset (derived from Penn tags) ==== | + | ==== CoNLL 2006 ==== |
The driver is just an envelope around the '' | The driver is just an envelope around the '' | ||
Total work time: 48 minutes | Total work time: 48 minutes | ||
+ | |||
+ | ==== CoNLL 2009 ==== | ||
+ | |||
+ | Another envelope around the '' | ||
+ | |||
+ | Work started: 25.3.2009 | ||
+ | Work finished: 25.3.2009 | ||
+ | Total work time: 2:57 h | ||
===== German (de) ===== | ===== German (de) ===== | ||
Line 91: | Line 148: | ||
Total work time: 4:00 h | Total work time: 4:00 h | ||
- | ==== CoNLL (derived from STTS) ==== | + | ==== CoNLL 2006 ==== |
Only simple envelope around the STTS driver needed. | Only simple envelope around the STTS driver needed. | ||
Line 100: | Line 157: | ||
+ | ==== CoNLL 2009 ==== | ||
+ | This tagset is derived from the STTS, too. Unlike CoNLL 2006, there are also morphological features this time, which required additional processing effort. | ||
+ | Work started: 5.4.2009 | ||
+ | Work finished: 6.4.2009 | ||
+ | Total work time: 9:39 h | ||
+ | |||
+ | ===== Polish (pl) ===== | ||
+ | |||
+ | Based on the [[http:// | ||
+ | |||
+ | Work started: 4.9.2009 | ||
+ | Work finished: 8.9.2009 | ||
+ | Total work time: 9:54 h | ||
===== Portuguese (pt) ===== | ===== Portuguese (pt) ===== | ||
Line 109: | Line 179: | ||
http:// | http:// | ||
http:// | http:// | ||
+ | |||
+ | Work started: 2.4.2008 | ||
+ | Work finished: 24.4.2008 | ||
+ | Total work time: 28:18 h | ||
+ | |||
+ | The CoNLL version of the Floresta tagset was a real pain. Not only is the tagset complex with many features, some of them strangely overlapping, | ||
| **Feature** | **Explanation** | **Examples** | | | **Feature** | **Explanation** | **Examples** | | ||
| _ | no features | prepositions, | | _ | no features | prepositions, | ||
- | | 1 | 1st person | | | ||
| 1/3S | 1st person or 3rd person singular | leia, disse, seria, prefira | | | 1/3S | 1st person or 3rd person singular | leia, disse, seria, prefira | | ||
| 1S | 1st person singular | tenho, tinha, usei, vivo, vou | | | 1S | 1st person singular | tenho, tinha, usei, vivo, vou | | ||
Line 215: | Line 290: | ||
| > | noise; should be ignored | | | | > | noise; should be ignored | | | ||
| 0/1/3S | noise; should probably be 1/3S | | | | 0/1/3S | noise; should probably be 1/3S | | | ||
+ | | 1 | noise; should be 1S | aproveitaria, | ||
| 1S> | noise; should be 1S | meu, meus, minha, minhas | | | 1S> | noise; should be 1S | meu, meus, minha, minhas | | ||
| 1P> | noise; should be 1P | nossa, nossas, nosso, nossos | | | 1P> | noise; should be 1P | nossa, nossas, nosso, nossos | | ||
Line 248: | Line 324: | ||
| < | | < | ||
| < | | < | ||
- | | R | noise | 2 occurrences | | + | | R | noise; should be PR | 2 occurrences | |
| recohidas> | | recohidas> | ||
| < | | < | ||
Line 260: | Line 336: | ||
| < | | < | ||
| VFIN | noise | há od haver | | | VFIN | noise | há od haver | | ||
+ | |||
+ | ===== Slovak (sk) ===== | ||
+ | |||
+ | ==== Slovenský národný korpus (SNK) ==== | ||
+ | |||
+ | 1457 structured tags. | ||
+ | |||
+ | Total work time: 5:32 hours. | ||
===== Swedish (sv) ===== | ===== Swedish (sv) ===== |