Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
user:zeman:interset:drivers [2008/03/06 15:51] zeman Time requirements moved to Drivers. |
user:zeman:interset:drivers [2008/03/25 14:10] zeman Lemma features. |
||
---|---|---|---|
Line 3: | Line 3: | ||
This is an overview of existing tag set drivers. Tag-set or language specific issues are described here. | This is an overview of existing tag set drivers. Tag-set or language specific issues are described here. | ||
- | ===== Chinese ===== | + | ===== Chinese |
The only corpus covered so far is the Sinica Treebank, converted to the CoNLL format. The tag set lacks comprehensive documentation (almost zero supplied with CoNLL data, and only a little found in the web). The tags do not encode any morphological features. Instead, there is a comprehensive (but undocumented) hierarchy of word classes and subclasses. Most of the information encoded in the tags cannot be mapped to Interset. | The only corpus covered so far is the Sinica Treebank, converted to the CoNLL format. The tag set lacks comprehensive documentation (almost zero supplied with CoNLL data, and only a little found in the web). The tags do not encode any morphological features. Instead, there is a comprehensive (but undocumented) hierarchy of word classes and subclasses. Most of the information encoded in the tags cannot be mapped to Interset. | ||
Line 16: | Line 16: | ||
Most of the time was dedicated to extracting, transcribing and translating examples in an effort to understand the tag classes. | Most of the time was dedicated to extracting, transcribing and translating examples in an effort to understand the tag classes. | ||
+ | |||
+ | ===== Czech (cs) ===== | ||
+ | |||
+ | ==== Prague Dependency Treebank (PDT) ==== | ||
+ | |||
+ | Při práci na tomto ovladači jsem ještě neměl k dispozici chytré funkce pro zajištění povolených značek. | ||
+ | |||
+ | Jde zatím o nejrozsáhlejší sadu značek, se kterou jsem se setkal. Obsahuje 4288 značek. | ||
+ | |||
+ | České značky PDT (přes 4000 značek; jádro Intersetu vzniklo jako vedlejší produkt, když jsem dělal tohle) asi 2 dny, tedy dejme tomu 18 hodin. Dalších 11:09 hodin jsem spotřeboval, | ||
+ | |||
+ | |||
+ | ==== CoNLL (derived from PDT) ==== | ||
+ | |||
+ | The CoNLL 2006 and 2007 Czech treebanks are data from PDT converted to the CoNLL format. The PDT morphological tags have been decomposed into coarse-grained part of speech, detailed part of speech, and a set of feature values. There should be a one-to-one mapping between the original PDT and the CoNLL tagsets, however, the driver cannot be a simple envelope around the driver of the original tagset (as is the case for e.g. Penn Treebank tags) because of the features. | ||
+ | |||
+ | Update: the mapping to the original PDT tags is not one-to-one. Some information, | ||
+ | |||
+ | Work started: 25.3.2008 | ||
===== Time needed for tag set conversion ===== | ===== Time needed for tag set conversion ===== | ||
Line 26: | Line 45: | ||
Arabské značky (Otovy i Buckwalterovy, | Arabské značky (Otovy i Buckwalterovy, | ||
4:45+1+1:40 = 7:25 | 4:45+1+1:40 = 7:25 | ||
- | |||
- | České značky PDT (přes 4000 značek; jádro Intersetu vzniklo jako vedlejší produkt, když jsem dělal tohle) | ||
- | asi 2 dny, tedy dejme tomu 18 hodin | ||
Dánské značky DDT/Parole (144 značek s košatým popisem) | Dánské značky DDT/Parole (144 značek s košatým popisem) | ||
Line 44: | Line 60: | ||
Arabské značky CoNLL | Arabské značky CoNLL | ||
4: | 4: | ||
- | |||
- | České značky PDT (CoNLL verze? Nebo to jsou jen opravy, když jsem začal ovladače testovat?) | ||
- | 1: | ||
Bulharské značky CoNLL | Bulharské značky CoNLL |