[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
user:zeman:interset:drivers [2008/03/06 15:51]
zeman Time requirements moved to Drivers.
user:zeman:interset:drivers [2008/03/26 08:56]
zeman cs::conll finished.
Line 3: Line 3:
 This is an overview of existing tag set drivers. Tag-set or language specific issues are described here. This is an overview of existing tag set drivers. Tag-set or language specific issues are described here.
  
-===== Chinese =====+===== Chinese (zh) =====
  
 The only corpus covered so far is the Sinica Treebank, converted to the CoNLL format. The tag set lacks comprehensive documentation (almost zero supplied with CoNLL data, and only a little found in the web). The tags do not encode any morphological features. Instead, there is a comprehensive (but undocumented) hierarchy of word classes and subclasses. Most of the information encoded in the tags cannot be mapped to Interset. The only corpus covered so far is the Sinica Treebank, converted to the CoNLL format. The tag set lacks comprehensive documentation (almost zero supplied with CoNLL data, and only a little found in the web). The tags do not encode any morphological features. Instead, there is a comprehensive (but undocumented) hierarchy of word classes and subclasses. Most of the information encoded in the tags cannot be mapped to Interset.
Line 16: Line 16:
  
 Most of the time was dedicated to extracting, transcribing and translating examples in an effort to understand the tag classes. Most of the time was dedicated to extracting, transcribing and translating examples in an effort to understand the tag classes.
 +
 +===== Czech (cs) =====
 +
 +==== Prague Dependency Treebank (PDT) ====
 +
 +Při práci na tomto ovladači jsem ještě neměl k dispozici chytré funkce pro zajištění povolených značek.
 +
 +Jde zatím o nejrozsáhlejší sadu značek, se kterou jsem se setkal. Obsahuje 4288 značek.
 +
 +České značky PDT (přes 4000 značek; jádro Intersetu vzniklo jako vedlejší produkt, když jsem dělal tohle) asi 2 dny, tedy dejme tomu 18 hodin. Dalších 11:09 hodin jsem spotřeboval, když jsem začal ovladače testovat a musel jsem tenhle opravovat. Opět platí, že část času zabralo ladění testovacího skriptu, který v té době teprve vznikal.
 +
 +
 +
 +==== CoNLL (derived from PDT) ====
 +
 +The CoNLL 2006 and 2007 Czech treebanks are data from PDT converted to the CoNLL format. The PDT morphological tags have been decomposed into coarse-grained part of speech, detailed part of speech, and a set of feature values. All PDT tags have unique equivalents in CoNLL. However, the mapping to the original PDT tags is not one-to-one. Some information, encoded in lemmas in the PDT, has been encoded as a new feature called ''Sem'' in CoNLL data. README refers the following documentation: [[http://ufal.mff.cuni.cz/pdt/Corpora/PDT_1.0/References/mman.html#pos-tag|part of speech and most features]] | [[http://ufal.mff.cuni.cz/pdt/Corpora/PDT_1.0/References/mman.html#sem-info|lemma features]]
 +
 +The list of tags of this tagset contains equivalents of all original PDT tags. In addition, it contains those tags with the ''Sem'' feature set, that occur in CoNLL data, and a few more. The ''Sem'' values are currently stored in the ''other'' feature of Interset. At the same time, ''subpos = "prop"'' is set if ''Sem'' is set and ''subpos'' would otherwise be empty. (The original PDT tags cannot distinguish proper from common nouns.) If the encoder encounters ''subpos = "prop"'', it uses the default value "Sem=m". The "few more" tags were added to the list whenever there was a tag ''Foo=bar|Sem=something'' and there was not the default ''Foo=bar|Sem=m''.
 +
 +Work started: 25.3.2008
 +Work finished: 25.3.2008
 +Total work time: 6:02 h
 +
 +More than half of the time was consumed during testing for tuning tags containing the Sem feature.
  
 ===== Time needed for tag set conversion ===== ===== Time needed for tag set conversion =====
Line 26: Line 50:
 Arabské značky (Otovy i Buckwalterovy, ještě bez Intersetu, 22.3.2006): Arabské značky (Otovy i Buckwalterovy, ještě bez Intersetu, 22.3.2006):
 4:45+1+1:40 = 7:25 4:45+1+1:40 = 7:25
- 
-České značky PDT (přes 4000 značek; jádro Intersetu vzniklo jako vedlejší produkt, když jsem dělal tohle) 
-asi 2 dny, tedy dejme tomu 18 hodin 
  
 Dánské značky DDT/Parole (144 značek s košatým popisem) Dánské značky DDT/Parole (144 značek s košatým popisem)
Line 44: Line 65:
 Arabské značky CoNLL Arabské značky CoNLL
 4:33+5:19+3:16 = 13:08 4:33+5:19+3:16 = 13:08
- 
-České značky PDT (CoNLL verze? Nebo to jsou jen opravy, když jsem začal ovladače testovat?) 
-1:44+3:20+6:05 = 11:09 
  
 Bulharské značky CoNLL Bulharské značky CoNLL

[ Back to the navigation ] [ Back to the content ]