Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision Next revision Both sides next revision | ||
user:zeman:interset:drivers [2008/03/06 15:47] zeman vytvořeno |
user:zeman:interset:drivers [2008/04/03 14:27] zeman Restructuralization. |
||
---|---|---|---|
Line 3: | Line 3: | ||
This is an overview of existing tag set drivers. Tag-set or language specific issues are described here. | This is an overview of existing tag set drivers. Tag-set or language specific issues are described here. | ||
- | ===== Chinese ===== | + | ===== Arabic (ar) ===== |
+ | |||
+ | The Arabic CoNLL tags are derived from the tags of the Prague Arabic Dependency Treebank. | ||
+ | |||
+ | Created in 2006-2007. | ||
+ | Total work time: 13 hours | ||
+ | |||
+ | ===== Bulgarian (bg) ===== | ||
+ | |||
+ | The Bulgarian CoNLL tags are derived from the tags of BulTreeBank. Speciality: sophisticated system of pronouns includes interrogative adverbs and numerals. | ||
+ | |||
+ | Created in 2007. | ||
+ | Total work time: 35 hours | ||
+ | |||
+ | The main reasons why the implementation took so long: | ||
+ | * Necessity to re-work the system of main word classes, especially pronouns. | ||
+ | * Necessity to separate morphological and lexical definiteness (there are indefinite pronouns morphologically definite, and vice versa). | ||
+ | * Necessity to separate morphological and lexical aspect (aorist vs. imperfect tense; there are perfective verbs that can occur in imperfect tense). | ||
+ | * Driver tester required that encode(decode(x))=x. However, the CoNLL incarnation of the tags was inconsistent, | ||
+ | |||
+ | ===== Chinese | ||
The only corpus covered so far is the Sinica Treebank, converted to the CoNLL format. The tag set lacks comprehensive documentation (almost zero supplied with CoNLL data, and only a little found in the web). The tags do not encode any morphological features. Instead, there is a comprehensive (but undocumented) hierarchy of word classes and subclasses. Most of the information encoded in the tags cannot be mapped to Interset. | The only corpus covered so far is the Sinica Treebank, converted to the CoNLL format. The tag set lacks comprehensive documentation (almost zero supplied with CoNLL data, and only a little found in the web). The tags do not encode any morphological features. Instead, there is a comprehensive (but undocumented) hierarchy of word classes and subclasses. Most of the information encoded in the tags cannot be mapped to Interset. | ||
Line 16: | Line 36: | ||
Most of the time was dedicated to extracting, transcribing and translating examples in an effort to understand the tag classes. | Most of the time was dedicated to extracting, transcribing and translating examples in an effort to understand the tag classes. | ||
+ | |||
+ | ===== Czech (cs) ===== | ||
+ | |||
+ | ==== Prague Dependency Treebank (PDT) ==== | ||
+ | |||
+ | Při práci na tomto ovladači jsem ještě neměl k dispozici chytré funkce pro zajištění povolených značek. | ||
+ | |||
+ | Jde zatím o nejrozsáhlejší sadu značek, se kterou jsem se setkal. Obsahuje 4288 značek. | ||
+ | |||
+ | České značky PDT (přes 4000 značek; jádro Intersetu vzniklo jako vedlejší produkt, když jsem dělal tohle) asi 2 dny, tedy dejme tomu 18 hodin. Dalších 11:09 hodin jsem spotřeboval, | ||
+ | |||
+ | ==== CoNLL (derived from PDT) ==== | ||
+ | |||
+ | The CoNLL 2006 and 2007 Czech treebanks are data from PDT converted to the CoNLL format. The PDT morphological tags have been decomposed into coarse-grained part of speech, detailed part of speech, and a set of feature values. All PDT tags have unique equivalents in CoNLL. However, the mapping to the original PDT tags is not one-to-one. Some information, | ||
+ | |||
+ | The list of tags of this tagset contains equivalents of all original PDT tags. In addition, it contains those tags with the '' | ||
+ | |||
+ | Work started: 25.3.2008 | ||
+ | Work finished: 25.3.2008 | ||
+ | Total work time: 6:02 h | ||
+ | |||
+ | More than half of the time was consumed during testing for tuning tags containing the Sem feature. | ||
+ | |||
+ | ===== Danish (da) ===== | ||
+ | |||
+ | Tags of the Danish Dependency Treebank converted to CoNLL format. 144 tags with complex documentation in Danish. | ||
+ | |||
+ | Total work time: about 7 hours | ||
+ | |||
+ | ===== English (en) ===== | ||
+ | |||
+ | ==== Penn Treebank Tagset ==== | ||
+ | |||
+ | Penn Treebank (45 atomic tags). Detailed classification of punctuation. | ||
+ | |||
+ | Total work time: about 3 hours | ||
+ | |||
+ | ==== CoNLL Tagset (derived from Penn tags) ==== | ||
+ | |||
+ | The driver is just an envelope around the '' | ||
+ | |||
+ | Total work time: 48 minutes | ||
+ | |||
+ | ===== German (de) ===== | ||
+ | |||
+ | ==== Stuttgart-Tübingen Tagset (STTS) ==== | ||
+ | |||
+ | This is the tagset used in the Tiger treebank. It is quite syntax-oriented, | ||
+ | |||
+ | The tags omit inflectional information (number and case of pronouns and articles, degree of comparison of adjectives, tense (Präteritum, | ||
+ | |||
+ | Work started: 29.3.2008 | ||
+ | Work finished: 29.3.2008 | ||
+ | Total work time: 4:00 h | ||
+ | |||
+ | ==== CoNLL (derived from STTS) ==== | ||
+ | |||
+ | Only simple envelope around the STTS driver needed. | ||
+ | |||
+ | Work started: 31.3.2008 | ||
+ | Work finished: 31.3.2008 | ||
+ | Total work time: 10 min | ||
+ | |||
+ | ===== Swedish (sv) ===== | ||
+ | |||
+ | ==== Mamba and CoNLL ==== | ||
+ | |||
+ | Mamba tagset of Talbanken05. 48 tags, no morphosyntactic categories but detailed classification of auxiliary and modal verbs and punctuation. CoNLL driver is just an envelope around Mamba. | ||
+ | |||
+ | Total work time: about 3 hours | ||
+ | |||
+ | ==== Tags of Hajič' | ||
+ | |||
+ | Based on PAROLE Swedish tagset but some characters different (@ => W), and filled by dashes to uniform length of 9 characters (although i-th position does not always encode the same feature). | ||
+ | |||
+ | No reliable statistics of work time; estimated 8 hours | ||
+ | |||
+ | ===== Time needed for tag set conversion ===== | ||
+ | |||
+ | Some records about targeted tagset conversion for given tagset pairs, done in early 2006: | ||
+ | |||
+ | Ruský treebank (nejen značky, ale vůbec převod formátu): | ||
+ | 12:36 | ||
+ | |||
+ | Arabské značky (Otovy i Buckwalterovy, | ||
+ | 4:45+1+1:40 = 7:25 | ||