Differences

This shows you the differences between two versions of the page.

--- user:zeman:interset:drivers [2008/03/06 15:51]
zeman Time requirements moved to Drivers.
+++ user:zeman:interset:drivers [2008/04/03 14:27]
zeman Restructuralization.
@@ Line 3: / Line 3: @@
 This is an overview of existing tag set drivers. Tag-set or language specific issues are described here.
-===== Chinese =====
+===== Arabic (ar) =====
+The Arabic CoNLL tags are derived from the tags of the Prague Arabic Dependency Treebank.
+Created in 2006-2007.
+Total work time: 13 hours
+===== Bulgarian (bg) =====
+The Bulgarian CoNLL tags are derived from the tags of BulTreeBank. Speciality: sophisticated system of pronouns includes interrogative adverbs and numerals.
+Created in 2007.
+Total work time: 35 hours
+The main reasons why the implementation took so long:
+  * Necessity to re-work the system of main word classes, especially pronouns.
+  * Necessity to separate morphological and lexical definiteness (there are indefinite pronouns morphologically definite, and vice versa).
+  * Necessity to separate morphological and lexical aspect (aorist vs. imperfect tense; there are perfective verbs that can occur in imperfect tense).
+  * Driver tester required that encode(decode(x))=x. However, the CoNLL incarnation of the tags was inconsistent, in the order and form in which it presented features.
+===== Chinese (zh) =====
 The only corpus covered so far is the Sinica Treebank, converted to the CoNLL format. The tag set lacks comprehensive documentation (almost zero supplied with CoNLL data, and only a little found in the web). The tags do not encode any morphological features. Instead, there is a comprehensive (but undocumented) hierarchy of word classes and subclasses. Most of the information encoded in the tags cannot be mapped to Interset.
@@ Line 17: / Line 37: @@
 Most of the time was dedicated to extracting, transcribing and translating examples in an effort to understand the tag classes.
-===== Time needed for tag set conversion =====
+===== Czech (cs) =====
-Poznamenávám si, kolik času mi zabral který ovladač, abych to mohl publikovat. Srovnání potřebného času s časem potřebným na obyčejný převod je zajímavé, i když vím, že ve skutečnosti ušetřím až při opakovaném využití ovladače.
+==== Prague Dependency Treebank (PDT) ====
-Ruský treebank (nejen značky, ale vůbec převod formátu):
+Při práci na tomto ovladači jsem ještě neměl k dispozici chytré funkce pro zajištění povolených značek.
-:36
-Arabské značky (Otovy i Buckwalterovy, ještě bez Intersetu, 22.3.2006):
+Jde zatím o nejrozsáhlejší sadu značek, se kterou jsem se setkal. Obsahuje 4288 značek.
-:45+1+1:40 = 7:25
-České značky PDT (přes 4000 značek; jádro Intersetu vzniklo jako vedlejší produkt, když jsem dělal tohle)
+České značky PDT (přes 4000 značek; jádro Intersetu vzniklo jako vedlejší produkt, když jsem dělal tohle) asi 2 dny, tedy dejme tomu 18 hodin. Dalších 11:09 hodin jsem spotřeboval, když jsem začal ovladače testovat a musel jsem tenhle opravovat. Opět platí, že část času zabralo ladění testovacího skriptu, který v té době teprve vznikal.
-asi 2 dny, tedy dejme tomu 18 hodin
-Dánské značky DDT/Parole (144 značek s košatým popisem)
+==== CoNLL (derived from PDT) ====
-asi 7 hodin
-Švédské značky Mamba (48 značek)
+The CoNLL 2006 and 2007 Czech treebanks are data from PDT converted to the CoNLL format. The PDT morphological tags have been decomposed into coarse-grained part of speech, detailed part of speech, and a set of feature values. All PDT tags have unique equivalents in CoNLL. However, the mapping to the original PDT tags is not one-to-one. Some information, encoded in lemmas in the PDT, has been encoded as a new feature called ''Sem'' in CoNLL data. README refers the following documentation: [[http://ufal.mff.cuni.cz/pdt/Corpora/PDT_1.0/References/mman.html#pos-tag|part of speech and most features]] | [[http://ufal.mff.cuni.cz/pdt/Corpora/PDT_1.0/References/mman.html#sem-info|lemma features]]
-asi 3 hodiny
-Penn Treebank (36 značek)
+The list of tags of this tagset contains equivalents of all original PDT tags. In addition, it contains those tags with the ''Sem'' feature set, that occur in CoNLL data, and a few more. The ''Sem'' values are currently stored in the ''other'' feature of Interset. At the same time, ''subpos = "prop"'' is set if ''Sem'' is set and ''subpos'' would otherwise be empty. (The original PDT tags cannot distinguish proper from common nouns.) If the encoder encounters ''subpos = "prop"'', it uses the default value "Sem=m". The "few more" tags were added to the list whenever there was a tag ''Foo=bar|Sem=something'' and there was not the default ''Foo=bar|Sem=m''.
-asi 3 hodiny, ale tady jsem to ještě neměřil, takže to je jen hrubý zpětný odhad
-Hajičovy švédské značky
+Work started: 25.3.2008
-:32 - tady zjevně chybí úplná statistika
+Work finished: 25.3.2008
+Total work time: 6:02 h
-Arabské značky CoNLL
+More than half of the time was consumed during testing for tuning tags containing the Sem feature.
-:33+5:19+3:16 = 13:08
-České značky PDT (CoNLL verze? Nebo to jsou jen opravy, když jsem začal ovladače testovat?)
+===== Danish (da) =====
-:44+3:20+6:05 = 11:09
-Bulharské značky CoNLL
+Tags of the Danish Dependency Treebank converted to CoNLL format. 144 tags with complex documentation in Danish.
-:20+1:00+0:26+5:44+2:00+6:15+1:20+0:46+1:26+2:30+0:48+12:44 = 35:19
-(ale u bulharštiny jsem se dost natrápil s jevy, které do té doby nebyly v intersetu podchycené)
-Anglické značky CoNLL
+Total work time: about 7 hours
-:48 - možná tady chybí statistika, ale možná taky ne, protože stačilo upravit existující ovladač Penn Treebanku, ne?
-Žádné z výše uvedených převodů (tedy vše napsané před říjnem 2007) ještě neměly k dispozici chytré funkce pro nahrazování nepovolených hodnot.
+===== English (en) =====
+==== Penn Treebank Tagset ====
+Penn Treebank (45 atomic tags). Detailed classification of punctuation.
+Total work time: about 3 hours
+==== CoNLL Tagset (derived from Penn tags) ====
+The driver is just an envelope around the ''en::penn'' driver.
+Total work time: 48 minutes
+===== German (de) =====
+==== Stuttgart-Tübingen Tagset (STTS) ====
+This is the tagset used in the Tiger treebank. It is quite syntax-oriented, often the same word can be tagged in couple different ways according to its function in a particular sentence. Pronouns are systematically categorized as substitutive (occur instead of an NP), attributive (occur inside an NP) and adverbial.
+The tags omit inflectional information (number and case of pronouns and articles, degree of comparison of adjectives, tense (Präteritum, Konjunktiv), person and number of verbs).
+Work started: 29.3.2008
+Work finished: 29.3.2008
+Total work time: 4:00 h
+==== CoNLL (derived from STTS) ====
+Only simple envelope around the STTS driver needed.
+Work started: 31.3.2008
+Work finished: 31.3.2008
+Total work time: 10 min
+===== Swedish (sv) =====
+==== Mamba and CoNLL ====
+Mamba tagset of Talbanken05. 48 tags, no morphosyntactic categories but detailed classification of auxiliary and modal verbs and punctuation. CoNLL driver is just an envelope around Mamba.
+Total work time: about 3 hours
+==== Tags of Hajič's Swedish tagger ====
+Based on PAROLE Swedish tagset but some characters different (@ => W), and filled by dashes to uniform length of 9 characters (although i-th position does not always encode the same feature).
+No reliable statistics of work time; estimated 8 hours
+===== Time needed for tag set conversion =====
+Some records about targeted tagset conversion for given tagset pairs, done in early 2006:
+Ruský treebank (nejen značky, ale vůbec převod formátu):
+:36
+Arabské značky (Otovy i Buckwalterovy, ještě bez Intersetu, 22.3.2006):
+:45+1+1:40 = 7:25

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences