Differences

This shows you the differences between two versions of the page.

--- user:zeman:interset:drivers [2008/03/26 08:56]
zeman cs::conll finished.
+++ user:zeman:interset:drivers [2008/04/25 09:01]
zeman Portuguese work time summary.
@@ Line 2: / Line 2: @@
 This is an overview of existing tag set drivers. Tag-set or language specific issues are described here.
+===== Arabic (ar) =====
+The Arabic CoNLL tags are derived from the tags of the Prague Arabic Dependency Treebank.
+Created in 2006-2007.
+Total work time: 13 hours
+===== Bulgarian (bg) =====
+The Bulgarian CoNLL tags are derived from the tags of BulTreeBank. Speciality: sophisticated system of pronouns includes interrogative adverbs and numerals.
+Created in 2007.
+Total work time: 35 hours
+The main reasons why the implementation took so long:
+  * Necessity to re-work the system of main word classes, especially pronouns.
+  * Necessity to separate morphological and lexical definiteness (there are indefinite pronouns morphologically definite, and vice versa).
+  * Necessity to separate morphological and lexical aspect (aorist vs. imperfect tense; there are perfective verbs that can occur in imperfect tense).
+  * Driver tester required that encode(decode(x))=x. However, the CoNLL incarnation of the tags was inconsistent, in the order and form in which it presented features.
 ===== Chinese (zh) =====
@@ Line 26: / Line 46: @@
 České značky PDT (přes 4000 značek; jádro Intersetu vzniklo jako vedlejší produkt, když jsem dělal tohle) asi 2 dny, tedy dejme tomu 18 hodin. Dalších 11:09 hodin jsem spotřeboval, když jsem začal ovladače testovat a musel jsem tenhle opravovat. Opět platí, že část času zabralo ladění testovacího skriptu, který v té době teprve vznikal.
 ==== CoNLL (derived from PDT) ====
@@ Line 41: / Line 59: @@
 More than half of the time was consumed during testing for tuning tags containing the Sem feature.
-===== Time needed for tag set conversion =====
+===== Danish (da) =====
-Poznamenávám si, kolik času mi zabral který ovladač, abych to mohl publikovat. Srovnání potřebného času s časem potřebným na obyčejný převod je zajímavé, i když vím, že ve skutečnosti ušetřím až při opakovaném využití ovladače.
+Tags of the Danish Dependency Treebank converted to CoNLL format. 144 tags with complex documentation in Danish.
-Ruský treebank (nejen značky, ale vůbec převod formátu):
+Total work time: about 7 hours
-:36
-Arabské značky (Otovy i Buckwalterovy, ještě bez Intersetu, 22.3.2006):
+===== English (en) =====
-:45+1+1:40 = 7:25
-Dánské značky DDT/Parole (144 značek s košatým popisem)
+==== Penn Treebank Tagset ====
-asi 7 hodin
-Švédské značky Mamba (48 značek)
+Penn Treebank (45 atomic tags). Detailed classification of punctuation.
-asi 3 hodiny
-Penn Treebank (36 značek)
+Total work time: about 3 hours
-asi 3 hodiny, ale tady jsem to ještě neměřil, takže to je jen hrubý zpětný odhad
-Hajičovy švédské značky
+==== CoNLL Tagset (derived from Penn tags) ====
-:32 - tady zjevně chybí úplná statistika
-Arabské značky CoNLL
+The driver is just an envelope around the ''en::penn'' driver.
-:33+5:19+3:16 = 13:08
-Bulharské značky CoNLL
+Total work time: 48 minutes
-:20+1:00+0:26+5:44+2:00+6:15+1:20+0:46+1:26+2:30+0:48+12:44 = 35:19
-(ale u bulharštiny jsem se dost natrápil s jevy, které do té doby nebyly v intersetu podchycené)
-Anglické značky CoNLL
+===== German (de) =====
-:48 - možná tady chybí statistika, ale možná taky ne, protože stačilo upravit existující ovladač Penn Treebanku, ne?
-Žádné z výše uvedených převodů (tedy vše napsané před říjnem 2007) ještě neměly k dispozici chytré funkce pro nahrazování nepovolených hodnot.
+==== Stuttgart-Tübingen Tagset (STTS) ====
+This is the tagset used in the Tiger treebank. It is quite syntax-oriented, often the same word can be tagged in couple different ways according to its function in a particular sentence. Pronouns are systematically categorized as substitutive (occur instead of an NP), attributive (occur inside an NP) and adverbial.
+The tags omit inflectional information (number and case of pronouns and articles, degree of comparison of adjectives, tense (Präteritum, Konjunktiv), person and number of verbs).
+Work started: 29.3.2008
+Work finished: 29.3.2008
+Total work time: 4:00 h
+==== CoNLL (derived from STTS) ====
+Only simple envelope around the STTS driver needed.
+Work started: 31.3.2008
+Work finished: 31.3.2008
+Total work time: 10 min
+===== Portuguese (pt) =====
+The Portuguese CoNLL treebank contains tags with 149 different features. Big part of them are noise, probably introduced by the conversion procedure from the original Floresta format to the CoNLL format. The driver is designed so that it accepts all incorrect tags on decoding but encodes only corrected tags. Incorrect tags are not on the list of possible tags so the driver tester will not complain.
+http://visl.sdu.dk/visl/pt/info/symbolset-floresta.html
+http://en.wikipedia.org/wiki/Portuguese_grammar
+Work started: 2.4.2008
+Work finished: 24.4.2008
+Total work time: 28:18 h
+The CoNLL version of the Floresta tagset was a real pain. Not only is the tagset complex with many features, some of them strangely overlapping, some of them undocumented. There was also a terrible proportion of noise, typos or otherwise introduced errors in annotation.
+| **Feature** | **Explanation** | **Examples** |
+| _ | no features | prepositions, punctuation etc. |
+| 1/3S | 1st person or 3rd person singular | leia, disse, seria, prefira |
+| 1S | 1st person singular | tenho, tinha, usei, vivo, vou |
+| 1P | 1st person plural | tomámos, vamos, vemos, víamos |
+| 2S | 2nd person singular | compreendeste, queres, te, ti, veja, vives |
+| 2P | 2nd person plural | chamais, vós |
+| 3S | 3rd person singular | viu, viva |
+| 3S/P | 3rd person singular or plural | se, si |
+| 3P | 3rd person plural | vivem |
+| ACC | pronoun as direct accusative object | se, te, vos |
+| ACC/DAT | pronouns in accusative or dative | nos, se |
+| COND | verb in conditional mood | precisariam, seriam, tentaria, venderia, viriam |
+| DAT | pronoun as dative object | lhe, lhes, me, no, nos, se, vos |
+| F | feminine | |
+| F/M | feminine or masculine | |
+| FUT | future tense of verbs | tenderão, tomará, usará |
+| IMP | imperative mood of verbs | chega, move, olha, sê |
+| IMPF | imperfect tense of verbs | abandonasse, abandonava, abria |
+| IND | indicative mood of verbs | abafaram, abandonam, abate, abateu |
+| M | masculine | açúcar, adepto, adiantado |
+| M/F | masculine or feminine | Abidjan, cada, Chaves, especial |
+| MQP | pluperfect past tense of verbs | acabara, defendera, existira, foram, quisera, viram |
+| NOM | personal pronoun in nominative | ela, elas, ele, eles, eu, nós, vocês, você, vós |
+| NOM/PIV | personal pronoun in nominative or prepositional object | ela, elas, ele, eles, nós, você |
+| P | plural | 0,92, 14h00, africanos, águas, Amigos_da_Ilha_de_Santos |
+| PIV | pronoun in prepositional object | ela, elas, ele, eles, mim, nós, si, ti, vós |
+| PR | present tense of verbs | abandonam, abate, abonam, abordo, abra |
+| PR/PS | present or past tense of verbs | conhecemos, conseguimos, decidimos |
+| PS | perfect past tense of verbs | abalou, abandonaram, abandonou, abateu |
+| PS/MQP | perfect or pluperfect past tense of verbs | abafaram, abriram, acabaram, aceitaram |
+| S | singular | 1992, adicional, aditamento, aduaneira |
+| S/P | singular or plural | capaz, Chaves, mais |
+| SUBJ | subjunctive mood of verbs | abandonasse, abra, abram |
+| <ALT> | indicates typo in word | |
+| <DERP> | derivation by prefixation | hidroginástica, interactivo, supercomputação |
+| <DERS> | derivation by suffixation | neo-comunista, pessedebismo, tropologia |
+| <KOMP> | comparative hook determiner or adverb | assim, inferior, maior, mais, melhor, mesma, outra, piores, tanto |
+| <NUM-ord> | ordinal number, subclass of adjectives | 10º, 113ª, 1., primeiro, terços, última, XIV |
+| <SUP> | superlative of adjectives and adverbs | inferior, máximo, melhor, mínimo, ótimo, péssimo, pior |
+| <artd> | definite article or determiner pronoun | a, as, o, os |
+| <arti> | indefinite article or determiner pronoun | uma, um |
+| <card> | cardinal number | um, uma, dois, três, quatro, cinco |
+| <co-acc> | coordination of direct accusative objects | |
+| <co-advl> | coordination of adjunct adverbials | |
+| <co-advo> | coordination of argument adverbials, object related | |
+| <co-advs> | coordination of argument adverbials, subject related | |
+| <co-app> | coordination of adnominal appositions | |
+| <co-fmc> | coordination of main clauses | |
+| <co-ger> | coordination of gerunds | |
+| <co-inf> | coordination of infinitives | |
+| <co-oc> | coordination of object complements | |
+| <co-pass> | coordination of passive adjuncts | |
+| <co-pcv> | coordination of predicative participles | |
+| <co-piv> | coordination of prepositional objects | |
+| <co-postad> | coordination of postpositioned dependents in ap or advp | |
+| <co-postnom> | coordination of postpositioned dependents in np | |
+| <co-pred> | coordination of adjunct predicatives | |
+| <co-prenom> | coordination of prepositioned dependents in np | |
+| <co-prparg> | coordination of preposition arguments | |
+| <co-sc> | coordination of subject complements | |
+| <co-subj> | coordination of subjects | |
+| <co-vfin> | coordination of finite verbs | |
+| <coll> | collective reflexive pronoun | se (reunir-se, associar-se) |
+| <dem> | demonstrative pronoun or adverb | este, isso, isto, o, os, tais, tal, tão |
+| <det> | determiner usage / inflection of adverb | algo, meio, nada, quase, todo, um_tanto |
+| <diff> | differentiator | mesmo, outro, semelhante, tal |
+| <error> | probably processing error, not typo | |
+| <fmc> | verb heading finite main clause | |
+| <foc> | focus marker, adverb or pronoun | é_que, foi, fomos, que, são, será |
+| <hyfen> | separated hyphenated prefix, usually of verbs | tinha-, unia-, verifica- |
+| <ident> | identifier pronoun | mesmo, próprio |
+| <interr> | interrogative pronoun or adverb | como, onde, porque, quais, qual, quando, quanto, quem, que |
+| <kc> | conjunctional adverb | agora, aí, bem_como, como, ora, tal_como, todavia |
+| <ks> | adverb or preposition used like a subordinating conjunction | como, enquanto, onde, quando, segundo |
+| <n> | other word class used as noun, typically as head of noun phrase | anglo-americano, claro, feliz |
+| <poss | possessive determiner pronoun | meu, meus, minha, minhas, nossa, nossas, nosso, nossos, seu, seus, sua |
+| <prop> | other word class used as proper noun | Abril, Administração, Aeronáutica |
+| <prp> | other word class used as preposition | como, conforme, consoante, embora, segundo |
+| <quant> | indefinite quantifier adverb or pronoun | algo, ambas, bastante, bem, cada, certos, diversas, mais, menos |
+| <reci> | reciprocal reflexive | se (amar-se) |
+| <refl> | reflexive pronoun | se, me, te, nos, vos, si |
+| <rel> | relative pronoun or adverb | à_medida_que, como, cuja, donde, enquanto, quando, quão |
+| <-sam> | 2nd part in contracted word (nisto --> isto) | |
+| <sam-> | 1st part in contracted word (nisto --> em) | abaixo_de, a_cargo_de, ao_largo_de, apesar_de, em_face_de |
+| <si> | reflexive usage of 3rd person possessive | seu, seus, sua, suas |
+| <eg> | undocumented feature | 2 occurrences with cardinal numbers |
+| <Eg> | undocumented feature | occurs with numbers, adjectives and pronouns |
+| <Em> | undocumented feature | 6 occurrences with adjectives |
+| <Es> | undocumented feature | 3 occurrences with adverbs and prepositions |
+| <ink> | undocumented feature of finite verbs | está, havia, pode, tentou |
+| <mente> | undocumented feature; feminine adjective that can serve as base for derivation using the "-mente" suffix | directa, pura, rápida |
+| <meta> | undocumented feature of adverbs | afinal, só |
+| N | undocumented feature of nouns and articles | 15 occurrences |
+| <new> | undocumented feature | |
+| <nil> | undocumented feature | |
+| <obj> | undocumented feature | se |
+| <p> | undocumented feature | 1 occurrence |
+| <parkc-1> | undocumented feature of conjunctions and adverbs | assim, nem, ou, tanto |
+| <parkc-2> | undocumented feature of conjunctions, adverbs and prepositions | como, como_também, e, nem, ou, tampouco |
+| <postmod> | undocumented feature | 3 occurrences |
+| <premod> | undocumented feature of adverbs | |
+| > | noise; should be ignored | |
+| 0/1/3S | noise; should probably be 1/3S | |
+| 1 | noise; should be 1S | aproveitaria, saiba, tinha, vivia |
+| 1S> | noise; should be 1S | meu, meus, minha, minhas |
+| 1P> | noise; should be 1P | nossa, nossas, nosso, nossos |
+| 2S> | noise; should be 2S | seu, teu |
+| 2P> | noise; should be 2P | vossa, vosso |
+| 3S> | noise; should be 3S | seu, seus, sua, suas |
+| 3S/P> | noise; should be 3S/P | seu, seus, sua |
+| 3P> | noise; should be 3P | seu, seus, sua |
+| <adv> | noise? | fundo |
+| <advl> | noise; should be <co-advl> | e |
+| <co-adv> | noise; should be <co-advl> | |
+| >co-fmc> | noise; should be <co-fmc> | |
+| <co-fmv> | noise; should be <co-fmc> | |
+| convidado-> | noise; should be <ALT> | |
+| <co-postnom | noise; should be <co-postnom> | |
+| <co-prparg | noise; should be <co-prparg | |
+| <corr | noise; should be <ALT> | |
+| <co-vfin><co-fmc> | noise; should be two features | |
+| <Eg>F | noise; should be two features | |
+| <Eg>M | noise; should be two features | |
+| <F | noise; should be F | |
+| GER | noise; redundant gerund marker | 1 occurrence with v-ger |
+| <hyphen> | noise; should be <hyfen> | sofrê- |
+| INF | noise; redundant infinitive marker | 2 occurrences with <hyfen> |
+| 'Maio | noise | Maio |
+| MVF | noise; should be MV and F | motivada |
+| NUM | noise; redundant numeral marker | 1994 |
+| pasando> | noise; should be <ALT> | passando |
+| PCP | noise; redundant participle marker | 2 occurrences |
+| <postmod>F | noise; should be two features | |
+| <postnom> | noise; should be <co-postnom> | |
+| PROP | noise | 2 occurrences |
+| <prop>M | noise; should be two features | |
+| <prparg> | noise; should be <co-prparg> | |
+| R | noise; should be PR | 2 occurrences |
+| recohidas> | noise; should be <ALT> | recolhidas |
+| <rel><ks> | noise; should be two features | |
+| s | noise; should be S | |
+| saiem> | noise; should be <ALT> | saem |
+| <-sam><arti> | noise; should be two features | |
+| <-sam><dem> | noise; should be two features | |
+| <sc> | noise; should be <co-sc> | |
+| subordinanda> | noise; should be <ALT> | subordinada |
+| V | noise; redundant verb marker | |
+| <vfin> | noise; should be <co-vfin> | |
+| VFIN | noise | há od haver |
+===== Swedish (sv) =====
+==== Mamba and CoNLL ====
+Mamba tagset of Talbanken05. 48 tags, no morphosyntactic categories but detailed classification of auxiliary and modal verbs and punctuation. CoNLL driver is just an envelope around Mamba.
+Total work time: about 3 hours
+==== Tags of Hajič's Swedish tagger ====
+Based on PAROLE Swedish tagset but some characters different (@ => W), and filled by dashes to uniform length of 9 characters (although i-th position does not always encode the same feature).
+No reliable statistics of work time; estimated 8 hours
+===== Time needed for tag set conversion =====
+Some records about targeted tagset conversion for given tagset pairs, done in early 2006:
+Ruský treebank (nejen značky, ale vůbec převod formátu):
+:36
+Arabské značky (Otovy i Buckwalterovy, ještě bez Intersetu, 22.3.2006):
+:45+1+1:40 = 7:25

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences