[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
user:zeman:interset:drivers [2008/04/03 18:05]
zeman Restructuralization.
user:zeman:interset:drivers [2014/07/17 16:32] (current)
zeman hr::multext
Line 1: Line 1:
 ====== Tag Set Drivers ====== ====== Tag Set Drivers ======
  
-This is an overview of existing tag set drivers. Tag-set or language specific issues are described here.+This is an overview of existing tag set drivers. Tag-set or language specific issues are described here. I also try to keep track of the work time needed for particular drivers because the original motivation behind DZ Interset was to save time and effort.
  
 ===== Arabic (ar) ===== ===== Arabic (ar) =====
 +
 +==== CoNLL 2006 ====
  
 The Arabic CoNLL tags are derived from the tags of the Prague Arabic Dependency Treebank. The Arabic CoNLL tags are derived from the tags of the Prague Arabic Dependency Treebank.
Line 9: Line 11:
 Created in 2006-2007. Created in 2006-2007.
 Total work time: 13 hours Total work time: 13 hours
 +
 +==== CoNLL 2007 ====
 +
 +The Arabic tags in CoNLL 2007 slightly differed from 2006. There are also new tags. The driver ''ar::conll2007'' was cloned from ''ar::conll'' and modified.
 +
 +Created: 23.6.2011
 +Total work time: 2 hours
  
 ===== Bulgarian (bg) ===== ===== Bulgarian (bg) =====
Line 36: Line 45:
  
 Most of the time was dedicated to extracting, transcribing and translating examples in an effort to understand the tag classes. Most of the time was dedicated to extracting, transcribing and translating examples in an effort to understand the tag classes.
 +
 +===== Croatian (hr) =====
 +
 +==== Multext ====
 +
 +The tagset of the MULTEXT-EAST project as used in the SETimes.HR corpus. Documentation lists 1291 tags, we removed one wrong tag and kept 1290.
 +
 +Work started: 16.7.2014
 +Work finished: 17.7.2014
 +Total work time: 5:45 h
 +
 +This is the second Multext-East tagset covered by DZ Interset. Adding it was not too difficult because much of the previous effort on ''cs::multext'' could be reused.
  
 ===== Czech (cs) ===== ===== Czech (cs) =====
Line 47: Line 68:
 České značky PDT (přes 4000 značek; jádro Intersetu vzniklo jako vedlejší produkt, když jsem dělal tohle) asi 2 dny, tedy dejme tomu 18 hodin. Dalších 11:09 hodin jsem spotřeboval, když jsem začal ovladače testovat a musel jsem tenhle opravovat. Opět platí, že část času zabralo ladění testovacího skriptu, který v té době teprve vznikal. České značky PDT (přes 4000 značek; jádro Intersetu vzniklo jako vedlejší produkt, když jsem dělal tohle) asi 2 dny, tedy dejme tomu 18 hodin. Dalších 11:09 hodin jsem spotřeboval, když jsem začal ovladače testovat a musel jsem tenhle opravovat. Opět platí, že část času zabralo ladění testovacího skriptu, který v té době teprve vznikal.
  
-==== CoNLL (derived from PDT) ====+==== CoNLL 2006 ====
  
 The CoNLL 2006 and 2007 Czech treebanks are data from PDT converted to the CoNLL format. The PDT morphological tags have been decomposed into coarse-grained part of speech, detailed part of speech, and a set of feature values. All PDT tags have unique equivalents in CoNLL. However, the mapping to the original PDT tags is not one-to-one. Some information, encoded in lemmas in the PDT, has been encoded as a new feature called ''Sem'' in CoNLL data. README refers the following documentation: [[http://ufal.mff.cuni.cz/pdt/Corpora/PDT_1.0/References/mman.html#pos-tag|part of speech and most features]] | [[http://ufal.mff.cuni.cz/pdt/Corpora/PDT_1.0/References/mman.html#sem-info|lemma features]] The CoNLL 2006 and 2007 Czech treebanks are data from PDT converted to the CoNLL format. The PDT morphological tags have been decomposed into coarse-grained part of speech, detailed part of speech, and a set of feature values. All PDT tags have unique equivalents in CoNLL. However, the mapping to the original PDT tags is not one-to-one. Some information, encoded in lemmas in the PDT, has been encoded as a new feature called ''Sem'' in CoNLL data. README refers the following documentation: [[http://ufal.mff.cuni.cz/pdt/Corpora/PDT_1.0/References/mman.html#pos-tag|part of speech and most features]] | [[http://ufal.mff.cuni.cz/pdt/Corpora/PDT_1.0/References/mman.html#sem-info|lemma features]]
Line 58: Line 79:
  
 More than half of the time was consumed during testing for tuning tags containing the Sem feature. More than half of the time was consumed during testing for tuning tags containing the Sem feature.
 +
 +==== CoNLL 2009 ====
 +
 +The [[:format-conll|CoNLL data format]] has changed. Formerly (2006 and 2007) there were three relevant columns (coarse-grained part of speech, fine-grained part of speech, and features) that we combined (using tabs) into one tag string. As of 2009, there are only two columns left, namely part of speech and features. For the Czech tags this further means that there is a new feature in the 'features' column. It is called ''SubPOS'', it is present in all tags and its value is one character, copied from the second position of the standard PDT tag. Otherwise, the tags should be identical to those of CoNLL 2007, including the ''Sem'' feature.
 +
 +The ''Sem'' feature can have more values than previously. This is caused by the extension of the [[http://ufal.mff.cuni.cz/~zeman/publikace/2005-01/mmanual.html#lemma-term|term value set]] in the Prague Dependency Treebank 2.0 (in contrast to 1.0), so the change actually applied already to CoNLL 2007 data. However, CoNLL 2007 uses the older ''cs::conll'' driver.
 +
 +Work started: 24.3.2009
 +Work finished: 24.3.2009
 +Total work time: 1:10 h
 +
 +==== Multext ====
 +
 +The tagset of the MULTEXT-EAST project and corpora. The file ''mte-lex/wfl-cs.tbl'' contains 1428 unique tags (which is not to say that other tags are not possible). The corpora are stored in a TEI-compliant SGML format. It is easily readable except that non-ASCII characters are encoded using SGML entities.
 +
 +Work started: 16.2.2009
 +Work finished: 18.2.2009
 +Total work time: 16:36 h
 +
 +Czech tagsets are notoriously complex. This one maps quite nicely to DZ Interset features. However, the few distinctions that are not (yet) represented in DZ Interset made debugging difficult. Clitic_s and generic numerals represented using the ''other'' feature led to wrong feature-value combinations in conversions to/from other Czech sets. DZ Interset had to be slightly modified in response to this tagset, and more changes that initiated here will be done later.
 +
 +==== Prague Spoken Corpus ====
 +
 +The Prague Spoken Corpus (Pražský mluvený korpus, PMK) is distributed together with the frequency dictionary of spoken Czech (book). It uses very strange tags and very many of them (over 10000!) Extremely high portion of the tags has to rely on the ''other'' feature. There are two types of tags: long and short.
 +
 +Work started: 26.11.2009
 +Work finished: 4.10.2010
 +Total work time: 57 hours
  
 ===== Danish (da) ===== ===== Danish (da) =====
Line 73: Line 122:
 Total work time: about 3 hours Total work time: about 3 hours
  
-==== CoNLL Tagset (derived from Penn tags) ====+==== CoNLL 2006 ====
  
 The driver is just an envelope around the ''en::penn'' driver. The driver is just an envelope around the ''en::penn'' driver.
  
 Total work time: 48 minutes Total work time: 48 minutes
 +
 +==== CoNLL 2009 ====
 +
 +Another envelope around the ''en::penn'' driver. However, three new tags required changes even in the older drivers: ''HYPH'', ''AFX'' (''PRF'') and ''NIL''.
 +
 +Work started: 25.3.2009
 +Work finished: 25.3.2009
 +Total work time: 2:57 h
  
 ===== German (de) ===== ===== German (de) =====
Line 91: Line 148:
 Total work time: 4:00 h Total work time: 4:00 h
  
-==== CoNLL (derived from STTS) ====+==== CoNLL 2006 ====
  
 Only simple envelope around the STTS driver needed. Only simple envelope around the STTS driver needed.
Line 100: Line 157:
  
  
 +==== CoNLL 2009 ====
  
 +This tagset is derived from the STTS, too. Unlike CoNLL 2006, there are also morphological features this time, which required additional processing effort.
 +
 +Work started: 5.4.2009
 +Work finished: 6.4.2009
 +Total work time: 9:39 h
 +
 +===== Polish (pl) =====
 +
 +Based on the [[http://korpus.pl/index.php|Korpus Języka Polskiego IPI PAN]]. (Saša tyhle značky potřebuje zpracovat v Intercorpu.) Moderate amount of new stuff but it is one of the fairly complex Slavic tagsets. And it contributed to [[how-to-write-a-driver#replacing-and-the-other-feature|new treatment of o-tags]] (those setting the ''other'' feature) when learning permitted feature-value combinations.
 +
 +Work started: 4.9.2009
 +Work finished: 8.9.2009
 +Total work time: 9:54 h
  
 ===== Portuguese (pt) ===== ===== Portuguese (pt) =====
  
 The Portuguese CoNLL treebank contains tags with 149 different features. Big part of them are noise, probably introduced by the conversion procedure from the original Floresta format to the CoNLL format. The driver is designed so that it accepts all incorrect tags on decoding but encodes only corrected tags. Incorrect tags are not on the list of possible tags so the driver tester will not complain. The Portuguese CoNLL treebank contains tags with 149 different features. Big part of them are noise, probably introduced by the conversion procedure from the original Floresta format to the CoNLL format. The driver is designed so that it accepts all incorrect tags on decoding but encodes only corrected tags. Incorrect tags are not on the list of possible tags so the driver tester will not complain.
 +
 +http://visl.sdu.dk/visl/pt/info/symbolset-floresta.html
 +http://en.wikipedia.org/wiki/Portuguese_grammar
 +
 +Work started: 2.4.2008
 +Work finished: 24.4.2008
 +Total work time: 28:18 h
 +
 +The CoNLL version of the Floresta tagset was a real pain. Not only is the tagset complex with many features, some of them strangely overlapping, some of them undocumented. There was also a terrible proportion of noise, typos or otherwise introduced errors in annotation.
  
 | **Feature** | **Explanation** | **Examples** | | **Feature** | **Explanation** | **Examples** |
 | _ | no features | prepositions, punctuation etc. | | _ | no features | prepositions, punctuation etc. |
-| 1 | 1st person | | 
 | 1/3S | 1st person or 3rd person singular | leia, disse, seria, prefira | | 1/3S | 1st person or 3rd person singular | leia, disse, seria, prefira |
 | 1S | 1st person singular | tenho, tinha, usei, vivo, vou | | 1S | 1st person singular | tenho, tinha, usei, vivo, vou |
Line 121: Line 200:
 | COND | verb in conditional mood | precisariam, seriam, tentaria, venderia, viriam | | COND | verb in conditional mood | precisariam, seriam, tentaria, venderia, viriam |
 | DAT | pronoun as dative object | lhe, lhes, me, no, nos, se, vos | | DAT | pronoun as dative object | lhe, lhes, me, no, nos, se, vos |
 +| F | feminine | |
 +| F/M | feminine or masculine | |
 +| FUT | future tense of verbs | tenderão, tomará, usará |
 +| IMP | imperative mood of verbs | chega, move, olha, sê |
 +| IMPF | imperfect tense of verbs | abandonasse, abandonava, abria |
 +| IND | indicative mood of verbs | abafaram, abandonam, abate, abateu |
 +| M | masculine | açúcar, adepto, adiantado |
 +| M/F | masculine or feminine | Abidjan, cada, Chaves, especial |
 +| MQP | pluperfect past tense of verbs | acabara, defendera, existira, foram, quisera, viram |
 +| NOM | personal pronoun in nominative | ela, elas, ele, eles, eu, nós, vocês, você, vós |
 +| NOM/PIV | personal pronoun in nominative or prepositional object | ela, elas, ele, eles, nós, você |
 +| P | plural | 0,92, 14h00, africanos, águas, Amigos_da_Ilha_de_Santos |
 +| PIV | pronoun in prepositional object | ela, elas, ele, eles, mim, nós, si, ti, vós |
 +| PR | present tense of verbs | abandonam, abate, abonam, abordo, abra |
 +| PR/PS | present or past tense of verbs | conhecemos, conseguimos, decidimos |
 +| PS | perfect past tense of verbs | abalou, abandonaram, abandonou, abateu |
 +| PS/MQP | perfect or pluperfect past tense of verbs | abafaram, abriram, acabaram, aceitaram |
 +| S | singular | 1992, adicional, aditamento, aduaneira |
 +| S/P | singular or plural | capaz, Chaves, mais |
 +| SUBJ | subjunctive mood of verbs | abandonasse, abra, abram |
 | <ALT> | indicates typo in word | | | <ALT> | indicates typo in word | |
 +| <DERP> | derivation by prefixation | hidroginástica, interactivo, supercomputação |
 +| <DERS> | derivation by suffixation | neo-comunista, pessedebismo, tropologia |
 +| <KOMP> | comparative hook determiner or adverb | assim, inferior, maior, mais, melhor, mesma, outra, piores, tanto |
 +| <NUM-ord> | ordinal number, subclass of adjectives | 10º, 113ª, 1., primeiro, terços, última, XIV |
 +| <SUP> | superlative of adjectives and adverbs | inferior, máximo, melhor, mínimo, ótimo, péssimo, pior |
 | <artd> | definite article or determiner pronoun | a, as, o, os | | <artd> | definite article or determiner pronoun | a, as, o, os |
 | <arti> | indefinite article or determiner pronoun | uma, um | | <arti> | indefinite article or determiner pronoun | uma, um |
Line 146: Line 250:
 | <co-vfin> | coordination of finite verbs | | | <co-vfin> | coordination of finite verbs | |
 | <coll> | collective reflexive pronoun | se (reunir-se, associar-se) | | <coll> | collective reflexive pronoun | se (reunir-se, associar-se) |
 +| <dem> | demonstrative pronoun or adverb | este, isso, isto, o, os, tais, tal, tão |
 +| <det> | determiner usage / inflection of adverb | algo, meio, nada, quase, todo, um_tanto |
 +| <diff> | differentiator | mesmo, outro, semelhante, tal |
 +| <error> | probably processing error, not typo | |
 +| <fmc> | verb heading finite main clause | |
 +| <foc> | focus marker, adverb or pronoun | é_que, foi, fomos, que, são, será |
 +| <hyfen> | separated hyphenated prefix, usually of verbs | tinha-, unia-, verifica- |
 +| <ident> | identifier pronoun | mesmo, próprio |
 +| <interr> | interrogative pronoun or adverb | como, onde, porque, quais, qual, quando, quanto, quem, que |
 +| <kc> | conjunctional adverb | agora, aí, bem_como, como, ora, tal_como, todavia |
 +| <ks> | adverb or preposition used like a subordinating conjunction | como, enquanto, onde, quando, segundo |
 +| <n> | other word class used as noun, typically as head of noun phrase | anglo-americano, claro, feliz |
 +| <poss | possessive determiner pronoun | meu, meus, minha, minhas, nossa, nossas, nosso, nossos, seu, seus, sua |
 +| <prop> | other word class used as proper noun | Abril, Administração, Aeronáutica |
 +| <prp> | other word class used as preposition | como, conforme, consoante, embora, segundo |
 +| <quant> | indefinite quantifier adverb or pronoun | algo, ambas, bastante, bem, cada, certos, diversas, mais, menos |
 +| <reci> | reciprocal reflexive | se (amar-se) |
 +| <refl> | reflexive pronoun | se, me, te, nos, vos, si |
 +| <rel> | relative pronoun or adverb | à_medida_que, como, cuja, donde, enquanto, quando, quão |
 +| <-sam> | 2nd part in contracted word (nisto --> isto) | |
 +| <sam-> | 1st part in contracted word (nisto --> em) | abaixo_de, a_cargo_de, ao_largo_de, apesar_de, em_face_de |
 +| <si> | reflexive usage of 3rd person possessive | seu, seus, sua, suas |
 +| <eg> | undocumented feature | 2 occurrences with cardinal numbers |
 +| <Eg> | undocumented feature | occurs with numbers, adjectives and pronouns |
 +| <Em> | undocumented feature | 6 occurrences with adjectives |
 +| <Es> | undocumented feature | 3 occurrences with adverbs and prepositions |
 +| <ink> | undocumented feature of finite verbs | está, havia, pode, tentou |
 +| <mente> | undocumented feature; feminine adjective that can serve as base for derivation using the "-mente" suffix | directa, pura, rápida |
 +| <meta> | undocumented feature of adverbs | afinal, só |
 +| N | undocumented feature of nouns and articles | 15 occurrences |
 +| <new> | undocumented feature | |
 +| <nil> | undocumented feature | |
 +| <obj> | undocumented feature | se |
 +| <p> | undocumented feature | 1 occurrence |
 +| <parkc-1> | undocumented feature of conjunctions and adverbs | assim, nem, ou, tanto |
 +| <parkc-2> | undocumented feature of conjunctions, adverbs and prepositions | como, como_também, e, nem, ou, tampouco |
 +| <postmod> | undocumented feature | 3 occurrences |
 +| <premod> | undocumented feature of adverbs | |
 | > | noise; should be ignored | | | > | noise; should be ignored | |
 | 0/1/3S | noise; should probably be 1/3S | | | 0/1/3S | noise; should probably be 1/3S | |
 +| 1 | noise; should be 1S | aproveitaria, saiba, tinha, vivia |
 | 1S> | noise; should be 1S | meu, meus, minha, minhas | | 1S> | noise; should be 1S | meu, meus, minha, minhas |
 | 1P> | noise; should be 1P | nossa, nossas, nosso, nossos | | 1P> | noise; should be 1P | nossa, nossas, nosso, nossos |
Line 165: Line 308:
 | <corr | noise; should be <ALT> | | | <corr | noise; should be <ALT> | |
 | <co-vfin><co-fmc> | noise; should be two features | | | <co-vfin><co-fmc> | noise; should be two features | |
 +| <Eg>F | noise; should be two features | |
 +| <Eg>M | noise; should be two features | |
 +| <F | noise; should be F | |
 +| GER | noise; redundant gerund marker | 1 occurrence with v-ger |
 +| <hyphen> | noise; should be <hyfen> | sofrê- |
 +| INF | noise; redundant infinitive marker | 2 occurrences with <hyfen> |
 +| 'Maio | noise | Maio |
 +| MVF | noise; should be MV and F | motivada |
 +| NUM | noise; redundant numeral marker | 1994 |
 +| pasando> | noise; should be <ALT> | passando |
 +| PCP | noise; redundant participle marker | 2 occurrences |
 +| <postmod>F | noise; should be two features | |
 +| <postnom> | noise; should be <co-postnom> | |
 +| PROP | noise | 2 occurrences |
 +| <prop>M | noise; should be two features | |
 +| <prparg> | noise; should be <co-prparg> | |
 +| R | noise; should be PR | 2 occurrences |
 +| recohidas> | noise; should be <ALT> | recolhidas |
 +| <rel><ks> | noise; should be two features | |
 +| s | noise; should be S | |
 +| saiem> | noise; should be <ALT> | saem |
 +| <-sam><arti> | noise; should be two features | |
 +| <-sam><dem> | noise; should be two features | |
 +| <sc> | noise; should be <co-sc> | |
 +| subordinanda> | noise; should be <ALT> | subordinada |
 +| V | noise; redundant verb marker | |
 +| <vfin> | noise; should be <co-vfin> | |
 +| VFIN | noise | há od haver |
 +
 +===== Slovak (sk) =====
 +
 +==== Slovenský národný korpus (SNK) ====
 +
 +1457 structured tags.
 +
 +Total work time: 5:32 hours.
  
 ===== Swedish (sv) ===== ===== Swedish (sv) =====

[ Back to the navigation ] [ Back to the content ]