Tag Set Drivers

This is an overview of existing tag set drivers. Tag-set or language specific issues are described here. I also try to keep track of the work time needed for particular drivers because the original motivation behind DZ Interset was to save time and effort.

Arabic (ar)

CoNLL 2006

The Arabic CoNLL tags are derived from the tags of the Prague Arabic Dependency Treebank.

Created in 2006-2007.
Total work time: 13 hours

CoNLL 2007

The Arabic tags in CoNLL 2007 slightly differed from 2006. There are also new tags. The driver ar::conll2007 was cloned from ar::conll and modified.

Created: 23.6.2011
Total work time: 2 hours

Bulgarian (bg)

The Bulgarian CoNLL tags are derived from the tags of BulTreeBank. Speciality: sophisticated system of pronouns includes interrogative adverbs and numerals.

Created in 2007.
Total work time: 35 hours

The main reasons why the implementation took so long:

Necessity to re-work the system of main word classes, especially pronouns.
Necessity to separate morphological and lexical definiteness (there are indefinite pronouns morphologically definite, and vice versa).
Necessity to separate morphological and lexical aspect (aorist vs. imperfect tense; there are perfective verbs that can occur in imperfect tense).
Driver tester required that encode(decode(x))=x. However, the CoNLL incarnation of the tags was inconsistent, in the order and form in which it presented features.

Chinese (zh)

The only corpus covered so far is the Sinica Treebank, converted to the CoNLL format. The tag set lacks comprehensive documentation (almost zero supplied with CoNLL data, and only a little found in the web). The tags do not encode any morphological features. Instead, there is a comprehensive (but undocumented) hierarchy of word classes and subclasses. Most of the information encoded in the tags cannot be mapped to Interset.

Pronouns are special cases of nouns. Numerals are special cases of determiners.

There are many sorts of particles, some of which have special tags (DE).

Work started: 21.10.2007
Work finished: 5.3.2008
Total work time: 21:30 h

Most of the time was dedicated to extracting, transcribing and translating examples in an effort to understand the tag classes.

Croatian (hr)

Multext

The tagset of the MULTEXT-EAST project as used in the SETimes.HR corpus. Documentation lists 1291 tags, we removed one wrong tag and kept 1290.

Work started: 16.7.2014
Work finished: 17.7.2014
Total work time: 5:45 h

This is the second Multext-East tagset covered by DZ Interset. Adding it was not too difficult because much of the previous effort on cs::multext could be reused.

Czech (cs)

Prague Dependency Treebank (PDT)

Při práci na tomto ovladači jsem ještě neměl k dispozici chytré funkce pro zajištění povolených značek.

Jde zatím o nejrozsáhlejší sadu značek, se kterou jsem se setkal. Obsahuje 4288 značek.

České značky PDT (přes 4000 značek; jádro Intersetu vzniklo jako vedlejší produkt, když jsem dělal tohle) asi 2 dny, tedy dejme tomu 18 hodin. Dalších 11:09 hodin jsem spotřeboval, když jsem začal ovladače testovat a musel jsem tenhle opravovat. Opět platí, že část času zabralo ladění testovacího skriptu, který v té době teprve vznikal.

CoNLL 2006

The CoNLL 2006 and 2007 Czech treebanks are data from PDT converted to the CoNLL format. The PDT morphological tags have been decomposed into coarse-grained part of speech, detailed part of speech, and a set of feature values. All PDT tags have unique equivalents in CoNLL. However, the mapping to the original PDT tags is not one-to-one. Some information, encoded in lemmas in the PDT, has been encoded as a new feature called Sem in CoNLL data. README refers the following documentation: part of speech and most features | lemma features

The list of tags of this tagset contains equivalents of all original PDT tags. In addition, it contains those tags with the Sem feature set, that occur in CoNLL data, and a few more. The Sem values are currently stored in the other feature of Interset. At the same time, subpos = “prop” is set if Sem is set and subpos would otherwise be empty. (The original PDT tags cannot distinguish proper from common nouns.) If the encoder encounters subpos = “prop”, it uses the default value “Sem=m”. The “few more” tags were added to the list whenever there was a tag Foo=bar|Sem=something and there was not the default Foo=bar|Sem=m.

Work started: 25.3.2008
Work finished: 25.3.2008
Total work time: 6:02 h

More than half of the time was consumed during testing for tuning tags containing the Sem feature.

CoNLL 2009

The CoNLL data format has changed. Formerly (2006 and 2007) there were three relevant columns (coarse-grained part of speech, fine-grained part of speech, and features) that we combined (using tabs) into one tag string. As of 2009, there are only two columns left, namely part of speech and features. For the Czech tags this further means that there is a new feature in the 'features' column. It is called SubPOS, it is present in all tags and its value is one character, copied from the second position of the standard PDT tag. Otherwise, the tags should be identical to those of CoNLL 2007, including the Sem feature.

The Sem feature can have more values than previously. This is caused by the extension of the term value set in the Prague Dependency Treebank 2.0 (in contrast to 1.0), so the change actually applied already to CoNLL 2007 data. However, CoNLL 2007 uses the older cs::conll driver.

Work started: 24.3.2009
Work finished: 24.3.2009
Total work time: 1:10 h

Multext

The tagset of the MULTEXT-EAST project and corpora. The file mte-lex/wfl-cs.tbl contains 1428 unique tags (which is not to say that other tags are not possible). The corpora are stored in a TEI-compliant SGML format. It is easily readable except that non-ASCII characters are encoded using SGML entities.

Work started: 16.2.2009
Work finished: 18.2.2009
Total work time: 16:36 h

Czech tagsets are notoriously complex. This one maps quite nicely to DZ Interset features. However, the few distinctions that are not (yet) represented in DZ Interset made debugging difficult. Clitic_s and generic numerals represented using the other feature led to wrong feature-value combinations in conversions to/from other Czech sets. DZ Interset had to be slightly modified in response to this tagset, and more changes that initiated here will be done later.

Prague Spoken Corpus

The Prague Spoken Corpus (Pražský mluvený korpus, PMK) is distributed together with the frequency dictionary of spoken Czech (book). It uses very strange tags and very many of them (over 10000!) Extremely high portion of the tags has to rely on the other feature. There are two types of tags: long and short.

Work started: 26.11.2009
Work finished: 4.10.2010
Total work time: 57 hours

Danish (da)

Tags of the Danish Dependency Treebank converted to CoNLL format. 144 tags with complex documentation in Danish.

Total work time: about 7 hours

English (en)

Penn Treebank Tagset

Penn Treebank (45 atomic tags). Detailed classification of punctuation.

Total work time: about 3 hours

CoNLL 2006

The driver is just an envelope around the en::penn driver.

Total work time: 48 minutes

CoNLL 2009

Another envelope around the en::penn driver. However, three new tags required changes even in the older drivers: HYPH, AFX (PRF) and NIL.

Work started: 25.3.2009
Work finished: 25.3.2009
Total work time: 2:57 h

German (de)

Stuttgart-Tübingen Tagset (STTS)

This is the tagset used in the Tiger treebank. It is quite syntax-oriented, often the same word can be tagged in couple different ways according to its function in a particular sentence. Pronouns are systematically categorized as substitutive (occur instead of an NP), attributive (occur inside an NP) and adverbial.

The tags omit inflectional information (number and case of pronouns and articles, degree of comparison of adjectives, tense (Präteritum, Konjunktiv), person and number of verbs).

Work started: 29.3.2008
Work finished: 29.3.2008
Total work time: 4:00 h

CoNLL 2006

Only simple envelope around the STTS driver needed.

Work started: 31.3.2008
Work finished: 31.3.2008
Total work time: 10 min

CoNLL 2009

This tagset is derived from the STTS, too. Unlike CoNLL 2006, there are also morphological features this time, which required additional processing effort.

Work started: 5.4.2009
Work finished: 6.4.2009
Total work time: 9:39 h

Polish (pl)

Based on the Korpus Języka Polskiego IPI PAN. (Saša tyhle značky potřebuje zpracovat v Intercorpu.) Moderate amount of new stuff but it is one of the fairly complex Slavic tagsets. And it contributed to new treatment of o-tags (those setting the other feature) when learning permitted feature-value combinations.

Work started: 4.9.2009
Work finished: 8.9.2009
Total work time: 9:54 h

Portuguese (pt)

The Portuguese CoNLL treebank contains tags with 149 different features. Big part of them are noise, probably introduced by the conversion procedure from the original Floresta format to the CoNLL format. The driver is designed so that it accepts all incorrect tags on decoding but encodes only corrected tags. Incorrect tags are not on the list of possible tags so the driver tester will not complain.

http://visl.sdu.dk/visl/pt/info/symbolset-floresta.html
http://en.wikipedia.org/wiki/Portuguese_grammar

Work started: 2.4.2008
Work finished: 24.4.2008
Total work time: 28:18 h

The CoNLL version of the Floresta tagset was a real pain. Not only is the tagset complex with many features, some of them strangely overlapping, some of them undocumented. There was also a terrible proportion of noise, typos or otherwise introduced errors in annotation.

Feature	Explanation	Examples
_	no features	prepositions, punctuation etc.
1/3S	1st person or 3rd person singular	leia, disse, seria, prefira
1S	1st person singular	tenho, tinha, usei, vivo, vou
1P	1st person plural	tomámos, vamos, vemos, víamos
2S	2nd person singular	compreendeste, queres, te, ti, veja, vives
2P	2nd person plural	chamais, vós
3S	3rd person singular	viu, viva
3S/P	3rd person singular or plural	se, si
3P	3rd person plural	vivem
ACC	pronoun as direct accusative object	se, te, vos
ACC/DAT	pronouns in accusative or dative	nos, se
COND	verb in conditional mood	precisariam, seriam, tentaria, venderia, viriam
DAT	pronoun as dative object	lhe, lhes, me, no, nos, se, vos
F	feminine
F/M	feminine or masculine
FUT	future tense of verbs	tenderão, tomará, usará
IMP	imperative mood of verbs	chega, move, olha, sê
IMPF	imperfect tense of verbs	abandonasse, abandonava, abria
IND	indicative mood of verbs	abafaram, abandonam, abate, abateu
M	masculine	açúcar, adepto, adiantado
M/F	masculine or feminine	Abidjan, cada, Chaves, especial
MQP	pluperfect past tense of verbs	acabara, defendera, existira, foram, quisera, viram
NOM	personal pronoun in nominative	ela, elas, ele, eles, eu, nós, vocês, você, vós
NOM/PIV	personal pronoun in nominative or prepositional object	ela, elas, ele, eles, nós, você
P	plural	0,92, 14h00, africanos, águas, Amigos_da_Ilha_de_Santos
PIV	pronoun in prepositional object	ela, elas, ele, eles, mim, nós, si, ti, vós
PR	present tense of verbs	abandonam, abate, abonam, abordo, abra
PR/PS	present or past tense of verbs	conhecemos, conseguimos, decidimos
PS	perfect past tense of verbs	abalou, abandonaram, abandonou, abateu
PS/MQP	perfect or pluperfect past tense of verbs	abafaram, abriram, acabaram, aceitaram
S	singular	1992, adicional, aditamento, aduaneira
S/P	singular or plural	capaz, Chaves, mais
SUBJ	subjunctive mood of verbs	abandonasse, abra, abram
<ALT>	indicates typo in word
<DERP>	derivation by prefixation	hidroginástica, interactivo, supercomputação
<DERS>	derivation by suffixation	neo-comunista, pessedebismo, tropologia
<KOMP>	comparative hook determiner or adverb	assim, inferior, maior, mais, melhor, mesma, outra, piores, tanto
<NUM-ord>	ordinal number, subclass of adjectives	10º, 113ª, 1., primeiro, terços, última, XIV
<SUP>	superlative of adjectives and adverbs	inferior, máximo, melhor, mínimo, ótimo, péssimo, pior
<artd>	definite article or determiner pronoun	a, as, o, os
<arti>	indefinite article or determiner pronoun	uma, um
<card>	cardinal number	um, uma, dois, três, quatro, cinco
<co-acc>	coordination of direct accusative objects
<co-advl>	coordination of adjunct adverbials
<co-advo>	coordination of argument adverbials, object related
<co-advs>	coordination of argument adverbials, subject related
<co-app>	coordination of adnominal appositions
<co-fmc>	coordination of main clauses
<co-ger>	coordination of gerunds
<co-inf>	coordination of infinitives
<co-oc>	coordination of object complements
<co-pass>	coordination of passive adjuncts
<co-pcv>	coordination of predicative participles
<co-piv>	coordination of prepositional objects
<co-postad>	coordination of postpositioned dependents in ap or advp
<co-postnom>	coordination of postpositioned dependents in np
<co-pred>	coordination of adjunct predicatives
<co-prenom>	coordination of prepositioned dependents in np
<co-prparg>	coordination of preposition arguments
<co-sc>	coordination of subject complements
<co-subj>	coordination of subjects
<co-vfin>	coordination of finite verbs
<coll>	collective reflexive pronoun	se (reunir-se, associar-se)
<dem>	demonstrative pronoun or adverb	este, isso, isto, o, os, tais, tal, tão
<det>	determiner usage / inflection of adverb	algo, meio, nada, quase, todo, um_tanto
<diff>	differentiator	mesmo, outro, semelhante, tal
<error>	probably processing error, not typo
<fmc>	verb heading finite main clause
<foc>	focus marker, adverb or pronoun	é_que, foi, fomos, que, são, será
<hyfen>	separated hyphenated prefix, usually of verbs	tinha-, unia-, verifica-
<ident>	identifier pronoun	mesmo, próprio
<interr>	interrogative pronoun or adverb	como, onde, porque, quais, qual, quando, quanto, quem, que
<kc>	conjunctional adverb	agora, aí, bem_como, como, ora, tal_como, todavia
<ks>	adverb or preposition used like a subordinating conjunction	como, enquanto, onde, quando, segundo
<n>	other word class used as noun, typically as head of noun phrase	anglo-americano, claro, feliz
<poss	possessive determiner pronoun	meu, meus, minha, minhas, nossa, nossas, nosso, nossos, seu, seus, sua
<prop>	other word class used as proper noun	Abril, Administração, Aeronáutica
<prp>	other word class used as preposition	como, conforme, consoante, embora, segundo
<quant>	indefinite quantifier adverb or pronoun	algo, ambas, bastante, bem, cada, certos, diversas, mais, menos
<reci>	reciprocal reflexive	se (amar-se)
<refl>	reflexive pronoun	se, me, te, nos, vos, si
<rel>	relative pronoun or adverb	à_medida_que, como, cuja, donde, enquanto, quando, quão
←sam>	2nd part in contracted word (nisto –> isto)
<sam→	1st part in contracted word (nisto –> em)	abaixo_de, a_cargo_de, ao_largo_de, apesar_de, em_face_de
<si>	reflexive usage of 3rd person possessive	seu, seus, sua, suas
<eg>	undocumented feature	2 occurrences with cardinal numbers
<Eg>	undocumented feature	occurs with numbers, adjectives and pronouns
<Em>	undocumented feature	6 occurrences with adjectives
<Es>	undocumented feature	3 occurrences with adverbs and prepositions
<ink>	undocumented feature of finite verbs	está, havia, pode, tentou
<mente>	undocumented feature; feminine adjective that can serve as base for derivation using the “-mente” suffix	directa, pura, rápida
<meta>	undocumented feature of adverbs	afinal, só
N	undocumented feature of nouns and articles	15 occurrences
<new>	undocumented feature
<nil>	undocumented feature
<obj>	undocumented feature	se
<p>	undocumented feature	1 occurrence
<parkc-1>	undocumented feature of conjunctions and adverbs	assim, nem, ou, tanto
<parkc-2>	undocumented feature of conjunctions, adverbs and prepositions	como, como_também, e, nem, ou, tampouco
<postmod>	undocumented feature	3 occurrences
<premod>	undocumented feature of adverbs
>	noise; should be ignored
0/1/3S	noise; should probably be 1/3S
1	noise; should be 1S	aproveitaria, saiba, tinha, vivia
1S>	noise; should be 1S	meu, meus, minha, minhas
1P>	noise; should be 1P	nossa, nossas, nosso, nossos
2S>	noise; should be 2S	seu, teu
2P>	noise; should be 2P	vossa, vosso
3S>	noise; should be 3S	seu, seus, sua, suas
3S/P>	noise; should be 3S/P	seu, seus, sua
3P>	noise; should be 3P	seu, seus, sua
<adv>	noise?	fundo
<advl>	noise; should be <co-advl>	e
<co-adv>	noise; should be <co-advl>
>co-fmc>	noise; should be <co-fmc>
<co-fmv>	noise; should be <co-fmc>
convidado→	noise; should be <ALT>
<co-postnom	noise; should be <co-postnom>
<co-prparg	noise; should be <co-prparg
<corr	noise; should be <ALT>
<co-vfin><co-fmc>	noise; should be two features
<Eg>F	noise; should be two features
<Eg>M	noise; should be two features
<F	noise; should be F
GER	noise; redundant gerund marker	1 occurrence with v-ger
<hyphen>	noise; should be <hyfen>	sofrê-
INF	noise; redundant infinitive marker	2 occurrences with <hyfen>
'Maio	noise	Maio
MVF	noise; should be MV and F	motivada
NUM	noise; redundant numeral marker	1994
pasando>	noise; should be <ALT>	passando
PCP	noise; redundant participle marker	2 occurrences
<postmod>F	noise; should be two features
<postnom>	noise; should be <co-postnom>
PROP	noise	2 occurrences
<prop>M	noise; should be two features
<prparg>	noise; should be <co-prparg>
R	noise; should be PR	2 occurrences
recohidas>	noise; should be <ALT>	recolhidas
<rel><ks>	noise; should be two features
s	noise; should be S
saiem>	noise; should be <ALT>	saem
←sam><arti>	noise; should be two features
←sam><dem>	noise; should be two features
<sc>	noise; should be <co-sc>
subordinanda>	noise; should be <ALT>	subordinada
V	noise; redundant verb marker
<vfin>	noise; should be <co-vfin>
VFIN	noise	há od haver

Slovak (sk)

Slovenský národný korpus (SNK)

1457 structured tags.

Total work time: 5:32 hours.

Swedish (sv)

Mamba and CoNLL

Mamba tagset of Talbanken05. 48 tags, no morphosyntactic categories but detailed classification of auxiliary and modal verbs and punctuation. CoNLL driver is just an envelope around Mamba.

Total work time: about 3 hours

Tags of Hajič's Swedish tagger

Based on PAROLE Swedish tagset but some characters different (@ ⇒ W), and filled by dashes to uniform length of 9 characters (although i-th position does not always encode the same feature).

No reliable statistics of work time; estimated 8 hours

Time needed for tag set conversion

Some records about targeted tagset conversion for given tagset pairs, done in early 2006:

Ruský treebank (nejen značky, ale vůbec převod formátu):
12:36

Arabské značky (Otovy i Buckwalterovy, ještě bez Intersetu, 22.3.2006):
4:45+1+1:40 = 7:25

Table of Contents

Tag Set Drivers

Arabic (ar)

CoNLL 2006

CoNLL 2007

Bulgarian (bg)

Chinese (zh)

Croatian (hr)

Multext

Czech (cs)

Prague Dependency Treebank (PDT)

CoNLL 2006

CoNLL 2009

Multext

Prague Spoken Corpus

Danish (da)

English (en)

Penn Treebank Tagset

CoNLL 2006

CoNLL 2009

German (de)

Stuttgart-Tübingen Tagset (STTS)

CoNLL 2006

CoNLL 2009

Polish (pl)

Portuguese (pt)

Slovak (sk)

Slovenský národný korpus (SNK)

Swedish (sv)

Mamba and CoNLL

Tags of Hajič's Swedish tagger

Time needed for tag set conversion