[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

This is an old revision of the document!


Table of Contents

Tag Set Drivers

This is an overview of existing tag set drivers. Tag-set or language specific issues are described here.

Arabic (ar)

The Arabic CoNLL tags are derived from the tags of the Prague Arabic Dependency Treebank.

Created in 2006-2007.
Total work time: 13 hours

Bulgarian (bg)

The Bulgarian CoNLL tags are derived from the tags of BulTreeBank. Speciality: sophisticated system of pronouns includes interrogative adverbs and numerals.

Created in 2007.
Total work time: 35 hours

The main reasons why the implementation took so long:

Chinese (zh)

The only corpus covered so far is the Sinica Treebank, converted to the CoNLL format. The tag set lacks comprehensive documentation (almost zero supplied with CoNLL data, and only a little found in the web). The tags do not encode any morphological features. Instead, there is a comprehensive (but undocumented) hierarchy of word classes and subclasses. Most of the information encoded in the tags cannot be mapped to Interset.

Pronouns are special cases of nouns. Numerals are special cases of determiners.

There are many sorts of particles, some of which have special tags (DE).

Work started: 21.10.2007
Work finished: 5.3.2008
Total work time: 21:30 h

Most of the time was dedicated to extracting, transcribing and translating examples in an effort to understand the tag classes.

Czech (cs)

Prague Dependency Treebank (PDT)

Při práci na tomto ovladači jsem ještě neměl k dispozici chytré funkce pro zajištění povolených značek.

Jde zatím o nejrozsáhlejší sadu značek, se kterou jsem se setkal. Obsahuje 4288 značek.

České značky PDT (přes 4000 značek; jádro Intersetu vzniklo jako vedlejší produkt, když jsem dělal tohle) asi 2 dny, tedy dejme tomu 18 hodin. Dalších 11:09 hodin jsem spotřeboval, když jsem začal ovladače testovat a musel jsem tenhle opravovat. Opět platí, že část času zabralo ladění testovacího skriptu, který v té době teprve vznikal.

CoNLL (derived from PDT)

The CoNLL 2006 and 2007 Czech treebanks are data from PDT converted to the CoNLL format. The PDT morphological tags have been decomposed into coarse-grained part of speech, detailed part of speech, and a set of feature values. All PDT tags have unique equivalents in CoNLL. However, the mapping to the original PDT tags is not one-to-one. Some information, encoded in lemmas in the PDT, has been encoded as a new feature called Sem in CoNLL data. README refers the following documentation: part of speech and most features | lemma features

The list of tags of this tagset contains equivalents of all original PDT tags. In addition, it contains those tags with the Sem feature set, that occur in CoNLL data, and a few more. The Sem values are currently stored in the other feature of Interset. At the same time, subpos = “prop” is set if Sem is set and subpos would otherwise be empty. (The original PDT tags cannot distinguish proper from common nouns.) If the encoder encounters subpos = “prop”, it uses the default value “Sem=m”. The “few more” tags were added to the list whenever there was a tag Foo=bar|Sem=something and there was not the default Foo=bar|Sem=m.

Work started: 25.3.2008
Work finished: 25.3.2008
Total work time: 6:02 h

More than half of the time was consumed during testing for tuning tags containing the Sem feature.

Danish (da)

Tags of the Danish Dependency Treebank converted to CoNLL format. 144 tags with complex documentation in Danish.

Total work time: about 7 hours

English (en)

Penn Treebank Tagset

Penn Treebank (45 atomic tags). Detailed classification of punctuation.

Total work time: about 3 hours

CoNLL Tagset (derived from Penn tags)

The driver is just an envelope around the en::penn driver.

Total work time: 48 minutes

German (de)

Stuttgart-Tübingen Tagset (STTS)

This is the tagset used in the Tiger treebank. It is quite syntax-oriented, often the same word can be tagged in couple different ways according to its function in a particular sentence. Pronouns are systematically categorized as substitutive (occur instead of an NP), attributive (occur inside an NP) and adverbial.

The tags omit inflectional information (number and case of pronouns and articles, degree of comparison of adjectives, tense (Präteritum, Konjunktiv), person and number of verbs).

Work started: 29.3.2008
Work finished: 29.3.2008
Total work time: 4:00 h

CoNLL (derived from STTS)

Only simple envelope around the STTS driver needed.

Work started: 31.3.2008
Work finished: 31.3.2008
Total work time: 10 min

Portuguese (pt)

The Portuguese CoNLL treebank contains tags with 149 different features. Big part of them are noise, probably introduced by the conversion procedure from the original Floresta format to the CoNLL format. The driver is designed so that it accepts all incorrect tags on decoding but encodes only corrected tags. Incorrect tags are not on the list of possible tags so the driver tester will not complain.

Feature Explanation Examples
_ no features prepositions, punctuation etc.
1 1st person
1/3S 1st person or 3rd person singular leia, disse, seria, prefira
1S 1st person singular tenho, tinha, usei, vivo, vou
1P 1st person plural tomámos, vamos, vemos, víamos
2S 2nd person singular compreendeste, queres, te, ti, veja, vives
2P 2nd person plural chamais, vós
3S 3rd person singular viu, viva
3S/P 3rd person singular or plural se, si
3P 3rd person plural vivem
> noise; should be ignored
0/1/3S noise; should probably be 1/3S
1S> noise; should be 1S meu, meus, minha, minhas
1P> noise; should be 1P nossa, nossas, nosso, nossos
2S> noise; should be 2S seu, teu
2P> noise; should be 2P vossa, vosso
3S> noise; should be 3S seu, seus, sua, suas
3S/P> noise; should be 3S/P seu, seus, sua
3P> noise; should be 3P seu, seus, sua

Swedish (sv)

Mamba and CoNLL

Mamba tagset of Talbanken05. 48 tags, no morphosyntactic categories but detailed classification of auxiliary and modal verbs and punctuation. CoNLL driver is just an envelope around Mamba.

Total work time: about 3 hours

Tags of Hajič's Swedish tagger

Based on PAROLE Swedish tagset but some characters different (@ ⇒ W), and filled by dashes to uniform length of 9 characters (although i-th position does not always encode the same feature).

No reliable statistics of work time; estimated 8 hours

Time needed for tag set conversion

Some records about targeted tagset conversion for given tagset pairs, done in early 2006:

Ruský treebank (nejen značky, ale vůbec převod formátu):
12:36

Arabské značky (Otovy i Buckwalterovy, ještě bez Intersetu, 22.3.2006):
4:45+1+1:40 = 7:25


[ Back to the navigation ] [ Back to the content ]