Tag Set Drivers

This is an overview of existing tag set drivers. Tag-set or language specific issues are described here.

Chinese (zh)

The only corpus covered so far is the Sinica Treebank, converted to the CoNLL format. The tag set lacks comprehensive documentation (almost zero supplied with CoNLL data, and only a little found in the web). The tags do not encode any morphological features. Instead, there is a comprehensive (but undocumented) hierarchy of word classes and subclasses. Most of the information encoded in the tags cannot be mapped to Interset.

Pronouns are special cases of nouns. Numerals are special cases of determiners.

There are many sorts of particles, some of which have special tags (DE).

Work started: 21.10.2007
Work finished: 5.3.2008
Total work time: 21:30 h

Most of the time was dedicated to extracting, transcribing and translating examples in an effort to understand the tag classes.

Czech (cs)

Prague Dependency Treebank (PDT)

Při práci na tomto ovladači jsem ještě neměl k dispozici chytré funkce pro zajištění povolených značek.

Jde zatím o nejrozsáhlejší sadu značek, se kterou jsem se setkal. Obsahuje 4288 značek.

České značky PDT (přes 4000 značek; jádro Intersetu vzniklo jako vedlejší produkt, když jsem dělal tohle) asi 2 dny, tedy dejme tomu 18 hodin. Dalších 11:09 hodin jsem spotřeboval, když jsem začal ovladače testovat a musel jsem tenhle opravovat. Opět platí, že část času zabralo ladění testovacího skriptu, který v té době teprve vznikal.

CoNLL (derived from PDT)

The CoNLL 2006 and 2007 Czech treebanks are data from PDT converted to the CoNLL format. The PDT morphological tags have been decomposed into coarse-grained part of speech, detailed part of speech, and a set of feature values. There should be a one-to-one mapping between the original PDT and the CoNLL tagsets, however, the driver cannot be a simple envelope around the driver of the original tagset (as is the case for e.g. Penn Treebank tags) because of the features.

Update: the mapping to the original PDT tags is not one-to-one. Some information, encoded in lemmas in the PDT, has been encoded as features in CoNLL data. README refers the following documentation: part of speech and most features | lemma features

Work started: 25.3.2008

Time needed for tag set conversion

Poznamenávám si, kolik času mi zabral který ovladač, abych to mohl publikovat. Srovnání potřebného času s časem potřebným na obyčejný převod je zajímavé, i když vím, že ve skutečnosti ušetřím až při opakovaném využití ovladače.

Ruský treebank (nejen značky, ale vůbec převod formátu):
12:36

Arabské značky (Otovy i Buckwalterovy, ještě bez Intersetu, 22.3.2006):
4:45+1+1:40 = 7:25

Dánské značky DDT/Parole (144 značek s košatým popisem)
asi 7 hodin

Švédské značky Mamba (48 značek)
asi 3 hodiny

Penn Treebank (36 značek)
asi 3 hodiny, ale tady jsem to ještě neměřil, takže to je jen hrubý zpětný odhad

Hajičovy švédské značky
0:32 - tady zjevně chybí úplná statistika

Arabské značky CoNLL
4:33+5:19+3:16 = 13:08

Bulharské značky CoNLL
0:20+1:00+0:26+5:44+2:00+6:15+1:20+0:46+1:26+2:30+0:48+12:44 = 35:19
(ale u bulharštiny jsem se dost natrápil s jevy, které do té doby nebyly v intersetu podchycené)

Anglické značky CoNLL
0:48 - možná tady chybí statistika, ale možná taky ne, protože stačilo upravit existující ovladač Penn Treebanku, ne?

Žádné z výše uvedených převodů (tedy vše napsané před říjnem 2007) ještě neměly k dispozici chytré funkce pro nahrazování nepovolených hodnot.

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Table of Contents