Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
user:zeman:interset:drivers [2008/03/06 15:51] zeman Time requirements moved to Drivers. |
user:zeman:interset:drivers [2014/07/17 16:32] (current) zeman hr::multext |
||
---|---|---|---|
Line 1: | Line 1: | ||
====== Tag Set Drivers ====== | ====== Tag Set Drivers ====== | ||
- | This is an overview of existing tag set drivers. Tag-set or language specific issues are described here. | + | This is an overview of existing tag set drivers. Tag-set or language specific issues are described here. I also try to keep track of the work time needed for particular drivers because the original motivation behind DZ Interset was to save time and effort. |
- | ===== Chinese ===== | + | ===== Arabic (ar) ===== |
+ | |||
+ | ==== CoNLL 2006 ==== | ||
+ | |||
+ | The Arabic CoNLL tags are derived from the tags of the Prague Arabic Dependency Treebank. | ||
+ | |||
+ | Created in 2006-2007. | ||
+ | Total work time: 13 hours | ||
+ | |||
+ | ==== CoNLL 2007 ==== | ||
+ | |||
+ | The Arabic tags in CoNLL 2007 slightly differed from 2006. There are also new tags. The driver '' | ||
+ | |||
+ | Created: 23.6.2011 | ||
+ | Total work time: 2 hours | ||
+ | |||
+ | ===== Bulgarian (bg) ===== | ||
+ | |||
+ | The Bulgarian CoNLL tags are derived from the tags of BulTreeBank. Speciality: sophisticated system of pronouns includes interrogative adverbs and numerals. | ||
+ | |||
+ | Created in 2007. | ||
+ | Total work time: 35 hours | ||
+ | |||
+ | The main reasons why the implementation took so long: | ||
+ | * Necessity to re-work the system of main word classes, especially pronouns. | ||
+ | * Necessity to separate morphological and lexical definiteness (there are indefinite pronouns morphologically definite, and vice versa). | ||
+ | * Necessity to separate morphological and lexical aspect (aorist vs. imperfect tense; there are perfective verbs that can occur in imperfect tense). | ||
+ | * Driver tester required that encode(decode(x))=x. However, the CoNLL incarnation of the tags was inconsistent, | ||
+ | |||
+ | ===== Chinese | ||
The only corpus covered so far is the Sinica Treebank, converted to the CoNLL format. The tag set lacks comprehensive documentation (almost zero supplied with CoNLL data, and only a little found in the web). The tags do not encode any morphological features. Instead, there is a comprehensive (but undocumented) hierarchy of word classes and subclasses. Most of the information encoded in the tags cannot be mapped to Interset. | The only corpus covered so far is the Sinica Treebank, converted to the CoNLL format. The tag set lacks comprehensive documentation (almost zero supplied with CoNLL data, and only a little found in the web). The tags do not encode any morphological features. Instead, there is a comprehensive (but undocumented) hierarchy of word classes and subclasses. Most of the information encoded in the tags cannot be mapped to Interset. | ||
Line 17: | Line 46: | ||
Most of the time was dedicated to extracting, transcribing and translating examples in an effort to understand the tag classes. | Most of the time was dedicated to extracting, transcribing and translating examples in an effort to understand the tag classes. | ||
- | ===== Time needed for tag set conversion | + | ===== Croatian (hr) ===== |
- | Poznamenávám si, kolik času mi zabral který ovladač, abych to mohl publikovat. Srovnání potřebného času s časem potřebným na obyčejný převod je zajímavé, i když vím, že ve skutečnosti ušetřím až při opakovaném využití ovladače. | + | ==== Multext ==== |
- | Ruský treebank (nejen značky, ale vůbec převod formátu): | + | The tagset of the MULTEXT-EAST project as used in the SETimes.HR corpus. Documentation lists 1291 tags, we removed one wrong tag and kept 1290. |
- | 12:36 | + | |
- | Arabské značky (Otovy i Buckwalterovy, | + | Work started: 16.7.2014 |
- | 4:45+1+1:40 = 7:25 | + | Work finished: 17.7.2014 |
+ | Total work time: 5:45 h | ||
- | České značky PDT (přes 4000 značek; jádro Intersetu vzniklo jako vedlejší produkt, když jsem dělal tohle) | + | This is the second Multext-East tagset covered by DZ Interset. Adding it was not too difficult because much of the previous effort on '' |
- | asi 2 dny, tedy dejme tomu 18 hodin | + | |
- | Dánské značky DDT/ | + | ===== Czech (cs) ===== |
- | asi 7 hodin | + | |
- | Švédské značky Mamba (48 značek) | + | ==== Prague Dependency Treebank |
- | asi 3 hodiny | + | |
- | Penn Treebank (36 značek) | + | Při práci na tomto ovladači jsem ještě neměl k dispozici chytré funkce pro zajištění povolených značek. |
- | asi 3 hodiny, ale tady jsem to ještě neměřil, takže to je jen hrubý zpětný odhad | + | |
- | Hajičovy | + | Jde zatím o nejrozsáhlejší sadu značek, se kterou jsem se setkal. Obsahuje 4288 značek. |
- | 0:32 - tady zjevně chybí úplná statistika | + | |
- | Arabské | + | České |
- | 4:33+5: | + | |
- | České značky PDT (CoNLL verze? Nebo to jsou jen opravy, když jsem začal ovladače testovat? | + | ==== CoNLL 2006 ==== |
- | 1: | + | |
- | Bulharské značky | + | The CoNLL 2006 and 2007 Czech treebanks are data from PDT converted to the CoNLL format. The PDT morphological tags have been decomposed into coarse-grained part of speech, detailed part of speech, and a set of feature values. All PDT tags have unique equivalents in CoNLL. However, the mapping to the original PDT tags is not one-to-one. Some information, |
- | 0:20+1:00+0:26+5: | + | |
- | (ale u bulharštiny jsem se dost natrápil s jevy, které do té doby nebyly v intersetu podchycené) | + | |
- | Anglické značky | + | The list of tags of this tagset contains equivalents of all original PDT tags. In addition, it contains those tags with the '' |
- | 0:48 - možná tady chybí statistika, ale možná taky ne, protože stačilo upravit existující ovladač Penn Treebanku, ne? | + | |
- | Žádné z výše uvedených převodů (tedy vše napsané | + | Work started: 25.3.2008 |
+ | Work finished: 25.3.2008 | ||
+ | Total work time: 6:02 h | ||
+ | |||
+ | More than half of the time was consumed during testing for tuning tags containing the Sem feature. | ||
+ | |||
+ | ==== CoNLL 2009 ==== | ||
+ | |||
+ | The [[: | ||
+ | |||
+ | The '' | ||
+ | |||
+ | Work started: 24.3.2009 | ||
+ | Work finished: 24.3.2009 | ||
+ | Total work time: 1:10 h | ||
+ | |||
+ | ==== Multext ==== | ||
+ | |||
+ | The tagset of the MULTEXT-EAST project and corpora. The file '' | ||
+ | |||
+ | Work started: 16.2.2009 | ||
+ | Work finished: 18.2.2009 | ||
+ | Total work time: 16:36 h | ||
+ | |||
+ | Czech tagsets are notoriously complex. This one maps quite nicely to DZ Interset features. However, the few distinctions that are not (yet) represented in DZ Interset made debugging difficult. Clitic_s and generic numerals represented using the '' | ||
+ | |||
+ | ==== Prague Spoken Corpus ==== | ||
+ | |||
+ | The Prague Spoken Corpus (Pražský mluvený korpus, PMK) is distributed together with the frequency dictionary of spoken Czech (book). It uses very strange tags and very many of them (over 10000!) Extremely high portion of the tags has to rely on the '' | ||
+ | |||
+ | Work started: 26.11.2009 | ||
+ | Work finished: 4.10.2010 | ||
+ | Total work time: 57 hours | ||
+ | |||
+ | ===== Danish (da) ===== | ||
+ | |||
+ | Tags of the Danish Dependency Treebank converted to CoNLL format. 144 tags with complex documentation in Danish. | ||
+ | |||
+ | Total work time: about 7 hours | ||
+ | |||
+ | ===== English (en) ===== | ||
+ | |||
+ | ==== Penn Treebank Tagset ==== | ||
+ | |||
+ | Penn Treebank (45 atomic tags). Detailed classification of punctuation. | ||
+ | |||
+ | Total work time: about 3 hours | ||
+ | |||
+ | ==== CoNLL 2006 ==== | ||
+ | |||
+ | The driver is just an envelope around the '' | ||
+ | |||
+ | Total work time: 48 minutes | ||
+ | |||
+ | ==== CoNLL 2009 ==== | ||
+ | |||
+ | Another envelope around the '' | ||
+ | |||
+ | Work started: 25.3.2009 | ||
+ | Work finished: 25.3.2009 | ||
+ | Total work time: 2:57 h | ||
+ | |||
+ | ===== German (de) ===== | ||
+ | |||
+ | ==== Stuttgart-Tübingen Tagset (STTS) ==== | ||
+ | |||
+ | This is the tagset used in the Tiger treebank. It is quite syntax-oriented, | ||
+ | |||
+ | The tags omit inflectional information (number and case of pronouns and articles, degree of comparison of adjectives, tense (Präteritum, | ||
+ | |||
+ | Work started: 29.3.2008 | ||
+ | Work finished: 29.3.2008 | ||
+ | Total work time: 4:00 h | ||
+ | |||
+ | ==== CoNLL 2006 ==== | ||
+ | |||
+ | Only simple envelope around the STTS driver needed. | ||
+ | |||
+ | Work started: 31.3.2008 | ||
+ | Work finished: 31.3.2008 | ||
+ | Total work time: 10 min | ||
+ | |||
+ | |||
+ | ==== CoNLL 2009 ==== | ||
+ | |||
+ | This tagset is derived from the STTS, too. Unlike CoNLL 2006, there are also morphological features this time, which required additional processing effort. | ||
+ | |||
+ | Work started: 5.4.2009 | ||
+ | Work finished: 6.4.2009 | ||
+ | Total work time: 9:39 h | ||
+ | |||
+ | ===== Polish (pl) ===== | ||
+ | |||
+ | Based on the [[http:// | ||
+ | |||
+ | Work started: 4.9.2009 | ||
+ | Work finished: 8.9.2009 | ||
+ | Total work time: 9:54 h | ||
+ | |||
+ | ===== Portuguese (pt) ===== | ||
+ | |||
+ | The Portuguese CoNLL treebank contains tags with 149 different features. Big part of them are noise, probably introduced by the conversion procedure from the original Floresta format to the CoNLL format. The driver is designed so that it accepts all incorrect tags on decoding but encodes only corrected tags. Incorrect tags are not on the list of possible tags so the driver tester will not complain. | ||
+ | |||
+ | http:// | ||
+ | http:// | ||
+ | |||
+ | Work started: 2.4.2008 | ||
+ | Work finished: 24.4.2008 | ||
+ | Total work time: 28:18 h | ||
+ | |||
+ | The CoNLL version of the Floresta tagset was a real pain. Not only is the tagset complex with many features, some of them strangely overlapping, | ||
+ | |||
+ | | **Feature** | **Explanation** | **Examples** | | ||
+ | | _ | no features | prepositions, | ||
+ | | 1/3S | 1st person or 3rd person singular | leia, disse, seria, prefira | | ||
+ | | 1S | 1st person singular | tenho, tinha, usei, vivo, vou | | ||
+ | | 1P | 1st person plural | tomámos, vamos, vemos, víamos | | ||
+ | | 2S | 2nd person singular | compreendeste, | ||
+ | | 2P | 2nd person plural | chamais, vós | | ||
+ | | 3S | 3rd person singular | viu, viva | | ||
+ | | 3S/P | 3rd person singular or plural | se, si | | ||
+ | | 3P | 3rd person plural | vivem | | ||
+ | | ACC | pronoun as direct accusative object | se, te, vos | | ||
+ | | ACC/DAT | pronouns in accusative or dative | nos, se | | ||
+ | | COND | verb in conditional mood | precisariam, | ||
+ | | DAT | pronoun as dative object | lhe, lhes, me, no, nos, se, vos | | ||
+ | | F | feminine | | | ||
+ | | F/M | feminine or masculine | | | ||
+ | | FUT | future tense of verbs | tenderão, tomará, usará | | ||
+ | | IMP | imperative mood of verbs | chega, move, olha, sê | | ||
+ | | IMPF | imperfect tense of verbs | abandonasse, | ||
+ | | IND | indicative mood of verbs | abafaram, abandonam, abate, abateu | | ||
+ | | M | masculine | açúcar, adepto, adiantado | | ||
+ | | M/F | masculine or feminine | Abidjan, cada, Chaves, especial | | ||
+ | | MQP | pluperfect past tense of verbs | acabara, defendera, existira, foram, quisera, viram | | ||
+ | | NOM | personal pronoun in nominative | ela, elas, ele, eles, eu, nós, vocês, você, vós | | ||
+ | | NOM/PIV | personal pronoun in nominative or prepositional object | ela, elas, ele, eles, nós, você | | ||
+ | | P | plural | 0,92, 14h00, africanos, águas, Amigos_da_Ilha_de_Santos | | ||
+ | | PIV | pronoun in prepositional object | ela, elas, ele, eles, mim, nós, si, ti, vós | | ||
+ | | PR | present tense of verbs | abandonam, abate, abonam, abordo, abra | | ||
+ | | PR/PS | present or past tense of verbs | conhecemos, conseguimos, | ||
+ | | PS | perfect past tense of verbs | abalou, abandonaram, | ||
+ | | PS/MQP | perfect or pluperfect past tense of verbs | abafaram, abriram, acabaram, aceitaram | | ||
+ | | S | singular | 1992, adicional, aditamento, aduaneira | | ||
+ | | S/P | singular or plural | capaz, Chaves, mais | | ||
+ | | SUBJ | subjunctive mood of verbs | abandonasse, | ||
+ | | <ALT> | indicates typo in word | | | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | <SUP> | superlative of adjectives and adverbs | inferior, máximo, melhor, mínimo, ótimo, péssimo, pior | | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | <dem> | demonstrative pronoun or adverb | este, isso, isto, o, os, tais, tal, tão | | ||
+ | | <det> | determiner usage / inflection of adverb | algo, meio, nada, quase, todo, um_tanto | | ||
+ | | < | ||
+ | | < | ||
+ | | <fmc> | verb heading finite main clause | | | ||
+ | | <foc> | focus marker, adverb or pronoun | é_que, foi, fomos, que, são, será | | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | <kc> | conjunctional adverb | agora, aí, bem_como, como, ora, tal_como, todavia | | ||
+ | | <ks> | adverb or preposition used like a subordinating conjunction | como, enquanto, onde, quando, segundo | | ||
+ | | <n> | other word class used as noun, typically as head of noun phrase | anglo-americano, | ||
+ | | <poss | possessive determiner pronoun | meu, meus, minha, minhas, nossa, nossas, nosso, nossos, seu, seus, sua | | ||
+ | | < | ||
+ | | <prp> | other word class used as preposition | como, conforme, consoante, embora, segundo | | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | <rel> | relative pronoun or adverb | à_medida_que, | ||
+ | | < | ||
+ | | < | ||
+ | | <si> | reflexive usage of 3rd person possessive | seu, seus, sua, suas | | ||
+ | | <eg> | undocumented feature | 2 occurrences with cardinal numbers | | ||
+ | | <Eg> | undocumented feature | occurs with numbers, adjectives and pronouns | | ||
+ | | <Em> | undocumented feature | 6 occurrences with adjectives | | ||
+ | | <Es> | undocumented feature | 3 occurrences with adverbs and prepositions | | ||
+ | | <ink> | undocumented feature of finite verbs | está, havia, pode, tentou | | ||
+ | | < | ||
+ | | < | ||
+ | | N | undocumented feature of nouns and articles | 15 occurrences | | ||
+ | | <new> | undocumented feature | | | ||
+ | | <nil> | undocumented feature | | | ||
+ | | <obj> | undocumented feature | se | | ||
+ | | <p> | undocumented feature | 1 occurrence | | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | > | noise; should be ignored | | | ||
+ | | 0/1/3S | noise; should probably be 1/3S | | | ||
+ | | 1 | noise; should be 1S | aproveitaria, | ||
+ | | 1S> | noise; should be 1S | meu, meus, minha, minhas | | ||
+ | | 1P> | noise; should be 1P | nossa, nossas, nosso, nossos | | ||
+ | | 2S> | noise; should be 2S | seu, teu | | ||
+ | | 2P> | noise; should be 2P | vossa, vosso | | ||
+ | | 3S> | noise; should be 3S | seu, seus, sua, suas | | ||
+ | | 3S/P> | noise; should be 3S/P | seu, seus, sua | | ||
+ | | 3P> | noise; should be 3P | seu, seus, sua | | ||
+ | | <adv> | noise? | fundo | | ||
+ | | < | ||
+ | | < | ||
+ | | > | ||
+ | | < | ||
+ | | convidado-> | ||
+ | | < | ||
+ | | < | ||
+ | | <corr | noise; should be <ALT> | | | ||
+ | | < | ||
+ | | <Eg>F | noise; should be two features | | | ||
+ | | <Eg>M | noise; should be two features | | | ||
+ | | <F | noise; should be F | | | ||
+ | | GER | noise; redundant gerund marker | 1 occurrence with v-ger | | ||
+ | | < | ||
+ | | INF | noise; redundant infinitive marker | 2 occurrences with < | ||
+ | | 'Maio | noise | Maio | | ||
+ | | MVF | noise; should be MV and F | motivada | | ||
+ | | NUM | noise; redundant numeral marker | 1994 | | ||
+ | | pasando> | noise; should be <ALT> | passando | | ||
+ | | PCP | noise; redundant participle marker | 2 occurrences | | ||
+ | | < | ||
+ | | < | ||
+ | | PROP | noise | 2 occurrences | | ||
+ | | < | ||
+ | | < | ||
+ | | R | noise; should be PR | 2 occurrences | | ||
+ | | recohidas> | ||
+ | | < | ||
+ | | s | noise; should be S | | | ||
+ | | saiem> | noise; should be <ALT> | saem | | ||
+ | | < | ||
+ | | < | ||
+ | | <sc> | noise; should be < | ||
+ | | subordinanda> | ||
+ | | V | noise; redundant verb marker | | | ||
+ | | < | ||
+ | | VFIN | noise | há od haver | | ||
+ | |||
+ | ===== Slovak (sk) ===== | ||
+ | |||
+ | ==== Slovenský národný korpus (SNK) ==== | ||
+ | |||
+ | 1457 structured tags. | ||
+ | |||
+ | Total work time: 5:32 hours. | ||
+ | |||
+ | ===== Swedish (sv) ===== | ||
+ | |||
+ | ==== Mamba and CoNLL ==== | ||
+ | |||
+ | Mamba tagset of Talbanken05. 48 tags, no morphosyntactic categories but detailed classification of auxiliary and modal verbs and punctuation. CoNLL driver is just an envelope around Mamba. | ||
+ | |||
+ | Total work time: about 3 hours | ||
+ | |||
+ | ==== Tags of Hajič' | ||
+ | |||
+ | Based on PAROLE Swedish tagset but some characters different (@ => W), and filled by dashes to uniform length of 9 characters (although i-th position does not always encode the same feature). | ||
+ | |||
+ | No reliable statistics of work time; estimated 8 hours | ||
+ | |||
+ | ===== Time needed for tag set conversion ===== | ||
+ | |||
+ | Some records about targeted tagset conversion for given tagset pairs, done in early 2006: | ||
+ | |||
+ | Ruský treebank (nejen značky, ale vůbec | ||
+ | 12:36 | ||
+ | |||
+ | Arabské značky (Otovy i Buckwalterovy, | ||
+ | 4:45+1+1:40 = 7:25 | ||