Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
user:zeman:interset:drivers [2008/03/31 22:14] zeman de::conll |
user:zeman:interset:drivers [2009/02/20 15:10] zeman |
||
---|---|---|---|
Line 1: | Line 1: | ||
====== Tag Set Drivers ====== | ====== Tag Set Drivers ====== | ||
- | This is an overview of existing tag set drivers. Tag-set or language specific issues are described here. | + | This is an overview of existing tag set drivers. Tag-set or language specific issues are described here. I also try to keep track of the work time needed for particular drivers because the original motivation behind DZ Interset was to save time and effort. |
+ | |||
+ | ===== Arabic (ar) ===== | ||
+ | |||
+ | The Arabic CoNLL tags are derived from the tags of the Prague Arabic Dependency Treebank. | ||
+ | |||
+ | Created in 2006-2007. | ||
+ | Total work time: 13 hours | ||
+ | |||
+ | ===== Bulgarian (bg) ===== | ||
+ | |||
+ | The Bulgarian CoNLL tags are derived from the tags of BulTreeBank. Speciality: sophisticated system of pronouns includes interrogative adverbs and numerals. | ||
+ | |||
+ | Created in 2007. | ||
+ | Total work time: 35 hours | ||
+ | |||
+ | The main reasons why the implementation took so long: | ||
+ | * Necessity to re-work the system of main word classes, especially pronouns. | ||
+ | * Necessity to separate morphological and lexical definiteness (there are indefinite pronouns morphologically definite, and vice versa). | ||
+ | * Necessity to separate morphological and lexical aspect (aorist vs. imperfect tense; there are perfective verbs that can occur in imperfect tense). | ||
+ | * Driver tester required that encode(decode(x))=x. However, the CoNLL incarnation of the tags was inconsistent, | ||
===== Chinese (zh) ===== | ===== Chinese (zh) ===== | ||
Line 38: | Line 58: | ||
More than half of the time was consumed during testing for tuning tags containing the Sem feature. | More than half of the time was consumed during testing for tuning tags containing the Sem feature. | ||
+ | |||
+ | ==== Multext ==== | ||
+ | |||
+ | The tagset of the MULTEXT-EAST project and corpora. The file '' | ||
+ | |||
+ | Work started: 16.2.2009 | ||
+ | Work finished: 18.2.2009 | ||
+ | Total work time: 16:36 h | ||
+ | |||
+ | Czech tagsets are notoriously complex. This one maps quite nicely to DZ Interset features. However, the few distinctions that are not (yet) represented in DZ Interset made debugging difficult. Clitic_s and generic numerals represented using the '' | ||
+ | |||
+ | ===== Danish (da) ===== | ||
+ | |||
+ | Tags of the Danish Dependency Treebank converted to CoNLL format. 144 tags with complex documentation in Danish. | ||
+ | |||
+ | Total work time: about 7 hours | ||
+ | |||
+ | ===== English (en) ===== | ||
+ | |||
+ | ==== Penn Treebank Tagset ==== | ||
+ | |||
+ | Penn Treebank (45 atomic tags). Detailed classification of punctuation. | ||
+ | |||
+ | Total work time: about 3 hours | ||
+ | |||
+ | ==== CoNLL Tagset (derived from Penn tags) ==== | ||
+ | |||
+ | The driver is just an envelope around the '' | ||
+ | |||
+ | Total work time: 48 minutes | ||
===== German (de) ===== | ===== German (de) ===== | ||
Line 59: | Line 109: | ||
Total work time: 10 min | Total work time: 10 min | ||
- | ===== Time needed for tag set conversion | + | ===== Portuguese (pt) ===== |
- | Poznamenávám si, kolik času mi zabral který ovladač, abych to mohl publikovat. Srovnání potřebného času s časem potřebným na obyčejný převod je zajímavé, i když vím, že ve skutečnosti ušetřím až při opakovaném využití ovladače. | + | The Portuguese CoNLL treebank contains tags with 149 different features. Big part of them are noise, probably introduced by the conversion procedure from the original Floresta format |
- | Ruský treebank (nejen značky, ale vůbec převod formátu): | + | http:// |
- | 12:36 | + | http:// |
- | Arabské značky (Otovy i Buckwalterovy, | + | Work started: 2.4.2008 |
- | 4:45+1+1:40 = 7:25 | + | Work finished: 24.4.2008 |
+ | Total work time: 28:18 h | ||
- | Dánské značky DDT/Parole (144 značek s košatým popisem) | + | The CoNLL version of the Floresta tagset was a real pain. Not only is the tagset complex with many features, some of them strangely overlapping, |
- | asi 7 hodin | + | |
- | Švédské značky Mamba (48 značek) | + | | **Feature** | **Explanation** | **Examples** | |
- | asi 3 hodiny | + | | _ | no features | prepositions, |
+ | | 1/3S | 1st person or 3rd person singular | leia, disse, seria, prefira | | ||
+ | | 1S | 1st person singular | tenho, tinha, usei, vivo, vou | | ||
+ | | 1P | 1st person plural | tomámos, vamos, vemos, víamos | | ||
+ | | 2S | 2nd person singular | compreendeste, | ||
+ | | 2P | 2nd person plural | chamais, vós | | ||
+ | | 3S | 3rd person singular | viu, viva | | ||
+ | | 3S/P | 3rd person singular or plural | se, si | | ||
+ | | 3P | 3rd person plural | vivem | | ||
+ | | ACC | pronoun as direct accusative object | se, te, vos | | ||
+ | | ACC/DAT | pronouns in accusative or dative | nos, se | | ||
+ | | COND | verb in conditional mood | precisariam, | ||
+ | | DAT | pronoun as dative object | lhe, lhes, me, no, nos, se, vos | | ||
+ | | F | feminine | | | ||
+ | | F/M | feminine or masculine | | | ||
+ | | FUT | future tense of verbs | tenderão, tomará, usará | | ||
+ | | IMP | imperative mood of verbs | chega, move, olha, sê | | ||
+ | | IMPF | imperfect tense of verbs | abandonasse, | ||
+ | | IND | indicative mood of verbs | abafaram, abandonam, abate, abateu | | ||
+ | | M | masculine | açúcar, adepto, adiantado | | ||
+ | | M/F | masculine or feminine | Abidjan, cada, Chaves, especial | | ||
+ | | MQP | pluperfect past tense of verbs | acabara, defendera, existira, foram, quisera, viram | | ||
+ | | NOM | personal pronoun in nominative | ela, elas, ele, eles, eu, nós, vocês, você, vós | | ||
+ | | NOM/PIV | personal pronoun in nominative or prepositional object | ela, elas, ele, eles, nós, você | | ||
+ | | P | plural | 0,92, 14h00, africanos, águas, Amigos_da_Ilha_de_Santos | | ||
+ | | PIV | pronoun in prepositional object | ela, elas, ele, eles, mim, nós, si, ti, vós | | ||
+ | | PR | present tense of verbs | abandonam, abate, abonam, abordo, abra | | ||
+ | | PR/PS | present or past tense of verbs | conhecemos, conseguimos, | ||
+ | | PS | perfect past tense of verbs | abalou, abandonaram, | ||
+ | | PS/MQP | perfect or pluperfect past tense of verbs | abafaram, abriram, acabaram, aceitaram | | ||
+ | | S | singular | 1992, adicional, aditamento, aduaneira | | ||
+ | | S/P | singular or plural | capaz, Chaves, mais | | ||
+ | | SUBJ | subjunctive mood of verbs | abandonasse, | ||
+ | | <ALT> | indicates typo in word | | | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | <SUP> | superlative of adjectives and adverbs | inferior, máximo, melhor, mínimo, ótimo, péssimo, pior | | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | <dem> | demonstrative pronoun or adverb | este, isso, isto, o, os, tais, tal, tão | | ||
+ | | <det> | determiner usage / inflection of adverb | algo, meio, nada, quase, todo, um_tanto | | ||
+ | | < | ||
+ | | < | ||
+ | | <fmc> | verb heading finite main clause | | | ||
+ | | <foc> | focus marker, adverb or pronoun | é_que, foi, fomos, que, são, será | | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | <kc> | conjunctional adverb | agora, aí, bem_como, como, ora, tal_como, todavia | | ||
+ | | <ks> | adverb or preposition used like a subordinating conjunction | como, enquanto, onde, quando, segundo | | ||
+ | | <n> | other word class used as noun, typically as head of noun phrase | anglo-americano, | ||
+ | | <poss | possessive determiner pronoun | meu, meus, minha, minhas, nossa, nossas, nosso, nossos, seu, seus, sua | | ||
+ | | < | ||
+ | | <prp> | other word class used as preposition | como, conforme, consoante, embora, segundo | | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | <rel> | relative pronoun or adverb | à_medida_que, | ||
+ | | < | ||
+ | | < | ||
+ | | <si> | reflexive usage of 3rd person possessive | seu, seus, sua, suas | | ||
+ | | <eg> | undocumented feature | 2 occurrences with cardinal numbers | | ||
+ | | <Eg> | undocumented feature | occurs with numbers, adjectives and pronouns | | ||
+ | | <Em> | undocumented feature | 6 occurrences with adjectives | | ||
+ | | <Es> | undocumented feature | 3 occurrences with adverbs and prepositions | | ||
+ | | <ink> | undocumented feature of finite verbs | está, havia, pode, tentou | | ||
+ | | < | ||
+ | | < | ||
+ | | N | undocumented feature of nouns and articles | 15 occurrences | | ||
+ | | <new> | undocumented feature | | | ||
+ | | <nil> | undocumented feature | | | ||
+ | | <obj> | undocumented feature | se | | ||
+ | | <p> | undocumented feature | 1 occurrence | | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | > | noise; should be ignored | | | ||
+ | | 0/1/3S | noise; should probably be 1/3S | | | ||
+ | | 1 | noise; should be 1S | aproveitaria, | ||
+ | | 1S> | noise; should be 1S | meu, meus, minha, minhas | | ||
+ | | 1P> | noise; should be 1P | nossa, nossas, nosso, nossos | | ||
+ | | 2S> | noise; should be 2S | seu, teu | | ||
+ | | 2P> | noise; should be 2P | vossa, vosso | | ||
+ | | 3S> | noise; should be 3S | seu, seus, sua, suas | | ||
+ | | 3S/P> | noise; should be 3S/P | seu, seus, sua | | ||
+ | | 3P> | noise; should be 3P | seu, seus, sua | | ||
+ | | <adv> | noise? | fundo | | ||
+ | | < | ||
+ | | < | ||
+ | | > | ||
+ | | < | ||
+ | | convidado-> | ||
+ | | < | ||
+ | | < | ||
+ | | <corr | noise; should be <ALT> | | | ||
+ | | < | ||
+ | | <Eg>F | noise; should be two features | | | ||
+ | | <Eg>M | noise; should be two features | | | ||
+ | | <F | noise; should be F | | | ||
+ | | GER | noise; redundant gerund marker | 1 occurrence with v-ger | | ||
+ | | < | ||
+ | | INF | noise; redundant infinitive marker | 2 occurrences with < | ||
+ | | 'Maio | noise | Maio | | ||
+ | | MVF | noise; should be MV and F | motivada | | ||
+ | | NUM | noise; redundant numeral marker | 1994 | | ||
+ | | pasando> | noise; should be <ALT> | passando | | ||
+ | | PCP | noise; redundant participle marker | 2 occurrences | | ||
+ | | < | ||
+ | | < | ||
+ | | PROP | noise | 2 occurrences | | ||
+ | | < | ||
+ | | < | ||
+ | | R | noise; should be PR | 2 occurrences | | ||
+ | | recohidas> | ||
+ | | < | ||
+ | | s | noise; should be S | | | ||
+ | | saiem> | noise; should be <ALT> | saem | | ||
+ | | < | ||
+ | | < | ||
+ | | <sc> | noise; should be < | ||
+ | | subordinanda> | ||
+ | | V | noise; redundant verb marker | | | ||
+ | | < | ||
+ | | VFIN | noise | há od haver | | ||
- | Penn Treebank | + | ===== Swedish |
- | asi 3 hodiny, ale tady jsem to ještě neměřil, takže to je jen hrubý zpětný odhad | + | |
- | Hajičovy švédské značky | + | ==== Mamba and CoNLL ==== |
- | 0:32 - tady zjevně chybí úplná statistika | + | |
- | Arabské značky | + | Mamba tagset of Talbanken05. 48 tags, no morphosyntactic categories but detailed classification of auxiliary and modal verbs and punctuation. |
- | 4: | + | |
- | Bulharské značky CoNLL | + | Total work time: about 3 hours |
- | 0:20+1: | + | |
- | (ale u bulharštiny jsem se dost natrápil s jevy, které do té doby nebyly v intersetu podchycené) | + | |
- | Anglické značky CoNLL | + | ==== Tags of Hajič's Swedish tagger ==== |
- | 0:48 - možná tady chybí statistika, ale možná taky ne, protože stačilo upravit existující ovladač Penn Treebanku, ne? | + | |
- | Žádné z výše uvedených převodů | + | Based on PAROLE Swedish tagset but some characters different |
+ | |||
+ | No reliable statistics of work time; estimated 8 hours | ||
+ | |||
+ | ===== Time needed for tag set conversion ===== | ||
+ | |||
+ | Some records about targeted tagset conversion for given tagset pairs, done in early 2006: | ||
+ | |||
+ | Ruský treebank (nejen značky, ale vůbec převod formátu): | ||
+ | 12:36 | ||
+ | |||
+ | Arabské značky (Otovy i Buckwalterovy, | ||
+ | 4:45+1+1:40 = 7:25 | ||