[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
user:zeman:interset:drivers [2008/03/31 22:14]
zeman de::conll
user:zeman:interset:drivers [2008/04/03 14:49]
zeman
Line 2: Line 2:
  
 This is an overview of existing tag set drivers. Tag-set or language specific issues are described here. This is an overview of existing tag set drivers. Tag-set or language specific issues are described here.
 +
 +===== Arabic (ar) =====
 +
 +The Arabic CoNLL tags are derived from the tags of the Prague Arabic Dependency Treebank.
 +
 +Created in 2006-2007.
 +Total work time: 13 hours
 +
 +===== Bulgarian (bg) =====
 +
 +The Bulgarian CoNLL tags are derived from the tags of BulTreeBank. Speciality: sophisticated system of pronouns includes interrogative adverbs and numerals.
 +
 +Created in 2007.
 +Total work time: 35 hours
 +
 +The main reasons why the implementation took so long:
 +  * Necessity to re-work the system of main word classes, especially pronouns.
 +  * Necessity to separate morphological and lexical definiteness (there are indefinite pronouns morphologically definite, and vice versa).
 +  * Necessity to separate morphological and lexical aspect (aorist vs. imperfect tense; there are perfective verbs that can occur in imperfect tense).
 +  * Driver tester required that encode(decode(x))=x. However, the CoNLL incarnation of the tags was inconsistent, in the order and form in which it presented features.
  
 ===== Chinese (zh) ===== ===== Chinese (zh) =====
Line 38: Line 58:
  
 More than half of the time was consumed during testing for tuning tags containing the Sem feature. More than half of the time was consumed during testing for tuning tags containing the Sem feature.
 +
 +===== Danish (da) =====
 +
 +Tags of the Danish Dependency Treebank converted to CoNLL format. 144 tags with complex documentation in Danish.
 +
 +Total work time: about 7 hours
 +
 +===== English (en) =====
 +
 +==== Penn Treebank Tagset ====
 +
 +Penn Treebank (45 atomic tags). Detailed classification of punctuation.
 +
 +Total work time: about 3 hours
 +
 +==== CoNLL Tagset (derived from Penn tags) ====
 +
 +The driver is just an envelope around the ''en::penn'' driver.
 +
 +Total work time: 48 minutes
  
 ===== German (de) ===== ===== German (de) =====
Line 59: Line 99:
 Total work time: 10 min Total work time: 10 min
  
-===== Time needed for tag set conversion ===== 
  
-Poznamenávám si, kolik času mi zabral který ovladač, abych to mohl publikovat. Srovnání potřebného času s časem potřebným na obyčejný převod je zajímavé, i když vím, že ve skutečnosti ušetřím až při opakovaném využití ovladače.+===== Portuguese (pt) =====
  
-Ruský treebank (nejen značkyale vůbec převod formátu): +The Portuguese CoNLL treebank contains tags with 149 different features. Big part of them are noiseprobably introduced by the conversion procedure from the original Floresta format to the CoNLL format. The driver is designed so that it accepts all incorrect tags on decoding but encodes only corrected tags. Incorrect tags are not on the list of possible tags so the driver tester will not complain.
-12:36+
  
-Arabské značky (Otovy i Buckwalterovyještě bez Intersetu, 22.3.2006): +| **Feature** | **Explanation** | **Examples** | 
-4:45+1+1:40 = 7:25+| _ | no features | prepositionspunctuation etc| 
 +| 1st person | | 
 +1/3S | 1st person or 3rd person singular | leia, disse, seria, prefira | 
 +| 1S | 1st person singular | tenho, tinha, usei, vivo, vou | 
 +| 1P | 1st person plural | tomámos, vamos, vemos, víamos | 
 +| 2S | 2nd person singular | compreendeste, queres, te, ti, veja, vives | 
 +| 2P | 2nd person plural | chamais, vós | 
 +| 3S | 3rd person singular | viu, viva | 
 +| 3S/P | 3rd person singular or plural | se, si | 
 +| 3P | 3rd person plural | vivem | 
 +| ACC | pronoun as direct accusative object | se, te, vos | 
 +| ACC/DAT | pronouns in accusative or dative | nos, se | 
 +| > | noise; should be ignored | | 
 +| 0/1/3S | noise; should probably be 1/3S | | 
 +| 1S> | noise; should be 1S | meu, meus, minha, minhas | 
 +| 1P> | noise; should be 1P | nossa, nossas, nosso, nossos | 
 +| 2S> | noise; should be 2S | seu, teu | 
 +| 2P> | noise; should be 2P | vossa, vosso | 
 +| 3S> | noise; should be 3S | seu, seus, sua, suas | 
 +| 3S/P> | noise; should be 3S/P | seu, seus, sua | 
 +| 3P> | noise; should be 3P | seu, seus, sua |
  
-Dánské značky DDT/Parole (144 značek s košatým popisem) +===== Swedish (sv=====
-asi 7 hodin+
  
-Švédské značky Mamba (48 značek) +==== Mamba and CoNLL ====
-asi 3 hodiny+
  
-Penn Treebank (36 značek) +Mamba tagset of Talbanken05. 48 tagsno morphosyntactic categories but detailed classification of auxiliary and modal verbs and punctuation. CoNLL driver is just an envelope around Mamba.
-asi 3 hodinyale tady jsem to ještě neměřil, takže to je jen hrubý zpětný odhad+
  
-Hajičovy švédské značky +Total work timeabout 3 hours
-0:32 - tady zjevně chybí úplná statistika+
  
-Arabské značky CoNLL +==== Tags of Hajič's Swedish tagger ====
-4:33+5:19+3:16 13:08+
  
-Bulharské značky CoNLL +Based on PAROLE Swedish tagset but some characters different (@ => W)and filled by dashes to uniform length of 9 characters (although i-th position does not always encode the same feature).
-0:20+1:00+0:26+5:44+2:00+6:15+1:20+0:46+1:26+2:30+0:48+12:44 = 35:19 +
-(ale u bulharštiny jsem se dost natrápil s jevykteré do té doby nebyly v intersetu podchycené)+
  
-Anglické značky CoNLL +No reliable statistics of work time; estimated 8 hours
-0:48 - možná tady chybí statistika, ale možná taky ne, protože stačilo upravit existující ovladač Penn Treebanku, ne?+
  
-Žádné z výše uvedených převodů (tedy vše napsané ed říjnem 2007) ještě neměly k dispozici chytré funkce pro nahrazování nepovolených hodnot.+===== Time needed for tag set conversion ===== 
 + 
 +Some records about targeted tagset conversion for given tagset pairs, done in early 2006: 
 + 
 +Ruský treebank (nejen značky, ale vůbec evod formátu)
 +12:36 
 + 
 +Arabské značky (Otovy i Buckwalterovy, ještě bez Intersetu, 22.3.2006): 
 +4:45+1+1:40 = 7:25
  

[ Back to the navigation ] [ Back to the content ]