[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision Both sides next revision
user:zeman:interset:drivers [2008/04/03 14:16]
zeman Restructuralization.
user:zeman:interset:drivers [2008/04/03 14:27]
zeman Restructuralization.
Line 2: Line 2:
  
 This is an overview of existing tag set drivers. Tag-set or language specific issues are described here. This is an overview of existing tag set drivers. Tag-set or language specific issues are described here.
 +
 +===== Arabic (ar) =====
 +
 +The Arabic CoNLL tags are derived from the tags of the Prague Arabic Dependency Treebank.
 +
 +Created in 2006-2007.
 +Total work time: 13 hours
 +
 +===== Bulgarian (bg) =====
 +
 +The Bulgarian CoNLL tags are derived from the tags of BulTreeBank. Speciality: sophisticated system of pronouns includes interrogative adverbs and numerals.
 +
 +Created in 2007.
 +Total work time: 35 hours
 +
 +The main reasons why the implementation took so long:
 +  * Necessity to re-work the system of main word classes, especially pronouns.
 +  * Necessity to separate morphological and lexical definiteness (there are indefinite pronouns morphologically definite, and vice versa).
 +  * Necessity to separate morphological and lexical aspect (aorist vs. imperfect tense; there are perfective verbs that can occur in imperfect tense).
 +  * Driver tester required that encode(decode(x))=x. However, the CoNLL incarnation of the tags was inconsistent, in the order and form in which it presented features.
  
 ===== Chinese (zh) ===== ===== Chinese (zh) =====
Line 81: Line 101:
 ===== Swedish (sv) ===== ===== Swedish (sv) =====
  
-Mamba tagset of Talbanken05. 48 tags, no morphosyntactic categories but detailed classification of auxiliary and modal verbs and punctuation.+==== Mamba and CoNLL ==== 
 + 
 +Mamba tagset of Talbanken05. 48 tags, no morphosyntactic categories but detailed classification of auxiliary and modal verbs and punctuation. CoNLL driver is just an envelope around Mamba.
  
 Total work time: about 3 hours Total work time: about 3 hours
 +
 +==== Tags of Hajič's Swedish tagger ====
 +
 +Based on PAROLE Swedish tagset but some characters different (@ => W), and filled by dashes to uniform length of 9 characters (although i-th position does not always encode the same feature).
 +
 +No reliable statistics of work time; estimated 8 hours
  
 ===== Time needed for tag set conversion ===== ===== Time needed for tag set conversion =====
  
-Poznamenávám sikolik času mi zabral který ovladač, abych to mohl publikovat. Srovnání potřebného času s časem potřebným na obyčejný převod je zajímavé, i když vím, že ve skutečnosti ušetřím až při opakovaném využití ovladače.+Some records about targeted tagset conversion for given tagset pairsdone in early 2006:
  
 Ruský treebank (nejen značky, ale vůbec převod formátu): Ruský treebank (nejen značky, ale vůbec převod formátu):
Line 94: Line 122:
 Arabské značky (Otovy i Buckwalterovy, ještě bez Intersetu, 22.3.2006): Arabské značky (Otovy i Buckwalterovy, ještě bez Intersetu, 22.3.2006):
 4:45+1+1:40 = 7:25 4:45+1+1:40 = 7:25
- 
-Hajičovy švédské značky 
-0:32 - tady zjevně chybí úplná statistika 
- 
-Arabské značky CoNLL 
-4:33+5:19+3:16 = 13:08 
- 
-Bulharské značky CoNLL 
-0:20+1:00+0:26+5:44+2:00+6:15+1:20+0:46+1:26+2:30+0:48+12:44 = 35:19 
-(ale u bulharštiny jsem se dost natrápil s jevy, které do té doby nebyly v intersetu podchycené) 
- 
-Žádné z výše uvedených převodů (tedy vše napsané před říjnem 2007) ještě neměly k dispozici chytré funkce pro nahrazování nepovolených hodnot. 
  

[ Back to the navigation ] [ Back to the content ]