[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
user:zeman:interset:drivers [2009/09/08 18:18]
zeman Summary of pl::ipipan.
user:zeman:interset:drivers [2014/03/01 12:40]
zeman Slovenský národný korpus.
Line 4: Line 4:
  
 ===== Arabic (ar) ===== ===== Arabic (ar) =====
 +
 +==== CoNLL 2006 ====
  
 The Arabic CoNLL tags are derived from the tags of the Prague Arabic Dependency Treebank. The Arabic CoNLL tags are derived from the tags of the Prague Arabic Dependency Treebank.
Line 9: Line 11:
 Created in 2006-2007. Created in 2006-2007.
 Total work time: 13 hours Total work time: 13 hours
 +
 +==== CoNLL 2007 ====
 +
 +The Arabic tags in CoNLL 2007 slightly differed from 2006. There are also new tags. The driver ''ar::conll2007'' was cloned from ''ar::conll'' and modified.
 +
 +Created: 23.6.2011
 +Total work time: 2 hours
  
 ===== Bulgarian (bg) ===== ===== Bulgarian (bg) =====
Line 78: Line 87:
  
 Czech tagsets are notoriously complex. This one maps quite nicely to DZ Interset features. However, the few distinctions that are not (yet) represented in DZ Interset made debugging difficult. Clitic_s and generic numerals represented using the ''other'' feature led to wrong feature-value combinations in conversions to/from other Czech sets. DZ Interset had to be slightly modified in response to this tagset, and more changes that initiated here will be done later. Czech tagsets are notoriously complex. This one maps quite nicely to DZ Interset features. However, the few distinctions that are not (yet) represented in DZ Interset made debugging difficult. Clitic_s and generic numerals represented using the ''other'' feature led to wrong feature-value combinations in conversions to/from other Czech sets. DZ Interset had to be slightly modified in response to this tagset, and more changes that initiated here will be done later.
 +
 +==== Prague Spoken Corpus ====
 +
 +The Prague Spoken Corpus (Pražský mluvený korpus, PMK) is distributed together with the frequency dictionary of spoken Czech (book). It uses very strange tags and very many of them (over 10000!) Extremely high portion of the tags has to rely on the ''other'' feature. There are two types of tags: long and short.
 +
 +Work started: 26.11.2009
 +Work finished: 4.10.2010
 +Total work time: 57 hours
  
 ===== Danish (da) ===== ===== Danish (da) =====
Line 135: Line 152:
 Work finished: 6.4.2009 Work finished: 6.4.2009
 Total work time: 9:39 h Total work time: 9:39 h
- 
- 
- 
  
 ===== Polish (pl) ===== ===== Polish (pl) =====
Line 310: Line 324:
 | <vfin> | noise; should be <co-vfin> | | | <vfin> | noise; should be <co-vfin> | |
 | VFIN | noise | há od haver | | VFIN | noise | há od haver |
 +
 +===== Slovak (sk) =====
 +
 +==== Slovenský národný korpus (SNK) ====
 +
 +1457 structured tags.
 +
 +Total work time: 5:32 hours.
  
 ===== Swedish (sv) ===== ===== Swedish (sv) =====

[ Back to the navigation ] [ Back to the content ]