[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
user:zeman:interset:drivers [2009/03/24 11:23]
zeman CoNLL 2009.
user:zeman:interset:drivers [2014/07/17 16:32] (current)
zeman hr::multext
Line 4: Line 4:
  
 ===== Arabic (ar) ===== ===== Arabic (ar) =====
 +
 +==== CoNLL 2006 ====
  
 The Arabic CoNLL tags are derived from the tags of the Prague Arabic Dependency Treebank. The Arabic CoNLL tags are derived from the tags of the Prague Arabic Dependency Treebank.
Line 9: Line 11:
 Created in 2006-2007. Created in 2006-2007.
 Total work time: 13 hours Total work time: 13 hours
 +
 +==== CoNLL 2007 ====
 +
 +The Arabic tags in CoNLL 2007 slightly differed from 2006. There are also new tags. The driver ''ar::conll2007'' was cloned from ''ar::conll'' and modified.
 +
 +Created: 23.6.2011
 +Total work time: 2 hours
  
 ===== Bulgarian (bg) ===== ===== Bulgarian (bg) =====
Line 36: Line 45:
  
 Most of the time was dedicated to extracting, transcribing and translating examples in an effort to understand the tag classes. Most of the time was dedicated to extracting, transcribing and translating examples in an effort to understand the tag classes.
 +
 +===== Croatian (hr) =====
 +
 +==== Multext ====
 +
 +The tagset of the MULTEXT-EAST project as used in the SETimes.HR corpus. Documentation lists 1291 tags, we removed one wrong tag and kept 1290.
 +
 +Work started: 16.7.2014
 +Work finished: 17.7.2014
 +Total work time: 5:45 h
 +
 +This is the second Multext-East tagset covered by DZ Interset. Adding it was not too difficult because much of the previous effort on ''cs::multext'' could be reused.
  
 ===== Czech (cs) ===== ===== Czech (cs) =====
Line 62: Line 83:
  
 The [[:format-conll|CoNLL data format]] has changed. Formerly (2006 and 2007) there were three relevant columns (coarse-grained part of speech, fine-grained part of speech, and features) that we combined (using tabs) into one tag string. As of 2009, there are only two columns left, namely part of speech and features. For the Czech tags this further means that there is a new feature in the 'features' column. It is called ''SubPOS'', it is present in all tags and its value is one character, copied from the second position of the standard PDT tag. Otherwise, the tags should be identical to those of CoNLL 2007, including the ''Sem'' feature. The [[:format-conll|CoNLL data format]] has changed. Formerly (2006 and 2007) there were three relevant columns (coarse-grained part of speech, fine-grained part of speech, and features) that we combined (using tabs) into one tag string. As of 2009, there are only two columns left, namely part of speech and features. For the Czech tags this further means that there is a new feature in the 'features' column. It is called ''SubPOS'', it is present in all tags and its value is one character, copied from the second position of the standard PDT tag. Otherwise, the tags should be identical to those of CoNLL 2007, including the ''Sem'' feature.
 +
 +The ''Sem'' feature can have more values than previously. This is caused by the extension of the [[http://ufal.mff.cuni.cz/~zeman/publikace/2005-01/mmanual.html#lemma-term|term value set]] in the Prague Dependency Treebank 2.0 (in contrast to 1.0), so the change actually applied already to CoNLL 2007 data. However, CoNLL 2007 uses the older ''cs::conll'' driver.
  
 Work started: 24.3.2009 Work started: 24.3.2009
-Work finished: +Work finished: 24.3.2009 
-Total work time:+Total work time: 1:10 h
  
 ==== Multext ==== ==== Multext ====
Line 76: Line 99:
  
 Czech tagsets are notoriously complex. This one maps quite nicely to DZ Interset features. However, the few distinctions that are not (yet) represented in DZ Interset made debugging difficult. Clitic_s and generic numerals represented using the ''other'' feature led to wrong feature-value combinations in conversions to/from other Czech sets. DZ Interset had to be slightly modified in response to this tagset, and more changes that initiated here will be done later. Czech tagsets are notoriously complex. This one maps quite nicely to DZ Interset features. However, the few distinctions that are not (yet) represented in DZ Interset made debugging difficult. Clitic_s and generic numerals represented using the ''other'' feature led to wrong feature-value combinations in conversions to/from other Czech sets. DZ Interset had to be slightly modified in response to this tagset, and more changes that initiated here will be done later.
 +
 +==== Prague Spoken Corpus ====
 +
 +The Prague Spoken Corpus (Pražský mluvený korpus, PMK) is distributed together with the frequency dictionary of spoken Czech (book). It uses very strange tags and very many of them (over 10000!) Extremely high portion of the tags has to rely on the ''other'' feature. There are two types of tags: long and short.
 +
 +Work started: 26.11.2009
 +Work finished: 4.10.2010
 +Total work time: 57 hours
  
 ===== Danish (da) ===== ===== Danish (da) =====
Line 91: Line 122:
 Total work time: about 3 hours Total work time: about 3 hours
  
-==== CoNLL Tagset (derived from Penn tags) ====+==== CoNLL 2006 ====
  
 The driver is just an envelope around the ''en::penn'' driver. The driver is just an envelope around the ''en::penn'' driver.
  
 Total work time: 48 minutes Total work time: 48 minutes
 +
 +==== CoNLL 2009 ====
 +
 +Another envelope around the ''en::penn'' driver. However, three new tags required changes even in the older drivers: ''HYPH'', ''AFX'' (''PRF'') and ''NIL''.
 +
 +Work started: 25.3.2009
 +Work finished: 25.3.2009
 +Total work time: 2:57 h
  
 ===== German (de) ===== ===== German (de) =====
Line 109: Line 148:
 Total work time: 4:00 h Total work time: 4:00 h
  
-==== CoNLL (derived from STTS) ====+==== CoNLL 2006 ====
  
 Only simple envelope around the STTS driver needed. Only simple envelope around the STTS driver needed.
Line 116: Line 155:
 Work finished: 31.3.2008 Work finished: 31.3.2008
 Total work time: 10 min Total work time: 10 min
 +
 +
 +==== CoNLL 2009 ====
 +
 +This tagset is derived from the STTS, too. Unlike CoNLL 2006, there are also morphological features this time, which required additional processing effort.
 +
 +Work started: 5.4.2009
 +Work finished: 6.4.2009
 +Total work time: 9:39 h
 +
 +===== Polish (pl) =====
 +
 +Based on the [[http://korpus.pl/index.php|Korpus Języka Polskiego IPI PAN]]. (Saša tyhle značky potřebuje zpracovat v Intercorpu.) Moderate amount of new stuff but it is one of the fairly complex Slavic tagsets. And it contributed to [[how-to-write-a-driver#replacing-and-the-other-feature|new treatment of o-tags]] (those setting the ''other'' feature) when learning permitted feature-value combinations.
 +
 +Work started: 4.9.2009
 +Work finished: 8.9.2009
 +Total work time: 9:54 h
  
 ===== Portuguese (pt) ===== ===== Portuguese (pt) =====
Line 280: Line 336:
 | <vfin> | noise; should be <co-vfin> | | | <vfin> | noise; should be <co-vfin> | |
 | VFIN | noise | há od haver | | VFIN | noise | há od haver |
 +
 +===== Slovak (sk) =====
 +
 +==== Slovenský národný korpus (SNK) ====
 +
 +1457 structured tags.
 +
 +Total work time: 5:32 hours.
  
 ===== Swedish (sv) ===== ===== Swedish (sv) =====

[ Back to the navigation ] [ Back to the content ]