[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
user:zeman:interset:how-to-use [2009/02/20 15:00]
zeman Level of headings.
user:zeman:interset:how-to-use [2009/09/04 14:05]
zeman CoNLL 2009.
Line 1: Line 1:
-====== Manual ======+====== DZ Interset Manual ======
  
 ===== Installation ===== ===== Installation =====
Line 14: Line 14:
   - tagset::bg::conll - Bulgarian CoNLL treebank   - tagset::bg::conll - Bulgarian CoNLL treebank
   - tagset::cs::conll - Czech CoNLL treebank, based on the Prague Dependency Treebank   - tagset::cs::conll - Czech CoNLL treebank, based on the Prague Dependency Treebank
 +  - tagset::cs::conll2009 - Czech CoNLL 2009 treebank (new formatting)
   - tagset::cs::multext - Czech subset of the tagset from the Multext East project   - tagset::cs::multext - Czech subset of the tagset from the Multext East project
   - tagset::cs::pdt - Czech positional tags of the Prague Dependency Treebank   - tagset::cs::pdt - Czech positional tags of the Prague Dependency Treebank
   - tagset::da::conll - Danish CoNLL treebank   - tagset::da::conll - Danish CoNLL treebank
   - tagset::de::conll - German CoNLL treebank (one-to-one mapping to de::stts)   - tagset::de::conll - German CoNLL treebank (one-to-one mapping to de::stts)
 +  - tagset::de::conll2009 - German CoNLL 2009 treebank (STTS POS tags + additional morphological features)
   - tagset::de::stts - German: Stuttgart-Tübingen Tagset (Tiger treebank)   - tagset::de::stts - German: Stuttgart-Tübingen Tagset (Tiger treebank)
   - tagset::en::conll - English CoNLL treebank (one-to-one mapping to en::penn)   - tagset::en::conll - English CoNLL treebank (one-to-one mapping to en::penn)
 +  - tagset::en::conll2009 - English CoNLL 2009 treebank (new formatting, two new tags)
   - tagset::en::penn - English Penn Treebank   - tagset::en::penn - English Penn Treebank
   - tagset::pt::conll - Portuguese CoNLL treebank (based on the Floresta treebank)   - tagset::pt::conll - Portuguese CoNLL treebank (based on the Floresta treebank)
Line 39: Line 42:
  
 There is also the driver testing script, ''bin/driver-test.pl''. The distribution may contain some sample conversion scripts as well; however, these depend much more on the file format than on the tagset drivers, and thus you'll probably need to write your own anyway. There is also the driver testing script, ''bin/driver-test.pl''. The distribution may contain some sample conversion scripts as well; however, these depend much more on the file format than on the tagset drivers, and thus you'll probably need to write your own anyway.
 +
  
 ===== How to use the Interset ===== ===== How to use the Interset =====
Line 44: Line 48:
 You can write your own tag conversion Perl script, and use the Interset driver library. You have to tell Perl where to find the drivers (the following commands work in ''csh''; you have to use different syntax under ''bash'' or in Windows command line): You can write your own tag conversion Perl script, and use the Interset driver library. You have to tell Perl where to find the drivers (the following commands work in ''csh''; you have to use different syntax under ''bash'' or in Windows command line):
  
-<code>setenv PERLLIB /home/zeman/projekty/interset/lib:$PERLLIB +<code>setenv PERLLIB /home/zeman/interset/lib:$PERLLIB 
-setenv PATH /home/zeman/projekty/interset/bin:$PATH</code>+setenv PATH /home/zeman/interset/bin:$PATH</code>
  
 Once the variable is set, writing a conversion script is very easy. For instance, my ''csts-cs-pdt-en-penn.pl'' script (meaning "read and write [[:Formát CSTS|CSTS format]], read Czech PDT tags, write English Penn tags) essentially looks like this: Once the variable is set, writing a conversion script is very easy. For instance, my ''csts-cs-pdt-en-penn.pl'' script (meaning "read and write [[:Formát CSTS|CSTS format]], read Czech PDT tags, write English Penn tags) essentially looks like this:

[ Back to the navigation ] [ Back to the content ]