[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
user:zeman:interset:how-to-use [2008/03/26 08:59]
zeman cs::conll finished.
user:zeman:interset:how-to-use [2009/09/04 14:05]
zeman CoNLL 2009.
Line 1: Line 1:
-===== Manual =====+====== DZ Interset Manual ======
  
 +===== Installation =====
  
- +If you exist on the ÚFAL network, you can use directly Dan's version here. Otherwise, you need to [[download]] a zipped package of the currently existing drivers. Unzip it to a convenient place; below, we assume it is in ''/home/zeman/interset''.
-==== Installation ==== +
- +
-If you exist on the ÚFAL network, you can use directly Dan's version here. Otherwise, you need to [[mailto:zeman@ufal.mff.cuni.cz|ask Dan]] for a zipped package of the currently existing drivers. (I intend to maintain it here for download some time later.) Unzip it to a convenient place; below, we assume it is in ''/home/zeman/interset''.+
  
 **Contributions welcome!** If you write your own driver, please share it with others! If you send it to me, I will add it to the package for download here. **Contributions welcome!** If you write your own driver, please share it with others! If you send it to me, I will add it to the package for download here.
  
-=== Existing drivers ===+==== Existing drivers ====
  
 Note: This list may not be up-to-date. To see what drivers are currently available on your system, call ''driver-test.pl'' without arguments. Note: This list may not be up-to-date. To see what drivers are currently available on your system, call ''driver-test.pl'' without arguments.
Line 16: Line 14:
   - tagset::bg::conll - Bulgarian CoNLL treebank   - tagset::bg::conll - Bulgarian CoNLL treebank
   - tagset::cs::conll - Czech CoNLL treebank, based on the Prague Dependency Treebank   - tagset::cs::conll - Czech CoNLL treebank, based on the Prague Dependency Treebank
 +  - tagset::cs::conll2009 - Czech CoNLL 2009 treebank (new formatting)
 +  - tagset::cs::multext - Czech subset of the tagset from the Multext East project
   - tagset::cs::pdt - Czech positional tags of the Prague Dependency Treebank   - tagset::cs::pdt - Czech positional tags of the Prague Dependency Treebank
   - tagset::da::conll - Danish CoNLL treebank   - tagset::da::conll - Danish CoNLL treebank
 +  - tagset::de::conll - German CoNLL treebank (one-to-one mapping to de::stts)
 +  - tagset::de::conll2009 - German CoNLL 2009 treebank (STTS POS tags + additional morphological features)
 +  - tagset::de::stts - German: Stuttgart-Tübingen Tagset (Tiger treebank)
   - tagset::en::conll - English CoNLL treebank (one-to-one mapping to en::penn)   - tagset::en::conll - English CoNLL treebank (one-to-one mapping to en::penn)
 +  - tagset::en::conll2009 - English CoNLL 2009 treebank (new formatting, two new tags)
   - tagset::en::penn - English Penn Treebank   - tagset::en::penn - English Penn Treebank
 +  - tagset::pt::conll - Portuguese CoNLL treebank (based on the Floresta treebank)
   - tagset::sv::conll - Swedish CoNLL treebank (one-to-one mapping to sv::mamba)   - tagset::sv::conll - Swedish CoNLL treebank (one-to-one mapping to sv::mamba)
   - tagset::sv::hajic - Tags output by Swedish tagger by Jan Hajič   - tagset::sv::hajic - Tags output by Swedish tagger by Jan Hajič
Line 26: Line 31:
   - tagset::zh::conll - Chinese CoNLL treebank   - tagset::zh::conll - Chinese CoNLL treebank
  
-=== Directory Structure ===+==== Directory Structure ====
  
 The drivers are Perl modules and must be somewhere under ''$PERLLIB'' (''@INC''). Their root folder is ''tagset'' (this is what separates the tag set drivers from other Perl libraries). Subfolders of ''tagset'' are two-letter codes of languages ([[http://en.wikipedia.org/wiki/ISO_639-1|ISO 639-1]]). Some tagsets may be designed for more than one language but most are language-specific. PM files in language folders are drivers. Drivers are called xxx.pm, where xxx is the code name of the tagset. The driver xxx.pm for language ll should be accessible from Perl via The drivers are Perl modules and must be somewhere under ''$PERLLIB'' (''@INC''). Their root folder is ''tagset'' (this is what separates the tag set drivers from other Perl libraries). Subfolders of ''tagset'' are two-letter codes of languages ([[http://en.wikipedia.org/wiki/ISO_639-1|ISO 639-1]]). Some tagsets may be designed for more than one language but most are language-specific. PM files in language folders are drivers. Drivers are called xxx.pm, where xxx is the code name of the tagset. The driver xxx.pm for language ll should be accessible from Perl via
Line 39: Line 44:
  
  
- +===== How to use the Interset =====
- +
- +
- +
-==== How to use the Interset ====+
  
 You can write your own tag conversion Perl script, and use the Interset driver library. You have to tell Perl where to find the drivers (the following commands work in ''csh''; you have to use different syntax under ''bash'' or in Windows command line): You can write your own tag conversion Perl script, and use the Interset driver library. You have to tell Perl where to find the drivers (the following commands work in ''csh''; you have to use different syntax under ''bash'' or in Windows command line):
  
-<code>setenv PERLLIB /home/zeman/projekty/interset/lib:$PERLLIB +<code>setenv PERLLIB /home/zeman/interset/lib:$PERLLIB 
-setenv PATH /home/zeman/projekty/interset/bin:$PATH</code>+setenv PATH /home/zeman/interset/bin:$PATH</code>
  
 Once the variable is set, writing a conversion script is very easy. For instance, my ''csts-cs-pdt-en-penn.pl'' script (meaning "read and write [[:Formát CSTS|CSTS format]], read Czech PDT tags, write English Penn tags) essentially looks like this: Once the variable is set, writing a conversion script is very easy. For instance, my ''csts-cs-pdt-en-penn.pl'' script (meaning "read and write [[:Formát CSTS|CSTS format]], read Czech PDT tags, write English Penn tags) essentially looks like this:

[ Back to the navigation ] [ Back to the content ]