[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

This is an old revision of the document!


Table of Contents

DZ Interset Manual

Installation

If you exist on the ÚFAL network, you can use directly Dan's version here. Otherwise, you need to download a zipped package of the currently existing drivers. Unzip it to a convenient place; below, we assume it is in /home/zeman/interset.

Contributions welcome! If you write your own driver, please share it with others! If you send it to me, I will add it to the package for download here.

Existing drivers

Note: This list may not be up-to-date. To see what drivers are currently available on your system, call driver-test.pl without arguments.

  1. tagset::ar::conll - Arabic CoNLL treebank (coarse, fine and feat fields in one string, delimited by tabs)
  2. tagset::bg::conll - Bulgarian CoNLL treebank
  3. tagset::cs::conll - Czech CoNLL treebank, based on the Prague Dependency Treebank
  4. tagset::cs::conll2009 - Czech CoNLL 2009 treebank (new formatting)
  5. tagset::cs::multext - Czech subset of the tagset from the Multext East project
  6. tagset::cs::pdt - Czech positional tags of the Prague Dependency Treebank
  7. tagset::da::conll - Danish CoNLL treebank
  8. tagset::de::conll - German CoNLL treebank (one-to-one mapping to de::stts)
  9. tagset::de::conll2009 - German CoNLL 2009 treebank (STTS POS tags + additional morphological features)
  10. tagset::de::stts - German: Stuttgart-Tübingen Tagset (Tiger treebank)
  11. tagset::en::conll - English CoNLL treebank (one-to-one mapping to en::penn)
  12. tagset::en::conll2009 - English CoNLL 2009 treebank (new formatting, two new tags)
  13. tagset::en::penn - English Penn Treebank
  14. tagset::pt::conll - Portuguese CoNLL treebank (based on the Floresta treebank)
  15. tagset::sv::conll - Swedish CoNLL treebank (one-to-one mapping to sv::mamba)
  16. tagset::sv::hajic - Tags output by Swedish tagger by Jan Hajič
  17. tagset::sv::mamba - Swedish Mamba tags from Talbanken05 (CoNLL treebank)
  18. tagset::sv::svdahybrid - Dan's tagset, aiming at making distribution of tags from sv::hajic and da::conll as close as possible
  19. tagset::zh::conll - Chinese CoNLL treebank

Directory Structure

The drivers are Perl modules and must be somewhere under $PERLLIB (@INC). Their root folder is tagset (this is what separates the tag set drivers from other Perl libraries). Subfolders of tagset are two-letter codes of languages (ISO 639-1). Some tagsets may be designed for more than one language but most are language-specific. PM files in language folders are drivers. Drivers are called xxx.pm, where xxx is the code name of the tagset. The driver xxx.pm for language ll should be accessible from Perl via

use tagset::ll::xxx;

Besides drivers, there is a library of useful functions that can be called from within drivers: tagset/common.pm.

There is also the driver testing script, bin/driver-test.pl. The distribution may contain some sample conversion scripts as well; however, these depend much more on the file format than on the tagset drivers, and thus you'll probably need to write your own anyway.

How to use the Interset

You can write your own tag conversion Perl script, and use the Interset driver library. You have to tell Perl where to find the drivers (the following commands work in csh; you have to use different syntax under bash or in Windows command line):

setenv PERLLIB /home/zeman/interset/lib:$PERLLIB
setenv PATH /home/zeman/interset/bin:$PATH

Once the variable is set, writing a conversion script is very easy. For instance, my csts-cs-pdt-en-penn.pl script (meaning “read and write CSTS format, read Czech PDT tags, write English Penn tags) essentially looks like this:

use tagset::cs::pdt;
use tagset::en::penn;
 
while(<>)
{
    if(s/<t>([^<]+)/<_tag_to_convert_>/)
    {
        my $tag0 = $1;
        my $features = tagset::cs::pdt::decode($tag0);
        my $tag1 = tagset::en::penn::encode($features);
        s/<_tag_to_convert_>/<t>$tag1/;
    }
    print;
}

Note the two-step replacement of the original tag. I do not dare to use the original tag in a regular expression because there could be special characters in the tag.

Some operations performed by the drivers (especially when encoding) are not trivial. While you may not observe long processing times for toy runs, it might matter once you start converting millions of tags in a big corpus. Then you may want to use up the fact that there are tens to thousands of tags, and cache their translations like in the following example:

use tagset::cs::pdt;
use tagset::en::penn;
 
while(<>)
{
    if(s/<t>([^<]+)/<_tag_to_convert_>/)
    {
        my $tag0 = $1;
        my $tag1;
        if(exists($cache{$tag0}))
        {
            $tag1 = $cache{$tag0};
        }
        else
        {
            my $features = tagset::cs::pdt::decode($tag0);
            $tag1 = tagset::en::penn::encode($features);
            $cache{$tag0} = $tag1;
        }
        s/<_tag_to_convert_>/<t>$tag1/;
    }
    print;
}

[ Back to the navigation ] [ Back to the content ]