[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

This is an old revision of the document!


Table of Contents

DZ Interset Manual

Installation

If you exist on the ÚFAL network and use Perl from PerlBrew, you probably already have Lingua::Interset available (depending on which Perl version you take from PerlBrew). Or you can point your `PERL5LIB` directly to Dan's version in /home/zeman/projekty/interset/lib. Otherwise you can install Lingua::Interset from CPAN, e.g. using

cpanm Lingua::Interset

Contributions welcome! If you write your own driver, please share it with others! If you send it to me, I will add it to the package on CPAN.

Existing drivers

Use the tool bin/driver-test.pl from the package (call it without arguments) to list the tagsets/drivers currently available on your system.

Directory Structure

The drivers are Perl modules and must be somewhere under $PERLLIB (@INC). Their root folder is Lingua/Interset/Tagset. Subfolders of Tagset are two-letter codes of languages (ISO 639-1), uppercased (because of the convention that Perl modules start with an uppercase letter). Some tagsets may be designed for more than one language but most are language-specific. PM files in language folders are drivers. Drivers are called Xxx.pm, where xxx is the code name of the tagset. The driver Xxx.pm for language ll should be accessible from Perl via

use Lingua::Interset::Tagset::LL::Xxx;

but usually it is more convenient to just call the main module and then refer to the tagset using the lowercased identifier:

use Lingua::Interset qw(decode);
my $fs = decode('ll::xxx', $tag);

The main object in Interset is of the class Lingua::Interset::FeatureStructure, which provides various useful access methods. For details, see the documentation at https://metacpan.org/pod/Lingua::Interset

There is also the driver testing script, bin/driver-test.pl. The distribution may contain some sample conversion scripts as well; however, these depend much more on the file format than on the tagset drivers, and thus you'll probably need to write your own anyway.

How to use the Interset

You can write your own tag conversion Perl script, and use the Interset driver library. You may have to tell Perl where to find Interset (the following commands work in csh; you have to use different syntax under bash or in Windows command line):

setenv PERLLIB /home/zeman/projekty/interset/lib:$PERLLIB
setenv PATH /home/zeman/projekty/interset/bin:$PATH

Once the variable is set, writing a conversion script is very easy. Here is an example (note that in CoNLL-X files we often merge the contents of the CPOS, POS and FEATS columns to create one long string that will be seen by Interset as one “tag”):

use Lingua::Interset::Converter;
 
my $c = new Lingua::Interset::Converter ('from' => $tagset1, 'to' => 'mul::uposf');
 
# Read the CoNLL-X file from STDIN or from files given as arguments.
while(<>)
{
    unless(m/^\s*$/)
    {
        chomp();
        my @f = split(/\t/, $_);
        my $tag = "$f[3]\t$f[4]\t$f[5]";
        my $utag = $c->convert($tag);
        my ($upos, $ufeat) = split(/\t/, $utag);
        $f[3] = $upos;
        $f[5] = $ufeat;
        $_ = join("\t", @f)."\n";
    }
    # Write the modified line to the standard output.
    print();
}

[ Back to the navigation ] [ Back to the content ]