[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
user:zeman:interset:how-to-use [2008/03/31 21:23]
zeman tagset::de::stts
user:zeman:interset:how-to-use [2017/01/16 13:06] (current)
zeman
Line 1: Line 1:
-===== Manual =====+====== DZ Interset Manual ======
  
 +===== Installation =====
  
 +If you exist on the ÚFAL network and use Perl from PerlBrew, you probably already have ''Lingua::Interset'' available (depending on which Perl version you take from PerlBrew). Or you can point your ''PERL5LIB'' directly to Dan's version in ''/home/zeman/projekty/interset/lib''. Otherwise you can install ''Lingua::Interset'' from CPAN, e.g. using
  
-==== Installation ====+<code bash>cpanm Lingua::Interset</code>
  
-If you exist on the ÚFAL network, you can use directly Dan's version here. Otherwiseyou need to [[mailto:zeman@ufal.mff.cuni.cz|ask Dan]] for a zipped package of the currently existing drivers. (intend to maintain it here for download some time later.) Unzip it to a convenient place; below, we assume it is in ''/home/zeman/interset''.+**Contributions welcome!** If you write your own driverplease share it with others! If you send it to me, I will add it to the package on CPAN.
  
-**Contributions welcome!** If you write your own driver, please share it with others! If you send it to me, I will add it to the package for download here.+==== Existing drivers ====
  
-=== Existing drivers ===+Use the tool ''bin/driver-test.pl'' from the package (call it without arguments) to list the tagsets/drivers currently available on your system.
  
-Note: This list may not be up-to-date. To see what drivers are currently available on your system, call ''driver-test.pl'' without arguments.+==== Directory Structure ====
  
-  - tagset::ar::conll - Arabic CoNLL treebank (coarse, fine and feat fields in one string, delimited by tabs) +The drivers are Perl modules and must be somewhere under ''$PERLLIB'' (''@INC''). Their root folder is ''Lingua/Interset/Tagset''. Subfolders of ''Tagset'' are two-letter codes of languages ([[http://en.wikipedia.org/wiki/ISO_639-1|ISO 639-1]]), uppercased (because of the convention that Perl modules start with an uppercase letter). Some tagsets may be designed for more than one language but most are language-specific. PM files in language folders are drivers. Drivers are called Xxx.pmwhere xxx is the code name of the tagset. The driver Xxx.pm for language ll should be accessible from Perl via
-  - tagset::bg::conll - Bulgarian CoNLL treebank +
-  - tagset::cs::conll - Czech CoNLL treebank, based on the Prague Dependency Treebank +
-  - tagset::cs::pdt - Czech positional tags of the Prague Dependency Treebank +
-  - tagset::da::conll - Danish CoNLL treebank +
-  - tagset::de::stts - German: Stuttgart-Tübingen Tagset (Tiger treebank) +
-  tagset::en::conll - English CoNLL treebank (one-to-one mapping to en::penn) +
-  - tagset::en::penn English Penn Treebank +
-  tagset::sv::conll - Swedish CoNLL treebank (one-to-one mapping to sv::mamba) +
-  - tagset::sv::hajic - Tags output by Swedish tagger by Jan Hajič +
-  - tagset::sv::mamba - Swedish Mamba tags from Talbanken05 (CoNLL treebank) +
-  tagset::sv::svdahybrid - Dan's tagsetaiming at making distribution of tags from sv::hajic and da::conll as close as possible +
-  - tagset::zh::conll - Chinese CoNLL treebank+
  
-=== Directory Structure ===+<code perl> 
 +use Lingua::Interset::Tagset::LL::Xxx; 
 +</code>
  
-The drivers are Perl modules and must be somewhere under ''$PERLLIB'' (''@INC''). Their root folder is ''tagset'' (this is what separates the tag set drivers from other Perl libraries). Subfolders of ''tagset'' are two-letter codes of languages ([[http://en.wikipedia.org/wiki/ISO_639-1|ISO 639-1]]). Some tagsets may be designed for more than one language but most are language-specific. PM files in language folders are drivers. Drivers are called xxx.pm, where xxx is the code name of the tagset. The driver xxx.pm for language ll should be accessible from Perl via+but usually it is more convenient to just call the main module and then refer to the tagset using the lowercased identifier:
  
 <code perl> <code perl>
-use tagset::ll::xxx;+use Lingua::Interset qw(decode); 
 +my $fs = decode('ll::xxx', $tag);
 </code> </code>
  
-Besides drivers, there is a library of useful functions that can be called from within drivers: ''tagset/common.pm''.+The main object in Interset is of the class ''Lingua::Interset::FeatureStructure'', which provides various useful access methodsFor details, see the documentation at https://metacpan.org/pod/Lingua::Interset
  
 There is also the driver testing script, ''bin/driver-test.pl''. The distribution may contain some sample conversion scripts as well; however, these depend much more on the file format than on the tagset drivers, and thus you'll probably need to write your own anyway. There is also the driver testing script, ''bin/driver-test.pl''. The distribution may contain some sample conversion scripts as well; however, these depend much more on the file format than on the tagset drivers, and thus you'll probably need to write your own anyway.
  
  
 +===== How to use the Interset =====
  
- +You can write your own Perl script to convert tags, and use the Interset driver library. You may have to tell Perl where to find Interset (the following commands work in ''csh''; you have to use different syntax under ''bash'' or in Windows command line):
- +
- +
-==== How to use the Interset ==== +
- +
-You can write your own tag conversion Perl script, and use the Interset driver library. You have to tell Perl where to find the drivers (the following commands work in ''csh''; you have to use different syntax under ''bash'' or in Windows command line):+
  
 <code>setenv PERLLIB /home/zeman/projekty/interset/lib:$PERLLIB <code>setenv PERLLIB /home/zeman/projekty/interset/lib:$PERLLIB
 setenv PATH /home/zeman/projekty/interset/bin:$PATH</code> setenv PATH /home/zeman/projekty/interset/bin:$PATH</code>
  
-Once the variable is set, writing a conversion script is very easy. For instancemy ''csts-cs-pdt-en-penn.pl'' script (meaning "read and write [[:Formát CSTS|CSTS format]], read Czech PDT tags, write English Penn tagsessentially looks like this:+Once the variable is set, writing a conversion script is very easy. Here is an example (note that in CoNLL-X files we often merge the contents of the CPOSPOS and FEATS columns to create one long string that will be seen by Interset as one “tag”):
  
 <code perl> <code perl>
-use tagset::cs::pdt; +use Lingua::Interset::Converter;
-use tagset::en::penn;+
  
-while(<>+my $new Lingua::Interset::Converter ('from' => $tagset1'to' ='mul::uposf');
-+
-    if(s/<t>([^<]+)/<_tag_to_convert_>/+
-    { +
-        my $tag0 $1; +
-        my $features = tagset::cs::pdt::decode($tag0); +
-        my $tag1 tagset::en::penn::encode($features); +
-        s/<_tag_to_convert_>/<t>$tag1/; +
-    } +
-    print; +
-+
-</code> +
- +
-Note the two-step replacement of the original tag. I do not dare to use the original tag in a regular expression because there could be special characters in the tag. +
- +
-Some operations performed by the drivers (especially when encoding) are not trivial. While you may not observe long processing times for toy runsit might matter once you start converting millions of tags in a big corpus. Then you may want to use up the fact that there are tens to thousands of tags, and cache their translations like in the following example: +
- +
-<code perl> +
-use tagset::cs::pdt; +
-use tagset::en::penn;+
  
 +# Read the CoNLL-X file from STDIN or from files given as arguments.
 while(<>) while(<>)
 { {
-    if(s/<t>([^<]+)/<_tag_to_convert_>/)+    unless(m/^\s*$/)
     {     {
-        my $tag0 = $1+        chomp()
-        my $tag1; +        my @f = split(/\t/, $_); 
-        if(exists($cache{$tag0})+        my $tag "$f[3]\t$f[4]\t$f[5]"
-        +        my $utag $c->convert($tag); 
-            $tag1 = $cache{$tag0}+        my ($upos, $ufeat) split(/\t/, $utag); 
-        +        $f[3] = $upos
-        else +        $f[5] = $ufeat; 
-        { +        $_ = join("\t", @f)."\n";
-            my $features tagset::cs::pdt::decode($tag0); +
-            $tag1 tagset::en::penn::encode($features); +
-            $cache{$tag0} = $tag1+
-        } +
-        s/<_tag_to_convert_>/<t>$tag1/;+
     }     }
-    print;+    # Write the modified line to the standard output. 
 +    print();
 } }
 </code> </code>
  

[ Back to the navigation ] [ Back to the content ]