Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
user:zeman:interset:how-to-use [2008/03/14 09:59] zeman Note CSH. |
user:zeman:interset:how-to-use [2017/01/16 13:06] (current) zeman |
||
---|---|---|---|
Line 1: | Line 1: | ||
- | ===== Manual ===== | + | ====== DZ Interset |
+ | ===== Installation ===== | ||
- | ==== Installation ==== | + | If you exist on the ÚFAL network and use Perl from PerlBrew, you probably already have '' |
- | If you exist on the ÚFAL network, you can use directly Dan's version here. Otherwise, you need to [[mailto:zeman@ufal.mff.cuni.cz|ask Dan]] for a zipped package of the currently existing drivers. (I intend to maintain it here for download some time later.) Unzip it to a convenient place; below, we assume it is in ''/ | + | <code bash> |
- | **Contributions welcome!** If you write your own driver, please share it with others! If you send it to me, I will add it to the package | + | **Contributions welcome!** If you write your own driver, please share it with others! If you send it to me, I will add it to the package |
- | === Existing drivers === | + | ==== Existing drivers |
- | Note: This list may not be up-to-date. To see what drivers are currently available on your system, call '' | + | Use the tool '' |
- | * tagset:: | + | ==== Directory Structure ==== |
- | * tagset:: | + | |
- | * tagset:: | + | |
- | * tagset:: | + | |
- | * tagset:: | + | |
- | * tagset:: | + | |
- | * tagset:: | + | |
- | * tagset:: | + | |
- | * tagset:: | + | |
- | * tagset:: | + | |
- | * tagset:: | + | |
- | === Directory Structure === | + | The drivers are Perl modules and must be somewhere under '' |
- | The drivers are Perl modules and must be somewhere under '' | + | <code perl> |
+ | use Lingua:: | ||
+ | </code> | ||
+ | |||
+ | but usually it is more convenient to just call the main module and then refer to the tagset | ||
<code perl> | <code perl> | ||
- | use tagset::ll::xxx; | + | use Lingua::Interset qw(decode); |
+ | my $fs = decode(' | ||
</ | </ | ||
- | Besides drivers, there is a library | + | The main object in Interset |
There is also the driver testing script, '' | There is also the driver testing script, '' | ||
+ | ===== How to use the Interset ===== | ||
- | + | You can write your own Perl script | |
- | + | ||
- | + | ||
- | ==== How to use the Interset ==== | + | |
- | + | ||
- | You can write your own tag conversion | + | |
< | < | ||
setenv PATH / | setenv PATH / | ||
- | Once the variable is set, writing a conversion script is very easy. For instance, my '' | + | Once the variable is set, writing a conversion script is very easy. Here is an example (note that in CoNLL-X files we often merge the contents of the CPOS, POS and FEATS columns to create one long string that will be seen by Interset as one “tag”): |
<code perl> | <code perl> | ||
- | use tagset::cs::pdt; | + | use Lingua::Interset::Converter; |
- | use tagset:: | + | |
- | while(<> | + | my $c = new Lingua::Interset::Converter |
- | { | + | |
- | if(s/< | + | |
- | { | + | |
- | | + | |
- | my $features = tagset::cs::pdt::decode($tag0); | + | |
- | my $tag1 = tagset:: | + | |
- | s/< | + | |
- | } | + | |
- | print; | + | |
- | } | + | |
- | </ | + | |
- | + | ||
- | Note the two-step replacement of the original tag. I do not dare to use the original tag in a regular expression because there could be special characters in the tag. | + | |
- | + | ||
- | Some operations performed by the drivers (especially when encoding) are not trivial. While you may not observe long processing times for toy runs, it might matter once you start converting millions of tags in a big corpus. Then you may want to use up the fact that there are tens to thousands of tags, and cache their translations like in the following example: | + | |
- | + | ||
- | <code perl> | + | |
- | use tagset::cs::pdt; | + | |
- | use tagset:: | + | |
+ | # Read the CoNLL-X file from STDIN or from files given as arguments. | ||
while(<> | while(<> | ||
{ | { | ||
- | | + | |
{ | { | ||
- | | + | |
- | my $tag1; | + | my @f = split(/\t/, $_); |
- | if(exists($cache{$tag0})) | + | |
- | | + | my $utag = $c-> |
- | | + | my ($upos, $ufeat) |
- | | + | $f[3] = $upos; |
- | else | + | |
- | { | + | |
- | | + | |
- | $tag1 = tagset:: | + | |
- | $cache{$tag0} | + | |
- | | + | |
- | | + | |
} | } | ||
- | print; | + | |
+ | | ||
} | } | ||
</ | </ | ||