[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
user:zeman:interset:how-to-use [2008/03/10 14:25]
zeman How to get the list of drivers.
user:zeman:interset:how-to-use [2008/03/14 09:53]
zeman New drivers.
Line 3: Line 3:
 ==== Installation ==== ==== Installation ====
  
-If you exist on the ÚFAL network, you can use directly Dan's version here. Otherwise, you need to [[mailto:zeman@ufal.mff.cuni.cz|ask Dan]] for a zipped package of the currently existing drivers. (I intend to maintain it here for download some time later.) Unzip it to a convenient place; below, we assume it is in ''/home/zeman/lib/perl''+If you exist on the ÚFAL network, you can use directly Dan's version here. Otherwise, you need to [[mailto:zeman@ufal.mff.cuni.cz|ask Dan]] for a zipped package of the currently existing drivers. (I intend to maintain it here for download some time later.) Unzip it to a convenient place; below, we assume it is in ''/home/zeman/interset''.
- +
-**Note:** I decided to put the whole thing under version control. At the same time, I moved it from my lib folder to my project folder. So the current ÚFAL location is ''/home/zeman/projekty/interset/lib''.+
  
 **Contributions welcome!** If you write your own driver, please share it with others! If you send it to me, I will add it to the package for download here. **Contributions welcome!** If you write your own driver, please share it with others! If you send it to me, I will add it to the package for download here.
Line 17: Line 15:
   * tagset::cs::pdt - Czech positional tags of the Prague Dependency Treebank   * tagset::cs::pdt - Czech positional tags of the Prague Dependency Treebank
   * tagset::da::conll - Danish CoNLL treebank   * tagset::da::conll - Danish CoNLL treebank
-  * tagset::en::conll - English CoNLL treebank+  * tagset::en::conll - English CoNLL treebank (one-to-one mapping to en::penn)
   * tagset::en::penn - English Penn Treebank   * tagset::en::penn - English Penn Treebank
 +  * tagset::sv::conll - Swedish CoNLL treebank (one-to-one mapping to sv::mamba)
   * tagset::sv::hajic - Tags output by Swedish tagger by Jan Hajič   * tagset::sv::hajic - Tags output by Swedish tagger by Jan Hajič
   * tagset::sv::mamba - Swedish Mamba tags from Talbanken05 (CoNLL treebank)   * tagset::sv::mamba - Swedish Mamba tags from Talbanken05 (CoNLL treebank)
   * tagset::sv::svdahybrid - Dan's tagset, aiming at making distribution of tags from sv::hajic and da::conll as close as possible   * tagset::sv::svdahybrid - Dan's tagset, aiming at making distribution of tags from sv::hajic and da::conll as close as possible
 +  * tagset::zh::conll - Chinese CoNLL treebank
  
 === Directory Structure === === Directory Structure ===
Line 34: Line 34:
  
 There is also the driver testing script, ''bin/driver-test.pl''. The distribution may contain some sample conversion scripts as well; however, these depend much more on the file format than on the tagset drivers, and thus you'll probably need to write your own anyway. There is also the driver testing script, ''bin/driver-test.pl''. The distribution may contain some sample conversion scripts as well; however, these depend much more on the file format than on the tagset drivers, and thus you'll probably need to write your own anyway.
 +
 +
  
  
Line 64: Line 66:
  
 Note the two-step replacement of the original tag. I do not dare to use the original tag in a regular expression because there could be special characters in the tag. Note the two-step replacement of the original tag. I do not dare to use the original tag in a regular expression because there could be special characters in the tag.
 +
 +Some operations performed by the drivers (especially when encoding) are not trivial. While you may not observe long processing times for toy runs, it might matter once you start converting millions of tags in a big corpus. Then you may want to use up the fact that there are tens to thousands of tags, and cache their translations like in the following example:
 +
 +<code perl>
 +use tagset::cs::pdt;
 +use tagset::en::penn;
 +
 +while(<>)
 +{
 +    if(s/<t>([^<]+)/<_tag_to_convert_>/)
 +    {
 +        my $tag0 = $1;
 +        my $tag1;
 +        if(exists($cache{$tag0}))
 +        {
 +            $tag1 = $cache{$tag0};
 +        }
 +        else
 +        {
 +            my $features = tagset::cs::pdt::decode($tag0);
 +            $tag1 = tagset::en::penn::encode($features);
 +            $cache{$tag0} = $tag1;
 +        }
 +        s/<_tag_to_convert_>/<t>$tag1/;
 +    }
 +    print;
 +}
 +</code>
  

[ Back to the navigation ] [ Back to the content ]