Both sides previous revision
Previous revision
Next revision
|
Previous revision
Next revision
Both sides next revision
|
user:zeman:interset:how-to-use [2009/02/20 15:00] zeman Level of headings. |
user:zeman:interset:how-to-use [2009/09/04 14:05] zeman CoNLL 2009. |
====== Manual ====== | ====== DZ Interset Manual ====== |
| |
===== Installation ===== | ===== Installation ===== |
- tagset::bg::conll - Bulgarian CoNLL treebank | - tagset::bg::conll - Bulgarian CoNLL treebank |
- tagset::cs::conll - Czech CoNLL treebank, based on the Prague Dependency Treebank | - tagset::cs::conll - Czech CoNLL treebank, based on the Prague Dependency Treebank |
| - tagset::cs::conll2009 - Czech CoNLL 2009 treebank (new formatting) |
- tagset::cs::multext - Czech subset of the tagset from the Multext East project | - tagset::cs::multext - Czech subset of the tagset from the Multext East project |
- tagset::cs::pdt - Czech positional tags of the Prague Dependency Treebank | - tagset::cs::pdt - Czech positional tags of the Prague Dependency Treebank |
- tagset::da::conll - Danish CoNLL treebank | - tagset::da::conll - Danish CoNLL treebank |
- tagset::de::conll - German CoNLL treebank (one-to-one mapping to de::stts) | - tagset::de::conll - German CoNLL treebank (one-to-one mapping to de::stts) |
| - tagset::de::conll2009 - German CoNLL 2009 treebank (STTS POS tags + additional morphological features) |
- tagset::de::stts - German: Stuttgart-Tübingen Tagset (Tiger treebank) | - tagset::de::stts - German: Stuttgart-Tübingen Tagset (Tiger treebank) |
- tagset::en::conll - English CoNLL treebank (one-to-one mapping to en::penn) | - tagset::en::conll - English CoNLL treebank (one-to-one mapping to en::penn) |
| - tagset::en::conll2009 - English CoNLL 2009 treebank (new formatting, two new tags) |
- tagset::en::penn - English Penn Treebank | - tagset::en::penn - English Penn Treebank |
- tagset::pt::conll - Portuguese CoNLL treebank (based on the Floresta treebank) | - tagset::pt::conll - Portuguese CoNLL treebank (based on the Floresta treebank) |
| |
There is also the driver testing script, ''bin/driver-test.pl''. The distribution may contain some sample conversion scripts as well; however, these depend much more on the file format than on the tagset drivers, and thus you'll probably need to write your own anyway. | There is also the driver testing script, ''bin/driver-test.pl''. The distribution may contain some sample conversion scripts as well; however, these depend much more on the file format than on the tagset drivers, and thus you'll probably need to write your own anyway. |
| |
| |
===== How to use the Interset ===== | ===== How to use the Interset ===== |
You can write your own tag conversion Perl script, and use the Interset driver library. You have to tell Perl where to find the drivers (the following commands work in ''csh''; you have to use different syntax under ''bash'' or in Windows command line): | You can write your own tag conversion Perl script, and use the Interset driver library. You have to tell Perl where to find the drivers (the following commands work in ''csh''; you have to use different syntax under ''bash'' or in Windows command line): |
| |
<code>setenv PERLLIB /home/zeman/projekty/interset/lib:$PERLLIB | <code>setenv PERLLIB /home/zeman/interset/lib:$PERLLIB |
setenv PATH /home/zeman/projekty/interset/bin:$PATH</code> | setenv PATH /home/zeman/interset/bin:$PATH</code> |
| |
Once the variable is set, writing a conversion script is very easy. For instance, my ''csts-cs-pdt-en-penn.pl'' script (meaning "read and write [[:Formát CSTS|CSTS format]], read Czech PDT tags, write English Penn tags) essentially looks like this: | Once the variable is set, writing a conversion script is very easy. For instance, my ''csts-cs-pdt-en-penn.pl'' script (meaning "read and write [[:Formát CSTS|CSTS format]], read Czech PDT tags, write English Penn tags) essentially looks like this: |