[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
user:zeman:interset:how-to-write-a-driver [2008/03/14 10:06]
zeman %f was not defined here.
user:zeman:interset:how-to-write-a-driver [2008/03/14 10:21]
zeman Conversion testing.
Line 35: Line 35:
  
 The list is not necessary for the driver to work. However, it can be useful for [[#Test your driver|testing]] the driver. If no list is distributed along with the tagset description, you may still be able to acquire a partial list from a corpus. The list is not necessary for the driver to work. However, it can be useful for [[#Test your driver|testing]] the driver. If no list is distributed along with the tagset description, you may still be able to acquire a partial list from a corpus.
 +
  
 ===== Alternative values ===== ===== Alternative values =====
Line 65: Line 66:
 Now, what do you do with features where you want to encode arrays? You should first check whether the value is an array or not. If it is an array, you may want to ''grep'' your values rather than trying exact match, because you do not know what is going to come from other drivers, and the ordering or additional values may not be what matters. Now, what do you do with features where you want to encode arrays? You should first check whether the value is an array or not. If it is an array, you may want to ''grep'' your values rather than trying exact match, because you do not know what is going to come from other drivers, and the ordering or additional values may not be what matters.
  
-If the arrays turns out to be incompatible with what you expect, you should pick one value (we suggest you  take the first one) and proceed with default single-value processing.+If the array turns out to be incompatible with what you expect, you should pick one value (we suggest you  take the first one) and proceed with default single-value processing.
  
 <code perl> <code perl>
Line 94: Line 95:
  
 **Note:** This approach cannot encode situations where some combinations of feature values are plausible and some are not! For instance, if positions [2] and [3] in a tag encode gender and number, respectively, and if ''NNQW'' means a logical disjunction of the tags ''NNFS'' and ''NNNP'', then you cannot encode the situation in DZ Interset precisely. If you do not want to discard either ''NNFS'' or ''NNNP'' (by storing the other only), you can say that gender = ''F'' or ''N'' and number = ''S'' or ''P'' but by that you have also introduced ''NNFP'' and ''NNNS'' as possibilities. The approach may be revised in future. **Note:** This approach cannot encode situations where some combinations of feature values are plausible and some are not! For instance, if positions [2] and [3] in a tag encode gender and number, respectively, and if ''NNQW'' means a logical disjunction of the tags ''NNFS'' and ''NNNP'', then you cannot encode the situation in DZ Interset precisely. If you do not want to discard either ''NNFS'' or ''NNNP'' (by storing the other only), you can say that gender = ''F'' or ''N'' and number = ''S'' or ''P'' but by that you have also introduced ''NNFP'' and ''NNNS'' as possibilities. The approach may be revised in future.
 +
  
  
Line 108: Line 110:
  
 Building the list of permitted values is expensive (all tags must be decoded!) and you should do it only once when your driver initializes. In your ''BEGIN'' block, you should call ''tagset::common::get_permitted_values()'' and store the hash reference it returns. The hash (of arrays) will contain a list of permitted values for every feature. When you later need to check a value and replace it if necessary, you pass the hash reference back to ''tagset::common'': Building the list of permitted values is expensive (all tags must be decoded!) and you should do it only once when your driver initializes. In your ''BEGIN'' block, you should call ''tagset::common::get_permitted_values()'' and store the hash reference it returns. The hash (of arrays) will contain a list of permitted values for every feature. When you later need to check a value and replace it if necessary, you pass the hash reference back to ''tagset::common'':
 +
 +(Note that the ''list()'' function must be defined before the ''BEGIN'' block that uses it.)
  
 <code perl> <code perl>
 use tagset::common; use tagset::common;
 +...
 +sub list { ... }
 +...
 BEGIN BEGIN
 { {
Line 133: Line 140:
  
 See [[user:zeman:interset:Common Problems]] for a list of suggestions for phenomena difficult to match between tagsets and the Interset. See [[user:zeman:interset:Common Problems]] for a list of suggestions for phenomena difficult to match between tagsets and the Interset.
 +
  
  
Line 145: Line 153:
  
 <code>driver-test.pl ar::conll <code>driver-test.pl ar::conll
-driver-test.pl -a</code>+driver-test.pl -a 
 +driver-test.pl bg::conll cs::pdt 
 +driver-test.pl -A</code>
  
-Running ''driver-test.pl'' without arguments will list the drivers available on the system. Running it with the ''-a'' option will test all the drivers.+Running ''driver-test.pl'' without arguments will list the drivers available on the system. Running it with the ''-a'' option will test all the drivers. Two arguments test both drivers separately and then conversions from driver A to driver B and vice versa. The ''-A'' option tests all conversions between all pairs of drivers.
  
 Note that only drivers implementing the ''list()'' function can be tested. Most testing involves generating the list of all possible tags and testing the driver on each tag separately. Note that only drivers implementing the ''list()'' function can be tested. Most testing involves generating the list of all possible tags and testing the driver on each tag separately.
  
-The following tests will be performed:+The following tests will be performed for a single driver:
  
   * Decode each tag and check that only known features and values are set. In addition to a built-in list, every feature can have an empty value, and the features "tagset" and "other" can have any value.   * Decode each tag and check that only known features and values are set. In addition to a built-in list, every feature can have an empty value, and the features "tagset" and "other" can have any value.
   * Check for each tag that ''encode(decode($tag)) eq $tag''. While sometimes it can be annoying to try to preserve some obscure information hidden in the tags, this test can also reveal many unwanted bugs. Besides, you should preserve information of your own tagset because people may want to use your driver merely to //access// the tags, instead of //converting// them.   * Check for each tag that ''encode(decode($tag)) eq $tag''. While sometimes it can be annoying to try to preserve some obscure information hidden in the tags, this test can also reveal many unwanted bugs. Besides, you should preserve information of your own tagset because people may want to use your driver merely to //access// the tags, instead of //converting// them.
 +
 +The following tests will be performed for a pair of drivers:
 +
 +  * Decode every tag of the first driver, encode it using the second driver and check whether the result is a known tag in the second tagset.
  

[ Back to the navigation ] [ Back to the content ]