[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
user:zeman:interset:how-to-write-a-driver [2008/03/08 11:06]
zeman Testing all drivers.
user:zeman:interset:how-to-write-a-driver [2008/03/14 10:09]
zeman Typo.
Line 7: Line 7:
 The input/output tag can be any string. If the information is stored in several kinds of tags, they can be passed in one string, using some unique delimiters. We recommend "\t" (horizontal tab, ASCII 9) as delimiter. If desirable, the input/output tag can be even a multi-line XML! The input/output tag can be any string. If the information is stored in several kinds of tags, they can be passed in one string, using some unique delimiters. We recommend "\t" (horizontal tab, ASCII 9) as delimiter. If desirable, the input/output tag can be even a multi-line XML!
  
-Empty feature value means "unknown". It is even not known, whether this feature would be relevant. If we know that a feature is irrelevant, we can set it to "n/a" (not applicable; although not mentioned explicitly, this value is allowed for all features)However, I am not sure whether this should be used at all. While something can be irrelevant in one tagset, we can hardly say that it is not relevant in any tagset. So, since we are setting a value in a universal "tagset", we probably better leave the value empty or even set it to an appropriate default.+Empty feature value means "unknown". It is even not known, whether this feature would be relevant. Some tagsets distinguish between unknown values and irrelevant features. This is not the case of Interset. While something can be irrelevant in one tagset, we can hardly say that it is not relevant in any tagset. So, since we are setting a value in a universal "tagset", we probably better leave the value empty or even set it to an appropriate default.
  
 ===== decode() ===== ===== decode() =====
Line 16: Line 16:
  
 If the tagset encodes features separately (e.g., each character is a value of a particular feature): The decoder should be tolerant to unexpected combinations of features (or should be able to be tolerant if asked for it). If the tagset encodes features separately (e.g., each character is a value of a particular feature): The decoder should be tolerant to unexpected combinations of features (or should be able to be tolerant if asked for it).
 +
  
  
Line 27: Line 28:
 Since any feature can in theory have an array of values instead of a single value, the encoder should either be prepared to arrays (more precisely: array references) anywhere, or call ''tagset::single_values()'' to get rid of the arrays (or some of them). See [[#Alternative values]] for more details. Since any feature can in theory have an array of values instead of a single value, the encoder should either be prepared to arrays (more precisely: array references) anywhere, or call ''tagset::single_values()'' to get rid of the arrays (or some of them). See [[#Alternative values]] for more details.
  
-**WARNING:** Before modifying the contents of ''%f'', you should make a //deep// copy of it. You cannot assume that the user of the driver will not need the values in ''%f'' after encoding. If you have called ''single_values()'', it made the copy for you.+**WARNING:** Before modifying the contents of the feature structure, you should make a //deep// copy of it. You cannot assume that the user of the driver will not need the original values of the features after encoding. If you have called ''single_values()'', it made the copy for you.
  
 ===== list() ===== ===== list() =====
Line 34: Line 35:
  
 The list is not necessary for the driver to work. However, it can be useful for [[#Test your driver|testing]] the driver. If no list is distributed along with the tagset description, you may still be able to acquire a partial list from a corpus. The list is not necessary for the driver to work. However, it can be useful for [[#Test your driver|testing]] the driver. If no list is distributed along with the tagset description, you may still be able to acquire a partial list from a corpus.
 +
  
 ===== Alternative values ===== ===== Alternative values =====
Line 64: Line 66:
 Now, what do you do with features where you want to encode arrays? You should first check whether the value is an array or not. If it is an array, you may want to ''grep'' your values rather than trying exact match, because you do not know what is going to come from other drivers, and the ordering or additional values may not be what matters. Now, what do you do with features where you want to encode arrays? You should first check whether the value is an array or not. If it is an array, you may want to ''grep'' your values rather than trying exact match, because you do not know what is going to come from other drivers, and the ordering or additional values may not be what matters.
  
-If the arrays turns out to be incompatible with what you expect, you should pick one value (we suggest you  take the first one) and proceed with default single-value processing.+If the array turns out to be incompatible with what you expect, you should pick one value (we suggest you  take the first one) and proceed with default single-value processing.
  
 <code perl> <code perl>

[ Back to the navigation ] [ Back to the content ]