Table of Contents
Common Problems
Abbreviations, hyphenated compounds, foreign words
Some features are logically orthogonal to the part of speech, yet many tagsets encode them as special parts of speech. Examples include foreign words, hyphenated parts of compounds, abbreviations. Encode() should first check these special cases, then translate the normal parts of speech.
Counted nouns in Slavic languages
In Russian, plural counted nouns take the form of genitive singular if their number is 2, 3 or 4: три часа instead of *три часы or *три часов.
The same holds for some Bulgarian masculine non-person nouns, however, the case system has almost vanished in Bulgarian, so the correspondence with the genitive case is less visible there. The tagset of the BulTreeBank thus defines a special value of number
, called “count form”. Example: normal plural: 240 народни представители vs. count form: в следващите шест месеца.
Both cases should probably be encoded the same way in Interset. However, I am currently unsure, what is better.
Solution 1: number = “sing”, case = “gen”
Drawbacks:
- A parser will learn that a genitive noun can be modified by a non-genitive adjective.
- A parser will learn that singular genitive, if modified by a numeral, forms a plural noun phrase (not necessarily in genitive).
Solution 2: number = “counted”
Drawbacks:
- New value of
number
would be unused in most languages and the drivers may not react to it. (The same holds, to a lesser extent, fornumber = “dual”
; however, in that case it is at least commonly accepted that dual is a number. In contrast, our “counted form” may be regarded a special boolean feature rather than a value ofnumber
.) - If used in Russian, a parser will learn that counted-form nouns are modified by genitive adjectives.
Determiners
Many languages do not have or distinguish determiners. Remember that when writing your encode(). It is usually better to encode determiners as adjectives or pronouns than to let them fall back to a residual class.
Infinitive markers
Many languages do not have special words to mark an infinitive (e.g. English to). Tagsets for languages where such words exist take various approaches to encode them: particles, subjunctions, special parts of speech… The TO
tag of the Penn Treebank is another example, since it is a special part of speech but it is shared with the prepositional occurrences of to. If a tag contains information that the tagged word is or can be an infinitive mark, decode() can use the inf
part of speech. However, all implementations of encode() should be aware of that possibility! If the target tagset does not encode infinitive marks, encode() probably should fall back to a particle or even a residual class.
Participles
Participles are non-finite verb forms whose usage lies somewhere between a verb and an adjective (e.g. English doing, done). Because of that, some tagsets may opt for assigning participles to adjectives. Although you could combine pos = adj
and verbform = part
in Interset, for the sake of compatibility, all drivers should decode participles as verbs, whenever possible.
Pronouns
The class of pronouns is recognized in most tagsets. However, there are many different opinions on what belongs into the class. Personal and possibly possessive pronouns are in the intersection of most approaches.
Bulgarian tags of BulTreeBank classify all wh-words and their corresponding indefinite, collective and negative words as pronouns. Other tagsets prefer to separate some of them as determiners, numerals and adverbs, respectively.
Wh-words
Many languages do not distinguish between relative and interrogative pronouns (or adverbs or other words). For such languages, your encode() should contain a statement like this:
if($f{definiteness} =~ m/^(wh|int|rel)$/)
Even if you know that your own decode() always sets $f{definiteness} = “wh”
, you cannot be sure that the feature values were not set by a driver, which uses “int”
or “rel”
in different contexts.
Chinese particles
Chinese 的 (de) has a part of speech of its own in the Sinica treebank, DE
. The easiest approach is to decode it as a particle and remember its special nature using a new subpos
value, or just storing it in the other
feature. However, de's usage could be compared to that of conjunctions. (It's not a coordinative conjunction, though. It connects two elements with different roles: often a possessor and the possessed object, e.g. 我的腦海.)
Combinations of values in one feature structure
The current version allows for storing arrays of values in one feature. For instance, we can say that a word is either in nominative or in accusative by assigning
$f{case} = ["nom", "acc"];
However, we cannot define complex combinations of values of different features. For instance, if we assign
$f{gender} = ["fem", "neut"]; $f{number} = ["sing", "plu"];
all four combinations of the gender and number values are permitted. We cannot properly decode a tag that applies to either fem+sing
or neut+plu
but not fem+plu
, nor neut+sing
(real example taken from cs::pdt
). The only way to encode this is to exit our one-tag-at-a-time scope and create two parallel feature structures as the result of decoding. That would complicate using the feature structure(s) by the user, and also subsequent encoding into a physical tagset. Even the arrays that are already implemented make the system quite complex.
The inability to describe value combinations also plays a role in the situation where one feature value of the physical tagset has to be decomposed into values of multiple features in Interset, and the decomposed value should be one of multiple values in an array. For instance, the cs::pdt
gender I
is decoded as gender = “masc”, animateness = “anim”
. Now how shall we decode the physical gender T
, meaning disjunction of physical genders I
and F
(masculine inanimate or feminine)? gender = [“masc”, “fem”], animateness = [“inan”, “”]
is not exactly the description of what's going on here.
Similar situation is in pt::conll
with the feature NOM/PIV
, meaning “this is either nominative, or the prepositional form of accusative”. We can decode it as case = [“nom”, “acc”], prepcase = [“”, “pre”]
, but it is not precise representation of the original information. Nor is case = “nom”, prepcase = “pre”
.
The correct solution would be to decode such tag into multiple parallel feature structures. Every structure would only contain single values, no arrays. This would remove one level of complexity inside the structures but add another level around the structure. We can consider making this change in a future version of Interset. There could be two interfaces to the decoding function: one that would output an array of (references to) feature structures, and the other that would output (reference to) just one feature structure but there would be an additional feature with reference to the next feature structure. The encoder would select the structure that requires the least modification to fit the target tagset. If the user can deal with more than one target tag, they would ask for encoding each of them separately. If the target tagset could accommodate alternate values in some features, the encoder could look at multiple structures at a time; it is unclear how this would be done.
Multiple alternate feature structures can also be stored in a packed form. There is only one structure. It has an additional feature called _clusters_
. Its value is an array of substructures (clusters) C1..Cn. If a feature is set in a cluster, it must be empty in the main structure (or there can be the reference to the clusters or another special value). If there are two disjunct independent subsets of features (there are dependencies between the features in one subset but not between the subsets), we must write all combinations of possible values from all affected subsets (as if there were dependencies between the subsets). Adding a layer to separate the independent subsets would only complicate processing.
%f = ( 'pos' => "adj", 'gender' => \$clusters, 'animateness' => \$clusters, 'number' => "plu", 'case' => "nom", 'degree' => "pos", 'negativeness' => "pos", '_clusters_' => [ {'gender' => "masc", 'animateness' => "inan"}, {'gender' => "fem"} ] );
Or the alternating features would directly reference the array of substructures containing their alternating values. No additional technical feature would be needed. There could be more than one independent arrays of substructures if needed. The main structure would look the same as today, including the possibility that a feature value is an array reference. However, the referenced array would not contain directly the values of the referencing feature. Instead, it would contain (references to) substructures (hashes), each of them defining a single value for this and possibly other features.
@alt1 = ( {'gender' => "masc", 'animateness' => "inan"}, {'gender' => "fem"} ); @alt2 = ( {'case' => "nom"}, {'case' => "acc"}, {'case' => "voc"} ); %f = ( 'pos' => "adj", 'gender' => \@alt1, 'animateness' => \@alt1, 'number' => "plu", 'case' => \@alt2, 'degree' => "pos", 'negativeness' => "pos", );