Common Problems

Abbreviations, hyphenated compounds, foreign words

Some features are logically orthogonal to the part of speech, yet many tagsets encode them as special parts of speech. Examples include foreign words, hyphenated parts of compounds, abbreviations. Encode() should first check these special cases, then translate the normal parts of speech.

Counted nouns in Slavic languages

In Russian, plural counted nouns take the form of genitive singular if their number is 2, 3 or 4: три часа instead of *три часы or *три часов.

The same holds for some Bulgarian masculine non-person nouns, however, the case system has almost vanished in Bulgarian, so the correspondence with the genitive case is less visible there. The tagset of the BulTreeBank thus defines a special value of number, called “count form”. Example: normal plural: 240 народни представители vs. count form: в следващите шест месеца.

Both cases should probably be encoded the same way in Interset. However, I am currently unsure, what is better.

Solution 1: number = “sing”, case = “gen”
Drawbacks:

A parser will learn that a genitive noun can be modified by a non-genitive adjective.
A parser will learn that singular genitive, if modified by a numeral, forms a plural noun phrase (not necessarily in genitive).

Solution 2: number = “counted”
Drawbacks:

New value of number would be unused in most languages and the drivers may not react to it. (The same holds, to a lesser extent, for number = “dual”; however, in that case it is at least commonly accepted that dual is a number. In contrast, our “counted form” may be regarded a special boolean feature rather than a value of number.)
If used in Russian, a parser will learn that counted-form nouns are modified by genitive adjectives.

Determiners

Many languages do not have or distinguish determiners. Remember that when writing your encode(). It is usually better to encode determiners as adjectives or pronouns than to let them fall back to a residual class.

Infinitive markers

Many languages do not have special words to mark an infinitive (e.g. English to). Tagsets for languages where such words exist take various approaches to encode them: particles, subjunctions, special parts of speech… The TO tag of the Penn Treebank is another example, since it is a special part of speech but it is shared with the prepositional occurrences of to. If a tag contains information that the tagged word is or can be an infinitive mark, decode() can use the inf part of speech. However, all implementations of encode() should be aware of that possibility! If the target tagset does not encode infinitive marks, encode() probably should fall back to a particle or even a residual class.

Participles

Participles are non-finite verb forms whose usage lies somewhere between a verb and an adjective (e.g. English doing, done). Because of that, some tagsets may opt for assigning participles to adjectives. Although you could combine pos = adj and verbform = part in Interset, for the sake of compatibility, all drivers should decode participles as verbs, whenever possible.

Pronouns

The class of pronouns is recognized in most tagsets. However, there are many different opinions on what belongs into the class. Personal and possibly possessive pronouns are in the intersection of most approaches.

Bulgarian tags of BulTreeBank classify all wh-words and their corresponding indefinite, collective and negative words as pronouns. Other tagsets prefer to separate some of them as determiners, numerals and adverbs, respectively.

Wh-words

Many languages do not distinguish between relative and interrogative pronouns (or adverbs or other words). For such languages, your encode() should contain a statement like this:

if($f{definiteness} =~ m/^(wh|int|rel)$/)

Even if you know that your own decode() always sets $f{definiteness} = “wh”, you cannot be sure that the feature values were not set by a driver, which uses “int” or “rel” in different contexts.

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Table of Contents