Treebank Engineering

This page is a place for notes on the project where we experiment with various dependency constructions and their transformations encountered in treebanks. Feel free to edit and add new stuff!

The project could eventually lead to a journal article. The SVN storage for the article and related materials is at http://svn.ms.mff.cuni.cz/projects/publications/browser/papers/2011_cl_tree_conventions.

Current participants: David Mareček (parsing experiments), Martin Popel (transformations), Loganathan Ramasamy (Tamil development), Daniel Zeman (treebank normalization), Zdeněk Žabokrtský (transformations) and Jan Hajič (proofreading).

Our basic strategy is as follows:

Normalize the treebank, i.e. make its structure and afuns as close as possible to PDT.
Apply selected transformations, e.g. coordination from Prague style to that of Mel'čuk.
Evaluate impact on parsing using Malt and MST.

The special CL issue is on parsing “morphologically rich” languages, so we will have to devote some effort to arguing how our observations relate to that group of languages (however vaguely they are defined).

Data

We want to have a collection of dependency treebanks as diverse as possible. Morphologically rich languages are more important (while some others may be picked for contrast). Genuine dependency treebanks are probably better than those converted from constituency trees. There are the following sources, among others:

CoNLL shared task data from 2006, 2007, 2008, 2009. Ignore semantics in 2008-9. Some languages may have licenses that will prevent us from using them.
ICON shared task data from 2009 and 2010 (newer version of the same): Hindi, Bangla, Telugu.
Tamil treebank created by Loganathan.
Latin Dependency Treebank (LDT) and Ancient Greek Dependency Treebank (AGDT).
Russian Dependency Treebank (Dan has a version from 2006 and David + Natalia have something newer).

Note that several treebanks have been modeled after PDT, thus their structure is very close to PDT and they probably require very little adjustment: PADT (Arabic), SDT (Slovene), Tamil, LDT (Latin) and possibly also AGDT.

Tentatively Selected Treebanks

Czech: Prague Dependency Treebank. Our home treebank and the model for “default annotation style”. Morphologically very rich.
Tamil: Loganathan's main contribution. Under development, i.e. we can adjust its guidelines to the findings of this project. Morphologically very rich, agglutinative.
Bulgarian: BulTreeBank is originally a HPSG treebank, converted to dependencies for CoNLL 2006. Dan is transforming it to the PDT style. Medium morphological richness (no cases of nouns but rich verbal morphology).

Initial Normalization

The purpose of the initial normalization is to make the treebank look as close to PDT as possible. Normalization involves dependency structure, syntactic tags (afuns), and, if possible, morphological tags (using DZ Interset). The transformations applied during this process are important inspiration to what various treebanks do differently and what we may want to experiment with later.

Unless specified otherwise, normalization is done using Treex (TectoMT). See $TMT_ROOT/applications/norm_treebank and $TMT_ROOT/treex/lib/Treex/Block/A2A/$LANGUAGE/*2PDTStyle.pm.

Bulgarian

The BulTreeBank (BTB) morphological tagset has been decoded to Interset features, and, for convenience, also converted to PDT tags (some information lost). There are no lemmas in BTB (but Mirek Týnovský has a rule-based tool to guess them!)

There is a description of the deprel tags in BTB. For detailed description of what's going on see the source of $TMT_ROOT/treex/lib/Treex/Block/A2A/BG/CoNLL2PDTStyle.pm. Here is a short list of BulTreeBank features different from PDT:

Coordination is Mel'čuk-like, i.e. first member is the head, all other members, delimiters and shared modifiers are attached to it. Note that this style does not provide for the distinction between a shared modifier and a private modifier of the first member.
Preposition governs its noun phrase (so far same as PDT). However, rhematizers are attached to the preposition, not the noun phrase. Advantage of the Bulgarian approach: the result is projective. Advantage of the Prague approach: the preposition has one child, which is more intuitive than two.
Final punctuation is attached to the main verb or other “real ROOT” node (not our artificial empty root).
There are special auxiliary particles “да” (da) and “ще” (šte). Da is a sort of infinitival marker (Bulgarian verbs do not have the infinitive form). Šte marks the future tense. In BTB the particles govern the verb form and all dependents of the verb are also attached to the particle. Our solutions:
- Da: tagged as AuxC (similar to subordinating conjunctions), real afun and all dependents moved down to the verb.
- Šte: tagged as AuxV (similar to auxiliary verbs), attached to the verb, real afun and all dependents moved to the verb.

TO DO:

Get and apply the lemmatizer by Mirek Týnovský?
Sentence 39: shall we try to detect private first-member modifiers, if it is “ще се”, and the second member has the same of its own? Similarly sentence 54: the subject belongs to the first clause because the second clause has got its own.
Sports scores (“2 : 1”): in BTB, “:” is a preposition (!) and “1” is its complement. In PDT this is coordination of two numbers.
Explore the set of possible complex verb forms and compare their annotation to the Czech ones in PDT. So far we transform a fraction (if a participle is governed by “би”, swap them). Further examples: sentence 102: “, кой е бил сътрудник”; sentence 104: “, че не са били”. As in Czech “já jsem byl spolupracovník”, i.e. “jsem” should depend on “byl”, not vice versa.

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Table of Contents

Treebank Engineering

Data

Tentatively Selected Treebanks

Initial Normalization

Bulgarian