====== Treebank Engineering ======

This page is a place for notes on the project where we experiment with various dependency constructions and their transformations encountered in treebanks. **Feel free to edit and to add new stuff!**

The project could eventually lead to a journal article. The SVN storage for the article and related materials is at [[http://svn.ms.mff.cuni.cz/projects/publications/browser/papers/2011_cl_tree_conventions]].

Current participants: Nathan Green (treebank normalization), David Mareček (parsing experiments), Martin Popel (transformations), Loganathan Ramasamy (Tamil development, treebank normalization), Rudolf Rosa (MST parser reimplementation in Perl), Daniel Zeman (treebank normalization), Zdeněk Žabokrtský (transformations) and Jan Hajič (proofreading).

Our basic strategy is as follows:
  * Normalize the treebank, i.e. make its structure and afuns as close as possible to PDT.
  * Apply selected transformations, e.g. coordination from Prague style to that of Mel'čuk.
  * Evaluate impact on parsing using Malt and MST.

The special CL issue is on parsing “morphologically rich” languages, so we will have to devote some effort to arguing how our observations relate to that group of languages (however vaguely they are defined).

===== Some Unsorted References =====

  * Dan's old PBML article about inconsistent annotation rules in PDT 1.0 ("How to Decrease Performance of a Statistical Parser")
  * All references required by the providers of the respective treebanks.
  * Interset (the LREC paper is better?)

===== Data =====

We want to have a collection of dependency treebanks as diverse as possible. Morphologically rich languages are more important (while some others may be picked for contrast). Genuine dependency treebanks are probably better than those converted from constituency trees. There are the following sources, among others:

  * CoNLL shared task data from 2006, 2007, 2008, 2009. Ignore semantics in 2008-9. Some languages may have licenses that will prevent us from using them.
  * ICON shared task data from 2009 and 2010 (newer version of the same): Hindi, Bangla, Telugu.
  * Tamil treebank created by Loganathan.
  * Latin Dependency Treebank (LDT) and Ancient Greek Dependency Treebank (AGDT).
  * Russian Dependency Treebank (Dan has a version from 2006 and David + Natalia have something newer).

Note that several treebanks have been modeled after PDT, thus their structure is very close to PDT and they probably require very little adjustment: PADT (Arabic), SDT (Slovene), Tamil, LDT (Latin) and possibly also AGDT.

==== Tentatively Selected Treebanks ====

  * Czech: Prague Dependency Treebank. Our home treebank and the model for “default annotation style”. Morphologically very rich.
  * Tamil: Loganathan's main contribution. Under development, i.e. we can adjust its guidelines to the findings of this project. Morphologically very rich, agglutinative.
  * Bulgarian: BulTreeBank is originally a HPSG treebank, converted to dependencies for CoNLL 2006. Dan is transforming it to the PDT style. Medium morphological richness (no cases of nouns but rich verbal morphology).

===== Initial Normalization =====

The purpose of the initial normalization is to make the treebank look as close to PDT as possible. Normalization involves dependency structure, syntactic tags (afuns), and, if possible, morphological tags (using [[interset|DZ Interset]]). The transformations applied during this process are important inspiration to what various treebanks do differently and what we may want to experiment with later.

Unless specified otherwise, normalization is done using Treex ([[internal:tectomt|TectoMT]]). See ''$TMT_ROOT/applications/norm_treebank'' and ''$TMT_ROOT/treex/lib/Treex/Block/A2A/$LANGUAGE/*2PDTStyle.pm''.

==== Bulgarian ====

The BulTreeBank (BTB) morphological tagset has been decoded to Interset features, and, for convenience, also converted to PDT tags (some information lost). There are no lemmas in BTB (but Mirek Týnovský has a rule-based tool to guess them!)

To test our normalization of BTB, go to ''$TMT_ROOT/applications/norm_treebank'' and call ''make''. It will read our copy of the Bulgarian test file from CoNLL 2006 (398 sentences), transform it and create a file ''bg.treex''. View it by calling ''ttred bg.treex &'' (you should have initialized Treex in order to see the command ''ttred'').

There is a [[http://www.bultreebank.org/dpbtb/|description of the deprel tags in BTB]]. For detailed description of what's going on see the source of ''$TMT_ROOT/treex/lib/Treex/Block/A2A/BG/CoNLL2PDTStyle.pm''. Here is a short list of BulTreeBank features different from PDT:

  * Coordination is Mel'čuk-like, i.e. first member is the head, all other members, delimiters and shared modifiers are attached to it. Note that this style does not provide for the distinction between a shared modifier //(**čeští** studenti a vysokoškolští učitelé)// and a private modifier of the first member //(**čeští** studenti a němečtí učitelé).// Most of the time we cannot guess the correct attachment of the modifiers when transforming to the Prague annotation style.
  * Sentence-initial coordinating conjunction (such as in //But he believed that...//) is attached to the verb. In the Prague style this is coordination with a single member: the clause. Thus the conjunction is attached to the root and the verb is attached to the conjunction.
  * Preposition governs its noun phrase (so far same as PDT). However, rhematizers are attached to the preposition, not the noun phrase. Advantage of the Bulgarian approach: the result is projective. Advantage of the Prague approach: the preposition has one child, which is more intuitive than two.
  * Final punctuation is attached to the main verb or other “real ROOT” node (not our artificial empty root).
  * There are special auxiliary particles “да” (da) and “ще” (šte). //Da// is a sort of infinitival marker (Bulgarian verbs do not have the infinitive form). //Šte// marks the future tense. In BTB the particles govern the verb form and all dependents of the verb are also attached to the particle. Our solutions:
    * **Da:** tagged as ''AuxC'' (similar to subordinating conjunctions), real afun and all dependents moved down to the verb.
    * **Šte:** tagged as ''AuxV'' (similar to auxiliary verbs), attached to the verb, real afun and all dependents moved to the verb.

**TO DO:**
  * Get and apply the lemmatizer by Mirek Týnovský?
  * Sentence 39: shall we try to detect private first-member modifiers, if it is “ще се”, and the second member has the same of its own? Similarly sentence 54: the subject belongs to the first clause because the second clause has got its own.
  * Sports scores (“2 : 1”): in BTB, ":" is a preposition (!) and "1" is its complement. In PDT this is coordination of two numbers.
  * Explore the set of possible complex verb forms and compare their annotation to the Czech ones in PDT. So far we transform a fraction (if a participle is governed by “би”, swap them). Further examples: sentence 102: “, кой е бил сътрудник”; sentence 104: “, че не са били”. As in Czech “já jsem byl spolupracovník”, i.e. "jsem" should depend on "byl", not vice versa.