[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

This is an old revision of the document!


Table of Contents

Treebank Engineering

This page is a place for notes on the project where we experiment with various dependency constructions and their transformations encountered in treebanks. Feel free to edit and add new stuff!

The project could eventually lead to a journal article. The SVN storage for the article and related materials is at http://svn.ms.mff.cuni.cz/projects/publications/browser/papers/2011_cl_tree_conventions.

Current participants: David Mareček (parsing experiments), Martin Popel (transformations), Loganathan Ramasamy (Tamil development), Daniel Zeman (treebank normalization), Zdeněk Žabokrtský (transformations) and Jan Hajič (proofreading).

Our basic strategy is as follows:

The special CL issue is on parsing “morphologically rich” languages, so we will have to devote some effort to arguing how our observations relate to that group of languages (however vaguely they are defined).

Data

We want to have a collection of dependency treebanks as diverse as possible. Morphologically rich languages are more important (while some others may be picked for contrast). Genuine dependency treebanks are probably better than those converted from constituency trees. There are the following sources, among others:

Note that several treebanks have been modeled after PDT, thus their structure is very close to PDT and they probably require very little adjustment: PADT (Arabic), SDT (Slovene), Tamil, LDT (Latin) and possibly also AGDT.

Tentatively Selected Treebanks

Initial Normalization

The purpose of the initial normalization is to make the treebank look as close to PDT as possible. Normalization involves dependency structure, syntactic tags (afuns), and, if possible, morphological tags (using DZ Interset). The transformations applied during this process are important inspiration to what various treebanks do differently and what we may want to experiment with later.

Unless specified otherwise, normalization is done using Treex (TectoMT). See $TMT_ROOT/applications/norm_treebank and $TMT_ROOT/treex/lib/Treex/Block/A2A/$LANGUAGE/*2PDTStyle.pm.

Bulgarian

The BulTreeBank (BTB) morphological tagset has been decoded to Interset features, and, for convenience, also converted to PDT tags (some information lost). There are no lemmas in BTB (but Mirek Týnovský has a rule-based tool to guess them!)

To test our normalization of BTB, go to $TMT_ROOT/applications/norm_treebank and call make. It will read our copy of the Bulgarian test file from CoNLL 2006 (398 sentences), transform it and create a file bg.treex. View it by calling ttred bg.treex & (you should have initialized Treex in order to see the command ttred).

There is a description of the deprel tags in BTB. For detailed description of what's going on see the source of $TMT_ROOT/treex/lib/Treex/Block/A2A/BG/CoNLL2PDTStyle.pm. Here is a short list of BulTreeBank features different from PDT:

TO DO:


[ Back to the navigation ] [ Back to the content ]