This is an old revision of the document!
Table of Contents
Treebank Engineering
This page is a place for notes on the project where we experiment with various dependency constructions and their transformations encountered in treebanks. Feel free to edit and add new stuff!
The project could eventually lead to a journal article. The SVN storage for the article and related materials is at http://svn.ms.mff.cuni.cz/projects/publications/browser/papers/2011_cl_tree_conventions.
Current participants: David Mareček (parsing experiments), Martin Popel (transformations), Loganathan Ramasamy (Tamil development), Daniel Zeman (treebank normalization), Zdeněk Žabokrtský (transformations) and Jan Hajič (proofreading).
Our basic strategy is as follows:
- Normalize the treebank, i.e. make its structure and afuns as close as possible to PDT.
- Apply selected transformations, e.g. coordination from Prague style to that of Mel'čuk.
- Evaluate impact on parsing using Malt and MST.
The special CL issue is on parsing “morphologically rich” languages, so we will have to devote some effort to arguing how our observations relate to that group of languages (however vaguely they are defined).
Data
We want to have a collection of dependency treebanks as diverse as possible. Morphologically rich languages are more important (while some others may be picked for contrast). Genuine dependency treebanks are probably better than those converted from constituency trees. There are the following sources, among others:
- CoNLL shared task data from 2006, 2007, 2008, 2009. Ignore semantics in 2008-9. Some languages may have licenses that will prevent us from using them.
- ICON shared task data from 2009 and 2010 (newer version of the same): Hindi, Bangla, Telugu.
- Tamil treebank created by Loganathan.
- Latin Dependency Treebank (LDT) and Ancient Greek Dependency Treebank (AGDT).
- Russian Dependency Treebank (Dan has a version from 2006 and David + Natalia have something newer).
Note that several treebanks have been modeled after PDT, thus their structure is very close to PDT and they probably require very little adjustment: PADT (Arabic), SDT (Slovene), Tamil, LDT (Latin) and possibly also AGDT.
Tentatively Selected Treebanks
- Czech: Prague Dependency Treebank. Our home treebank and the model for “default annotation style”. Morphologically very rich.
- Tamil: Loganathan's main contribution. Under development, i.e. we can adjust its guidelines to the findings of this project. Morphologically very rich, agglutinative.
- Bulgarian: originally a HPSG treebank, converted to dependencies for CoNLL 2006. Dan is transforming it to the PDT style. Medium morphological richness (no cases of nouns but rich verbal morphology).
Initial Normalization
The purpose of the initial normalization is to make the treebank look as close to PDT as possible. Normalization involves dependency structure, syntactic tags (afuns), and, if possible, morphological tags (using DZ Interset). The transformations applied during this process are important inspiration to what various treebanks do differently and what we may want to experiment with later.