[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
user:zeman:treebank-engineering [2011/06/21 16:50]
zeman vytvořeno
user:zeman:treebank-engineering [2011/07/01 12:08] (current)
zeman References.
Line 1: Line 1:
 ====== Treebank Engineering ====== ====== Treebank Engineering ======
  
-This page is a place for notes on the project where we experiment with various dependency constructions and their transformations encountered in treebanks. **Feel free to edit and add new stuff!**+This page is a place for notes on the project where we experiment with various dependency constructions and their transformations encountered in treebanks. **Feel free to edit and to add new stuff!**
  
 The project could eventually lead to a journal article. The SVN storage for the article and related materials is at [[http://svn.ms.mff.cuni.cz/projects/publications/browser/papers/2011_cl_tree_conventions]]. The project could eventually lead to a journal article. The SVN storage for the article and related materials is at [[http://svn.ms.mff.cuni.cz/projects/publications/browser/papers/2011_cl_tree_conventions]].
  
-Current participants: David Mareček (parsing experiments), Martin Popel (transformations), Loganathan Ramasamy (Tamil development), Daniel Zeman (treebank normalization), Zdeněk Žabokrtský (transformations) and Jan Hajič (proofreading).+Current participants: Nathan Green (treebank normalization), David Mareček (parsing experiments), Martin Popel (transformations), Loganathan Ramasamy (Tamil development, treebank normalization), Rudolf Rosa (MST parser reimplementation in Perl), Daniel Zeman (treebank normalization), Zdeněk Žabokrtský (transformations) and Jan Hajič (proofreading).
  
 Our basic strategy is as follows: Our basic strategy is as follows:
Line 13: Line 13:
  
 The special CL issue is on parsing “morphologically rich” languages, so we will have to devote some effort to arguing how our observations relate to that group of languages (however vaguely they are defined). The special CL issue is on parsing “morphologically rich” languages, so we will have to devote some effort to arguing how our observations relate to that group of languages (however vaguely they are defined).
 +
 +===== Some Unsorted References =====
 +
 +  * Dan's old PBML article about inconsistent annotation rules in PDT 1.0 ("How to Decrease Performance of a Statistical Parser")
 +  * All references required by the providers of the respective treebanks.
 +  * Interset (the LREC paper is better?)
  
 ===== Data ===== ===== Data =====
Line 30: Line 36:
   * Czech: Prague Dependency Treebank. Our home treebank and the model for “default annotation style”. Morphologically very rich.   * Czech: Prague Dependency Treebank. Our home treebank and the model for “default annotation style”. Morphologically very rich.
   * Tamil: Loganathan's main contribution. Under development, i.e. we can adjust its guidelines to the findings of this project. Morphologically very rich, agglutinative.   * Tamil: Loganathan's main contribution. Under development, i.e. we can adjust its guidelines to the findings of this project. Morphologically very rich, agglutinative.
-  * Bulgarian: originally a HPSG treebank, converted to dependencies for CoNLL 2006. Dan is transforming it to the PDT style. Medium morphological richness (no cases of nouns but rich verbal morphology).+  * Bulgarian: BulTreeBank is originally a HPSG treebank, converted to dependencies for CoNLL 2006. Dan is transforming it to the PDT style. Medium morphological richness (no cases of nouns but rich verbal morphology).
  
 ===== Initial Normalization ===== ===== Initial Normalization =====
  
 The purpose of the initial normalization is to make the treebank look as close to PDT as possible. Normalization involves dependency structure, syntactic tags (afuns), and, if possible, morphological tags (using [[interset|DZ Interset]]). The transformations applied during this process are important inspiration to what various treebanks do differently and what we may want to experiment with later. The purpose of the initial normalization is to make the treebank look as close to PDT as possible. Normalization involves dependency structure, syntactic tags (afuns), and, if possible, morphological tags (using [[interset|DZ Interset]]). The transformations applied during this process are important inspiration to what various treebanks do differently and what we may want to experiment with later.
 +
 +Unless specified otherwise, normalization is done using Treex ([[internal:tectomt|TectoMT]]). See ''$TMT_ROOT/applications/norm_treebank'' and ''$TMT_ROOT/treex/lib/Treex/Block/A2A/$LANGUAGE/*2PDTStyle.pm''.
 +
 +==== Bulgarian ====
 +
 +The BulTreeBank (BTB) morphological tagset has been decoded to Interset features, and, for convenience, also converted to PDT tags (some information lost). There are no lemmas in BTB (but Mirek Týnovský has a rule-based tool to guess them!)
 +
 +To test our normalization of BTB, go to ''$TMT_ROOT/applications/norm_treebank'' and call ''make''. It will read our copy of the Bulgarian test file from CoNLL 2006 (398 sentences), transform it and create a file ''bg.treex''. View it by calling ''ttred bg.treex &'' (you should have initialized Treex in order to see the command ''ttred'').
 +
 +There is a [[http://www.bultreebank.org/dpbtb/|description of the deprel tags in BTB]]. For detailed description of what's going on see the source of ''$TMT_ROOT/treex/lib/Treex/Block/A2A/BG/CoNLL2PDTStyle.pm''. Here is a short list of BulTreeBank features different from PDT:
 +
 +  * Coordination is Mel'čuk-like, i.e. first member is the head, all other members, delimiters and shared modifiers are attached to it. Note that this style does not provide for the distinction between a shared modifier //(**čeští** studenti a vysokoškolští učitelé)// and a private modifier of the first member //(**čeští** studenti a němečtí učitelé).// Most of the time we cannot guess the correct attachment of the modifiers when transforming to the Prague annotation style.
 +  * Sentence-initial coordinating conjunction (such as in //But he believed that...//) is attached to the verb. In the Prague style this is coordination with a single member: the clause. Thus the conjunction is attached to the root and the verb is attached to the conjunction.
 +  * Preposition governs its noun phrase (so far same as PDT). However, rhematizers are attached to the preposition, not the noun phrase. Advantage of the Bulgarian approach: the result is projective. Advantage of the Prague approach: the preposition has one child, which is more intuitive than two.
 +  * Final punctuation is attached to the main verb or other “real ROOT” node (not our artificial empty root).
 +  * There are special auxiliary particles “да” (da) and “ще” (šte). //Da// is a sort of infinitival marker (Bulgarian verbs do not have the infinitive form). //Šte// marks the future tense. In BTB the particles govern the verb form and all dependents of the verb are also attached to the particle. Our solutions:
 +    * **Da:** tagged as ''AuxC'' (similar to subordinating conjunctions), real afun and all dependents moved down to the verb.
 +    * **Šte:** tagged as ''AuxV'' (similar to auxiliary verbs), attached to the verb, real afun and all dependents moved to the verb.
 +
 +**TO DO:**
 +  * Get and apply the lemmatizer by Mirek Týnovský?
 +  * Sentence 39: shall we try to detect private first-member modifiers, if it is “ще се”, and the second member has the same of its own? Similarly sentence 54: the subject belongs to the first clause because the second clause has got its own.
 +  * Sports scores (“2 : 1”): in BTB, ":" is a preposition (!) and "1" is its complement. In PDT this is coordination of two numbers.
 +  * Explore the set of possible complex verb forms and compare their annotation to the Czech ones in PDT. So far we transform a fraction (if a participle is governed by “би”, swap them). Further examples: sentence 102: “, кой е бил сътрудник”; sentence 104: “, че не са били”. As in Czech “já jsem byl spolupracovník”, i.e. "jsem" should depend on "byl", not vice versa.

[ Back to the navigation ] [ Back to the content ]