Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
user:zeman:treebank-engineering [2011/06/21 16:50] zeman vytvořeno |
user:zeman:treebank-engineering [2011/07/01 12:08] (current) zeman References. |
||
---|---|---|---|
Line 1: | Line 1: | ||
====== Treebank Engineering ====== | ====== Treebank Engineering ====== | ||
- | This page is a place for notes on the project where we experiment with various dependency constructions and their transformations encountered in treebanks. **Feel free to edit and add new stuff!** | + | This page is a place for notes on the project where we experiment with various dependency constructions and their transformations encountered in treebanks. **Feel free to edit and to add new stuff!** |
The project could eventually lead to a journal article. The SVN storage for the article and related materials is at [[http:// | The project could eventually lead to a journal article. The SVN storage for the article and related materials is at [[http:// | ||
- | Current participants: | + | Current participants: |
Our basic strategy is as follows: | Our basic strategy is as follows: | ||
Line 13: | Line 13: | ||
The special CL issue is on parsing “morphologically rich” languages, so we will have to devote some effort to arguing how our observations relate to that group of languages (however vaguely they are defined). | The special CL issue is on parsing “morphologically rich” languages, so we will have to devote some effort to arguing how our observations relate to that group of languages (however vaguely they are defined). | ||
+ | |||
+ | ===== Some Unsorted References ===== | ||
+ | |||
+ | * Dan's old PBML article about inconsistent annotation rules in PDT 1.0 ("How to Decrease Performance of a Statistical Parser" | ||
+ | * All references required by the providers of the respective treebanks. | ||
+ | * Interset (the LREC paper is better?) | ||
===== Data ===== | ===== Data ===== | ||
Line 30: | Line 36: | ||
* Czech: Prague Dependency Treebank. Our home treebank and the model for “default annotation style”. Morphologically very rich. | * Czech: Prague Dependency Treebank. Our home treebank and the model for “default annotation style”. Morphologically very rich. | ||
* Tamil: Loganathan' | * Tamil: Loganathan' | ||
- | * Bulgarian: originally a HPSG treebank, converted to dependencies for CoNLL 2006. Dan is transforming it to the PDT style. Medium morphological richness (no cases of nouns but rich verbal morphology). | + | * Bulgarian: |
===== Initial Normalization ===== | ===== Initial Normalization ===== | ||
The purpose of the initial normalization is to make the treebank look as close to PDT as possible. Normalization involves dependency structure, syntactic tags (afuns), and, if possible, morphological tags (using [[interset|DZ Interset]]). The transformations applied during this process are important inspiration to what various treebanks do differently and what we may want to experiment with later. | The purpose of the initial normalization is to make the treebank look as close to PDT as possible. Normalization involves dependency structure, syntactic tags (afuns), and, if possible, morphological tags (using [[interset|DZ Interset]]). The transformations applied during this process are important inspiration to what various treebanks do differently and what we may want to experiment with later. | ||
+ | |||
+ | Unless specified otherwise, normalization is done using Treex ([[internal: | ||
+ | |||
+ | ==== Bulgarian ==== | ||
+ | |||
+ | The BulTreeBank (BTB) morphological tagset has been decoded to Interset features, and, for convenience, | ||
+ | |||
+ | To test our normalization of BTB, go to '' | ||
+ | |||
+ | There is a [[http:// | ||
+ | |||
+ | * Coordination is Mel' | ||
+ | * Sentence-initial coordinating conjunction (such as in //But he believed that...//) is attached to the verb. In the Prague style this is coordination with a single member: the clause. Thus the conjunction is attached to the root and the verb is attached to the conjunction. | ||
+ | * Preposition governs its noun phrase (so far same as PDT). However, rhematizers are attached to the preposition, | ||
+ | * Final punctuation is attached to the main verb or other “real ROOT” node (not our artificial empty root). | ||
+ | * There are special auxiliary particles “да” (da) and “ще” (šte). //Da// is a sort of infinitival marker (Bulgarian verbs do not have the infinitive form). //Šte// marks the future tense. In BTB the particles govern the verb form and all dependents of the verb are also attached to the particle. Our solutions: | ||
+ | * **Da:** tagged as '' | ||
+ | * **Šte:** tagged as '' | ||
+ | |||
+ | **TO DO:** | ||
+ | * Get and apply the lemmatizer by Mirek Týnovský? | ||
+ | * Sentence 39: shall we try to detect private first-member modifiers, if it is “ще се”, and the second member has the same of its own? Similarly sentence 54: the subject belongs to the first clause because the second clause has got its own. | ||
+ | * Sports scores (“2 : 1”): in BTB, ":" | ||
+ | * Explore the set of possible complex verb forms and compare their annotation to the Czech ones in PDT. So far we transform a fraction (if a participle is governed by “би”, swap them). Further examples: sentence 102: “, кой е бил сътрудник”; |