[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
user:zeman:treebank-engineering [2011/06/21 17:33]
zeman Link.
user:zeman:treebank-engineering [2011/07/01 12:08] (current)
zeman References.
Line 1: Line 1:
 ====== Treebank Engineering ====== ====== Treebank Engineering ======
  
-This page is a place for notes on the project where we experiment with various dependency constructions and their transformations encountered in treebanks. **Feel free to edit and add new stuff!**+This page is a place for notes on the project where we experiment with various dependency constructions and their transformations encountered in treebanks. **Feel free to edit and to add new stuff!**
  
 The project could eventually lead to a journal article. The SVN storage for the article and related materials is at [[http://svn.ms.mff.cuni.cz/projects/publications/browser/papers/2011_cl_tree_conventions]]. The project could eventually lead to a journal article. The SVN storage for the article and related materials is at [[http://svn.ms.mff.cuni.cz/projects/publications/browser/papers/2011_cl_tree_conventions]].
  
-Current participants: David Mareček (parsing experiments), Martin Popel (transformations), Loganathan Ramasamy (Tamil development), Daniel Zeman (treebank normalization), Zdeněk Žabokrtský (transformations) and Jan Hajič (proofreading).+Current participants: Nathan Green (treebank normalization), David Mareček (parsing experiments), Martin Popel (transformations), Loganathan Ramasamy (Tamil development, treebank normalization), Rudolf Rosa (MST parser reimplementation in Perl), Daniel Zeman (treebank normalization), Zdeněk Žabokrtský (transformations) and Jan Hajič (proofreading).
  
 Our basic strategy is as follows: Our basic strategy is as follows:
Line 13: Line 13:
  
 The special CL issue is on parsing “morphologically rich” languages, so we will have to devote some effort to arguing how our observations relate to that group of languages (however vaguely they are defined). The special CL issue is on parsing “morphologically rich” languages, so we will have to devote some effort to arguing how our observations relate to that group of languages (however vaguely they are defined).
 +
 +===== Some Unsorted References =====
 +
 +  * Dan's old PBML article about inconsistent annotation rules in PDT 1.0 ("How to Decrease Performance of a Statistical Parser")
 +  * All references required by the providers of the respective treebanks.
 +  * Interset (the LREC paper is better?)
  
 ===== Data ===== ===== Data =====
Line 46: Line 52:
 There is a [[http://www.bultreebank.org/dpbtb/|description of the deprel tags in BTB]]. For detailed description of what's going on see the source of ''$TMT_ROOT/treex/lib/Treex/Block/A2A/BG/CoNLL2PDTStyle.pm''. Here is a short list of BulTreeBank features different from PDT: There is a [[http://www.bultreebank.org/dpbtb/|description of the deprel tags in BTB]]. For detailed description of what's going on see the source of ''$TMT_ROOT/treex/lib/Treex/Block/A2A/BG/CoNLL2PDTStyle.pm''. Here is a short list of BulTreeBank features different from PDT:
  
-  * Coordination is Mel'čuk-like, i.e. first member is the head, all other members, delimiters and shared modifiers are attached to it. Note that this style does not provide for the distinction between a shared modifier and a private modifier of the first member.+  * Coordination is Mel'čuk-like, i.e. first member is the head, all other members, delimiters and shared modifiers are attached to it. Note that this style does not provide for the distinction between a shared modifier //(**čeští** studenti a vysokoškolští učitelé)// and a private modifier of the first member //(**čeští** studenti a němečtí učitelé).// Most of the time we cannot guess the correct attachment of the modifiers when transforming to the Prague annotation style. 
 +  * Sentence-initial coordinating conjunction (such as in //But he believed that...//) is attached to the verb. In the Prague style this is coordination with a single member: the clause. Thus the conjunction is attached to the root and the verb is attached to the conjunction.
   * Preposition governs its noun phrase (so far same as PDT). However, rhematizers are attached to the preposition, not the noun phrase. Advantage of the Bulgarian approach: the result is projective. Advantage of the Prague approach: the preposition has one child, which is more intuitive than two.   * Preposition governs its noun phrase (so far same as PDT). However, rhematizers are attached to the preposition, not the noun phrase. Advantage of the Bulgarian approach: the result is projective. Advantage of the Prague approach: the preposition has one child, which is more intuitive than two.
   * Final punctuation is attached to the main verb or other “real ROOT” node (not our artificial empty root).   * Final punctuation is attached to the main verb or other “real ROOT” node (not our artificial empty root).

[ Back to the navigation ] [ Back to the content ]