[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
external:tectomt:tutorial [2009/01/21 12:09]
kravalova
external:tectomt:tutorial [2009/01/22 11:50]
kravalova
Line 1: Line 1:
 ====== TectoMT Tutorial ====== ====== TectoMT Tutorial ======
  
-Welcome at TectoMT Tutorial. This tutorial should take about hours.+Welcome at TectoMT Tutorial. This tutorial should take about hours. 
  
  
Line 7: Line 8:
 ===== What is TectoMT ===== ===== What is TectoMT =====
  
-TectoMT is a highly modular NLP (Natural Language Processing) software system implemented in Perl programming language under Linux. It is primarily aimed at Machine Translation, making use of the ideas and technology created during the Prague Dependency Treebank project. At the same time, it is also hoped to significantly facilitate and accelerate development of software solutions of many other NLP tasks, especially due to re-usability of the numerous integrated processing modules (called blocks), which are equipped with uniform object-oriented interfaces. +TectoMT is a highly modular NLP (Natural Language Processing) software system implemented in Perl programming language under Linux. It is primarily aimed at Machine Translation, making use of the ideas and technology created during the Prague Dependency Treebank project. At the same time, it is also hoped to facilitate and significantly accelerate development of software solutions of many other NLP tasks, especially due to re-usability of the numerous integrated processing modules (called blocks), which are equipped with uniform object-oriented interfaces. 
  
  
Line 20: Line 21:
   * Your shell is bash   * Your shell is bash
   * You have basic experience with bash and can read basic Perl   * You have basic experience with bash and can read basic Perl
 +
 +
 +
  
  
Line 33: Line 37:
 ==== Installation and setup ==== ==== Installation and setup ====
  
-  * Checkout SVN repository. If you are running this installation in computer lab in Prague, you have to checkout the repository into directory ''/home/BIG'' (because data quotas don't apply here):+  * Checkout SVN repository. If you are running this installation in computer lab in Prague, you have to checkout the repository into directory ''/home/BIG'' (because bigger disk quota applies here):
  
 <code bash> <code bash>
     cd ~/BIG     cd ~/BIG
-    svn --username <username> co https://svn.ms.mff.cuni.cz/svn/tectomt_devel/trunk tectomt+    svn --username mtm co https://svn.ms.mff.cuni.cz/svn/tectomt_devel/trunk tectomt
 </code> </code>
  
Line 52: Line 56:
     source ~/BIG/tectomt/config/init_devel_environ.sh     source ~/BIG/tectomt/config/init_devel_environ.sh
 </code> </code>
 +
 +  * In your ''.bash_profile'' file, add line 
 +
 +<code bash>
 +    source .bashrc
 +</code>
 +
 +
  
  
Line 86: Line 98:
  
 This tutorial itself has its blocks in ''libs/blocks/Tutorial'' and the application in ''applications/tutorial''. This tutorial itself has its blocks in ''libs/blocks/Tutorial'' and the application in ''applications/tutorial''.
 +
  
  
Line 101: Line 114:
 {{ external:tectomt:pyramid.gif?300x190|MT pyramid in terms of PDT layers}} {{ external:tectomt:pyramid.gif?300x190|MT pyramid in terms of PDT layers}}
  
-The notion of 'layer' has a combinatorial nature in TectoMT. It corresponds not only the layer of language description as used e.g. in the Prague Dependency Treebank, but it is also specific for a given language (e.g., possible values of morphological tags are typically different for different languages) and even for how the data on the given layer were created (whether by analysis from the lower layer or by synthesis/transfer).+The notion of 'layer' has a combinatorial nature in TectoMT. It corresponds not only to the layer of language description as used e.g. in the Prague Dependency Treebank, but it is also specific for a given language (e.g., possible values of morphological tags are typically different for different languages) and even for how the data on the given layer were created (whether by analysis from the lower layer or by synthesis/transfer).
  
 Thus, the set of TectoMT layers is a Cartesian product {S,T} x {English,Czech,...} x {W,M,P,A,T}, in which: Thus, the set of TectoMT layers is a Cartesian product {S,T} x {English,Czech,...} x {W,M,P,A,T}, in which:
Line 113: Line 126:
 //Example//: Block adding Czech morphological tags (pos, case, gender, etc.) can be found in ''libs/blocks/SCzechW_to_SCzechM/Simple_tagger.pm''. //Example//: Block adding Czech morphological tags (pos, case, gender, etc.) can be found in ''libs/blocks/SCzechW_to_SCzechM/Simple_tagger.pm''.
  
-There are also other directories for other purpose blocks, for example blocks which only print out some information go to ''libs/Print''. Our tutorial blocks are in ''libs/blocks/Tutorial''.+There are also other directories for other purpose blocks, for example blocks which only print out some information go to ''libs/Print''. Our tutorial blocks are in ''libs/blocks/Tutorial/''. 
 + 
  
  
Line 140: Line 155:
   * A document consists of a sequence of bundles (''<bundle>''), mirroring a sequence of natural language sentences originating from the text. So, for one sentence we have one ''<bundle>''.   * A document consists of a sequence of bundles (''<bundle>''), mirroring a sequence of natural language sentences originating from the text. So, for one sentence we have one ''<bundle>''.
   * Each bundle contains tree shaped sentence representations on various linguistic layers. In our example ''sample.tmt'' we have morphological tree (''SEnglishM'') in each bundle. Later on, also an analytical layer (''SEnglishA'') will appear in each bundle as we proceed with our analysis.    * Each bundle contains tree shaped sentence representations on various linguistic layers. In our example ''sample.tmt'' we have morphological tree (''SEnglishM'') in each bundle. Later on, also an analytical layer (''SEnglishA'') will appear in each bundle as we proceed with our analysis. 
-  * Trees are formed by nodes and edges. Attributes can be attached only to nodes. Edge's attributes must be equivalently stored as the lower node's attributes. Tree's attributes must be stored as attributes of the root node.+  * Trees are formed by nodes and edges. Attributes can be attached only to nodes. Edge's attributes must be stored as the lower node's attributes. Tree's attributes must be stored as attributes of the root node. 
 + 
 + 
 + 
 + 
 + 
  
  
Line 171: Line 192:
                 SEnglishW_to_SEnglishM::Sentence_segmentation_simple \                 SEnglishW_to_SEnglishM::Sentence_segmentation_simple \
                 SEnglishW_to_SEnglishM::Penn_style_tokenization \                 SEnglishW_to_SEnglishM::Penn_style_tokenization \
-                SEnglishW_to_SEnglishM::TagTnT \+                SEnglishW_to_SEnglishM::TagMxPost \
                 SEnglishW_to_SEnglishM::Lemmatize_mtree \                 SEnglishW_to_SEnglishM::Lemmatize_mtree \
         -- sample.tmt         -- sample.tmt
Line 183: Line 204:
                 SEnglishW_to_SEnglishM::Sentence_segmentation_simple \                 SEnglishW_to_SEnglishM::Sentence_segmentation_simple \
                 SEnglishW_to_SEnglishM::Penn_style_tokenization \                 SEnglishW_to_SEnglishM::Penn_style_tokenization \
-                SEnglishW_to_SEnglishM::TagTnT \+                SEnglishW_to_SEnglishM::TagMxPost \
                 SEnglishW_to_SEnglishM::Lemmatize_mtree \                 SEnglishW_to_SEnglishM::Lemmatize_mtree \
                 SEnglishM_to_SEnglishA::McD_parser_local \                 SEnglishM_to_SEnglishA::McD_parser_local \
Line 200: Line 221:
  
 we can examine our ''sample.tmt'' again. Really, an analytical layer ''SEnglishA'' describing a dependency tree with analytical functions (''<afun>'') has been added to each bundle. we can examine our ''sample.tmt'' again. Really, an analytical layer ''SEnglishA'' describing a dependency tree with analytical functions (''<afun>'') has been added to each bundle.
 +
 +Blocks can also be parametrized. For syntax parser, we might want to use a smaller but faster model. To achieve this, replace the line
 +
 +<code bash>
 +SEnglishM_to_SEnglishA::McD_parser_local \
 +</code>
 +
 +with
 +
 +<code bash>
 +SEnglishM_to_SEnglishA::McD_parser_local MT_PARAM_MCD_EN_MODEL=conll_mcd_order2_0.1.model \
 +</code>
  
 You can view the trees in ''sample.tmt'' with TrEd by typing You can view the trees in ''sample.tmt'' with TrEd by typing
Line 206: Line 239:
 tmttred sample.tmt tmttred sample.tmt
 </code> </code>
 +
 +Try to click on some nodes to see their parameters (tag, lemma, form, analytical function etc).
 +
 +//Note//: For more information about tree editor TrEd, see [[http://ufal.mff.cuni.cz/~pajas/tred/ar01-toc.html|TrEd User's Manual]].
  
  
Line 256: Line 293:
  
   * ''my @bundles = $document<nowiki>-></nowiki>get_bundles()'' - an array of bundles contained in the document   * ''my @bundles = $document<nowiki>-></nowiki>get_bundles()'' - an array of bundles contained in the document
-  * ''my $root_node = $bundle<nowiki>-></nowiki>get_tree($layer_name);'' - the root node of the tree of the given type in the given bundle +  * ''my $root_node = $bundle<nowiki>-></nowiki>get_tree($layer_name)'' - the root node of the tree of the given type in the given bundle 
-  * ''my @children = $node<nowiki>-></nowiki>get_children();'' - array of the node's children +  * ''my @children = $node<nowiki>-></nowiki>get_children()'' - array of the node's children 
-  * ''my @descendants = $node<nowiki>-></nowiki>get_descendants();'' - array of the node's children and their children and children of their children ... +  * ''my @descendants = $node<nowiki>-></nowiki>get_descendants()'' - array of the node's children and their children and children of their children ... 
-  * ''my $parent = $node<nowiki>-></nowiki>get_parent();'' - parent node of the given node, or undef for root +  * ''my $parent = $node<nowiki>-></nowiki>get_parent()'' - parent node of the given node, or undef for root 
-  * ''my $root_node = $node<nowiki>-></nowiki>get_root();'' - the root node of the tree into which the node belongs+  * ''my $root_node = $node<nowiki>-></nowiki>get_root()'' - the root node of the tree into which the node belongs
  
 Attributes of documents, bundles or nodes can be accessed by attribute getters and setters, for example:  Attributes of documents, bundles or nodes can be accessed by attribute getters and setters, for example: 
-  * ''$node<nowiki>-></nowiki>get_attr($attr_name);'' + 
-  * ''$node<nowiki>-></nowiki>set_attr($attr_name, $attr_value);''+  * ''$node<nowiki>-></nowiki>get_attr($attr_name)'' 
 +  * ''$node<nowiki>-></nowiki>set_attr($attr_name, $attr_value)'' 
 + 
 +Some interesting attributes on morphologic layer are ''form'', ''lemma'' and ''tag''. Some interesting attributes on analytical layer are ''afun'' (analytical function) and ''ord'' (surface word order). To reach ''form'', ''lemma'' or ''tag'' from analytical layer, that is, when calling this attribute on an ''a-node'', you use ''$a_node->get_attr('m/form')'' and the same way for ''lemma'' and ''tag''. The easiest way to see the node attributes is to click on the node in TrEd: 
 + 
 +<code bash> 
 +tmttred sample.tmt 
 +</code>
  
 Our tutorial block ''Print_node_info.pm'' is ready to use. You only need to add this block to our scenario, e.g. as a new Makefile target: Our tutorial block ''Print_node_info.pm'' is ready to use. You only need to add this block to our scenario, e.g. as a new Makefile target:
Line 279: Line 323:
 </code> </code>
  
-//Advanced version//: Try to change the block so that it prints out the information only for verbs. (You need to look at an attribute ''tag'' at the ''m'' level). The tagset used is Penn Treebank Tagset.+Try to change the block so that it prints out the information only for verbs. (You need to look at an attribute ''tag'' at the ''m'' level). The tagset used is Penn Treebank Tagset.
  
  
Line 303: Line 347:
 ==== Task ==== ==== Task ====
 A block which, given an analytical tree (''SEnglishA''), fills each ''a-node'' with boolean attribute ''is_clause_head'' which is set to ''1'' if the ''a-node'' corresponds to a finite verb, and to ''0'' otherwise. A block which, given an analytical tree (''SEnglishA''), fills each ''a-node'' with boolean attribute ''is_clause_head'' which is set to ''1'' if the ''a-node'' corresponds to a finite verb, and to ''0'' otherwise.
 +
 +
 +
 +
 +
  
  
Line 349: Line 398:
   * ''my @eff_children = $node<nowiki>-></nowiki>get_eff_children()''   * ''my @eff_children = $node<nowiki>-></nowiki>get_eff_children()''
  
-//Note//: ''get_children()'' returns topological node children in a tree, while ''get_eff_children()'' returns node children in a linguistic sense. Mostly, these do not differ.+//Note//: ''get_children()'' returns topological node children in a tree, while ''get_eff_children()'' returns node children in a linguistic sense. Mostly, these do not differ. If interested, see Figure 1 in [[http://ufal.mff.cuni.cz/pdt2.0/doc/tools/tred/bn-tutorial.html|btred tutorial]].
  
 +//Hint//: Finite clauses in English usually require grammatical subject to be present.
  
 +==== Advanced version ====
  
-//Advanced version//: The output of our block might still be incorrect in special cases - we don't solve coordination (see the second sentence in sample.txt) and subordinate conjunctions.+The output of our block might still be incorrect in special cases - we don't solve coordination (see the second sentence in sample.txt) and subordinate conjunctions.
  
  
  
 ===== Your turn: more tasks ===== ===== Your turn: more tasks =====
 +
 +
 +
 +
  
  
Line 377: Line 432:
 **Instructions**:  **Instructions**: 
  
 +  * You can use block template in ''libs/blocks/BlockTemplate.pm''
   * To find an object to a verb, look for objects among effective children of a verb (''$child<nowiki>-></nowiki>get_attr('afun') eq 'Obj' ''). That implies working on analytical layer.   * To find an object to a verb, look for objects among effective children of a verb (''$child<nowiki>-></nowiki>get_attr('afun') eq 'Obj' ''). That implies working on analytical layer.
   * For debugging, a method returning surface word order of a node is useful: ''$node<nowiki>-></nowiki>get_attr('ord')''. It can be used to print out nodes sorted by attribute ''ord''.   * For debugging, a method returning surface word order of a node is useful: ''$node<nowiki>-></nowiki>get_attr('ord')''. It can be used to print out nodes sorted by attribute ''ord''.
   * Once you have node ''$object'' and node ''$verb'', use method ''$object<nowiki>-></nowiki>shift_before_node($verb)''. This method takes the whole subtree under node ''$object'' and re-counts the attributes ''ord'' (surface word order) so that all nodes in subtree under ''$object'' have smaller ''ord'' than ''$verb''. That is, the method rearranges the surface word order from VO to OV.   * Once you have node ''$object'' and node ''$verb'', use method ''$object<nowiki>-></nowiki>shift_before_node($verb)''. This method takes the whole subtree under node ''$object'' and re-counts the attributes ''ord'' (surface word order) so that all nodes in subtree under ''$object'' have smaller ''ord'' than ''$verb''. That is, the method rearranges the surface word order from VO to OV.
 +
 +**Advanced version**: This solution shifts object (or more objects) of a verb just in front of that verb node. So f.e.: //Mr. Brown has urged MPs.// changes to: //Mr. Brown has MPs urged.// You can try to change this solution, so the final sentence would be: //Mr. Brown MPs has urged.// You may need a method ''$node->shift_after_subtree($root_of_that_subtree)''. Subjects should have attribute '''afun' eq 'Sb'''.
 +
 +
 +
 +
  
  
Line 427: Line 489:
 //Hint//:  //Hint//: 
   * On analytical layer, you can use this test to recognize prepositions: ''$node<nowiki>-></nowiki>get_attr('afun') eq 'AuxP' ''    * On analytical layer, you can use this test to recognize prepositions: ''$node<nowiki>-></nowiki>get_attr('afun') eq 'AuxP' '' 
-  * You can use block template in ''libs/blocks/BlockTemplate.pm'' 
   * To see the results, you can again use TrEd (''tmttred sample.tmt'')   * To see the results, you can again use TrEd (''tmttred sample.tmt'')
  
 +**Advanced version**: What happens in case of multiword prepositions? For example, ''because of'', ''instead of''. Can you handle it?
  
-//Advanced version//: What happens in case of multiword prepositions? For example, ''because of'', ''instead of''. Can you handle it? 
  
  
-===== Further information ===== 
-  * [[http://ufallab2.ms.mff.cuni.cz/~bojar/cruise_control_tmt/last_doc/generated/guide/guidelines.html|TectoMT Developer's Guide]] - obsolete 
-  * Questions? Ask ''kravalova'' at ''ufal.mff.cuni.cz'' 
-  * Solutions to this tutorial tasks are in ''libs/blocks/Tutorial/*solution.pm''. 
-  * [[http://ufal.mff.cuni.cz/~pajas/tred/|TrEd]] - tree editor 
  
  
 +===== Further information =====
 +  * [[http://ufallab2.ms.mff.cuni.cz/~bojar/cruise_control_tmt/last_doc/generated/guide/guidelines.html|TectoMT Developer's Guide]]
 +  * Questions? Ask ''kravalova'' at ''ufal.mff.cuni.cz''
 +  * Solutions to this tutorial tasks are in ''libs/blocks/Tutorial/*solution*.pm''.
 +  * [[http://ufal.mff.cuni.cz/~pajas/tred/|TrEd]], [[http://ufal.mff.cuni.cz/~pajas/tred/ar01-toc.html|TrEd User's Manual]] - tree editor
  
  

[ Back to the navigation ] [ Back to the content ]