[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
external:tectomt:tutorial [2009/01/21 11:35]
kravalova
external:tectomt:tutorial [2009/01/22 10:34]
kravalova
Line 86: Line 86:
  
 This tutorial itself has its blocks in ''libs/blocks/Tutorial'' and the application in ''applications/tutorial''. This tutorial itself has its blocks in ''libs/blocks/Tutorial'' and the application in ''applications/tutorial''.
 +
 +
 +
 +
 +
  
  
Line 97: Line 102:
 {{ external:tectomt:pyramid.gif?300x190|MT pyramid in terms of PDT layers}} {{ external:tectomt:pyramid.gif?300x190|MT pyramid in terms of PDT layers}}
  
-TectoMT blocks repository is saved in ''libs/blocks/''In correspondence with ..., the blocks are located in directories describing their purpose+The notion of 'layerhas a combinatorial nature in TectoMTIt corresponds not only to the layer of language description as used e.gin the Prague Dependency Treebank, but it is also specific for a given language (e.g., possible values of morphological tags are typically different for different languages) and even for how the data on the given layer were created (whether by analysis from the lower layer or by synthesis/transfer).
  
 Thus, the set of TectoMT layers is a Cartesian product {S,T} x {English,Czech,...} x {W,M,P,A,T}, in which: Thus, the set of TectoMT layers is a Cartesian product {S,T} x {English,Czech,...} x {W,M,P,A,T}, in which:
Line 104: Line 109:
   * {English,Czech...} represents the language in question   * {English,Czech...} represents the language in question
   * {W,M,P,A,T...} represents the layer of description in terms of PDT 2.0 (W - word layer, M - morphological layer, A - analytical layer, T - tectogrammatical layer) or extensions (P - phrase-structure layer).   * {W,M,P,A,T...} represents the layer of description in terms of PDT 2.0 (W - word layer, M - morphological layer, A - analytical layer, T - tectogrammatical layer) or extensions (P - phrase-structure layer).
 +
 +Blocks in block repository ''libs/blocks'' are located in directories indicating their purpose in machine translation.
  
 //Example//: Block adding Czech morphological tags (pos, case, gender, etc.) can be found in ''libs/blocks/SCzechW_to_SCzechM/Simple_tagger.pm''. //Example//: Block adding Czech morphological tags (pos, case, gender, etc.) can be found in ''libs/blocks/SCzechW_to_SCzechM/Simple_tagger.pm''.
  
-There are also other directories for other purpose blocks, for example blocks which only print out some information go to ''libs/Print''. Our tutorial blocks are in ''libs/blocks/Tutorial''.+There are also other directories for other purpose blocks, for example blocks which only print out some information go to ''libs/Print''. Our tutorial blocks are in ''libs/blocks/Tutorial/''. 
 + 
  
  
Line 127: Line 136:
 </code> </code>
  
-Our plain text data ''sample.txt'' have been transformed into ''tmt'', an internal TectoMT format, and saved into ''sample.tmt''. Then, all four blocks have been loaded and our data has been processed. We can now examine ''sample.tmt'' using regular text editor. We'll now stop and describe data structure in TectoMT.+Our plain text data ''sample.txt'' have been transformed into ''tmt'', an internal TectoMT format, and saved into ''sample.tmt''. Then, all four blocks have been loaded and our data has been processed. We can now examine ''sample.tmt'' with a text editor (vi, emacs, etc).
  
   * One physical ''tmt'' file corresponds to one document.   * One physical ''tmt'' file corresponds to one document.
Line 133: Line 142:
   * Each bundle contains tree shaped sentence representations on various linguistic layers. In our example ''sample.tmt'' we have morphological tree (''SEnglishM'') in each bundle. Later on, also an analytical layer (''SEnglishA'') will appear in each bundle as we proceed with our analysis.    * Each bundle contains tree shaped sentence representations on various linguistic layers. In our example ''sample.tmt'' we have morphological tree (''SEnglishM'') in each bundle. Later on, also an analytical layer (''SEnglishA'') will appear in each bundle as we proceed with our analysis. 
   * Trees are formed by nodes and edges. Attributes can be attached only to nodes. Edge's attributes must be equivalently stored as the lower node's attributes. Tree's attributes must be stored as attributes of the root node.   * Trees are formed by nodes and edges. Attributes can be attached only to nodes. Edge's attributes must be equivalently stored as the lower node's attributes. Tree's attributes must be stored as attributes of the root node.
 +
 +
 +
  
  
Line 155: Line 167:
 ===== Changing the scenario ===== ===== Changing the scenario =====
  
-We'll now add a syntax analysis (dependency parsing) to our scenario by adding four more blocks. Instead of +We'll now add a syntax analysis (dependency parsing) to our scenario by adding three more blocks. Instead of 
  
 <code bash> <code bash>
Line 162: Line 174:
                 SEnglishW_to_SEnglishM::Sentence_segmentation_simple \                 SEnglishW_to_SEnglishM::Sentence_segmentation_simple \
                 SEnglishW_to_SEnglishM::Penn_style_tokenization \                 SEnglishW_to_SEnglishM::Penn_style_tokenization \
-                SEnglishW_to_SEnglishM::TagTnT \+                SEnglishW_to_SEnglishM::TagMxPost \
                 SEnglishW_to_SEnglishM::Lemmatize_mtree \                 SEnglishW_to_SEnglishM::Lemmatize_mtree \
         -- sample.tmt         -- sample.tmt
Line 174: Line 186:
                 SEnglishW_to_SEnglishM::Sentence_segmentation_simple \                 SEnglishW_to_SEnglishM::Sentence_segmentation_simple \
                 SEnglishW_to_SEnglishM::Penn_style_tokenization \                 SEnglishW_to_SEnglishM::Penn_style_tokenization \
-                SEnglishW_to_SEnglishM::TagTnT \+                SEnglishW_to_SEnglishM::TagMxPost \
                 SEnglishW_to_SEnglishM::Lemmatize_mtree \                 SEnglishW_to_SEnglishM::Lemmatize_mtree \
                 SEnglishM_to_SEnglishA::McD_parser_local \                 SEnglishM_to_SEnglishA::McD_parser_local \
Line 197: Line 209:
 tmttred sample.tmt tmttred sample.tmt
 </code> </code>
 +
 +
  
  
Line 246: Line 260:
  
   * ''my @bundles = $document<nowiki>-></nowiki>get_bundles()'' - an array of bundles contained in the document   * ''my @bundles = $document<nowiki>-></nowiki>get_bundles()'' - an array of bundles contained in the document
-  * ''my $root_node = $bundle<nowiki>-></nowiki>get_tree($layer_name);'' - the root node of the tree of the given type in the given bundle +  * ''my $root_node = $bundle<nowiki>-></nowiki>get_tree($layer_name)'' - the root node of the tree of the given type in the given bundle 
-  * ''my @children = $node<nowiki>-></nowiki>get_children();'' - array of the node's children +  * ''my @children = $node<nowiki>-></nowiki>get_children()'' - array of the node's children 
-  * ''my @descendants = $node<nowiki>-></nowiki>get_descendants();'' - array of the node's children and their children and children of their children ... +  * ''my @descendants = $node<nowiki>-></nowiki>get_descendants()'' - array of the node's children and their children and children of their children ... 
-  * ''my $parent = $node<nowiki>-></nowiki>get_parent();'' - parent node of the given node, or undef for root +  * ''my $parent = $node<nowiki>-></nowiki>get_parent()'' - parent node of the given node, or undef for root 
-  * ''my $root_node = $node<nowiki>-></nowiki>get_root();'' - the root node of the tree into which the node belongs+  * ''my $root_node = $node<nowiki>-></nowiki>get_root()'' - the root node of the tree into which the node belongs
  
 Attributes of documents, bundles or nodes can be accessed by attribute getters and setters, for example:  Attributes of documents, bundles or nodes can be accessed by attribute getters and setters, for example: 
-  * ''$node<nowiki>-></nowiki>get_attr($attr_name);'' +  * ''$node<nowiki>-></nowiki>get_attr($attr_name)'' 
-  * ''$node<nowiki>-></nowiki>set_attr($attr_name, $attr_value);''+  * ''$node<nowiki>-></nowiki>set_attr($attr_name, $attr_value)''
  
 Our tutorial block ''Print_node_info.pm'' is ready to use. You only need to add this block to our scenario, e.g. as a new Makefile target: Our tutorial block ''Print_node_info.pm'' is ready to use. You only need to add this block to our scenario, e.g. as a new Makefile target:
Line 293: Line 307:
 ==== Task ==== ==== Task ====
 A block which, given an analytical tree (''SEnglishA''), fills each ''a-node'' with boolean attribute ''is_clause_head'' which is set to ''1'' if the ''a-node'' corresponds to a finite verb, and to ''0'' otherwise. A block which, given an analytical tree (''SEnglishA''), fills each ''a-node'' with boolean attribute ''is_clause_head'' which is set to ''1'' if the ''a-node'' corresponds to a finite verb, and to ''0'' otherwise.
 +
  
  
Line 348: Line 363:
  
 ===== Your turn: more tasks ===== ===== Your turn: more tasks =====
 +
 +
 +
 +
  
  
Line 359: Line 378:
 ==== SVO to SOV ==== ==== SVO to SOV ====
  
-**Motivation**: During translation from an SVO based language (English) to an SOV based language (Korean) we might need to change the word order from SVO to SOV. +**Motivation**: During translation from an SVO based language (e.g. English) to an SOV based language (e.g. Korean) we might need to change the word order from SVO to SOV. 
  
 **Task**: Change the word order from SVO to SOV. **Task**: Change the word order from SVO to SOV.
Line 367: Line 386:
   * To find an object to a verb, look for objects among effective children of a verb (''$child<nowiki>-></nowiki>get_attr('afun') eq 'Obj' ''). That implies working on analytical layer.   * To find an object to a verb, look for objects among effective children of a verb (''$child<nowiki>-></nowiki>get_attr('afun') eq 'Obj' ''). That implies working on analytical layer.
   * For debugging, a method returning surface word order of a node is useful: ''$node<nowiki>-></nowiki>get_attr('ord')''. It can be used to print out nodes sorted by attribute ''ord''.   * For debugging, a method returning surface word order of a node is useful: ''$node<nowiki>-></nowiki>get_attr('ord')''. It can be used to print out nodes sorted by attribute ''ord''.
-  * Once you have node ''$object'' and node ''$verb'', use method ''$object<nowiki>-></nowiki>shift_before_node($verb)''. This method takes the whole subtree under node ''$object'' and counts the attributes ''ord'' (surface word order) so that all nodes in subtree under ''$object'' have smaller ''ord'' than ''$verb''. That is, the method rearranges the surface word order from VO to OV.+  * Once you have node ''$object'' and node ''$verb'', use method ''$object<nowiki>-></nowiki>shift_before_node($verb)''. This method takes the whole subtree under node ''$object'' and re-counts the attributes ''ord'' (surface word order) so that all nodes in subtree under ''$object'' have smaller ''ord'' than ''$verb''. That is, the method rearranges the surface word order from VO to OV. 
 + 
 +**Advanced version**: This solution shifts object (or more objects) of a verb just in front of that verb node. So f.e.: //Mr. Brown has urged MPs.// changes to: //Mr. Brown has MPs urged.// You can try to change this solution, so the final sentence would be: //Mr. Brown MPs has urged.// You may need a method ''$node->shift_after_subtree($root_of_that_subtree)''
 +Subjects should have attribute '''afun' eq 'Sb'''
 + 
 + 
  
  
Line 418: Line 443:
   * To see the results, you can again use TrEd (''tmttred sample.tmt'')   * To see the results, you can again use TrEd (''tmttred sample.tmt'')
  
 +**Advanced version**: What happens in case of multiword prepositions? For example, ''because of'', ''instead of''. Can you handle it?
  
-//Advanced version//: What happens in case of multiword prepositions? For example, ''because of'', ''instead of''. Can you handle it? 
  
  
Line 425: Line 450:
   * [[http://ufallab2.ms.mff.cuni.cz/~bojar/cruise_control_tmt/last_doc/generated/guide/guidelines.html|TectoMT Developer's Guide]] - obsolete   * [[http://ufallab2.ms.mff.cuni.cz/~bojar/cruise_control_tmt/last_doc/generated/guide/guidelines.html|TectoMT Developer's Guide]] - obsolete
   * Questions? Ask ''kravalova'' at ''ufal.mff.cuni.cz''   * Questions? Ask ''kravalova'' at ''ufal.mff.cuni.cz''
-  * Solutions to this tutorial tasks are in ''libs/blocks/Tutorial/*solution.pm''.+  * Solutions to this tutorial tasks are in ''libs/blocks/Tutorial/*solution*.pm''.
   * [[http://ufal.mff.cuni.cz/~pajas/tred/|TrEd]] - tree editor   * [[http://ufal.mff.cuni.cz/~pajas/tred/|TrEd]] - tree editor
  

[ Back to the navigation ] [ Back to the content ]