[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
external:tectomt:tutorial [2009/01/22 11:49]
kravalova
external:tectomt:tutorial [2009/03/31 23:07]
ptacek promoting open access to tectomt
Line 1: Line 1:
 ====== TectoMT Tutorial ====== ====== TectoMT Tutorial ======
  
-Welcome at TectoMT Tutorial. This tutorial should take about 3 hours.+Welcome to the TectoMT Tutorial. This tutorial should take about 3 hours.
  
  
Line 21: Line 21:
   * Your shell is bash   * Your shell is bash
   * You have basic experience with bash and can read basic Perl   * You have basic experience with bash and can read basic Perl
 +
 +
 +
  
  
Line 36: Line 39:
 ==== Installation and setup ==== ==== Installation and setup ====
  
-  * Checkout SVN repository. If you are running this installation in computer lab in Prague, you have to checkout the repository into directory ''/home/BIG'' (because bigger disk quota applies here):+  * Checkout SVN repository. If you are running this installation in computer lab in Prague, you have to checkout the repository into directory ''~/BIG'' (because bigger disk quota applies here):
  
 <code bash> <code bash>
     cd ~/BIG     cd ~/BIG
-    svn --username <username> co https://svn.ms.mff.cuni.cz/svn/tectomt_devel/trunk tectomt+    svn co https://svn.ms.mff.cuni.cz/svn/tectomt_devel/trunk tectomt
 </code> </code>
  
Line 50: Line 53:
 </code> </code>
  
-  * In your ''.bashrc'' file, add line (or source this file every time before experimenting with TectoMT):+  * In your ''.bashrc'' file, add line (or source the specified file every time before experimenting with TectoMT):
  
 <code bash> <code bash>
Line 84: Line 87:
  
 ===== TectoMT Architecture ===== ===== TectoMT Architecture =====
 +
  
  
Line 93: Line 97:
  
   * The basic units are blocks. They serve for some very limited, well defined, and often linguistically interpretable tasks (e.g., tokenization, tagging, parsing). Technically, blocks are Perl classes inherited from ''TectoMT::Block'', each saved in a separate file. The blocks repository is in ''libs/blocks/''.   * The basic units are blocks. They serve for some very limited, well defined, and often linguistically interpretable tasks (e.g., tokenization, tagging, parsing). Technically, blocks are Perl classes inherited from ''TectoMT::Block'', each saved in a separate file. The blocks repository is in ''libs/blocks/''.
-  * To solve a more complex task, selected blocks can be chained into a block sequence, called also a scenario. Technically, scenarios are instances of ''TectoMT::Scenario'' class, but in some situations (e.g. on the command line) it is sufficient to specify the scenario simply by listing block names separated with spaces.+  * To solve a more complex task, selected blocks can be chained into a block sequence, called also a scenario. Technically, scenarios are instances of ''TectoMT::Scenario'' class, but in some situations (e.g. on the command line) it is sufficient to specify the scenario simply by listing block names separated by spaces.
   * The highest unit is called application. Applications correspond to end-to-end tasks, be they real end-user applications (such as machine translation), or 'only' NLP-related experiments. Technically, applications are often implemented as ''Makefiles'', which only glue the components existing in TectoMT. Some demo applications can be found in ''applications''.   * The highest unit is called application. Applications correspond to end-to-end tasks, be they real end-user applications (such as machine translation), or 'only' NLP-related experiments. Technically, applications are often implemented as ''Makefiles'', which only glue the components existing in TectoMT. Some demo applications can be found in ''applications''.
  
 This tutorial itself has its blocks in ''libs/blocks/Tutorial'' and the application in ''applications/tutorial''. This tutorial itself has its blocks in ''libs/blocks/Tutorial'' and the application in ''applications/tutorial''.
 +
  
  
Line 123: Line 128:
 Blocks in block repository ''libs/blocks'' are located in directories indicating their purpose in machine translation. Blocks in block repository ''libs/blocks'' are located in directories indicating their purpose in machine translation.
  
-//Example//: Block adding Czech morphological tags (pos, case, gender, etc.) can be found in ''libs/blocks/SCzechW_to_SCzechM/Simple_tagger.pm''.+//Example//: A block adding Czech morphological tags (pos, case, gender, etc.) can be found in ''libs/blocks/SCzechW_to_SCzechM/Simple_tagger.pm''.
  
 There are also other directories for other purpose blocks, for example blocks which only print out some information go to ''libs/Print''. Our tutorial blocks are in ''libs/blocks/Tutorial/''. There are also other directories for other purpose blocks, for example blocks which only print out some information go to ''libs/Print''. Our tutorial blocks are in ''libs/blocks/Tutorial/''.
 +
  
  
Line 140: Line 146:
 Once you have TectoMT installed on your machine, you can find this tutorial in ''applications/tutorial/''. After you ''cd'' into this directory, you can see our plain text sample data in ''sample.txt'' Once you have TectoMT installed on your machine, you can find this tutorial in ''applications/tutorial/''. After you ''cd'' into this directory, you can see our plain text sample data in ''sample.txt''
  
-Most applications are defined in Makefiles, which describe sequence of blocks to be applied on our data. In our particular ''Makefile'', four blocks are going to be applied on our sample text: sentence segmentation, tokenization, tagging and lemmatization. Since we have our input text in plain text format, the file is going to be converted into ''tmt'' format beforehand (the ''in'' target in the Makefile).+Most applications are defined in ''Makefiles'', which describe sequence of blocks to be applied on our data. In our particular ''Makefile'', four blocks are going to be applied on our sample text: sentence segmentation, tokenization, tagging and lemmatization. Since we have our input text in plain text format, the file is going to be converted into ''tmt'' format beforehand (the ''in'' target in the ''Makefile'').
  
 We can run the application: We can run the application:
Line 153: Line 159:
   * A document consists of a sequence of bundles (''<bundle>''), mirroring a sequence of natural language sentences originating from the text. So, for one sentence we have one ''<bundle>''.   * A document consists of a sequence of bundles (''<bundle>''), mirroring a sequence of natural language sentences originating from the text. So, for one sentence we have one ''<bundle>''.
   * Each bundle contains tree shaped sentence representations on various linguistic layers. In our example ''sample.tmt'' we have morphological tree (''SEnglishM'') in each bundle. Later on, also an analytical layer (''SEnglishA'') will appear in each bundle as we proceed with our analysis.    * Each bundle contains tree shaped sentence representations on various linguistic layers. In our example ''sample.tmt'' we have morphological tree (''SEnglishM'') in each bundle. Later on, also an analytical layer (''SEnglishA'') will appear in each bundle as we proceed with our analysis. 
-  * Trees are formed by nodes and edges. Attributes can be attached only to nodes. Edge's attributes must be equivalently stored as the lower node's attributes. Tree's attributes must be stored as attributes of the root node.+  * Trees are formed by nodes and edges. Attributes can be attached only to nodes. Edge's attributes must be stored as the lower node's attributes. Tree's attributes must be stored as attributes of the root node. 
 + 
 + 
 + 
 + 
  
  
Line 210: Line 221:
 </code> </code>
  
-//Note//: Makefiles use tabulators to mark command lines. Make sure your lines start with tabulator (or two tabulators) and not, for example, with 4 spaces.+//Note//: ''Makefiles'' use tabulators to mark command lines. Make sure your lines start with tabulator (or two tabulators) and not, for example, with 4 spaces.
  
 After running After running
Line 229: Line 240:
  
 <code bash> <code bash>
-SEnglishM_to_SEnglishA::McD_parser_local MT_PARAM_MCD_EN_MODEL=conll_mcd_order2_0.1.model \+SEnglishM_to_SEnglishA::McD_parser_local TMT_PARAM_MCD_EN_MODEL=conll_mcd_order2_0.1.model \
 </code> </code>
  
Line 242: Line 253:
 //Note//: For more information about tree editor TrEd, see [[http://ufal.mff.cuni.cz/~pajas/tred/ar01-toc.html|TrEd User's Manual]]. //Note//: For more information about tree editor TrEd, see [[http://ufal.mff.cuni.cz/~pajas/tred/ar01-toc.html|TrEd User's Manual]].
  
 +If you are not familiar with ''Makefile'' syntax, another way of running a scenario in TectoMT is using ''.scen'' file (see ''applications/tutorial.scen''). This file lists the blocks to be run - one block on a single line. 
  
 +<code bash>
 +$TMT_ROOT/tools/format_convertors/plaintext_to_tmt/plaintext_to_tmt.pl English sample.txt
 +brunblocks -S -o --scen tutorial.scen -- sample.tmt
 +</code>
  
 +Finally, yet another way is to use a simple ''bash'' script (see ''applications/tutorial/run_all.sh''):
 +
 +<code bash>
 +./run_all.sh
 +</code>
  
  
Line 302: Line 323:
   * ''$node<nowiki>-></nowiki>set_attr($attr_name, $attr_value)''   * ''$node<nowiki>-></nowiki>set_attr($attr_name, $attr_value)''
  
-Some interesting attributes on morphologic layer are ''form'', ''lemma'' and ''tag''. Some interesting attributes on analytical layer are ''afun'' (analytical function) and ''ord'' (surface word order). To reach ''form'', ''lemma'' or ''tag'' from analytical layer, that is, when calling this attribute on an ''a-node'', you use ''$a_node->get_attr('m/form')'' and the same way for ''lemma'' and ''tag''. The easiest way to see the node attributes is to click on the node in TrEd:+Some interesting attributes on morphologic layer are ''form'', ''lemma'' and ''tag''. Some interesting attributes on analytical layer are ''afun'' (analytical function) and ''ord'' (surface word order). To reach ''form'', ''lemma'' or ''tag'' from analytical layer, that is, when calling this attribute on an ''a-node'', you use ''$a_node<nowiki>-></nowiki>get_attr('m/form')'' and the same way for ''lemma'' and ''tag''. The easiest way to see the node attributes is to click on the node in TrEd:
  
 <code bash> <code bash>
Line 308: Line 329:
 </code> </code>
  
-Our tutorial block ''Print_node_info.pm'' is ready to use. You only need to add this block to our scenario, e.g. as a new Makefile target:+Our tutorial block ''Print_node_info.pm'' is ready to use. You only need to add this block to our scenario, e.g. as a new ''Makefile'' target:
  
 <code bash> <code bash>
Line 329: Line 350:
  
 ===== Advanced block: finite clauses ===== ===== Advanced block: finite clauses =====
 +
  
  
Line 337: Line 359:
 ==== Motivation ==== ==== Motivation ====
  
-It is assumed that finite clauses can be translated independently, which would reduce computational complexity or make parallel translation possible. We could even use hybrid translation - each finite clause could be translated by the most self-confident translation system. In this task, we are going to split the sentence into finite clauses.+It is assumed that finite clauses can be translated independently, which would reduce combinatorial complexity or make parallel translation possible. We could even use hybrid translation - each finite clause could be translated by the most self-confident translation system. In this task, we are going to split the sentence into finite clauses.
  
  
Line 345: Line 367:
 ==== Task ==== ==== Task ====
 A block which, given an analytical tree (''SEnglishA''), fills each ''a-node'' with boolean attribute ''is_clause_head'' which is set to ''1'' if the ''a-node'' corresponds to a finite verb, and to ''0'' otherwise. A block which, given an analytical tree (''SEnglishA''), fills each ''a-node'' with boolean attribute ''is_clause_head'' which is set to ''1'' if the ''a-node'' corresponds to a finite verb, and to ''0'' otherwise.
 +
 +
 +
  
  
Line 396: Line 421:
   * ''my @eff_children = $node<nowiki>-></nowiki>get_eff_children()''   * ''my @eff_children = $node<nowiki>-></nowiki>get_eff_children()''
  
-//Note//: ''get_children()'' returns topological node children in a tree, while ''get_eff_children()'' returns node children in a linguistic sense. Mostly, these do not differ. If interested, see Figure 1 in [[http://ufal.mff.cuni.cz/pdt2.0/doc/tools/tred/bn-tutorial.html|btred tutorial]].+//Note//: ''get_children()'' returns topological node children in a tree, while ''get_eff_children()'' returns node children in a linguistic sense. Mostly, these do not differ. If interested, see Figure 1 in [[http://ufal.mff.cuni.cz/pdt2.0/doc/tools/tred/bn-tutorial.html#i-effective|btred tutorial]].
  
 //Hint//: Finite clauses in English usually require grammatical subject to be present. //Hint//: Finite clauses in English usually require grammatical subject to be present.
Line 407: Line 432:
  
 ===== Your turn: more tasks ===== ===== Your turn: more tasks =====
 +
 +
  
  
Line 424: Line 451:
 ==== SVO to SOV ==== ==== SVO to SOV ====
  
-**Motivation**: During translation from an SVO based language (e.g. English) to an SOV based language (e.g. Korean) we might need to change the word order from SVO to SOV. +**Motivation**: During translation from an SVO based language (e.g. English) to an SOV based language (e.g. Korean)we might need to change the word order from SVO to SOV. 
  
 **Task**: Change the word order from SVO to SOV. **Task**: Change the word order from SVO to SOV.
Line 431: Line 458:
  
   * You can use block template in ''libs/blocks/BlockTemplate.pm''   * You can use block template in ''libs/blocks/BlockTemplate.pm''
-  * To find an object to a verb, look for objects among effective children of a verb (''$child<nowiki>-></nowiki>get_attr('afun') eq 'Obj' ''). That implies working on analytical layer.+  * To find an object of a verb, look for objects among effective children of a verb (''$child<nowiki>-></nowiki>get_attr('afun') eq 'Obj' ''). That implies working on analytical layer.
   * For debugging, a method returning surface word order of a node is useful: ''$node<nowiki>-></nowiki>get_attr('ord')''. It can be used to print out nodes sorted by attribute ''ord''.   * For debugging, a method returning surface word order of a node is useful: ''$node<nowiki>-></nowiki>get_attr('ord')''. It can be used to print out nodes sorted by attribute ''ord''.
-  * Once you have node ''$object'' and node ''$verb'', use method ''$object<nowiki>-></nowiki>shift_before_node($verb)''. This method takes the whole subtree under node ''$object'' and re-counts the attributes ''ord'' (surface word order) so that all nodes in subtree under ''$object'' have smaller ''ord'' than ''$verb''. That is, the method rearranges the surface word order from VO to OV.+  * Once you have the node ''$object'' and the node ''$verb'', use the method ''$object<nowiki>-></nowiki>shift_before_node($verb)''. This method takes the whole subtree under the node ''$object'' and recalculates the attributes ''ord'' (surface word order) so that all the nodes in the subtree under ''$object'' have smaller ''ord'' than ''$verb''. That is, the method rearranges the surface word order from VO to OV
 + 
 +**Advanced version**: This solution shifts object (or more objects) of a verb just in front of that verb node. So f.e.: //Mr. Brown has urged MPs.// changes to: //Mr. Brown has MPs urged.// You can try to change this solution, so the final sentence would be: //Mr. Brown MPs has urged.// You may need a method ''$node<nowiki>-></nowiki>shift_after_subtree($root_of_that_subtree)''. Subjects should have attribute '''afun' eq 'Sb'''.
  
-**Advanced version**: This solution shifts object (or more objects) of a verb just in front of that verb node. So f.e.: //Mr. Brown has urged MPs.// changes to: //Mr. Brown has MPs urged.// You can try to change this solution, so the final sentence would be: //Mr. Brown MPs has urged.// You may need a method ''$node->shift_after_subtree($root_of_that_subtree)''. Subjects should have attribute '''afun' eq 'Sb'''. 
  
  
Line 474: Line 502:
 {{ external:tectomt:preps.png?200x80|Prepositions example}} {{ external:tectomt:preps.png?200x80|Prepositions example}}
  
-**Motivation**: In dependency approach question "where to hang prepositions" arises. In praguian style (PDT), prepositions are heads of the subtree and the noun/pronoun is dependent on the preposition. However, another ordering might be preferable: The noun/pronoun might be the head of subtree, while the preposition would take the role of a modifier.+**Motivation**: In dependency approach the question "where to hang prepositions" arises. In the praguian style (PDT), prepositions are heads of the subtree and the noun/pronoun is dependent on the preposition. However, another ordering might be preferable: The noun/pronoun might be the head of subtree, while the preposition would take the role of a modifier.
  
 **Task**: The task is to rehang all prepositions as indicated at the picture. You may assume that prepositions have at most 1 child. **Task**: The task is to rehang all prepositions as indicated at the picture. You may assume that prepositions have at most 1 child.

[ Back to the navigation ] [ Back to the content ]