[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
external:tectomt:tutorial [2009/01/20 11:30]
kravalova
external:tectomt:tutorial [2009/01/20 16:59]
popel
Line 8: Line 8:
  
 TectoMT is a highly modular NLP (Natural Language Processing) software system implemented in Perl programming language under Linux. It is primarily aimed at Machine Translation, making use of the ideas and technology created during the Prague Dependency Treebank project. At the same time, it is also hoped to significantly facilitate and accelerate development of software solutions of many other NLP tasks, especially due to re-usability of the numerous integrated processing modules (called blocks), which are equipped with uniform object-oriented interfaces.  TectoMT is a highly modular NLP (Natural Language Processing) software system implemented in Perl programming language under Linux. It is primarily aimed at Machine Translation, making use of the ideas and technology created during the Prague Dependency Treebank project. At the same time, it is also hoped to significantly facilitate and accelerate development of software solutions of many other NLP tasks, especially due to re-usability of the numerous integrated processing modules (called blocks), which are equipped with uniform object-oriented interfaces. 
 +
  
 ===== Prerequisities ===== ===== Prerequisities =====
 +
 +In this tutorial, we assume 
 +
 +  * Your system is Linux
 +  * Your shell is bash
 +  * You have basic experience bash and you can read Perl
 +
 +
 +
 +
 +
 +
  
  
Line 15: Line 28:
 ==== Installation and setup ==== ==== Installation and setup ====
  
-TODO popsat instalaci+  * Checkout SVN repository. If you are running this installation in a computer lab in Prague, you have to checkout the repository into directory ''/home/BIG'' (because data quotas don't apply here):
  
-Before running any experiments with TectoMT, you must set up your environment by running+<code bash> 
 +    cd ~/BIG 
 +    svn --username <username> co https://svn.ms.mff.cuni.cz/svn/tectomt_devel/trunk tectomt 
 +</code> 
 + 
 +  * In ''tectomt/install/'' run ./install.sh:
  
 <code bash> <code bash>
-source config/init_devel_environ.sh+    cd tectomt/install 
 +    ./install.sh
 </code> </code>
 +
 +  * In your ''.bashrc'' file, add line (or source this file every time before experimenting with TectoMT):
 +
 +<code bash>
 +    source ~/BIG/tectomt/config/init_devel_environ.sh
 +</code>
 +
 +
 +
 +
  
  
  
  
-==== Theoretical background ==== 
  
-TODO obrazek 
  
  
Line 35: Line 62:
  
  
-==== TrEd ==== 
  
-TODO malicko o TrEdu a obrazek 
  
  
Line 43: Line 68:
  
 ===== TectoMT Architecture ===== ===== TectoMT Architecture =====
 +
  
  
Line 50: Line 76:
 In TectoMT, there is the following hierarchy of processing units (software components that process data): In TectoMT, there is the following hierarchy of processing units (software components that process data):
  
-  * The basic units are blocks. They serve for some very limited, well defined, and often linguistically interpretable tasks (e.g., tokenization, tagging, parsing). Blocks are not parametrizable. Technically, blocks are Perl classes inherited from ''TectoMT::Block''. The blocks repository is in ''libs/blocks/''.+  * The basic units are blocks. They serve for some very limited, well defined, and often linguistically interpretable tasks (e.g., tokenization, tagging, parsing). Technically, blocks are Perl classes inherited from ''TectoMT::Block'', each saved in a separate file. The blocks repository is in ''libs/blocks/''.
   * To solve a more complex task, selected blocks can be chained into a block sequence, called also a scenario. Technically, scenarios are instances of ''TectoMT::Scenario'' class, but in some situations (e.g. on the command line) it is sufficient to specify the scenario simply by listing block names separated with spaces.   * To solve a more complex task, selected blocks can be chained into a block sequence, called also a scenario. Technically, scenarios are instances of ''TectoMT::Scenario'' class, but in some situations (e.g. on the command line) it is sufficient to specify the scenario simply by listing block names separated with spaces.
-  * The highest unit is called application. Applications correspond to end-to-end tasks, be they real end-user applications (such as machine translation), or 'only' NLP-related experiments. Technically, applications are often implemented as ''Makefiles'', which only glue the components existing in TectoMT. Some demo applications can be found in ''applications''+  * The highest unit is called application. Applications correspond to end-to-end tasks, be they real end-user applications (such as machine translation), or 'only' NLP-related experiments. Technically, applications are often implemented as ''Makefiles'', which only glue the components existing in TectoMT. Some demo applications can be found in ''applications''.
  
 This tutorial itself has its blocks in ''libs/blocks/Tutorial'' and the application in ''applications/tutorial''. This tutorial itself has its blocks in ''libs/blocks/Tutorial'' and the application in ''applications/tutorial''.
 +
 +
 +
  
  
Line 60: Line 89:
  
 ==== Layers of Linguistic Structures ==== ==== Layers of Linguistic Structures ====
 +
 +{{ external:tectomt:pyramid.gif?300x190|MT pyramid in terms of PDT layers}}
  
 TectoMT blocks repository is saved in ''libs/blocks/''. In correspondence with ..., the blocks are located in directories describing their purpose.  TectoMT blocks repository is saved in ''libs/blocks/''. In correspondence with ..., the blocks are located in directories describing their purpose. 
  
-Thus, the set of TectoMT layers is Cartesian product {S,T} x {English,Czech} x {W,M,P,A,T}, in which:+Thus, the set of TectoMT layers is Cartesian product {S,T} x {English,Czech,...} x {W,M,P,A,T}, in which:
  
   * {S,T} distinguishes whether the data was created by analysis or transfer/synthesis (mnemonics: S and T correspond to (S)ource and (T)arget in MT perspective).   * {S,T} distinguishes whether the data was created by analysis or transfer/synthesis (mnemonics: S and T correspond to (S)ource and (T)arget in MT perspective).
Line 69: Line 100:
   * {W,M,P,A,T...} represents the layer of description in terms of PDT 2.0 (W - word layer, M - morphological layer, A - analytical layer, T - tectogrammatical layer) or extensions (P - phrase-structure layer).   * {W,M,P,A,T...} represents the layer of description in terms of PDT 2.0 (W - word layer, M - morphological layer, A - analytical layer, T - tectogrammatical layer) or extensions (P - phrase-structure layer).
  
-//Example//: Block adding Czech morphological tags (pos, case, gender, etc.) can be found in ''libs/blocks/SCzechW_toSCzechM/Simple_tagger.pm''.+//Example//: Block adding Czech morphological tags (pos, case, gender, etc.) can be found in ''libs/blocks/SCzechW_to_SCzechM/Simple_tagger.pm''.
  
-There are also other directories for other purpose blocks, for example blocks which only print out some information go to ''libs/Print''. Our tutorial blocks are in ''libs/Tutorial''.+There are also other directories for other purpose blocks, for example blocks which only print out some information go to ''libs/Print''. Our tutorial blocks are in ''libs/blocks/Tutorial''.
  
  
Line 245: Line 276:
  
 It is assumed that finite clauses can be translated independently, which would reduce computational complexity or make parallel translation possible. We could even use hybrid translation - each finite clause could be translated by the most self-confident translation system. In this task, we are going to split the sentence into finite clauses. It is assumed that finite clauses can be translated independently, which would reduce computational complexity or make parallel translation possible. We could even use hybrid translation - each finite clause could be translated by the most self-confident translation system. In this task, we are going to split the sentence into finite clauses.
 +
  
  
  
 ==== Task ==== ==== Task ====
-A block which, given an analytical tree (''SEnglishA''), fills each ''a-node'' with boolean attribute ''is_head'' which is set to ''1'' if the ''a-node'' corresponds to a finite verb.+A block which, given an analytical tree (''SEnglishA''), fills each ''a-node'' with boolean attribute ''is_head'' which is set to ''1'' if the ''a-node'' corresponds to a finite verb, and to ''0'' otherwise. 
 + 
 + 
 + 
 + 
 + 
 + 
 + 
 + 
  
  
Line 278: Line 319:
 You are going to need these methods: You are going to need these methods:
  
-  * ''my $root = $bundle->get_tree('tree_name')'' - returns root node of the tree ''tree_name'' +  * ''my root = $bundle->get_tree('tree_name')'' 
-  * ''my $attr = $node->get_attr('attr_name')'' - returns value of an attribute ''attr_name''+  * ''my $attr = $node->get_attr('attr_name')''
   * ''$node->set_attr('attr_name',$attr_value)''   * ''$node->set_attr('attr_name',$attr_value)''
-  * ''my ($eff_children= $node->get_eff_children()'' - returns an array of effective children or 1 effective child+  * ''my @eff_children = $node->get_eff_children()''
  
 +//Note//: ''get_children'' returns topological node children in a tree, while ''get_eff_children'' returns node children in a linguistic sense. Mostly, these do not differ.
  
  
  
-==== Is it perfect? ==== +//Advanced version//: The output of our block might still be incorrect in special cases - we don't solve coordination and subordinate conjunctions.
- +
-The output of our block is not ideal - we don't solve coordination and subordinate conjunctions. But for the time being, we can move on to some other tasks.+
  
  
Line 295: Line 335:
  
  
-==== SVO typology ==== 
  
-TODO+ 
 + 
 + 
 + 
 + 
 +==== SVO to SOV ==== 
 + 
 +**Motivation**: During translation from an SVO based language (English) to an SOV based language (Korean) we might need to change the word order from SVO to SOV.  
 + 
 +**Task**: Change the word order from SVO to SOV. 
 + 
 +**Instructions**:  
 + 
 +  * To find an object to a verb, look for objects among effective children of a verb (''$child->get_attr('afun') eq 'Obj' ''). That implies working on analytical layer. 
 +  * Once you have node ''$object'' and node ''$verb'', use method TODO  
 +  * For debugging, a method returning word order of a node is useful: ''$node->get_attr('ord')''. It can be used to print out nodes sorted by attribute ''ord''
 + 
 + 
 + 
 + 
 + 
 + 
 + 
 + 
 + 
 + 
 + 
 + 
 + 
  
  
Line 306: Line 374:
 ==== Prepositions ==== ==== Prepositions ====
  
-In dependency approach a question "where to hang prepositions" arises. In praguian style (PDT), prepositions are heads of the subtree and the noun/pronoun/... is dependent on the preposition. However, another ordering might be preferable: The noun/pronoun/... might be the head of subtree, while the preposition would have the role of modifier.+**Motivation**: In dependency approach a question "where to hang prepositions" arises. In praguian style (PDT), prepositions are heads of the subtree and the noun/pronoun is dependent on the preposition. However, another ordering might be preferable: The noun/pronoun might be the head of subtree, while the preposition would take the role of modifier.
  
 TODO obrazek TODO obrazek
  
-The task is to rehang all prepositions as indicated at the picture. You may assume that prepositions have at most 1 child.+**Task**: The task is to rehang all prepositions as indicated at the picture. You may assume that prepositions have at most 1 child.
  
-You are going to need these methods: +** Instructions**: 
-  * ''+ 
 +You are going to need these new methods: 
 +  * ''my @children = $node->get_children'' 
 +  * ''my $parent = $node->get_parent'' 
 +  * ''$node->set_parent($parent)'' 
 + 
 +//Hint//:  
 +  * On analytical layer, you can use this test to recognize prepositions: ''$node->get_attr('afun') eq 'AuxP' ''  
 +  * You can use block template in ''libs/blocks/BlockTemplate.pm''. To see the results, you can again use TrEd (''tmttred sample.tmt'')
  
-You can use block template in ''devel/libs/blocks/BlockTemplate.pm''. To see the results, you can again use TrEd (''tmttred sample.tmt'') 
  
 +//Advanced version//: What happens in case of multiword prepositions? For example, ''because of'', ''instead of''. Can you handle it?
  
  

[ Back to the navigation ] [ Back to the content ]