Differences

This shows you the differences between two versions of the page.

--- external:tectomt:tutorial [2009/01/15 12:04]
kravalova
+++ external:tectomt:tutorial [2010/11/10 16:39] (current)
popel SEnglishM_to_SEnglishA::Clone_MTree is needed now
@@ Line 1: / Line 1: @@
 ====== TectoMT Tutorial ======
-Welcome at TectoMT Tutorial. This tutorial should take about 2 hours.
+Welcome to the TectoMT Tutorial. This tutorial should take about 3 hours.
 ===== What is TectoMT =====
-TectoMT is a highly modular NLP (Natural Language Processing) software system implemented in Perl programming language under Linux. It is primarily aimed at Machine Translation, making use of the ideas and technology created during the Prague Dependency Treebank project. At the same time, it is also hoped to significantly facilitate and accelerate development of software solutions of many other NLP tasks, especially due to re-usability of the numerous integrated processing modules (called blocks), which are equipped with uniform object-oriented interfaces.
+TectoMT is a highly modular NLP (Natural Language Processing) software system implemented in Perl programming language under Linux. It is primarily aimed at Machine Translation, making use of the ideas and technology created during the Prague Dependency Treebank project. At the same time, it is also hoped to facilitate and significantly accelerate development of software solutions of many other NLP tasks, especially due to re-usability of the numerous integrated processing modules (called blocks), which are equipped with uniform object-oriented interfaces.
 ===== Prerequisities =====
-In this tutorial, we assume that TectoMT has been successfully installed on your machine. For installation details, see [[http://ufallab2.ms.mff.cuni.cz/~bojar/cruise_control_tmt/last_doc/generated/guide/guidelines.html#id2563224|Getting started]] in TectoMT Developer's Guide.
+In this tutorial, we assume
-Before running any experiments with TectoMT, you must set up your environment by running
+  * Your system is Linux
+  * Your shell is bash
+  * You have basic experience with bash and can read basic Perl
+==== Installation and setup ====
+  * Checkout SVN repository. If you are running this installation in computer lab in Prague, you have to checkout the repository into directory ''~/BIG'' (because bigger disk quota applies here):
 <code bash>
-source devel/config/init_shell_environ.sh
+    cd ~/BIG
+    svn --username public co https://svn.ms.mff.cuni.cz/svn/tectomt_devel/trunk tectomt
 </code>
+  * accept the certificate and provide a password which is same as the username ie. : //public//
+  * In ''tectomt/install/'' run ''./install.sh'':
+<code bash>
+    cd tectomt/install
+    ./install.sh
+</code>
+  * In your ''.bashrc'' file, add line (or source the specified file every time before experimenting with TectoMT):
+<code bash>
+    source ~/BIG/tectomt/config/init_devel_environ.sh
+</code>
-===== Layers of linguistic structures =====
+  * In your ''.bash_profile'' file, add line
+<code bash>
+    source .bashrc
+</code>
+===== TectoMT Architecture =====
+==== Blocks, scenarios and applications ====
+In TectoMT, there is the following hierarchy of processing units (software components that process data):
-===== Architecture =====
+  * The basic units are **blocks**. They serve for some very limited, well defined, and often linguistically interpretable tasks (e.g., tokenization, tagging, parsing). Technically, blocks are Perl classes inherited from ''TectoMT::Block'', each saved in a separate file. The blocks repository is in ''libs/blocks/''.
+  * To solve a more complex task, selected blocks can be chained into a block sequence, called **scenario**. Scenarios are stored in ''*.scen'' files (alternatively, the block names separated by spaces can be simply listed on the command line) and at runtime the scenarios are represented by instances of ''TectoMT::Scenario'' class.
+  * The highest unit is called **application**. Applications correspond to end-to-end tasks, be they real end-user applications (such as machine translation), or 'only' NLP-related experiments. Technically, applications are often implemented as ''Makefiles'', which only glue the components existing in TectoMT. Some demo applications can be found in ''applications''.
-In TectoMT, there is the following hierarchy of processing units (software components that process data):
+This tutorial itself has its blocks in ''libs/blocks/Tutorial'' and the application in ''applications/tutorial''.
+==== Layers of Linguistic Structures ====
+{{ external:tectomt:pyramid.gif?300x190|MT pyramid in terms of PDT layers}}
+The notion of 'layer' has a combinatorial nature in TectoMT. It corresponds not only to the layer of language description as used e.g. in the Prague Dependency Treebank, but it is also specific for a given language (e.g., possible values of morphological tags are typically different for different languages) and even for how the data on the given layer were created (whether by analysis from the lower layer or by synthesis/transfer).
-  * The basic units are blocks. They serve for some very limited, well defined, and often linguistically interpretable tasks (e.g., tokenization, tagging, parsing). Blocks are not parametrizable. Technically, blocks are Perl classes inherited from ''TectoMT::Block''. Blocks can be found in ''devel/libs/blocks/''
+Thus, the set of TectoMT layers is a Cartesian product {S,T} x {English,Czech,...} x {W,M,P,A,T}, in which:
-  * To solve a more complex task, selected blocks can be chained into a block sequence, called also a scenario. Technically, scenarios are instances of ''TectoMT::Scenario'' class, but in some situations (e.g. on the command line) it is sufficient to specify the scenario simply by listing block names separated with spaces.
-  * The highest unit is called application. Applications correspond to end-to-end tasks, be they real end-user applications (such as machine translation), or 'only' NLP-related experiments. Technically, applications are often implemented as ''Makefiles'', which only glue the components existing in TectoMT. Some demo applications, including this tutorial, can be found in ''devel/applications/''
+  * {S,T} distinguishes whether the data was created by analysis or transfer/synthesis (mnemonics: S and T correspond to (S)ource and (T)arget in MT perspective).
+  * {English,Czech...} represents the language in question
+  * {W,M,P,A,T...} represents the layer of description in terms of PDT 2.0 (W - word layer, M - morphological layer, A - analytical layer, T - tectogrammatical layer) or extensions (P - phrase-structure layer).
+Blocks in block repository ''libs/blocks'' are located in directories indicating their purpose in machine translation.
+//Example//: A block adding Czech morphological tags (pos, case, gender, etc.) can be found in ''libs/blocks/SCzechW_to_SCzechM/Simple_tagger.pm''.
+There are also other directories for other purpose blocks, for example blocks which only print out some information go to ''libs/Print''. Our tutorial blocks are in ''libs/blocks/Tutorial/''.
 ===== First application =====
-Once you have TectoMT installed on your machine, you can find this tutorial in ''devel/applications/tutorial/''. After you cd in to this directory, you can see our plain text sample data in ''sample.txt''.
+Once you have TectoMT installed on your machine, you can find this tutorial in ''applications/tutorial/''. After you ''cd'' into this directory, you can see our plain text sample data in ''sample.txt''.
-Most applications are defined in Makefiles, which describe sequence of blocks to be applied on our data. In our particular ''Makefile'', four blocks are going to be applied on our sample text: sentence segmentation, tokenization, tagging and lemmatization. Since we have our input text in plain text format, the file is going to be converted into ''tmt'' format beforehand (the ''in'' section).
+Most applications are defined in ''Makefiles'' and ''*.scen'' files, which describe sequence of blocks to be applied on our data. In our case, ''tutorial.scen'' lists four blocks to be applied on our sample text: sentence segmentation, tokenization, part-of-speech tagging and lemmatization. Since we have our input text in plain text format, the file is going to be converted into ''tmt'' format beforehand (the ''in'' target in the ''Makefile'').
 We can run the application:
@@ Line 55: / Line 93: @@
 </code>
-Our plain text data ''sample.txt'' have been transformed into ''tmt'', internal TectoMT format, and saved into ''sample.tmt''. Then, all four blocks have been loaded and our data has been processed. We can now examine ''sample.tmt'' using a regular text editor. We'll now stop and describe data structure in TectoMT.
+Our plain text data ''sample.txt'' have been transformed into ''tmt'', an internal TectoMT format, and saved into ''sample.tmt''. Then, all four blocks have been loaded and our data has been processed. We can now examine ''sample.tmt'' with a text editor (vi, emacs, etc).
-  * One physical file corresponds to one document.
-  * A document consists of a sequence of bundles (''<bundle>''), mirroring a sequence of natural language sentences originating from the text. So, for one sentence we have one ''<bundle>''.
-  * Each bundle contains tree shaped sentence representations on various linguistic layers. In our example ''sample.tmt'' we have morphological tree (''SEnglishM'') in each bundle. Later on, also phrase structure tree (''SEnglishP'') and analytical layer (''SEnglishA'') will appear in each bundle as we proceed with our analysis.
-  * Trees are formed by nodes and edges. Attributes can be attached only to nodes. Edge's attributes must be equivalently stored as the lower node's attributes. Tree's attributes must be stored as attributes of the root node.
+  * One physical ''tmt'' file corresponds to one document.
+  * A document consists of a sequence of bundles (element ''<bundles>''), mirroring a sequence of natural language sentences originating from the text. So, for each sentence we have one bundle.
+  * Each bundle contains tree shaped sentence representations on various linguistic layers. In our example ''sample.tmt'' we have morphological tree (''SEnglishM'') in each bundle (actually, it is a flat tree: one technical root and its children are the tokens). Later on, also an analytical layer (''SEnglishA'') will appear in each bundle as we proceed with our analysis.
+  * Trees are formed by nodes and edges. Attributes can be attached only to nodes. Edge's attributes must be stored as the lower node's attributes. Tree's attributes must be stored as attributes of the root node.
 ===== Changing the scenario =====
-We'll now add syntax analysis to our scenario by adding four more blocks. Instead of
+We'll now add a syntax analysis (dependency parsing) to our scenario by adding five more blocks to ''tutorial.scen''. Instead of
-<code bash>
+<code>
-analyze:
+SEnglishW_to_SEnglishM::Sentence_segmentation_simple
-        eval ${BRUNBLOCKS} -S -o \
+SEnglishW_to_SEnglishM::Tokenization
-                SEnglishW_to_SEnglishM::Sentence_segmentation_simple \
+SEnglishW_to_SEnglishM::TagMxPost
-                SEnglishW_to_SEnglishM::Penn_style_tokenization \
+SEnglishW_to_SEnglishM::Lemmatize_mtree
-                SEnglishW_to_SEnglishM::TagTnT \
-                SEnglishW_to_SEnglishM::Lemmatize_mtree -- sample.tmt
 </code>
@@ Line 93: / Line 115: @@
 <code bash>
-analyze:
+SEnglishW_to_SEnglishM::Sentence_segmentation_simple
-        eval ${BRUNBLOCKS} -S -o \
+SEnglishW_to_SEnglishM::Tokenization
-                SEnglishW_to_SEnglishM::Sentence_segmentation_simple \
+SEnglishW_to_SEnglishM::TagMxPost
-                SEnglishW_to_SEnglishM::Penn_style_tokenization \
+SEnglishW_to_SEnglishM::Lemmatize_mtree
-                SEnglishW_to_SEnglishM::TagTnT \
+SEnglishM_to_SEnglishA::Clone_MTree
-                SEnglishW_to_SEnglishM::Lemmatize_mtree  \
+SEnglishM_to_SEnglishA::McD_parser
-                SEnglishM_to_SEnglishP::Phrase_parsing \
+SEnglishM_to_SEnglishA::Fill_is_member_from_deprel
-                SEnglishP_to_SEnglishA::Mark_heads \
+SEnglishM_to_SEnglishA::Fix_McD_topology
-                SEnglishP_to_SEnglishA::Build_atree \
+SEnglishM_to_SEnglishA::Fill_afun_AuxCP_Coord
-                SEnglishP_to_SEnglishA::Rehang_appos -- sample.tmt
+SEnglishM_to_SEnglishA::Fill_afun
 </code>
-//Note//: Makefiles use tabulators to mark command lines. Make sure your lines start with tabulator (or two tabulators) and not, for example with 4 spaces.
 After running
@@ Line 113: / Line 133: @@
 </code>
-we can examine our ''sample.tmt'' again. Really, phrase structure layer ''SEnglishP'' and analytical layer ''SEnglishA'' has been added to each bundle.
+we can examine our ''sample.tmt'' again. Really, an analytical layer ''SEnglishA'' describing a dependency tree with analytical functions (''<afun>'') has been added to each bundle.
+Blocks can also be parametrized. For syntax parser, we might want to use a smaller but faster model. To achieve this, replace the line
+<code bash>
+SEnglishM_to_SEnglishA::McD_parser
+</code>
+with
+<code bash>
+SEnglishM_to_SEnglishA::McD_parser TMT_PARAM_MCD_EN_MODEL=conll_mcd_order2_0.1.model
+</code>
+You can view the trees in ''sample.tmt'' with TrEd by typing
+<code bash>
+tmttred sample.tmt
+</code>
+Try to click on some nodes to see their parameters (tag, lemma, form, analytical function etc).
+//Note//: For more information about tree editor TrEd, see [[http://ufal.mff.cuni.cz/~pajas/tred/ar01-toc.html|TrEd User's Manual]].
+If you are not familiar with ''Makefile'' syntax, you can run the scenario with a simple ''bash'' script (see ''applications/tutorial/run_all.sh''):
+<code bash>
+./run_all.sh
+</code>
@@ Line 133: / Line 172: @@
   * node - ''TectoMT::Node''
-We'll now examine an example of a new block in file ''devel/applications/tutorial/Tutorial.pm''
+You can get TectoMT automatically execute your block code on each document or bundle by defining the main block entry point:
-This block illustrates the most common methods for accessing objects:
+  * ''sub process_document'' - run this procedure on each document
+  * ''sub process_bundle'' - run this procedure on each bundle (sentence)
-  * ''my @bundles = $document->get_bundles'' - an array of bundles contained in the document
+Each block must have exactly one entry point.
-  * ''my $root_node = $bundle->get_tree($layer_name);'' - the root node of the tree of the given type in the given bundle
-  * ''my @children = $node->get_children;'' - array of the node's children
+We'll now examine an example of a new block in file ''libs/blocks/Tutorial/Print_node_info.pm''.
-  * ''my @descendants = $node->get_descendants;'' - array of the node's children and their children and children of their children ...
-  * ''my $parent = $node->get_parent;'' - parent node of the given node, or undef for root
+This block illustrates some of the most common methods for accessing objects:
-  * ''my $root_node = $node->get_root;'' - the root node of the tree into which the node belongs
+  * ''my @bundles = $document<nowiki>-></nowiki>get_bundles()'' - an array of bundles contained in the document
+  * ''my $root_node = $bundle<nowiki>-></nowiki>get_tree($layer_name)'' - the root node of the tree of the given type in the given bundle
+  * ''my @children = $node<nowiki>-></nowiki>get_children()'' - array of the node's children
+  * ''my @descendants = $node<nowiki>-></nowiki>get_descendants()'' - array of the node's children and their children and children of their children ...
+  * ''my $parent = $node<nowiki>-></nowiki>get_parent()'' - parent node of the given node, or undef for root
+  * ''my $root_node = $node<nowiki>-></nowiki>get_root()'' - the root node of the tree into which the node belongs
 Attributes of documents, bundles or nodes can be accessed by attribute getters and setters, for example:
-  * ''$node->get_attr($attr_name);''
-  * ''$node->set_attr($attr_name, $attr_value);''
-Our tutorial block ''Tutorial.pm'' is ready to use. We'll copy it to ''devel/libs/blocks/SEnglishA_to_SEnglishA/'' where TectoMT will expect this block:
+  * ''$node<nowiki>-></nowiki>get_attr($attr_name)''
+  * ''$node<nowiki>-></nowiki>set_attr($attr_name, $attr_value)''
+Some interesting attributes on morphologic layer are ''form'', ''lemma'' and ''tag''. Some interesting attributes on analytical layer are ''afun'' (analytical function) and ''ord'' (surface word order). To reach ''form'', ''lemma'' or ''tag'' from analytical layer, that is, when calling this attribute on an ''a-node'', you use ''$a_node<nowiki>-></nowiki>get_attr('m/form')'' and the same way for ''lemma'' and ''tag''. The easiest way to see the node attributes is to click on the node in TrEd:
 <code bash>
-cp devel/applications/Tutorial.pm devel/libs/blocks/SEnglishA_to_SEnglishA/Tutorial.pm
+tmttred sample.tmt
 </code>
-We also have to add this block to our scenario which can be done by adding new ''Makefile'' target:
+Our tutorial block ''Print_node_info.pm'' is ready to use. You only need to add this block to our scenario, e.g. as a new ''Makefile'' target:
 <code bash>
-printfun:
+print_info:
-        eval ${BRUNBLOCKS} -S -o \
+        brunblocks -o Tutorial::Print_node_info -- sample.tmt
-                SEnglishA_to_SEnglishA::Tutorial -- sample.tmt
 </code>
-And we'll observe our new block's behaviour:
+We can observe our new block behaviour:
 <code bash>
-make printfun
+make print_info
 </code>
+Try to change the block so that it prints out the information only for verbs. (You need to look at an attribute ''tag'' at the ''m'' level). The tagset used is Penn Treebank Tagset.
-===== More advanced block =====
+===== Advanced block: finite clauses =====
-In this application, we'll split the text into finite clauses.
+==== Motivation ====
+It is assumed that finite clauses can be translated independently, which would reduce combinatorial complexity or make parallel translation possible. We could even use hybrid translation - each finite clause could be translated by the most self-confident translation system. In this task, we are going to split the sentence into finite clauses.
-===== Further information =====
+==== Task ====
-  * [[http://ufallab2.ms.mff.cuni.cz/~bojar/cruise_control_tmt/last_doc/generated/guide/guidelines.html|TectoMT Developer's Guide]]
+A block which, given an analytical tree (''SEnglishA''), fills each ''a-node'' with boolean attribute ''is_clause_head'' which is set to ''1'' if the ''a-node'' corresponds to a finite verb, and to ''0'' otherwise.
+==== Instructions ====
+There is a block template with hints in ''libs/blocks/Tutorial/Mark_heads.pm''. You should edit the block so that the output of this block is the same a-tree, in addition with attribute ''is_clause_head'' attached to each ''a-node''. There is also a printing block ''libs/blocks/Print_finite_clauses.pm'' which will print out the ''a-nodes'' grouped by clauses:
+<code bash>
+finite_clauses:
+        brunblocks -S -o Tutorial::Mark_heads Tutorial::Print_finite_clauses -- sample.tmt
+</code>
+You are going to need these methods:
+  * ''my $root = $bundle<nowiki>-></nowiki>get_tree('tree_name')''
+  * ''my $attr = $node<nowiki>-></nowiki>get_attr('attr_name')''
+  * ''$node<nowiki>-></nowiki>set_attr('attr_name',$attr_value)''
+  * ''my @eff_children = $node<nowiki>-></nowiki>get_eff_children()''
+//Note//: ''get_children()'' returns topological node children in a tree, while ''get_eff_children()'' returns node children in a linguistic sense. Mostly, these do not differ. If interested, see Figure 1 in [[http://ufal.mff.cuni.cz/pdt2.0/doc/tools/tred/bn-tutorial.html#i-effective|btred tutorial]].
+//Hint//: Finite clauses in English usually require grammatical subject to be present.
+==== Advanced version ====
+The output of our block might still be incorrect in special cases - we don't solve coordination (see the second sentence in sample.txt) and subordinate conjunctions.
+===== Your turn: more tasks =====
+==== SVO to SOV ====
+**Motivation**: During translation from an SVO based language (e.g. English) to an SOV based language (e.g. Korean), we might need to change the word order from SVO to SOV.
+**Task**: Change the word order from SVO to SOV.
+**Instructions**:
+  * You can use block template in ''libs/blocks/BlockTemplate.pm''.
+  * To find an object of a verb, look for objects among effective children of a verb (''$child<nowiki>-></nowiki>get_attr('afun') eq 'Obj' ''). That implies working on analytical layer.
+  * For debugging, a method returning surface word order of a node is useful: ''$node<nowiki>-></nowiki>get_attr('ord')''. It can be used to print out nodes sorted by attribute ''ord''.
+  * Once you have the node ''$object'' and the node ''$verb'', use the method ''$object<nowiki>-></nowiki>shift_before_node($verb)''. This method takes the whole subtree under the node ''$object'' and recalculates the attributes ''ord'' (surface word order) so that all the nodes in the subtree under ''$object'' have a smaller ''ord'' than ''$verb''. That is, the method rearranges the surface word order from VO to OV.
+**Advanced version**: This solution shifts object (or more objects) of a verb just in front of that verb node. So f.e.: //Mr. Brown has urged MPs.// changes to: //Mr. Brown has MPs urged.// You can try to change this solution, so the final sentence would be: //Mr. Brown MPs has urged.// You may need a method ''$node<nowiki>-></nowiki>shift_after_subtree($root_of_that_subtree)''. Subjects should have attribute '''afun' eq 'Sb'''.
+==== Prepositions ====
+{{ external:tectomt:preps.png?200x80|Prepositions example}}
+**Motivation**: In dependency approach the question "where to hang prepositions" arises. In the praguian style (PDT), prepositions are heads of the subtree and the noun/pronoun is dependent on the preposition. However, another ordering might be preferable: The noun/pronoun might be the head of subtree, while the preposition would take the role of a modifier.
+**Task**: The task is to rehang all prepositions as indicated at the picture. You may assume that prepositions have at most 1 child.
+** Instructions**:
+You are going to need these new methods:
+  * ''my @children = $node<nowiki>-></nowiki>get_children()''
+  * ''my $parent = $node<nowiki>-></nowiki>get_parent()''
+  * ''$node<nowiki>-></nowiki>set_parent($parent)''
+//Hint//:
+  * On analytical layer, you can use this test to recognize prepositions: ''$node<nowiki>-></nowiki>get_attr('afun') eq 'AuxP' ''
+  * To see the results, you can again use TrEd (''tmttred sample.tmt'')
+**Advanced version**: What happens in case of multiword prepositions? For example, ''because of'', ''instead of''. Can you handle it?
+===== Further information =====
+  * [[http://ufal.mff.cuni.cz/tectomt|TectoMT Homepage]]
+  * Questions? Ask ''kravalova'' at ''ufal.mff.cuni.cz''
+  * Solutions to this tutorial tasks are in ''libs/blocks/Tutorial/*solution*.pm''.
+  * [[http://ufal.mff.cuni.cz/~pajas/tred/|TrEd]], [[http://ufal.mff.cuni.cz/~pajas/tred/ar01-toc.html|TrEd User's Manual]] - tree editor
+If you are missing some files from //share//, you can download it from [[http://ufallab.ms.mff.cuni.cz/tectomt/share/]].

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences