Differences

This shows you the differences between two versions of the page.

--- external:tectomt:tutorial [2009/01/20 15:25]
kravalova
+++ external:tectomt:tutorial [2009/01/20 17:57]
popel
@@ Line 17: / Line 17: @@
   * Your shell is bash
   * You have basic experience bash and you can read Perl
@@ Line 27: / Line 28: @@
 ==== Installation and setup ====
-  * Checkout SVN repository. If you are running this installation in computer lab in Prague, you have checkout the repository into directory /home/BIG (because data quotas don't apply here):
+  * Checkout SVN repository. If you are running this installation in a computer lab in Prague, you have to checkout the repository into directory ''/home/BIG'' (because data quotas don't apply here):
 <code bash>
@@ Line 67: / Line 68: @@
 ===== TectoMT Architecture =====
@@ Line 74: / Line 76: @@
 In TectoMT, there is the following hierarchy of processing units (software components that process data):
-  * The basic units are blocks. They serve for some very limited, well defined, and often linguistically interpretable tasks (e.g., tokenization, tagging, parsing). Blocks are not parametrizable. Technically, blocks are Perl classes inherited from ''TectoMT::Block''. The blocks repository is in ''libs/blocks/''.
+  * The basic units are blocks. They serve for some very limited, well defined, and often linguistically interpretable tasks (e.g., tokenization, tagging, parsing). Technically, blocks are Perl classes inherited from ''TectoMT::Block'', each saved in a separate file. The blocks repository is in ''libs/blocks/''.
   * To solve a more complex task, selected blocks can be chained into a block sequence, called also a scenario. Technically, scenarios are instances of ''TectoMT::Scenario'' class, but in some situations (e.g. on the command line) it is sufficient to specify the scenario simply by listing block names separated with spaces.
-  * The highest unit is called application. Applications correspond to end-to-end tasks, be they real end-user applications (such as machine translation), or 'only' NLP-related experiments. Technically, applications are often implemented as ''Makefiles'', which only glue the components existing in TectoMT. Some demo applications can be found in ''applications''
+  * The highest unit is called application. Applications correspond to end-to-end tasks, be they real end-user applications (such as machine translation), or 'only' NLP-related experiments. Technically, applications are often implemented as ''Makefiles'', which only glue the components existing in TectoMT. Some demo applications can be found in ''applications''.
 This tutorial itself has its blocks in ''libs/blocks/Tutorial'' and the application in ''applications/tutorial''.
@@ Line 90: / Line 94: @@
 TectoMT blocks repository is saved in ''libs/blocks/''. In correspondence with ..., the blocks are located in directories describing their purpose.
-Thus, the set of TectoMT layers is Cartesian product {S,T} x {English,Czech} x {W,M,P,A,T}, in which:
+Thus, the set of TectoMT layers is a Cartesian product {S,T} x {English,Czech,...} x {W,M,P,A,T}, in which:
   * {S,T} distinguishes whether the data was created by analysis or transfer/synthesis (mnemonics: S and T correspond to (S)ource and (T)arget in MT perspective).
@@ Line 96: / Line 100: @@
   * {W,M,P,A,T...} represents the layer of description in terms of PDT 2.0 (W - word layer, M - morphological layer, A - analytical layer, T - tectogrammatical layer) or extensions (P - phrase-structure layer).
-//Example//: Block adding Czech morphological tags (pos, case, gender, etc.) can be found in ''libs/blocks/SCzechW_toSCzechM/Simple_tagger.pm''.
+//Example//: Block adding Czech morphological tags (pos, case, gender, etc.) can be found in ''libs/blocks/SCzechW_to_SCzechM/Simple_tagger.pm''.
+There are also other directories for other purpose blocks, for example blocks which only print out some information go to ''libs/Print''. Our tutorial blocks are in ''libs/blocks/Tutorial''.
-There are also other directories for other purpose blocks, for example blocks which only print out some information go to ''libs/Print''. Our tutorial blocks are in ''libs/Tutorial''.
@@ Line 107: / Line 112: @@
 ===== First application =====
-Once you have TectoMT installed on your machine, you can find this tutorial in ''applications/tutorial/''. After you cd in to this directory, you can see our plain text sample data in ''sample.txt''.
+Once you have TectoMT installed on your machine, you can find this tutorial in ''applications/tutorial/''. After you ''cd'' into this directory, you can see our plain text sample data in ''sample.txt''.
-Most applications are defined in Makefiles, which describe sequence of blocks to be applied on our data. In our particular ''Makefile'', four blocks are going to be applied on our sample text: sentence segmentation, tokenization, tagging and lemmatization. Since we have our input text in plain text format, the file is going to be converted into ''tmt'' format beforehand (the ''in'' section).
+Most applications are defined in Makefiles, which describe sequence of blocks to be applied on our data. In our particular ''Makefile'', four blocks are going to be applied on our sample text: sentence segmentation, tokenization, tagging and lemmatization. Since we have our input text in plain text format, the file is going to be converted into ''tmt'' format beforehand (the ''in'' target in the Makefile).
 We can run the application:
@@ Line 117: / Line 122: @@
 </code>
-Our plain text data ''sample.txt'' have been transformed into ''tmt'', internal TectoMT format, and saved into ''sample.tmt''. Then, all four blocks have been loaded and our data has been processed. We can now examine ''sample.tmt'' using a regular text editor. We'll now stop and describe data structure in TectoMT.
+Our plain text data ''sample.txt'' have been transformed into ''tmt'', an internal TectoMT format, and saved into ''sample.tmt''. Then, all four blocks have been loaded and our data has been processed. We can now examine ''sample.tmt'' using a regular text editor. We'll now stop and describe data structure in TectoMT.
-  * One physical file corresponds to one document.
+  * One physical ''tmt'' file corresponds to one document.
   * A document consists of a sequence of bundles (''<bundle>''), mirroring a sequence of natural language sentences originating from the text. So, for one sentence we have one ''<bundle>''.
   * Each bundle contains tree shaped sentence representations on various linguistic layers. In our example ''sample.tmt'' we have morphological tree (''SEnglishM'') in each bundle. Later on, also an analytical layer (''SEnglishA'') will appear in each bundle as we proceed with our analysis.
   * Trees are formed by nodes and edges. Attributes can be attached only to nodes. Edge's attributes must be equivalently stored as the lower node's attributes. Tree's attributes must be stored as attributes of the root node.
@@ Line 144: / Line 150: @@
 ===== Changing the scenario =====
-We'll now add syntax analysis to our scenario by adding four more blocks. Instead of
+We'll now add a syntax analysis (dependency parsing) to our scenario by adding four more blocks. Instead of
 <code bash>
@@ Line 152: / Line 158: @@
                 SEnglishW_to_SEnglishM::Penn_style_tokenization \
                 SEnglishW_to_SEnglishM::TagTnT \
-                SEnglishW_to_SEnglishM::Lemmatize_mtree -- sample.tmt
+                SEnglishW_to_SEnglishM::Lemmatize_mtree \
+        -- sample.tmt
 </code>
@@ Line 163: / Line 170: @@
                 SEnglishW_to_SEnglishM::Penn_style_tokenization \
                 SEnglishW_to_SEnglishM::TagTnT \
-                SEnglishW_to_SEnglishM::Lemmatize_mtree  \
+                SEnglishW_to_SEnglishM::Lemmatize_mtree \
                 SEnglishM_to_SEnglishA::McD_parser_local \
                 SEnglishM_to_SEnglishA::Fix_McD_Tree \
-                SEnglishM_to_SEnglishA::Fill_afun_after_McD -- sample.tmt
+                SEnglishM_to_SEnglishA::Fill_afun_after_McD \
+        -- sample.tmt
 </code>
@@ Line 184: / Line 192: @@
 tmttred sample.tmt
 </code>
@@ Line 231: / Line 240: @@
 This block illustrates some of the most common methods for accessing objects:
-  * ''my @bundles = $document->get_bundles'' - an array of bundles contained in the document
+  * ''my @bundles = $document->get_bundles()'' - an array of bundles contained in the document
   * ''my $root_node = $bundle->get_tree($layer_name);'' - the root node of the tree of the given type in the given bundle
-  * ''my @children = $node->get_children;'' - array of the node's children
+  * ''my @children = $node->get_children();'' - array of the node's children
-  * ''my @descendants = $node->get_descendants;'' - array of the node's children and their children and children of their children ...
+  * ''my @descendants = $node->get_descendants();'' - array of the node's children and their children and children of their children ...
-  * ''my $parent = $node->get_parent;'' - parent node of the given node, or undef for root
+  * ''my $parent = $node->get_parent();'' - parent node of the given node, or undef for root
-  * ''my $root_node = $node->get_root;'' - the root node of the tree into which the node belongs
+  * ''my $root_node = $node->get_root();'' - the root node of the tree into which the node belongs
 Attributes of documents, bundles or nodes can be accessed by attribute getters and setters, for example:
@@ Line 242: / Line 251: @@
   * ''$node->set_attr($attr_name, $attr_value);''
-Our tutorial block ''Print_node_info.pm'' is ready to use. You only need to add this block to our scenario:
+Our tutorial block ''Print_node_info.pm'' is ready to use. You only need to add this block to our scenario, e.g. as a new Makefile target:
 <code bash>
@@ Line 255: / Line 264: @@
 </code>
-Try to change the block so that it prints out the information only for verbs. (You need to look at attribute ''tag'' at the ''m'' level). The tagset used is Penn Treebank Tagset.
+Try to change the block so that it prints out the information only for verbs. (You need to look at an attribute ''tag'' at the ''m'' level). The tagset used is Penn Treebank Tagset.
@@ Line 332: / Line 341: @@
-==== SVO typology ====
-**Motivation**:
+==== SVO to SOV ====
+**Motivation**: During translation from an SVO based language (English) to an SOV based language (Korean) we might need to change the word order from SVO to SOV.
+**Task**: Change the word order from SVO to SOV.
+**Instructions**:
+  * To find an object to a verb, look for objects among effective children of a verb (''$child->get_attr('afun') eq 'Obj' ''). That implies working on analytical layer.
+  * Once you have node ''$object'' and node ''$verb'', use method TODO
+  * For debugging, a method returning word order of a node is useful: ''$node->get_attr('ord')''. It can be used to print out nodes sorted by attribute ''ord''.
@@ Line 352: / Line 379: @@
 ==== Prepositions ====
-In dependency approach a question "where to hang prepositions" arises. In praguian style (PDT), prepositions are heads of the subtree and the noun/pronoun is dependent on the preposition. However, another ordering might be preferable: The noun/pronoun might be the head of subtree, while the preposition would take the role of a modifier.
+**Motivation**: In dependency approach a question "where to hang prepositions" arises. In praguian style (PDT), prepositions are heads of the subtree and the noun/pronoun is dependent on the preposition. However, another ordering might be preferable: The noun/pronoun might be the head of subtree, while the preposition would take the role of a modifier.
 TODO obrazek
-The task is to rehang all prepositions as indicated at the picture. You may assume that prepositions have at most 1 child.
+**Task**: The task is to rehang all prepositions as indicated at the picture. You may assume that prepositions have at most 1 child.
+** Instructions**:
 You are going to need these new methods:
@@ Line 364: / Line 393: @@
 //Hint//:
-  * On analytical layer, you can use this test to recognize prepositions: ''$afun eq 'AuxP' ''
+  * On analytical layer, you can use this test to recognize prepositions: ''$node->get_attr('afun') eq 'AuxP' ''
   * You can use block template in ''libs/blocks/BlockTemplate.pm''. To see the results, you can again use TrEd (''tmttred sample.tmt'')

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences