This is an old revision of the document!
Table of Contents
TectoMT Tutorial
Welcome at TectoMT Tutorial. This tutorial should take about 2 hours.
What is TectoMT
TectoMT is a highly modular NLP (Natural Language Processing) software system implemented in Perl programming language under Linux. It is primarily aimed at Machine Translation, making use of the ideas and technology created during the Prague Dependency Treebank project. At the same time, it is also hoped to significantly facilitate and accelerate development of software solutions of many other NLP tasks, especially due to re-usability of the numerous integrated processing modules (called blocks), which are equipped with uniform object-oriented interfaces.
Prerequisities
Installation and setup
TODO popsat instalaci
Before running any experiments with TectoMT, you must set up your environment by running
source config/init_devel_environ.sh
Theoretical background
TODO obrazek
TrEd
TODO malicko o TrEdu a obrazek
TectoMT Architecture
Blocks, scenarios and applications
In TectoMT, there is the following hierarchy of processing units (software components that process data):
- The basic units are blocks. They serve for some very limited, well defined, and often linguistically interpretable tasks (e.g., tokenization, tagging, parsing). Blocks are not parametrizable. Technically, blocks are Perl classes inherited from
TectoMT::Block
. The blocks repository is inlibs/blocks/
. - To solve a more complex task, selected blocks can be chained into a block sequence, called also a scenario. Technically, scenarios are instances of
TectoMT::Scenario
class, but in some situations (e.g. on the command line) it is sufficient to specify the scenario simply by listing block names separated with spaces. - The highest unit is called application. Applications correspond to end-to-end tasks, be they real end-user applications (such as machine translation), or 'only' NLP-related experiments. Technically, applications are often implemented as
Makefiles
, which only glue the components existing in TectoMT. Some demo applications can be found inapplications
This tutorial itself has its blocks in libs/blocks/Tutorial
and the application in applications/tutorial
.
Layers of Linguistic Structures
TectoMT blocks repository is saved in libs/blocks/
. In correspondence with …, the blocks are located in directories describing their purpose.
Thus, the set of TectoMT layers is Cartesian product {S,T} x {English,Czech} x {W,M,P,A,T}, in which:
- {S,T} distinguishes whether the data was created by analysis or transfer/synthesis (mnemonics: S and T correspond to (S)ource and (T)arget in MT perspective).
- {English,Czech…} represents the language in question
- {W,M,P,A,T…} represents the layer of description in terms of PDT 2.0 (W - word layer, M - morphological layer, A - analytical layer, T - tectogrammatical layer) or extensions (P - phrase-structure layer).
Example: Block adding Czech morphological tags (pos, case, gender, etc.) can be found in libs/blocks/SCzechW_toSCzechM/Simple_tagger.pm
.
There are also other directories for other purpose blocks, for example blocks which only print out some information go to libs/Print
. Our tutorial blocks are in libs/Tutorial
.
First application
Once you have TectoMT installed on your machine, you can find this tutorial in devel/applications/tutorial/
. After you cd in to this directory, you can see our plain text sample data in sample.txt
.
Most applications are defined in Makefiles, which describe sequence of blocks to be applied on our data. In our particular Makefile
, four blocks are going to be applied on our sample text: sentence segmentation, tokenization, tagging and lemmatization. Since we have our input text in plain text format, the file is going to be converted into tmt
format beforehand (the in
section).
We can run the application:
make all
Our plain text data sample.txt
have been transformed into tmt
, internal TectoMT format, and saved into sample.tmt
. Then, all four blocks have been loaded and our data has been processed. We can now examine sample.tmt
using a regular text editor. We'll now stop and describe data structure in TectoMT.
- One physical file corresponds to one document.
- A document consists of a sequence of bundles (
<bundle>
), mirroring a sequence of natural language sentences originating from the text. So, for one sentence we have one<bundle>
. - Each bundle contains tree shaped sentence representations on various linguistic layers. In our example
sample.tmt
we have morphological tree (SEnglishM
) in each bundle. Later on, also an analytical layer (SEnglishA
) will appear in each bundle as we proceed with our analysis. - Trees are formed by nodes and edges. Attributes can be attached only to nodes. Edge's attributes must be equivalently stored as the lower node's attributes. Tree's attributes must be stored as attributes of the root node.
Changing the scenario
We'll now add syntax analysis to our scenario by adding four more blocks. Instead of
analyze: brunblocks -S -o \ SEnglishW_to_SEnglishM::Sentence_segmentation_simple \ SEnglishW_to_SEnglishM::Penn_style_tokenization \ SEnglishW_to_SEnglishM::TagTnT \ SEnglishW_to_SEnglishM::Lemmatize_mtree -- sample.tmt
we'll have:
analyze: brunblocks -S -o \ SEnglishW_to_SEnglishM::Sentence_segmentation_simple \ SEnglishW_to_SEnglishM::Penn_style_tokenization \ SEnglishW_to_SEnglishM::TagTnT \ SEnglishW_to_SEnglishM::Lemmatize_mtree \ SEnglishM_to_SEnglishA::McD_parser_local \ SEnglishM_to_SEnglishA::Fix_McD_Tree \ SEnglishM_to_SEnglishA::Fill_afun_after_McD -- sample.tmt
Note: Makefiles use tabulators to mark command lines. Make sure your lines start with tabulator (or two tabulators) and not, for example, with 4 spaces.
After running
make all
we can examine our sample.tmt
again. Really, an analytical layer SEnglishA
describing a dependency tree with analytical functions (<afun>
) has been added to each bundle.
You can view the trees in sample.tmt
with TrEd by typing
tmttred sample.tmt
Adding a new block
The linguistic structures in TectoMT are represented using the following object-oriented interface/types:
- document -
TectoMT::Document
- bundle -
TectoMT::Bundle
- node -
TectoMT::Node
You can get TectoMT automatically execute your block code on each document or bundle by defining the main block entry point:
sub process_document
- run this procedure on each documentsub process_bundle
- run this procedure on each bundle (sentence)
Each block must have exactly one entry point.
We'll now examine an example of a new block in file libs/blocks/Tutorial/Print_node_info.pm
.
This block illustrates some of the most common methods for accessing objects:
my @bundles = $document→get_bundles
- an array of bundles contained in the documentmy $root_node = $bundle→get_tree($layer_name);
- the root node of the tree of the given type in the given bundlemy @children = $node→get_children;
- array of the node's childrenmy @descendants = $node→get_descendants;
- array of the node's children and their children and children of their children …my $parent = $node→get_parent;
- parent node of the given node, or undef for rootmy $root_node = $node→get_root;
- the root node of the tree into which the node belongs
Attributes of documents, bundles or nodes can be accessed by attribute getters and setters, for example:
$node→get_attr($attr_name);
$node→set_attr($attr_name, $attr_value);
Our tutorial block Print_node_info.pm
is ready to use:
- Copy the block to the right place in blocks repository
devel/libs/blocks/Print
(because it is a printing block)
cp libs/blocks/Tutorial/Print_node_info.pm devel/libs/blocks/Print/Print_node_info.pm
- in copied file
Print/Print_node_info.pm
, edit the block package frompackage Tutorial::Print_node_info.pm
topackage Print::Print_node_info.pm
- Add this block to our scenario:
print_afun: brunblocks -S -o Print::Print_node_info -- sample.tmt
We can observe our new block behaviour:
make print_afun
Advanced block: finite clauses
Motivation
Task
A block which, given an analytical tree (SEnglishA
), fills each a-node
with nummerical attribute <clause>
so that nodes in the same finite clause are marked with the same number of clause.
Algorithm
Instructions
There is a block template with hints in devel/libs/blocks/Tutorial/Tutorial_fill_finite_clauses.pm
. Copy the file to SEnglishA_to_SEnglishT
and edit this file using the hints in it. Also, don't forget to change the name of package (to SEnglishA_to_SEnglishT::Tutorial_fill_finite_clauses
). The ouput of this block should be the same a-tree with nummerical value <clause>
attached to each a-node
. There is also a printing block devel/libs/blocks/Tutorial_print_finite_clauses.pm
which will print out the a-nodes
grouped by clauses:
finite_clauses: brunblocks -S -o \ SEnglishA_to_SEnglishT::Tutorial_fill_finite_clauses \ Print::Tutorial_print_finite_clauses -- sample.tmt
You are going to need these methods:
$bundle→get_tree($tree_name)
$node→get_attr($attr_name)
$node→set_attr('attr_name',$attr_value)
$node→get_eff_children()
$node→get_children()
Your turn: more tasks
Coordination
This time …
You can use block template in devel/libs/blocks/BlockTemplate.pm
. To see the results, you can again use TrEd (tmttred sample.tmt
)
Further information
- TectoMT Developer's Guide - obsolete
- Questions? Ask
kravalova
atufal.mff.cuni.cz
- Solutions to