[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

This is an old revision of the document!


Table of Contents

TectoMT Tutorial

Welcome at TectoMT Tutorial. This tutorial should take about 2 hours.

What is TectoMT

TectoMT is a highly modular NLP (Natural Language Processing) software system implemented in Perl programming language under Linux. It is primarily aimed at Machine Translation, making use of the ideas and technology created during the Prague Dependency Treebank project. At the same time, it is also hoped to significantly facilitate and accelerate development of software solutions of many other NLP tasks, especially due to re-usability of the numerous integrated processing modules (called blocks), which are equipped with uniform object-oriented interfaces.

Prerequisities

Installation and setup

TODO popsat instalaci

Before running any experiments with TectoMT, you must set up your environment by running

source config/init_devel_environ.sh

Theoretical background

TODO obrazek

TrEd

TODO malicko o TrEdu a obrazek

TectoMT Architecture

Blocks, scenarios and applications

In TectoMT, there is the following hierarchy of processing units (software components that process data):

This tutorial itself has its blocks in libs/blocks/Tutorial and the application in applications/tutorial.

Layers of Linguistic Structures

TectoMT blocks repository is saved in libs/blocks/. In correspondence with …, the blocks are located in directories describing their purpose.

Thus, the set of TectoMT layers is Cartesian product {S,T} x {English,Czech} x {W,M,P,A,T}, in which:

Example: Block adding Czech morphological tags (pos, case, gender, etc.) can be found in libs/blocks/SCzechW_toSCzechM/Simple_tagger.pm.

There are also other directories for other purpose blocks, for example blocks which only print out some information go to libs/Print. Our tutorial blocks are in libs/Tutorial.

First application

Once you have TectoMT installed on your machine, you can find this tutorial in applications/tutorial/. After you cd in to this directory, you can see our plain text sample data in sample.txt.

Most applications are defined in Makefiles, which describe sequence of blocks to be applied on our data. In our particular Makefile, four blocks are going to be applied on our sample text: sentence segmentation, tokenization, tagging and lemmatization. Since we have our input text in plain text format, the file is going to be converted into tmt format beforehand (the in section).

We can run the application:

make all

Our plain text data sample.txt have been transformed into tmt, internal TectoMT format, and saved into sample.tmt. Then, all four blocks have been loaded and our data has been processed. We can now examine sample.tmt using a regular text editor. We'll now stop and describe data structure in TectoMT.

Changing the scenario

We'll now add syntax analysis to our scenario by adding four more blocks. Instead of

analyze:
        brunblocks -S -o \
                SEnglishW_to_SEnglishM::Sentence_segmentation_simple \
                SEnglishW_to_SEnglishM::Penn_style_tokenization \
                SEnglishW_to_SEnglishM::TagTnT \
                SEnglishW_to_SEnglishM::Lemmatize_mtree -- sample.tmt

we'll have:

analyze:
        brunblocks -S -o \
                SEnglishW_to_SEnglishM::Sentence_segmentation_simple \
                SEnglishW_to_SEnglishM::Penn_style_tokenization \
                SEnglishW_to_SEnglishM::TagTnT \
                SEnglishW_to_SEnglishM::Lemmatize_mtree  \
                SEnglishM_to_SEnglishA::McD_parser_local \
                SEnglishM_to_SEnglishA::Fix_McD_Tree \
                SEnglishM_to_SEnglishA::Fill_afun_after_McD -- sample.tmt

Note: Makefiles use tabulators to mark command lines. Make sure your lines start with tabulator (or two tabulators) and not, for example, with 4 spaces.

After running

make all

we can examine our sample.tmt again. Really, an analytical layer SEnglishA describing a dependency tree with analytical functions (<afun>) has been added to each bundle.

You can view the trees in sample.tmt with TrEd by typing

tmttred sample.tmt

Adding a new block

The linguistic structures in TectoMT are represented using the following object-oriented interface/types:

You can get TectoMT automatically execute your block code on each document or bundle by defining the main block entry point:

Each block must have exactly one entry point.

We'll now examine an example of a new block in file libs/blocks/Tutorial/Print_node_info.pm.

This block illustrates some of the most common methods for accessing objects:

Attributes of documents, bundles or nodes can be accessed by attribute getters and setters, for example:

Our tutorial block Print_node_info.pm is ready to use. You only need to add this block to our scenario:

print_info:
        brunblocks -S -o Print::Print_node_info -- sample.tmt

We can observe our new block behaviour:

make print_info

Try to change the block so that it prints out the information only for verbs. (You need to look at attribute tag at the m level). The tagset is Penn Treebank Tagset.

Advanced block: finite clauses

Motivation

TODO prelozit motivaci

Task

A block which, given an analytical tree (SEnglishA), fills each a-node with nummerical attribute <clause> so that nodes in the same finite clause are marked with the same number of clause.

Algorithm

TODO obrazek a vysvetleni algoritmu

Instructions

There is a block template with hints in libs/blocks/Tutorial/Mark_finite_clauses.pm. You should edit the block so that the ouput of this block is the same a-tree, in addition with attribute <is_head> attached to each a-node. There is also a printing block libs/blocks/Print_finite_clauses.pm which will print out the a-nodes grouped by clauses:

finite_clauses:
        brunblocks -S -o \
                Tutorial::Mark_heads \
                Tutorial::Print_finite_clauses -- sample.tmt

You are going to need these methods:

Is it perfect?

The output of our block is not ideal - we don't solve coordination and subordinate conjunctions. But for the time being, we can move on to some other tasks.

Your turn: more tasks

SVO typology

TODO

Coordination

This time TODO vysvetleni a obrazek

You can use block template in devel/libs/blocks/BlockTemplate.pm. To see the results, you can again use TrEd (tmttred sample.tmt)

Further information


[ Back to the navigation ] [ Back to the content ]