[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
external:tectomt:tutorial [2009/04/01 10:46]
ptacek
external:tectomt:tutorial [2010/04/14 01:45]
popel nové bloky
Line 91: Line 91:
  
 ===== TectoMT Architecture ===== ===== TectoMT Architecture =====
 +
  
  
Line 100: Line 101:
 In TectoMT, there is the following hierarchy of processing units (software components that process data): In TectoMT, there is the following hierarchy of processing units (software components that process data):
  
-  * The basic units are blocks. They serve for some very limited, well defined, and often linguistically interpretable tasks (e.g., tokenization, tagging, parsing). Technically, blocks are Perl classes inherited from ''TectoMT::Block'', each saved in a separate file. The blocks repository is in ''libs/blocks/''+  * The basic units are **blocks**. They serve for some very limited, well defined, and often linguistically interpretable tasks (e.g., tokenization, tagging, parsing). Technically, blocks are Perl classes inherited from ''TectoMT::Block'', each saved in a separate file. The blocks repository is in ''libs/blocks/''
-  * To solve a more complex task, selected blocks can be chained into a block sequence, called also a scenario. Technically, scenarios are instances of ''TectoMT::Scenario'' class, but in some situations (e.g. on the command line) it is sufficient to specify the scenario simply by listing block names separated by spaces+  * To solve a more complex task, selected blocks can be chained into a block sequence, called **scenario**Scenarios are stored in ''*.scen'' files (alternativelythe block names separated by spaces can be simply listed on the command line) and at runtime the scenarios are represented by instances of ''TectoMT::Scenario'' class. 
-  * The highest unit is called application. Applications correspond to end-to-end tasks, be they real end-user applications (such as machine translation), or 'only' NLP-related experiments. Technically, applications are often implemented as ''Makefiles'', which only glue the components existing in TectoMT. Some demo applications can be found in ''applications''.+  * The highest unit is called **application**. Applications correspond to end-to-end tasks, be they real end-user applications (such as machine translation), or 'only' NLP-related experiments. Technically, applications are often implemented as ''Makefiles'', which only glue the components existing in TectoMT. Some demo applications can be found in ''applications''.
  
 This tutorial itself has its blocks in ''libs/blocks/Tutorial'' and the application in ''applications/tutorial''. This tutorial itself has its blocks in ''libs/blocks/Tutorial'' and the application in ''applications/tutorial''.
Line 135: Line 136:
  
 There are also other directories for other purpose blocks, for example blocks which only print out some information go to ''libs/Print''. Our tutorial blocks are in ''libs/blocks/Tutorial/''. There are also other directories for other purpose blocks, for example blocks which only print out some information go to ''libs/Print''. Our tutorial blocks are in ''libs/blocks/Tutorial/''.
 +
  
  
Line 150: Line 152:
 Once you have TectoMT installed on your machine, you can find this tutorial in ''applications/tutorial/''. After you ''cd'' into this directory, you can see our plain text sample data in ''sample.txt'' Once you have TectoMT installed on your machine, you can find this tutorial in ''applications/tutorial/''. After you ''cd'' into this directory, you can see our plain text sample data in ''sample.txt''
  
-Most applications are defined in ''Makefiles'', which describe sequence of blocks to be applied on our data. In our particular ''Makefile''four blocks are going to be applied on our sample text: sentence segmentation, tokenization, tagging and lemmatization. Since we have our input text in plain text format, the file is going to be converted into ''tmt'' format beforehand (the ''in'' target in the ''Makefile'').+Most applications are defined in ''Makefiles'' and ''*.scen'' files, which describe sequence of blocks to be applied on our data. In our case, ''tutorial.scen'' lists four blocks to be applied on our sample text: sentence segmentation, tokenization, part-of-speech tagging and lemmatization. Since we have our input text in plain text format, the file is going to be converted into ''tmt'' format beforehand (the ''in'' target in the ''Makefile'').
  
 We can run the application: We can run the application:
Line 161: Line 163:
  
   * One physical ''tmt'' file corresponds to one document.   * One physical ''tmt'' file corresponds to one document.
-  * A document consists of a sequence of bundles (''<bundle>''), mirroring a sequence of natural language sentences originating from the text. So, for one sentence we have one ''<bundle>''+  * A document consists of a sequence of bundles (element ''<bundles>''), mirroring a sequence of natural language sentences originating from the text. So, for each sentence we have one bundle. 
-  * Each bundle contains tree shaped sentence representations on various linguistic layers. In our example ''sample.tmt'' we have morphological tree (''SEnglishM'') in each bundle. Later on, also an analytical layer (''SEnglishA'') will appear in each bundle as we proceed with our analysis. +  * Each bundle contains tree shaped sentence representations on various linguistic layers. In our example ''sample.tmt'' we have morphological tree (''SEnglishM'') in each bundle (actually, it is a flat tree: one technical root and its children are the tokens). Later on, also an analytical layer (''SEnglishA'') will appear in each bundle as we proceed with our analysis. 
   * Trees are formed by nodes and edges. Attributes can be attached only to nodes. Edge's attributes must be stored as the lower node's attributes. Tree's attributes must be stored as attributes of the root node.   * Trees are formed by nodes and edges. Attributes can be attached only to nodes. Edge's attributes must be stored as the lower node's attributes. Tree's attributes must be stored as attributes of the root node.
 +
  
  
Line 198: Line 201:
 ===== Changing the scenario ===== ===== Changing the scenario =====
  
-We'll now add a syntax analysis (dependency parsing) to our scenario by adding three more blocks. Instead of +We'll now add a syntax analysis (dependency parsing) to our scenario by adding five more blocks to ''tutorial.scen''. Instead of 
  
-<code bash+<code> 
-analyze: +SEnglishW_to_SEnglishM::Sentence_segmentation_simple 
-        brunblocks -S -o \ +SEnglishW_to_SEnglishM::Tokenization 
-                SEnglishW_to_SEnglishM::Sentence_segmentation_simple \ +SEnglishW_to_SEnglishM::TagMxPost 
-                SEnglishW_to_SEnglishM::Penn_style_tokenization \ +SEnglishW_to_SEnglishM::Lemmatize_mtree
-                SEnglishW_to_SEnglishM::TagMxPost \ +
-                SEnglishW_to_SEnglishM::Lemmatize_mtree +
-        -- sample.tmt+
 </code> </code>
  
Line 213: Line 213:
  
 <code bash> <code bash>
-analyze: +SEnglishW_to_SEnglishM::Sentence_segmentation_simple 
-        brunblocks -S -o \ +SEnglishW_to_SEnglishM::Tokenization 
-                SEnglishW_to_SEnglishM::Sentence_segmentation_simple \ +SEnglishW_to_SEnglishM::TagMxPost 
-                SEnglishW_to_SEnglishM::Penn_style_tokenization \ +SEnglishW_to_SEnglishM::Lemmatize_mtree 
-                SEnglishW_to_SEnglishM::TagMxPost \ +SEnglishM_to_SEnglishA::McD_parser 
-                SEnglishW_to_SEnglishM::Lemmatize_mtree \ +SEnglishM_to_SEnglishA::Fill_is_member_from_deprel 
-                SEnglishM_to_SEnglishA::McD_parser_local \ +SEnglishM_to_SEnglishA::Fix_McD_topology 
-                SEnglishM_to_SEnglishA::Fix_McD_Tree \ +SEnglishM_to_SEnglishA::Fill_afun_AuxCP_Coord 
-                SEnglishM_to_SEnglishA::Fill_afun_after_McD \ +SEnglishM_to_SEnglishA::Fill_afun
-        -- sample.tmt+
 </code> </code>
- 
-//Note//: ''Makefiles'' use tabulators to mark command lines. Make sure your lines start with a tabulator (or two tabulators) and not, for example, with 4 spaces. 
  
 After running After running
Line 238: Line 235:
  
 <code bash> <code bash>
-SEnglishM_to_SEnglishA::McD_parser_local \+SEnglishM_to_SEnglishA::McD_parser
 </code> </code>
  
Line 244: Line 241:
  
 <code bash> <code bash>
-SEnglishM_to_SEnglishA::McD_parser_local TMT_PARAM_MCD_EN_MODEL=conll_mcd_order2_0.1.model \+SEnglishM_to_SEnglishA::McD_parser TMT_PARAM_MCD_EN_MODEL=conll_mcd_order2_0.1.model
 </code> </code>
  
Line 257: Line 254:
 //Note//: For more information about tree editor TrEd, see [[http://ufal.mff.cuni.cz/~pajas/tred/ar01-toc.html|TrEd User's Manual]]. //Note//: For more information about tree editor TrEd, see [[http://ufal.mff.cuni.cz/~pajas/tred/ar01-toc.html|TrEd User's Manual]].
  
-If you are not familiar with ''Makefile'' syntax, another way of running a scenario in TectoMT is using ''.scen'' file (see ''applications/tutorial.scen''). This file lists the blocks to be run - one block on a single line.  +If you are not familiar with ''Makefile'' syntax, you can run the scenario with a simple ''bash'' script (see ''applications/tutorial/run_all.sh''):
- +
-<code bash> +
-$TMT_ROOT/tools/format_convertors/plaintext_to_tmt/plaintext_to_tmt.pl English sample.txt +
-brunblocks -S -o --scen tutorial.scen -- sample.tmt +
-</code> +
- +
-Finally, yet another way is to use a simple ''bash'' script (see ''applications/tutorial/run_all.sh''):+
  
 <code bash> <code bash>

[ Back to the navigation ] [ Back to the content ]