Differences

This shows you the differences between two versions of the page.

--- user:zeman:addicter [2010/11/05 16:41]
zeman Acknowledgements.
+++ user:zeman:addicter [2011/07/16 17:49]
zeman Franz's Giza++ website is gone.
@@ Line 3: / Line 3: @@
 //Addicter// stands for Automatic Detection and DIsplay of Common Translation ERrors. It will be a set of tools (mostly scripts written in Perl) that help with error analysis for machine translation.
-The work on Addicter has started at the MT Marathon 2010 in Dublin, within a broader 5-day project called Failfinder (Dan Zeman, Ondřej Bojar, Martin Popel, David Mareček, Jon Clark, Ken Heafield, Qin Gao, Loïc Barrault). The code that resulted from the project can be freely downloaded from https://failfinder.googlecode.com/svn/trunk/. The nucleus that existed just after the MT Marathon (4 Feb 2010) is Addicter version 0.1, to reflect that this is by no means deemed a final product. Anyway, it can already do a useful job.
+The work on Addicter has started at the MT Marathon 2010 in Dublin, within a broader 5-day project called Failfinder (Dan Zeman, Ondřej Bojar, Martin Popel, David Mareček, Jon Clark, Ken Heafield, Qin Gao, Loïc Barrault). The code that resulted from the project can be freely downloaded from https://failfinder.googlecode.com/svn/trunk/. The nucleus that existed just after the MT Marathon (4 Feb 2010) is Addicter version 0.1, to reflect that this was by no means deemed a final product.
-Currently, Addicter can view and browse aligned corpora, look for example words in context and summarize known alignments of a given word. The viewing and browsing is performed using a web server that generates web pages dynamically (to avoid pre-generating millions of static HTML documents). The obvious drawback is that access to a web server is needed.
+In 2011, the viewer was accompanied by an automatic error recognizer and classifier, thanks to Mark Fishel. The development has been moved to ÚFAL StatMT SVN repository (i.e. ''failfinder.googlecode.com'' is currently not maintained).
+Currently, Addicter can do the following:
+  * Find erroneous tokens and classify the errors in a way similar to Vilar's taxonomy.
+  * Browse the test data, sentence by sentence, and show aligned source sentence, reference translation and system hypothesis.
+  * Browse aligned training corpus, look for example words in context.
+  * Show lines of the phrase table that contain a given word.
+  * Summarize alignments of a given word. This feature can also serve as a primitive corpus-based dictionary.
+  * In the near future, we also plan to add searching and grouping of words sharing the same lemma. That way morphological errors will be highlighted.
+The viewing and browsing is performed using a web server that generates web pages dynamically (to avoid pre-generating millions of static HTML documents). Words in sentences are clickable so that the user can quickly navigate to examples and summaries of other than the current word. The obvious drawback is that access to a web server is needed. A small subset can be also generated as static HTML files and viewed without a web server: the test data browser.
+There is another subpage for Addicter in this wiki that lies in the external name space, thus it can be used for [[external:addicter|external collaboration]].
 ===== Installation =====
+  * Addicter is written in Perl and you need a Perl interpreter to run Addicter. This is usually no problem on Unix-like systems but you may need to install Perl version ≥ 5.8 if you are working on Windows. Options include [[http://www.activestate.com/activeperl|Active Perl]] and [[http://strawberryperl.com/|Strawberry Perl]].
   * Install a web server, unless you already have access to one (local or remote). For instance, the Apache web server is available for at least Linux and MS Windows, and it's free. Configure your web server to work with CGI scripts written in Perl.
-  * To be able to generate alignments that will be displayed by Addicter, you need Giza++ or equivalent. The first training few steps of the Moses suite will do.
+  * To be able to generate alignments that will be displayed by Addicter, you need [[http://code.google.com/p/giza-pp/|Giza++]] or equivalent. The first training few steps of the [[http://www.statmt.org/moses/|Moses]] suite will do.
-  * Check out Addicter code from the Failfinder SVN repository.
+  * Check out Addicter code from the ÚFAL SVN repository (see below how).
 ==== How to install and configure Apache ====
@@ Line 45: / Line 58: @@
 We use ''$CGI'' to refer to the path you registered with Apache as containing CGI scripts (using the ''ScriptAlias'' directive).
-  * Check out the current version of Addicter from the SVN repository. In Linux, the following command will do that: <code>svn checkout https://failfinder.googlecode.com/svn/trunk addicter</code> In Windows, you can use [[http://tortoisesvn.tigris.org/|TortoiseSVN]] to access the repository.
+  * Addicter uses some general-purpose Perl libraries that are maintained in a separate repository. Download these first, using username ''public'' and password ''public''. Then make sure that Perl finds these libraries. In Linux/bash, the following commands will do that: <code bash>svn --username public checkout https://svn.ms.mff.cuni.cz/svn/dzlib ~/lib
-  * All you need at this moment is in the folder ''dan''. There are two subfolders, ''prepare'' and ''cgi''. Copy the contents of the ''cgi'' folder to ''$CGI''.
+export PERL5LIB=~/lib:$PERL5LIB</code> In Windows, you can use [[http://tortoisesvn.tigris.org/|TortoiseSVN]] to access the repository.
+  * Check out the current version of Addicter from the StatMT SVN repository, again using username ''public'' and password ''public'': <code bash>svn --username public checkout https://svn.ms.mff.cuni.cz/svn/statmt/trunk/addicter addicter</code>
+  * There are two subfolders, ''prepare'' and ''cgi''. Copy the contents of the ''cgi'' folder to ''$CGI''.
+  * For every experiment whose data shall be explored by addicter, create a subfolder in ''$CGI'', e.g. ''$CGI/fr-en-exp01''.
 ===== Alignment viewer =====
@@ Line 61: / Line 77: @@
   * ''test.system.ali'' ... alignment of the source and the system output for test data
-The ''prepare'' folder contains some sample corpora in ''sample_data.zip''.
+<!--The ''prepare'' folder contains some sample corpora in ''sample_data.zip''.-->
-The indexer splits the output index into multiple files in order to reduce size of any individual file. All index files must be stored in the same folder as the viewing CGI scripts.
+The indexer splits the output index into multiple files in order to reduce size of any individual file. All index files must be stored in the experiment subfolder of ''$CGI'' so that the CGI scripts can find them.
 ==== How to prepare a corpus for viewing ====
@@ Line 84: / Line 100: @@
 The indexer will copy the input files and output all index files into the ''$CGI'' folder where the CGI scripts will find them.
+==== How to invoke the error classifier ====
+The error classifier currently uses its own monlingual word-alignment of reference translation and the hypothesis. It is invoked as follows:
+<code bash>${ADDICTER}/testchamber/align-hmm.pl ref.txt hyp.txt > tcali.txt
+${ADDICTER}/testchamber/finderrs.pl src.txt hyp.txt ref.txt tcali.txt > tcerr.txt
+${ADDICTER}/testchamber/errsummary.pl tcerr.txt</code>
+Place the files ''tcali.txt'' and ''tcerr.txt'' in the experiment subfolder of ''$CGI'' and the error classes will be displayed during test data browsing in the viewer.
 ==== How to use the viewer ====
@@ Line 91: / Line 117: @@
 ===== Acknowledgements =====
-This research has been supported by the grant of the Czech Ministry of Education no. MSM0021620838.
+This research has been supported by the grant of the Czech Ministry of Education no. MSM0021620838 (2010), by the grants of the Czech Science Foundation no. P406/11/1499 and P406/10/P259 and the Estonian Science Foundation target financed theme SF0180078s08 (2011).

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences