Addicter stands for Automatic Detection and DIsplay of Common Translation ERrors. It will be a set of tools (mostly scripts written in Perl) that help with error analysis for machine translation.
The work on Addicter has started at the MT Marathon 2010 in Dublin, within a broader 5-day project called Failfinder (Dan Zeman, Ondřej Bojar, Martin Popel, David Mareček, Jon Clark, Ken Heafield, Qin Gao, Loïc Barrault). The code that resulted from the project can be freely downloaded from https://failfinder.googlecode.com/svn/trunk/. The nucleus that existed just after the MT Marathon (4 Feb 2010) is Addicter version 0.1, to reflect that this was by no means deemed a final product.
In 2011, the viewer was accompanied by an automatic error recognizer and classifier, thanks to Mark Fishel. The development has been moved to ÚFAL StatMT SVN repository (i.e. failfinder.googlecode.com
is currently not maintained). In September 2011 at the Sixth MT Marathon in Trento, Addicter was further developed and thoroughly compared with another tool for error analysis, Hjerson. See the project wiki. For further developments, see also the Terra website.
Currently, Addicter can do the following:
The viewing and browsing is performed using a web server that generates web pages dynamically (to avoid pre-generating millions of static HTML documents). Words in sentences are clickable so that the user can quickly navigate to examples and summaries of other than the current word. If you have access to a webserver you may use Addicter with it; otherwise you can use Addicter's own lightweight server. A small subset can be also generated as static HTML files and viewed without a web server: the test data browser.
There is another subpage for Addicter in this wiki that lies in the external name space, thus it can be used for external collaboration.
NOTE: Since September 2011, it is not necessary to install a local web server, so skip this section if you do not want it. Addicter now comes with a script called server.pl
that works as a HTTP daemon and serves Addicter content (but nothing else) to your browser. This section is thus optional.
This tutorial currently focuses on installing Apache HTTP Server on Microsoft Windows. If you are experienced user of another operating system and wish to share advice, please feel free to contact me.
.msi
). Install it by double-clicking on the installation file. I suggest installing Apache as a system service. That way, it will automatically start on startup of your computer.C:\Program Files\Apache Software Foundation\Apache2.2\conf\httpd.conf
. Alternatively, you can access it via your Start Menu: Apache → Apache HTTP Server 2.2 → Configure Apache Server → Edit the Apache httpd.conf Configuration File.ScriptAlias
directive. It tells the server: 1. what path on the hard disk contains scripts that can generate dynamic HTML content on the fly, and 2. how the path will be represented in the URL (web address). For example ScriptAlias /cgi/ "C:/Documents and Settings/Dan/Documents/Web/cgi/"
says that the URL http://localhost/cgi/anyscript.pl
leads to your script C:\Documents and Settings\Dan\Documents\Web\cgi\anyscript.pl
, and that it's a script (i.e., the server shall invoke it and send its output, instead of sending the script itself).
ScriptInterpreterSource registry
It tells the server that the Windows registry shall be used to figure out how to run a script (e.g., that C:\Perl\Perl.exe
binary must be run to interpret a .pl
script).
PERLLIB
variable and thus not find the libraries unless we specifically instruct Apache to pass the variable to the CGI environment: PassEnv PERLLIB PERL5LIB
Install the Apache HTTP server package. After successful installation, there should be a file /etc/apache2/sites-enabled/000-default
. Edit it (you need root permissions). There should be a section similar to the following:
ScriptAlias /cgi-bin/ /usr/lib/cgi-bin/ <Directory "/usr/lib/cgi-bin"> AllowOverride None Options +ExecCGI -MultiViews +SymLinksIfOwnerMatch Order allow,deny Allow from all </Directory>
Either create a copy of the section with new alias and path (eg. ScriptAlias /addicter-cgi/ /home/user/addicter/cgi/
) or use the /usr/lib/cgi-bin
(or whatever folder you see by default) for your addicter CGI scripts and data (see below).
We use $CGI
to refer to the path you registered with Apache as containing CGI scripts (using the ScriptAlias
directive). NOTE: If you are using Addicter's own web server or if Addicter content is the only thing you intend to use the server to serve, probably the easiest thing to do is to set the Addicter's cgi
folder as your $CGI
. NOTE 2: There are couple of files with static (non-CGI) web content, needed by the CGI scripts. These files (currently tabs.gif
and activatables.js
) are in $CGI/..
. With Addicter's own web server, this is just fine. If you are using another web server, however, you must copy these files to the appropriate location in your static content directory structure so that the server finds them. They should not be directly in the $CGI
folder because they are not scripts and should not be treated as scripts by the server.
public
and password public
. Then make sure that Perl finds these libraries. In Linux/bash, the following commands will do that: svn --username public checkout https://svn.ms.mff.cuni.cz/svn/dzlib ~/lib export PERL5LIB=~/lib:$PERL5LIB
In Windows, you can use TortoiseSVN to access the repository.
public
and password public
: svn --username public checkout https://svn.ms.mff.cuni.cz/svn/statmt/trunk/addicter addicter
testchamber
, prepare
and cgi
. Copy the contents of the cgi
folder to $CGI
.$CGI
, e.g. $CGI/fr-en-exp01
.Before invoking the viewer, you need to run an indexing script over your aligned corpus. It will create a bunch of index files that will later tell the viewer where to look for examples of a particular word. The indexer needs the following input files:
train.src
… source side of training corpustrain.tgt
… target side of training corpustrain.ali
… alignment of training corpustest.src
… source side of test datatest.tgt
… reference translation of test datatest.ali
… alignment of the source and reference translation of test datatest.system.tgt
… system output for test datatest.system.ali
… alignment of the source and the system output for test data
<!–The prepare
folder contains some sample corpora in sample_data.zip
.–>
The indexer splits the output index into multiple files in order to reduce size of any individual file. All index files must be stored in the experiment subfolder of $CGI
so that the CGI scripts can find them.
We assume that your corpus is already sentence-aligned and tokenized. I.e., source and target files have the same number of lines (sentences, segments), and tokens (words, punctuation) on each line are space-separated. If you are using Addicter to perform analysis of errors made by a machine translation system, you probably already have such a corpus. You may also want to use a lowercased version of your corpus. Unless stated otherwise, all files are supposed to be plain text files in the UTF-8 encoding.
You will also need some alignment files that define bi-directional word alignments. If you have trained a statistical MT system such as Moses, chances are that you already have such files for the training data. They result from the first three steps of the Moses training pipeline, namely from two runs of Giza++ and an alignment symmetrization algorithm. In order to get alignments for test data, too, you can do the following:
Once all the input files are ready, the indexer is invoked as follows:
addictindex.pl \ -trs train.en -trt train.hi -tra train.ali \ -s test.en -r test.hi -h test.system.hi -ra test.ali -ha test.system.ali \ -o $CGI
The indexer will copy the input files and output all index files into the $CGI
folder where the CGI scripts will find them.
The error classifier currently uses its own monlingual word-alignment of reference translation and the hypothesis. It is invoked as follows:
${ADDICTER}/prepare/detecter.pl -s srcfile -r reffile -h hypfile [-a alignment] -w workdir
and it creates the files workdir/tcali.txt
and workdir/tcerr.txt
. The input files (src, ref and hyp) can also be gzipped. Custom alignment between hypothesis and reference can be supplied. If it is not supplied, then the default aligner (${ADDICTER}/testchamber/align-greedy.pl
) is invoked.
Place the files tcali.txt
and tcerr.txt
in the experiment subfolder of $CGI
and the error classes will be displayed during test data browsing in the viewer. The viewer can work with several alternating alignments (perhaps using different aligning algorithms) of the same data. For each of those alignments, you have to run detecter.pl
separately.
First make sure that your web server is running and configured properly and that your index and data files have been prepared in the correct place. If you do not use your own web server, invoke the script server.pl
in the main Addicter folder. It will say something like
Please contact me at: <URL:http://localhost:2588/cgi/index.pl>
which is the URL you should point your browser to. The server uses a randomly picked port number unless you specify it as a commandline parameter: server.pl 8080
.
In the browser, you will see a list of experiments (all subfolders of $CGI
). Start browsing your data by clicking on an experiment.
This research has been supported by the grant of the Czech Ministry of Education no. MSM0021620838 (2010), by the grants of the Czech Science Foundation no. P406/11/1499 and P406/10/P259, the Estonian Science Foundation target financed theme SF0180078s08 (2011) and by the project EuroMatrixPlus (FP7-ICT-2007-3-231720 of the EU and 7E09003+7E11051 of the Ministry of Education, Youth and Sports of the Czech Republic; 2011-2012).