This is an old revision of the document!
Table of Contents
Addicter
Addicter stands for Automatic Detection and DIsplay of Common Translation ERrors. It will be a set of tools (mostly scripts written in Perl) that help with error analysis for machine translation.
The work on Addicter has started at the MT Marathon 2010 in Dublin, within a broader 5-day project called Failfinder (Dan Zeman, Ondřej Bojar, Martin Popel, David Mareček, Jon Clark, Ken Heafield, Qin Gao, Loïc Barrault). The code that resulted from the project can be freely downloaded from https://failfinder.googlecode.com/svn/trunk/. The nucleus that existed just after the MT Marathon (4 Feb 2010) is Addicter version 0.1, to reflect that this is by no means deemed a final product. Anyway, it can already do a useful job.
Currently, Addicter can view and browse aligned corpora, look for example words in context and summarize known alignments of a given word. The viewing and browsing is performed using a web server that generates web pages dynamically (to avoid pre-generating millions of static HTML documents). The obvious drawback is that access to a web server is needed.
Installation
- Install a web server, unless you already have access to one (local or remote). For instance, the Apache web server is available for at least Linux and MS Windows, and it's free. Configure your web server to work with CGI scripts written in Perl.
- To be able to generate alignments that will be displayed by Addicter, you need Giza++ or equivalent. The first training few steps of the Moses suite will do.
- Check out Addicter code from the Failfinder SVN repository.
How to install and configure Apache
This tutorial currently focuses on installing Apache HTTP Server on Microsoft Windows. If you are experienced user of another operating system and wish to share advice, please feel free to contact me.
- Download the Apache HTTP Server from http://httpd.apache.org/download.cgi. For MS Windows, you can download a package for the Microsoft Installer (
.msi
). Install it by double-clicking on the installation file. I suggest installing Apache as a system service. That way, it will automatically start on startup of your computer. - Configure the server. This essentially means editing a configuration file and restarting the server. Depending on your system settings, Apache version etc., the configuration file will reside in a path similar to this:
C:\Program Files\Apache Software Foundation\Apache2.2\conf\httpd.conf
. Alternatively, you can access it via your Start Menu: Apache → Apache HTTP Server 2.2 → Configure Apache Server → Edit the Apache httpd.conf Configuration File.- Look for a
ScriptAlias
directive. It tells the server: 1. what path on the hard disk contains scripts that can generate dynamic HTML content on the fly, and 2. how the path will be represented in the URL (web address). For exampleScriptAlias /cgi/ "C:/Documents and Settings/Dan/Documents/Web/cgi/"
says that the URL
http://localhost/cgi/anyscript.pl
leads to your scriptC:\Documents and Settings\Dan\Documents\Web\cgi\anyscript.pl
, and that it's a script (i.e., the server shall invoke it and send its output, instead of sending the script itself). - Under Windows, you will also want to set
ScriptInterpreterSource registry
It tells the server that the Windows registry shall be used to figure out how to run a script (e.g., that
C:\Perl\Perl.exe
binary must be run to interpret a.pl
script).
- Restart the server. On the main Windows panel, there is (typically in the lower right corner) a set of icons, including a new one for Apache. Right-click on it, select Open Apache Monitor, then Restart.
How to install Addicter
We use $CGI
to refer to the path you registered with Apache as containing CGI scripts (using the ScriptAlias
directive).
- Check out the current version of Addicter from the SVN repository. In Linux, the following command will do that:
svn checkout https://failfinder.googlecode.com/svn/trunk addicter
In Windows, you can use TortoiseSVN to access the repository.
- All you need at this moment is in the folder
dan
. There are two subfolders,prepare
andcgi
. Copy the contents of thecgi
folder to$CGI
.
Alignment viewer
Before invoking the viewer, you need to run an indexing script over your aligned corpus. It will create a bunch of index files that will later tell the viewer where to look for examples of a particular word. The indexer needs the following input files:
train.src
… source side of training corpustrain.tgt
… target side of training corpustrain.ali
… alignment of training corpustest.src
… source side of test datatest.tgt
… reference translation of test datatest.ali
… alignment of the source and reference translation of test datatest.system.tgt
… system output for test datatest.system.ali
… alignment of the source and the system output for test data
The prepare
folder contains some sample corpora in sample_data.zip
.
The indexer splits the output index into multiple files in order to reduce size of any individual file. All index files must be stored in the same folder as the viewing CGI scripts.
How to prepare a corpus for viewing
We assume that your corpus is already sentence-aligned and tokenized. I.e., source and target files have the same number of lines (sentences, segments), and tokens (words, punctuation) on each line are space-separated. If you are using Addicter to perform analysis of errors made by a machine translation system, you probably already have such a corpus. You may also want to use a lowercased version of your corpus. Unless stated otherwise, all files are supposed to be plain text files in the UTF-8 encoding.
You will also need some alignment files that define bi-directional word alignments. If you have trained a statistical MT system such as Moses, chances are that you already have such files for the training data. They result from the first three steps of the Moses training pipeline, namely from two runs of Giza++ and an alignment symmetrization algorithm. In order to get alignments for test data, too, you can do the following:
- Join the source training file with the source test file. Similarly, join the target sides of the two data sets.
- Re-run Giza++ over the joint corpus.
- The resulting alignment file has the same number of lines as the source and the target side of the corpus. By cutting off the last N lines, you easily separate the training and test alignments from each other.
- Alternatively, you can use MGiza (presented at MT Marathon 2010 in Dublin) to align the test data. After the aligned training corpus is initially read, alignments for new sentences just fly out, so you do not need to wait several hours for Giza++ again. Also, unlike the above method, your original training alignment is now guaranteed to remain untouched by the new evidence.
Once all the input files are ready, the indexer is invoked as follows:
addictindex.pl \ -trs train.en -trt train.hi -tra train.ali \ -s test.en -r test.hi -h test.joshua.hi -ra test.ali -ha test.joshua.ali \ -o $CGI
The indexer will copy the input files and output all index files into the $CGI
folder where the CGI scripts will find them.
How to use the viewer
Now if your web server is running and configured properly and your index and data files have been prepared in the correct place, launch your web browser and point it to http://localhost/cgi/index.pl.