[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

This is an old revision of the document!


Table of Contents

Addicter

Addicter stands for Automatic Detection and DIsplay of Common Translation ERrors. It will be a set of tools (mostly scripts written in Perl) that help with error analysis for machine translation.

The work on Addicter has started at the MT Marathon 2010 in Dublin, within a broader 5-day project called Failfinder (Dan Zeman, Ondřej Bojar, Martin Popel, David Mareček, Jon Clark, Ken Heafield, Qin Gao, Loïc Barrault). The code that resulted from the project can be freely downloaded from https://failfinder.googlecode.com/svn/trunk/. The nucleus that existed just after the MT Marathon (4 Feb 2010) is Addicter version 0.1, to reflect that this was by no means deemed a final product.

In 2011, the viewer was accompanied by an automatic error recognizer and classifier, thanks to Mark Fishel. The development has been moved to ÚFAL StatMT SVN repository (i.e. failfinder.googlecode.com is currently not maintained).

Currently, Addicter can do the following:

The viewing and browsing is performed using a web server that generates web pages dynamically (to avoid pre-generating millions of static HTML documents). Words in sentences are clickable so that the user can quickly navigate to examples and summaries of other than the current word. The obvious drawback is that access to a web server is needed. A small subset can be also generated as static HTML files and viewed without a web server: the test data browser.

There is another subpage for Addicter in this wiki that lies in the external name space, thus it can be used for external collaboration.

Installation

How to install and configure Apache

Microsoft Windows

This tutorial currently focuses on installing Apache HTTP Server on Microsoft Windows. If you are experienced user of another operating system and wish to share advice, please feel free to contact me.

Ubuntu Linux

Install the Apache HTTP server package. After successful installation, there should be a file /etc/apache2/sites-enabled/000-default. Edit it (you need root permissions). There should be a section similar to the following:

	ScriptAlias /cgi-bin/ /usr/lib/cgi-bin/
	<Directory "/usr/lib/cgi-bin">
		AllowOverride None
		Options +ExecCGI -MultiViews +SymLinksIfOwnerMatch
		Order allow,deny
		Allow from all
	</Directory>

Either create a copy of the section with new alias and path (eg. ScriptAlias /addicter-cgi/ /home/user/addicter/cgi/) or use the /usr/lib/cgi-bin (or whatever folder you see by default) for your addicter CGI scripts and data (see below).

How to install Addicter

We use $CGI to refer to the path you registered with Apache as containing CGI scripts (using the ScriptAlias directive).

Alignment viewer

Before invoking the viewer, you need to run an indexing script over your aligned corpus. It will create a bunch of index files that will later tell the viewer where to look for examples of a particular word. The indexer needs the following input files:

<!–The prepare folder contains some sample corpora in sample_data.zip.–>

The indexer splits the output index into multiple files in order to reduce size of any individual file. All index files must be stored in the experiment subfolder of $CGI so that the CGI scripts can find them.

How to prepare a corpus for viewing

We assume that your corpus is already sentence-aligned and tokenized. I.e., source and target files have the same number of lines (sentences, segments), and tokens (words, punctuation) on each line are space-separated. If you are using Addicter to perform analysis of errors made by a machine translation system, you probably already have such a corpus. You may also want to use a lowercased version of your corpus. Unless stated otherwise, all files are supposed to be plain text files in the UTF-8 encoding.

You will also need some alignment files that define bi-directional word alignments. If you have trained a statistical MT system such as Moses, chances are that you already have such files for the training data. They result from the first three steps of the Moses training pipeline, namely from two runs of Giza++ and an alignment symmetrization algorithm. In order to get alignments for test data, too, you can do the following:

Once all the input files are ready, the indexer is invoked as follows:

addictindex.pl \
    -trs train.en -trt train.hi -tra train.ali \
    -s test.en -r test.hi -h test.system.hi -ra test.ali -ha test.system.ali \
    -o $CGI

The indexer will copy the input files and output all index files into the $CGI folder where the CGI scripts will find them.

How to invoke the error classifier

The error classifier currently uses its own monlingual word-alignment of reference translation and the hypothesis. It is invoked as follows:

${ADDICTER}/testchamber/align-hmm.pl ref.txt hyp.txt > tcali.txt
${ADDICTER}/testchamber/finderrs.pl src.txt hyp.txt ref.txt tcali.txt > tcerr.txt
${ADDICTER}/testchamber/errsummary.pl tcerr.txt

Place the files tcali.txt and tcerr.txt in the experiment subfolder of $CGI and the error classes will be displayed during test data browsing in the viewer.

How to use the viewer

Now if your web server is running and configured properly and your index and data files have been prepared in the correct place, launch your web browser and point it to http://localhost/cgi/index.pl.

Acknowledgements

This research has been supported by the grant of the Czech Ministry of Education no. MSM0021620838 (2010), by the grants of the Czech Science Foundation no. P406/11/1499 and P406/10/P259 and the Estonian Science Foundation target financed theme SF0180078s08 (2011).


[ Back to the navigation ] [ Back to the content ]