This is an old revision of the document!
Table of Contents
Addicter
Addicter stands for Automatic Detection and DIsplay of Common Translation ERrors. It will be a set of tools (mostly scripts written in Perl) that help with error analysis for machine translation.
The work on Addicter has started at the MT Marathon 2010 in Dublin, within a broader 5-day project called Failfinder (Dan Zeman, Ondřej Bojar, Martin Popel, David Mareček, Jon Clark, Ken Heafield, Qin Gao, Loïc Barrault). The code that resulted from the project can be freely downloaded from https://failfinder.googlecode.com/svn/trunk/. The nucleus that existed just after the MT Marathon (4 Feb 2010) is Addicter version 0.1, to reflect that this is by no means deemed a final product. Anyway, it can already do a useful job.
Currently, Addicter can view and browse aligned corpora, look for example words in context and summarize known alignments of a given word. The viewing and browsing is performed using a web server that generates web pages dynamically (to avoid pre-generating millions of static HTML documents). The obvious drawback is that access to a web server is needed.
Installation
- Install a web server, unless you already have access to one (local or remote). For instance, the Apache web server is available for at least Linux and MS Windows, and it's free. Configure your web server to work with CGI scripts written in Perl.
- To be able to generate alignments that will be displayed by Addicter, you need Giza++ or equivalent. The first training few steps of the Moses suite will do.
- Check out Addicter code from the Failfinder SVN repository.
Alignment viewer
Before invoking the viewer, you need to run an indexing script over your aligned corpus. It will create a bunch of index files that will later tell the viewer where to look for examples of a particular word. The indexer needs the following input files:
train.src
… source side of training corpustrain.tgt
… target side of training corpustrain.ali
… alignment of training corpustest.src
… source side of test datatest.tgt
… reference translation of test datatest.ali
… alignment of the source and reference translation of test datatest.system.tgt
… system output for test datatest.system.ali
… alignment of the source and the system output for test data
The indexer splits the output index into multiple files in order to reduce size of any individual file. All index files must be stored in the same folder as the viewing CGI scripts.