Addicter

Addicter stands for Automatic Detection and DIsplay of Common Translation ERrors. It will be a set of tools (mostly scripts written in Perl) that help with error analysis for machine translation.

The work on Addicter has started at the MT Marathon 2010 in Dublin, within a broader 5-day project called Failfinder (Dan Zeman, Ondřej Bojar, Martin Popel, David Mareček, Jon Clark, Ken Heafield, Qin Gao, Loïc Barrault). The code that resulted from the project can be freely downloaded from https://failfinder.googlecode.com/svn/trunk/. The nucleus that existed just after the MT Marathon (4 Feb 2010) is Addicter version 0.1, to reflect that this is by no means deemed a final product. Anyway, it can already do a useful job.

Currently, Addicter can view and browse aligned corpora, look for example words in context and summarize known alignments of a given word. The viewing and browsing is performed using a web server that generates web pages dynamically (to avoid pre-generating millions of static HTML documents). The obvious drawback is that access to a web server is needed.

Installation

Install a web server, unless you already have access to one (local or remote). For instance, the Apache web server is available for at least Linux and MS Windows, and it's free. Configure your web server to work with CGI scripts written in Perl.
To be able to generate alignments that will be displayed by Addicter, you need Giza++ or equivalent. The first training few steps of the Moses suite will do.
Check out Addicter code from the Failfinder SVN repository.

Alignment viewer

Before invoking the viewer, you need to run an indexing script over your aligned corpus. It will create a bunch of index files that will later tell the viewer where to look for examples of a particular word. The indexer needs the following input files:

train.src … source side of training corpus
train.tgt … target side of training corpus
train.ali … alignment of training corpus
test.src … source side of test data
test.tgt … reference translation of test data
test.ali … alignment of the source and reference translation of test data
test.system.tgt … system output for test data
test.system.ali … alignment of the source and the system output for test data

The indexer splits the output index into multiple files in order to reduce size of any individual file. All index files must be stored in the same folder as the viewing CGI scripts.

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Table of Contents

Addicter

Installation

Alignment viewer