Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision Next revision Both sides next revision | ||
user:zeman:addicter [2010/02/22 11:56] zeman vytvořeno |
user:zeman:addicter [2012/05/16 13:35] zeman Publications. |
||
---|---|---|---|
Line 3: | Line 3: | ||
// | // | ||
- | The work on Addicter has started at the MT Marathon 2010 in Dublin, within a broader 5-day project called Failfinder (Dan Zeman, Ondřej Bojar, Martin Popel, David Mareček, Jon Clark, Ken Heafield, Qin Gao, Loïc Barrault). The code that resulted from the project can be freely downloaded from https:// | + | The work on Addicter has started at the MT Marathon 2010 in Dublin, within a broader 5-day project called Failfinder (Dan Zeman, Ondřej Bojar, Martin Popel, David Mareček, Jon Clark, Ken Heafield, Qin Gao, Loïc Barrault). The code that resulted from the project can be freely downloaded from https:// |
+ | |||
+ | In 2011, the viewer was accompanied by an automatic error recognizer and classifier, thanks to Mark Fishel. The development has been moved to ÚFAL StatMT SVN repository (i.e. '' | ||
+ | |||
+ | Currently, Addicter can do the following: | ||
+ | * Find erroneous tokens and classify the errors in a way similar to Vilar' | ||
+ | * Browse the test data, sentence by sentence, and show aligned source sentence, reference translation and system hypothesis. | ||
+ | * Browse aligned training corpus, look for example words in context. | ||
+ | * Show lines of the phrase table that contain a given word. | ||
+ | * Summarize alignments of a given word. This feature can also serve as a primitive corpus-based dictionary. | ||
+ | * In the near future, we also plan to add searching and grouping of words sharing the same lemma. That way morphological errors will be highlighted. | ||
+ | |||
+ | The viewing and browsing is performed using a web server that generates web pages dynamically (to avoid pre-generating millions of static HTML documents). Words in sentences are clickable so that the user can quickly navigate to examples and summaries of other than the current word. If you have access to a webserver you may use Addicter with it; otherwise you can use Addicter' | ||
+ | |||
+ | There is another subpage for Addicter in this wiki that lies in the external name space, thus it can be used for [[external: | ||
+ | |||
+ | ===== Installation ===== | ||
+ | |||
+ | * Addicter is written in Perl and you need a Perl interpreter to run Addicter. This is usually no problem on Unix-like systems but you may need to install Perl version ≥ 5.8 if you are working on Windows. Options include [[http:// | ||
+ | * There is now no need to install or have access to a web server; nevertheless, | ||
+ | * To be able to generate alignments that will be displayed by Addicter, you need [[http:// | ||
+ | * Check out Addicter code from the ÚFAL SVN repository (see below how). | ||
+ | |||
+ | ==== How to install and configure Apache ==== | ||
+ | |||
+ | **NOTE:** Since September 2011, it is not necessary to install a local web server, so skip this section if you do not want it. Addicter now comes with a script called '' | ||
+ | |||
+ | === Microsoft Windows === | ||
+ | |||
+ | This tutorial currently focuses on installing Apache HTTP Server on Microsoft Windows. If you are experienced user of another operating system and wish to share advice, please feel free to [[mailto: | ||
+ | |||
+ | * Download the Apache HTTP Server from http:// | ||
+ | * Configure the server. This essentially means editing a configuration file and restarting the server. Depending on your system settings, Apache version etc., the configuration file will reside in a path similar to this: '' | ||
+ | * Look for a '' | ||
+ | * Under Windows, you will also want to set < | ||
+ | * CGI scripts will not run under the same environment as a user command line. They will not see the '' | ||
+ | * Restart the server. On the main Windows panel, there is (typically in the lower right corner) a set of icons, including a new one for Apache. Right-click on it, select Open Apache Monitor, then Restart. | ||
+ | |||
+ | === Ubuntu Linux === | ||
+ | |||
+ | Install the Apache HTTP server package. After successful installation, | ||
+ | |||
+ | < | ||
+ | ScriptAlias /cgi-bin/ / | ||
+ | < | ||
+ | AllowOverride None | ||
+ | Options +ExecCGI -MultiViews +SymLinksIfOwnerMatch | ||
+ | Order allow, | ||
+ | Allow from all | ||
+ | </ | ||
+ | </ | ||
+ | |||
+ | Either create a copy of the section with new alias and path (eg. '' | ||
+ | |||
+ | ==== How to install Addicter ==== | ||
+ | |||
+ | We use '' | ||
+ | |||
+ | * Addicter uses some general-purpose Perl libraries that are maintained in a separate repository. Download these first, using username '' | ||
+ | export PERL5LIB=~/ | ||
+ | * Check out the current version of Addicter from the StatMT SVN repository, again using username '' | ||
+ | * There are three subfolders, '' | ||
+ | * For every experiment whose data shall be explored by addicter, create a subfolder in '' | ||
+ | |||
+ | ===== Alignment viewer ===== | ||
+ | |||
+ | Before invoking the viewer, you need to run an indexing script over your aligned corpus. It will create a bunch of index files that will later tell the viewer where to look for examples of a particular word. The indexer needs the following input files: | ||
+ | |||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | |||
+ | <!--The '' | ||
+ | |||
+ | The indexer splits the output index into multiple files in order to reduce size of any individual file. All index files must be stored in the experiment subfolder of '' | ||
+ | |||
+ | ==== How to prepare a corpus for viewing ==== | ||
+ | |||
+ | We assume that your corpus is already | ||
+ | |||
+ | You will also need some alignment files that define bi-directional word alignments. If you have trained a statistical MT system such as Moses, chances are that you already have such files for the training data. They result from the first three steps of the Moses training pipeline, namely from two runs of Giza++ and an alignment symmetrization algorithm. In order to get alignments for test data, too, you can do the following: | ||
+ | |||
+ | * Join the source training file with the source test file. Similarly, join the target sides of the two data sets. | ||
+ | * Re-run Giza++ over the joint corpus. | ||
+ | * The resulting alignment file has the same number of lines as the source and the target side of the corpus. By cutting off the last N lines, you easily separate the training and test alignments from each other. | ||
+ | * Alternatively, | ||
+ | |||
+ | Once all the input files are ready, the indexer is invoked as follows: | ||
+ | |||
+ | < | ||
+ | -trs train.en -trt train.hi -tra train.ali \ | ||
+ | -s test.en -r test.hi -h test.system.hi -ra test.ali -ha test.system.ali \ | ||
+ | -o $CGI</ | ||
+ | |||
+ | The indexer will copy the input files and output all index files into the '' | ||
+ | |||
+ | ==== How to invoke the error classifier ==== | ||
+ | |||
+ | The error classifier currently uses its own monlingual word-alignment of reference translation and the hypothesis. It is invoked as follows: | ||
+ | |||
+ | <code bash> | ||
+ | |||
+ | and it creates the files '' | ||
+ | |||
+ | Place the files '' | ||
+ | |||
+ | ==== How to use the viewer ==== | ||
+ | |||
+ | First make sure that your web server is running and configured properly and that your index and data files have been prepared in the correct place. If you do not use your own web server, invoke the script '' | ||
+ | |||
+ | < | ||
+ | |||
+ | which is the URL you should point your browser to. The server uses a randomly picked port number unless you specify it as a commandline parameter: '' | ||
+ | |||
+ | In the browser, you will see a list of experiments (all subfolders of '' | ||
+ | |||
+ | ===== Acknowledgements ===== | ||
+ | |||
+ | This research has been supported by the grant of the Czech Ministry of Education no. MSM0021620838 (2010), by the grants of the Czech Science Foundation no. P406/ | ||
+ | |||
+ | ===== Publications ===== | ||
+ | |||
+ | * Mark Fishel, Ondřej Bojar, Daniel Zeman, Jan Berka: // | ||
+ | * Daniel Zeman, Mark Fishel, Jan Berka, Ondřej Bojar: // | ||
+ | * Jan Berka, Ondřej Bojar, Mark Fishel, Maja Popović, Daniel Zeman: // | ||
- | Currently, Addicter can view and browse aligned corpora, look for example words in context and summarize known alignments of a given word. The viewing and browsing is performed using a web server that generates web pages dynamically (to avoid pre-generating millions of static HTML documents). The obvious drawback is that access to a web server is needed. |