[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
user:zeman:addicter [2010/02/22 13:26]
zeman How to install and configure Apache.
user:zeman:addicter [2011/05/09 13:41]
zeman Link to future external page on Addicter.
Line 6: Line 6:
  
 Currently, Addicter can view and browse aligned corpora, look for example words in context and summarize known alignments of a given word. The viewing and browsing is performed using a web server that generates web pages dynamically (to avoid pre-generating millions of static HTML documents). The obvious drawback is that access to a web server is needed. Currently, Addicter can view and browse aligned corpora, look for example words in context and summarize known alignments of a given word. The viewing and browsing is performed using a web server that generates web pages dynamically (to avoid pre-generating millions of static HTML documents). The obvious drawback is that access to a web server is needed.
 +
 +There is another subpage for Addicter in this wiki that lies in the external name space, thus it can be used for [[external:addicter|external collaboration]].
  
 ===== Installation ===== ===== Installation =====
Line 14: Line 16:
  
 ==== How to install and configure Apache ==== ==== How to install and configure Apache ====
 +
 +=== Microsoft Windows ===
  
 This tutorial currently focuses on installing Apache HTTP Server on Microsoft Windows. If you are experienced user of another operating system and wish to share advice, please feel free to [[mailto:zeman@ufal.mff.cuni.cz|contact me]]. This tutorial currently focuses on installing Apache HTTP Server on Microsoft Windows. If you are experienced user of another operating system and wish to share advice, please feel free to [[mailto:zeman@ufal.mff.cuni.cz|contact me]].
Line 22: Line 26:
     * Under Windows, you will also want to set <code>ScriptInterpreterSource registry</code> It tells the server that the Windows registry shall be used to figure out how to run a script (e.g., that ''C:\Perl\Perl.exe'' binary must be run to interpret a ''.pl'' script).     * Under Windows, you will also want to set <code>ScriptInterpreterSource registry</code> It tells the server that the Windows registry shall be used to figure out how to run a script (e.g., that ''C:\Perl\Perl.exe'' binary must be run to interpret a ''.pl'' script).
   * Restart the server. On the main Windows panel, there is (typically in the lower right corner) a set of icons, including a new one for Apache. Right-click on it, select Open Apache Monitor, then Restart.   * Restart the server. On the main Windows panel, there is (typically in the lower right corner) a set of icons, including a new one for Apache. Right-click on it, select Open Apache Monitor, then Restart.
 +
 +=== Ubuntu Linux ===
 +
 +Install the Apache HTTP server package. After successful installation, there should be a file ''/etc/apache2/sites-enabled/000-default''. Edit it (you need root permissions). There should be a section similar to the following:
 +
 +<code>
 + ScriptAlias /cgi-bin/ /usr/lib/cgi-bin/
 + <Directory "/usr/lib/cgi-bin">
 + AllowOverride None
 + Options +ExecCGI -MultiViews +SymLinksIfOwnerMatch
 + Order allow,deny
 + Allow from all
 + </Directory>
 +</code>
 +
 +Either create a copy of the section with new alias and path (eg. ''ScriptAlias /addicter-cgi/ /home/user/addicter/cgi/'') or use the ''/usr/lib/cgi-bin'' (or whatever folder you see by default) for your addicter CGI scripts and data (see below).
 +
 +==== How to install Addicter ====
 +
 +We use ''$CGI'' to refer to the path you registered with Apache as containing CGI scripts (using the ''ScriptAlias'' directive).
 +
 +  * Check out the current version of Addicter from the SVN repository. In Linux, the following command will do that: <code>svn checkout https://failfinder.googlecode.com/svn/trunk addicter</code> In Windows, you can use [[http://tortoisesvn.tigris.org/|TortoiseSVN]] to access the repository.
 +  * All you need at this moment is in the folder ''dan''. There are two subfolders, ''prepare'' and ''cgi''. Copy the contents of the ''cgi'' folder to ''$CGI''.
  
 ===== Alignment viewer ===== ===== Alignment viewer =====
Line 35: Line 62:
   * ''test.system.tgt'' ... system output for test data   * ''test.system.tgt'' ... system output for test data
   * ''test.system.ali'' ... alignment of the source and the system output for test data   * ''test.system.ali'' ... alignment of the source and the system output for test data
 +
 +The ''prepare'' folder contains some sample corpora in ''sample_data.zip''.
  
 The indexer splits the output index into multiple files in order to reduce size of any individual file. All index files must be stored in the same folder as the viewing CGI scripts. The indexer splits the output index into multiple files in order to reduce size of any individual file. All index files must be stored in the same folder as the viewing CGI scripts.
 +
 +==== How to prepare a corpus for viewing ====
 +
 +We assume that your corpus is already sentence-aligned and tokenized. I.e., source and target files have the same number of lines (sentences, segments), and tokens (words, punctuation) on each line are space-separated. If you are using Addicter to perform analysis of errors made by a machine translation system, you probably already have such a corpus. You may also want to use a lowercased version of your corpus. Unless stated otherwise, all files are supposed to be plain text files in the UTF-8 encoding.
 +
 +You will also need some alignment files that define bi-directional word alignments. If you have trained a statistical MT system such as Moses, chances are that you already have such files for the training data. They result from the first three steps of the Moses training pipeline, namely from two runs of Giza++ and an alignment symmetrization algorithm. In order to get alignments for test data, too, you can do the following:
 +
 +  * Join the source training file with the source test file. Similarly, join the target sides of the two data sets.
 +  * Re-run Giza++ over the joint corpus.
 +  * The resulting alignment file has the same number of lines as the source and the target side of the corpus. By cutting off the last N lines, you easily separate the training and test alignments from each other.
 +  * Alternatively, you can use MGiza (presented at MT Marathon 2010 in Dublin) to align the test data. After the //aligned// training corpus is initially read, alignments for new sentences just fly out, so you do not need to wait several hours for Giza++ again. Also, unlike the above method, your original training alignment is now guaranteed to remain untouched by the new evidence.
 +
 +Once all the input files are ready, the indexer is invoked as follows:
 +
 +<code>addictindex.pl \
 +    -trs train.en -trt train.hi -tra train.ali \
 +    -s test.en -r test.hi -h test.system.hi -ra test.ali -ha test.system.ali \
 +    -o $CGI</code>
 +
 +The indexer will copy the input files and output all index files into the ''$CGI'' folder where the CGI scripts will find them.
 +
 +==== How to use the viewer ====
 +
 +Now if your web server is running and configured properly and your index and data files have been prepared in the correct place, launch your web browser and point it to http://localhost/cgi/index.pl.
 +
 +===== Acknowledgements =====
 +
 +This research has been supported by the grant of the Czech Ministry of Education no. MSM0021620838.
 +

[ Back to the navigation ] [ Back to the content ]