====== Transductive learning for statistical machine translation ====== === Nicola Ueffing and Gholamreza Haffari and Anoop Sarkar === ===== Introduction ===== The paper is about the use of transductive semi-supervised methods for the effective use of monolingual data from the source language in order to improve translation quality. Transductive means that they repeatedly translate sentences from the development set or test set and use generated translation to improve the SMT system. Transductive learning is another mean to adapt the SMT system to a new type of text. Authors mention two SMT modeling problems which need different learning strategies for improving the translation quality. 1. SMT systems face data sparseness issue even if there is large bitext available for any language pair. 2. For many language pairs the amount of available bilingual text is very limited. The authors hypothesis is that adding information from source data might help in improvements. ===== Comments ===== * The Paper very well describes the transductive learning algorithm, **Algorithm 1** which is inspired by Yarowsky algorithm [1][2]. * In algorithm 1, the translation model is estimated based on the sentence pairs in bilingual data L. Then a set of source language sentences, U, is translated based on the current model. A subset of good transaltions and their sources, Ti, is selected on each iteration and added to the training data. These sentence pairs are replaces in each iteration and only the original data, L, is fixed throughout algorithm. * Algorithm 1 is based on **Estimate**, **Score** and **Select** functions. * Estimate function estimates the model parameters or in other words perform training of the system. The authors used three different model for parameters estimation. **Full Re-training**, **Additional Phrase Table** and **Mixture Model**. * Scoring function assign a score to each translation t. The scoring functions used in the paper are: **Length-normalized Score** and **Confidence Estimation**. * Selection function is used to create additional training data Ti which is used in next iteration i+1 by **Estimate** to augment the original bilingual data. The selection functions used in this paper are: **Importance Sampling**, **Selection using a Threshold** and **Keep All**. * Data filtering is performed on both bilingual and monolingual data to keep only that part of the data which is relevant to the test data. * They used three different evaluation metrics for evaluating translation quality that are: **BLEU** score, **mWER** (multi-reference word error rate) and **mPER** (multi-reference position-independent word error rate). * Experiments are performed on EuroParl and NIST corpus. On EuroParl corpus the selection and scoring was carried out using importance sampling with normalized score. Three different experiments are performed on the Europarl corpus which didn't produce significant improvement in the accuracy on output translation. Experiments on NIST data are performed using three different test sets where the output translation of each test set yield the best score when threshold-based selection method was combined with confidence estimation as scoring method. * The main issue with this paper is that number of iterations that are selected to train the model are not described. And according to Figure 1 in the paper, graph achieves global maximum on iteration 16 and iteration 18 on train100k and train150K corpus. Where our main concern is they might know the BLEU score from the test set they already have and they stopped the training process or cut down the graph after particular iteration where BLEU is optimized according to already computed BLEU score. * Adding examples in the paper on which improvement is achieved would have helped much better to understand the impact of this training scheme. * Ambiguous terminology is used for defining the feasibility setting in Table 3. '**' is defined as those experiments that produced minimal improvement over the baseline. But, this doesn't mean that experiments marked with '*' achieved significant improvement over baseline. ===== Suggested Additional Reading ===== * [1] D. Yarowsky. 1995. Unsupervised Word Sense Disambiguation Rivaling Supervised Methods. In Proc. ACL * [2] S. Abney. 2004. Understanding the Yarowsky Algorithm. Comput. Ling., 30(3). ===== What do we like about the paper ===== * Even if the reader doesn't have any background knowledge on what transduction learning is, he/she will clearly get the idea about it after reading this paper, how it works in the domain of SMT and what are the features we need to define for training the system. The paper is infact a well written piece of work. ===== What do we dislike about the paper ===== * The semi supervised learning scheme presented in this paper uses the source side test data during training process which limits the use of this technique in different applications. For instance this learning mechanism can not be applied in online translation systems. Comments by Bushra Jawaid