Table of Contents
Statistical Post-Editing for a Statistical MT System
Hanna Béchara, Yanjun Ma, Josef van Genabith
MT Summit 2011
Presented by Rudolf Rosa
Report by Jindřich Helcl
This article was about statistical post-editing on results of a statistical machine translation system. The most interesting part on this article was that authors claim that they achieved improvement of about 2 BLEU score points by pipelining two statistical MT systems, which was until then considered useless.
The paper frequently quotes another article from Simard et al. (2007), which has been also briefly presented in the beginning of the presentation and which you can read online here.
A brief outline of the paper follows. In introduction, previous work has been briefly presented, it was stated that any results of this method were either none or not statistically significant.
- Data: The data for the experiment came from English-French translation memory from Symantec. The size of the data was about 55k sentences (0.8M words) in each language. In the paper, they call the English training data E and the French data F.
- Architecture: They wanted to train the same system to do the translation and post-editing. To overcome training on the same data, they build a third dataset F' using 10-fold cross “validation” approach (strictly speaking, it is not a validation, it is a part of training) on results of the first translation system trained on datasets E and F. After that, they trained the second system on datasets F' and F to learn it “translate” from (French) results of the first system to the “real-world” French.
- Enhancements: However, the basic architecture of this system did not produce any improvements. There was a drop of 0.15 BLEU points against the baseline without post-editing in English-to-French translation and only 0.65 BLEU points increase in French-to-English. So they introduced following enhancements:
- Contextual SPE, which means that the translated words was created by concatenating the English word and the translation separated by hash sign to one resulting word. This new dataset is called F'#E in the paper. With this enhancement, they were able to do post-editing of translated text with regard to original text.
- Next, they striped off the #-postfixes of non-translated words (OOV).
- Then, they do alignment between the source text and the translation and use the contextual enhancement only where the alignment weight was over some threshold.
With the last enhancement, they achieve improvement of 2 BLEU points in French-English translation.
Following topics about the article were discussed on RG meeting:
- As the main possible flaw of the experiment was assumed the size of the data (only 55k sentences). On the other hand, the data from translation memory were mentioned to be clean and there were not duplicities. However, the authors do not explain why they took so small data when other options are easily available. One possible explanation is that their translation system was built for the domain from the Symantec data - but this is not explicitly said in the article.
- In the paper, they state that they use 10-fold cross validation approach to build a new dataset. Many of us have got confuset by this statement and found unclear what exactly the authors meant by this. We finally agreed that the new dataset is created fold-by-fold by training the SMT on the other 9 folds of E and F and then running it on the tenth fold of source language.
- We found pointless for authors to present explicit results of Contextual SPE without removing the #-postfixes, as it was plain enough to remove them right away. This simple objection lead us to idea of removing the #-postfixes even before the OOV utterance is put to the language model, while it could bring some improvements.
- When the authors wrote about Contextual SPE with thresholding, they did not clarify how exactly they get the alignment after first-stage translation from E to F'#E to apply the thresholding to. They mention only “we do this using GIZA++ word-alignments”. We discussed two possibilities:
- Moses can output also the word-alignment together with the translations. Although the alignment originates from GIZA++ (it is the alignment which was used to build the phrase table), Ondřej Bojar says it is not usual to describe this approach “we do this using GIZA++ word-alignments”.
- (New) GIZA++ can be trained (and applied) on the 55k sentence pairs (E, F').
Method #1 should be more accurate, but it seems the authors used method #2.
- How could OOVs arise in post-editing? Let's say renseigne#populate is a OOV, i.e. it was not seen in the F'#E data. Using method #1 this would imply populate was never translated as renseigne in the first stage using 10-fold cross-“validation”. However, how could the first stage SMT translate populate to renseigne in the test data?
- Either because they used method #2 which mis-aligned populate to some other word (although the first stage translated it actually to renseigne).
- Or the first stage system trained on particular 9/10 of the training data cannot translate populate to renseigne, but it succeeds when trained on the full (10/10) training data.
Despite the structure of the paper was often critisized and possible flaws was found, the article was considered to be well-readable and simple enough to be the opening article for this semester's reading group.