Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
courses:rg:2012:spe-for-smt [2012/10/12 13:11] jindra.helcl |
courses:rg:2012:spe-for-smt [2012/10/12 14:17] (current) popel my notes |
||
---|---|---|---|
Line 3: | Line 3: | ||
MT Summit 2011 | MT Summit 2011 | ||
[[http:// | [[http:// | ||
- | |||
Presented by Rudolf Rosa | Presented by Rudolf Rosa | ||
Line 15: | Line 14: | ||
A brief outline of the paper follows. In introduction, | A brief outline of the paper follows. In introduction, | ||
* **Data:** The data for the experiment came from English-French translation memory from Symantec. The size of the data was about 55k sentences (0.8M words) in each language. In the paper, they call the English training data **E** and the French data **F**. | * **Data:** The data for the experiment came from English-French translation memory from Symantec. The size of the data was about 55k sentences (0.8M words) in each language. In the paper, they call the English training data **E** and the French data **F**. | ||
- | * **Architecture: | + | * **Architecture: |
* **Enhancements: | * **Enhancements: | ||
- | * Contextual SPE, which means that the translated words was created by concatenating the English word and the translation separated by hash sign to one resulting word. This new dataset is called **E#F'** in the paper. With this enhancement, | + | * Contextual SPE, which means that the translated words was created by concatenating the English word and the translation separated by hash sign to one resulting word. This new dataset is called **F'#E** in the paper. With this enhancement, |
* Next, they striped off the #-postfixes of non-translated words (OOV). | * Next, they striped off the #-postfixes of non-translated words (OOV). | ||
* Then, they do alignment between the source text and the translation and use the contextual enhancement only where the alignment weight was over some threshold. | * Then, they do alignment between the source text and the translation and use the contextual enhancement only where the alignment weight was over some threshold. | ||
Line 28: | Line 27: | ||
* In the paper, they state that they use 10-fold cross validation approach to build a new dataset. Many of us have got confuset by this statement and found unclear what exactly the authors meant by this. We finally agreed that the new dataset is created fold-by-fold by training the SMT on the other 9 folds of **E** and **F** and then running it on the tenth fold of source language. | * In the paper, they state that they use 10-fold cross validation approach to build a new dataset. Many of us have got confuset by this statement and found unclear what exactly the authors meant by this. We finally agreed that the new dataset is created fold-by-fold by training the SMT on the other 9 folds of **E** and **F** and then running it on the tenth fold of source language. | ||
* We found pointless for authors to present explicit results of Contextual SPE without removing the # | * We found pointless for authors to present explicit results of Contextual SPE without removing the # | ||
- | * When the authors wrote about Contextual SPE with thresholding, | + | * When the authors wrote about Contextual SPE with thresholding, |
+ | - Moses can output also the word-alignment together with the translations. Although the alignment originates from GIZA++ (it is //the// alignment which was used to build the phrase table), Ondřej Bojar says it is not usual to describe this approach //"we do this using GIZA++ word-alignments"// | ||
+ | - (New) GIZA++ can be trained (and applied) on the 55k sentence pairs (**E**, **F' | ||
+ | Method #1 should be more accurate, but it seems the authors used method #2. | ||
+ | * How could OOVs arise in post-editing? | ||
+ | - Either because they used method #2 which mis-aligned // | ||
+ | - Or the first stage system trained on particular 9/10 of the training data cannot translate // | ||
+ | |||
===== Conclusion ===== | ===== Conclusion ===== | ||
Despite the structure of the paper was often critisized and possible flaws was found, the article was considered to be well-readable and simple enough to be the opening article for this semester' | Despite the structure of the paper was often critisized and possible flaws was found, the article was considered to be well-readable and simple enough to be the opening article for this semester' |