Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
courses:rg:2012:meant [2012/11/12 23:22] rosa sec 3 |
courses:rg:2012:meant [2012/11/13 00:02] rosa final section |
||
---|---|---|---|
Line 19: | Line 19: | ||
Matěj Korvas then pointed to an important difference between MEANT and HTER: MEANT uses reference translations, | Matěj Korvas then pointed to an important difference between MEANT and HTER: MEANT uses reference translations, | ||
- | Section | + | **Section |
===== 3 MEANT: SRL for MT evaluation ===== | ===== 3 MEANT: SRL for MT evaluation ===== | ||
Line 51: | Line 51: | ||
===== 4 Meta-evaluation methodology ===== | ===== 4 Meta-evaluation methodology ===== | ||
+ | Here, we reminded the difference between Kendall' | ||
+ | Martin also remarks that they use sentence-level BLEU to compute the correlation; | ||
+ | ===== 6 Experiment: Monolinguals vs. bilinguals ===== | ||
+ | |||
+ | Petr notes that, although it might seem surprising that monolinguals perform better in the evaluation than bilinguals, it is probably a consequence of the fact that bilinguals try to guess what the source was, while the monolinguals cannot do that. | ||
+ | |||
+ | **All other sections were basically skipped.** | ||
+ | |||
+ | ===== Final Objections ===== | ||
+ | |||
+ | For the rest of the session, Martin took the lead to express some more objections to the paper. The group agreed with the objections, and even added some more. | ||
+ | |||
+ | Table 3 seems to represent the main results of the paper. | ||
+ | It is shocking that the authors used **only 40 sentences**; | ||
+ | The grid search they use to tune the parameters means to "try everything and find the best-correlating parameters" | ||
+ | They ran the grid search optimization on the 40 sentences they have, but then they evaluated HMEANT on the same data. | ||
+ | The group agreed that such evaluation is completely flawed and it is not clear why it was performed and included in the paper. | ||
+ | Karel Bílek also notes that it is quite ridiculous to state the precision to 4 decimal digits when only 40 sentences are used. | ||
+ | |||
+ | In Table 4, the authors probably try to compensate for this flaw by performing cross-validation. However, please note there are only 10 sentences in one fold. Petr thinks that the table should show that the parameter weights are stable. However, Martin thinks that for only 40 sentences, it is probably easy to find 12 parameter values to achieve good performance. Moreover, Aleš Tamchyna assumes that even the formulas used might be fitted to those 40 sentences. | ||
+ | |||
+ | Martin then informed the group that Dekai Wu has still not given us the data from the annotations done on ÚFAL (which was already several months ago), which makes us even more suspicious whether the experiments were fair. | ||
+ | |||
+ | Martin also notes that the authors claim that all other existing evaluation metrics require lexical matches to consider a translation to be correct - which is not true, as the Meteor metric can also use paraphrases. | ||
+ | |||
+ | The group generally agreed that, although the ideas behind HMEANT seem reasonable, the paper itself is misleading and is not to be believed much (or probably at all). The proposed metric possibly correlates better with human judgement than automatic metrics, but it does not really seem to reach HTER. |