Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
courses:rg:2012:meant [2012/11/12 22:47] rosa sect 1 |
courses:rg:2012:meant [2012/11/12 23:54] rosa |
||
---|---|---|---|
Line 8: | Line 8: | ||
- | The paper was widely discussed throughout the whole session. The report | + | The paper was widely discussed throughout the whole session. The report |
===== 1 Introduction ===== | ===== 1 Introduction ===== | ||
- | |||
The paper proposes a semi-automatic translation evaluation metric that is claimed to be both well correlated with human judgement (especially in comparison to BLEU) and less labour-intensive than HTER (which is claimed to be much more expensive). | The paper proposes a semi-automatic translation evaluation metric that is claimed to be both well correlated with human judgement (especially in comparison to BLEU) and less labour-intensive than HTER (which is claimed to be much more expensive). | ||
+ | ==== Question 1: Which translation is considered as "a good one" by (H)MEANT? ==== | ||
Meant assumes that a good traslation is one where the reader understands correctly "Who did what to whom, when, where and why" - which, as Martin noted, is rather adequacy than fluency, and therefore a comparison with BLEU, which is more fluency-oriented, | Meant assumes that a good traslation is one where the reader understands correctly "Who did what to whom, when, where and why" - which, as Martin noted, is rather adequacy than fluency, and therefore a comparison with BLEU, which is more fluency-oriented, | ||
Line 19: | Line 19: | ||
Matěj Korvas then pointed to an important difference between MEANT and HTER: MEANT uses reference translations, | Matěj Korvas then pointed to an important difference between MEANT and HTER: MEANT uses reference translations, | ||
- | The group then discussed whether HMEANT evaluations are really faster than HTER annotations, | + | **Section |
- | + | ||
- | Section | + | |
===== 3 MEANT: SRL for MT evaluation ===== | ===== 3 MEANT: SRL for MT evaluation ===== | ||
+ | Here we look at how the evaluation is actually done. It consists of three steps, all done by humans in HMEANT. In MEANT, the first step is done automatically. | ||
+ | |||
+ | ==== Question 2: Which phases of annotations are there? ==== | ||
+ | |||
+ | - SRL (semantic role labelling) of both the reference and the MT output; the labels are based on PropBank (but have nicer names) | ||
+ | - aligning the frames - first, predicates are aligned, and then, for each matching pair of predicates, their arguments are aligned as well | ||
+ | - ternary judging - deciding whether each matched role is translated correctly, incorrectly or only partially correctly | ||
+ | |||
+ | The group discussed whether HMEANT evaluations are really faster than HTER annotations, | ||
+ | |||
+ | ==== Question 3: What does the set J contain in the // | ||
+ | The answer is that it contains the arguments of the predicate. It actually contains all // | ||
+ | |||
+ | We further tried to compute the score for the following set of sentences: | ||
+ | * Reference: //John loves Mary.// | ||
+ | * MT1: //Stupid John loves Mary.// | ||
+ | * MT2: //John loves Jack.// | ||
+ | * MT3: //John hates Mary.// | ||
+ | We supposed that the semantic roles are the same in all cases, i.e. Agent for //John// or //Stupid John//, Predicate for //loves// or //hates//, and Experiencer for //Mary//. It was explained by Martin that //Stupid John// has no inner structure in HMEANT as there is no predicate in the phrase - HMEANT semantic annotation is shallow in that respect. Furthermore, | ||
+ | |||
+ | For MT1, the HMEANT score is equal to 1, because, according to the paper, extra information is not penalized, and the translation is therefore regarded as being completely correct. | ||
+ | |||
+ | For MT2, // | ||
+ | |||
+ | For MT3, the predicates do not match, and therefore no arguments are taken into account. Martin and Ruda agreed that most probably not even a partial match of predicates can be annotated, as there is no support for such annotation in the formulas, which Martin suggested to be a possible flaw of the method. | ||
+ | |||
+ | Karel Bílek also noted that it is hard to annotate semantics on incorrect sentences, which is not mentioned in the paper. | ||
+ | |||
+ | ===== 4 Meta-evaluation methodology ===== | ||
+ | Here, we reminded the difference between Kendall' | ||
+ | |||
+ | Martin also remarks that they use sentence-level BLEU to compute the correlation; | ||
+ | |||
+ | ===== 6 Experiment: Monolinguals vs. bilinguals ===== | ||
+ | |||
+ | Petr notes that, although it might seem surprising that monolinguals perform better in the evaluation than bilinguals, it is probably a consequence of the fact that bilinguals try to guess what the source was, while the monolinguals cannot do that. | ||
+ | |||
+ | **All other sections were basically skipped.** | ||
+ | |||
+ | ===== Final Objections ===== | ||
+ | |||
+ | For the rest of the session, Martin took the lead to express some more objections to the paper. The group agreed with the objections, and even added some more. | ||
+ | |||
+ | Table 3 seems to represent the main results of the paper. | ||
+ | It is shocking that the authors used **only 40 sentences**; | ||
+ | The grid search they use to tune the parameters means to "try everything and find the best-correlating parameters" | ||
+ | They ran the grid search optimization on the 40 sentences they have, but then they evaluated HMEANT on the same data. | ||
+ | The group agreed that such evaluation is completely flawed and it is not clear why it was performed and included in the paper. | ||
+ | |||
+ | Table 4 | ||
+ | |||
+ | Martin then informed the group that Dekai Wu has still not given us the data from the annotations done on ÚFAL (which was already several months ago), which rises even more suspicion whether the experiments were fair. | ||
+ | |||
+ | Martin also notes that the authors claim that all other existing evaluation metrics require lexical matches to consider a translation to be correct - which is not true, as the Meteor metric can also use paraphrases. | ||
+ | |||
+ | KB: 40 sentences but precision to 4 decimal digits | ||
+ | |||
+ | |||
+ | The group generally agreed that, although the ideas behind HMEANT seem reasonable, the paper itself is misleading and is not to be believed much (or probably at all). | ||