Differences

This shows you the differences between two versions of the page.

--- courses:rg:2012:meant [2012/11/12 22:28]
rosa vytvořeno
+++ courses:rg:2012:meant [2012/11/12 23:22]
rosa sec 3
@@ Line 7: / Line 7: @@
 Report by Rudolf Rosa
-===== 1 Introduction =====
+The paper was widely discussed throughout the whole session. The report tries to divide the points discussed in correspondance to the sections of the paper.
+===== 1 Introduction =====
+The paper proposes a semi-automatic translation evaluation metric that is claimed to be both well correlated with human judgement (especially in comparison to BLEU) and less labour-intensive than HTER (which is claimed to be much more expensive).
+==== Question 1: Which translation is considered as "a good one" by (H)MEANT? ====
+Meant assumes that a good traslation is one where the reader understands correctly "Who did what to whom, when, where and why" - which, as Martin noted, is rather adequacy than fluency, and therefore a comparison with BLEU, which is more fluency-oriented, is not completely fair. Moreover, good systems usually make more errors in adequacy than in fluency, which makes BLEU an even worse metric these days.
+Matin further explained that HTER is a metric where the humans post-edit the MT output to transform it into a correct translation, and then TER, which is actually a word-based Levenshtein distance, is computed as the score.
+Matěj Korvas then pointed to an important difference between MEANT and HTER: MEANT uses reference translations, whereas HTER uses post-editations. Surprisingly, this is not noted in the paper.
 Section **2 Related work** was skipped.
 ===== 3 MEANT: SRL for MT evaluation =====
+Here we look at how the evaluation is actually done. It consists of three steps, all done by humans in HMEANT. In MEANT, the first step is done automatically.
+==== Question 2: Which phases of annotations are there? ====
+  - SRL (semantic role labelling) of both the reference and the MT output; the labels are based on PropBank (but have nicer names)
+  - aligning the frames - first, predicates are aligned, and then, for each matching pair of predicates, their arguments are aligned as well
+  - ternary judging - deciding whether each matched role is translated correctly, incorrectly or only partially correctly
+The group discussed whether HMEANT evaluations are really faster than HTER annotations, as some of the readers participated in HMEANT evaluation. Some readers agree that about 5 minutes per sentence is quite accurate, while others state that 5 minutes are at best the lower bound. However, it is not completely clear whether all of the three phases of annotation are claimed to be done in 5 minutes. (Probably yes, but the less do the readersagree with the necessary times indicated.)
+==== Question 3: What does the set J contain in the //C_precision// formula? ====
+The answer is that it contains the arguments of the predicate. It actually contains all //possible// roles, where the non-present ones only add a zero to the sum and therefore do not influence the score.
+We further tried to compute the score for the following set of sentences:
+  * Reference: //John loves Mary.//
+  * MT1: //Stupid John loves Mary.//
+  * MT2: //John loves Jack.//
+  * MT3: //John hates Mary.//
+We supposed that the semantic roles are the same in all cases, i.e. Agent for //John// or //Stupid John//, Predicate for //loves// or //hates//, and Experiencer for //Mary//. It was explained by Martin that //Stupid John// has no inner structure in HMEANT as there is no predicate in the phrase - HMEANT semantic annotation is shallow in that respect. Furthermore, we assumed (following the paper's Section 3) that the weights are uniform, i.e. //w_pred// = //w_j// = 0.1 and //w_partial// = 0.5.
+For MT1, the HMEANT score is equal to 1, because, according to the paper, extra information is not penalized, and the translation is therefore regarded as being completely correct.
+For MT2, //C_precision// is 1, but //C_recall// is only 2/3, and the HMEANT score, which is the F-score, is therefore 4/5.
+For MT3, the predicates do not match, and therefore no arguments are taken into account. Martin and Ruda agreed that most probably not even a partial match of predicates can be annotated, as there is no support for such annotation in the formulas, which Martin suggested to be a possible flaw of the method.
+Karel Bílek also noted that it is hard to annotate semantics on incorrect sentences, which is not mentioned in the paper.
+===== 4 Meta-evaluation methodology =====

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences