[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision Both sides next revision
courses:rg:2012:meant [2012/11/12 22:47]
rosa sect 1
courses:rg:2012:meant [2012/11/12 23:22]
rosa sec 3
Line 8: Line 8:
  
  
-The paper was widely discussed throughout the whole session. The report is mainly chronological, and is approximately divided in correspondence to the sections of the paper.+The paper was widely discussed throughout the whole session. The report tries to divide the points discussed in correspondance to the sections of the paper.
  
 ===== 1 Introduction =====  ===== 1 Introduction ===== 
- 
 The paper proposes a semi-automatic translation evaluation metric that is claimed to be both well correlated with human judgement (especially in comparison to BLEU) and less labour-intensive than HTER (which is claimed to be much more expensive). The paper proposes a semi-automatic translation evaluation metric that is claimed to be both well correlated with human judgement (especially in comparison to BLEU) and less labour-intensive than HTER (which is claimed to be much more expensive).
  
 +==== Question 1: Which translation is considered as "a good one" by (H)MEANT? ====
 Meant assumes that a good traslation is one where the reader understands correctly "Who did what to whom, when, where and why" - which, as Martin noted, is rather adequacy than fluency, and therefore a comparison with BLEU, which is more fluency-oriented, is not completely fair. Moreover, good systems usually make more errors in adequacy than in fluency, which makes BLEU an even worse metric these days. Meant assumes that a good traslation is one where the reader understands correctly "Who did what to whom, when, where and why" - which, as Martin noted, is rather adequacy than fluency, and therefore a comparison with BLEU, which is more fluency-oriented, is not completely fair. Moreover, good systems usually make more errors in adequacy than in fluency, which makes BLEU an even worse metric these days.
  
 Matin further explained that HTER is a metric where the humans post-edit the MT output to transform it into a correct translation, and then TER, which is actually a word-based Levenshtein distance, is computed as the score. Matin further explained that HTER is a metric where the humans post-edit the MT output to transform it into a correct translation, and then TER, which is actually a word-based Levenshtein distance, is computed as the score.
 Matěj Korvas then pointed to an important difference between MEANT and HTER: MEANT uses reference translations, whereas HTER uses post-editations. Surprisingly, this is not noted in the paper. Matěj Korvas then pointed to an important difference between MEANT and HTER: MEANT uses reference translations, whereas HTER uses post-editations. Surprisingly, this is not noted in the paper.
- 
-The group then discussed whether HMEANT evaluations are really faster than HTER annotations, as some of the readers participated in HMENAT evaluation. Some readers agree that about 5 minutes per sentence is quite accurate, while others state that 5 minutes are at best the lower bound. It should also be noted that there are actually three parts of the annotation - role labelling, frames aligning and accuracy evaluation (full/partial/none), as it is not completely clear whether all of these three parts are claimed to be done in 5 minutes. 
  
 Section **2 Related work** was skipped. Section **2 Related work** was skipped.
  
 ===== 3 MEANT: SRL for MT evaluation =====  ===== 3 MEANT: SRL for MT evaluation ===== 
 +Here we look at how the evaluation is actually done. It consists of three steps, all done by humans in HMEANT. In MEANT, the first step is done automatically.
 +
 +==== Question 2: Which phases of annotations are there? ====
 +
 +  - SRL (semantic role labelling) of both the reference and the MT output; the labels are based on PropBank (but have nicer names)
 +  - aligning the frames - first, predicates are aligned, and then, for each matching pair of predicates, their arguments are aligned as well
 +  - ternary judging - deciding whether each matched role is translated correctly, incorrectly or only partially correctly
 +
 +The group discussed whether HMEANT evaluations are really faster than HTER annotations, as some of the readers participated in HMEANT evaluation. Some readers agree that about 5 minutes per sentence is quite accurate, while others state that 5 minutes are at best the lower bound. However, it is not completely clear whether all of the three phases of annotation are claimed to be done in 5 minutes. (Probably yes, but the less do the readersagree with the necessary times indicated.)
 +
 +==== Question 3: What does the set J contain in the //C_precision// formula? ====
 +The answer is that it contains the arguments of the predicate. It actually contains all //possible// roles, where the non-present ones only add a zero to the sum and therefore do not influence the score.
 +
 +We further tried to compute the score for the following set of sentences:
 +  * Reference: //John loves Mary.//
 +  * MT1: //Stupid John loves Mary.//
 +  * MT2: //John loves Jack.//
 +  * MT3: //John hates Mary.//
 +We supposed that the semantic roles are the same in all cases, i.e. Agent for //John// or //Stupid John//, Predicate for //loves// or //hates//, and Experiencer for //Mary//. It was explained by Martin that //Stupid John// has no inner structure in HMEANT as there is no predicate in the phrase - HMEANT semantic annotation is shallow in that respect. Furthermore, we assumed (following the paper's Section 3) that the weights are uniform, i.e. //w_pred// = //w_j// = 0.1 and //w_partial// = 0.5.
 +
 +For MT1, the HMEANT score is equal to 1, because, according to the paper, extra information is not penalized, and the translation is therefore regarded as being completely correct.
 +
 +For MT2, //C_precision// is 1, but //C_recall// is only 2/3, and the HMEANT score, which is the F-score, is therefore 4/5.
 +
 +For MT3, the predicates do not match, and therefore no arguments are taken into account. Martin and Ruda agreed that most probably not even a partial match of predicates can be annotated, as there is no support for such annotation in the formulas, which Martin suggested to be a possible flaw of the method.
 +
 +Karel Bílek also noted that it is hard to annotate semantics on incorrect sentences, which is not mentioned in the paper.
 +
 +===== 4 Meta-evaluation methodology =====
  
  

[ Back to the navigation ] [ Back to the content ]