[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

This is an old revision of the document!


Table of Contents

MEANT: An inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility via semantic frames

Chi-kiu Lo and Dekai Wu
ACL 2011
http://www.aclweb.org/anthology/P11-1023

Presented by Petr Jankovský
Report by Rudolf Rosa

The paper was widely discussed throughout the whole session. The report tries to divide the points discussed in correspondance to the sections of the paper.

1 Introduction

The paper proposes a semi-automatic translation evaluation metric that is claimed to be both well correlated with human judgement (especially in comparison to BLEU) and less labour-intensive than HTER (which is claimed to be much more expensive).

Question 1: Which translation is considered as "a good one" by (H)MEANT?

Meant assumes that a good traslation is one where the reader understands correctly “Who did what to whom, when, where and why” - which, as Martin noted, is rather adequacy than fluency, and therefore a comparison with BLEU, which is more fluency-oriented, is not completely fair. Moreover, good systems usually make more errors in adequacy than in fluency, which makes BLEU an even worse metric these days.

Matin further explained that HTER is a metric where the humans post-edit the MT output to transform it into a correct translation, and then TER, which is actually a word-based Levenshtein distance, is computed as the score.
Matěj Korvas then pointed to an important difference between MEANT and HTER: MEANT uses reference translations, whereas HTER uses post-editations. Surprisingly, this is not noted in the paper.

Section 2 Related work was skipped.

3 MEANT: SRL for MT evaluation

Here we look at how the evaluation is actually done. It consists of three steps, all done by humans in HMEANT. In MEANT, the first step is done automatically.

Question 2: Which phases of annotations are there?

  1. SRL (semantic role labelling) of both the reference and the MT output; the labels are based on PropBank (but have nicer names)
  2. aligning the frames - first, predicates are aligned, and then, for each matching pair of predicates, their arguments are aligned as well
  3. ternary judging - deciding whether each matched role is translated correctly, incorrectly or only partially correctly

The group discussed whether HMEANT evaluations are really faster than HTER annotations, as some of the readers participated in HMEANT evaluation. Some readers agree that about 5 minutes per sentence is quite accurate, while others state that 5 minutes are at best the lower bound. However, it is not completely clear whether all of the three phases of annotation are claimed to be done in 5 minutes. (Probably yes, but the less do the readersagree with the necessary times indicated.)

Question 3: What does the set J contain in the //C_precision// formula?

The answer is that it contains the arguments of the predicate. It actually contains all possible roles, where the non-present ones only add a zero to the sum and therefore do not influence the score.

We further tried to compute the score for the following set of sentences:

We supposed that the semantic roles are the same in all cases, i.e. Agent for John or Stupid John, Predicate for loves or hates, and Experiencer for Mary. It was explained by Martin that Stupid John has no inner structure in HMEANT as there is no predicate in the phrase - HMEANT semantic annotation is shallow in that respect. Furthermore, we assumed (following the paper's Section 3) that the weights are uniform, i.e. w_pred = w_j = 0.1 and w_partial = 0.5.

For MT1, the HMEANT score is equal to 1, because, according to the paper, extra information is not penalized, and the translation is therefore regarded as being completely correct.

For MT2, C_precision is 1, but C_recall is only 2/3, and the HMEANT score, which is the F-score, is therefore 4/5.

For MT3, the predicates do not match, and therefore no arguments are taken into account. Martin and Ruda agreed that most probably not even a partial match of predicates can be annotated, as there is no support for such annotation in the formulas, which Martin suggested to be a possible flaw of the method.

Karel Bílek also noted that it is hard to annotate semantics on incorrect sentences, which is not mentioned in the paper.

4 Meta-evaluation methodology


[ Back to the navigation ] [ Back to the content ]