Differences

This shows you the differences between two versions of the page.

--- courses:rg:2012:meant [2012/11/12 23:54]
rosa
+++ courses:rg:2012:meant [2012/11/13 00:07]
rosa spellcheck
@@ Line 8: / Line 8: @@
-The paper was widely discussed throughout the whole session. The report tries to divide the points discussed in correspondance to the sections of the paper.
+The paper was widely discussed throughout the whole session. The report tries to divide the points discussed in correspondence to the sections of the paper.
 ===== 1 Introduction =====
-The paper proposes a semi-automatic translation evaluation metric that is claimed to be both well correlated with human judgement (especially in comparison to BLEU) and less labour-intensive than HTER (which is claimed to be much more expensive).
+The paper proposes a semi-automatic translation evaluation metric that is claimed to be both well correlated with human judgment (especially in comparison to BLEU) and less labour-intensive than HTER (which is claimed to be much more expensive).
 ==== Question 1: Which translation is considered as "a good one" by (H)MEANT? ====
-Meant assumes that a good traslation is one where the reader understands correctly "Who did what to whom, when, where and why" - which, as Martin noted, is rather adequacy than fluency, and therefore a comparison with BLEU, which is more fluency-oriented, is not completely fair. Moreover, good systems usually make more errors in adequacy than in fluency, which makes BLEU an even worse metric these days.
+Meant assumes that a good translation is one where the reader understands correctly "Who did what to whom, when, where and why" - which, as Martin noted, is rather adequacy than fluency, and therefore a comparison with BLEU, which is more fluency-oriented, is not completely fair. Moreover, good systems usually make more errors in adequacy than in fluency, which makes BLEU an even worse metric these days.
-Matin further explained that HTER is a metric where the humans post-edit the MT output to transform it into a correct translation, and then TER, which is actually a word-based Levenshtein distance, is computed as the score.
+Martin further explained that HTER is a metric where the humans post-edit the MT output to transform it into a correct translation, and then TER, which is actually a word-based Levenshtein distance, is computed as the score.
 Matěj Korvas then pointed to an important difference between MEANT and HTER: MEANT uses reference translations, whereas HTER uses post-editations. Surprisingly, this is not noted in the paper.
@@ Line 25: / Line 25: @@
 ==== Question 2: Which phases of annotations are there? ====
   - SRL (semantic role labelling) of both the reference and the MT output; the labels are based on PropBank (but have nicer names)
   - aligning the frames - first, predicates are aligned, and then, for each matching pair of predicates, their arguments are aligned as well
   - ternary judging - deciding whether each matched role is translated correctly, incorrectly or only partially correctly
-The group discussed whether HMEANT evaluations are really faster than HTER annotations, as some of the readers participated in HMEANT evaluation. Some readers agree that about 5 minutes per sentence is quite accurate, while others state that 5 minutes are at best the lower bound. However, it is not completely clear whether all of the three phases of annotation are claimed to be done in 5 minutes. (Probably yes, but the less do the readersagree with the necessary times indicated.)
+The group discussed whether HMEANT evaluations are really faster than HTER annotations, as some of the readers participated in HMEANT evaluation. Some readers agree that about 5 minutes per sentence is quite accurate, while others state that 5 minutes are at best the lower bound. However, it is not completely clear whether all of the three phases of annotation are claimed to be done in 5 minutes. (Probably yes, but the less do the readers agree with the necessary times indicated.)
 ==== Question 3: What does the set J contain in the //C_precision// formula? ====
@@ Line 56: / Line 55: @@
 ===== 6 Experiment: Monolinguals vs. bilinguals =====
 Petr notes that, although it might seem surprising that monolinguals perform better in the evaluation than bilinguals, it is probably a consequence of the fact that bilinguals try to guess what the source was, while the monolinguals cannot do that.
@@ Line 62: / Line 60: @@
 ===== Final Objections =====
 For the rest of the session, Martin took the lead to express some more objections to the paper. The group agreed with the objections, and even added some more.
 Table 3 seems to represent the main results of the paper.
-It is shocking that the authors used **only 40 sentences**; moreover, they used it as **both the traning set and the test set**.
+It is shocking that the authors used **only 40 sentences**; moreover, they used it as **both the training set and the test set**.
 The grid search they use to tune the parameters means to "try everything and find the best-correlating parameters" - in this case this is 12 parameters.
 They ran the grid search optimization on the 40 sentences they have, but then they evaluated HMEANT on the same data.
 The group agreed that such evaluation is completely flawed and it is not clear why it was performed and included in the paper.
+Karel Bílek also notes that it is quite ridiculous to state the precision to 4 decimal digits when only 40 sentences are used.
-Table 4
+In Table 4, the authors probably try to compensate for this flaw by performing cross-validation. However, please note there are only 10 sentences in one fold. Petr thinks that the table should show that the parameter weights are stable. However, Martin thinks that for only 40 sentences, it is probably easy to find 12 parameter values to achieve good performance. Moreover, Aleš Tamchyna assumes that even the formulas used might be fitted to those 40 sentences.
-Martin then informed the group that Dekai Wu has still not given us the data from the annotations done on ÚFAL (which was already several months ago), which rises even more suspicion whether the experiments were fair.
+Martin then informed the group that Dekai Wu has still not given us the data from the annotations done on ÚFAL (which was already several months ago), which makes us even more suspicious whether the experiments were fair.
 Martin also notes that the authors claim that all other existing evaluation metrics require lexical matches to consider a translation to be correct - which is not true, as the Meteor metric can also use paraphrases.
-KB: 40 sentences but precision to 4 decimal digits
+The group generally agreed that, although the ideas behind HMEANT seem reasonable, the paper itself is misleading and is not to be believed much (or probably at all). The proposed metric possibly correlates better with human judgment than automatic metrics, but it does not really seem to reach HTER.
-The group generally agreed that, although the ideas behind HMEANT seem reasonable, the paper itself is misleading and is not to be believed much (or probably at all).

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences