Differences

This shows you the differences between two versions of the page.

--- courses:rg:2012:meant [2012/11/13 00:07]
rosa spellcheck
+++ courses:rg:2012:meant [2012/11/13 16:25] (current)
popel
@@ Line 31: / Line 31: @@
 The group discussed whether HMEANT evaluations are really faster than HTER annotations, as some of the readers participated in HMEANT evaluation. Some readers agree that about 5 minutes per sentence is quite accurate, while others state that 5 minutes are at best the lower bound. However, it is not completely clear whether all of the three phases of annotation are claimed to be done in 5 minutes. (Probably yes, but the less do the readers agree with the necessary times indicated.)
-==== Question 3: What does the set J contain in the //C_precision// formula? ====
+==== Question 3: What does the set J contain in the C_precision formula? ====
 The answer is that it contains the arguments of the predicate. It actually contains all //possible// roles, where the non-present ones only add a zero to the sum and therefore do not influence the score.
@@ Line 45: / Line 45: @@
 For MT2, //C_precision// is 1, but //C_recall// is only 2/3, and the HMEANT score, which is the F-score, is therefore 4/5.
-For MT3, the predicates do not match, and therefore no arguments are taken into account. Martin and Ruda agreed that most probably not even a partial match of predicates can be annotated, as there is no support for such annotation in the formulas, which Martin suggested to be a possible flaw of the method.
+For MT3, the predicates do not match, and therefore no arguments are taken into account and the score is 0. Martin and Ruda agreed that most probably not even a partial match of predicates can be annotated, as there is no support for such annotation in the formulas, which Martin suggested to be a possible flaw of the method.
 Karel Bílek also noted that it is hard to annotate semantics on incorrect sentences, which is not mentioned in the paper.
 ===== 4 Meta-evaluation methodology =====
-Here, we reminded the difference between Kendall's τ and Spearman's τ. Kendall's τ only takes the ranks into account, disregarding the actual scores, while Spearman's τ takes the scores into account. The formula for Kendall's τ is τ = (#same rank - #different rank) / #pairs.
+Here, we reminded the difference between Kendall's τ and Spearman's τ. Kendall's τ only takes the ranks into account, disregarding the actual scores, while Spearman's τ takes the scores into account. The formula for [[http://en.wikipedia.org/wiki/Kendall%27s_tau|Kendall's τ]] is τ = (#same_ordered_pairs - #opposite_ordered_pairs) / #all_pairs.
-Martin also remarks that they use sentence-level BLEU to compute the correlation; however, BLEU was designed for whole documents, not for individual sentences, and therefore should preferably not be used on sentence level.
+Martin also remarked that the authors use sentence-level BLEU to compute the correlation; however, BLEU was designed for whole documents, not for individual sentences, and therefore should preferably not be used on sentence level.
 ===== 6 Experiment: Monolinguals vs. bilinguals =====
@@ Line 62: / Line 62: @@
 For the rest of the session, Martin took the lead to express some more objections to the paper. The group agreed with the objections, and even added some more.
-Table 3 seems to represent the main results of the paper.
+Table 3 seems to represent the main results of the paper.It is shocking that the authors used **only 40 sentences**; moreover, they used it as **both the training set and the test set**.
-It is shocking that the authors used **only 40 sentences**; moreover, they used it as **both the training set and the test set**.
+The grid search they use to tune the parameters means to "try everything and find the best-correlating parameters" - in this case this is 12 parameters. They ran the grid search optimization on the 40 sentences they have, but then they evaluated HMEANT on the same data. The group agreed that such evaluation is completely flawed and it is not clear why it was performed and included in the paper.
-The grid search they use to tune the parameters means to "try everything and find the best-correlating parameters" - in this case this is 12 parameters.
-They ran the grid search optimization on the 40 sentences they have, but then they evaluated HMEANT on the same data.
-The group agreed that such evaluation is completely flawed and it is not clear why it was performed and included in the paper.
 Karel Bílek also notes that it is quite ridiculous to state the precision to 4 decimal digits when only 40 sentences are used.

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences