====== From Human to Automatic Error Classification for Machine Translation Output ======

**by Maja Popovic and Aljoscha Burchardt** (German Research Center for Artifficial Intelligence - Language Technology Group, Berlin)

talk by Petra Galuščáková
report by Jindřich Libovický

===== Introduction =====

A method for classification and evaluation of errors in MT is introduced in this paper. Petra presented it on Monday, 12th December 2011.

===== Outline of the paper =====
  - Introduction
     * discussed what kind of evaluation developers to have a relevant feedback
     * references to an article by [[http://www-i6.informatik.rwth-aachen.de/~xujia/publications/EA.pdf|Villar et al.]] about a scheme for human evaluation of SMT
  - Error classification
     * Nečíslovaný seznam
  - Experimental results
     * human and automatic error classification was done
     * high correlation between the human and automatic classification reported
     * some problems beyond the scope of the algorithm
  - Conclusion

===== Notes =====
  * The tool based on this algorithm is called [[http://ufal.mff.cuni.cz/pbml/96/art-popovic.pdf|Hjerson]], is written in Python and can be [[http://www.dfki.de/~mapo02/hjerson/|downloaded]] from Maja Popovic's website
  * While looking at the Table 2, it may seem strange that e.g. in the "lexical" column there are big differences in the numbers, but still the reported correlation is almost one. This is because [[http://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient|Spearman's rank correlation coefficient]] is used. This measure does not in fact capture the correlation between the values themselves, but between their orders.
  * Very high correlation is reported, but the baseline (random ordering) is not mentioned in the article, which would have also relatively high correlation with the human evaluation.
  * Table 3 may be a little bit confusing because the a) and b) represents in fact the same absolute numbers. The actual numbers in the tables are different because precision and recall are relative measures.

===== Conclusion =====

The paper was well presented and despite it belongs to the easier ones in this semester, it brought some interesting ideas, mostly for those who (like me) didn't think about this problem much before.