[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision Both sides next revision
courses:rg:2012:sigtest-mt-zilka [2012/11/14 17:19]
zilka vytvořeno
courses:rg:2012:sigtest-mt-zilka [2012/11/14 17:45]
zilka
Line 32: Line 32:
 Can you reformulate Section 4 using this view? What is the observed test statistic and what is the null hypothesis? Can you reformulate Section 4 using this view? What is the observed test statistic and what is the null hypothesis?
  
 +====== Presentation ======
 +  * We answered:
 +    * Question 1 - BLEU scores are: 1 - 1.0, 2 - 0.0 (or some smoothed value), 3 - 0.2
 +    * Question 2 - broad sampling, samples far apart distributed -> {data_1, data_101, data_201, ...}
  
 +===== Section 3 =====
 +  * motivation: we don't usually have 30k sentences for testing, so we need an approximate method to obtain reliable scores
 +  * method: divide test set into 100 smaller test sets (300 sentences each)
 +    * consecutive samples - for each of the sets BLEU score varies in range +-8 %
 +    * non-consecutive samples (broad apart) - for each of the sets BLEU varies much less - +-1.5 %
 +  * they make an assumption and claim that there is no difference between comparing output of 2 different MT systems and output of 1 MT systems that is trained just with different data
 +    * Lukas Zilka complained about this assumption - they should have conducted some experiments to support their claim, as there is nothing that suggest we can generalize like that
 +
 +===== Section 4, 5 =====
 +  * we cannot use Student's T distribution to estimate confidence interval for BLEU, because it cannot be constructed in the form of sum of terms to give us mean and variance
 +  * so for estimating the confidence intervals we will use randomized test set generation - e.g. we build 1000 new test sets of size 300 sentences out of our small test set of 300 sentences (i.e. we draw (with replacement) samples from the small test set; so we should get 1000 different test sets)
 +  * answer to Question3 - they do not assume there is any particular distribution in the set of BLEU scores of the 1000 test sets (i.e. their method would work regardless of whether the distribution is normal, uniform or any other), but it is perhaps normally distributed
 +
 +===== Section 6 =====
 +
 +
 +
 +
 +  * **Section 3** describes the data

[ Back to the navigation ] [ Back to the content ]