Joseph P. Simmons, Leif D. Nelson, Uri Simonsohn: False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant, Psychological Science, 2011.
Most of the described issues are relevant for NLP as well (cf. hyper-parameters, unreported technical details, tokenization,…).
When human evaluation is involved in NLP, it shares many methodological properties/problems with Psychology.
However, in many NLP tasks we have (only) automatic evaluation (based on human-annotated gold data).
Psy: “find evidence that an effect exists”
NLP: “our method for solving xy is better”
NLP: easier to replicate experiments (“code and Makefiles” may be published with the paper)
PSY: replication costs money and time, you need different people (so they are not influenced)
The last point is quite important. You may suggest the same person should listen to both Beatles and Kalimba (after some time), but there is a risk of long-term effect of the first experiment influencing (skewing) the second one.
P-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. (see Wikipedia P-value)
Using a formula for one-tailed test: p-value = P( X≥x | H0)
where X is the statistics we are measuring (e.g. difference between average age of Kalimba-listeners and average age of Beatles-listeners),
x is the value of X we have actually measured (e.g. 1.4 years),
and H0 is the null hypothesis (no effect, no difference between the two groups, i.e. X=0, the difference has normal distribution with mean 0).
If you set the traditional significance level to 0.05, you get a false positive case when p<0.05, but the null hypothesis holds.
false-positive-rate =
= P(p-value < 0.05 & H0)
= P( P( X≥x | H0) < 0.05 & H0) != p-value
Alpha was originally defined by Neyman & Pearson as Type I error rate, but this is incompatible with p-value and Fisher's theory of significance testing.
Alpha is also used as a name for the “significance level” - a threshold for p-value, which is traditionally set to 0.05.
When multiple experiments are tested, we should decrease this threshold (but how?).
See XKCD.