Table of Contents

False-Positive Psychology

Joseph P. Simmons, Leif D. Nelson, Uri Simonsohn: False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant, Psychological Science, 2011.

Questions

  1. Are the described issues (researcher degrees of freedom etc.) relevant also for NLP research (papers)? Can you name some similarities and differences (concerning the described issues) between NLP and psychological research?
  2. What does it mean when “p ≤ 0.05”? I.e., what is the definition of p-value?
  3. What does “Correcting alpha levels” mean? Give an example.

Answers

1.

Most of the described issues are relevant for NLP as well (cf. hyper-parameters, unreported technical details, tokenization,…).
When human evaluation is involved in NLP, it shares many methodological properties/problems with Psychology.
However, in many NLP tasks we have (only) automatic evaluation (based on human-annotated gold data).

Psy: “find evidence that an effect exists”
NLP: “our method for solving xy is better”

NLP: easier to replicate experiments (“code and Makefiles” may be published with the paper)
PSY: replication costs money and time, you need different people (so they are not influenced)

The last point is quite important. You may suggest the same person should listen to both Beatles and Kalimba (after some time), but there is a risk of long-term effect of the first experiment influencing (skewing) the second one.

2.

P-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. (see Wikipedia P-value)

Using a formula for one-tailed test: p-value = P( X≥x | H0)
where X is the statistics we are measuring (e.g. difference between average age of Kalimba-listeners and average age of Beatles-listeners),
x is the value of X we have actually measured (e.g. 1.4 years),
and H0 is the null hypothesis (no effect, no difference between the two groups, i.e. X=0, the difference has normal distribution with mean 0).

If you set the traditional significance level to 0.05, you get a false positive case when p<0.05, but the null hypothesis holds.
false-positive-rate =
= P(p-value < 0.05 & H0)
= P( P( X≥x | H0) < 0.05 & H0) != p-value

3.

Alpha was originally defined by Neyman & Pearson as Type I error rate, but this is incompatible with p-value and Fisher's theory of significance testing.
Alpha is also used as a name for the “significance level” - a threshold for p-value, which is traditionally set to 0.05.
When multiple experiments are tested, we should decrease this threshold (but how?).
See XKCD.