Diederik Stapel represents everything that has gone wrong in experimental social psychology. Until 2011, he was seen as a successful scientists who made important contributions to the literature on social priming. In the article “From Seeing to Being: Subliminal Social Comparisons Affect Implicit and Explicit Self-Evaluations” he presented 8 studies that showed that social comparisons can occur in response to stimuli that were presented without awareness (subliminally). The results were published in the top journal of social psychology published by the American Psychological Association (APA) and APA published a press-release for the general public about this work.

In 2011, an investigation into Diedrik Stapel’s reserach practices revealed scientific fraud, which resulted in over 50 retractions (Retraction Watch), including the article on unconscious social comparisons (Retraction Notice). In a book, Diederik Stapel told his story about his motives and practices, but the book is not detailed enough to explain how particular datasets were fabricated. All we know, is that he used a number of different methods that range from making up datasets to the use of questionable research practices that increase the chance of producing a significant result. These practices are widely used and are not considered scientific fraud, although the end result is the same. Published results no longer provide credible empirical evidence for the claims made in a published article.

I had two hypotheses. First, the data could be entirely made up. When researchers make up fake data they are likely to overestimate the real effect sizes and produce data that show the predicted pattern much more clearly than real data would. In this case, bias tests would not show a problem with the data. The only evidence that the data are fake would be that the evidence is stronger than in other studies that relied on real data.

In contrast, a researcher who starts with real data and then uses questionable practices is likely to use as little dishonest practices as possible because this makes it easier to justify the questionable decisions. For example, removing 10% of data may seem justified, especially if some rational for exclusion can be found. However, removing 60% of data cannot be justified. The researcher will need to use these practices to produce the desired outcome, namely a p-value below .05 (or at least very close to .05). As more use of questionable practices is not needed and harder to justify, the researcher will stop producing stronger evidence. As a result, we would expect a large number of just significant results.

There are two bias tests that detect the latter form of fabricating significant results by means of questionable statistical methods; the Replicability-Index (R-Index) and the Test of Insufficient Variance (TIVA). If Stapel used questionable statistical practices to produce just significant results, R-Index and TIVA would show evidence of bias.

The article reported 8 studies. The table shows the key finding of each study.

Study | Statistic | p | z | OP |

1 | F(1,28)=4.47 | 0.044 | 2.02 | 0.52 |

2A | F(1,38)=4.51 | 0.040 | 2.05 | 0.54 |

2B | F(1,32)=4.20 | 0.049 | 1.97 | 0.50 |

2C | F(1,38)=4.13 | 0.049 | 1.97 | 0.50 |

3 | F(1,42)=4.46 | 0.041 | 2.05 | 0.53 |

4 | F(2,49)=3.61 | 0.034 | 2.11 | 0.56 |

5 | F(1,29)=7.04 | 0.013 | 2.49 | 0.70 |

6 | F(1,55)=3.90 | 0.053 | 1.93 | 0.49 |

All results were interpreted as evidence for an effect and the p-value for Study 6 was reported as p = .05.

All p-values are below .053 but greater than .01. This is an unlikely outcome because sampling error should produce more variability in p-values. TIVA examines whether there is insufficient variability. First, p-values are converted into z-scores. The variance of z-scores due to sampling error alone is expected to be approximately 1. However, the observed variance is only Var(z) = 0.032. A chi-square test shows that this observed variance is unlikely to occur by chance alone, p = .00035. We would expect such an extremely small variability or even less variability in only 1 out of 2857 sets of studies by chance alone.

The last column transforms z-scores into a measure of observed power. Observed power is an estimate of the probability of obtaining a significant result under the assumption that the observed effect size matches the population effect size. These estimates are influenced by sampling error. To get a more reliable estimate of the probability of a successful outcome, the R-Index uses the median. The median is 53%. It is unlikely that a set of 8 studies with a 53% chance of obtaining a significant result produced significant results in all studies. This finding shows that the reported success rate is not credible. To make matters worse, the probability of obtaining a significant result is inflated when a set of studies contains too many significant results. To correct for this bias, the R-Index computes the inflation rate. With 53% probability of success and 100% success rate, the inflation rate is 47%. To correct for inflation, the inflation rate is subtracted from median observed probability, which yields an R-Index of 53% – 47% = 6%. Based on this value, it is extremely unlikely that a researcher would obtain a significant result, if they would actually replicate the original studies exactly. The published results show that Stapel could not have produced these results without the help of questionable methods, which also means nobody else can reproduce these results.

In conclusion, bias tests suggest that Stapel actually collected data and failed to find supporting evidence for his hypotheses. He then used questionable practices until the results were statistically significant. It seems unlikely that he outright faked these data and intentionally produced a p-value of .053 and reported it as p = .05. However, statistical analysis can only provide suggestive evidence and only Stapel knows what he did to get these results.