Estimating Reproducibility of Psychology (No. 136): An Open Post-Publication Peer-Review


In 2015, Science published the results of the first empirical attempt to estimate the reproducibility of psychology.   One key finding was that out of 97 attempts to reproduce a significant result, only 36% of attempts succeeded.

This finding fueled debates about a replication crisis in psychology.  However, there have been few detailed examinations of individual studies to examine why a particular result could be replicated or not.  The main reason is probably that it is a daunting task to conduct detailed examinations of all studies. Another reason is that replication failures can be caused by several factors.  Each study may have a different explanation.  This means it is important to take an ideographic (study-centered) perspective.

The conclusions of these ideographic reviews will be used for a nomothetic research project that aims to predict actual replication outcomes on the basis of the statistical results reported in the original article.  These predictions will only be accurate if the replication studies were close replications of the original study.  Otherwise, differences between the original study and the replication study may explain why replication studies failed.

Special Introduction

The authors of this article are prominent figures in the replication crisis of social psychology.  Kathleen Vohs was co-author of a highly criticized article that suggested will-power depends on blood-glucose levels. The evidence supporting this claim has been challenged for methodological and statistical reasons (Kurzban, 2010; Schimmack, 2012).  She also co-authored numerous articles on ego-depletion that are difficult to replicate (Schimmack, 2016; Inzlicht, 2016).  In a response with Roy Baumeister titled “A misguided effort with elusive implications” she dismissed these problems,b ut her own replication project produced very similar results (SPSP, 2018). Some of her social priming studies also failed to replicate (Vadillo, Hardwicke, & Shanks, 2016). 


The z-curve plot for Vohs shows clear evidence that her articles contain too many significant results (75% inc. marginally significant one’s with only 55% average power).  The average probability of successfully replicating a randomly drawn finding form Vohs’ articles is 55%. However, this average is obtained with substantial heterogeneity.  Just significant z-scores (2 to 2.5) have only an average estimated replicability of 33%. Even z-scores in the range from 2.5 to 3 have only an average replicability of 40%.   This suggests that p-values in the range between .05 and .005 are unlikely to replicate in exact replication studies.

In a controversial article with the title “The Truth is Wearing Off“, Jonathan Schooler even predicted that replication studies might often fail.  The article was controversial because Schooler suggested that all effects diminish over time (I wish this were true for the effect of eating chocolate on weight, but so far it hasn’t happened).  Schooler is also known for an influential article about “verbal overshadowing” in eyewitness identifications.  Francis (2012) demonstrated that the published results were too good to be true and the first Registered Replication Report failed to replicate on of the five studies and replicated another one only with a much smaller effect size.


The z-curve plot for Schooler looks very different. The average estimated power is higher.  However, there is a drop at z = 2.6 that is difficult to explain with a normal sampling distribution.

Based on this context information, predictions about replicabilty depend on the p-values of the actual studies.  Just significant p-values are unlikely to replicate but larger p-values might replicate.

Summary of Original Article

The article examines moral behavior.  The main hypothesis is that beliefs about free will vs. determinism influence cheating.  Whereas belief in free will encourages moral behavior,  beliefs that behavior is determined make it easier to cheat.

Study 1

30 students were randomly assigned to one of two conditions (n = 15).

Participants in the anti-free-will condition, read a passage written by Francis Crick, a Noble Laureate, suggesting that free will is an illusion.  In the control condition, they read about consciousness.

Then participants were asked to work on math problems on a computer. They were given a cover story that the computer program had a glitch and would present the correct answers, but they could fix this problem by pressing the space bar as soon as the question appeared.  They were asked to do so and to try to solve the problems on their own.

It is not mentioned whether participants were probed for suspicion and data from all participants were included in the analysis.

The main finding was that participants cheated more in the “no-free-will” condition than in the control condition, t(28) = 3.04, p = .005.

Study 2

Study 2 addressed several limitations of Study 1. Although the sample size was larger, the design included 5 conditions (n = 24/25 per condition).

The main dependent variable was the number of correct answers on 15 reading comprehension, mathematical, and logic problems that were used by Vohs in a previous study (Schmeichel, Vohs, & Baumeister, 2003).  For each correct answer, participants received $1.

Two conditions manipulate free will beliefs, but participants could not cheat. The comparison of these two conditions shows whether the manipulation influences actual performance, but there was no major difference (based on Figure $7.50 control vs. $7 no-free-will).

In the cheating condition, experimenters received a fake phone call, told the participants that they had to leave and that the participant should continue, score their answers and pay themselves.  Surprisingly, neither the free-will, nor the neutral condition showed any signs of cheating ($7.20 & 7.30, respectively).  However, the determinism condition increased the average pay-out to $10.50.

One problem for the statistical analysis is that the researchers “did not have participants’ answer sheets in the three self-paid conditions; therefore, we divided the number of $1 coins taken by each group by the number of group members to arrive at an average self-payment” (p. 52).

The authors then report a significant ANOVA result, F(4, 114) = 5.68, p = .0003.

However, without information about the standard deviation in each cell, it is not possible to compute an Analysis of Variance.  This part of the analysis is not explained in the article.


The replication team also had some problems with Study 2.

We originally intended to carry out Study 2, following the Reproducibility Project’s system of working from the back of an article. However, on corresponding with the authors we discovered that it had arisen in post-publication correspondence about analytic methods that the actual effect size found was smaller than reported, although the overall conclusion remained the same. 

As a result, they decided to replicate Study 1.

The sample size of the replication study was near twice as large as the sample of the original study (N = 58 vs. 30).

The results did not replicate the significant result of the original study, t(56) = 0.77, p = .44.


Study 1 was underpowered.  Even nearly doubling the sample size was not sufficient to obtain significance in the replication study.   Study 2 was superior, but it was reported so poorly that the replication team could not replicate the study.






1 thought on “Estimating Reproducibility of Psychology (No. 136): An Open Post-Publication Peer-Review

Leave a Reply