Estimating Reproducibility of Psychology (No. 111): An Open Post-Publication Peer-Review


In 2015, Science published the results of the first empirical attempt to estimate the reproducibility of psychology.   One key finding was that out of 97 attempts to reproduce a significant result, only 36% of attempts succeeded.

This finding fueled debates about a replication crisis in psychology.  However, there have been few detailed examinations of individual studies to examine why a particular result could be replicated or not.  The main reason is probably that it is a daunting task to conduct detailed examinations of all studies. Another reason is that replication failures can be caused by several factors.  Each study may have a different explanation.  This means it is important to take an ideographic (study-centered) perspective.

The conclusions of these ideographic reviews will be used for a nomothetic research project that aims to predict actual replication outcomes on the basis of the statistical results reported in the original article.  These predictions will only be accurate if the replication studies were close replications of the original study.  Otherwise, differences between the original study and the replication study may explain why replication studies failed.

Summary of Original Article

The article examines anchoring effects.  When individuals are uncertain about a quantity (the price of a house, the height of Mount Everest), their estimates can be influenced by some arbitrary prior number.  The anchoring effect is a robust phenomenon that has been replicated in one of the first large multi-lab replication projects (Klein et al., 2014).

This article titled “Precision of the anchor influence the amount of adjustment” tested the hypothesis that the anchoring effect is larger (or the adjustment effect is smaller) if the anchor is precise than if it is rounded in six studies.


Study 1

43 students participated in this study that manipulated the type of anchor between subjects (rounded, precise over, precise under; n = 14 per cell).

The main effect of the manipulation was significant, F(2,40) = 10.94.

Study 2

85 students participated in this study.  It manipulated anchor (rounding, precise under) and the range of plausible values (narrow vs. broad).

The study replicated a main effect of anchor, F(1,81) = 22.23.

Study 3

45 students participated in this study.

Study 3 added a condition with information that made the rounded anchor more credible.

The results were significant, F(2,42) = 23.07.  A follow up test showed that participants continued to be more influenced by a precise anchor than by a rounded anchor even with additional information that the rounded anchor was credible, F(1, 42) = 20.80.

Study 4a 

This study was picked for the replication attempt.

As the motivation to adjust increases and the number of units of adjustment increases correspondingly, the amount of adjustment on the coarse-resolution
scale should increase at a faster rate than the amount of adjustment on the fine-resolution scale (i.e., motivation to adjust and scale resolution should interact).

The high-motivation-to-adjust condition was created by removing information from the scenarios used in Experiment 2 (the scenarios from Experiment 2 were used without alteration in the low-motivation-to-adjust condition). For example, sentences
in the plasma-TV scenario that encouraged a slight adjustment (‘‘items are priced very close to their actual cost . . . actual cost would be only slightly less than $5,000’’) were replaced with a sentence that encouraged more adjustment (‘‘What is your estimate of the TV’s actual cost?’’).

The width of the scale unit was manipulated with the precision of the anchor (i.e., rounded anchor for broad width and precise anchor for narrow width).

Study 4a had 59 participants.

Study 4a was similar to Study 1 with a manipulation of the width of the scale. (i.e., rounded anchor for broad width and precise anchor for narrow width).

Study 4a showed an interaction between the motivation to adjust condition and the Scale Width manipulation, F(1, 55) = 6.88.

Study 4b

Study 4b  had 149 participants and also showed a significant result, F(1,145) = 4.01, p = .047.

Study 5 

Study 5 used home-sales data for 12,581 home sales.  The study found a significant effect of list-price precision on the sales price, F(1, 12577) = 23.88, with list price as a covariate.


In conclusion, all of the results showed strong statistical evidence against the null hypothesis except for the pair of studies 4a and 4b.  It is remarkable that this close replication study produced a just significant result with three times with a much larger sample size than Study 4a (149 vs. 59).  This pattern of results suggest that the sample size is not independent of the result and that the evidence for this effect could be exeggerated by  the use of optional stopping (collecting more data until p < .05).

Replication Study

The replication study did not use Study 5 which was not an experiment.  Study 4a was chosen over 4b because it used a more direct manipulation of motivation.

The replication report states the goal of the replication study as replicating two main effects and the interaction.

The results show a main effect of the motivation manipulation, F(1,116) = 71.06, a main effect of anchor precision, F(1,116) = 6.28, but no significant interaction, F < 1.

The data form shows the interaction effect as the main result in the original study, F(1, 55) = 6.88, but the effect is miscoded as a main effect.  The replication result is entered as the main effect for the anchor precision manipulation, F(1, 116) = 6.28 and this significant result is scored as a successful replication of the original study.

However, the key finding in the original article was the interaction effect.  No statistical tests of main effects are reported.

In Experiment 4a, there was a Motivation to Adjust x Scale Unit Width interaction, F(1, 55) = 6.88, prep = .947, omega2 = .02. The difference in the amount of adjustment between the rounded and precise-anchor conditions increased as the motivation to
adjust went from low (Mprecise = -0.76, Mrounded = -0.23, Mdifference = 0.53), F(1, 55) = 15.76, prep = .994, omega2 = .06, to high (Mprecise = -0.04, Mrounded = 0.98, Mdifference = 1.02), F(1, 55) = 60.55, prep =.996, omega2 = .25. 

This leads me to the conclusion that the successful replication of this study is a coding mistake. The critical interaction was not replicated.







Leave a Reply