Estimating Reproducibility of Psychology (No. 165): An Open Post-Publication Peer-Review


In 2015, Science published the results of the first empirical attempt to estimate the reproducibility of psychology.   One key finding was that out of 97 attempts to reproduce a significant result, only 36% of attempts succeeded.

This finding fueled debates about a replication crisis in psychology.  However, there have been few detailed examinations of individual studies to examine why a particular result could be replicated or not.  The main reason is probably that it is a daunting task to conduct detailed examinations of all studies. Another reason is that replication failures can be caused by several factors.  Each study may have a different explanation.  This means it is important to take an ideographic (study-centered) perspective.

The conclusions of these ideographic reviews will be used for a nomothetic research project that aims to predict actual replication outcomes on the basis of the statistical results reported in the original article.  These predictions will only be accurate if the replication studies were close replications of the original study.  Otherwise, differences between the original study and the replication study may explain why replication studies failed.

Summary of Original Article 

The article “The Value Heuristic in Judgments of Relative Frequency” was published as a Short Report in Psychological Science.   The article has been cited only 25 times overall and was not cited at all in 2017.


The authors suggest that they have identified a new process how people judge the relative frequency of objects.

Estimating the relative frequency of a class of objects or events is fundamental in subjective probability assessments and decision making (Estes, 1976), and research has long shown that people rely on heuristics for making these judgments (Gilovich, Griffin, & Kahneman, 2002). In this report, we identify a novel heuristic for making these judgments, the value heuristic: People judge the frequency of a class of objects on the basis of the subjective value of the objects.

As my dissertation was about frequency judgments of emotions, I am familiar with the frequency estimation literature, especially the estimation of valued objects like positive and negative emotions.  My reading of the literature suggests that this hypothesis is inconsistent with prior research because frequency judgments are often made on the basis of a fast, automatic, parallel search of episodic memory (e..g, Hintzman, 1988). Thus, value might only indirectly influence frequency estimates if it influences the accessibility of exemplars.

The authors present a single experiment to support their hypothesis.


68 students participated in this study.  5 were excluded for a final N of 63 students.

During the learning phase of the study , participants were exposed to 57 pictures of bird and 57 pictures of flowers.

Participants were  then told that they would receive 2 cent for each picture from one of the two categories. The experimental manipulation was whether participants would be rewarded for bird or flower pictures.

The dependent variables were frequency estimates of the number of pictures in each category. Specifically, whether participants gave a higher or lower, equal, or higher estimate to the rewarded category.

When flowers were rewarded, 12 participants had higher estimates for flowers and 15 had higher estimates for birds.

When birds were rewarded, 21 participants had higher estimates for bird and 8 had higher estimates for birds.

A chi-square test showed a just significant effect that was driven by the condition that rewarded birds, presumably because there was also a main effect of birds vs. flowers (birds are more memorable and accessible).

Chi2(1, N = 56) = 4.51,  p = .037.


81 students participated in the replication study.  After exclusion of 4 participants the final sample size was N = 77.

When flowers were rewarded, 16 participants had higher estimates for flowers and 11 had higher estimates for birds.

When birds were rewarded, 10 participants had higher estimates for bird and 14 had higher estimates for birds.

The remaining participants were tied.

The chi-square test was not significant.

Chi2(1, N = 51) = 1.57,  p = .21.


The original article tested a novel and controversial hypothesis that was not grounded in the large cognitive literature on frequency estimation.  The article relied on a just significant result in a single study as evidence.  It is not surprising that a replication study failed to replicate the finding.  The article had very little impact. In hindsight, this study does not meet the high bar for acceptance into a high impact journal like Psychological Science.  However, hindsight is 20/20 and it is well known that the foresight of traditional peer-review is an imperfect predictor of replicability and relevance.






