Estimating Reproducibility of Psychology (No. 68): An Open Post-Publication Peer-Review


In 2015, Science published the results of the first empirical attempt to estimate the reproducibility of psychology.   One key finding was that out of 97 attempts to reproduce a significant result, only 36% of attempts succeeded.

This finding fueled debates about a replication crisis in psychology.  However, there have been few detailed examinations of individual studies to examine why a particular result could be replicated or not.  The main reason is probably that it is a daunting task to conduct detailed examinations of all studies. Another reason is that replication failures can be caused by several factors.  Each study may have a different explanation.  This means it is important to take an ideographic (study-centered) perspective.

The conclusions of these ideographic reviews will be used for a nomothetic research project that aims to predict actual replication outcomes on the basis of the statistical results reported in the original article.  These predictions will only be accurate if the replication studies were close replications of the original study.  Otherwise, differences between the original study and the replication study may explain why replication studies failed.

Summary of Original Article 

The article “Why People Are Reluctant to Tempt Fate” by Risen and Gilovich examined magical thinking in six experiments.  The evidence suggests that individuals are reluctant to tempt fate because it increases the accessibility of thoughts about negative outcomes. The article has been cited 58 times so far and it was cited 10 times in 2017, although the key finding failed to replicate in the OSC (Science, 2015) replication study.


Study 1

Study 1 demonstrated the basic phenomenon.  62 students read a scenario about a male student who applied at a prestigious university.  His mother sent him a t-shirt with the logo of the university. In one condition, he decided to wear the t-shirt. In the other scenario, he stuffed it in the bottom drawer.  Participants rated how likely it would be that the student would be accepted.  Participants thought it would be more likely that the student gets accepted, if the student did not wear the t-shirt  (wearing it M = 5.19, SD = 1.35; stuffed away M = 6.13, SD = 1.02), t(60) = 3.01, p = .004, d = 0.78.

Study 2

120 students participated in Study 2 (n = 30 per cell). Study 2 manipulated whether participants imagined themselves or somebody else in a scenario. The scenario was about the probability of a professor picking a student to answer a question.  The experimental factor was whether students had done the reading or not. Not having done the reading was considered tempting fate.

The ANOVA results showed a significant main effect for tempting fate (not prepared M =  3.43, SD = 2.34; prepared M = 2.53, SD = 2.24), F(1, 116) = 4.60, p = .034. d = 0.39.

Study 3

Study 3 examined whether tempting fate increases the accessibility of thoughts about negative outcomes with 211 students.  Accessibiliy was measured with reaction times to two scenarios matching those from Study 1 and 2.  Participants had to indicate as quickly as possible whether the ending of a story matched the beginning of a story.

Analysis were carried out separately for each story.  Participants were faster to judge that not getting into a prestigious university was a reasonable ending after reading that a student tempted fate by wearing a t-shirt with the university logo  (wearing t-shirt M =  2,671 ms, SD = 1,113) than those who read that he stuffed the shirt in the drawer
(M = 3,176 ms, SD = 1,573), F(1, 171) = 11.01, p = .001, d = 0.53.

The same result was obtained for judgments of tempting fate by not doing the readings for a class, (not prepared M = 2,879 ms, SD = 1,149; prepared M = 3,112 ms, SD 1,226), F(1, 184) = 7.50, p = .007, d = 0.26.

Study 4 

Study 4 aimed to test the mediation hypothesis. Notably the sample size is much smaller than in Study 3 (N = 96 vs. N = 211).

The study used the university application scenario. For half the participants the decision was acceptance and for the other half it was rejection.

The reaction time ANOVA showed a significant interaction, F(1, 87) = 15.43.

As in Study 3, participants were faster to respond to a rejection after wearing the shirt than after not wearing it (wearing M = 3,196 ms, SD = 1,348; not wearing M = 4,324 ms,
SD = 2,194), F(1, 41) = 9.13, p = .004, d = 0.93.   Surprisingly, the effect size was twice as large as in Study 3.

The novel finding was that participants were faster to respond to an acceptance decision after not wearing the shirt than after wearing it (not wearing M = 2,995 ms, SD = 1,175;  wearing M = 3,551 ms, SD = 1,432),  F(1, 45) = 6.07, p = .018, d = 0.73.

Likelihood results also showed a significant interaction, F(1, 92) = 10.49, p = .002.

As in Study 2, in the rejection condition participants believed that a rejection was more likely after wearing the shirt than after putting it away (M = 5.79, SD = 1.53; M = 4.79, SD = 1.56), t(46) = 2.24, p = .030, d = 0.66.  In the new acceptance condition, participants thought that an acceptance was less likely after wearing the shirt than after putting it away (wore shirt M = 5.88, SD = 1.51;  did not wear shirt M = 6.83, SD = 1.31), t(46) = 2.35, p = .023, d = 0.69.  [The two p-values are surprisingly similar]

The mediation hypothesis was tested separately for the rejection and acceptance condition.  For the rejection condition, the Sobel test was significant, z = 1.96, p = .05. For the acceptance condition, the result was considered to be “supported by a marginally significant Sobel (1982) test, z = 1.91, p = .057.  [It is unlikely that two independent statistical tests produce p-values of .05 and .057]

Study 5

Study 5 is the icing on the cake. It aimed to manipulate accessibility by means of a subliminal priming manipulation.  [This was 2008 when subliminal priming was considered a plausible procedure]

Participants were 111 students.

The main story was about a woman who did or did not (tempt fate) bring an umbrella when the forecast predicted rain.  The ending of the story was that it started to rain hard.

For the reaction times, the interaction between subliminal priming and the manipulation of tempting fate (the protagonist brought an umbrella or not) was significant, F(1, 85) = 5.89.

In the control condition with a nonsense prime, participants were faster to respond to the ending that it would rain, if the protagonist did not bring an umbrella than when she did (no umbrella M = 2,694 ms, SD = 876; umbrella M = 3,957 ms, SD = 2,112), F(1, 43) =
15.45, p = .0003, d = 1.19.  This finding conceptually replicated studies 3 and 4.

In the priming condition, no significant effect of tempting fate was observed (no umbrella M = 2,749 ms, SD = 971, umbrella M = 2,770 ms, SD = 1,032).

For the likelihood judgments, the interaction was only marginally significant, F(1,
86) = 3.62, p = .06.

However, in the control condition with nonsense primes, the typical tempt fate effect was significant (no umbrella M = 6.96, SD = 1.31; M = 6.15, SD = 1.46), t(44) = 2.00, p = .052 (reported as p = .05), d = 0.58.

The tempt fate effect was not observed in the priming condition when participants were subliminally primed with rain (no umbrella M = 7.11, SD = 1.56; M = 7.16, SD = 1.41).

As in Study 5, “the mediated relation was supported by a marginally significant Sobel
(1982) test, z = 1.88, p = .06.  It is unlikely to get p = .05, p = .06 and p  = .06 in three independent mediation tests.

Study 6

Study 6 is the last study and the study that was chosen for the replication attempt.

122 students participated.  Study 6 used the scenario of being called on by a professor either prepared or not prepared (tempting fate).  The novel feature was a cognitive load manipulation.

The interaction between load manipulation and tempting fate manipulation was significant, F(1, 116) = 4.15, p = .044.

The no-load condition was a replication of Study 2 and replicated a significant effect of tempting fate (not prepared (M = 2.93, SD = 2.16, prepared M = 1.90, SD = 1.42), t(58) = 2.19, p = .033, d = 0.58.

Under the load condition, the effect was even more pronounced (not prepared M = 5.27, SD = 2.36′ prepared M = 2.70, SD = 2.17), t(58) = 4.38, p = .00005, d = 1.15.

A comparison of participants in the tempting fate condition showed a significant difference between the load and the no-load condition, t(58) = 3.99, p = .0002, d = 0.98.

Overall the results suggest that some questionable research practices were used (e.g., mediation tests p = .05, .06, .06).  The interaction effect in Study 6 with the load condition was also just significant and may not replicate.  However, the main effect of the tempting fate manipulation on likelihood judgments was obtained in all studies and might replicate.

Replication Study 

The replication study used an Mturk sample. The sample size was larger than in the original study (N = 226 vs. 122).

The load manipulation lead to higher likelihood estimates of being called on, suggesting that the load manipulation was effective even with Mturk participants, F(1,122) = 10.28.

However, the study did not replicate the interaction effect, F(1, 122) = 0.002.  More surprisingly, it also failed to show a main effect for the tempting-fate manipulation, F(1,122) = 0.50, p = .480.

One possible reason for the failure to replicate the tempting fate effect in this study could be the use of a school/university scenario (being called on by a professor) with Mturk participants who are older.

However, the results for the same scenario in the original article are not very strong.

In Study 2, the p-value was p = .034 and in the the no-load condition in Study 6 the p-value was p = .033.  Thus, neither the interaction with load, nor the main effect of the tempting fate manipulation are strongly supported in the original article.


It is never possible to show definitively that QRPs were used, it is possible that the use of QRPs in the original article explain the replication failure, although other explanations are also possible.  The most plausible alternative explanation would be the use of an Mturk sample.  A replication study in a student sample or a replication study of one of the other scenarios would be desirable.










Leave a Reply