Category Archives: Publication Bias

Did Social Psychologists Change their Research Practice: A Case Study (Adam D. Galinsky)

December 28, 2019Adam D. Galinsky, Experimental Social Psychology, Publication Bias, Questionable Research Practices, Replicability, Response to Replication Crisis, Social PsychologyUlrich Schimmack

Social psychologists have responded differently to the replication crisis. Some eminent social psychologists were at the end of their careers when the crisis started in 2011. Their research output in the 2010s is too small for quantitative investigations. Thus, it makes sense to look at the younger generation of future leaders in the field.

By quantitative measures one of the leading social psychologists with an active lab in the 2010s is Adam D. Galinsky. Web of Science shows that he is on track to become a social psychologists with an H-Index of 100. He currently has 213 articles with 14,004 citations and an H-Index of 62.

Several of Adam D. Galinsky’s Psychological Science articles published between 2009-2012 were examined by Greg Francis and showed signs of questionable research practices. This is to be expected because the use of QRPs was the norm in social psychology. The more interesting question is how a productive and influential social psychologists like Adam D. Galinsky responded to the replication crisis. Given his large number of articles, it is possible to examine this quantiatively by z-curving the automatically extracted test-statistics of the articles. Although automatic extraction has the problem that it does not distinguish between focal and non-focal tests, it has the advantage that it is 100% objective and can reveal changes in research practices over time.

The good news is that results have become more replicable. The average replicability for all tests was 48% (95%CI = 42%-57%) before 2012 and 61% (95%CI = 54%-69%) since then. Zooming in on p-values between .05 and .01, replicability increased from 23% to 38%.

The observed discovery rate has not changed (71% vs. 69%). Thus, articles do not report more non-significant results, although it is not clear whether articles report more non-significant focal (and fewer non-significant non-focal tests). This observed discovery rates are significantly higher than the estimated discovery rates before 2012, 26% (9%-36%) and after 2012, 40% (18%-59%). Thus, there is evidence of selection bias; that is published results are selected for significance. The extend of selection bias can be seen visually by comparing the histogram of observed non-significant results to the predicted densities shown by the grey line. This ‘file-drawer’ has decreased but is still clearly visible after 2012.

Conclusion

Social psychology is a large field and the response to the replication crisis has been mixed. Whereas some social psychologists are leaders in open science practices and have changed their research practices considerably, others have not. At present, journals still reward significant results and researchers who continue to use questionable research practices continue to have an advantage. The good news is that it is now possible to examine and quantify the use of questionable research practices and to take this information into account. The 2020s will show whether the field will finally take information about replicability into account and reward slow and solid results more than fast and wobbly results.

Statistics Wars: Don’t change alpha. Change the null-hypothesis!

December 18, 2019NHST, Nil-Hypothesis, Publication Bias, Questionable Research Practices, StatisticsEffect SizeUlrich Schimmack

The statistics wars go back all the way to Fisher, Pearson, and Neyman-Pearson(Jr), and there is no end in sight. I have no illusion that I will be able to end these debates, but at least I can offer a fresh perspective. Lately, statisticians and empirical researchers like me who dabble in statistics have been debating whether p-values should be banned and if they are not banned outright whether they should be compared to a criterion value of .05 or .005 or be chosen on an individual basis. Others have advocated the use of Bayes-Factors.

However, most of these proposals have focused on the traditional approach to test the null-hypothesis that the effect size is zero. Cohen (1994) called this the nil-hypothesis to emphasize that this is only one of many ways to specify the hypothesis that is to be rejected in order to provide evidence for a hypothesis.

For example, a nil-hypothesis is that the difference in the average height of men and women is exactly zero). Many statisticians have pointed out that a precise null-hypothesis is often wrong a priori and that little information is provided by rejecting it. The only way to make nil-hypothesis testing meaningful is to think about the nil-hypothesis as a boundary value that distinguishes two opposing hypothesis. One hypothesis is that men are taller than women and the other is that women are taller than men. When data allow rejecting the nil-hypothesis, the direction of the mean difference in the sample makes it possible to reject one of the two directional hypotheses. That is, if the sample mean height of men is higher than the sample mean height of women, the hypothesis that women are taller than men can be rejected.

However, the use of the nil-hypothesis as a boundary value does not solve another problem of nil-hypothesis testing. Namely, specifying the null-hypothesis as a point value makes it impossible to find evidence for it. That is, we could never show that men and women have the same height or the same intelligence or the same life-satisfaction. The reason is that the population difference will always be different from zero, even if this difference is too small to be practically meaningful. A related problem is that rejecting the nil-hypothesis provides no information about effect sizes. A significant result can be obtained with a large effect size and with a small effect size.

In conclusion, nil-hypothesis testing has a number of problems, and many criticism of null-hypothesis testing are really criticism of nil-hypothesis testing. A simple solution to the problem of nil-hypothesis testing is to change the null-hypothesis by specifying a minimal effect size that makes a finding theoretically or practically useful. Although this effect size can vary from research question to research question, Cohen’s criteria for standardized effect sizes can give some guidance about reasonable values for a minimal effect size. Using the example of mean differences, Cohen considered an effect size of d = .2 small, but meaningful. So, it makes sense to set a criterion for a minimum effect size somewhere between 0 and .2, and d = .1 seems a reasonable value.

We can even apply this criterion retrospectively to published studies with some interesting implications for the interpretation of published results. Shifting the null-hypothesis from d = 0 to d < abs(.1), we are essentially raising the criterion value that a test statistic has to meet in order to be significant. Let me illustrate this first with a simple one-sample t-test with N = 100.

Conveniently, the sampling error for N = 100 is 1/sqrt(100) = .1. To achieve significance with alpha = .05 (two-tailed) and H0:d = 0, the test statistic has to be greater than t.crit = 1.98. However, if we change H0 to d > abs(.1), the t-distribution is now centered at the t-value that is expected for an effect size of d = .1. The criterion value to get significance is now t.crit = 3.01. Thus, some published results that were able to reject the nil-hypothesis would be non-significant when the null-hypothesis specifies a range of values between d = -.1 to .1.

If the null-hypothesis is specified in terms of standardized effect sizes, the critical values vary as a function of sample size. For example, with N = 10 the critical t-value is 2.67, with N = 100 it is 3.01, and with N = 1,000 it is 5.14. An alternative approach is to specify H0 in terms of a fixed test statistic which implies different effect sizes for the boundary value. For example, with t = 2.5, the effect sizes would be d = .06 with N = 10, d = .05 with N = 100, and d = .02 with N = 1000. This makes sense because researchers should use larger samples to test weaker effects. The example also shows that a t-value of 2.5 specifies a very narrow range of values around zero. However, the example was based on one-sample t-tests. For the typical comparison of two groups, a criterion value of 2.5 corresponds to an effect size of d = .1 with N = 100. So, while t = 2.5 is arbitrary, it is a meaningful value to test for statistical significance. With N = 100, t(98) = 2.5 corresponds to an alpha criterion of .014, which is a bit more stringent than .05, but not as strict as a criterion value of .005. With N = 100, alpha = .005 corresponds to a criterion value of t.crit = 2.87, which implies a boundary value of d = .17.

In conclusion, statistical significance depends on the specification of the null-hypothesis. While it is common to specify the null-hypothesis as an effect size of zero, this is neither necessary, nor ideal. An alternative approach is to (re)specify the null-hypothesis in terms of a minimum effect size that makes a finding theoretically interesting or practically important. If the population effect size is below this value, the results could also be used to show that a hypothesis is false. Examination of various effect sizes shows that criterion values in the range between 2 and 3 provide can be used to define reasonable boundary values that vary around a value of d = .1

The problem with t-distributions is that they differ as a function of the degrees of freedom. To create a common metric it is possible to convert t-values into p-values and then to convert the p-values into z-scores. A z-score of 2.5 corresponds to a p-value of .01 (exact .0124) and an effect size of d = .13 with N = 100 in a between-subject design. This seems to be a reasonable criterion value to evaluate statistical significance when the null-hypothesis is defined as a range of smallish values around zero and alpha is .05.

Shifting the significance criterion in this way can dramatically change the evaluation of published results, especially results that are just significant, p < .05 & p > .01. There have been concerns that many of these results have been obtained with questionable research practices that were used to reject the nil-hypothesis. However, these results would not be strong enough to reject the modified hypothesis that the population effect size exceeds a minimum value of theoretical or practical significance. Thus, no debates about the use of questionable research practices are needed. There is also no need to reduce the type-I error rate at the expense of increasing the type-II error rate. It can be simply noted that the evidence is insufficient to reject the hypothesis that the effect size is greater than zero but too small to be important. This would shift any debates towards discussion about effect sizes and proponents of theories would have to make clear which effect sizes they consider to be theoretically important. I believe that this would be more productive than quibbling over alpha levels.

To demonstrate the implications of redefining the null-hypothesis, I use the results of the replicability project (Open Science Collaboration, 2015). The first z-curve shows the traditional analysis for the nil-hypothesis and alpha = .05, which has z = 1.96 as the criterion value for statistical significance (red vertical line).

Figure 1 shows that 86 out of 90 studies reported a test-statistic that exceeded the criterion value of 1.96 for H0:d = 0, alpha = .05 (two-tailed). The other four studies met the criterion for marginal significance (alpha = .10, two-tailed or .05 one-tailed). The figure also shows that the distribution of observed z-scores is not consistent with sampling error. The steep drop at z = 1.96 is inconsistent with random sampling error. A comparison of the observed discovery rate (86/90, 96%) and the expected discovery rate 43% shows evidence that the published results are selected from a larger set of studies/tests with non-significant results. Even the upper limit of the confidence interval around this estimate (71%) is well below the observed discovery rate, showing evidence of publication bias. Z-curve estimates that only 60% of the published results would reproduce a significant result in an actual replication attempt. The actual success rate for these studies was 39%.

Results look different when the null-hypothesis is changed to correspond to a range of effect sizes around zero that correspond to a criterion value of z = 2.5. Along with shifting the significance criterion, z-curve is also only fitted to studies that produced z-scores greater than 2.5. As questionable research practices have a particularly strong effect on the distribution of just significant results, the new estimates are less influenced by these practices.

Figure 2 shows the results. Most important, the observed discovery rate dropped from 96% to 61%, indicating that many of the original results provided just enough evidence to reject the nil-hypothesis, but not enough evidence to rule out even small effect sizes. The observed discovery rate is also more in line with the expected discovery rate. Thus, some of the missing non-significant results may have been published as just significant results. This is also implied by the greater frequency of results with z-scores between 2 and 2.5 than the model predicts (grey curve). However, the expected replication rate of 63% is still much higher than the actual replication rate with a criterion value of 2.5 (33%). Thus, other factors may contribute to the low success rate in the actual replication studies of the replicability project.

Conclusion

In conclusion, statisticians have been arguing about p-values, significance levels, and Bayes-Factors. Proponents of Bayes-Factors have argued that their approach is supreme because Bayes-Factors can provide evidence for the null-hypothesis. I argue that this is wrong because it is theoretically impossible to demonstrate that a population effect size is exactly zero or any other specific value. A better solution is to specify the null-hypothesis as a range of values that are too small to be meaningful. This makes it theoretically possible to demonstrate that a population effect size is above or below the boundary value. This approach can also be applied retrospectively to published studies. I illustrate this by defining the null-hypothesis as the region of effect sizes that is defined by the effect size that corresponds to a z-score of 2.5. While a z-score of 2.5 corresponds to p = .01 (two-tailed) for the nil-hypothesis, I use this criterion value to maintain an error rate of 5% and to change the null-hypothesis to a range of values around zero that becomes smaller as sample sizes increase.

As p-hacking is often used to just reject the nil-hypothesis, changing the null-hypothesis to a range of values around zero makes many ‘significant’ results non-significant. That is, the evidence is too weak to exclude even trivial effect sizes. This does not mean that the hypothesis is wrong or that original authors did p-hack their data. However, it does mean that they can no longer point to their original results as empirical evidence. Rather they have to conduct new studies to demonstrate with larger samples that they can reject the new null-hypothesis that the predicted effect meets some minimal standard of practical or theoretical significance. With a clear criterion value for significance, authors also risk to obtain evidence that positively contradicts their predictions. Thus, the biggest improvement that arises form rethinking null-hypothesis testing is that authors have to specify effect sizes a priori and that that studies can provide evidence for and against a zero. Thus, changing the nil-hypothesis to a null-hypothesis with a non-null value makes it possible to provide evidence for or against a theory. In contrast, computing Bayes-Factors in favor of the nil-hypothesis fails to achieve this goal because the nil-hypothesis is always wrong, the real question is only how wrong.

Where Do Non-Significant Results in Meta-Analysis Come From?

June 4, 2019Meta-Analysis, Publication BiasUlrich Schimmack

It is well known that focal hypothesis tests in psychology journals nearly always reject the null-hypothesis (Sterling, 1959; Sterling et al., 1995). However, meta-analyses often contain a fairly large number of non-significant results. To my knowledge, the emergence of non-significant results in meta-analysis has not been examined systematically (happy to be proven wrong). Here I used the extremely well-done meta-analysis of money priming studies to explore this issue (Lodder, Ong, Grasman, & Wicherts, 2019).

I downloaded their data and computed z-scores by (1) dividing Cohen’s d by sampling errror (2/sqrt(N)) to compute t-values, (2) convert the absolute t-values into two-sided p-values, and (3) converting the p-values into absolute z-scores. The z-scores were submitted to a z-curve analysis (Brunner & Schimmack, 2019).

The first figure shows the z-curve for all test-statistics. Out of 282 tests, only 116 (41%) are significant. This finding is surprising, given the typical discovery rates over 90% in psychology journals. The figure also shows that the observed discovery rate of 41% is higher than the expected discovery rate of 29%, although the difference is relatively small and the confidence intervals overlap. This might suggest that publication bias in the money priming literature is not a serious problem. On the other hand, meta-analysis may mask the presence of publication bias in the published literature for a number of reasons.

Published vs. Unpublished Studies

Publication bias implies that studies with non-significant results end up in the proverbial file-drawer. Meta-analysts try to correct for publication bias by soliciting unpublished studies. The money-priming meta-analysis included 113 unpublished studies.

Figure 2 shows the z-curve for these studies. The observed discovery rate is slightly lower than for the full set of studies, 29%, and more consistent with the expected discovery rate, 25%. Thus, there this set of studies appears to be unbiased.

The complementary finding for published studies (Figure 3) is that the observed discovery rate increases, 49%, while the expected discovery rate remains low, 31%. Thus, published articles report a higher percentage of significant results without more statistical power to produce significant results.

A New Type of Publications: Independent Replication Studies

In response to concerns about publication bias and questionable research practices, psychology journals have become more willing to publish null-results. An emerging format are pre-registered replication studies with the explicit aim of probing the credibility of published results. The money priming meta-analysis included 47 independent replication studies.

Figure 4 shows that independent replication studies had a very low observed discovery rate, 4%, that is matched by a very low expected discovery rate, 5%. It is remarkable that the discovery rate for replication studies is lower than the discovery rate for unpublished studies. One reason for this discrepancy is that significance alone is not sufficient to get published and authors may be selective in the sharing of unpublished results.

Removing independent replication studies from the set of published studies further increases the observed discovery rate, 66%. Given the low power of replication studies, the expected discovery rate also increases somewhat, but it is notably lower than the observed discovery rate, 35%. The difference is now large enough to be statistically significant, despite the rather wide confidence interval around the expected discovery rate estimate.

Coding of Interaction Effects

After a (true or false) effect has been established in the literature, follow up studies often examine boundary conditions and moderators of an effect. Evidence for moderation is typically demonstrated with interaction effects that are sometimes followed by contrast analysis for different groups. One way to code these studies would be to focus on the main effect and to ignore the moderator analysis. However, meta-analysts often split the sample and treat different subgroups as independent samples. This can produce a large number of non-significant results because a moderator analysis allows for the fact that the effect emerged only in one group. The resulting non-significant results may provide false evidence of honest reporting of results because bias tests rely on the focal moderator effect to examine publication bias.

The next figure is based on studies that involved an interaction hypothesis. The observed discovery rate, 42%, is slightly higher than the expected discovery rate, 25%, but bias is relatively mild and interaction effects contribute 34 non-significant results to the meta-analysis.

The analysis of the published main effect shows a dramatically different pattern. The observed discovery rate increased to 56/67 = 84%, while the expected discovery rate remained low with 27%. The 95%CI do not overlap, demonstrating that the large file-drawer of missing studies is not just a chance finding.

I also examined more closely the 7 non-significant results in this set of studies.

Gino and Mogliner (2014) reported results of a money priming study with cheating as the dependent variable. There were 98 participants in 3 conditions. Results were analyzed with percentage of cheating participants and extent of cheating. The percentage of cheating participants produced a significant contrast of the money priming and control condition, chi2(1, N = 65) = 3.97. However, the meta-analysis used the extent of cheating dependent variable, which should only a marginally significant effect with a one-tailed p-value of .07. “Simple contrasts revealed that participants cheated more in the money condition (M = 4.41, SD = 4.25) than in both the control condition (M = 2.76, SD = 3.96; p = .07) and the time condition (M = 1.55, SD = 2.41; p = .002).” Thus, this non-significant results was presented as supporting evidence in the original article.
Jin, Z., Shiomura, K., & Jiang, L. (2015) conducted a priming studies with reaction times as dependent variables. This design is different from social priming studies in the meta-analysis. Moreover, money priming effects were examined within-participants, and the study produced several significant complex interaction effects. Thus, this study also does not count as a published failure to replicate money priming effects.
Mukherjee, S., Nargundkar, M., & Manjaly, J. A. (2014) examined the influence of money primes on various satisfaction judgments. Study 1 used a small sample of N = 48 participants with three dependent variables. Two achieved significance, but the meta-analysis aggregated across DVs, which resulted in a non-significant outcome. Study 2 used a larger sample and replicated significance for two outcomes. It was not included in the meta-analysis. In this case, aggregation of DVs explains a non-significant result in the meta-analysis, while the original article reported significant results.
I was unable to retrieve this article, but the abstract suggests that the article reports a significant interaction. ” We found that although money-primed reactance in control trials in which the majority provided correct responses, this effect vanished in critical trials in which the majority provided incorrect answers.”
[https://www.sbp-journal.com/index.php/sbp/article/view/3227]
Wierzbicki, J., & Zawadzka, A. (2014) published two studies. Study 1 reported a significant result. Study 2 added a non-significant result to the meta-analysis. Although the effect for money priming was not significant, this study reported a significant effect for credit-card priming and a money priming x morality interaction effect. Thus, the article also did not report a money-priming failure as the key finding.
Gasiorowska, A. (2013) is an article in Polish.
is a duplication of article 5

In conclusion, none of the 7 studies with non-significant results in the meta-analysis that were published in a journal reported that money priming had no effect on a dependent variable. All articles reported some significant results as the key finding. This further confirms how dramatically publication bias distorts the evidence reported in psychology journals.

Conclusion

In this blog post, I examined the discrepancy between null-results in journal articles and in meta-analysis, using a meta-analysis of money priming. While the meta-analysis suggested that publication bias is relatively modest, published articles showed clear evidence of publication bias with an observed discovery rate of 89%, while the expected discovery rate was only 27%.

Three factors contributed to this discrepancy: (a) the inclusion of unpublished studies, (b) independent replication studies, and (c) the coding of interaction effects as separate effects for subgroups rather than coding the main effect.

After correcting for publication bias, expected discovery rates are consistently low with estimates around 30%. The main exception are the independent replication studies that found no evidence at all. Overall, these results confirm that published money priming studies and other social priming studies cannot be trusted because the published studies overestimate replicability and effect sizes.

It is not the aim of this blog post to examine whether some money priming paradigms can produce replicable effects. The main goal was to explain why publication bias in meta-analysis is often small, when publication bias in the published literature is large. The results show that several factors contribute to this discrepancy and that the inclusion of unpublished studies, independent replication studies, and coding of effects explain most of these discrepancies.

Social psychology textbook audit: Something smells fishy

April 7, 2019Experimental Social Psychology, Publication Bias, Questionable Research Practices, Social Psychology, Social Psychology TextbookUlrich Schimmack

Social psychology textbook like colorful laboratory experiments that illustrate a theoretical point. As famous social psychologist Daryl Bem stated, he considered his experiments more illustrations of what could happen than empirical tests of what actually happens. Unfortunately, social psychology textbooks make it less obvious that the results of highlighted studies should not be generalized to real life.

Myers and Twenge (2019) tell the story of fishy smells.

In a laboratory experiment, exposure to a fishy smell caused people to be suspicious of each other and cooperate less—priming notions of a shady deal as “fishy” (Lee & Schwarz, 2012). All these effects occurred without the participants’ conscious awareness of the scent and its influence.

They don’t even mention some other fun facts about this study. To make sure that the effect is not just a mood effect induced by bad odors in general, fishy smells were contrasted with fart smells, and the effect seemed to be limited to fishy smells.

The article was published in the top journal for experimental social psychology (JPSP:ASC) and is relatively highly cited.

However, the studies reported in this article smell a bit fishy and should be consumed with a grain of salt and a lot of lemon. The problem is that all of the results are significant, which is highly unlikely unless studies have very high statistical power (Schimmack, 2012).

And it even works the other way around.

And making people think about suspicion, also makes them think about fish, in theory.

Suspicion also makes you be more sensitive to fishy smells.

Undergraduate students may not realize what the problem with these studies is. After all, they all worked out; that is they produced a p-value less than .05, which is supposed to ensure that no more than 1 out of 20 studies are a false positive result. As all of these studies are significant, it is extremely unlikely that all of them are false positives. So, we would have to infer that suspicion is related to fishy smells in our minds.

However, since 2012 it is clear that we have to draw another conclusion. The reason is that results in social psychology articles like this one smell fishy and suggest that the authors are telling us a fun story, but they are not telling us what really happened in their lab. It is extremely unlikely that the authors reported all of their studies and data analyses that they conducted. Instead they may have used a variety of so-called questionable research practices that increase the chances of reporting a significant result. Questionable research practices are also known as fishing for significance. These questionable research practices have the undesirable effect that they increase the type-I error rate. Thus, while the reported p-values are below .05, the risk of a false positive result is not and could be as high as 100%.

To demonstrate that researchers used questionable research practices, we can conduct a bias test. The most powerful bias test for small sets of studies is the Test of Insufficient Variance. When most p-values are just significant , p < .05 and p > .005, but always significant the results are not trustworthy because sampling error should produce more variability than we see.

The table lists the test statistics, converts the two-tailed p-values into z-scores and computes the variance of the z-scores. The variance is expected to be 1, but the actual variance is only 0.14. A chi-square test shows that this deviation is significant with p = .01. Thus, we have scientific evidence to claim that these results smell a bit fishy.

Study	test	value	df	p	z
1	t	2.22	42	0.032	2.15
2	t	2.01	79	0.048	1.98
3a	chisq	4.27	1	0.039	2.07
3b	chisq	6.28	1	0.012	2.51
3c	chisq	7.77	1	0.005	2.79
5	F	8.24	116	0.005	2.82
6	F	3.93	1614	0.048	1.98
VAR(z)	0.14
TIVA	0.01

Unfortunately, these results are not the only fishy results in social psychology textbooks. Thus, students of social psychology should read textbook claims with a healthy dose of skepticism. They should also ask their professors to provide information about the replicability of textbook findings. Has this study been replicated in a preregistered replication attempt? Would you think you could replicate this result in your own lab? It is time to get rid of the fishy smell and let the fresh wind of open science clean up social psychology.

We can only hope that sooner than later, articles like this will sleep with the fishes.

Most Between-Subject Experimental Social Psychology (BS-ESP) Results Could Be False: A Z-Curve Analysis of Motyl et al.’s BS-ESP results

October 11, 2018Before You Know It, Ethics, Experimental Social Psychology, False Positives, John A. Bargh, Kahneman, Nisbett, Psychology, Publication Bias, Questionable Research Practices, Replicability, Replication Crisis, Social Psychology, Statistical Power, TIVA, Z-Curve, ZcurveUlrich Schimmack

Introduction

Richard Nisbett has been an influential experimental social psychologist. His co-authored book on faulty human information processing (Nisbett & Ross, 1980) provided the foundation of experimental studies of social cognition (Fiske & Taylor, 1984). Experiments became the dominant paradigm in social psychology with success stories like Daniel Kahneman’s Noble Price for Economics and embarrassments like Diederik Staple’s numerous retractions because he fabricated data for articles published in experimental social psychology (ESP) journals.

The Stapel Debacle raised questions about the scientific standards of experimental social psychology. The reputation of Experimental Social Psychology (ESP) also took a hit when the top journal of ESP research published an article by Daryl Bem that claimed to provide evidence for extra-sensory perceptions. For example, in one study extraverts seemed to be able to foresee the location of pornographic images before a computer program determined the location. Subsequent analyses of his results and data revealed that Daryl Bem did not use scientific methods properly and that the results provide no credible empirical evidence for his claims (Francis, 2012; Schimmack, 2012; Schimmack, 2018).

More detrimental for the field of experimental social psychology was that Bem’s carefree use of scientific methods is common in experimental social psychology; in part because Bem wrote a chapter that instructed generations of experimental social psychologists how they could produce seemingly perfect results. The use of these questionable research practices explains why over 90% of published results in social psychology journals support authors’ hypotheses (Sterling, 1959; Sterling et al., 1995).

Since 2011, some psychologists have started to put the practices and results of experimental social psychologists under the microscope. The most impressive evidence comes from a project that tried to replicate a representative sample of psychological studies (Open Science Collaboration, 2015). Only a quarter of social psychology experiments could be replicated successfully.

The response by eminent social psychologists to these findings has been a classic case of motivated reasoning and denial. For example, in an interview for the Chronicle of Higher Education, Nisbett dismissed these results by attributing them to problems of the replication studies.

Nisbett has been calculating effect sizes since before most of those in the replication movement were born. And he’s a skeptic of this new generation of skeptics. For starters, Nisbett doesn’t think direct replications are efficient or sensible; instead he favors so-called conceptual replication, which is more or less taking someone else’s interesting result and putting your own spin on it. Too much navel-gazing, according to Nisbett, hampers professional development. “I’m alarmed at younger people wasting time and their careers,” he says. He thinks that Nosek’s ballyhooed finding that most psychology experiments didn’t replicate did enormous damage to the reputation of the field, and that its leaders were themselves guilty of methodological problems. And he’s annoyed that it’s led to the belief that social psychology is riddled with errors. “How do they know that?”, Nisbett asks, dropping in an expletive for emphasis.

In contrast to Nisbett’s defensive response, Noble Laureate Daniel Kahneman has expressed concerns about the replicability of BS-ESP results that he reported in his popular book “Thinking: Fast and Slow” He also wrote a letter to experimental social psychologists suggesting that they should replicate their findings. It is telling that several years later, eminent experimental social psychologists have not published self-replications of their classic findings.

Nisbett also ignores that Nosek’s findings are consistent with statistical analyses that show clear evidence of questionable research practices and evidence that published results are too good to be true (Francis, 2014). Below I present new evidence about the credibility of experimental social psychology based on a representative sample of published studies in social psychology.

How Replicable are Between-Subject Social Psychology Experiments (BS-ESP)?

Motyl and colleagues (2017) coded hundreds of articles and over one-thousand published studies in social psychology journals. They recorded information about the type of study (experimental or correlational), the design of the study (within subject vs. between-subject) and about the strength of an effect (as reflected in test statistics or p-values). After crunching the numbers, their results showed clear evidence of publication bias, but also evidence that social psychologists published some robust and replicable results.

In a separate blog-post, I agreed with this general conclusion. However, Motyl et al.’s assessment was based on a broad range of studies, including correlational studies with large samples. Few people doubt that these results would replicate, but experimental social psychologist tend to dismiss these findings because they are correlational (Nisbett, 2016).

The replicability of BS-ESP results is more doubtful because these studies often used between-subject designs with small samples, which makes it difficult to obtain statistically significant results. For example, John Bargh used only 30 participants (15 per condition) for his famous elderly priming study that failed to replicate.

I conducted a replicability analysis of BS-ESP results based on a subset of studies in Motly’s dataset. I selected only studies with between-subject experiments where participants were randomly assigned to different conditions with one degree of freedom. The studies could be comparisons of two groups or a 2 x 2 design that is also often used by experimental social psychologists to demonstrate interaction (moderator) effects. I also excluded studies with fewer than 20 participants per condition because these studies should not have been published because parametric tests require a minimum of 20 participants to be robust (Cohen, 1994).

There were k = 314 studies that fulfilled these criteria. Two-hundred-seventy-eight of these studies (89%) were statistically significant at the standard criterion of p < .05. Including marginally significant and one-sided tests, 95% were statistically significant. This success rate is consistent with Sterling’s results in the 1950s and 1990s and the results of the OSC project. For the replicability analysis, I focused on the 278 results that met the standard criterion of statistical significance.

First, I compared the mean effect size without correcting for publication bias with the bias-corrected effect size estimate using the latest version of puniform (vanAert, 2018). The mean effect size of the k = 278 studies was d = .64, while the bias-corrected puniform estimate was more than 50% lower, d = .30. The finding that actual effect sizes are approximately half of the published effect sizes is consistent with the results based on actual replication studies (OSC, 2015).

Next, I used z-curve (Brunner & Schimmack, 2018) to estimate mean power of the published studies based on the test statistics reported in the original articles. Mean power predicts the success rate if the 278 studies were exactly replicated. The advantage of using a statistical approach is that it avoids problems of carrying out exact replication studies, which is often difficult and sometimes impossible.

Figure 1. Histogram of the strength of evidence against the null-hypothesis in a representative sample of (k = 314) between-subject experiments in social psychology.

The Nominal Test of Insufficient Variance (Schimmack, 2015) showed 61% of results within the range from z = 1.96 and z = 2.80, when only a maximum of 32% is expected from a representative sample of independent tests. The z-statistic of 10.34 makes it clear that this is not a chance finding. Visual inspection shows a sharp drop at a z-score of 1.96, which corresponds to the significance criterion, p < .05, two-tailed. Taken together, these results provide clear evidence that published results are not representative of all studies that are conducted by experimental psychologists. Rather, published results are selected to provide evidence in favor of authors’ hypotheses.

The mean power of statistically significant results is 32% with a 95%CI ranging from 23% to 39%. This means that many of the studies that were published with a significant result would not reproduce a significant result in an actual replication attempt. With an estimate of 32%, the success rate is not reliably higher than the success rate for actual replication studies in the Open Science Reproducibility Project (OSC, 2015). Thus, it is clear that the replication failures are the result of shoddy research practices in the original studies rather than problems of exact replication studies.

The estimate of 32% is also consistent with my analysis of social psychology experiments in Bargh’s book “Before you know it” that draws heavily on BS-ESP results. Thus, the present results replicate previous analyses based on a set of studies that were selected by an eminent experimental social psychologist. Thus, replication failures in experimental social psychology are highly replicable.

A new statistic under development is the maximum false discovery rate; that is, the percentage of significant results that could be false positives. It is based on the fit of z-curves with different proportions of false positives (z = 0). The maximum false discovery rate is 70% with a 95%CI ranging from 50% to 85%. This means, the data are so weak that it is impossible to rule out the possibility that most BS-ESP results are false.

Conclusion

Nisbett’s questioned how critics know that ESP is riddled with errors. I answered his call for evidence by presenting a z-curve analysis of a representative set of BS-ESP results. The results are consistent with findings from actual replication studies. There is clear evidence of selection bias and consistent evidence that the majority of published BS-ESP results cannot be replicated in exact replication studies. Nisbett dismisses this evidence and attributes replication failures to problems with the replication studies. This attribution is a classic example of a self-serving attribution error; that is, the tendency to blame others for negative outcomes.

The low replicability of BS-ESP results is not surprising, given several statements by experimental social psychologists about their research practices. For example, Bem’s (2001) advised students that it is better to “err on the side of discovery” (translation: a fun false finding is better than no finding). He also shrugged off replication failures of his ESP studies with a comment that he doesn’t care whether his results replicate or not.

“I used data as a point of persuasion, and I never really worried about, ‘Will this replicate or will this not?” (Daryl J. Bem, in Engber, 2017)

A similar attitude is revealed in Baumeister’s belief that personality psychology has lost appeal because it developed its scientific method and a “gain in rigor was accomplished by a loss in interest.” I agree that fiction can be interesting, but science without rigor is science fiction.

Another social psychologist (I forgot the name) once bragged openly that he was able to produce significant results in 30% of his studies and compared this to a high batting averages in baseball. In baseball it is indeed impressive to hit a fast small ball with a bat 1 out of 3 times. However, I prefer to compare success rates of BS-ESP researchers to the performance of my students on an exam, where a 30% success rate earns them a straight F. And why would anybody watch a movie that earned a 32% average rating on rottentomatoes.com, unless they are watching it because watching bad movies can be fun (e.g., “The Room”).

The problems of BS-ESP research are by no means new. Tversky and Kahneman (1971) tried to tell psychologists decades ago that studies with low power should not be conducted. Despite decades of warnings by methodologists (Cohen, 1962, 1994), social psychologists have blissfully ignored these warnings and continue to publish meaningless statistically significant results and hiding non-significant ones. In doing so, they conducted the ultimate attribution error. They attributed the results of their studies to the behavior of their participants, while the results actually depended on their own biases that determined which studies they selected for publication.

Many experimental social psychologists prefer to ignore evidence that their research practices are flawed and published results are not credible. For example, Bargh did not mention actual replication failures of his work in his book, nor did he mention that Noble Laureate Daniel Kahneman wrote him a letter in which he described Bargh’s work as “the poster child for doubts about the integrity of psychological research.” Several years later, it is fair to say that evidence is accumulating that experimental social psychology lacks scientific integrity. It is often said that science is self-correcting. Given the lack of self-correction by experimental social psychologists, it logically follows that it is not a science; at least it doesn’t behave like one.

I doubt that members of the Society for Experimental Social Psychology (SESP) will respond to this new information any differently from the way they responded to criticism of the field in the past seven years; that is, with denial, name calling (“Shameless Little Bullies”, “Method Terrorists”, “Human Scum”), or threats of legal actions. In my opinion, the biggest failure of SESP is not the way its members conducted research in the past, but their response to valid scientific criticism of their work. As Karl Popper pointed out “True ignorance is not the absence of knowledge, but the refusal to acquire it.” Ironically, the unwillingness of experimental social psychologists to acquire inconvenient self-knowledge provides some of the strongest evidence for biases and motivated reasoning in human information processing. If only these biases could be studied in BS experiments with experimental social psychologists as participants.

Caution

The abysmal results for experimental social psychology should not be generalized to all areas of psychology. The OSC (2015) report examined the replicability of psychology, and found that cognitive studies replicated much better than experimental social psychology results. Motyl et al. (2017) found evidence that correlational results in social and personality psychology are more replicable than BS-ESP results.

It is also not fair to treat all experimental social psychologists alike. Some experimental social psychologists may have used the scientific method correctly and published credible results. The problem is to know which results are credible and which results are not. Fortunately, studies with stronger evidence (lower p-values or higher z-score) are more likely to be true. In actual replication attempts, studies with z-scores greater than 4 had an 80% chance to be successfully replicated (OSC, 2015). I provided a brief description of results that met this criterion in Motyl et al.’s dataset in the Appendix. However, it is impossible to distinguish honest results with weak evidence from results that were manipulated to show significance. Thus, over 50 years of experimental social psychology have produced many interesting ideas without empirical evidence for most of them. Sadly, even today articles are published that are no more credible than those published 10 years ago. If there can be failed sciences, experimental social psychology is one of them. Maybe it is time to create a new society for social psychologists who respect the scientific method. I suggest calling it Society of Ethical Social Psychologists (SESP), and that it adopts the ethics code of the American Physical Society (APS).

Fabrication of data or selective reporting of data with the intent to mislead or deceive is an egregious departure from the expected norms of scientific conduct, as is the theft of data or research results from others.

APPENDIX

Journal of Experimental Social Psychology

Klein, W. M. (2003). Effects of objective feedback and “single other” or “average other” social comparison feedback on performance judgments and helping behavior. Personality and Social Psychology Bulletin, 29(3), 418-429.

Df1	Df2	N	F-value
1	44	48	21.71

In this study, participants have a choice to give easy or difficult hints to a confederate after performing on a different task. The strong result shows an interaction effect between performance feedback and the way participants are rewarded for their performance. When their reward is contingent on the performance of the other students, participants gave easier hints after they received positive feedback and harder hints after they received negative feedback.

Phillips, K. W. (2003). The effects of categorically based expectations on minority influence: The importance of congruence. Personality and Social Psychology Bulletin, 29(1), 3-13.

155

158

97.02

This strong effect shows that participants were surprised when an in-group member disagreed with their opinion in a hypothetical scenario in which they made decisions with an in-group and an out-group member, z = 8.67.

Seta, J. J., Seta, C. E., & McElroy, T. (2003). Attributional biases in the service of stereotype maintenance: A schema-maintenance through compensation analysis. Personality and Social Psychology Bulletin, 29(2), 151-163.

101

112

39.80

The strong effect in Study 1 reflects different attributions of a minister’s willingness to volunteer for a charitable event. Participants assumed that the motives were more selfish and different from motives of other ministers if they were told that the minister molested a young boy and sold heroin to a teenager. These effects were qualified by a Target Identity × Inconsistency interaction, F(1, 101) = 39.80, p < .001. This interaction was interpreted via planned comparisons. As expected, participants who read about the aberrant behaviors of the minister attributed his generosity in volunteering to the dimension that was more inconsistent with the dispositional attribution of ministers—impressing others (M = 2.26)—in contrast to the same target control participants (M= 4.62), F(1, 101) = 34.06, p < .01.

Trope, Y., Gervey, B., & Bolger, N. (2003). The role of perceived control in overcoming defensive self-evaluation. Journal of Experimental Social Psychology, 39(5), 407-419.

176

190

25.61

Study 2 manipulated perceptions of changeability of attributes and valence of feedback. A third factor were self-reported abilities. The 2 way interaction showed that participants were more interested in feedback about weaknesses when attributes were perceived as changeable, z = 4.74. However, the critical test was the three-way interaction with self-perceived abilities, which was weaker and not a fully experimental design, F(1, 176) = 6.34, z = 2.24.

Brambilla, M., Sacchi, S., Pagliaro, S., & Ellemers, N. (2013). Morality and intergroup relations: Threats to safety and group image predict the desire to interact with outgroup and ingroup members. Journal of Experimental Social Psychology, 49(5), 811-821.

89.02	1	78	83	89.02
42.76	1	99	108	42.76
134.67	1	156	165	134.67

Three strong results come from this study of morality (zs > 5). In hypothetical scenarios, participants were presented with moral and immoral targets and asked about their behavioral intentions how they would interact with them. All studies showed that participants were less willing to engage with immoral targets. Other characteristics that were manipulated had no effect.

Mason, M. F., Lee, A. J., Wiley, E. A., & Ames, D. R. (2013). Precise offers are potent anchors: Conciliatory counteroffers and attributions of knowledge in negotiations. Journal of Experimental Social Psychology, 49(4), 759-763.

244

247

19.29

This study showed that recipients of a rounded offer make larger adjustments to the offer than recipients of more precise offers, z = 4.15. This effect was demonstrated in several studies. This is the strongest evidence, in part, because the sample size was the largest. So, if you put your house up for sale, you may suggest a sales price of $491,307 rather than $500,000 to get a higher counteroffer.

Pica, G., Pierro, A., Bélanger, J. J., & Kruglanski, A. W. (2013). The Motivational Dynamics of Retrieval-Induced Forgetting A Test of Cognitive Energetics Theory. Personality and Social Psychology Bulletin, 39(11), 1530-1541.

208.16

The strong effect for this analysis is a within-subject main effect. The critical effect was a mixed design three-way interaction. This effect was weaker. “.05). Of greatest importance, the three-way interaction between retrieval-practice repetition, need for closure, and OSPAN was significant, β = −.24, t = −2.25, p < .05.”

Preston, J. L., & Ritter, R. S. (2013). Different effects of religion and God on prosociality with the ingroup and outgroup. Personality and Social Psychology Bulletin, ###.

113

127

23.22

This strong effect, z = 4.59, showed that participants thought a religious leader would want them to help a family that belongs to their religious group, whereas God would want them to help a family that does not belong to the religious group (cf. These values were analyzed by one-way ANOVA on Condition (God/Leader), F(1, 113) = 23.22, p < .001, partial η2= .17. People expected the religious leader would want them to help the religious ingroup family (M = 6.71, SD = 2.67), whereas they expected God would want them to help the outgroup family (M = 4.39, SD = 2.48)). I find the dissociation between God and religious leaders interesting. The strength of the effect makes me belief that this is a replicable finding.

Sinaceur, M., Adam, H., Van Kleef, G. A., & Galinsky, A. D. (2013). The advantages of being unpredictable: How emotional inconsistency extracts concessions in negotiation. Journal of Experimental Social Psychology, 49(3), 498-508.

151

152

25.93

Study 2 produced a notable effect of manipulating emotional inconsistency on self-ratings of “sense of unpredictability” (z = 4.88). However, the key dependent variable was concession making. The effect on concession making was not as strong, F(1, 151) = 7.29, z = 2.66.

Newheiser, A. K., & Barreto, M. (2014). Hidden costs of hiding stigma: Ironic interpersonal consequences of concealing a stigmatized identity in social interactions. Journal of Experimental Social Psychology, 52, 58-70.

26.9361

Participants in this study were either told to reveal their major of study or to falsely report that they are medical students. The strong effect shows that participants who were told to lie reported feeling less authentic, z = 4.51. The effect on a second dependent variable, “belonging” (I feel accepted) was weaker, t(54) = 2.54, z = 2.20.

PERSONALITY AND SOCIAL PSYCHOLOGY BULLETIN

Simon, B., & Stürmer, S. (2003). Respect for group members: Intragroup determinants of collective identification and group-serving behavior. Personality and Social Psychology Bulletin, 29(2), 183-193.

159

163

48.75

The strong effect in this study shows a main effect of respectful vs. disrespectful feedback from a group-member on collective self-esteem; that is, feeling good about being part of the group. As predicted, a 2 x2 ANOVA revealed that collective identification (averaged over all 12 items; Cronbach’s = .84) was stronger in the respectful-treatment condition than in the disrespectful-treatment condition,M(RESP) = 3.54,M(DISRESP) = 2.59, F(1, 159) = 48.75, p < .001.

Craig, M. A., & Richeson, J. A. (2014). More diverse yet less tolerant? How the increasingly diverse racial landscape affects white Americans’ racial attitudes. Personality and Social Psychology Bulletin, 40(6) 750–761.

1	13	30	41.60
1	13	15	36.00

Two strong effects are based on studies that aimed to manipulate responses to the race IAT with stories about shifting demographics in the United States. However, the test statistics are based on the comparison of IAT scores against a value of zero and not the comparison of the experimental group and the control group. The relevant results are, t(26) = 2.07, p = .048, d = 0.84 in Study 2a and t(23) = 2.80, p = .01, d = 1.13 in Study 2b. These results are highly questionable because it is unlikely to obtain just significant results in a pair of studies. In addition, the key finding in Study 1 is also just significant, t(84) = 2.29, p = .025, as is the finding in Study 3, F(1,366) = 5.94, p = .015.

Hung, I. W., & Wyer, R. S. (2014). Effects of self-relevant perspective-taking on the impact of persuasive appeals. Personality and Social Psychology Bulletin, 40(3), 402-414.

17.42

288

300

17.42

Participants viewed a donation appeal from a charity called Pangaea. The one-page appeal described the problem of child trafficking and was either self-referential or impersonal. The strong effect was that participants in the self-referential condition were more likely to imagine themselves in the situation of the child. Participants were more likely to imagine themselves being trafficked when the appeal was self-referential than when it was impersonal (M = 4.78, SD =2.95 vs. M = 3.26, SD = 2.96 respectively), F(1, 288) = 17.42, p < .01, ω2 = .041, and this difference did not depend on the victims’ ethnicity (F < 1). Thus, taking the victims’ perspective influenced participants’ tendency to imagine themselves being trafficked without thinking about their actual similarity to the victims that were portrayed. The effect on self-reported likelihood of helping was weaker. Participants reported greater urge to help when the appeal encouraged them to take the protagonists’ perspective than when it did not (M = 5.83, SD = 2.04 vs. M = 5.18, SD = 2.31), F(1, 288) = 5.68, p < .02, ω2 = .013

Lick, D. J., & Johnson, K. L. (2014).” You Can’tTell Just by Looking!” Beliefs in the Diagnosticity of Visual Cues Explain Response Biases in Social Categorization. Personality and Social Psychology Bulletin,

164

166

47.3

The main effect of social category dimension was significant, F(1, 164) = 47.30, p < .001, indicating that participants made more stigmatizing categorizations in the sex condition (M = 15.94, SD = 0.45) relative to the religion condition (M = 12.42, SD = 4.64). This result merely shows that participants were more likely to indicate that a woman is a woman than that an atheist is an atheist based on a photograph of a person. This finding would be expected based on the greater visibility of gender than religion.

Bastian, B., Jetten, J., Chen, H., Radke, H. R., Harding, J. F., & Fasoli, F. (2013). Losing our humanity the self-dehumanizing consequences of social ostracism. Personality and Social Psychology Bulletin, 39(2), 156-169.

39.77

The strong effect in this study reveals that participants rated ostracizing somebody more immoral than a typical everyday interaction. An ANOVA, with condition as the between-subjects variable, revealed that condition had an effect on perceived immorality, F(1, 51) = 39.77, p < .001, η2 = .44, indicating that participants felt the act of ostracizing another person was more immoral (M = 3.83, SD = 1.80) compared with having an everyday interaction (M = 1.37, SD = 0.81).

PSYCHOLOGICAL SCIENCE

Kifer, Y., Heller, D., Perunovic, W. Q. E., & Galinsky, A. D. (2013). The good life of the powerful the experience of power and authenticity enhances subjective well-being. Psychological science, 24(3), 280-288.

130

132

245.5489

This strong effect is a manipulation check. The focal test provides much weaker evidence for the claim that authenticity increases wellbeing. The manipulation was successful. Participants in the high-authenticity condition (M = 4.57, SD = 0.62) reported feeling more authentic than those in the low-authenticity condition (M = 2.70, SD = 0.74), t(130) = 15.67, p < .01, d = 2.73. As predicted, participants in the high-authenticity condition (M = 0.38, SD = 1.99) reported higher levels of state SWB than those in the low-authenticity condition (M = −0.46, SD = 2.12), t(130) = 2.35, p < .05, d = 0.40.

Lerner, J. S., Li, Y., & Weber, E. U. (2012). The financial costs of sadness. Psychological science, 24(1) 72–79.

202

45.1584

Again, the strong effect is a manipulation check. The emotion-induction procedure was effective in both magnitude and specificity. Participants in the sad-state condition reported feeling more sadness (M = 3.72) than neutrality (M = 1.66), t(78) = 6.72, p < .0001. The critical test that sadness leads to financial losses produced a just significant result. Sad participants were more impatient (mean = .21, median = .04) than neutral participants (mean = .28, median = .19; Mann- Whitney z = 2.04, p = .04).

Tang, S., Shepherd, S., & Kay, A. C. (2014). Do Difficult Decisions Motivate Belief in Fate? A Test in the Context of the 2012 US Presidential Election. 25(4), 1046-1048.

180

200

102.8196

A manipulation check confirmed that participants in the similar-candidates condition saw the candidates as more similar (M = 4.41, SD = 0.80) than did participants in the different-candidates condition (M = 3.24, SD = 0.76), t(180) = 10.14, p < .001. The critical test was not statistically significant. As predicted, participants in the similar-candidates condition reported greater belief in fate (M = 3.45, SD = 1.46) than did those in the different-candidates condition (M = 3.04, SD = 1.44), t(180) = 1.92, p = .057

Caruso, E. M., Van Boven, L., Chin, M., & Ward, A. (2013). The temporal Doppler effect when the future feels closer than the past. Psychological science, 24(4) 530–536.

The strong effect revealed that participants view an event (Valentine’s Day) in the future closer to the present than an event in the past. Valentine’s Day was perceived to be closer 1 week before it happened than 1 week after it happened, t(321) = 4.56, p < .0001, d = 0.51 (Table 1). The effect met the criterion of z > 4 because the sample size was large, N = 323, indicating that experimental social psychology could benefit from larger samples to produce more credible results.

Galinsky, A. D., Wang, C. S., Whitson, J. A., Anicich, E. M., Hugenberg, K., & Bodenhausen, G. V. (2013). The reappropriation of stigmatizing labels the reciprocal relationship between power and self-labeling. Psychological science, 24(10)

The strong effect showed that participants rated a stigmatized group as having more power of a stigmatized label if they used the label themselves than when it was used by others. The stigmatized out-group was seen as possessing greater power over the label in the self-label condition (M = 5.14, SD = 1.52) than in the other-label condition (M = 3.42, SD = 1.76), t(233) = 8.04, p < .001, d = 1.05. The effect on evaluations of the label was weaker. The label was also seen as less negative in the self-label condition (M = 5.61, SD = 1.37) than in the other-label condition (M = 6.03, SD = 1.19), t(233) = 2.46, p = .01, d = 0.33. And the weakest evidence was provided for a mediation effect. We tested whether perceptions of the stigmatized group’s power mediated the link between self-labeling and stigma attenuation. The bootstrap analysis was significant, 95% bias-corrected CI = [−0.41, −0.01]. A value of 0 rather than -0.01 would render this finding non-significant. The t-value for this analysis can be approximated by dividing the mean of the confidence boundaries (-.42/2 = -.21), by an estimate of sampling error (-.21- (-0.01) = -.20 / 2 = -.10). The ratio is an estimate of the signal to noise ratio (-.21 / -.10 = 2.1). With N = 235 this t-value is similar to the z-score and the effect can be considered just significant. This result is consistent with the weak evidence in the other 7 studies in this article.

JOURNAL OF PERSONALITY AND SOCIAL PSYCHOLOGY-Attitudes and Social Cognition

[This journal published Bem’s (2011) alleged evidence in support of extrasensory perceptions]

Ruder, M., & Bless, H. (2003). Mood and the reliance on the ease of retrieval heuristic. Journal of Personality and Social Psychology, 85(1), 20.

23.46

The strong effect in this study is based on a contrast analysis for the happy condition. Supporting the first central hypothesis, happy participants responded faster than sad participants (M = 9.81 s vs. M = 14.17 s), F(1, 59) = 23.46, p < .01. The results of the 2 x 2 ANOVA analysis are reported subsequently. The differential impact of number of arguments on happy versus sad participants is reflected in a marginally significant interaction, F(1, 59) = 3.59, p = .06.

Fazio, R. H., Eiser, J. R., & Shook, N. J. (2004). Attitude formation through exploration: valence asymmetries. Journal of personality and social psychology, 87(3), 293.

18.65

The strong effect in this study reveals that participants approached stimuli that they were told were rewarding. When they learned by experience that this was not the case, they stopped approaching them. However, when they were told that stimuli were bad, they avoided them and were not able to learn that the initial information was wrong. This resulted in an interaction effect for prior (true or false) and actual information. More important, the predicted interaction was highly significant as well, F(1, 71) =18.65, p< .001.

Pierro, A., Mannetti, L., Kruglanski, A. W., & Sleeth-Keppler, D. (2004). Relevance override: on the reduced impact of” cues” under high-motivation conditions of persuasion studies. Journal of Personality and Social Psychology, 86(2), 251.

19.85

180

19.85

The strong effect in this study is based on a contrast effect following an interaction effect. Consistent with our hypothesis, the first informational set exerted a stronger attitudinal impact in the low accountability condition, simple F(1, 42) = 19.85, p < .001, M = (positive) 1.79 versus M (negative) = -.15. The pertinent two-way interaction effect was not as strong. The interaction between accountability and valence of first informational set was significant, F(1, 84) = 5.79, p = .018. For study 2, an effect of F(1,43) = 18.66 was used, but the authors emphasized the importance of the four-way interaction. Of greater theoretical interest was the four-way interaction between our independent variables, F(1, 180) = 3.922, p = .049. For Study 3, an effect of F(1,48) = 18.55 was recorded, but the authors emphasize the importance of the four-way interaction, which was not statistically significant. Of greater theoretical interest is the four-way interaction between our independent variables, F(1, 176) = 3.261, p = .073.

Clark, C. J., Luguri, J. B., Ditto, P. H., Knobe, J., Shariff, A. F., & Baumeister, R. F. (2014). Free to punish: A motivated account of free will belief. Journal of personality and social psychology, 106(4), 501.

192.3769

The strong effect in this study by my prolific friend Roy Baumeister is due to the finding that participants want to punish a robber more than an aluminum can forager. Participants also wanted to punish the robber (M _ 4.98, SD =1.07) more than the aluminum can forager (M = 1.96, SD = 1.05), t(93) = 13.87, p = .001. Less impressive is the evidence that beliefs about free will are influenced by reading about a robber or an aluminum can forager. Participants believed significantly more in free will after reading about the robber (M = 3.68, SD = 0.70) than the aluminum can forager (M _ 3.38, SD _ 0.62), t(90) = 2.23, p = .029, d = 0.47.

Yan, D. (2014). Future events are far away: Exploring the distance-on-distance effect. Journal of Personality and Social Psychology, 106(4), 514.

118

122

23.7

This strong effect reflects an effect of a construal level manipulation (thinking in abstract or concrete terms) on temporal distance judgments. The results of a 2 x 2 ANOVA on this index revealed a significant main effect of the construal level manipulation only, F(1, 118) = 23.70, p < .001. Consistent with the present prediction, participants in the superordinate condition (M = 1.81) indicated a higher temporal distance than those in the subordinate condition (M = 1.58).

Estimating Reproducibility of Psychology (No. 140): An Open Post-Publication Peer-Review

April 17, 2018Experimental Social Psychology, John A. Bargh, Kahneman, Priming, Publication Bias, Social Psychology, Thinking Fast and Slow, TIVAUlrich Schimmack

Introduction

In 2015, Science published the results of the first empirical attempt to estimate the reproducibility of psychology. One key finding was that out of 97 attempts to reproduce a significant result, only 36% of attempts succeeded.

This finding fueled debates about a replication crisis in psychology. However, there have been few detailed examinations of individual studies to examine why a particular result could be replicated or not. The main reason is probably that it is a daunting task to conduct detailed examinations of all studies. Another reason is that replication failures can be caused by several factors. Each study may have a different explanation. This means it is important to take an ideographic (study-centered) perspective.

The conclusions of these ideographic reviews will be used for a nomothetic research project that aims to predict actual replication outcomes on the basis of the statistical results reported in the original article. These predictions will only be accurate if the replication studies were close replications of the original study. Otherwise, differences between the original study and the replication study may explain why replication studies failed.

Special Introduction

The senior author of Article 140 was John A. Bargh. Other studies by John Bargh have failed to replicate and questions about his research were raised in an open letter by Noble Laureate Daniel Kahneman, who prominently features Bargh’s work in his popular book “Thinking Fast and Slow.” John Bargh’s response to this criticisms can be described as stoic defiance. In 2017, he published a book on his work that did not mention replication failures or concerns about replicability of social priming research in general. A quantitative review of the book showed that much of the cited evidence was weak.

Summary of Original Article

The article “Keeping One’s Distance: The Influence of Spatial Distance Cues on Affect and
Evaluation” by Lawrence E. Williams and John A. Bargh was published in Psychological Science. The article has been cited 155 times overall and 17 times in 2017.

The main hypothesis is that physical distance influences social judgments without reference to the self. To provide evidence for this hypothesis the authors reported four priming studies. Physical distance was primed by plotting points in a way that suggested physical closeness or distance. Mean differences indicated greater enjoyment of media depicting embarrassment, less emotional distress from violent media, lower estimates of the number of calories in unhealthy food, and weaker reports of emotional attachments to family members and hometowns in the distant prime condition than in the close prime condition.

Study 1

73 undergraduate students were assigned to three priming conditions (n = 24 per cell; 41 female, 32 male).

After the priming manipulation, participants were given an excerpt from the book “Good in Bed.” and rated their enjoyment of an article titled “Loving a larger woman.”

An ANOVA showed a significant difference between the three groups, F(2, 67) = 3.14, p = .0497 (reported as p-rep = .88). The authors wisely did not conduct post-hoc comparisons of the close or distant condition with the control condition. Instead, they reported a significant linear trend, t(67) = 2.41, p = .019.

Although men and women may respond differently to reading an article about making love to a larger woman, gender differences were not examined or reported.

Study 2

Study 2 reduced the sample size form 73 participants to 42 participants (n = 14 per cell).

The dependent variable was liking of a violent and shocking story.

Despite the loss in statistical power, the ANOVA result was significant, F(2, 39) = 4.37, p = .019. The linear contrast was also significant, t(39) = 2.62, p = .012.

Study 3

59 community members participated in Study 3.

This time, participants rated caloric content (but not liking?) of healthy and unhealthy foods .

The mixed-model ANOVA showed a significant interaction result, F(2,56) = 3.36, p = .042. The interaction was due to significant differences between the priming conditions for unhealthy foods, F(2,56) = 5.62, p = .006. There were no significant differences for healthy foods. No linear contrasts were reported. The Figure shows that the significant effect was mainly due to a lower estimate of calories in the distant priming condition.

Study 4

84 students participated in Study 4.

The dependent variable was the average of closeness ratings to siblings, parents, and hometown.

The ANOVA result was significant, F(2,81) = 4.97, p = .009. The linear contrast also showed a significant trend for lower closeness ratings after distance priming, t(81) = 2.86, p = .005.

Replicability Assessment

All four studies showed significant results with p-values (.019, .012, .006, and .005). These p-values correspond to z-scores of z = 2.35, 2.51, 2.75, and 2.81. Median observed power is 75%, while the success rate is 100%. The Replicability Index corrects for inflation in median observed power when the success rate is higher than median observed power. An R-Index of 50% (75% – 25% Inflation) is not terrible, but also not very reassuring. In the grading scheme of the R-Index it is a D- (in comparison, many chapters in Bargh’s book have an Index below 50% and an F).

More concerning is that the variance of the four z-scores is only var(z) = 0.046, when sampling error of independent studies should be 1. The Test of Insufficient Variance (TIVA) shows that such a small variance or even less variance would occur in four studies with a probability of p = .013. This suggests that some non-random factors contributed to the observed results.

These results suggest that it is difficult to replicate the reported results because the reported effect sizes may be inflated by the use of questionable research practices.

Replication Study

Consistent with the OSC project guidelines, the authors replicated the last study in the article.

The sample size was moderately larger than in the original study (N = 133 vs. 84).

The simple procedure was reproduced exactly.

The ANOVA result was not significant, F(2,122) = 0.24, p = .79. The linear contrast was also not significant, t(122) = -0.59, p = .56.

The pattern of means showed a slightly higher closeness (to family) rating after priming closeness (M = 5.44, SD = 0.83) than in the control condition (M = 5.31, SD = 1.07). The mean for the distance priming condition was identical to the control condition (M = 5.31, SD = 1.15).

In conclusion, the replication study failed to replicate the original finding. Given the simplicity of the study, there are no obvious differences between the studies that could explain the replication failure. The most plausible explanation is that the original article reported inflated effect sizes.

Other Replication Failures

A previous replication attempt failed to replicate the results of Studies 3 and 4 (Harold Pashler, Noriko Coburn, Christine R. Harris, 2012).

The replication of Study 3 had 92 participants (vs. 59 in the original study). The study failed to replicate the interaction between distance priming and healthiness of food, F(2,85) = 0.45, p = .64. The replication of Study 4 also had 92 participants, although some were excluded because they did not have siblings or parents or could not perform the plotting task. The final sample size was N = 71 (vs. 84 in the original study). It also failed to replicate the original result, F(2,68) = 0.31.

Pashler et al. (2012) found no plausible explanation for their replication failure, but they did not consider QRPs. In contrast, the bias analyses presented here suggest that QRPs were used to report only supportive evidence for distance priming. If QRPs were used, it is not surprising that unbiased replication attempts fail to produce biased results.

Conclusion

Williams and Bargh (2008) proposed that a simple geometric task could prime distance or closeness and alter judgments of liking (close = like). Although they presented four studies with significant results, the evidence is not conclusive because bias tests suggest that the results are too good to be true. This impression was confirmed by a replication failure in a replication study that had slightly more power to detect the predicted effect due to a larger sample.

Although the replication failure was reported in 2015, the original article continues to be cited as if no replication failure occurred or the replication failure can be dismissed. A careful bias analysis suggests that the original results do not provide credible evidence for distance priming and the article should not be cited as evidence for it. Unless future studies with larger samples provide credible evidence for the effect, it remains doubtful that a simple geometric drawing task can alter evaluations of closeness to family members.

This replication failure has to be considered in the context of other replication failures of studies by John Bargh and social priming studies in general. As noted by Daniel Kahneman in his 2012 letter to John Bargh.

“As all of you know, of course, questions have been raised about the robustness of priming results…. your field is now the poster child for doubts about the integrity of psychological research… people have now attached a question mark to the field, and it is your responsibility to remove it.”

Bargh’s response to this letter may be described as willful attentional blindness. Not addressing concerns about replicability may be one way to maintain confidence in oneself in the face of criticism and doubt by others, but it is not good science. Experimental social psychologists who still believe in social priming effects like distance priming need to demonstrate replicability with high powered studies. So far, the results have not been encouraging (see failure to replicate professor priming).

P.S If you liked this blog post, the reason is that I primed you with a closeness prime (featured image). In reality, this blog post is terrible. If you didn’t like the blog post, the priming manipulation didn’t work (ironic isn’t it).

A critique of Stroebe and Strack’s Article “The Alleged Crisis and the Illusion of Exact Replication”

February 11, 2018John A. Bargh, Power, Priming, Psychological Science, Psychology, Publication Bias, Questionable Research Practices, Replication, Replication Crisis, Social Psychology, Stapel, Statistical Power, StatisticsUlrich Schimmack

The article by Stroebe and Strack (2014) [henceforth S&S] illustrates how experimental social psychologists responded to replication failures in the beginning of the replicability revolution. The response is a classic example of repressive coping: Houston, we do not have a problem. Even in 2014, problems with the way experimental social psychologists had conducted research for decades were obvious (Bem, 2011; Wagenmakers et al., 2011; John et al., 2012; Francis, 2012; Schimmack, 2012; Hasher & Wagenmakers, 2012). S&S article is an attempt to dismiss these concerns as misunderstandings and empirically unsupported criticism.

“In contrast to the prevalent sentiment, we will argue that the claim of a replicability crisis is greatly exaggerated” (p. 59).

Although the article was well received by prominent experimental social psychologists (see citations in appendix), future events proved S&S wrong and vindicated critics of research methods in experimental social psychology. Only a year later, the Open Science Collaboration (2015) reported that only 25% of studies in social psychology could be replicated successfully. A statistical analysis of focal hypothesis tests in social psychology suggests that roughly 50% of original studies could be replicated successfully if these studies were replicated exactly (Motyl et al., 2017). Ironically, one of S&S’s point is that exact replication studies are impossible. As a result, the 50% estimate is an optimistic estimate of the success rate for actual replication studies, suggesting that the actual replicability of published results in social psychology is less than 50%.

Thus, even if S&S had reasons to be skeptical about the extent of the replicability crisis in experimental social psychology, it is now clear that experimental social psychology has a serious replication problem. Many published findings in social psychology textbooks may not replicate and many theoretical claims in social psychology rest on shaky empirical foundations.

What explains the replication problem in experimental social psychology? The main reason for replication failures is that social psychology journals mostly published significant results. The selective publishing of significant results is called publication bias. Sterling pointed out that publication bias in psychology is rampant. He found that psychology journals publish over 90% significant results (Sterling, 1959; Sterling et al., 1995). Given new estimates that the actual success rate of studies in experimental social psychology is less than 50%, only publication bias can explain why journals publish over 90% results that confirm theoretical predictions.

It is not difficult to see that reporting only studies that confirm predictions undermines the purpose of empirical tests of theoretical predictions. If studies that do not confirm predictions are hidden, it is impossible to obtain empirical evidence that a theory is wrong. In short, for decades experimental social psychologists have engaged in a charade that pretends that theories are empirically tested, but publication bias ensured that theories would never fail. This is rather similar to Volkswagen’s emission tests that were rigged to pass because emissions were never subjected to a real test.

In 2014, there were ample warning signs that publication bias and other dubious practices inflated the success rate in social psychology journals. However, S&S claim that (a) there is no evidence for the use of questionable research practices and (b) that it is unclear which practices are questionable or not.

“Thus far, however, no solid data exist on the prevalence of such research practices in either social or any other area of psychology. In fact, the discipline still needs to reach an agreement about the conditions under which these practices are unacceptable” (p. 60).

Scientists like to hedge their statements so that they are immune to criticism. S&S may argue that the evidence in 2014 was not “solid” and surely there was and still is no agreement about good research practices. However, this is irrelevant. What is important is that success rates in social psychology journals were and still are inflated by suppressing disconfirming evidence and biasing empirical tests of theories in favor of positive outcomes.

Although S&S’s main claims are not based on empirical evidence, it is instructive to examine how they tried to shield published results and established theories from the harsh light of open replication studies that report results without selection for significance and subject social psychological theories to real empirical tests for the first time.

Failed Replication of Between-Subject Priming Studies

S&S discuss failed replications of two famous priming studies in social psychology: Bargh’s elderly priming study and Dijksterhuis’s professor priming studies. Both seminal articles reported several successful tests of the prediction that a subtle priming manipulation would influence behavior without participants even noticing the priming effect. In 2012, Doyen et al., failed to replicate elderly priming. Schanks et al. (2013) failed to replicate professor priming effects and more recently a large registered replication report also provided no evidence for professor priming. For naïve readers it is surprising that original studies had a 100% success rate and replication studies had a 0% success rate. However, S&S are not surprised at all.

“as in most sciences, empirical findings cannot always be replicated” (p. 60).

Apparently, S&S knows something that naïve readers do not know. The difference between naïve readers and experts in the field is that experts have access to unpublished information about failed replications in their own labs and in the labs of their colleagues. Only they know how hard it sometimes was to get the successful outcomes that were published. With the added advantage of insider knowledge, it makes perfect sense to expect replication failures, although may be not 0%.

The problem is that S&S give the impression that replication failures are too be expected, but that this expectation cannot be based on the objective scientific record that hardly ever reports results that contradict theoretical predictions. Replication failures occur all the time, but they remained unpublished. Doyen et al. and Schanks et al.’s articles only violated the code to publish only supportive evidence.

Kahneman’s Train Wreck Letter

S&S also comment on Kahneman’s letter to Bargh that compared priming research to a train wreck. In response S&S claim that

“priming is an entirely undisputed method that is widely used to test hypotheses about associative memory (e.g., Higgins, Rholes, & Jones, 1977; Meyer & Schvaneveldt, 1971; Tulving & Schacter, 1990).” (p. 60).

This argument does not stand the test of time. Since S&S published their article researchers have distinguished more clearly between highly replicable priming effects in cognitive psychology with repeated measures and within-subject designs and difficult to replicate between-subject social priming studies with subtle priming manipulations and a single outcome measure (BS social priming). With regards to BS social priming, it is unclear which of these effects can be replicated and leading social psychologists have been reluctant to demonstrate replicability of their famous studies by conducting self-replications as they were encouraged to do in Kahneman’s letter.

S&S also point to empirical evidence for robust priming effects.

“A meta-analysis of studies that investigated how trait primes influence impression formation identified 47 articles based on 6,833 participants and found overall effects to be statistically highly significant (DeCoster & Claypool, 2004).” (p. 60).

The problem with this evidence is that this meta-analysis did not take publication bias into account; in fact, it does not even mention publication bias as a possible problem. A meta-analysis of studies that were selected for significance produces is also biased by selection for significance.

Several years after Kahneman’s letter, it is widely agreed that past research on social priming is a train wreck. Kahneman published a popular book that celebrated social priming effects as a major scientific discovery in psychology. Nowadays, he agrees with critiques that the existing evidence is not credible. It is also noteworthy that none of the researchers in this area have followed Kahneman’s advice to replicate their own findings to show the world that these effects are real.

It is all a big misunderstanding

S&S suggest that “the claim of a replicability crisis in psychology is based on a major misunderstanding.” (p. 60).

Apparently, lay people, trained psychologists, and a Noble laureate are mistaken in their interpretation of replication failures. S&S suggest that failed replications are unimportant.

“the myopic focus on “exact” replications neglects basic epistemological principles” (p. 60).

To make their argument, they introduce the notion of exact replications and suggest that exact replication studies are uninformative.

“a finding may be eminently reproducible and yet constitute a poor test of a theory.” (p. 60).

The problem with this line of argument is that we are supposed to assume that a finding is eminently reproducible, which probably means it has been successfully replicate many times. It seems sensible that further studies of gender differences in height are unnecessary to convince us that there is a gender difference in height. However, results in social psychology are not like gender differences in height. According to S&S own accord earlier, “empirical findings cannot always be replicated” (p. 60). And if journals only publish significant results, it remains unknown which results are eminently reproducible and which results are not. S&S ignore publication bias and pretend that the published record suggests that all findings in social psychology are eminently reproducible. Apparently, they would suggest that even Bem’s findings that people have supernatural abilities is eminently reproducible. These days, few social psychologists are willing to endorse this naïve interpretation of the scientific record as a credible body of empirical facts.

Exact Replication Studies are Meaningful if they are Successful

Ironically, S&S next suggest that exact replication studies can be useful.

Exact replications are also important when studies produce findings that are unexpected and only loosely connected to a theoretical framework. Thus, the fact that priming individuals with the stereotype of the elderly resulted in a reduction of walking speed was a finding that was unexpected. Furthermore, even though it was consistent with existing theoretical knowledge, there was no consensus about the processes that mediate the impact of the prime on walking speed. It was therefore important that Bargh et al. (1996) published an exact replication of their experiment in the same paper.

Similarly, Dijksterhuis and van Knippenberg (1998) conducted four studies in which they replicated the priming effects. Three of these studies contained conditions that were exact replications.

Because it is standard practice in publications of new effects, especially of effects that are surprising, to publish one or two exact replications, it is clearly more conducive to the advancement of psychological knowledge to conduct conceptual replications rather than attempting further duplications of the original study.

Given these citations it is problematic that S&S article is often cited to claim that exact replications are impossible or unnecessary. The argument that S&S are making here is rather different. They are suggesting that original articles already provide sufficient evidence that results in social psychology are eminently reproducible because original articles report multiple studies and some of these studies are often exact replication studies. At face value, S&S have a point. An honest series of statistically significant results makes it practically impossible that an effect is a false positive result (Schimmack, 2012). The problem is that multiple study articles are not honest reports of all replication attempts. Francis (2014) found that at least 80% of multiple study articles showed statistical evidence of questionable research practices. Given the pervasive influence of selection for significance, exact replication studies in original articles provide no information about the replicability of these results.

What made the failed replications by Doyen et al. and Shank et al. so powerful was that these studies were the first real empirical tests of BS social priming effects because the authors were willing to report successes or failures. The problem for social psychology is that many textbook findings that were obtained with selection for significance cannot be reproduced in honest empirical tests of the predicted effects. This means that the original effects were either dramatically inflated or may not exist at all.

Replication Studies are a Waste of Resources

S&S want readers to believe that replication studies are a waste of resources.

Given that both research time and money are scarce resources, the large scale attempts at duplicating previous studies seem to us misguided” (p. 61).

This statement sounds a bit like a plea to spare social psychology from the embarrassment of actual empirical tests that reveal the true replicability of textbook findings. After all, according to S&S it is impossible to duplicate original studies (i.e., conduct exact replication studies) because replication studies differ in some way from original studies and may not reproduce the original results. So, none of the failed replication studies is an exact replication. Doyen et al. replicate Bargh’s study that was conducted in New York city in Belgium and Shanks et al. replicated Dijksterhuis’s studies from the Netherlands in the United States. The finding that the original results could not be replicate the original results does not imply that the original findings were false positives, but they do imply that these findings may be unique to some unspecified specifics of the original studies. This is noteworthy when original results are used in textbook as evidence for general theories and not as historical accounts of what happened in one specific socio-cultural context during a specific historic period. As social situations and human behavior are never exact replications of the past, social psychological results need to be permanently replicated and doing so is not a waste of resources. Suggesting that replications is a waste of resources is like suggesting that measuring GDP or unemployment every year is a waste of resources because we can just use last-year’s numbers.

As S&S ignore publication bias and selection for significance, they are also ignoring that publication bias leads to a massive waste of resources. First, running empirical tests of theories that are not reported is a waste of resources. Second, publishing only significant results is also a waste of resources because researchers design new studies based on the published record. When the published record is biased, many new studies will fail, just like airplanes who are designed based on flawed science would drop from the sky. Thus, a biased literature creates a massive waste of resources.

Ultimately, a science that publishes only significant result wastes all resources because the outcome of the published studies is a foregone conclusion: the prediction was supported, p < .05. Social psychologists might as well publish purely theoretical article, just like philosophers in the old days used “thought experiments” to support their claims. An empirical science is only a real science if theoretical predictions are subjected to tests that can fail. By this simple criterion, experimental social psychology is not (yet) a science.

Should Psychologists Conduct Exact Replications or Conceptual Replications?

Strobe and Strack’s next cite Pashler and Harris (2012) to claim that critiques of experimental social psychology have dismissed the value of so-called conceptual replications and generalize.

The main criticism of conceptual replications is that they are less informative than exact replications (e.g., Pashler & Harris, 2012).”

Before I examine S&S’s counterargument, it is important to realize that S&S misrepresented, and maybe misunderstood, Pashler and Harris’s main point. Here is the relevant quote from Pashler and Harris’s article.

We speculate that the harmful interaction of publication bias and a focus on conceptual rather than direct replications may even shed light on some of the famous and puzzling “pathological science” cases that embarrassed the natural sciences at several points in the 20th century (e.g., Polywater; Rousseau & Porto, 1970; and cold fusion; Taubes, 1993).

The problem for S&S is that they cannot address the problem of publication bias and therefore carefully avoid talking about it. As a result, they misrepresent Pashler and Harris’s critique of conceptual replications in combination with publication bias as a criticism of conceptual replication studies, which is absurd and not what Pashler and Harris’s intended to say or actually said. The following quote from their article makes this crystal clear.

However, what kept faith in cold fusion alive for some time (at least in the eyes of some onlookers) was a trickle of positive results achieved using very different designs than the originals (i.e., what psychologists would call conceptual replications). This suggests that one important hint that a controversial finding is pathological may arise when defenders of a controversial effect disavow the initial methods used to obtain an effect and rest their case entirely upon later studies conducted using other methods. Of course, productive research into real phenomena often yields more refined and better ways of producing effects. But what should inspire doubt is any situation where defenders present a phenomenon as a “moving target” in terms of where and how it is elicited (cf. Langmuir, 1953/1989). When this happens, it would seem sensible to ask, “If the finding is real and yet the methods used by the original investigators are not reproducible, then how were these investigators able to uncover a valid phenomenon with methods that do not work?” Again, the unavoidable conclusion is that a sound assessment of a controversial phenomenon should focus first and foremost on direct replications of the original reports and not on novel variations, each of which may introduce independent ambiguities.

I am confident that unbiased readers will recognize that Pashler and Harris did not suggest that conceptual replication studies are bad. Their main point is that a few successful conceptual replication studies can be used to keep theories alive in the face of a string of many replication failures. The problem is not that researchers conduct successful conceptual replication studies. The problem is dismissing or outright hiding of disconfirming evidence in replication studies. S&S misconstrue Pashler and Harris’s claim to avoid addressing this real problem of ignoring and suppressing failed studies to support an attractive but false theory.

The illusion of exact replications.

S&S next argument is that replication studies are never exact.

If one accepts that the true purpose of replications is a (repeated) test of a theoretical hypothesis rather than an assessment of the reliability of a particular experimental procedure, a major problem of exact replications becomes apparent: Repeating a specific operationalization of a theoretical construct at a different point in time and/or with a different population of participants might not reflect the same theoretical construct that the same procedure operationalized in the original study.

The most important word in this quote is “might.” Ebbinghaus’s memory curve MIGHT not replicate today because he was his own subject. Bargh’s elderly priming study MIGHT not work today because Florida is no longer associated with the elderly, and Disjterhuis’s priming study MIGHT no longer works because students no longer think that professors are smart or that Hooligans are dumb.

Just because there is no certainty in inductive inferences doesn’t mean we can just dismiss replication failures because something MIGHT have changed. It is also possible that the published results MIGHT be false positives because significant results were obtained by chance, with QRPs, or outright fraud. Most people think that outright fraud is unlikely, but the Stapel debacle showed that we cannot rule it out. So, we can argue forever about hypothetical reasons why a particular study was successful or a failure. These arguments are futile and have nothing to do with scientific arguments and objective evaluation of facts.

This means that every study, whether it is a groundbreaking success or a replication failure needs to be evaluate in terms of the objective scientific facts. There is no blanket immunity for seminal studies that protects them from disconfirming evidence. No study is an exact replication of another study. That is a truism and S&S article is often cited for this simple fact. It is as true as it is irrelevant to understand the replication crisis in social psychology.

Exact Replications Are Often Uninformative

S&S contradict themselves in the use of the term exact replication. First it is impossible to do exact replications, but then they are uninformative. I agree with S&S that exact replication studies are impossible. So, we can simply drop the term “exact” and examine why S&S believe that some replication studies are uninformative.

First they give an elaborate, long and hypothetical explanation for Doyen et al.’s failure to replicate Bargh’s pair of elderly priming studies. After considering some possible explanations, they conclude

It is therefore possible that the priming procedure used in the Doyen et al. (2012) study failed in this respect, even though Doyen et al. faithfully replicated the priming procedure of Bargh et al. (1996).

Once more the realm of hypothetical conjectures has to rescue seminal findings. Just as it is possible that S&S are right it is also possible that Bargh faked his data. To be sure, I do not believe that he faked his data and I apologized for a Facebook comment that gave the wrong impression that I did. I am only raising this possibility here to make the point that everything is possible. Maybe Bargh just got lucky. The probability of this is 1 out of 1,600 attempts (the probability to get the predicted effect with .05 two-tailed (!) twice is .025^2). Not very likely, but also not impossible.

No matter what the reason for the discrepancy between Bargh and Doyen’s findings is, the example does not support S&S’s claim that replication studies are uninformative. The failed replication raised concerns about the robustness of BS social priming studies and stimulated further investigation of the robustness of social priming effects. In the short span of six years, the scientific consensus about these effects has shifted dramatically, and the first publication of a failed replication is an important event in the history of social psychology.

S&S’s critique of Shank et al.’s replication studies is even weaker. First, they have to admit that professor probably still primes intelligence more than soccer hooligans. To rescue the original finding S&S propose

“the priming manipulation might have failed to increase the cognitive representation of the concept “intelligence.”

S&S also think that

another LIKELY reason for their failure could be their selection of knowledge items.

Meanwhile a registered replication report with a design that was approved by Dijksterhuis failed to replicate the effect. Although it is possible to come up with more possible reasons for these failures, real scientific creativity is revealed in creating experimental paradigms that produce replicable results, not in coming up with many post-hoc explanations for replication failures.

Ironically, S&S even agree with my criticism of their argument.

“To be sure, these possibilities are speculative” (p. 62).

In contrast, S&S fail to consider the possibility that published significant results are false positives, even though there is actual evidence for publication bias. The strong bias against published failures may be rooted in a long history of dismissing unpublished failures that social psychologists routinely encounter in their own laboratory. To avoid the self-awareness that hiding disconfirming evidence is unscientific, social psychologists made themselves believe that minute changes in experimental procedures can ruin a study (Stapel). Unfortunately, a science that dismisses replication failures as procedural hiccups is fated to fail because it removed the mechanism that makes science self-correcting.

Failed Replications are Uninformative

S&S next suggest that “nonreplications are uninformative unless one can demonstrate that the theoretically relevant conditions were met” (p. 62).

This reverses the burden of proof. Original researchers pride themselves on innovative ideas and groundbreaking discoveries. Like famous rock stars, they are often not the best musicians, nor is it impossible for other musicians to play their songs. They get rewarded because they came up with something original. Take the Implicit Association Test as an example. The idea to use cognitive switching tasks to measure attitudes was original and Greenwald deserves recognition for inventing this task. The IAT did not revolutionize attitude research because only Tony Greenwald could get the effects. It did so because everybody, including my undergraduate students, could replicate the basic IAT effect.

However, let’s assume that the IAT effect could not have been replicated. Is it really the job of researchers who merely duplicated a study to figure out why it did not work and develop a theory under which circumstances an effect may occur or not? I do not think so. Failed replications are informative even if there is no immediate explanation why the replication failed. As Pashler and Harris’s cold fusion example shows there may not even be a satisfactory explanation after decades of research. Most probably, cold fusion never really worked and the successful outcome of the original study was a fluke or a problem of the experimental design. Nevertheless, it was important to demonstrate that the original cold fusion study could not be replicated. To ask for an explanation why replication studies fail is simply a way to make replication studies unattractive and to dismiss the results of studies that fail to produce the desired outcome.

Finally, S&S ignore that there is a simple explanation for replication failures in experimental social psychology: publication bias. If original studies have low statistical power (e.g., Bargh’s studies with N = 30) to detect small effects, only vastly inflated effect sizes reach significance. An open replication study without inflated effect sizes is unlikely to produce a successful outcome. Statistical analysis of original studies show that this explanation accounts for a large proportion of replication failures. Thus, publication bias provides one explanation for replication failures.

Conceptual Replication Studies are Informative

S&S cite Schmidt (2009) to argue that conceptual replication studies are informative.

With every difference that is introduced the confirmatory power of the replication increases, because we have shown that the phenomenon does not hinge on a particular operationalization but “generalizes to a larger area of application” (p. 93).

S&S continue

“An even more effective strategy to increase our trust in a theory is to test it using completely different manipulations.”

This is of course true as long as conceptual replication studies are successful. However, it is not clear why conceptual replication studies that for the first time try a completely different manipulation should be successful. As I pointed out in my 2012 article, reading multiple-study articles with only successful conceptual replication studies is a bit like watching a magic show.

Multiple-study articles are most common in experimental psychology to demonstrate the robustness of a phenomenon using slightly different experimental manipulations. For example, Bem (2011) used a variety of paradigms to examine ESP. Demonstrating a phenomenon in several different ways can show that a finding is not limited to very specific experimental conditions. Analogously, if Joe can hit the bull’s-eye nine times from different angles, with different guns, and in different light conditions, Joe truly must be a sharpshooter. However, the variation of experimental procedures also introduces more opportunities for biases (Ioannidis, 2005). The reason is that variation of experimental procedures allows researchers to discount null findings. Namely, it is possible to attribute nonsignificant results to problems with the experimental procedure rather than to the absence of an effect.

I don’t know whether S&S are impressed by Bem’s article with 9 conceptual replication studies that successfully demonstrated supernatural abilities. According to their line of arguments, they should be. However, even most social psychologists found it impossible to accept that time-reversed subliminal priming works. Unfortunately, this also means that successful conceptual replication studies are meaningless if only successful results are published. Once more, S&S cannot address this problem because they ignore the simple fact that selection for significance undermines the purpose of empirical research to test theoretical predictions.

Exact Replications Contribute Little to Scientific Knowledge

Without providing much evidence for their claims, S&S conclude

one reason why exact replications are not very interesting is that they contribute little to scientific knowledge.

Ironically, one year later Science published 100 replication studies with the only goal of estimating the replicability of psychology, with a focus on social psychology. The article has already been cited 640 times, while S&S’s criticism of replication studies has been cited (only) 114 times.

Although the article did nothing else then to report the outcome of replication studies, it made a tremendous empirical contribution to psychology because it reported results of studies without the filter of publication bias. Suddenly the success rate plummeted from over 90% to 37% and for social psychology to 25%. While S&S could claim in 2014 that “Thus far, however, no solid data exist on the prevalence of such [questionable] research practices in either social or any other area of psychology,” the reproducibility project revealed that these practices dramatically inflated the percentage of successful studies reported in psychology journals.

The article has been celebrated by scientists in many disciplines as a heroic effort and a sign that psychologists are trying to improve their research practices. S&S may disagree, but I consider the reproducibility project a big contribution to scientific knowledge.

Why null findings are not always that informative

To fully appreciate the absurdity of S&S’s argument, I let them speak for themselves.

One reason is that not all null findings are interesting. For example, just before his downfall, Stapel published an article on how disordered contexts promote stereotyping and discrimination. In this publication, Stapel and Lindenberg (2011) reported findings showing that litter or a broken-up sidewalk and an abandoned bicycle can increase social discrimination. These findings, which were later retracted, were judged to be sufficiently important and interesting to be published in the highly prestigious journal Science. Let us assume that Stapel had actually conducted the research described in this paper and failed to support his hypothesis. Such a null finding would have hardly merited publication in the Journal of Articles in Support of the Null Hypothesis. It would have been uninteresting for the same reason that made the positive result interesting, namely, that (a) nobody expected a relationship between disordered environments and prejudice and (b) there was no previous empirical evidence for such a relationship. Similarly, if Bargh et al. (1996) had found that priming participants with the stereotype of the elderly did not influence walking speed or if Dijksterhuis and van Knippenberg (1998) had reported that priming participants with “professor” did not improve their performance on a task of trivial pursuit, nobody would have been interested in their findings.

Notably, all of the examples are null-findings in original studies. Thus, they have absolutely no relevance for the importance of replication studies. As noted by Strack and Stroebe earlier

Thus, null findings are interesting only if they contradict a central hypothesis derived from an established theory and/or are discrepant with a series of earlier studies.” (p. 65).

Bem (2011) reported 9 significant results to support unbelievable claims about supernatural abilities. However, several failed replication studies allowed psychologists to dismiss these findings and to ignore claims about time-reversed priming effects. So, while not all null-results are important, null-results in replication studies are important because they can correct false positive results in original articles. Without this correction mechanism, science looses its ability to correct itself.

Failed Replications Do Not Falsify Theories

S&S state that failed replications do not falsify theories

The nonreplications published by Shanks and colleagues (2013) cannot be taken as a falsification of that theory, because their study does not explain why previous research was successful in replicating the original findings of Dijksterhuis and van Knippenberg (1998).” (p. 64).

I am unaware of any theory in psychology that has been falsified. The reason for this is not that failed replication studies are not informative. The reason is that theories have been protected by hiding failed replication studies until recently. Only in recent years have social psychologists started to contemplate the possibility that some theories in social psychology might be false. The most prominent example is ego-depletion theory, which has been one of the first prominent theories that has been put under the microscope of open science without the protection of questionable research practices in recent years. While ego-depletion theory is not entirely dead, few people still believe in the simple theory that 20 Stroop trials deplete individuals’ will power. Falsification is hard, but falsification without disconfirming evidence is impossible.

Inconsistent Evidence

S&S argue that replication failures have to be evaluated in the context of replication successes.

Even multiple failures to replicate an established finding would not result in a rejection of the original hypothesis, if there are also multiple studies that supported that hypothesis.

Earlier S&S wrote

in social psychology, as in most sciences, empirical findings cannot always be replicated (this was one of the reasons for the development of meta-analytic methods).

Indeed. Unless studies have very high statistical power, inconsistent results are inevitable; which is one reason why publishing only significant results is a sign of low credibility (Schimmack, 2012). Meta-analysis is the only way to make sense of these inconsistent findings. However, it is well known that publication bias makes meta-analytic results meaningless (e.g., meta-analysis show very strong evidence for supernatural abilities). Thus, it is important that all tests of a theoretical prediction are reported to produce meaningful meta-analyses. If social psychologists would take S&S seriously and continue to suppress non-significant results because they are uninformative, meta-analysis would continue to provide biased results that support even false theories.

Failed Replications are Uninformative II

Sorry that this is getting really long. But S&S keep on making the same arguments and the editor of this article didn’t tell them to shorten the article. Here they repeat the argument that failed replications are uninformative.

One reason why null findings are not very interesting is because they tell us only that a finding could not be replicated but not why this was the case. This conflict can be resolved only if researchers develop a theory that could explain the inconsistency in findings.

A related claim is that failed replications never demonstrate that original findings were false because the inconsistency is always due to some third variable; a hidden moderator.

Methodologically, however, nonreplications must be understood as interaction effects in that they suggest that the effect of the crucial influence depends on the idiosyncratic conditions under which the original experiment was conducted” (p. 64).

These statements reveal a fundamental misunderstanding of statistical inferences. A significant result never proofs that the null-hypothesis is false. The inference that a real effect rather than sampling error caused the observed result can be a mistake. This mistake is called a false positive or a type-I error. S&S seems to believe that type-I errors do not exist. Accordingly, Bem’s significant results show real supernatural abilities. If this were the case, it would be meaningless to report statistical significance tests. The only possible error that could be made would be false negatives or type-II error; the theory makes the correct prediction, but a study failed to produce a significant result. And if theoretical predictions are always correct, it is also not necessary to subject theories to empirical tests, because these tests either correctly show that a prediction was confirmed or falsely fail to confirm a prediction.

S&S’s belief in published results has a religious quality. Apparently we know nothing about the world, but once a significant result is published in a social psychology journal, ideally JPSP, it becomes a holy truth that defies any evidence that non-believers may produce under the misguided assumption that further inquiry is necessary. Elderly priming is real, amen.

More Confusing Nonsense

At some point, I was no longer surprised by S&S’s claims, but I did start to wonder about the reviewers and editors who allowed this manuscript to be published apparently with light or no editing. Why would a self-respecting journal publish a sentence like this?

As a consequence, the mere coexistence of exact replications that are both successful and unsuccessful is likely to leave researchers helpless about what to conclude from such a pattern of outcomes.

Didn’t S&S claim that exact replication studies do not exist? Didn’t they tell readers that every inconsistent finding has to be interpreted as an interaction effect? And where do they see inconsistent results if journals never publish non-significant results?

Aside from these inconsistencies, inconsistent results do not lead to a state of helpless paralysis. As S&S suggested themselves, they conduct a meta-analysis. Are S&S suggesting that we need to spare researchers from inconsistent results to protect them from a state of helpless confusion? Is this their justification for publishing only significant results?

Even Massive Replication Failures in Registered Replication Reports are Uninformative

In response to the replication crisis, some psychologists started to invest time and resources in major replication studies called many lab studies or registered replication studies. A single study was replicated in many labs. The total sample size of many labs gives these studies high precision in estimating the average effect size and makes it even possible to demonstrate that an effect size is close to zero, which suggests that the null-hypothesis may be true. These studies have failed to find evidence for classic social psychology findings, including Strack’s facial feedback studies. S&S suggest that even these results are uninformative.

Conducting exact replications in a registered and coordinated fashion by different laboratories does not remove the described shortcomings. This is also the case if exact replications are proposed as a means to estimate the “true size” of an effect. As the size of an experimental effect always depends on the specific error variance that is generated by the context, exact replications can assess only the efficiency of an intervention in a given situation but not the generalized strength of a causal influence.

Their argument does not make any sense to me. First, it is not clear what S&S mean by “the size of an experimental effect always depends on the specific error variance.” Neither unstandardized nor standardized effect sizes depend on the error variance. This is simple to see because error variance depends on the sample size and effect sizes do not depend on sample size. So, it makes no sense to claim that effect sizes depend on error variance.

Second, it is not clear what S&S mean by specific error variance that is generated by the context. I simply cannot address this argument because the notion of context generated specific error variance is not a statistical construct and S&S do not explain what they are talking about.

Finally, it is not clear why meta-analysis of replication studies cannot be used to estimate the generalized strength of a causal influence, which I believe to mean “an effect size”? Earlier S&S alluded to meta-analysis as a way to resolve inconsistencies in the literature, but now they seem to suggest that meta-analysis cannot be used.

If S&S really want to imply that meta-analyses are useless, it is unclear how they would make sense of inconsistent findings. The only viable solution seems to be to avoid inconsistencies by suppressing non-significant results in order to give the impression that every theory in social psychology is correct because theoretical predictions are always confirmed. Although this sounds absurd, it is the inevitable logical consequence of S&S’s claim that non-significant results are uninformative, even if over 20 labs independently and in combination failed to provide evidence for a theoretical predicted effect.

The Great History of Social Psychological Theories

S&S next present Über-social psychologist, Leon Festinger, as an example why theories are good and failed studies are bad. The argument is that good theories make correct predictions, even if bad studies fail to show the effect.

“Although their theoretical analysis was valid, it took a decade before researchers were able to reliably replicate the findings reported by Festinger and Carlsmith (1959).”

As a former student, I was surprised by this statement because I had learned that Festinger’s theory was challenged by Bem’s theory and that social psychologists had been unable to resolve which of the two theories was correct. Couldn’t some of these replication failures be explained by the fact that Festinger’s theory sometimes made the wrong prediction?

It is also not surprising that researchers had a hard time replicating Festinger and Carlsmith original findings. The reason is that the original study had low statistical power and replication failures are expected even if the theory is correct. Finally, I have been around social psychologists long enough to have heard some rumors about Festinger and Carlsmith’s original studies. Accordingly, some of Festinger’s graduate students also tried and failed to get the effect. Carlsmith was the ‘lucky’ one who got the effect, in one study p < .05, and he became the co-author of one of the most cited articles in the history of social psychology. Naturally, Festinger did not publish the failed studies of his other graduate students because surely they must have done something wrong. As I said, that is a rumor. Even if the rumor is not true, and Carlsmith got lucky on the first try, luck played a factor and nobody should expect that a study replicates simply because a single published study reported a p-value less than .05.

Failed Replications Did Not Influence Social Psychological Theories

Argument quality reaches a new low with the next argument against replication studies.

“If we look at the history of social psychology, theories have rarely been abandoned because of failed replications.”

This is true, but it reveals the lack of progress in theory development in social psychology rather than the futility of replication studies. From an evolutionary perspective, theory development requires selection pressure, but publication bias protects bad theories from failure.

The short history of open science shows how weak social psychological theories are and that even the most basic predictions cannot be confirmed in open replication studies that do not selectively report significant results. So, even if it is true that failed replications have played a minor role in the past of social psychology, they are going to play a much bigger role in the future of social psychology.

The Red Herring: Fraud

S&S imply that Roediger suggested to use replication studies as a fraud detection tool.

if others had tried to replicate his [Stapel’s] work soon after its publication, his misdeeds might have been uncovered much more quickly

S&S dismiss this idea in part on the basis of Stroebe’s research on fraud detection.

To their own surprise, Stroebe and colleagues found that replications hardly played any role in the discovery of these fraud cases.

Now this is actually not surprising because failed replications were hardly ever published. And if there is no variance in a predictor variable (significance), we cannot see a correlation between the predictor variable and an outcome (fraud). Although failed replication studies may help to detect fraud in the future, this is neither their primary purpose, nor necessary to make replication studies valuable. Replication studies also do not bring world peace or bring an end to global warming.

For some inexplicable reason S&S continue to focus on fraud. For example, they also argue that meta-analyses are poor fraud detectors, which is as true as it is irrelevant.

They conclude their discussion with an observation by Stapel, who famously faked 50+ articles in social psychology journals.

As Stapel wrote in his autobiography, he was always pleased when his invented findings were replicated: “What seemed logical and was fantasized became true” (Stapel, 2012). Thus, neither can failures to replicate a research finding be used as indicators of fraud, nor can successful replications be invoked as indication that the original study was honestly conducted.

I am not sure why S&S spend so much time talking about fraud, but it is the only questionable research practice that they openly address. In contrast, they do not discuss other questionable research practices, including suppressing failed studies, that are much more prevalent and much more important for the understanding of the replication crisis in social psychology than fraud. The term “publication bias” is not mentioned once in the article. Sometimes what is hidden is more significant than what is being published.

Conclusion

The conclusion section correctly predicts that the results of the reproducibility project will make social psychology look bad and that social psychology will look worse than other areas of psychology.

But whereas it will certainly be useful to be informed about studies that are difficult to replicate, we are less confident about whether the investment of time and effort of the volunteers of the Open Science Collaboration is well spent on replicating studies published in three psychology journals. The result will be a reproducibility coefficient that will not be greatly informative, because of justified doubts about whether the “exact” replications succeeded in replicating the theoretical conditions realized in the original research.

As social psychologists, we are particularly concerned that one of the outcomes of this effort will be that results from our field will be perceived to be less “reproducible” than research in other areas of psychology. This is to be expected because for the reasons discussed earlier, attempts at “direct” replications of social psychological studies are less likely than exact replications of experiments in psychophysics to replicate the theoretical conditions that were established in the original study.

Although psychologists should not be complacent, there seem to be no reasons to panic the field into another crisis. Crises in psychology are not caused by methodological flaws but by the way people talk about them (Kruglanski & Stroebe, 2012).

S&S attribute the foreseen (how did they know?) bad outcome in the reproducibility project to the difficulty of replicating social psychological studies, but they fail to explain why social psychology journals publish as many successes as other disciplines.

The results of the reproducibility project provide an answer to this question. Social psychologists use designs with less statistical power that have a lower chance of producing a significant result. Selection for significance ensures that the success rate is equally high in all areas of psychology, but lower power makes these successes less replicable.

To avoid further embarrassments in an increasingly open science, social psychologists must improve the statistical power of their studies. Which social psychological theories will survive actual empirical tests in the new world of open science is unclear. In this regard, I think it makes more sense to compare social psychology to a ship wreck than a train wreck. Somewhere down on the floor of the ocean is some gold. But it will take some deep diving and many failed attempts to find it. Good luck!

Appendix

S&S’s article was published in a “prestigious” psychology journal and has already garnered 114 citations. It ranks #21 in my importance rankings of articles in meta-psychology. So, I was curious why the article gets cited. The appendix lists 51 citing articles with the relevant citation and the reason for citing S&S’s article. The table shows the reasons for citations in decreasing order of frequency.

S&S are most frequently cited for the claim that exact replications are impossible, followed by the reason for this claim that effects in psychological research are sensitive to the unique context in which a study is conducted. The next two reasons for citing the article are that only conceptual replications (CR) test theories, whereas the results of exact replications (ER) are uninformative. The problem is that every study is a conceptual replication because exact replications are impossible. So, even if exact replications were uninformative this claim has no practical relevance because there are no exact replications. Some articles cite S&S with no specific claim attached to the citation. Only two articles cite them for the claim that there is no replication crisis and only 1 citation cites S&S for the claim that there is no evidence about the prevalence of QRPs. In short, the article is mostly cited for the uncontroversial and inconsequential claim that exact replications are impossible and that effect sizes in psychological studies can vary as a function of unique features of a particular sample or study. This observation is inconsequential because it is unclear how unknown unique characteristics of studies influence results. The main implication of this observation is that study results will be more variable than we would expect from a set of exact replication studies. For this reason, meta-analysts often use random-effects model because fixed-effects meta-analysis assumes that all studies are exact replications.

ER impossible	11
Contextual Sensitivity	8
CR test theory	8
ER uninformative	7
Mention	6
ER/CR Distinction	2
No replication crisis	2
Disagreement	1
CR Definition	1
ER informative	1
ER useful for applied research	1
ER cannot detect fraud	1
No evidence about prevalence of QRP	1
Contextual sensitivity greater in social psychology	1

the most influential citing articles and the relevant citation. I haven’t had time to do a content analysis, but the article is mostly cited to say (a) exact replications are impossible, and (b) conceptual replications are valuable, and (c) social psychological findings are harder to replicate. Few articles cite to article to claim that the replication crisis is overblown or that failed replications are uninformative. Thus, even though the article is cited a lot, it is not cited for the main points S&S tried to make. The high number of citation therefore does not mean that S&S’s claims have been widely accepted.

(Disagreement)
The value of replication studies.
Simmons, DJ.
“In this commentary, I challenge these claims.”

(ER/CR Distinction)
Bilingualism and cognition.
Valian, V.
“A host of methodological issues should be resolved. One is whether the field should undertake exact replications, conceptual replications, or both, in order to determine the conditions under which effects are reliably obtained (Paap, 2014; Simons, 2014; Stroebe & Strack, 2014).”

(Contextual Sensitivity)
Is Psychology Suffering From a Replication Crisis? What Does “Failure to Replicate” Really Mean?“
Maxwell et al. (2015)
A particular replication may fail to confirm the results of an original study for a variety of reasons, some of which may include intentional differences in procedures, measures, or samples as in a conceptual replication (Cesario, 2014; Simons, 2014; Stroebe & Strack, 2014).”

(ER impossible)
The Chicago face database: A free stimulus set of faces and norming data
Debbie S. Ma, Joshua Correll, & Bernd Wittenbrink.
The CFD will also make it easier to conduct exact replications, because researchers can use the same stimuli employed by other researchers (but see Stroebe & Strack, 2014).”

(Contextual Sensitivity)
“Contextual sensitivity in scientific reproducibility”
vanBavel et al. (2015)
“Many scientists have also argued that the failure to reproduce results might reflect contextual differences—often termed “hidden moderators”—between the original research and the replication attempt”

(Contextual Sensitivity)
Editorial Psychological Science
Linday,
As Nosek and his coauthors made clear, even ideal replications of ideal studies are expected to fail some of the time (Francis, 2012), and failure to replicate a previously observed effect can arise from differences between the original and replication studies and hence do not necessarily indicate flaws in the original study (Maxwell, Lau, & Howard, 2015; Stroebe & Strack, 2014). Still, it seems likely that psychology journals have too often reported spurious effects arising from Type I errors (e.g., Francis, 2014).

(ER impossible)
Best Research Practices in Psychology: Illustrating Epistemological and Pragmatic Considerations With the Case of Relationship Science
Finkel et al. (2015).
“Nevertheless, many scholars believe that direct replications are impossible in the human sciences—S&S (2014) call them “an illusion”— because certain factors, such as a moment in historical time or the precise conditions under which a sample was obtained and tested, that may have contributed to a result can never be reproduced identically.”

Conceptualizing and evaluating the replication of research results
Fabrigar and Wegener (2016)
(CR test theory)
“Traditionally, the primary presumed strength of conceptual replications has been their ability to address issues of construct validity (e.g., Brewer & Crano, 2014; Schmidt, 2009; Stroebe & Strack, 2014). “

(ER impossible)
“First, it should be recognized that an exact replication in the strictest sense of the term can never be achieved as it will always be impossible to fully recreate the contextual factors and participant characteristics present in the original experiment (see Schmidt (2009); S&S (2014).”

(Contextual Sensitivity)
“S&S (2014) have argued that there is good reason to expect that many traditional and contemporary experimental manipulations in social psychology would have different psychological properties and effects if used in contexts or populations different from the original experiments for which they were developed. For example, classic dissonance manipulations and fear manipulations or more contemporary priming procedures might work very differently if used in new contexts and/or populations. One could generate many additional examples beyond those mentioned by S&S.”

(ER impossible)
“Another important point illustrated by the above example is that the distinction between exact and conceptual replications is much more nebulous than many discussions of replication would suggest. Indeed, some critics of the exact/conceptual replication distinction have gone so far as to argue that the concept of exact replication is an “illusion” (Stroebe & Strack, 2014). Though we see some utility in the exact/conceptual distinction (especially regarding the goal of the researcher in the work), we agree with the sentiments expressed by S&S. Classifying studies on the basis of the exact/conceptual distinction is more difficult than is often appreciated, and the presumed strengths and weaknesses of the approaches are less straightforward than is often asserted or assumed.”

(Contextual Sensitivity)
“Furthermore, assuming that these failed replication experiments have used the same operationalizations of the independent and dependent variables, the most common inference drawn from such failures is that confidence in the existence of the originally demonstrated effect should be substantially undermined (e.g., see Francis (2012); Schimmack (2012)). Alternatively, a more optimistic interpretation of such failed replication experiments could be that the failed versus successful experiments differ as a function of one or more unknown moderators that regulate the emergence of the effect (e.g., Cesario, 2014; Stroebe & Strack, 2014).”

Replicating Studies in Which Samples of Participants Respond to Samples of Stimuli.
(CR Definition)
Westfall et al. (2015).
Nevertheless, the original finding is considered to be conceptually replicated if it can be convincingly argued that the same theoretical constructs thought to account for the results of the original study also account for the results of the replication study (Stroebe & Strack, 2014). Conceptual replications are thus “replications” in the sense that they establish the reproducibility of theoretical interpretations.”

(Mention)
“Although establishing the generalizability of research findings is undoubtedly important work, it is not the focus of this article (for opposing viewpoints on the value of conceptual replications, see Pashler & Harris, 2012; Stroebe & Strack, 2014).“

Introduction to the Special Section on Advancing Our Methods and Practices
(Mention)
Ledgerwood, A.
We can and surely should debate which problems are most pressing and which solutions most suitable (e.g., Cesario, 2014; Fiedler, Kutzner, & Krueger, 2012; Murayama, Pekrun, & Fiedler, 2013; Stroebe & Strack, 2014). But at this point, most can agree that there are some real problems with the status quo.

***Theory Building, Replication, and Behavioral Priming: Where Do We Need to Go From Here?
Locke, EA
(ER impossible)
As can be inferred from Table 1, I believe that the now popular push toward “exact” replication (e.g., see Simons, 2014) is not the best way to go. Everyone agrees that literal replication is impossible (e.g., Stroebe & Strack, 2014), but let us assume it is as close as one can get. What has been achieved?

The War on Prevention: Bellicose Cancer: Metaphors Hurt (Some) Prevention Intentions”
(CR test theory)
David J. Hauser1 and Norbert Schwarz
“As noted in recent discussions (Stroebe & Strack, 2014), consistent effects of multiple operationalizations of a conceptual variable across diverse content domains are a crucial criterion for the robustness of a theoretical approach.”

ON THE OTHER SIDE OF THE MIRROR: PRIMING IN COGNITIVE AND SOCIAL PSYCHOLOGY
Doyen et al. “
(CR test theory)
In contrast, social psychologists assume that the primes activate culturally and situationally contextualized representations (e.g., stereotypes, social norms), meaning that they can vary over time and culture and across individuals. Hence, social psychologists have advocated the use of “conceptual replications” that reproduce an experiment by relying on different operationalizations of the concepts under investigation (Stroebe & Strack, 2014). For example, in a society in which old age is associated not with slowness but with, say, talkativeness, the outcome variable could be the number of words uttered by the subject at the end of the experiment rather than walking speed.”

***Welcome back Theory
Ap Dijksterhuis
(ER uninformative)
“it is unavoidable, and indeed, this commentary is also about replication—it is done against the background of something we had almost forgotten: theory! S&S (2014, this issue) argue that focusing on the replication of a phenomenon without any reference to underlying theoretical mechanisms is uninformative”

On the scientific superiority of conceptual replications for scientific progress
Christian S. Crandall, Jeffrey W. Sherman
(ER impossible)
But in matters of social psychology, one can never step in the same river twice—our phenomena rely on culture, language, socially primed knowledge and ideas, political events, the meaning of questions and phrases, and an ever-shifting experience of participant populations (Ramscar, 2015). At a certain level, then, all replications are “conceptual” (Stroebe & Strack, 2014), and the distinction between direct and conceptual replication is continuous rather than categorical (McGrath, 1981). Indeed, many direct replications turn out, in fact, to be conceptual replications. At the same time, it is clear that direct replications are based on an attempt to be as exact as possible, whereas conceptual replications are not.

***Are most published social psychological findings false?
Stroebe, W.
(ER uninformative)
This near doubling of replication success after combining original and replication effects is puzzling. Because these replications were already highly powered, the increase is unlikely to be due to the greater power of a meta-analytic synthesis. The two most likely explanations are quality problems with the replications or publication bias in the original studies or. An evaluation of the quality of the replications is beyond the scope of this review and should be left to the original authors of the replicated studies. However, the fact that all replications were exact rather than conceptual replications of the original studies is likely to account to some extent for the lower replication rate of social psychological studies (Stroebe & Strack, 2014). There is no evidence either to support or to reject the second explanation.”

(ER impossible)
“All four projects relied on exact replications, often using the material used in the original studies. However, as I argued earlier (Stroebe & Strack, 2014), even if an experimental manipulation exactly replicates the one used in the original study, it may not reflect the same theoretical variable.”

(CR test theory)
“Gergen’s argument has important implications for decisions about the appropriateness of conceptual compared to exact replication. The more a phenomenon is susceptible to historical change, the more conceptual replication rather than exact replication becomes appropriate (Stroebe & Strack, 2014).”

(CR test theory)
“Moonesinghe et al. (2007) argued that any true replication should be an exact replication, “a precise processwhere the exact same finding is reexamined in the same way”. However, conceptual replications are often more informative than exact replications, at least in studies that are testing theoretical predictions (Stroebe & Strack, 2014). Because conceptual replications operationalize independent and/or dependent variables in a different way, successful conceptual replications increase our trust in the predictive validity of our theory.”

There’s More Than One Way to Conduct a Replication Study: Beyond Statistical Significance”
Anderson & Maxwell
(Mention)
“It is important to note some caveats regarding direct (exact) versus conceptual replications. While direct replications were once avoided for lack of originality, authors have recently urged the field to take note of the benefits and importance of direct replication. According to Simons (2014), this type of replication is “the only way to verify the reliability of an effect” (p. 76). With respect to this recent emphasis, the current article will assume direct replication. However, despite the push toward direct replication, some have still touted the benefits of conceptual replication (Stroebe & Strack, 2014). Importantly, many of the points and analyses suggested in this paper may translate well to conceptual replication.”

Reconceptualizing replication as a sequence of different studies: A replication typology
Joachim Hüffmeier, Jens Mazei, Thomas Schultze
(ER impossible)
The first type of replication study in our typology encompasses exact replication studies conducted by the author(s) of an original finding. Whereas we must acknowledge that replications can never be “exact” in a literal sense in psychology (Cesario, 2014; Stroebe & Strack, 2014), exact replications are studies that aspire to be comparable to the original study in all aspects (Schmidt, 2009). Exact replications—at least those that are not based on questionable research practices such as the arbitrary exclusion of critical outliers, sampling or reporting biases (John, Loewenstein, & Prelec, 2012; Simmons, Nelson, & Simonsohn, 2011)—serve the function of protecting against false positive effects (Type I errors) right from the start.

(ER informative)
Thus, this replication constitutes a valuable contribution to the research process. In fact, already some time ago, Lykken (1968; see also Mummendey, 2012) recommended that all experiments should be replicated before publication. From our perspective, this recommendation applies in particular to new findings (i.e., previously uninvestigated theoretical relations), and there seems to be some consensus that new findings should be replicated at least once, especially when they were unexpected, surprising, or only loosely connected to existing theoretical models (Stroebe & Strack, 2014; see also Giner-Sorolla, 2012; Murayama et al., 2014).”

(Mention)
Although there is currently some debate about the epistemological value of close replication studies (e.g., Cesario, 2014; LeBel & Peters, 2011; Pashler & Harris, 2012; Simons, 2014; Stroebe & Strack, 2014), the possibility that each original finding can—in principal—be replicated by the scientific community represents a cornerstone of science (Kuhn, 1962; Popper, 1992).”

(CR test theory)
So far, we have presented “only” the conventional rationale used to stress the importance of close replications. Notably, however, we will now add another—and as we believe, logically necessary—point originally introduced by S&S (2014). This point protects close replications from being criticized (cf. Cesario, 2014; Stroebe & Strack, 2014; see also LeBel & Peters, 2011). Close replications can be informative only as long as they ensure that the theoretical processes investigated or at least invoked by the original study are shown to also operate in the replication study.

(CR test theory)
The question of how to conduct a close replication that is maximally informative entails a number of methodological choices. It is important to both adhere to the original study proceedings (Brandt et al., 2014; Schmidt, 2009) and focus on and meticulously measure the underlying theoretical mechanisms that were shown or at least proposed in the original studies (Stroebe & Strack, 2014). In fact, replication attempts are most informative when they clearly demonstrate either that the theoretical processes have unfolded as expected or at which point in the process the expected results could no longer be observed (e.g., a process ranging from a treatment check to a manipulation check and [consecutive] mediator variables to the dependent variable). Taking these measures is crucial to rule out that a null finding is simply due to unsuccessful manipulations or changes in a manipulation’s meaning and impact over time (cf. Stroebe & Strack, 2014). “

(CR test theory)
Conceptual replications in laboratory settings are the fourth type of replication study in our typology. In these replications, comparability to the original study is aspired to only in the aspects that are deemed theoretically relevant (Schmidt, 2009; Stroebe & Strack, 2014). In fact, most if not all aspects may differ as long as the theoretical processes that have been studied or at least invoked in the original study are also covered in a conceptual replication study in the laboratory.”

(ER useful for applied research)
For instance, conceptual replications may be less important for applied disciplines that focus on clinical phenomena and interventions. Here, it is important to ensure that there is an impact of a specific intervention and that the related procedure does not hurt the members of the target population (e.g., Larzelere et al., 2015; Stroebe & Strack, 2014).”

From intrapsychic to ecological theories in social psychology: Outlines of a functional theory approach
Klaus Fiedler
(ER uninformative)
Replicating an ill-understood finding is like repeating a complex sentence in an unknown language. Such a “replication” in the absence of deep understanding may appear funny, ridiculous, and embarrassing to a native speaker, who has full control over the foreign language. By analogy, blindly replicating or running new experiments on an ill-understood finding will rarely create real progress (cf. Stroebe & Strack, 2014). “

Into the wild: Field research can increase both replicability and real-world impact
Jon K. Maner
(CR test theory)
Although studies relying on homogeneous samples of laboratory or online participants might be highly replicable when conducted again in a similar homogeneous sample of laboratory or online participants, this is not the key criterion (or at least not the only criterion) on which we should judge replicability (Westfall, Judd & Kenny, 2015; see also Brandt et al., 2014; Stroebe & Strack, 2014). Just as important is whether studies replicate in samples that include participants who reflect the larger and more diverse population.”

Romance, Risk, and Replication: Can Consumer Choices and Risk-Taking Be Primed by Mating Motives?
Shanks et al.
(ER impossible)
There is no such thing as an “exact” replication (Stroebe & Strack, 2014) and hence it must be acknowledged that the published studies (notwithstanding the evidence for p-hacking and/or publication bias) may have obtained genuine effects and that undetected moderator variables explain why the present studies failed to obtain priming. Some of the experiments reported here differed in important ways from those on which they were modeled (although others were closer replications and even these failed to obtain evidence of reliable romantic priming).

(CR test theory)
As S&S (2014) point out, what is crucial is not so much exact surface replication but rather identical operationalization of the theoretically relevant variables. In the present case, the crucial factors are the activation of romantic motives and the appropriate assessment of consumption, risk-taking, and other measures.”

A Duty to Describe: Better the Devil You Know Than the Devil You Don’t
Brown, Sacha D et al.
(Mention)
Ioannidis (2005) has been at the forefront of researchers identifying factors interfering with self-correction. He has claimed that journal editors selectively publish positive findings and discriminate against study replications, permitting errors in data and theory to enjoy a long half-life (see also Ferguson & Brannick, 2012; Ioannidis, 2008, 2012; Shadish, Doherty, & Montgomery, 1989; Stroebe & Strack, 2014). We contend there are other equally important, yet relatively unexplored, problems.

A Room with a Viewpoint Revisited: Descriptive Norms and Hotel Guests’ Towel Reuse Behavior
(Contextual Sensitivity)
Bohner, Gerd; Schlueter, Lena E.
On the other hand, our pilot participants’ estimates of towel reuse rates were generally well below 75%, so we may assume that the guests participating in our experiments did not perceive the normative messages as presenting a surprisingly low figure. In a more general sense, the issue of greatly diverging baselines points to conceptual issues in trying to devise a ‘‘direct’’ replication: Identical operationalizations simply may take on different meanings for people in different cultures.

***The empirical benefits of conceptual rigor: Systematic articulation of conceptual hypotheses can reduce the risk of non-replicable results (and facilitate novel discoveries too)
Mark Schaller
(Contextual Sensitivity)
Unless these subsequent studies employ methods that exactly replicate the idiosyncratic context in which the effect was originally detected, these studies are unlikely to replicate the effect. Indeed, because many psychologically important contextual variables may lie outside the awareness of researchers, even ostensibly “exact” replications may fail to create the conditions necessary for a fragile effect to emerge (Stroebe & Strack, 2014)

A Concise Set of Core Recommendations to Improve the Dependability of Psychological Research
David A. Lishner
(CR test theory)
The claim that direct replication produces more dependable findings across replicated studies than does conceptual replication seems contrary to conventional wisdom that conceptual replication is preferable to direct replication (Dijksterhuis, 2014; Neulip & Crandall, 1990, 1993a, 1993b; Stroebe & Strack, 2014).
(CR test theory)
However, most arguments advocating conceptual replication over direct replication are attempting to promote the advancement or refinement of theoretical understanding (see Dijksterhuis, 2014; Murayama et al., 2014; Stroebe & Strack, 2014). The argument is that successful conceptual replication demonstrates a hypothesis (and by extension the theory from which it derives) is able to make successful predictions even when one alters the sampled population, setting, operations, or data analytic approach. Such an outcome not only suggests the presence of an organizing principle, but also the quality of the constructs linked by the organizing principle (their theoretical meanings). Of course this argument assumes that the consistency across the replicated findings is not an artifact of data acquisition or data analytic approaches that differ among studies. The advantage of direct replication is that regardless of how flexible or creative one is in data acquisition or analysis, the approach is highly similar across replication studies. This duplication ensures that any false finding based on using a flexible approach is unlikely to be repeated multiple times.

(CR test theory)
Does this mean conceptual replication should be abandoned in favor of direct replication? No, absolutely not. Conceptual replication is essential for the theoretical advancement of psychological science (Dijksterhuis, 2014; Murayama et al., 2014; Stroebe & Strack, 2014), but only if dependability in findings via direct replication is first established (Cesario, 2014; Simons, 2014). Interestingly, in instances where one is able to conduct multiple studies for inclusion in a research report, one approach that can produce confidence in both dependability of findings and theoretical generalizability is to employ nested replications.

(ER cannot detect fraud)
A second advantage of direct replications is that they can protect against fraudulent findings (Schmidt, 2009), particularly when different research groups conduct direct replication studies of each other’s research. S&S (2014) make a compelling argument that direct replication is unlikely to prove useful in detection of fraudulent research. However, even if a fraudulent study remains unknown or undetected, its impact on the literature would be lessened when aggregated with nonfraudulent direct replication studies conducted by honest researchers.

***Does cleanliness influence moral judgments? Response effort moderates the effect of cleanliness priming on moral judgments.
Huang
(ER uninformative)
Indeed, behavioral priming effects in general have been the subject of increased scrutiny (see Cesario, 2014), and researchers have suggested different causes for failed replication, such as measurement and sampling errors (Stanley and Spence,2014), variation in subject populations (Cesario, 2014), discrepancy in operationalizations (S&S, 2014), and unidentified moderators (Dijksterhuis,2014).

UNDERSTANDING PRIMING EFFECTS IN SOCIAL PSYCHOLOGY: AN OVERVIEW AND INTEGRATION
Daniel C. Molden
(ER uninformative)
Therefore, some greater emphasis on direct replication in addition to conceptual replication is likely necessary to maximize what can be learned from further research on priming (but see Stroebe and Strack, 2014, for costs of overemphasizing direct replication as well).

On the automatic link between affect and tendencies to approach and avoid: Chen and Bargh (1999) revisited
Mark Rotteveel et al.
(no replication crisis)
Although opinions differ with regard to the extent of this “replication crisis” (e.g., Pashler and Harris, 2012; S&S, 2014), the scientific community seems to be shifting its focus more toward direct replication.

(ER uninformative)
Direct replications not only affect one’s confidence about the veracity of the phenomenon under study, but they also increase our knowledge about effect size (see also Simons, 2014; but see also S&S, 2014).

Single-Paper Meta-Analysis: Benefits for Study Summary, Theory Testing, and Replicability
McShane and Bockenholt
(ER impossible)
The purpose of meta-analysis is to synthesize a set of studies of a common phenomenon. This task is complicated in behavioral research by the fact that behavioral research studies can never be direct or exact replications of one another (Brandt et al. 2014; Fabrigar and Wegener 2016; Rosenthal 1991; S&S 2014; Tsang and Kwan 1999).

(ER impossible)
Further, because behavioral research studies can never be direct or exact replications of one another (Brandt et al. 2014; Fabrigar and Wegener 2016; Rosenthal 1991; S&S 2014; Tsang and Kwan 1999), our SPM methodology estimates and accounts for heterogeneity, which has been shown to be important in a wide variety of behavioral research settings (Hedges and Pigott 2001; Klein et al. 2014; Pigott 2012).

A Closer Look at Social Psychologists’ Silver Bullet: Inevitable and Evitable Side Effects of the Experimental Approach
Herbert Bless and Axel M. Burger
(ER/CR Distinction)
Given the above perspective, it becomes obvious that in the long run, conceptual replications can provide very fruitful answers because they address the question of whether the initially observed effects are potentially caused by some perhaps unknown aspects of the experimental procedure (for a discussion of conceptual versus direct replications, see e.g., Stroebe & Strack, 2014; see also Brandt et al., 2014; Cesario, 2014; Lykken, 1968; Schwarz & Strack, 2014). Whereas conceptual replications are adequate solutions for broadening the sample of situations (for examples, see Stroebe & Strack, 2014), the present perspective, in addition, emphasizes that it is important that the different conceptual replications do not share too much overlap in general aspects of the experiment (see also Schwartz, 2015, advocating for conceptual replications)

Men in red: A reexamination of the red-attractiveness effect
Vera M. Hesslinger, Lisa Goldbach, & Claus-Christian Carbon
(ER impossible)
As Brandt et al. (2014) pointed out, a replication in psychological research will never be absolutely exact or direct (see also, Stroebe & Strack, 2014), which is, of course, also the case in the present research.

***On the challenges of drawing conclusions from p-values just below 0.05
Daniel Lakens
(no evidence about QRP)
In recent years, researchers have become more aware of how flexibility during the data-analysis can increase false positive results (e.g., Simmons, Nelson & Simonsohn, 2011). If the true Type 1 error rate is substantially inflated, for example because researchers analyze their data until a p-value smaller than 0.05 is observed, the robustness of scientific knowledge can substantially decrease. However, as Stroebe & Strack (2014, p. 60) have pointed out: ‘Thus far, however, no solid data exist on the prevalence of such research practices.’

***Does Merely Going Through the Same Moves Make for a ‘‘Direct’’ Replication? Concepts, Contexts, and Operationalizations
Norbert Schwarz and Fritz Strack
(Contextual Sensitivity)
In general, meaningful replications need to realize the psychological conditions of the original study. The easier option of merely running through technically identical procedures implies the assumption that psychological processes are context insensitive and independent of social, cultural, and historical differences (Cesario, 2014; Stroebe & Strack, 2014). Few social (let alone cross-cultural) psychologists would be willing to endorse this assumption with a straight face. If so, mere procedural equivalence is an insufficient criterion for assessing the quality of a replication.

The Replication Paradox: Combining Studies can Decrease Accuracy of Effect Size Estimates
(ER uninformative)
Michèle B. Nuijten, Marcel A. L. M. van Assen, Coosje L. S. Veldkamp, and Jelte M. Wicherts
Replications with nonsignificant results are easily dismissed with the argument that the replication might contain a confound that caused the null finding (Stroebe & Strack, 2014).

Retro-priming, priming, and double testing: psi and replication in a test-retest design
Rabeyron, T
(Mention)
Bem’s paper spawned numerous attempts to replicate it (see e.g., Galak et al., 2012; Bem et al., submitted) and reflections on the difficulty of direct replications in psychology (Ritchie et al., 2012). This aspect has been associated more generally with debates concerning the “decline effect” in science (Schooler, 2011) and a potential “replication crisis” (S&S, 2014) especially in the fields of psychology and medical sciences (De Winter and Happee, 2013).

Do p Values Lose Their Meaning in Exploratory Analyses? It Depends How You Define the Familywise Error Rate
Mark Rubin
(ER impossible)
Consequently, the Type I error rate remains constant if researchers simply repeat the same test over and over again using different samples that have been randomly drawn from the exact same population. However, this first situation is somewhat hypothetical and may even be regarded as impossible in the social sciences because populations of people change over time and location (e.g., Gergen, 1973; Iso-Ahola, 2017; Schneider, 2015; Serlin, 1987; Stroebe & Strack, 2014). Yesterday’s population of psychology undergraduate students from the University of Newcastle, Australia, will be a different population to today’s population of psychology undergraduate students from the University of Newcastle, Australia.

***Learning and the replicability of priming effects
Michael Ramscar
(ER uninformative)
In the limit, this means that in the absence of a means for objectively determining what the information that produces a priming effect is, and for determining that the same information is available to the population in a replication, all learned priming effects are scientifically unfalsifiable. (Which also means that in the absence of an account of what the relevant information is in a set of primes, and how it produces a specific effect, reports of a specific priming result — or failures to replicate it — are scientifically uninformative; see also [Stroebe & Strack, 2014.)

***Evaluating Psychological Research Requires More Than Attention to the N: A Comment on Simonsohn’s (2015) “Small Telescopes”
Norbert Schwarz and Gerald L. Clore
(CR test theory)
Simonsohn’s decision to equate a conceptual variable (mood) with its manipulation (weather) is compatible with the logic of clinical trials, but not with the logic of theory testing. In clinical trials, which have inspired much of the replicability debate and its statistical focus, the operationalization (e.g., 10 mg of a drug) is itself the variable of interest; in theory testing, any given operationalization is merely one, usually imperfect, way to realize the conceptual variable. For this reason, theory tests are more compelling when the results of different operationalizations converge (Stroebe & Strack, 2014), thus ensuring, in the case in point, that it is not “the weather” but indeed participants’ (sometimes weather-induced) mood that drives the observed effect.

Internal conceptual replications do not increase independent replication success
Kunert, R
(Contextual Sensitivity)
According to the unknown moderator account of independent replication failure, successful internal replications should correlate with independent replication success. This account suggests that replication failure is due to the fact that psychological phenomena are highly context-dependent, and replicating seemingly irrelevant contexts (i.e. unknown moderators) is rare (e.g., Barrett, 2015; DGPS, 2015; Fleming Crim, 2015; see also Stroebe & Strack, 2014; for a critique, see Simons, 2014). For example, some psychological phenomenon may unknowingly be dependent on time of day.

(Contextual Sensitivity greater in social psychology)
When the chances of unknown moderator influences are greater and replicability is achieved (internal, conceptual replications), then the same should be true when chances are smaller (independent, direct replications). Second, the unknown moderator account is usually invoked for social psychological effects (e.g. Cesario, 2014; Stroebe & Strack, 2014). However, the lack of influence of internal replications on independent replication success is not limited to social psychology. Even for cognitive psychology a similar pattern appears to hold.

On Klatzky and Creswell (2014): Saving Social Priming Effects But Losing Science as We Know It?
Barry Schwartz
(ER uninformative)
The recent controversy over what counts as “replication” illustrates the power of this presumption. Does “conceptual replication” count? In one respect, conceptual replication is a real advance, as conceptual replication extends the generality of the phenomena that were initially discovered. But what if it fails? Is it because the phenomena are unreliable, because the conceptual equivalency that justified the new study was logically flawed, or because the conceptual replication has permitted the intrusion of extraneous variables that obscure the original phenomenon? This ambiguity has led some to argue that there is no substitute for strict replication (see Pashler & Harris, 2012; Simons, 2014, and Stroebe & Strack, 2014, for recent manifestations of this controversy). A significant reason for this view, however, is less a critique of the logic of conceptual replication than it is a comment on the sociology (or politics, or economics) of science. As Pashler and Harris (2012) point out, publication bias virtually guarantees that successful conceptual replications will be published whereas failed conceptual replications will live out their lives in a file drawer. I think Pashler and Harris’ surmise is probably correct, but it is not an argument for strict replication so much as it is an argument for publication of failed conceptual replication.

Commentary and Rejoinder on Lynott et al. (2014)
Lawrence E. Williams
(CR test theory)
On the basis of their investigations, Lynott and colleagues (2014) conclude ‘‘there is no evidence that brief exposure to warm therapeutic packs induces greater prosocial responding than exposure to cold therapeutic packs’’ (p. 219). This conclusion, however, does not take into account other related data speaking to the connection between physical warmth and prosociality. There is a fuller body of evidence to be considered, in which both direct and conceptual replications are instructive. The former are useful if researchers particularly care about the validity of a specific phenomenon; the latter are useful if researchers particularly care about theory testing (Stroebe & Strack, 2014).

The State of Social and Personality Science: Rotten to the Core, Not So Bad, Getting Better, or Getting Worse?
(no replication crisis)
Motyl et al. (2017) “The claim of a replicability crisis is greatly exaggerated.” Wolfgang Stroebe and Fritz Strack, 2014

Promise, peril, and perspective: Addressing concerns about reproducibility in social–personality psychology
Harry T. Reis, Karisa Y. Lee
(ER impossible)
Much of the current debate, however, is focused narrowly on direct or exact replications—whether the findings of a given study, carried out in a particular way with certain specific operations, would be repeated. Although exact replications are surely desirable, the papers by Fabrigar and by Crandall and Sherman remind us that in an absolute sense they are fundamentally impossible in social–personality psychology (see also S&S, 2014).

Show me the money
(Contextual Sensitivity)
Of course, it is possible that additional factors, which varied or could have varied among our studies and previously published studies (e.g., participants’ attitudes toward money) or among the online studies and laboratory study in this article (e.g., participants’ level of distraction), might account for these apparent inconsistencies. We did not aim to conduct a direct replication of any specific past study, and therefore we encourage special care when using our findings to evaluate existing ones (Doyen, Klein, Simons, & Cleeremans, 2014; Stroebe & Strack, 2014).

***From Data to Truth in Psychological Science. A Personal Perspective.
Strack
(ER uninformative)
In their introduction to the 2016 volume of the Annual Review of Psychology, Susan Fiske, Dan Schacter, and Shelley Taylor point out that a replication failure is not a scientific problem but an opportunity to find limiting conditions and contextual effects. To allow non-replications to regain this constructive role, they must come with conclusions that enter and stimulate a critical debate. It is even better if replication studies are endowed with a hypothesis that relates to the state of the scientific discourse. To show that an effect occurs only under one but not under another condition is more informative than simply demonstrating noneffects (S&S, 2014). But this may require expertise and effort.

Replicability 101: How to interpret the results of replication studies

February 8, 2018Bem, False Positives, Maxwell, Neyman-Pearson, Power, Publication Bias, Questionable Research Practices, Replicability, Replication, Replication Crisis, Research Integrity, Statistical PowerUlrich Schimmack

Even statistically sophisticated psychologists struggle with the interpretation of replication studies (Maxwell et al., 2015). This article gives a basic introduction to the interpretation of statistical results within the Neyman Pearson approach to statistical inferences.

I make two important points and correct some potential misunderstandings in Maxwell et al.’s discussion of replication failures. First, there is a difference between providing sufficient evidence for the null-hypothesis (evidence of absence) and providing insufficient evidence against the null-hypothesis (absence of evidence). Replication studies are useful even if they simply produce absence of evidence without evidence that an effect is absent. Second, I point out that publication bias undermines the credibility of significant results in original studies. When publication bias is present, open replication studies are valuable because they provide an unbiased test of the null-hypothesis, while original studies are rigged to reject the null-hypothesis.

DEFINITION OF REPLICATING A STATISTICAL RESULT

Replicating something means to get the same result. If I make the first free throw, replicating this outcome means to also make the second free throw. When we talk about replication studies in psychology we borrow from the common meaning of the term “to replicate.”

If we conduct psychological studies, we can control many factors, but some factors are not under our control. Participants in two independent studies differ from each other and the variation in the dependent variable across samples introduces sampling error. Hence, it is practically impossible to get identical results, even if the two studies are exact copies of each other. It is therefore more complicated to compare the results of two studies than to compare the outcome of two free throws.

To determine whether the results of two studies are identical or not, we need to focus on the outcome of a study. The most common outcome in psychological studies is a significant or non-significant result. The goal of a study is to produce a significant result and for this reason a significant result is often called a success. A successful replication study is a study that also produces a significant result. Obtaining two significant results is akin to making two free throws. This is one of the few agreements between Maxwell and me.

“Generally speaking, a published original study has in all likelihood demonstrated a statistically significant effect. In the current zeitgeist, a replication study is usually interpreted as successful if it also demonstrates a statistically significant effect.” (p. 488)

The more interesting and controversial scenario is a replication failure. That is, the original study produced a significant result (success) and the replication study produced a non-significant result (failure).

I propose that a lot of confusion arises from the distinction between original and replication studies. If a replication study is an exact copy of the first study, the outcome probabilities of original and replication studies are identical. Otherwise, the replication study is not really a replication study.

There are only three possible outcomes in a set of two studies: (a) both studies are successful, (b) one study is a success and one is a failure, or (c) both studies are failures. The probability of these outcomes depends on whether the significance criterion (the type-I error probability) when the null-hypothesis is true and the statistical power of a study when the null-hypothesis is false.

Table 1 shows the probability of the outcomes in two studies. The uncontroversial scenario of two significant results is very unlikely, if the null-hypothesis is true. With conventional alpha = .05, the probability is .0025 or 1 out of 400 attempts. This shows the value of replication studies. False positives are unlikely to repeat themselves and a series of replication studies with significant results is unlikely to occur by chance alone.

	2 sig, 0 ns	1 sig, 1 ns	0 sig, 2 ns
H0 is True	alpha^2	2alpha(1-alpha)	(1-alpha^2)
H1 is True	(1-beta)^2	2(1-beta)beta	beta^2

The probability of a successful replication of a true effect is a function of statistical power (1 – type-II error probability). High power is needed to get significant results in a pair of studies (an original study and a replication study). For example, if power is only 50%, the chance of this outcome is only 25% (Schimmack, 2012). Even with conventionally acceptable power of 80%, only 2/3 (64%) of replication attempts would produce this outcome. However, studies in psychology do not have 80% power and estimates of power can be as low as 37% (OSC, 2015). With 40% power, a pair of studies would produce significant results in no more than 16 out of 100 attempts. Although successful replications of true effects with low power are unlikely, they are still much more likely then significant results when the null-hypothesis is true (16/100 vs. 1/400 = 64:1). It is therefore reasonable to infer from two significant results that the null-hypothesis is false.

If the null-hypothesis is true, it is extremely likely that both studies produce a non-significant result (.95^2 = 90.25%). In contrast, it is unlikely that even a study with modest power would produce two non-significant results. For example, if power is 50%, there is a 75% chance that at least one of the two studies produces a significant result. If power is 80%, the probability of obtaining two non-significant results is only 4%. This means, it is much more likely (22.5 : 1) that the null-hypothesis is true than that the alternative hypothesis is true. This does not mean that the null-hypothesis is true in an absolute sense because power depends on the effect size. For example, if 80% power were obtained with a standardized effect size of Cohen’s d = .5, two non-significant results would suggest that the effect size is smaller than .5, but it does not warrant the conclusion that H0 is true and the effect size is exactly 0. Once more, it is important to distinguish between the absence of evidence for an effect and the evidence of absence of an effect.

The most controversial scenario assumes that the two studies produced inconsistent outcomes. Although theoretically there is no difference between the first and the second study, it is common to focus on a successful outcome followed by a replication failure (Maxwell et al., 2015). When the null-hypothesis is true, the probability of this outcome is low; .05 * (1-.05) = .0425. The same probability exists for the reverse pattern that a non-significant result is followed by a significant one. A probability of 4.25% shows that it is unlikely to observe a significant result followed by a non-significant result when the null-hypothesis is true. However, the low probability is mostly due to the low probability of obtaining a significant result in the first study, while the replication failure is extremely likely.

Although inconsistent results are unlikely when the null-hypothesis is true, they can also be unlikely when the null-hypothesis is false. The probability of this outcome depends on statistical power. A pair of studies with very high power (95%) is very unlikely to produce an inconsistent outcome because both studies are expected to produce a significant result. The probability of this rare event can be as low, or lower, than the probability with a true null effect; .95 * (1-.95) = .0425. Thus, an inconsistent result provides little information about the probability of a type-I or type-II error and is difficult to interpret.

In conclusion, a pair of significance tests can produce three outcomes. All three outcomes can occur when the null-hypothesis is true and when it is false. Inconsistent outcomes are likely unless the null-hypothesis is true or the null-hypothesis is false and power is very high. When two studies produce inconsistent results, statistical significance provides no basis for statistical inferences.

Meta-Analysis

The counting of successes and failures is an old way to integrate information from multiple studies. This approach has low power and is no longer used. A more powerful approach is effect size meta-analysis. Effect size meta-analysis was one way to interpret replication results in the Open Science Collaboration (2015) reproducibility project. Surprisingly, Maxwell et al. (2015) do not consider this approach to the interpretation of failed replication studies. To be clear, Maxwell et al. (2015) mention meta-analysis, but they are talking about meta-analyzing a larger set of replication studies, rather than meta-analyzing the results of an original and a replication study.

“This raises a question about how to analyze the data obtained from multiple studies. The natural answer is to use meta-analysis.” (p. 495)

I am going to show that effect-size meta-analysis solves the problem of interpreting inconsistent results in pairs of studies. Importantly, effect size meta-analysis does not care about significance in individual studies. A meta-analysis of a pair of studies with inconsistent results is no different from a meta-analysis of a pair of studies with consistent results.

Maxwell et al.’s (2015) introduced an example of a between-subject (BS) design with n = 40 per group (total N = 80) and a standardized effect size of Cohen’s d = .5 (a medium effect size). This study has 59% power to obtain a significant result. Thus, it is quite likely that a pair of studies produces inconsistent results (48.38%). However, a pair of studies with N = 80 has the power of a total sample size of N = 160, which means a fixed-effects meta-analysis will produce a significant result in 88% of all attempts. Thus, it is not difficult at all to interpret the results of pairs of studies with inconsistent results if the studies have acceptable power (> 50%). Even if the results are inconsistent, a meta-analysis will provide the correct answer that there is an effect most of the time.

A more interesting scenario are inconsistent results when the null-hypothesis is true. I turned to simulations to examine this scenario more closely. The simulation showed that a meta-analysis of inconsistent studies produced a significant result in 34% of all cases. The percentage slightly varies as a function of sample size. With a small sample of N = 40, the percentage is 35%. With a large sample of 1,000 participants it is 33%. This finding shows that in two-thirds of attempts, a failed replication reverses the inference about the null-hypothesis based on a significant original study. Thus, if an original study produced a false-positive results, a failed replication study corrects this error in 2 out of 3 cases. Importantly, this finding does not warrant the conclusion that the null-hypothesis is true. It merely reverses the result of the original study that falsely rejected the null-hypothesis.

In conclusion, meta-analysis of effect sizes is a powerful tool to interpret the results of replication studies, especially failed replication studies. If the null-hypothesis is true, failed replication studies can reduce false positives by 66%.

DIFFERENCES IN SAMPLE SIZES

We can all agree that, everything else being equal, larger samples are better than smaller samples (Cohen, 1990). This rule applies equally to original and replication studies. Sometimes it is recommended that replication studies should use much larger samples than original studies, but it is not clear to me why researchers who conduct replication studies should have to invest more resources than original researchers. If original researchers conducted studies with adequate power, an exact replication study with the same sample size would also have adequate power. If the original study was a type-I error, the replication study is unlikely to replicate the result no matter what the sample size. As demonstrated above, even a replication study with the same sample size as the original study can be effective in reversing false rejections of the null-hypothesis.

From a meta-analytic perspective, it does not matter whether a replication study had a larger or smaller sample size. Studies with larger sample sizes are given more weight than studies with smaller samples. Thus, researchers who invest more resources are rewarded by giving their studies more weight. Large original studies require large replication studies to reverse false inferences, whereas small original studies require only small replication studies to do the same. Nevertheless, failed replications with larger samples are more likely to reverse false rejections of the null-hypothesis, but there is no magical number about the size of a replication study to be useful.

I simulated a scenario with a sample size of N = 80 in the original study and a sample size of N = 200 in the replication study (a factor of 2.5). In this simulation, only 21% of meta-analyses produced a significant result. This is 13 percentage points lower than in the simulation with equal sample sizes (34%). If the sample size of the replication study is 10 times larger (N = 80 and N = 800), the percentage of remaining false positive results in the meta-analysis shrinks to 10%.

The main conclusion is that even replication studies with the same sample size as the original study have value and can help to reverse false positive findings. Larger sample sizes simply give replication studies more weight than original studies, but it is by no means necessary to increase sample sizes of replication studies to make replication failures meaningful. Given unlimited resources, larger replications are better, but these analysis show that large replication studies are not necessary. A replication study with the same sample size as the original study is more valuable than no replication study at all.

CONFUSING ABSENCE OF EVIDENCE WITH EVIDENCE OF ABSENCE

One problem in Maxwell et al’s (2015) article is to conflate two possible goals of replication studies. One goal is to probe the robustness of the evidence against the null-hypothesis. If the original result was a false positive result, an unsuccessful replication study can reverse the initial inference and produce a non-significant result in a meta-analysis. This finding would mean that evidence for an effect is absent. The status of a hypothesis (e.g., humans have supernatural abilities; Bem, 2011) is back to where it was before the original study found a significant result and the burden of proof is shifted back to proponents of the hypothesis to provide unbiased credible evidence for it.

Another goal of replication studies can be to provide conclusive evidence that an original study reported a false positive result (i..e, humans do not have supernatural abilities). Throughout their article, Maxwell et al. assume that the goal of replication studies is to prove the absence of an effect. They make many correct observations about the difficulties of achieving this goal, but it is not clear why replication studies have to be conclusive when original studies are not held to the same standard.

This makes it easy to produce (potentially false) positive results and very hard to remove false positive results from the literature. It also creates a perverse incentive to conduct underpowered original studies and to claim victory when a large replication study finds a significant result with an effect size that is 90% smaller than the effect size in an original study. The authors of the original article may claim that they do not care about effect sizes and that their theoretical claim was supported. To avoid this problem that replication researchers have to invest large amount of resources for little gain, it is important to realize that even a failure to replicate an original finding with the same sample size can undermine original claims and force researchers to provide stronger evidence for their original ideas in original articles. If they are right and the evidence is strong, others will be able to replicate the result in an exact replication study with the same sample size.

THE DIRTY BIG SECRET

The main problem of Maxwell et al.’s (2015) article is that the authors blissfully ignore the problem of publication bias. They mention publication bias twice to warn readers that publication bias inflates effect sizes and biases power analyses, but they completely ignore the influence of publication bias on the credibility of successful original results (Schimmack, 2012; Sterling; 1959; Sterling et al., 1995).

It is hard to believe that Maxwell is unaware of this problem, if only because Maxwell was action editor of my article that demonstrated how publication bias undermines the credibility of replication studies that are selected for significance (Schimmack, 2012).

I used Bem’s infamous article on supernatural abilities as an example, which appeared to show 8 successful replications of supernatural abilities. Ironically, Maxwell et al. (2015) also cites Bem’s article to argue that failed replication studies can be misinterpreted as evidence of absence of an effect.

“Similarly, Ritchie, Wiseman, and French (2012) state that their failure to obtain significant results in attempting to replicate Bem (2011) “leads us to favor the ‘experimental artifacts’ explanation for Bem’s original result” (p. 4)”

This quote is not only an insult to Ritchie et al.; it also ignores the concerns that have been raised about Bem’s research practices. First, Ritchie et al. do not claim that they have provided conclusive evidence against ESP. They merely express their own opinion that they “favor the ‘experimental artifacts’ explanation. There is nothing wrong with this statement, even if it is grounded in a healthy skepticism about supernatural abilities.

More important, Maxwell et al. ignore the broader context of these studies. Schimmack (2012) discussed many questionable practices in Bem’s original studies and I presented statistical evidence that the significant results in Bem’s article were obtained with the help of questionable research practices. Given this wider context, it is entirely reasonable to favor the experimental artifact explanation over the alternative hypothesis that learning after an exam can still alter the exam outcome.

It is not clear why Maxwell et al. (2015) picked Bem’s article to discuss problems with failed replication studies and ignores that questionable research practices undermine the credibility of significant results in original research articles. One reason why failed replication studies are so credible is that insiders know how incredible some original findings are.

Maxwell et al. (2015) were not aware that in the same year, the OSC (2015) reproducibilty project would replicate only 37% of statistically significant results in top psychology journals, while the apparent success rate in these journals is over 90%. The stark contrast between the apparent success rate and the true power to produce successful outcomes in original studies provided strong evidence that psychology is suffering from a replication crisis. This does not mean that all failed replications are false positives, but it does mean that it is not clear which findings are false positives and which findings are not. Whether this makes things better is a matter of opinion.

Publication bias also undermines the usefulness of meta-analysis for hypothesis testing. In the OSC reproducibility project, a meta-analysis of original and replication studies produced 68% significant results. This result is meaningless because publication bias inflates effect sizes and the probability of obtaining a false positive result in the meta-analysis. Thus, when publication bias is present, unbiased replication studies provide the most credible evidence and the large number of replication failures means that more replication studies with larger samples are needed to see which hypothesis predict real effects with practical significance.

DOES PSYCHOLOGY HAVE A REPLICATION CRISIS?

Maxwell et al.’s (2015) answer to this question is captured in this sentence. “Despite raising doubts about the extent to which apparent failures to replicate necessarily reveal that psychology is in crisis,we do not intend to dismiss concerns about documented methodological flaws in the field.” (p. 496). The most important part of this quote is “raising doubt,” the rest is Orwellian double-talk.

The whole point of Maxwell et al.’s article is to assure fellow psychologists that psychology is not in crisis and that failed replication studies should not be a major concern. As I have pointed out, this conclusion is based on some misconceptions about the purpose of replication studies and by blissful ignorance about publication bias and questionable research practices that made it possible to publish successful replications of supernatural phenomena, while discrediting authors who spend time and resources on demonstrating that unbiased replication studies fail.

The real answer to Maxwell et al.’s question was provided by the OSC (2015) finding that only 37% of published significant results could be replicated. In my opinion that is not only a crisis, but a scandal because psychologists routinely apply for funding with power analyses that claim 80% power. The reproducibilty project shows that the true power to obtain significant results in original and replication studies is much lower than this and that the 90% success rate is no more meaningful than 90% votes for a candidate in communist elections.

In the end, Maxwell et al. draw the misleading conclusion that “the proper design and interpretation of replication studies is less straightforward than conventional practice would suggest.” They suggest that “most importantly, the mere fact that a replication study yields a nonsignificant statistical result should not by itself lead to a conclusion that the corresponding original study was somehow deficient and should no longer be trusted.”

As I have demonstrated, this is exactly the conclusion that readers should draw from failed replication studies, especially if (a) the original study was not preregistered, (b) the original study produced weak evidence (e.g., p = .04), the original study was published in a journal that only publishes significant results, (d) the replication study had a larger sample, (e) the replication study would have been published independent of outcome, and (f) the replication study was preregistered.

We can only speculate why the American Psychologists published a flawed and misleading article that gives original studies the benefit of the doubt and casts doubt on the value of replication studies when they fail. Fortunately, APA can no longer control what is published because scientists can avoid the censorship of peer-reviewed journals by publishing blogs and by criticize peer-reviewed articles in open post-publication peer review on social media.

Long life the replicability revolution. !!!

REFERENCES

Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45(12), 1304-1312.

http://dx.doi.org/10.1037/0003-066X.45.12.1304

Maxwell, S.E, Lau, M. Y., & Howard, G. S. (2015). Is psychology suffering from a replication crisis? What does ‘failure to replicate’ really mean? American Psychologist, 70, 487-498. http://dx.doi.org/10.1037/a0039400.

Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551-566. http://dx.doi.org/10.1037/a0029487

Why the Journal of Personality and Social Psychology Should Retract Article DOI: 10.1037/a0021524 “Feeling the Future: Experimental evidence for anomalous retroactive influences on cognition and affect” by Daryl J. Bem

January 5, 2018Bem, Publication Bias, Questionable Research Practices, TIVAUlrich Schimmack

Added February 06, 2023

A collaboration between Bem and other believers in ESP and skeptics produced no evidence for a real effect and indirectly confirms that Bem’s incredible results were produced with questionable practices described here. “The failure to replicate previous positive findings with this strict methodology indicates that it is likely that the overall positive effect in the literature might be the result of recognized methodological biases rather than ESP” (https://doi.org/10.1098/rsos.191375)

Added January 30, 2018: A formal letter to the editor of JPSP, calling for a retraction of the article (Letter).

Added January 1, 2020: Response from JPSP Editor Shinobu Kitayama (Response).

—————————————————————————————————————————

“I’m all for rigor, but I prefer other people do it. I see its importance—it’s fun for some people—but I don’t have the patience for it. If you looked at all my past experiments, they were always rhetorical devices. I gathered data to show how my point would be made. I used data as a point of persuasion, and I never really worried about, ‘Will this replicate or will this not?” (Daryl J. Bem, in Engber, 2017)

In 2011, the Journal of Personality and Social Psychology published a highly controversial article that claimed to provide evidence for time-reversed causality. Time reversed causality implies that future events have a causal effect on past events. These effects are considered to be anomalous and outside current scientific explanations of human behavior because they contradict fundamental principles of our current understanding of reality.

The article reports 9 experiments with 10 tests of time-reversed causal influences on human behavior with stunning results. “The mean effect size (d) in psi performance across all 9 experiments was 0.22, and all but one of the experiments yielded statistically significant results. ” (Bem, 2011, p. 407).

The publication of this article rocked psychology and triggered a credibility crisis in psychological science. Unforeseen by Bem, the article did not sway psychologists to believe in time-reversed causality. Rather, it made them doubt other published findings in psychology.

In response to the credibility crisis, psychologists started to take replications more seriously, including replications of Bem’s studies. If Bem’s findings were real, other scientists should be able to replicate them using the same methodology in their labs. After all, independent verification by other scientists is the ultimate test of all empirical sciences.

The first replication studies were published by Ritchie, Wiseman, and French (2012). They conducted three studies with a total sample size of N = 150 and did not obtain a significant effect. Although this finding casts doubt about Bem’s reported results, the sample size is too small to challenge the evidence reported by Bem which was based on over 1,000 participants. A more informative replication attempt was made by Galek et al. (2012). A set of seven studies with a total of N = 3,289 participants produced an average effect size of d = 0.04, which was not significantly different from zero. This massive replication failure raised questions about potential moderators (i.e., variables that can explain inconsistent findings). The authors found “the only moderator that yields significantly different results is whether the experiment was conducted by Bem or not.” (p. 941).

Galek et al. (2012) also speculate about the nature of the moderating factor that explains Bem’s high success rate. One possible explanation is that Bem’s published results do not represent reality. Published results can only be interpreted at face value, if the reported data and analyses were not influenced by the result. If, however, data or analyzes were selected because they produced evidence for time-reversed causality, and data and analyses that failed to provide evidence for it were not reported, the results cannot be considered empirical evidence for an effect. After all, random numbers can provide evidence for any hypothesis, if they are selected for significance (Rosenthal, 1979; Sterling, 1959). It is irrelevant whether this selection occurred involuntarily (self-deception) or voluntary (other-deception). Both, self-deception and other-deception introduce bias in the scientific record.

Replication studies cannot provide evidence about bias in original studies. A replication study only tells us that other scientists were unable to replicate original findings, but they do not explain how the scientist who conducted the original studies obtained significant results. Seven years after Bem’s stunning results were published, it remains unknown how he obtained significant results in 9 out of 10 studies.

I obtained Bem’s original data (email on February 25, 2015) to examine this question more closely. Before I present the results of my analysis, I consider several possible explanations for Bem’s surprisingly high success rate.

1. Luck

The simplest and most parsimonious explanation for a stunning original result that cannot be replicate is luck. The outcome of empirical studies is partially determined by factors outside an experimenter’s control. Sometimes these random factors will produce a statistically significant result by chance alone. The probability of this outcome is determined by the criterion for statistical significance. Bem used the standard criterion of 5%. If time-reversed causality does not exist, 1 out of 20 attempts to demonstrate the phenomenon would provide positive evidence for it.

If Bem or other scientists would encounter one successful attempt and 19 unsuccessful attempts, they would not consider the one significant result evidence for the effect. Rather, the evidence would strongly suggest that the phenomenon does not exist. However, if the significant result emerged in the first attempt, Bem could not know (unless he can see into the future) that the next 19 studies will not replicate the effect.

Attributing Bem’s results to luck would be possible, if Bem had reported a significant result in a single study. However, the probability of getting lucky decreases with the number of attempts. Nobody gets lucky every time they try. The luck hypothesis assumes that Bem got lucky 9 out of 10 times with a probability of 5% on each attempt.
The probability of this event is very small. To be exact, it is 0.000000000019 or 1 out of 53,612,565,445.

Given this small probability, it is safe to reject the hypothesis that Bem’s results were merely the outcome of pure chance. If we assume that time-reversed causality does not exist, we are forced to believe that Bem’s published results are biased by involuntarily or voluntarily presenting misleading evidence; that is evidence that strengthens beliefs in a phenomenon that actually does not exist.

2. Questionable Research Practices

The most plausible explanation for Bem’s incredible results is the use of questionable research practices (John et al., 2012). Questionable research practices increase the probability of presenting only supportive evidence for a phenomenon at the risk of providing evidence for a phenomenon that does not exist. Francis (2012) and Schimmack (2012) independently found that Bem reported more significant results than one would expect based on the statistical power of the studies. This finding suggests that questionable research practices were used, but they do not provide information about the actual research practices that were used. John et al. listed a number of questionable research practices that might explain Bem’s findings.

2.1. Multiple Dependent Variables

One practice is to collect multiple dependent variables and to report only dependent variables that produced a significant result. The nature of Bem’s studies reduces the opportunity to collect many dependent variables. Thus, the inclusion of multiple dependent variables cannot explain Bem’s results.

2.2. Failure to report all conditions

This practice applies to studies with multiple conditions. Only Study 1 examined precognition for multiple types of stimuli and found a significant result for only one of them. However, Bem reported the results for all conditions and it was transparent that the significant result was only obtained in one condition, namely with erotic pictures. This weakens the evidence in Study 1, but it does not explain significant results in the other studies that had only one condition or two conditions that both produced significant results.

2.3 Generous Rounding

Sometimes a study may produce a p-value that is close to the threshold value of .05. Strictly speaking a p-value of .054 is not significant. However, researchers may report the p-value rounded to the second digit and claim significance. It is easy to spot this questionable research practice by computing exact p-values for the reported test-statistics or by redoing the statistical analysis from original data. Bem reported his p-values with three digits. Moreover, it is very unlikely that a p-value falls into the range between .05 and .055 and that this could happen in 9 out of 10 studies. Thus, this practice also does not explain Bem’s results.

2.4 HARKing

Hypothesizing after results are known (Kerr, 1998) can be used to make significant results more credible. The reason is that it is easy to find significant results in a series of exploratory analyses. A priori predictions limit the number of tests that are carried out and the risk of capitalizing on chance. Bem’s studies didn’t leave much room for HARKing, except Study 1. The studies build on a meta-analysis of prior studies and nobody has questioned the paradigms used by Bem to test time-reversed causality. Bem did include an individual difference measure and found that it moderated the effect, but even if this moderator effect was HARKed, the main effect remains to be explained. Thus, HARKing can also not explain Bem’s findings.

2.5 Excluding of Data

Sometimes non-significant results are caused by an an inconvenient outlier in the control group. Selective exclusion of these outliers based on p-values is another questionable research practice. There are some exclusions in Bem’s studies. The method section of Study 3 states that 100 participants were tested and three participants were excluded due to a high error rate in responses. The inclusion of these three participants is unlikely to turn a significant result with t(96) = 2.55, p = .006 (one-tailed), into a non-significant result. In Study 4, one participant out of 100 participants was excluded. The exclusion of a single participant is unlikely to change a significant result with t(98) = 2.03, p = .023 into a non-significant result. Across all studies, only 4 participants out of 1075 participants were excluded. Thus, exclusion of data cannot explain Bem’s robust evidence for time-reversed causality that other researchers cannot replicate.

2.6 Stopping Data Collection Early

Bem aimed for a minimum sample size of N = 100 to achieve 80% power in each study. All studies except Study 9 met this criterion before excluding participants (Ns = 100, 150, 97, 99, 100, 150, 200, 125, 50). Bem does not provide a justification for the use of a smaller sample size in Study 9 that reduced power from 80% to 54%. The article mentions that Study 9 was a modified replication of Study 8 and yielded a larger observed effect size, but the results of Studies 8 and 9 are not significantly different. Thus, the smaller sample size is not justified by an expectation of a larger effect size to maintain 80% power.

In a personal communication, Bem also mentioned that the study was terminated early because it was the end of the semester and the time stamp in the data file shows that the last participant was run on December 6, 2009. Thus, it seems that Study 9 was terminated early, but Bem simply got lucky that results were significant at the end of the semester. Even if Study 9 is excluded for this reason, it remains unclear how the other 8 studies could have produced significant results without a real effect.

2.7 Optional Stopping/Snooping

Collecting more data, if the collected data already show a significant effect can be wasteful. Therefore, researchers may conduct statistical significance tests throughout a study and terminate data collection when a significant result is obtained. The problem with this approach is that repeated checking (snooping) increases the risk of a false positive result (Strube, 2006). The increase in the risk of a false positive results depends on how frequently and how often researchers check results. If researchers use optional stopping, sample sizes are expected to vary because sampling error will sometimes produce a significant result quickly and sometimes after a long time. Second, sample size would be negatively correlated with observed effect sizes. The reason is that larger samples are needed to achieve significance with smaller observed effect sizes. If chance produces large effect sizes early on, significance is achieved quickly and the study is terminated with a small sample size and a large effect size. Finally, optional stopping will produce p-values close to the significance criterion because data collection is terminated as soon as p-values reach the criterion value.

The reported statistics in Bem’s article are consistent with optional stopping. First, sample sizes vary from N = 50 to N = 200. Second, sample sizes are strongly correlated with effect sizes, r = -.91 (Alcock, 2011). Third, p-values are bunched up close to the criterion value, which suggests studies may have been stopped as soon as significance was achieved (Schimmack, 2015).

Despite these warning signs, optional stopping cannot explain Bem’s results, if time-reversed causality does not exist. The reason is that the sample sizes are too small for a set of 9 studies to produce significant results. In a simulation study, with a minimum of 50 participants and a maximum of 200 participants, only 30% of attempts produced a significant result. Even 1,000 participants are not enough to guarantee a significant result by simply collecting more data.

2.8 Selective Reporting

The last questionable practice is to report only successful studies that produce a significant result. This practice is widespread and contributes to the presence of publication bias in scientific journals (Fraonco et al., 2014).

Selective reporting assumes that researchers conduct a series of studies and report only studies that produced a significant result. This may be a viable strategy for sets of studies with a real effect, but it does not seem to be a viable strategy, if there is no effect. Without a real effect, a significant result with p < .05 emerges in 1 out of 20 attempts. To obtain 9 significant results, Bem would have had to conduct approximately 9*20 = 180 studies. With a modal sample size of N = 100, this would imply a total sample size of 18,000 participants.

Engber (2017) reports that Bem conducted his studies over a period of 10 years. This may be enough time to collect data from 18,000 participants. However, Bem also paid participants $5 out of his own pocket because (fortunately) this research was not supported by research grants. This would imply that Bem paid $90,000 out of pocket.

As a strong believer in ESP, Bem may have paid $90,000 dollars to fund his studies, but any researcher of Bem’s status should realize that obtaining 9 significant results in 180 attempts does not provide evidence for time-reversed causality. Not disclosing that there were over 100 failed studies, would be a breach of scientific standards. Indeed, Bem (2010) warned graduate students in social psychology.

“The integrity of the scientific enterprise requires the reporting of disconfirming results.”

2.9 Conclusion

In conclusion, none of the questionable research practices that have been identified by John et al. seem to be plausible explanations for Bem’s results.

3. The Decline Effect and a New Questionable Research Practice

When I examined Bem’s original data, I discovered an interesting pattern. Most studies seemed to produce strong effect sizes at the beginning of a study, but then effect sizes decreased. This pattern is similar to the decline effect that has been observed across replication studies of paranormal phenomena (Schooler, 2011).

Figure 1 provides a visual representation of the decline effect in Bem’s studies. The x-axis is the sample size and the y-axis is the cumulative effect size. As sample sizes increase, the cumulative effect size approaches the population effect size. The grey area represents the results of simulation studies with a population effect size of d = .20. As sampling error is random, the grey area is a symmetrical funnel around the population effect size. The blue dotted lines show the cumulative effect sizes for Bem’s studies. The solid blue line shows the average cumulative effect size. The figure shows how the cumulative effect size decreases by more than 50% from the first 5 participants to a sample size of 100 participants.

The selection effect is so strong that Bem could have stopped 9 of the 10 studies after collecting a maximum of 15 participants with a significant result. The average sample size for these 9 studies would have been only 7.75 participants.

Table 1 shows the one-sided p-values for Bem’s datasets separately for the first 50 participants and for participants 51 to 100. For the first 50 participants, 8 out of 10 tests are statistically significant. For the following 50 participants none of the 10 tests is statistically significant. A meta-analysis across the 10 studies does show a significant effect for participants 51 to 100, but the Test of Insufficient Variance also shows insufficient variance, Var(z) = 0.22, p = .013, suggesting that even these trials are biased by selection for significance (Schimmack, 2015).
Table 1. P-values for Bem’s 10 datasets based on analyses of the first group of 50 participants and the second group of 50 participants.

EXPERIMENT	S 1-50	S 51-100
EXP1	p = .004	p = .194
EXP2	p = .096	p = .170
EXP3	p = .039	p = .100
EXP4	p = .033	p = .067
EXP5	p = .013	p = .069
EXP6a	p = .412	p = .126
EXP5b	p = .023	p = .410
EXP7	p = .020	p = .338
EXP8	p = .010	p = .318
EXP9	p = .003	NA

There are two interpretations of the decrease in effect sizes over the course of an experiment. One explanation is that we are seeing a subset of attempts that showed promising results after peeking at the data. Unlike optional stopping, however, a researcher continuous to collect more data to see whether the effect is real. Although the effect size decreases, the strong effect during the initial trials that motivated a researcher to collect more data is sufficient to maintain statistical significance because sampling error also decreases as more participants are added. These results cannot be replicated because they capitalized on chance during the first trials, but this remains unnoticed because the next study does not replicate the first study exactly. Instead, the researcher makes a small change to the experimental procedure and when he or she peeks at the data of the next study, the study is abandoned and the failure is attributed to the change in the experimental procedure (without checking that the successful finding can be replicated).

In this scenario, researchers are deceiving themselves that slight experimental manipulations apparently have huge effects on their dependent variable because sampling error in small samples is very large. Observed effect sizes in small samples can range from 1 to -1 (see grey area in Figure 1), giving the illusion that each experiment is different, but a random number generator would produce the same stunning differences in effect sizes. Bem (2011), and reviewers of his article, seem to share the believe that “the success of replications in psychological research often depends on subtle and unknown factors.” (p. 422). How could Bem reconcile this believe with the reporting of 9 out of 10 successes? The most plausible explanation is that successes are a selected set of findings out of many attempts that were not reported.

There are other hints that Bem peeked at the data to decide whether to collect more data or terminate data collection. In his 2011 article, he addressed concerns about a file drawer stuffed with failed studies.

“Like most social-psychological experiments, the experiments reported here required extensive pilot testing. As all research psychologists know, many procedures are tried and discarded during this process. This raises the question of how much of this pilot exploration should be reported to avoid the file-drawer problem, the selective suppression of negative or null results.”

Bem does not answer his own question, but the correct answer is clear: all of the so-called pilot studies need to be included if promising pilot studies were included in the actual studies. If Bem had clearly distinguished between promising pilot studies and actual studies, actual studies would be unbiased. However, it appears that he continued collecting data after peeking at the results after a few trials and that the significant results are largely driven by inflated effect sizes in promising pilot studies. This biased the results and can explain how Bem obtained evidence for time-reversed causality that others could not replicate when they did not peek at the data and terminated studies when the results were not promising.

Additional hints come from an interview with Engber (2017).

“I would start one [experiment], and if it just wasn’t going anywhere, I would abandon it and restart it with changes,” Bem told me recently. Some of these changes were reported in the article; others weren’t. “I didn’t keep very close track of which ones I had discarded and which ones I hadn’t,” he said. Given that the studies spanned a decade, Bem can’t remember all the details of the early work. “I was probably very sloppy at the beginning,” he said.

In sum, a plausible explanation of Bem’s successes that others could not replicate is that he stopped studies early when they did not show a promising result, then changed the procedure slightly. He also continued data collection when results looked promising after a few trials. As this research practices capitalizes on chance to produce large effect sizes at the beginning of a study, the results are not replicable.

Although this may appear to be the only hypothesis that is consistent with all of the evidence (evidence of selection bias in Bem’s studies, decline effect over the course of Bem’s studies, failed replications), it may not be the only one. Schooler (2011) proposed that something more intriguing may cause decline effects.

“Less likely, but not inconceivable, is an effect stemming from some unconventional process. Perhaps, just as the act of observation has been suggested to affect quantum measurements, scientific observation could subtly change some scientific effects. Although the laws of reality are usually understood to be immutable, some physicists, including Paul Davies, director of the BEYOND: Center for Fundamental Concepts in Science at Arizona State University in Tempe, have observed that this should be considered an assumption, not a foregone conclusion.”

Researchers who are willing to believe in time-reversed causality are probably also open to the idea that the process of detecting these processes is subject to quantum effects that lead to a decline in the effect size after attempts to measure it. They may consider the present findings of decline effects within Bem’s experiment a plausible explanation for replication failures. If a researcher collects too many data, the weak effects in the later trials wash out the strong effects during the initial trials. Moreover, quantum effect may not be observable all the time. Thus, sometimes initial trials will also not show the effect.

I have little hope that my analyses of Bem’s data will convince Bem or other parapsychologists to doubt supernatural phenomena. However, the analysis provides skeptics with rational and scientific arguments to dismiss Bem’s findings as empirical evidence that requires a supernatural explanation. Bad research practices are sufficient to explain why Bem obtained statistically significant results that could not be replicated in honest and unbiased replication attempts.

Discussion

Bem’s 2011 article “Feeling the Future” has had a profound effect on social psychology. Rather than revealing a supernatural phenomenon, the article demonstrated fundamental flaws in the way social psychologists conducted and reported empirical studies. Seven years later, awareness of bad research practices is widespread and new journal editors are implementing reforms in the evaluation of manuscripts. New statistical tools have been developed to detect practices that produce significant results by capitalizing on chance. It is unlikely that Bem’s article would be accepted for publication these days.

The past seven years have also revealed that Bem’s article is not an exception. The only difference is that the results contradicted researchers’ a priori beliefs, whereas other studies with even more questionable evidence were not scrutinized because the claims were consistent with researchers a priori beliefs (e.g., the glucose theory of will-power; cf. Schimmack, 2012).

The ability to analyze the original data of Bem’s studies offered a unique opportunity to examine how social psychologists deceived themselves and others into believing that they tested theories of human behavior when they were merely confirming their own beliefs, even if these beliefs defied basic principles of causality. The main problem appears to be a practice to peek at results in small samples with different procedures and to attribute differences in results to the experimental procedures, while ignoring the influence of sampling error.

Conceptual Replications and Hidden Moderators

In response to the crisis of confidence about social psychology, social psychologists have introduced the distinction between conceptual and exact replications and the hidden moderator hypothesis. The distinction between conceptual and exact replications is important because exact replications make a clear prediction about the outcome. If a theory is correct and an original study produced a result that is predicted by the theory, then an exact replication of the original study should also produce a significant result. At least, exact replications should be successful more often than fail (Tversky and Kahneman, 1971).

Social psychologists also realize that not reporting the outcome of failed exact replications distorts the evidence and that this practice violates research ethics (Bem, 2000).

The concept of a conceptual replication provides the opportunity to dismiss studies that fail to support a prediction by attributing the failure to a change in the experimental procedure, even if it is not clear, why a small change in the experimental procedure would produce a different result. These unexplained factors that seemingly produced a success in one study and a failure in the other studies are called hidden moderator.

Social psychologists have convinced themselves that many of the phenomena that they study are sensitive to minute changes in experimental protocols (Bem, 2011). This belief sustains beliefs in a theory despite many failures to obtain evidence for a predicted effect and justifies not reporting disconfirming evidence.

The sensitivity of social psychological effects to small changes in experimental procedures also justifies that it is necessary to conduct many studies that are expected to fail, just like medieval alchemists expected many failures in their attempts to make gold. These failures are not important. They are simply needed to find the conditions that produce the desired outcome; a significant result that supports researchers’ predictions.

The attribution of failures to hidden moderators is the ultimate attribution error of social psychologists. It makes them conduct study after study in the search for a predicted outcome without realizing that a few successes among many failures are expected simply due to chance alone. To avoid realizing the fragility of these successes, they never repeat the same study twice. The ultimate attribution error has enabled social psychologist to deceive themselves and others for decades.

Since Bem’s 2011 article was published, it has become apparent that many social psychological articles report results that fail to provide credible evidence for theoretical claims because they do not report results from an unknown number of failed attempts. The consequences of this inconvenient realization are difficult to exaggerate. Entire textbooks covering decades of research will have to be rewritten.

P-Hacking

Another important article for the replication crisis in psychology examined the probability that questionable research practices can produce false positive results (Simmons, Nelson, & Simonsohn, 2011). The article presents simulation studies that examine the actual risk of a type-I error when questionable research practices are used. They find that a single questionable practice can increase the chances of obtaining a false positive result from the nominal 5% to 12.6%. A combination of four questionable research practices increased the risk to 60.7%. The massive use of questionable research practices is called p-hacking. P-hacking may work for a single study, if a researcher is lucky. But it is very unlikely that a researcher can p-hack a series of 9 studies to produce 9 false positive results, (p = .6⁹= 1%).

The analysis of Bem’s data suggest that a perfect multiple-study article requires omitting failed studies from the record, and hiding disconfirming evidence violates basic standards of research ethics. If there is a known moderator, the non-significant results provide important information about boundary conditions (time-reversed causality works with erotic pictures, but not with pictures of puppies). If the moderator is not known, it is still important to report this finding to plan future studies. There is simply no justification for excluding non-significant results from a series of studies that are reported in a single article.

To reduce bias and increase credibility, pilot studies or other failed studies could be included in a meta-analysis at the end of a multiple-study article. The meta-analysis could show that the effect is significant across an unbiased sample of studies that produced significant and nonsignificant results. This overall effect is functionally equivalent to the test of the hypothesis in a single study with high power. Importantly, the meta-analysis is only credible if it includes nonsignificant results (Schimmack, 2012, p. 563).

Thus, a simple way to improve the credibility of psychological science is to demand that researchers submit all studies that tested relevant hypotheses for publication and to consider selection of significant results scientific misconduct. Ironically, publishing failed studies will provide stronger evidence than seemingly flawless results that were obtained by omitting nonsignificant results. Moreover, allowing for the publication of non-significant results reduces the pressure to use p-hacking, which only serves the goal to obtain significant results in all studies.

Should the Journal of Personality and Social Psychology Retract Bem’s Article?

Journals have a high threshold for retractions. Typically, articles are retracted only if there are doubts about the integrity of the published data. If data were manipulated by fabricating them entirely or by swapping participants from one condition to another to exaggerate mean differences, articles are retracted. In contrast, if researchers collected data and selectively reported only successful studies, articles are not retracted. The selective publishing of significant results is so widespread that it seems inconceivable to retract every article that used this questionable research practice. Francis (2014) estimated that at least 80% of articles published in the flagship journal Psychological Science would have to be retracted (Francis, 2014). This seems excessive.

However, Bem’s article is unique in many ways, and the new analyses of original data presented here suggest that bad research practices, inadvertently or not, produced Bem’s results. Moreover, the results could not be replicated in other studies. Retracting the article would send a clear signal to the scientific community and other stakeholders in psychological science that psychologists are serious about learning from mistakes by flagging the results reported in Bem as erroneous. Unless the article is retracted, uniformed researchers will continue to cite the article as evidence for supernatural phenomena like time-reversed causality.

“Experimentally, such precognitive effects have manifested themselves in a variety of ways. … as well as precognitive priming, where behaviour can be influenced by primes that are shown after the target stimulus has been seen (e.g. Bem, 2011; Vernon, 2015).” (Vernon, 2017, p. 217).

Vernon (2017) does cite failed replication studies, but interprets these failures as evidence for some hidden moderator that could explain inconsistent findings that require further investigation. A retraction would make it clear that there are no inconsistent findings because Bem’s findings do not provide credible evidence for the effect. Thus, it is unnecessary and maybe unethical to recruit human participants to further replication studies of Bem’s paradigms.

This does not mean that future research on paranormal phenomena should be banned. However, future studies cannot be based on Bem’s paradigms or results to plan future studies. For example, Vernon (2017) studied a small sample of 107 participants, which would be sufficient based on Bem’s effect sizes, but these effect sizes are not trustworthy and cannot be used to plan future studies.

A main objection to retractions is that Bem’s study made an inadvertent important contribution to the history of social psychology that triggered a method revolution and changes in the way social psychologist conduct research. Such an important article needs to remain part of the scientific record and needs to be cited in meta-psychological articles that reflect on research practices. However, a retraction does not eradicate a published article. Retracted articles remain available and can be cited (RetractionWatch, 2018). Thus, it is possible to retract an article without removing it from the scientific record. A retraction would signal clearly that the article should not be cited as evidence for time-reversed causality and that the studies should not be included in meta-analyses because the bias in Bem’s studies also biases all meta-analytic findings that include Bem’s studies (Bem, Ressoldi, Rabeyron, & Duggan (2015).

[edited January, 8, 2018]

It is not clear how Bem (2011) thinks about his article these days, but one quote in Enbger’s article suggests that Bem realizes now that he provided false evidence for a phenomenon that does not exist.

When Bem started investigating ESP, he realized the details of his research methods would be scrutinized with far more care than they had been before. In the years since his work was published, those higher standards have increasingly applied to a broad range of research, not just studies of the paranormal. “I get more credit for having started the revolution in questioning mainstream psychological methods than I deserve,” Bem told me. “I was in the right place at the right time. The groundwork was already pre-prepared, and I just made it all startlingly clear.”

If Bem wants credit for making it startlingly clear that his evidence was obtained with questionable research practices that can mislead researchers and readers, he should make it startlingly clear that this was the case by retracting the article.

REFERENCES

Alcock, J. E. (2011). Back from the future: Parapsychology and the Bem affair. Skeptical Inquirer, 35(2). Retrieved from http://www.csicop.org/specialarticles/show/back_from_the_future

Bem, D. J. (2000). Writing an empirical article. In R. J. Sternberg (Ed.), Guide to publishing in psychological journals (pp. 3–16). Cambridge, England: Cambridge University Press. doi:10.1017/CBO9780511807862.002

Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology, 100, 407–425. doi:10.1037/a0021524

Bem, D.J., Tressoldi, P., Rabeyron, T. & Duggan, M. (2015) Feeling the future: A meta-analysis of 90 experiments on the anomalous anticipation of random future events, F1000 Research, 4, 1–33.

Engber, D. (2017). Daryl Bem proved ESP Is real: Which means science is broken. https://slate.com/health-and-science/2017/06/daryl-bem-proved-esp-is-real-showed-science-is-broken.html

Francis, G. (2012). Too good to be true: Publication bias in two prominent
studies from experimental psychology. Psychonomic Bulletin & Review,
19, 151–156. doi:10.3758/s13423-012-0227-9

Francis, G. (2014). The frequency of excess success for articles in Psychological Science. Psychonomic Bulletin & Review, 21, 1180-1187.

Franco, A., Malhotra, N., & Simonovits, G. (2014). Publication bias in the social sciences: Unlocking the file drawer. Science, 345, Issue 6203, 502-1505, DOI: 10.1126/science.1255484

Galak, J., Leboeuf, R.A., Nelson, L. D., & Simmons, J.P. (2012). Journal of Personality and Social Psychology, 103, 933-948, doi: 10.1037/a0029709.

John, L. K. Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science, 23, 524-532. DOI: 10.1177/0956797611430953

RetractionWatch (2018). Ask retraction watch: Is it OK to cite a retracted paper? http://retractionwatch.com/2018/01/05/ask-retraction-watch-ok-cite-retracted-paper/

Ritchie, S. J., Wiseman, R., & French, C. C. (2012). Failing the future: Three unsuccessful attempts to replicate Bem’s “retroactive facilitation of recall” effect. PLoS

Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 2012, 17, 551–566.

Schimmack, U. (2015). The Test of Insufficient Variance: A New Tool for the Detection of Questionable Research Practices. https://replicationindex.com/2014/12/30/the-test-of-insufficient-variance-tiva-a-new-tool-for-the-detection-of-questionable-research-practices/

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359–1366. doi:10.1177/0956797611417632

Strube, M. J. (2006). SNOOP: A program for demonstrating the consequences
of premature and repeated null hypothesis testing. Behavior
Research Methods, 38, 24–27. doi:10.3758/BF03192746

Schooler, J. (2011). Unpublished results hide the decline effect. Nature, 470, 437.

Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance— or vice versa. Journal of the American Statistical Association, 54(285), 30–34. doi:10.2307/2282137

Tversky, A. & Kahneman, D. (1971). Belief in the law of small numbers. Psychological Bulletin, 76, 105-110.

Hidden Figures: Replication Failures in the Stereotype Threat Literature

April 7, 2017Publication Bias, r-index, Replicability, Replication, Research Integrity, Statistical Power, TIVAStereotype-Threat, Test of Insufficient Variance, TIVAUlrich Schimmack

In the past five years, it has become apparent that many classic and important findings in social psychology fail to replicate (Schimmack, 2016). The replication crisis is often considered a new phenomenon, but failed replications are not entirely new. Sometimes these studies have simply been ignored. These studies deserve more attention and need to be reevaluated in the context of the replication crisis in social psychology.

In the past, failed replications were often dismissed because seminal articles were assumed to provide robust empirical support for a phenomenon, especially if an article presented multiple studies. The chance of reporting a false positive results in a multiple study article is low because the risk of a false positive decreases exponentially (Schimmack, 2012). However, the low risk of a false positive is illusory if authors only publish studies that worked. In this case, even false positives can be supported by significant results in multiple studies, as demonstrated in the infamous ESP study by Bem (2011). As a result, publication bias undermines the reporting of statistical significance as diagnostic information about the risk of false positives (Sterling, 1959) and many important theories in social psychology rest on shaky empirical foundations that need to be reexamined.

Research on stereotype threat and women’s performance on math tests is one example where publication bias undermines the findings in a seminal study that produced a large literature of studies on gender differences in math performance. After correcting for publication bias, this literature shows very little evidence that stereotype threat has a notable and practically significant effect on women’s math performance (Flore & Wicherts, 2014).

Another important line of research has examined the contribution of stereotype threat to differences between racial groups on academic performance tests. This blog post examines the strength of the empirical evidence for stereotype threat effects in the seminal article by Steele and Aronson (1995). This article is currently the 12th most cited article in the top journal for social psychology, Journal of Personality and Social Psychology (2,278 citations so far).

According to the abstract, “stereotype threat is being at risk of confirming, as self-characteristic, a negative stereotype about one’s group.” Studies 1 and 2 showed that “reflecting the pressure of this vulnerability, Blacks underperformed in relation to Whites in the ability-diagnostic condition but not in the nondiagnostic condition (with Scholastic Aptitude Tests controlled).” “Study 3 validated that ability-diagnosticity cognitively activated the racial stereotype in these participants and motivated them not to conform to it, or to be judged by it.” “Study 4 showed that mere salience of the stereotype could impair Blacks’ performance even when the test was not
ability diagnostic.”

The results of Study 4 motivated Stricker and colleagues to examine the influence of stereotype-treat on test performance in a real-world testing situation. These studies had large samples and were not limited to students at Stanford. One study was reported in a College Board Report (Stricker and Ward, 1998). Another two studies were published in the Journal of Applied Social Psychology (Stricker & Ward, 2004). This article received only 52 citations, although it reported two studies with an experimental manipulation of stereotype threat in a real assessment context. One group of participants were asked about their gender or ethnicity before the text, the other group did not receive these questions. As noted in the abstract, neither the inquiry about race, nor about gender, had a significant effect on test performance. In short, this study failed to replicate Study 4 of the classic and widely cited article by Steele and Aronson.

Stricker and Ward’s Abstract
Steele and Aronson (1995) found that the performance of Black research participants on
ability test items portrayed as a problem-solving task, in laboratory experiments, was affected adversely when they were asked about their ethnicity. This outcome was attributed to stereotype threat: Performance was disrupted by participants’ concerns about fulfilling the negative stereotype concerning Black people’s intellectual ability. The present field experiments extended that research to other ethnic groups and to males and females taking operational tests. The experiments evaluated the effects of inquiring about ethnicity and gender on the performance of students taking 2 standardized tests-the Advanced Placement Calculus AB Examination, and the Computerized Placement Tests-in actual test administrations. This inquiry did not have any effects on the test performance of Black, female, or other subgroups of students that were both statistically and practically significant.

The article also mentions a personal communication with Steele, in which Steele mentions an unpublished study that also failed to demonstrate the effect under similar conditions.

“In fact, Steele found in an unpublished pilot study that inquiring about ethnicity did not affect Black participants’ performance when the task was described as diagnostic of their ability (C. M. Steele, personal communication, May 2 1, 1997), in contrast to the
substantial effect of inquiring when the task was described as nondiagnostic.”

A substantive interpretation of this finding is that inquires about race or gender do not produce stereotype threat effects when a test is diagnostic because a diagnostic test already activates stereotype threat. However, if this were a real moderator, it would be important to document this fact and it is not clear why this finding obtained in an earlier study by Steele remained unpublished. Moreover, it is premature to interpret the significant result in the published study with a non-diagnostic task and the non-significant result in an unpublished study with a diagnostic task as evidence that diagnosticity moderates the effect of the stereotype-threat manipulation. A proper test of this moderator hypothesis would require the demonstration of a three-way interaction between race, inquiry about race, and diagnosticity. Absent this evidence, it remains possible that diagnosticity is not a moderator and that the published result is a false positive (or a positive result with an inflated effect size estimate). In contrast, there appears to be consistent evidence that inquiries about race or gender before a real assessment of academic performance does not influence performance. This finding is not widely publicized, but is important for a better understanding of performance differences in real world settings.

The best way to examine the replicability of Steele and Aronson’s seminal finding with non-diagnostic tasks would be to conduct an exact replication study. However, exact replication studies are difficult and costly. An alternative is to examine the robustness of the published results by taking a closer look at the strength of the statistical results reported by Steele and Aronson, using modern statistical tests of publication bias and statistical power like the R-Index (Schimmack, 2014) and the Test of Insufficient Variance (TIVA, Schimmack, 2014).

Replicability Analysis of Steele and Aronson’s four studies

Study 1. The first study had a relatively large sample of N = 114 participants, but it is not clear how many of the participants were White or Black. The study also had a 2 x 3 design, which leaves less than 20 participants per condition. The study produced a significant main effect of condition, F(2, 107) = 4.74, and race, F(1,107) = 5.22, but the critical condition x race interaction was not significant (reported as p > .19). However, a specific contrast showed significant differences between Black participants in the diagnostic condition and the non-diagnostic condition, t(107) = 2.88, p = .005, z = 2.82. The authors concluded “in sum, then, the hypothesis was supported by the pattern of contrasts, but when tested over the whole design, reached only marginal significance” (p. 800). In other words, Study 1 provided only weak support for the stereotype threat hypothesis.

Study 2. Study 2 eliminated one of the three experimental conditions. Participants were 20 Black and 20 White participants. This means there were only 10 participants in each condition of a 2 x 2 design. The degrees of freedom further indicate that the actual sample size was only 38 participants. Given the weak evidence in Study 1, there is no justification for a reduction in the number of participants per cell, although the difficulty of recruiting Black participants at Stanford may explain this inadequate sample size. Nevertheless, the study showed a significant interaction between race and test description, F(1,35) = 8.07, p = .007. The study also replicated the contrast from Study 1 that Black participants in the diagnostic condition performed significantly worse than Black participants in the non-diagnostic group, t(35) = 2.38, p = .023, z = 2.28.

Studies 1 and 2 are close replications of each other. The consistent finding across the two studies that supports stereotype-treat theory is the finding that merely changing the description of an assessment task changes Black participants performance, as revealed by significant differences between the diagnostic and non-diagnostic condition in both studies. The problem is that both studies had small numbers of Black participants and that small samples have low power to produce significant results. As a result, it is unlikely that a pair of studies would produce significant results in both studies.

Observed power in the two studies is .81 and .62 with median observed power of .71. Thus, the actual success rate of 100% (2 out of 2 significant results) is 29 percentage points higher than the expected success rate. Moreover, when inflation is evident, median observed power is also inflated. To correct for this inflation, the Replicability-Index (R-Index) subtracts inflation from median observed power, which yields an R-Index of 42. Any value below 50 is considered unacceptably low and I give it a letter grade F, just like students at American Universities receive an F for exams with less than 50% correct answers. This does not mean that stereotype threat is not a valid theory or that there was no real effect in this pair of studies. It simply means that the evidence in this highly cited article is insufficient to make strong claims about the causes of Black’s performance on academic tests.

The Test of Insufficient Variance (TIVA) provides another way to examine published results. Test statistics like t-values vary considerably from study to study even if the exact same study is conducted twice (or if one larger sample is randomly split into two sub-samples). When test-statistics are converted into z-scores, sampling error (the random variability from sample to sample) follows approximately a standard normal distribution with a variance of 1. If the variance is considerably smaller than 1, it suggests that the reported results represent a selected sample. Often the selection is a result of publication bias. Applying TIVA to the pair of studies, yields a variance of Var(z) = 0.15. As there are only two studies, it is possible that this outcome occurred by chance, p = .300, and it does not imply intentional selection for significance or other questionable research practices. Nevertheless, it suggests that future replication studies will be more variable and produce some non-significant results.

In conclusion, the evidence presented in the first two studies is weaker than we might assume if we focused only on the fact that both studies produced significant contrasts. Given publication bias, the fact that both studies reported significant results provides no empirical evidence because virtually all published studies report significant results. The R-Index quantifies the strength of evidence for an effect while taking the influence of publication bias into account and it shows that the two studies with small samples provide only weak evidence for an effect.

Study 3. This study did not examine performance. The aim was to demonstrate activation of stereotype threat with a sentence completion task. The sample size of 68 participants (35 Black, 33 White) implied that only 11 or 12 participants were assigned to one of the six cells in a 2 (race) by 3 (task description) design. The study produced main effects for race and condition, but most importantly it produced a significant interaction effect, F(2,61) = 3.30, p = .044. In addition, Black participants in the diagnostic condition had more stereotype-related associations than Black participants in the non-diagnostic condition, t(61) = 3.53,

Study 4. This study used inquiry about race to induce stereotype-threat. Importantly, the task was described as non-diagnostic (as noted earlier, a similar study produced no significant results when the task was described as diagnostic). The design was a 2 x 2 design with 47 participants, which means only 11 or 12 participants were allocated to the four conditions. The degrees of freedom indicated that cell frequencies were even lower. The study produced a significant interaction effect, F(1,39) = 7.82, p = .008. The study also produced a significant contrast between Blacks in the race-prime condition and the no-prime condition, t(39) = 2.43, p = .020.

The contrast effect in Study 3 is strong, but it is not a performance measure. If stereotype threat mediates the effect of task characteristics and performance, we would expect a stronger effect on the measure of the mediator than on the actual outcome of interest, task performance. The key aim of stereotype threat theory is to explain differences in performance. With a focus on performance outcomes, it is possible to examine the R-Index and TIVA of Studies 1, 2, and 4. All three studies reported significant contrasts between Black students randomly assigned to two groups that were expected to show performance differences (Table 1).

Table 1

Study	Test Statistic	p-value	z-score	obs.pow
Study 1	t(107) = 2.88	0.005	2.82	0.81
Study 2	t(35)=2.38	0.023	2.28	0.62
Study 4	t(39) = 2.43	0.020	2.33	0.64

Median observed power is 64 and the R-Index is well below 50, 64 – 36 = 28 (F). The variance in z-scores is Var(z) = 0.09, p = .086. These results cast doubt about the replicability of the performance effects reported in Steele and Aronson’s seminal stereotype threat article.

Conclusion

Racial stereotypes and racial disparities are an important social issue. Social psychology aims and promises to contribute to the understanding of this issue by conducting objective, scientific studies that can inform our understanding of these issues. In order to live up to these expectations, social psychology has to follow the rules of science and listen to the data. Just like it is important to get the numbers right to send men and women into space (and bring them back), it is important to get the numbers right when we use science to understand women and men on earth. Unfortunately, social psychologists have not followed the examples of astronomers and the numbers do not add up.

The three African American women, features in this years movie “Hidden Figures”***, Katherine Johnson, Dorothy Vaughan, and Mary Jackson might not approve of the casual way social psychologists use numbers in their research, especially the wide-spread practice of hiding numbers that do not match expectations. No science that wants to make a real-world contribution can condone this practice. It is also not acceptable to simply ignore published results from well-conducted studies with large samples that challenge a prominent theory.

Surely, the movie Hidden Figures dramatized some of the experiences of Black women at NASA, but there is little doubt that Katherine Johnson, Dorothy Vaughan, and Mary Jackson encountered many obstacles that might be considered stereotype threatening situations. Yet, they prevailed and they paved the way for future generations of stereotyped groups. Understanding racial and gender bias and performance differences remains an important issue and that is the reason why it is important to shed a light on hidden numbers and put simplistic theories under the microscope. Stereotype threat is too often used as a simple explanation that avoids tackling deeper and more difficult issues that cannot be easily studied in a quick laboratory experiment with undergraduate students at top research universities. It is time for social psychologists to live up to its promises by tackling real world issues with research designs that have real world significance that produce real evidence using open and transparent research practices.

————————————————————————————————————————————

*** If you haven’t seen the movie, I highly recommend it.