Estimating the Replicability of Results in 'Journal of Experimental Psychology: Learning Memory, & Cognition"

The “Journal of Experimental Psychology” is the oldest journal of the American Psychological Association (Benjamin, 2019). When psychology grew, it was split into distinct journals for different areas of experimental psychology. IN 2019, the “Journal of Experimental Psychology: Learning, Memory, and Cognition” (JEP-LMC) published its 45th volume. In the same year, Aaron S. Benjamin took over as editor of JEP-LMC.

The editorial promises changes in publication practices in response to the so-called replication crisis in psychology. Concerns about the replicability of psychological findings were raised by the Open Science Collaboration (OSC, 2015). They replicated 100 studies from three journals, including JEP-LMC. For JEP-LMC they found that only 50% of published results produced a significant result in the replication attempts.

Benjamin (2019) hopes that changes in publication policies will raise this replication. He is also hopeful that this can be achieved with minor changes, suggesting that research practices in cognitive psychology are not as questionable as those in social psychology, where the replication rate was only 25%.

Aside from the OSC investigation relatively little is know about the research practices of cognitive psychology and the replicability of their findings. The reason is that systematic examination of replicability are difficult to do. It would take a major effort to repeat the replication project to see whether the replicability of cognitive psychology has already changed or will change in response to Benjamin’s initiatives. Without fast and credible indicators, editors are practically flying blind and can only hope for the best.

My colleagues and I developed a statistical method, called z-curve, to provide fast and representative information about research practices and replicability (Brunner & Schimmack, 2019; Bartos & Schimmack, 2020). Z-curve uses the actual test-statistics (t-values, F-values) of significant results to estimate the expected replication rate (RR) and expected discovery rate (EDR) of published results. The replication rate focuses on published significant result. It estimates how many of these results would be significant again if the original studies were replicated exactly with the same sample sizes. The discovery rate is the rate of significant results for all statistical tests that researchers conducted to produce their publications. Without publication bias, this rate would simply be the percentage of significant results that are reported in articles. However, with publication bias the observed discovery rate is inflated by publication bias. Z-curve provides an estimate of the actual discovery rate on the basis of the distribution of the significant results alone. A comparison of the ODR and EDR provides information about the presence of publication bias or selection for significance.

To provide this valuable information for JEP-LMC, I downloaded all articles from 2000 to 2019 and automatically extracted all test-statistics (t-values, F-values). These test-statistics are first converted into two-sided p-values that are converted into absolute z-scores. Higher z-scores provide stronger evidence against the null-hypothesis. Figure 1 shows the results for the 53,975 test-statistics published from 2000 to 2019.

Visual inspection shows a cliff at z = 1.96, which corresponds to a p-value of .05 (two-sided). This finding suggests that non-significant results are missing. A formal test of publication bias is provided by the comparison of the observed discovery rate of 68%, 95%CI = 67% to 68% and the expected discovery rate (EDR) of 44%, 95%CI = 33% to 60%. The confidence intervals do not overlap, indicating that this is not just a random finding. Thus, there is clear evidence that questionable practices inflate the percentage of significant results published in JEP-LMC.

The expected replication rate is high with 79%, 95%CI = 75% to 82%. This estimate is considerably higher than the actual success rate of replication studies of 50% (OSC, 2015). There are several reasons for this. Automatic extraction does not distinguish focal and non-focal hypothesis tests. Focal hypothesis tests are riskier and tend to produce weaker evidence. Estimates for the replicability of results with p-values between .05 and .01 (~ z = 2 to 2.5) show only a replicabilty of 55% (Figures below x-axis). Another reason for the discrepancy is that replication studies are rarely exact, even in cognitive psychology. When unknown moderating factors produce heterogeneity, the ERR overestimates actual replicability and the worst case scenario is that success rate matches the EDR. The 95%CI of the EDR does include 50%. Thus, editors are well advised to focus on the EDR as an indicator for improvement.

Z-curve also provides information about the risk that JEP-LMC publishes mostly false positive results (Benjamin, 2019). Although it is impossible to quantify the rate of true null-hypotheses, it is possible to use the EDR to estimate the maximum rate of false discoveries (Bartos & Schimmack, 2020; Soric, 1980). The Soric FDR is only 7% and even the upper limit of the 95%CI is only 11%. Thus, the results provide no evidence for the claim that most published results are false positives. Power estimates for z-scores between 1 and 2 rather suggest that many non-significant results are false negatives due to low statistical power. This has important implications for the interpretation of interaction effects. Interaction effects rarely show that effects are present in one condition and not in another condition. Most often they merely show that effects are stronger in one condition than in another, even if the weaker effect is not statistically significant.

The presence of questionable practices that inflate the discovery rate affects mostly just-significant results. One way to deal with this problem is to require stronger evidence to claim statistical significance. Just like correction for multiple comparisons, it is necessary to control for unreported tests that inflate the type-I error risk. Following other recommendations, I suggest to use p = .005 as a more stringent criterion to reject the null-hypothesis to contain the false positive risk at 5%. Figure 2 shows the results when only results that meet this criterion (z > 2.8) are fitted to the model.

The ERR increases to 91% and the EDR increases to 86%. Even for z-scores from 3 to 3.5, the ERR is 82%. Thus, most of these results are expected to replicate. For the future, this means authors should demonstrate that they did not use QRPs by preregistering and following a design and data analysis plan or they should use the more conservative criterion value of p = .005 to claim significance with alpha = .05.

The published results with p-values between .05 and .005 should be considered as questionable evidence. If multiple studies are available, meta-analyses that take publication bias into account can be used to examine whether these results are robust. If these results are very important, they should be subjected to a rigorous replication attempt in studies with larger samples that increase power.

The next figure examines whether research practices and replicability have changed over time. For this purpose, I computed the ERR (solid) and the EDR (dotted) for results significant at .05 (black) and those significant at .005 (grey) for each year.

Figure 3 shows high ERR and EDR with p < .005 as significance criterion (grey). The slight negative trends are not statistically significant, ERR: b = -.003, se = .0017; EDR: b = -.005, se = .0027. With p < .05 as criterion, the ERR is also high, but significantly decreasing, b = -.0025, se = .001. The EDR estimates are much more variable because they depend on the number of test-statistics that are just significant. The trend over time is negative, but not statistically significant, b = -.005, se = .006. Overall, these results do not show any changes in response to the replication crisis. Hopefully, the initiatives of the editor will reduce the use of questionable practices and increase power. Raising the EDR for all results with p < .05 to 80% can be achieved with less effort. Ironically, a simple way to do so is to publish fewer studies in a single article (Schimmack, 2012). Rather than reporting 8 studies with 25 participants each, results are more credible if they replicate across 4 studies with 50 participants in each study. Thus, without additional resources, it is possible to make results in JEP-LMC more credible and reduce the need to use questionable practices to move p-values below .05.

In conclusion, this blog post provides much needed objective, quantitative (meta-scientific) evidence about research practices and replicability of results in JEP-LMC. The results provide no evidence of a replication crisis in JEP-LMC, and fears that most published results are false positives are not based on empirical facts. Most results published in JEP-LMC are likely to be true and many of the replication failures in the OSC replication attempts were probably false negatives due to low power in both the original and the replication studies. Nevertheless, low power and false negatives are a problem because inconsistent results produce confusion. For powerful within-subject designs cognitive researchers can easily increase power by increasing sample sizes from the typical N = 20 to N = 40. They can do so by reducing the number of internal replication studies within an article or by investing more resources in meaningful conceptual replication studies. Better planning and justification of sample sizes is one of the initiatives in Benjamin’s editorial. Z-curve makes it possible to examine whether this initiative finally increases the power of studies in psychology, which has not been the case since Cohen (1962) warned about low power in psychological experiments. Maybe 60 years later, in 2022, we will see an increase in power in JEP-LMC.

Estimating the Replicability of Results in 'European Journal of Social Psychology"

Over the past decade, questions have been raised about research practices in psychology and the replicability of published results. The focus has been mostly on research practices in social psychology. A major replication project found that only 25% of results in social psychology could be replicated (Open Science Collaboration, 2015). This finding has produced a lot of conflicting responses that blame the replication project for the low success rate to claims that most results in social psychology are false positives.

Social psychology journals have responded to concerns about the replicability of social psychology with promises to improve the reporting of results. The European Journal of Social Psychology (EJSP) is no exception. In 2015, the incoming editors Radmila Prislin and Vivian L. Vignoles wrote

we believe that scientific progress requires careful adherence to the highest standards of integrity and methodological rigour. In this regard, we welcome recent initiatives to improve the trustworthiness of research in social and personality psychology

In 2018, the new editorial team, Roland Imhoff, Joanne Smith, Martijn van Zomeren, addressed concerns about questionable research practices more directly.

“opening up also implies being considerate of empirical imperfections that would otherwise remain hidden from view. This means that we require authors to provide a transparent description of the research process in their articles (e.g., report all measures,manipulations,
and exclusions from samples, if any; e.g., Simmons, Nelson, & Simonsohn, 2011). We thus encourage authors to accurately report about the inclusion of failed studies and imperfect patterns (e.g., p-values not meeting the .05 threshold), but this also has to mean that disclosing such imperfections, all else being equal, should not affect the likelihood of acceptance.”

This blog post uses the test-statistics published in EJSP to examine whether research practices of authors who publish in EJSP have changed in response to the low replicability of results in social psychology. To do so, I downloaded articles from 2000 to 2019 and automatically extracted test-statistics (t-values, F-values). I then converted these test-statistics into two-sided p-values and then into absolute z-scores. Higher z-scores provide stronger evidence against the null-hypothesis. These z-scores are then analyzed using z-curve (Brunner & Schimmack, 2019; Bartos & Schimmack, 2019). Figure 1 shows the results for the z-curve plot for all 27,223 test statistics.

Visual inspection shows a cliff at z = 1.96, which corresponds to a p-value of .05, two-sided. The grey curve shows the expected distribution based on the published significant results. The z-curve predicts many more non-significant results than are actually reported, especially below a value of 1.65 that represents the implicit criterion for marginal significance, p = .05, one-sided.

A formal test of selective reporting of significant results compares the observed discovery rate and the expected discovery rate. The observed discovery rate (ODR) is the percentage of reported results that are significant. The expected discovery rate (EDR) is the percentage of significant results that is expected given the z-curve model. The ODR of 72%, 95%CI = 72%-73%. This is much higher than the EDR of 26%, 95%CI = 19% to 40%. Thus, there is clear evidence of selective reporting of significant results.

Z-curve also provides an estimate of the expected replication rate. That is, if the studies were replicated exactly, how many of the significant results in the original studies would be significant again in the exact replication studies. The estimate is 70%, 95%CI = 65% to 73%. This is not a bad replication rate, but the problem is that it requires exact replications that are difficult if not impossible to do in social psychology. Bartos and Schimmack (2020) found that the EDR is a better predictor of results for conceptual replication studies. The estimate of 26% is consistent with the low replication rate in the replication project (Open Science Collaboration, 2015).

Fortunately, it is not necessary to dismiss all published results in EJSP. Questionable practices are more likely to produce just-significant results. It is therefore possible to focus on more credible results with a p-value less than .005, which corresponds to a z-score of 2.8. Figure 2 shows the results.

Based on the distribution of z-scores greater than 2.8, the model predicts much fewer just-significant results than are reported. This also suggests that questionable practices were used to produce significant results. Excluding these articles boosts the EDR to a satisfactory level of 77%. Thus, even if replication studies are not exact, the model predicts that most replication studies would produce a significant result with alpha = .05 (that is, the significance criterion was not adjusted to a more stringent level of .005).

The following analysis examines whether EJSP editors were successful in increasing the credibility of results published in their journal. For this purpose, I computed the ERR (solid) and the EDR (dotted) using all significant results (black) and excluding questionable results (grey) for each year and plotted the results as a function of year.

The results show no statistically significant trend for any of the four indicators over time. The most important indicator that reflects the use of questionable practices is the EDR for all significant results (black dotted line). The low rates in the last three years show that there have been now major improvements in the publishing culture of EJSP. It is surely easier to write lofty editorials than to actually improve scientific practices. Readers who care about social psychology are advised to ignore p-values greater than .005 because these results may have been produced with questionable practices and unlikely to replicate. The current editorial team may take these results as a baseline for initiatives to improve the credibility of EJSP in the following years.

Estimating the Replicability of Results in 'Journal of Cross-Cultural Psychology'

I published my first article in the Journal of Cross-Cultural Psychology (JCCP) and I continued to be interested in cross-cultural psychology. My most highly cited article, is based on a cross-cultural study with junior psychologists collecting data from the US, Mexico, Japan, Ghana, and Germany. I still remember how hard it was to collect cross-cultural data. Although this has become easier with the invention of the Internet, cross-cultural research is still challenging. For example, demonstrating that an experimental effect is moderated by culture requires large samples to have adequate power.

Over the past decades, questions have been raised about research practices in psychology and the replicability of published results. The focus has been mostly on social and cognitive psychological research in the United States or Western countries. A major replication project found that only 25% of results in social psychology could be replicated (Open Science Collaboration, 2015). Given the additional challenges in cross-cultural research, it is possible that the replication rate of results published in JCCP is even lower. (Milfont & Klein, 2018) discuss the implications of the replication crisis in social psychology for cross-cultural psychology, but their article focuses on challenges in conducting replication studies in cross-cultural psychology. The aim of this blog post is to examine the replicability of original results published in JCCP. A replicability analysis of original results is useful and necessary because it is impossible to replicate most original studies. Thus, cross-cultural psychology benefits more from ensuring that original results are trustworthy and replicable rather than mistrusting all original results until they have been replicated.

To examine the credibility of original results, my colleagues and I have developed a statistical tool, z-curve, that makes it possible to estimate the replication rate and the discovery rate in a set of original studies on the basis of the published test-statistics (t-values, F-values) (Brunner & Schimmack, 2019; Bartos & Schimmack, 2020).

To apply z-curve to results published in JCCP, I downloaded all articles from 2000 to 2019 and automatically extracted all test-statistics (t-values, F-values) that were reported in the articles. Figure 1 shows the results for 10,210 test-statistics.

Figure 1 shows a histogram of the test-statistics that are converted into two-sided p-values and then converted into absolute z-scores. Higher z-scores show stronger evidence against the null-hypothesis that there is no effect. Visual inspection shows a steep increase in reported test-statistics around a z -score of 1.96 that corresponds to a p-value of .05, two-sided. This finding reflects the pervasive tendency in psychology to omit non-significant results from publications (Sterling, 1959; Sterling et al., 1995).

Quantitative evidence of selection for significance is provided by a comparison of the observed discovery rate (how many reported results are significant) and the expected discovery rate (how many significant results are expected based on the z-curve analysis; grey curve). The observed discovery rate of 75%, 95%CI 75% to 76%, is significantly higher than the expected discovery rate of 34%, 95%CI = 23%-52% (significance = confidence intervals do not overlap).

Z-curve also produces an estimate of the expected replication rate. That is, if studies were replicated exactly with the same sample sizes, 77% of the significant results are expected to be significant again in the replication attempt. This estimate is reassuring, but there is a caveat. The estimate assumes that studies can be replicated exactly. This is more likely to be the case for simple studies (e.g., a Stroop task with American undergraduates) than for cross-cultural research. When studies are conceptual replications with uncertainty about the essential features that influence effect sizes (e.g., recruiting participants from different areas in a country), the discovery rate is a better predictor of actual replication outcomes (Bartos & Schimmack, 2020). The estimate of 34% is not very assuring and closer to the actual replication rate of actual social psychology studies.

This does not mean that cross-cultural psychologists need to distrust all published results and replicate all previous studies. It is possible to use z-curve to identify results that are more likely to replicate and provide a solid foundation for future research. The reason is that replicability increases with the strength of evidence; larger z-scores are more replicable. Figure 2 excludes just significant results that may have been obtained with questionable research practices from the z-curve analysis. Although the criterion value is arbitrary, a value of 2.8 corresponds to a p-value of .005, that has been advocated as a better criterion for significance. I believe this is sensible when questionable research practices were used.

The z-curve model now predicts fewer just-significant results than are actually reported. This suggests that questionable practices were used to report significant results. Based on the model, about a third of these just-significant results is questionable, and the percentage for results with p-values of .04 and .03 (z = 2 to 2.2) is 50%. Given the evidence that questionable practices were used, readers should ignore these results unless other studies show stronger evidence for an effect. Replicability for results with z-scores greater than 2.8 is 91%. Thus, these results are likely to replicate. Thus, a simple way to address concerns about a replication crisis i n cross-cultural psychology is to adjust the significance criterion retroactively and to focus on p-values less than .005.

It is also important to examine how often articles in cross-cultural psychology report false positive results. The maximum number of false positive results can be estimated from the discovery rate (Soric, 1989). In Figure 2, this estimate is close to zero. Even in Figure 1, where questionable results lower the discovery rate, the estimate is only 20%. Thus, there is no evidence that JCCP published an abundance of false positive results. Rather, the problem is that most hypotheses in cross-cultural research appear to be true hypotheses and that non-significant results are false negatives. This makes sense as it is unlikely that culture has absolutely no effect, which makes the nil-hypothesis a priori implausible. Thus, cross-cultural researchers need to make riskier predictions about effect sizes and they need to conduct studies with higher power to avoid false negative results.

Figure 3 examines time-trends in cross-cultural research by computing the expected discovery rate (ERR, solid) and the expected discovery rate (EDR, dotted) using all significant results (grey) and excluding z-scores below 2.8 (grey). Time trends would reveal changes in the research culture of cross-cultural psychologists.

Simple linear regressions showed no significant time-trends for any of the four measures. ERR estimates are high, while EDR estimates for all significant results are low. There is no indication that research practices changed in response to concerns about a replication crisis. Thus, readers should continue to be concerns about just-significant results. Editors and reviewers could improve the trustworthiness of results published in JCCP by asking for pre-registration and by allowing publication of non-significant results if studies have sufficient power to test a meaningful hypothesis. Results should always be reported with effect sizes and sampling error so that it is possible to examine the range of plausible effect sizes. Significance should not always be evaluated against the nil-hypothesis but also against criteria for a meaningful effect size.

In conclusion, there is no evidence that most published results in JCCP are false positives or that most results published in the journal cannot be replicated. There is, however, evidence that questionable practices are used to publish too many significant results and that non-significant results are often obtained because studies had insufficient power. Concerns about low power are decades old (Cohen, 1962) and haven’t changed research practices in psychology. A trend analysis showed that even the replication crisis in social psychology has not changed research practices in cross-cultural psychology. It is time for cross-cultural psychologists to increase the power of their studies and to report all of the results honestly even if they do not confirm theoretical predictions. Publishing only results that confirm predictions renders empirical data meaningless.

Estimating the Replicability of Results in 'Infancy'

A common belief is that the first two years of life are the most important years of development (Cohn, 2011). This makes research on infants very important. At the same time, studying infants is difficult. One problem is that it is hard to recruit infants for research. This makes it difficult to reduce sampling error and to obtain replicable results. Noisy data also make it possible that questionable research practices inflate effect sizes in order to publish because journals hardly ever publish non-significant results (Peterson, 2016). Even disciplines who are able to recruit larger samples of undergraduate students, like social psychology, have encounter replication failures, and a major replication effort suggested that only a quarter of published results can be replicated (Open Science Collaboration, 2015). This raises concerns about the replicability of results published in Infancy.

Despite much talk about a replication crisis in psychology, infancy researchers seem to be unaware of problems with research practices in psychology. Editorials by Bell (2009), Colombo (2014), and Bremner (2019) celebrate quantitative indicators like submission rates and impact factors, but do not comment on the practices that are used to produce significant results. In a special editorial, Colombo (2017) introduces registered reports that accept study ideas before data are collected and publish results independent of the outcome. However, he doesn’t mention why such an initiative would be necessary (e.g., standard articles use QRPs and studies are only submitted if they show a significant result). Bremner (2019) makes an interesting observation that “it is really rather easy to fail to obtain an effect with infants.” If this is the case and results are reported without selection for significance, articles in Infancy should report many non-significant results. This seems unlikely, given the general bias against non-significant results in psychology.

To examine the replicability of results published in Infancy, I conducted a z-curve analysis (Brunner & Schimmack, 2019; Bartos & Schimmack, 2020). Z-curve uses the test-statistics (t-values, F-values) in articles to examine how replicable significant results are, how often researchers obtain significant results, how many false positive results are reported, and whether researchers use questionable research practices to inflate the percentage of significant results that are being reported.

I downloaded articles from 2000 to 2019 and used an r-program to automatically extract test-statistics. Figure 1 shows the z-curve of the 9,109 test statistics.

First, visual inspection shows a steep cliff around z = 1.96, which corresponds to a p-value of .05 (two-tailed). The fact that there are many more just significant results than just non-significant results reveals that questionable practices inflate the percentage of significant results. This impression is confirmed by comparing the observed discovery rate of 64%, 95%CI = 63% to 65% to the estimated discovery rate 37%, 95%CI = 19% to 46%. The confidence intervals clearly do not overlap, indicating that questionable practices inflate the observed discovery rate.

The expected replication rate is 60%, 95%CI = 55% to 65%. This finding implies that exact replications of studies with significant results would produce 60% significant results. This is not a terrible success rate, but this estimate comes with several caveats. First, the estimate is an average of all reported statistical tests. Some of these tests are manipulation checks that are expected to have strong effects. Other tests are novel predictions that may have weaker effects. The replicability estimate for studies with just-significant results (z = 2 to 2.5) is only 35% (see values below x-axis).

The results are similar to estimates for social psychology, which has witnesses a string of replication failures in actual replication attempts. Based on the present results, I predict similar replication failures in infancy research when studies are actually replicated.

Given the questionable status of just-significant results, it is possible to exclude them from the z-curve analysis. Figure 2 shows the results when z-curve is fitted to z-values greater than 2.8, which corresponds to a p-value of .005.

Questionable research practices are now revealed by the greater proportion of just-significant results than the model predicts. Given the uncertainty about these results, readers may focus on p-values less than .005 to ensure that results are replicable.

The next figure shows results when the ERR (black) and EDR (grey) are estimated for all significant results (solid) and only for z-scores greater than 2.8 for each year.

As the number of tests per year is relatively small, estimates are fairly noisy. Tests of time trends did not reveal any significant changes over time. Thus, there is no evidence that infancy researchers have changed their research practices in response to concerns about a replication crisis in psychology.

The results in Figure 1 and 2 also suggest that Infancy research produces many false negative results; that is the hypothesis is true, but studies have insufficient power to produce a significant result. This is consistent with concerns about low power in psychology (Cohen, 1962) and Bremner’s (2019) observation that non-significant results are common, even when they are not reported. False negative results are a problem because they are sometimes falsely interpreted as evidence against a hypothesis. For infancy research to gain credibility, researchers need to change their research practices. First, they need to improve power by increasing reliability of measures, using within-subject designs whenever possible, or collaborating across labs to increase sample sizes. Second, they need to report all results honestly, not only when studies are pre-registered or in special registered reports. Honest reporting of results is a fundamental aspect of science and evidence of questionable research practices undermines the credibility of infancy research.

Estimating the Replicability of Results in 'Personality and Social Psychology Bulletin"

Abstract

It has been suggested that social psychologists have a unique opportunity to learn from their mistakes and to improve scientific practices (Bloom, 2016). So far, the editors and the editorial board responsible for PSPB have failed to seize this opportunity. The collective activities of social psychologists that leads to the publication of statistical results in this journal have changed rather little in response to concerns that most of the published results in social psychology are not replicable.

Introduction

There is a replication crisis in social psychology (see Schimmack, 2020, for a review). This crisis is sometimes unfairly generalized to all disciplines in psychology, while some areas do not have a replication crisis (Open Science Collaboration, 2015; Schimmack, 2020). The crisis is also sometimes presented as an opportunity.

A public discussion about how scientists make mistakes and how they can work to correct them will help advance scientific understanding more generally. Psychology can lead the way here.” (Paul Boom, 2016. The Atlantic).

However, the response to the replication crisis by social psychologists has been mixed (Schimmack, 2020). Especially, older social psychologists have mostly denied that there is a crisis. In contrast, younger social psychologists have created a new organization to improve (social) psychology. It is unclear whether the comments by older psychologists reflect their behaviors. On the other hand, it is possible that they continue to conduct research as before. On the other hand, it is possible that older social psychologists are mainly trying to preserve a positive image, while they are quietly changing their behaviors.

This blog post sheds some light on this question by examining the replicability of results published in the journal Personality and Social Psychology Bulletin (PSPB). The editors during the decade of replication failures were Shinobu Kitayama, Duane T. Wegener & Lee Fabrigar, and Christian S. Crandall.

One year before the replication crisis, Kitayama (2010) was optimistic about the quality of research in PSPB. “Now everyone in our field would agree that PSPB is one of our very best journals.” He also described 2010 as an exciting time, not knowing how exciting the next decade would be. I could not find an editorial by Wegener and Fabrigar. Their views on the replication crisis are reflected in their article “Conceptualizing and evaluating the replication of research results” (Fabrigar & Wegener, 2016).

Another theme that readers might draw from our discussion is that concerns about a “replication crisis” in psychology are exaggerated. In a number of respects, one might conclude that much of what we have said is reassuring for the field.” (p. 12).

Cris Crandall is well-known as an outspoken defender of the status quo on social media. This view is echoed in his editorial (Crandall, Leach, Robinson, & West, 2018).

PSPB has always been a place for newer, creative ideas, and we will continue to seek papers that showcase creativity, progress, and innovation. We will continue the practice of seeking the highest quality” (p. 287).

However, the authors also express their intention to make some improvements.

We encourage people to be transparent in the analysis and reporting of a priori power, consistent with the goals of transparency and clarity in reporting all statistical analyses.

Despite this statement, PSPB did not implement open-science badges that reward researchers for sharing data or pre-registering studies. I asked Chris Crandall why he did not adopt badges for PSPB, but he declined to answer.

It is therefore an empirical question how much the credibility of results published in PSPB has improved in response to the replication crisis. This blog post examines this question by conducting a z-curve analysis of PSPB (Brunner & Schimmack, 2019; Bartos & Schimmack, 2020). Articles published from 2000 to 2019 were downloaded and test statistics (F-values, t-values) were automatically extracted. Figure 1 shows a z-curve plot of the 50,013 test statistics that were converted into two-sided p-values and then converted into absolute z-scores.

Visual inspection of Figure 1 shows that researchers used questionable research practices. This is indicated by the cliff around a value of z = 1.96 that corresponds to a p-value of .05 (two-tailed). As can be seen, there are a lot fewer results just below 1.96 that are not significant than results just above 1.96 that are just-significant. Moreover, results between 1.65 and 1.96 are often reported as marginally significant support for a hypothesis. Thus, only values below 1.65 reflect results that are presented as truly non-significant results.

Z-Curve quantifies the use of QRPs by comparing the expected discovery rate to the observed discovery rate. The observed discovery rate is the percentage of reported results that are significant. The expected discovery rate is the percentage of significant results that are expected given the distribution of significant results. The grey curve shows the expected distribution. The observed discovery rate of 71%, 95%CI = 71%-72%, is much higher than the expected discovery rate of 34%, 95%CI = 22% to 41%. The confidence intervals are clearly not overlapping, indicating that this is not just a chance finding. Thus, questionable practices were used to inflate the percentage of reported significant results. For example, a simple QRP is to simply not report results from studies that failed to produce significant results. Although this may seem dishonest and unethical, it is a widely used practice.

Z-curve also provides an estimate of the expected replication rate (ERR). The ERR is the percentage of significant results that is expected if studies with significant results were replicated exactly, including the same sample size. The estimate is 64%, which is not a terribly low ERR. However, there are two caveats. First, the estimate is an average and replicability is lower for just significant results as indicated by the estimates of 31% for z-scores between 2 and 2.5. This means that just-significant results are unlikely to replicate. Moreover, it has been pointed out that studies in social psychology are more difficult to replicate. Thus, exact replications are impossible. Using data from actual replications, Bartos and Schimmack (2020) found that the expected discovery rate is a better predictor of success rates in actual replication studies, which is about 25% (Open Science Collaboration, 2015). For PSPB, the EDR estimate of 34% is closer to the actual success rate than the ERR of 64%.

Given the questionable nature of just-significant results, it is possible to exclude these values from the z-curve model. I typically use 2.4 as a criterion but given the extent of questionable practices, I chose a value of 2.8, which corresponds to a p-value of .005. Figure 2 shows the results. Questionable research practices are now reflected in the high proportion of just-significant results that exceeds the proportion predicted by z-curve (grey curve). The replication rate increases to 82%, and the EDR increases to 76%. Thus, readers of PSPB may use p = .005 to evaluate statistical significance because these results are more credible than just significant results that were obtained with questionable practices.

Figure 3 examines time trends for ERR (black) and EDR (grey) computed for all significant results (solid) and for selected z-scores greater than 2.8 (dotted). The time trend for the ERR with all significant results is significant, t(19) = 2.35, p = .03, but all other time trends are not significant. In terms of effect sizes the ERR with all significant results increased from 63% in 2000-2014, to 71% in 2015-2019. The EDR for all significant results also increased from 29% to 41%. Thus, it is possible that slight improvements did occur, although there is too much uncertainty in the estimates to be sure. Although the positive trend in the last couple of years is encouraging, the results do not show a notable response to the replication crisis in social psychology. The absence of a strong response is particularly troublesome if success rates of actual replication studies are better predicted by the EDR than the ERR.

The next figures make different assumptions about the use of questionable research practices in the more recent years form 2015-2019. Figure 4 shows the model that assumes researchers simply report significant results and tend to not report non-significant results.

The large discrepancy between the EDR and ODR suggests that questionable research practices continue to be used by social psychologists who publish in PSPB. Figure 5 excludes questionable results that are just significant. This figure implies that researchers are using QRPs that inflate effect sizes to produce significant results. As these practices tend to produce weak evidence, they produce an unexplained pile of just-significant results.

The last figure fits z-curve to all results, including non-significant ones. Without questionable practices the model should fit well across the entire range of z-values.

The results suggest that questionable research practices are used to turn promising results (z > 1 & z < 2) into just-significant results (z > 2 & z < 2.4) because there are too few promising results and too many just-significant results.

The results also suggest that non-significant results are mostly false negatives (i.e., there is an effect, but the result is not significant). The estimated maximum false discovery rate is only 3%, and the estimate of the minimum number of true hypotheses that are being tested is 62%. Thus, there is no evidence that most significant results in PSPB are false positives or that replication failures can be attributed to testing riskier hypotheses than cognitive psychologists. The results are rather consistent with criticisms that have been raised decades ago that social psychologists mostly test true hypotheses with low statistical power (Cohen, 1962). Thus, neither significant results nor non-significant results provide much empirical evidence that can be used to evaluate theories. To advance social psychology, social psychologists need to test riskier predictions (e.g., make predictions about ranges of effect sizes) and they need to do so with high power so that prediction failures can falsify theoretical predictions and spurn theory development. Social psychology will make no progress if it continues to focus on rejecting nil-hypotheses that manipulations have absolutely no effect that are a priori implausible.

Given the present results, it seems unlikely that PSPB will make the necessary changes to increase the credibility and importance of social psychology; at least until a new editor is appointed. Readers are advised to be skeptical about just significant p-values and focus on p-values less than .005 (heuristic: ~ t > 2.8, F > 8).

Conclusion

In conclusion, it has been suggested that social psychologists have a unique opportunity to learn from their mistakes and to improve scientific practices (Bloom, 2016). So far, the editors and the editorial board responsible for PSPB have failed to seize this opportunity. The collective activities of social psychologists that leads to the publication of statistical results in this journal have changed rather little in response to concerns that most of the published results in social psychology are not replicable.

Estimating the Replicability of Results in 'Psychonomic Bulletin and Review"

The journal “Psychonomic Bulletin and Review” is considered the flagship journal of the Psychonomic Society. The psychonomic society is a professional organization like APS and APA, but focuses mostly on cognitive psychology.

The journal was started in 1994 with Henry L. Roediger III as editor. The society already had some journals, but this journal aimed to publish more theory and review articles. However, it also published Brief Reports. An editorial in 2007 noted that submissions were skyrocketing, suggesting that it would be harder to publish in the journal. Despite much talk about a replication crisis or credibility crisis in the 2010s, I could not find any editorials published during this period. The incoming editor Brockmole published an editorial in 2020. It doesn’t mention any concerns about publication bias, questionable research practices, or low power. One reason for the lack of concerns could be that cognitive psychology is more robust than other areas of psychology. However, another possibility is that cognitive psychologists have not tested the replicability of results in cognitive psychology.

The aim of this blog post is to shed some light on the the credibility of results published in Psychonomic Bulletin and Review. Over the past decade, my colleagues and I have developed a statistical tool, z-curve, that makes it possible to estimate the replication rate and the discovery rate based on published test-statistics (t-values, F-values) (Brunner & Schimmack, 2019; Bartos & Schimmack, 2020). The discovery rate can also be used to estimate the maximum false positive rate and the rate of true hypotheses that are being tested. The analysis are based on an automatic extraction of test-statistics (t-values, F-values) from downloaded articles covering the years 2000 to 2019.

Figure 1 shows the z-curve for all 25,248 test statistics. Visual inspection shows a cliff for z-scores around 1.96, which corresponds to the .05 (two-tailed) criterion of significance. This shows preferential publishing of significant results. This is also indicated by a comparison of the number of non-significant results predicted by the model (grey curve) and the actual number of reported non-significant results (blue histogram). The discrepancy is statistically significant as indicated by a comparison of the 95% confidence intervals of the observed discovery rate, 68% to 70% and the estimated discovery rate, 26% to 53%. The estimated discovery rate implies that for every reported significant result, there should be 1 to 3 non-significant results, File Drawer Ratio 0.89 to 2.85.

With a discovery rate of 36%, the maximum False Positive Rate is 9% (Soric, 1989). Thus, published significant results are unlikely to be false positives in the strict sense that the effect size is zero. This estimate is much lower than the alarming estimate of 40% that was reported by Gronau et al. (2017).

The most important information is the replicability estimate of published significant results. The expected replication rate is 76%, 95%CI = 72% to 80%. Although reassuring, there are some caveats. First, the estimate is an average that includes tests of manipulation checks or unimportant main effects. This might explain why the actual replication rate for cognitive psychology is estimated to be only 50% (Open Science Collaboration, 2015). Local estimates for just significant results (z = 2 to 2.4) have only an expected replication rate of 54% to 65%. Another caveat is that the expected replication rate assumes that experiments can be replicated exactly. When this is not the case, additional selection effects lower the rate and the discovery rate becomes a better estimate of the actual replication rate. A rate of 36% would not be satisfactory.

Figure 1 assumes a simple selection model; significant results are published and non-significant results are not published. However, researchers also use questionable practices to increase their chances of obtaining a significant result (e.g. selective removing of outliers). These practices make just-significant results questionable. To address this concern, it is possible to exclude just-significant (z < 2.4) z-scores from the model. The results of this approach are shown in Figure 2.

The effect on the ERR is small. However, the effect on the expected discovery rate is large. The EDR is now even higher than the observed discovery rate. Thus, there is no evidence that cognitive psychologists hide non-significant results. However, there would be evidence that some just-significant results were obtained with questionable practices because there are more just-significant results than the model predicts. Thus, some questionable research practices are used, but it is not clear which practices are used and how much they influence the discovery rate.

To examine time-trends, I computed the ERR (black) and EDR (grey) using all significant values (solid) and excluding just-significant ones (dotted) and plotted them as a function of publication year.

ERR and the EDR excluding just-significant results showed high estimates that remained constant over time (no significant trends). For the EDR using all significant results an interesting pattern emerged. EDR estimates dropped for a period from 2006 to 2014. During this period Psychonomic Bulletin and Review published an excessive amount of just-significant p-values (Figure 4).

Since 2015, the percentage of just-significant p-values is more in line with model predictions. One possible explanation for this is that concerns about replicabilty changed researchers’ behaviors or reviewers’ evaluation of just-significant p-values.

The fact that there are now more non-significant results than the model predicts is explained by the difficulty of estimating the distribution of non-significant results when only z-scores greater than 2.4 are available. Given the weak evidence of questionable research practices since 2015, it is possible to fit z-curve to non-significant and significant results.

The results show only a slight possibility that promising results (z > 1) are missing because questionable research practices were used to turn them into just significant results. However, for the most part the model fits the data well, indicating that the reported results are credible. There are very few false positive results and a minimum of 62% of hypotheses are true hypotheses. Moreover, most non-significant results are type-II errors, suggesting that replication studies with larger samples would produce significant results. These results provide no evidence for a replication crisis in cognitive psychology. It is therefore unwarranted to generalize from attention-grabbing replication failures in social psychology to cognitive psychology. A bigger concern is modest power that produces a fairly large number of false negative results. Thus, replication efforts should focus on important non-significant result, especially when these results were falsely interpreted as evidence for the null-hypothesis. There is no justification to invest resources in massive replication efforts of significant results. Another concern could be the high rate of true hypotheses. Theory development benefits from evidence that shows when predictions are not confirmed. Given the low rate of false hypotheses, cognitive psychologists might benefit from subjecting their theories to riskier predictions that may fail. All of these conclusions are limited to results published in Psychonomic Bulletin and Review. Z-curve analyses of other journals are needed. It will be particularly interesting to examine whether the time of increased questionable practices from 2007-2014 is a general trend or reflects editorial decisions in this journal. The current editorial team is well-advised to request pre-registered replication studies when the original submission contains several just-significant results for focal hypothesis test.

How Often Do Researchers Test True Hypotheses?

Psychologists conduct empirical studies with the goal to demonstrate an effect by means of rejecting the null-hypothesis that there is no effect, p < .05. The utility of doing so depends on the a priori probability that the null-hypothesis is false. One of the oldest criticisms of hull-hypothesis significance testing (NHST) is that the null-hypothesis is virtually always false (Lykken, 1968). As a result, demonstrating a significant result adds little information, while failing to do so because studies have low power creates false information and confusion.

Since 2011, an alternative view is that the many published results are false positives (Simmons, Nelson, & Simonsohn, 2011). The narrative is that researchers use questionable research practices that make it extremely likely that true null-hypotheses produce significant results. This argument assumes that psychologists often conduct studies where the effect size is zero. That is, Bem’s (2011) crazy studies of extrasensory perception are representative of most studies in psychology. This seems a bit unlikely for all areas of psychology, especially correlational studies that examine how naturally occurring phenomena covary.

Debates like this are difficult to settle because the null-hypothesis is an elusive phenomenon. As all empirical studies have some sampling error (and there is some uncertainty about small effect sizes that may be artifacts), even large samples or meta-analyses can never affirm the null-hypothesis. However, it is possible to estimate the minimum percentage of null-hypotheses that are being tested on the basis of the discovery rate; that is the percentage of hypothesis tests that produced a significant result (Soric, 1989). The logic of this approach is illustrated in Tables 1 and 2.

NSSIG
TRUE06060
FALSE76040800
760100860
Table 1

Table 1 illustrates an example where only 7% (60/860) hypotheses that are being tested are true (there is an effect). These hypotheses are tested with 100% power. As a result, there are no additional true hypotheses with non-significant results. With alpha = .05, only 1 out of 20 false hypotheses produce a significant result. Thus, there are 19 times more false hypotheses with non-significant results (40 * 19 = 760) than false positives (40). In this example the discovery rate is 100/860 = 11.6%.

The assumption that power is 100% is unrealistic but it makes it possible to estimate the minimum rate of true hypotheses. As power decreases, the rate of true hypotheses increases. This is illustrated by lowering mean power to 20%, while keeping the discovery rate the same.

NSSIG
TRUE30476380
FALSE45624480
760100860
Table 2

Once more the discovery rate is 100/860 = 11.6%. However, now a lot more of the hypotheses are true hypotheses (380/860 = 44%). As there are many non-significant true hypotheses, there are also a lot fewer non-significant false hypotheses. If we lower power further, the number of true hypotheses increases and reaches 100% in the limit when power approaches alpha. That is, all hypotheses may have very small effect sizes that are different from zero. Thus, not a single hypothesis tests an effect that is exactly zero.

Soric’s (1989) insight makes it possible to examine empirically whether a literature tests many true hypotheses. However, a problem for the application of the approach to actual data is that the true discovery rate is unknown because psychology journals tend to publish nearly exclusively significant results, while non-significant results remain hidden in the proverbial file-drawer (Rosenthal, 1979). Recently, Bartos and Schimmack (2020) developed a statistical model that solves this problem. The model called z-curve makes it possible to estimate the discovery rate on the basis of the distribution of published significant results. The method is called z-curve because published tests statistics are converted into absolute z-scores.

To demonstrate z-curve estimates of the true hypothesis rate (THR), I use test-statistics from the journal Psychonomic Bulletin and Review. The choice of this journal is motivated by prior meta-psychological investigations of results published in this journal. Gronau, Duizer, Bakker, and Wagenmakers (2017) used a Bayesian Mixture Model to estimate that about 40% of results published in this journal are false positive results. Table 3 shows that a 40% False Positive Rate (72/172 = .42) corresponds roughly to an estimate that cognitive psychologists test only 10% true hypotheses. This is a surprisingly low estimates but it matches estimates that psychologists in general test only 9% true hypotheses (Dreber, Pfeiffer, Almenber, Isakssona, Wilsone, Chen, Nosek, & Johannesson, 2015).

NSSIG
TRUE0100100
FALSE136872900
13681721000
Table 3

Given these estimates, it is not surprising that some psychologists attribute replication failures to a high rate of false positive results that are published in psychology journals. The problem, however, is that the BMM is fundamentally flawed and uses a dogmatic prior to produce inflated estimates of the false positive rate (Schimmack & Brunner, 2019).

To examine this issue with z-curve, I used automatically extracted test-statistics that were published in Psychonomic Bulletin and Review from 2000 to 2010. The program extracted 25,248 test statistics, which allows for much more robust estimates than the 855 t-tests used by Gronau et al. Figure 1 shows clear evidence of selection bias, which makes the observed discover rate of 70%, 95%CI = 69% to 70%, unusable. Z-curve suggests that the true discovery rate is only 36%. Despite the large sample size, there is considerable uncertainty about this estimate, 95%CI = 26% to 54%. These results suggest that for every significant result, researchers obtain 1 to 3 non-significant results (File-Drawer Ratio, 1.75; 95%CI = 0.85, 2.85).

The expected discovery rate of 36%, implies a minimum rate of True Hypotheses of 33%. This estimate is considerably higher than the estimates of 10% based on much smaller datasets and questionable statistical models (###, Gronau et al.).

The estimate of the minimum THR assumes that power is 100%. This is an unrealistic assumption. Z-curve is able to provide a more realistic maximum value based on the expected replication rate of 76% and the estimated maximum false positive rate of 9%. With a false positive rate of 9%, power for the remaining 91% of true positives has to be 83% to produce the ER of 76% (9% * .05 + 91% *.83 = 76%). With fewer false positives, power would decrease to a minimum of 76% when there are no false positives. Thus, power cannot be higher than 83%. Using a value of 83% power, yields a true positive rate of 62%. This scenario is illustrated in Table 4.

NSSIG
TRUE65316381
FALSE58831619
6533471000
Table 4

Power is 316/381 = 83%, the False Positive Rate is 31/347 = 9%, and the True Hypothesis Rate is 381/1000 = 38%. This is still a conservative estimate because power for the non-significant results is lower than power for the significant results, but it is more realistic than simply assuming that power is 100%. In this case, the ERR is fairly high and the max THR estimate changes only slightly from 33% to 38%.

Conclusion

Z-curve makes it possible to provide an empirical answer to the question how often researchers test true versus false hypothesis. For the journal Psychonomic Bulletin and Review the estimate suggests that at least a third of the hypotheses are true hypotheses. This is a minimum estimate. This is a more plausible estimate than the estimate that only 10% of hypotheses are true (Gronau et al., 2017). At the same time, there is no evidence that psychologists only test true hypotheses, which was a concern for decades (Lykken, 1968; Cohen, 1994). This claim was based on the idea that effect sizes are never exactly zero. However, effect sizes can be very close to zero and would require extremely large sample sizes to have sufficient power to detect them reliably. Moreover, these effects would be theoretically meaningless. Figure 1 suggests that cognitive psychologists are sometimes testing effects like this. The problem is that this disconfirming evidence is rarely reported which impedes theory development. Thus, psychologists must report all theoretically important results even when they fail to support their predictions. Failing to do so is unscientific.