Category Archives: Uncategorized

Estimating the Replicability of Results in 'European Journal of Social Psychology"

Over the past decade, questions have been raised about research practices in psychology and the replicability of published results. The focus has been mostly on research practices in social psychology. A major replication project found that only 25% of results in social psychology could be replicated (Open Science Collaboration, 2015). This finding has produced a lot of conflicting responses that blame the replication project for the low success rate to claims that most results in social psychology are false positives.

Social psychology journals have responded to concerns about the replicability of social psychology with promises to improve the reporting of results. The European Journal of Social Psychology (EJSP) is no exception. In 2015, the incoming editors Radmila Prislin and Vivian L. Vignoles wrote

we believe that scientific progress requires careful adherence to the highest standards of integrity and methodological rigour. In this regard, we welcome recent initiatives to improve the trustworthiness of research in social and personality psychology

In 2018, the new editorial team, Roland Imhoff, Joanne Smith, Martijn van Zomeren, addressed concerns about questionable research practices more directly.

“opening up also implies being considerate of empirical imperfections that would otherwise remain hidden from view. This means that we require authors to provide a transparent description of the research process in their articles (e.g., report all measures,manipulations,
and exclusions from samples, if any; e.g., Simmons, Nelson, & Simonsohn, 2011). We thus encourage authors to accurately report about the inclusion of failed studies and imperfect patterns (e.g., p-values not meeting the .05 threshold), but this also has to mean that disclosing such imperfections, all else being equal, should not affect the likelihood of acceptance.”

This blog post uses the test-statistics published in EJSP to examine whether research practices of authors who publish in EJSP have changed in response to the low replicability of results in social psychology. To do so, I downloaded articles from 2000 to 2019 and automatically extracted test-statistics (t-values, F-values). I then converted these test-statistics into two-sided p-values and then into absolute z-scores. Higher z-scores provide stronger evidence against the null-hypothesis. These z-scores are then analyzed using z-curve (Brunner & Schimmack, 2019; Bartos & Schimmack, 2019). Figure 1 shows the results for the z-curve plot for all 27,223 test statistics.

Visual inspection shows a cliff at z = 1.96, which corresponds to a p-value of .05, two-sided. The grey curve shows the expected distribution based on the published significant results. The z-curve predicts many more non-significant results than are actually reported, especially below a value of 1.65 that represents the implicit criterion for marginal significance, p = .05, one-sided.

A formal test of selective reporting of significant results compares the observed discovery rate and the expected discovery rate. The observed discovery rate (ODR) is the percentage of reported results that are significant. The expected discovery rate (EDR) is the percentage of significant results that is expected given the z-curve model. The ODR of 72%, 95%CI = 72%-73%. This is much higher than the EDR of 26%, 95%CI = 19% to 40%. Thus, there is clear evidence of selective reporting of significant results.

Z-curve also provides an estimate of the expected replication rate. That is, if the studies were replicated exactly, how many of the significant results in the original studies would be significant again in the exact replication studies. The estimate is 70%, 95%CI = 65% to 73%. This is not a bad replication rate, but the problem is that it requires exact replications that are difficult if not impossible to do in social psychology. Bartos and Schimmack (2020) found that the EDR is a better predictor of results for conceptual replication studies. The estimate of 26% is consistent with the low replication rate in the replication project (Open Science Collaboration, 2015).

Fortunately, it is not necessary to dismiss all published results in EJSP. Questionable practices are more likely to produce just-significant results. It is therefore possible to focus on more credible results with a p-value less than .005, which corresponds to a z-score of 2.8. Figure 2 shows the results.

Based on the distribution of z-scores greater than 2.8, the model predicts much fewer just-significant results than are reported. This also suggests that questionable practices were used to produce significant results. Excluding these articles boosts the EDR to a satisfactory level of 77%. Thus, even if replication studies are not exact, the model predicts that most replication studies would produce a significant result with alpha = .05 (that is, the significance criterion was not adjusted to a more stringent level of .005).

The following analysis examines whether EJSP editors were successful in increasing the credibility of results published in their journal. For this purpose, I computed the ERR (solid) and the EDR (dotted) using all significant results (black) and excluding questionable results (grey) for each year and plotted the results as a function of year.

The results show no statistically significant trend for any of the four indicators over time. The most important indicator that reflects the use of questionable practices is the EDR for all significant results (black dotted line). The low rates in the last three years show that there have been now major improvements in the publishing culture of EJSP. It is surely easier to write lofty editorials than to actually improve scientific practices. Readers who care about social psychology are advised to ignore p-values greater than .005 because these results may have been produced with questionable practices and unlikely to replicate. The current editorial team may take these results as a baseline for initiatives to improve the credibility of EJSP in the following years.

Estimating the Replicability of Results in 'Journal of Cross-Cultural Psychology'

I published my first article in the Journal of Cross-Cultural Psychology (JCCP) and I continued to be interested in cross-cultural psychology. My most highly cited article, is based on a cross-cultural study with junior psychologists collecting data from the US, Mexico, Japan, Ghana, and Germany. I still remember how hard it was to collect cross-cultural data. Although this has become easier with the invention of the Internet, cross-cultural research is still challenging. For example, demonstrating that an experimental effect is moderated by culture requires large samples to have adequate power.

Over the past decades, questions have been raised about research practices in psychology and the replicability of published results. The focus has been mostly on social and cognitive psychological research in the United States or Western countries. A major replication project found that only 25% of results in social psychology could be replicated (Open Science Collaboration, 2015). Given the additional challenges in cross-cultural research, it is possible that the replication rate of results published in JCCP is even lower. (Milfont & Klein, 2018) discuss the implications of the replication crisis in social psychology for cross-cultural psychology, but their article focuses on challenges in conducting replication studies in cross-cultural psychology. The aim of this blog post is to examine the replicability of original results published in JCCP. A replicability analysis of original results is useful and necessary because it is impossible to replicate most original studies. Thus, cross-cultural psychology benefits more from ensuring that original results are trustworthy and replicable rather than mistrusting all original results until they have been replicated.

To examine the credibility of original results, my colleagues and I have developed a statistical tool, z-curve, that makes it possible to estimate the replication rate and the discovery rate in a set of original studies on the basis of the published test-statistics (t-values, F-values) (Brunner & Schimmack, 2019; Bartos & Schimmack, 2020).

To apply z-curve to results published in JCCP, I downloaded all articles from 2000 to 2019 and automatically extracted all test-statistics (t-values, F-values) that were reported in the articles. Figure 1 shows the results for 10,210 test-statistics.

Figure 1 shows a histogram of the test-statistics that are converted into two-sided p-values and then converted into absolute z-scores. Higher z-scores show stronger evidence against the null-hypothesis that there is no effect. Visual inspection shows a steep increase in reported test-statistics around a z -score of 1.96 that corresponds to a p-value of .05, two-sided. This finding reflects the pervasive tendency in psychology to omit non-significant results from publications (Sterling, 1959; Sterling et al., 1995).

Quantitative evidence of selection for significance is provided by a comparison of the observed discovery rate (how many reported results are significant) and the expected discovery rate (how many significant results are expected based on the z-curve analysis; grey curve). The observed discovery rate of 75%, 95%CI 75% to 76%, is significantly higher than the expected discovery rate of 34%, 95%CI = 23%-52% (significance = confidence intervals do not overlap).

Z-curve also produces an estimate of the expected replication rate. That is, if studies were replicated exactly with the same sample sizes, 77% of the significant results are expected to be significant again in the replication attempt. This estimate is reassuring, but there is a caveat. The estimate assumes that studies can be replicated exactly. This is more likely to be the case for simple studies (e.g., a Stroop task with American undergraduates) than for cross-cultural research. When studies are conceptual replications with uncertainty about the essential features that influence effect sizes (e.g., recruiting participants from different areas in a country), the discovery rate is a better predictor of actual replication outcomes (Bartos & Schimmack, 2020). The estimate of 34% is not very assuring and closer to the actual replication rate of actual social psychology studies.

This does not mean that cross-cultural psychologists need to distrust all published results and replicate all previous studies. It is possible to use z-curve to identify results that are more likely to replicate and provide a solid foundation for future research. The reason is that replicability increases with the strength of evidence; larger z-scores are more replicable. Figure 2 excludes just significant results that may have been obtained with questionable research practices from the z-curve analysis. Although the criterion value is arbitrary, a value of 2.8 corresponds to a p-value of .005, that has been advocated as a better criterion for significance. I believe this is sensible when questionable research practices were used.

The z-curve model now predicts fewer just-significant results than are actually reported. This suggests that questionable practices were used to report significant results. Based on the model, about a third of these just-significant results is questionable, and the percentage for results with p-values of .04 and .03 (z = 2 to 2.2) is 50%. Given the evidence that questionable practices were used, readers should ignore these results unless other studies show stronger evidence for an effect. Replicability for results with z-scores greater than 2.8 is 91%. Thus, these results are likely to replicate. Thus, a simple way to address concerns about a replication crisis i n cross-cultural psychology is to adjust the significance criterion retroactively and to focus on p-values less than .005.

It is also important to examine how often articles in cross-cultural psychology report false positive results. The maximum number of false positive results can be estimated from the discovery rate (Soric, 1989). In Figure 2, this estimate is close to zero. Even in Figure 1, where questionable results lower the discovery rate, the estimate is only 20%. Thus, there is no evidence that JCCP published an abundance of false positive results. Rather, the problem is that most hypotheses in cross-cultural research appear to be true hypotheses and that non-significant results are false negatives. This makes sense as it is unlikely that culture has absolutely no effect, which makes the nil-hypothesis a priori implausible. Thus, cross-cultural researchers need to make riskier predictions about effect sizes and they need to conduct studies with higher power to avoid false negative results.

Figure 3 examines time-trends in cross-cultural research by computing the expected discovery rate (ERR, solid) and the expected discovery rate (EDR, dotted) using all significant results (grey) and excluding z-scores below 2.8 (grey). Time trends would reveal changes in the research culture of cross-cultural psychologists.

Simple linear regressions showed no significant time-trends for any of the four measures. ERR estimates are high, while EDR estimates for all significant results are low. There is no indication that research practices changed in response to concerns about a replication crisis. Thus, readers should continue to be concerns about just-significant results. Editors and reviewers could improve the trustworthiness of results published in JCCP by asking for pre-registration and by allowing publication of non-significant results if studies have sufficient power to test a meaningful hypothesis. Results should always be reported with effect sizes and sampling error so that it is possible to examine the range of plausible effect sizes. Significance should not always be evaluated against the nil-hypothesis but also against criteria for a meaningful effect size.

In conclusion, there is no evidence that most published results in JCCP are false positives or that most results published in the journal cannot be replicated. There is, however, evidence that questionable practices are used to publish too many significant results and that non-significant results are often obtained because studies had insufficient power. Concerns about low power are decades old (Cohen, 1962) and haven’t changed research practices in psychology. A trend analysis showed that even the replication crisis in social psychology has not changed research practices in cross-cultural psychology. It is time for cross-cultural psychologists to increase the power of their studies and to report all of the results honestly even if they do not confirm theoretical predictions. Publishing only results that confirm predictions renders empirical data meaningless.

Estimating the Replicability of Results in 'Infancy'

A common belief is that the first two years of life are the most important years of development (Cohn, 2011). This makes research on infants very important. At the same time, studying infants is difficult. One problem is that it is hard to recruit infants for research. This makes it difficult to reduce sampling error and to obtain replicable results. Noisy data also make it possible that questionable research practices inflate effect sizes in order to publish because journals hardly ever publish non-significant results (Peterson, 2016). Even disciplines who are able to recruit larger samples of undergraduate students, like social psychology, have encounter replication failures, and a major replication effort suggested that only a quarter of published results can be replicated (Open Science Collaboration, 2015). This raises concerns about the replicability of results published in Infancy.

Despite much talk about a replication crisis in psychology, infancy researchers seem to be unaware of problems with research practices in psychology. Editorials by Bell (2009), Colombo (2014), and Bremner (2019) celebrate quantitative indicators like submission rates and impact factors, but do not comment on the practices that are used to produce significant results. In a special editorial, Colombo (2017) introduces registered reports that accept study ideas before data are collected and publish results independent of the outcome. However, he doesn’t mention why such an initiative would be necessary (e.g., standard articles use QRPs and studies are only submitted if they show a significant result). Bremner (2019) makes an interesting observation that “it is really rather easy to fail to obtain an effect with infants.” If this is the case and results are reported without selection for significance, articles in Infancy should report many non-significant results. This seems unlikely, given the general bias against non-significant results in psychology.

To examine the replicability of results published in Infancy, I conducted a z-curve analysis (Brunner & Schimmack, 2019; Bartos & Schimmack, 2020). Z-curve uses the test-statistics (t-values, F-values) in articles to examine how replicable significant results are, how often researchers obtain significant results, how many false positive results are reported, and whether researchers use questionable research practices to inflate the percentage of significant results that are being reported.

I downloaded articles from 2000 to 2019 and used an r-program to automatically extract test-statistics. Figure 1 shows the z-curve of the 9,109 test statistics.

First, visual inspection shows a steep cliff around z = 1.96, which corresponds to a p-value of .05 (two-tailed). The fact that there are many more just significant results than just non-significant results reveals that questionable practices inflate the percentage of significant results. This impression is confirmed by comparing the observed discovery rate of 64%, 95%CI = 63% to 65% to the estimated discovery rate 37%, 95%CI = 19% to 46%. The confidence intervals clearly do not overlap, indicating that questionable practices inflate the observed discovery rate.

The expected replication rate is 60%, 95%CI = 55% to 65%. This finding implies that exact replications of studies with significant results would produce 60% significant results. This is not a terrible success rate, but this estimate comes with several caveats. First, the estimate is an average of all reported statistical tests. Some of these tests are manipulation checks that are expected to have strong effects. Other tests are novel predictions that may have weaker effects. The replicability estimate for studies with just-significant results (z = 2 to 2.5) is only 35% (see values below x-axis).

The results are similar to estimates for social psychology, which has witnesses a string of replication failures in actual replication attempts. Based on the present results, I predict similar replication failures in infancy research when studies are actually replicated.

Given the questionable status of just-significant results, it is possible to exclude them from the z-curve analysis. Figure 2 shows the results when z-curve is fitted to z-values greater than 2.8, which corresponds to a p-value of .005.

Questionable research practices are now revealed by the greater proportion of just-significant results than the model predicts. Given the uncertainty about these results, readers may focus on p-values less than .005 to ensure that results are replicable.

The next figure shows results when the ERR (black) and EDR (grey) are estimated for all significant results (solid) and only for z-scores greater than 2.8 for each year.

As the number of tests per year is relatively small, estimates are fairly noisy. Tests of time trends did not reveal any significant changes over time. Thus, there is no evidence that infancy researchers have changed their research practices in response to concerns about a replication crisis in psychology.

The results in Figure 1 and 2 also suggest that Infancy research produces many false negative results; that is the hypothesis is true, but studies have insufficient power to produce a significant result. This is consistent with concerns about low power in psychology (Cohen, 1962) and Bremner’s (2019) observation that non-significant results are common, even when they are not reported. False negative results are a problem because they are sometimes falsely interpreted as evidence against a hypothesis. For infancy research to gain credibility, researchers need to change their research practices. First, they need to improve power by increasing reliability of measures, using within-subject designs whenever possible, or collaborating across labs to increase sample sizes. Second, they need to report all results honestly, not only when studies are pre-registered or in special registered reports. Honest reporting of results is a fundamental aspect of science and evidence of questionable research practices undermines the credibility of infancy research.

Estimating the Replicability of Results in 'Personality and Social Psychology Bulletin"

Abstract

It has been suggested that social psychologists have a unique opportunity to learn from their mistakes and to improve scientific practices (Bloom, 2016). So far, the editors and the editorial board responsible for PSPB have failed to seize this opportunity. The collective activities of social psychologists that leads to the publication of statistical results in this journal have changed rather little in response to concerns that most of the published results in social psychology are not replicable.

Introduction

There is a replication crisis in social psychology (see Schimmack, 2020, for a review). This crisis is sometimes unfairly generalized to all disciplines in psychology, while some areas do not have a replication crisis (Open Science Collaboration, 2015; Schimmack, 2020). The crisis is also sometimes presented as an opportunity.

A public discussion about how scientists make mistakes and how they can work to correct them will help advance scientific understanding more generally. Psychology can lead the way here.” (Paul Boom, 2016. The Atlantic).

However, the response to the replication crisis by social psychologists has been mixed (Schimmack, 2020). Especially, older social psychologists have mostly denied that there is a crisis. In contrast, younger social psychologists have created a new organization to improve (social) psychology. It is unclear whether the comments by older psychologists reflect their behaviors. On the other hand, it is possible that they continue to conduct research as before. On the other hand, it is possible that older social psychologists are mainly trying to preserve a positive image, while they are quietly changing their behaviors.

This blog post sheds some light on this question by examining the replicability of results published in the journal Personality and Social Psychology Bulletin (PSPB). The editors during the decade of replication failures were Shinobu Kitayama, Duane T. Wegener & Lee Fabrigar, and Christian S. Crandall.

One year before the replication crisis, Kitayama (2010) was optimistic about the quality of research in PSPB. “Now everyone in our field would agree that PSPB is one of our very best journals.” He also described 2010 as an exciting time, not knowing how exciting the next decade would be. I could not find an editorial by Wegener and Fabrigar. Their views on the replication crisis are reflected in their article “Conceptualizing and evaluating the replication of research results” (Fabrigar & Wegener, 2016).

Another theme that readers might draw from our discussion is that concerns about a “replication crisis” in psychology are exaggerated. In a number of respects, one might conclude that much of what we have said is reassuring for the field.” (p. 12).

Cris Crandall is well-known as an outspoken defender of the status quo on social media. This view is echoed in his editorial (Crandall, Leach, Robinson, & West, 2018).

PSPB has always been a place for newer, creative ideas, and we will continue to seek papers that showcase creativity, progress, and innovation. We will continue the practice of seeking the highest quality” (p. 287).

However, the authors also express their intention to make some improvements.

We encourage people to be transparent in the analysis and reporting of a priori power, consistent with the goals of transparency and clarity in reporting all statistical analyses.

Despite this statement, PSPB did not implement open-science badges that reward researchers for sharing data or pre-registering studies. I asked Chris Crandall why he did not adopt badges for PSPB, but he declined to answer.

It is therefore an empirical question how much the credibility of results published in PSPB has improved in response to the replication crisis. This blog post examines this question by conducting a z-curve analysis of PSPB (Brunner & Schimmack, 2019; Bartos & Schimmack, 2020). Articles published from 2000 to 2019 were downloaded and test statistics (F-values, t-values) were automatically extracted. Figure 1 shows a z-curve plot of the 50,013 test statistics that were converted into two-sided p-values and then converted into absolute z-scores.

Visual inspection of Figure 1 shows that researchers used questionable research practices. This is indicated by the cliff around a value of z = 1.96 that corresponds to a p-value of .05 (two-tailed). As can be seen, there are a lot fewer results just below 1.96 that are not significant than results just above 1.96 that are just-significant. Moreover, results between 1.65 and 1.96 are often reported as marginally significant support for a hypothesis. Thus, only values below 1.65 reflect results that are presented as truly non-significant results.

Z-Curve quantifies the use of QRPs by comparing the expected discovery rate to the observed discovery rate. The observed discovery rate is the percentage of reported results that are significant. The expected discovery rate is the percentage of significant results that are expected given the distribution of significant results. The grey curve shows the expected distribution. The observed discovery rate of 71%, 95%CI = 71%-72%, is much higher than the expected discovery rate of 34%, 95%CI = 22% to 41%. The confidence intervals are clearly not overlapping, indicating that this is not just a chance finding. Thus, questionable practices were used to inflate the percentage of reported significant results. For example, a simple QRP is to simply not report results from studies that failed to produce significant results. Although this may seem dishonest and unethical, it is a widely used practice.

Z-curve also provides an estimate of the expected replication rate (ERR). The ERR is the percentage of significant results that is expected if studies with significant results were replicated exactly, including the same sample size. The estimate is 64%, which is not a terribly low ERR. However, there are two caveats. First, the estimate is an average and replicability is lower for just significant results as indicated by the estimates of 31% for z-scores between 2 and 2.5. This means that just-significant results are unlikely to replicate. Moreover, it has been pointed out that studies in social psychology are more difficult to replicate. Thus, exact replications are impossible. Using data from actual replications, Bartos and Schimmack (2020) found that the expected discovery rate is a better predictor of success rates in actual replication studies, which is about 25% (Open Science Collaboration, 2015). For PSPB, the EDR estimate of 34% is closer to the actual success rate than the ERR of 64%.

Given the questionable nature of just-significant results, it is possible to exclude these values from the z-curve model. I typically use 2.4 as a criterion but given the extent of questionable practices, I chose a value of 2.8, which corresponds to a p-value of .005. Figure 2 shows the results. Questionable research practices are now reflected in the high proportion of just-significant results that exceeds the proportion predicted by z-curve (grey curve). The replication rate increases to 82%, and the EDR increases to 76%. Thus, readers of PSPB may use p = .005 to evaluate statistical significance because these results are more credible than just significant results that were obtained with questionable practices.

Figure 3 examines time trends for ERR (black) and EDR (grey) computed for all significant results (solid) and for selected z-scores greater than 2.8 (dotted). The time trend for the ERR with all significant results is significant, t(19) = 2.35, p = .03, but all other time trends are not significant. In terms of effect sizes the ERR with all significant results increased from 63% in 2000-2014, to 71% in 2015-2019. The EDR for all significant results also increased from 29% to 41%. Thus, it is possible that slight improvements did occur, although there is too much uncertainty in the estimates to be sure. Although the positive trend in the last couple of years is encouraging, the results do not show a notable response to the replication crisis in social psychology. The absence of a strong response is particularly troublesome if success rates of actual replication studies are better predicted by the EDR than the ERR.

The next figures make different assumptions about the use of questionable research practices in the more recent years form 2015-2019. Figure 4 shows the model that assumes researchers simply report significant results and tend to not report non-significant results.

The large discrepancy between the EDR and ODR suggests that questionable research practices continue to be used by social psychologists who publish in PSPB. Figure 5 excludes questionable results that are just significant. This figure implies that researchers are using QRPs that inflate effect sizes to produce significant results. As these practices tend to produce weak evidence, they produce an unexplained pile of just-significant results.

The last figure fits z-curve to all results, including non-significant ones. Without questionable practices the model should fit well across the entire range of z-values.

The results suggest that questionable research practices are used to turn promising results (z > 1 & z < 2) into just-significant results (z > 2 & z < 2.4) because there are too few promising results and too many just-significant results.

The results also suggest that non-significant results are mostly false negatives (i.e., there is an effect, but the result is not significant). The estimated maximum false discovery rate is only 3%, and the estimate of the minimum number of true hypotheses that are being tested is 62%. Thus, there is no evidence that most significant results in PSPB are false positives or that replication failures can be attributed to testing riskier hypotheses than cognitive psychologists. The results are rather consistent with criticisms that have been raised decades ago that social psychologists mostly test true hypotheses with low statistical power (Cohen, 1962). Thus, neither significant results nor non-significant results provide much empirical evidence that can be used to evaluate theories. To advance social psychology, social psychologists need to test riskier predictions (e.g., make predictions about ranges of effect sizes) and they need to do so with high power so that prediction failures can falsify theoretical predictions and spurn theory development. Social psychology will make no progress if it continues to focus on rejecting nil-hypotheses that manipulations have absolutely no effect that are a priori implausible.

Given the present results, it seems unlikely that PSPB will make the necessary changes to increase the credibility and importance of social psychology; at least until a new editor is appointed. Readers are advised to be skeptical about just significant p-values and focus on p-values less than .005 (heuristic: ~ t > 2.8, F > 8).

Conclusion

In conclusion, it has been suggested that social psychologists have a unique opportunity to learn from their mistakes and to improve scientific practices (Bloom, 2016). So far, the editors and the editorial board responsible for PSPB have failed to seize this opportunity. The collective activities of social psychologists that leads to the publication of statistical results in this journal have changed rather little in response to concerns that most of the published results in social psychology are not replicable.

Estimating the Replicability of Results in 'Psychonomic Bulletin and Review"

The journal “Psychonomic Bulletin and Review” is considered the flagship journal of the Psychonomic Society. The psychonomic society is a professional organization like APS and APA, but focuses mostly on cognitive psychology.

The journal was started in 1994 with Henry L. Roediger III as editor. The society already had some journals, but this journal aimed to publish more theory and review articles. However, it also published Brief Reports. An editorial in 2007 noted that submissions were skyrocketing, suggesting that it would be harder to publish in the journal. Despite much talk about a replication crisis or credibility crisis in the 2010s, I could not find any editorials published during this period. The incoming editor Brockmole published an editorial in 2020. It doesn’t mention any concerns about publication bias, questionable research practices, or low power. One reason for the lack of concerns could be that cognitive psychology is more robust than other areas of psychology. However, another possibility is that cognitive psychologists have not tested the replicability of results in cognitive psychology.

The aim of this blog post is to shed some light on the the credibility of results published in Psychonomic Bulletin and Review. Over the past decade, my colleagues and I have developed a statistical tool, z-curve, that makes it possible to estimate the replication rate and the discovery rate based on published test-statistics (t-values, F-values) (Brunner & Schimmack, 2019; Bartos & Schimmack, 2020). The discovery rate can also be used to estimate the maximum false positive rate and the rate of true hypotheses that are being tested. The analysis are based on an automatic extraction of test-statistics (t-values, F-values) from downloaded articles covering the years 2000 to 2019.

Figure 1 shows the z-curve for all 25,248 test statistics. Visual inspection shows a cliff for z-scores around 1.96, which corresponds to the .05 (two-tailed) criterion of significance. This shows preferential publishing of significant results. This is also indicated by a comparison of the number of non-significant results predicted by the model (grey curve) and the actual number of reported non-significant results (blue histogram). The discrepancy is statistically significant as indicated by a comparison of the 95% confidence intervals of the observed discovery rate, 68% to 70% and the estimated discovery rate, 26% to 53%. The estimated discovery rate implies that for every reported significant result, there should be 1 to 3 non-significant results, File Drawer Ratio 0.89 to 2.85.

With a discovery rate of 36%, the maximum False Positive Rate is 9% (Soric, 1989). Thus, published significant results are unlikely to be false positives in the strict sense that the effect size is zero. This estimate is much lower than the alarming estimate of 40% that was reported by Gronau et al. (2017).

The most important information is the replicability estimate of published significant results. The expected replication rate is 76%, 95%CI = 72% to 80%. Although reassuring, there are some caveats. First, the estimate is an average that includes tests of manipulation checks or unimportant main effects. This might explain why the actual replication rate for cognitive psychology is estimated to be only 50% (Open Science Collaboration, 2015). Local estimates for just significant results (z = 2 to 2.4) have only an expected replication rate of 54% to 65%. Another caveat is that the expected replication rate assumes that experiments can be replicated exactly. When this is not the case, additional selection effects lower the rate and the discovery rate becomes a better estimate of the actual replication rate. A rate of 36% would not be satisfactory.

Figure 1 assumes a simple selection model; significant results are published and non-significant results are not published. However, researchers also use questionable practices to increase their chances of obtaining a significant result (e.g. selective removing of outliers). These practices make just-significant results questionable. To address this concern, it is possible to exclude just-significant (z < 2.4) z-scores from the model. The results of this approach are shown in Figure 2.

The effect on the ERR is small. However, the effect on the expected discovery rate is large. The EDR is now even higher than the observed discovery rate. Thus, there is no evidence that cognitive psychologists hide non-significant results. However, there would be evidence that some just-significant results were obtained with questionable practices because there are more just-significant results than the model predicts. Thus, some questionable research practices are used, but it is not clear which practices are used and how much they influence the discovery rate.

To examine time-trends, I computed the ERR (black) and EDR (grey) using all significant values (solid) and excluding just-significant ones (dotted) and plotted them as a function of publication year.

ERR and the EDR excluding just-significant results showed high estimates that remained constant over time (no significant trends). For the EDR using all significant results an interesting pattern emerged. EDR estimates dropped for a period from 2006 to 2014. During this period Psychonomic Bulletin and Review published an excessive amount of just-significant p-values (Figure 4).

Since 2015, the percentage of just-significant p-values is more in line with model predictions. One possible explanation for this is that concerns about replicabilty changed researchers’ behaviors or reviewers’ evaluation of just-significant p-values.

The fact that there are now more non-significant results than the model predicts is explained by the difficulty of estimating the distribution of non-significant results when only z-scores greater than 2.4 are available. Given the weak evidence of questionable research practices since 2015, it is possible to fit z-curve to non-significant and significant results.

The results show only a slight possibility that promising results (z > 1) are missing because questionable research practices were used to turn them into just significant results. However, for the most part the model fits the data well, indicating that the reported results are credible. There are very few false positive results and a minimum of 62% of hypotheses are true hypotheses. Moreover, most non-significant results are type-II errors, suggesting that replication studies with larger samples would produce significant results. These results provide no evidence for a replication crisis in cognitive psychology. It is therefore unwarranted to generalize from attention-grabbing replication failures in social psychology to cognitive psychology. A bigger concern is modest power that produces a fairly large number of false negative results. Thus, replication efforts should focus on important non-significant result, especially when these results were falsely interpreted as evidence for the null-hypothesis. There is no justification to invest resources in massive replication efforts of significant results. Another concern could be the high rate of true hypotheses. Theory development benefits from evidence that shows when predictions are not confirmed. Given the low rate of false hypotheses, cognitive psychologists might benefit from subjecting their theories to riskier predictions that may fail. All of these conclusions are limited to results published in Psychonomic Bulletin and Review. Z-curve analyses of other journals are needed. It will be particularly interesting to examine whether the time of increased questionable practices from 2007-2014 is a general trend or reflects editorial decisions in this journal. The current editorial team is well-advised to request pre-registered replication studies when the original submission contains several just-significant results for focal hypothesis test.

How Often Do Researchers Test True Hypotheses?

Psychologists conduct empirical studies with the goal to demonstrate an effect by means of rejecting the null-hypothesis that there is no effect, p < .05. The utility of doing so depends on the a priori probability that the null-hypothesis is false. One of the oldest criticisms of hull-hypothesis significance testing (NHST) is that the null-hypothesis is virtually always false (Lykken, 1968). As a result, demonstrating a significant result adds little information, while failing to do so because studies have low power creates false information and confusion.

Since 2011, an alternative view is that the many published results are false positives (Simmons, Nelson, & Simonsohn, 2011). The narrative is that researchers use questionable research practices that make it extremely likely that true null-hypotheses produce significant results. This argument assumes that psychologists often conduct studies where the effect size is zero. That is, Bem’s (2011) crazy studies of extrasensory perception are representative of most studies in psychology. This seems a bit unlikely for all areas of psychology, especially correlational studies that examine how naturally occurring phenomena covary.

Debates like this are difficult to settle because the null-hypothesis is an elusive phenomenon. As all empirical studies have some sampling error (and there is some uncertainty about small effect sizes that may be artifacts), even large samples or meta-analyses can never affirm the null-hypothesis. However, it is possible to estimate the minimum percentage of null-hypotheses that are being tested on the basis of the discovery rate; that is the percentage of hypothesis tests that produced a significant result (Soric, 1989). The logic of this approach is illustrated in Tables 1 and 2.

NSSIG
TRUE06060
FALSE76040800
760100860
Table 1

Table 1 illustrates an example where only 7% (60/860) hypotheses that are being tested are true (there is an effect). These hypotheses are tested with 100% power. As a result, there are no additional true hypotheses with non-significant results. With alpha = .05, only 1 out of 20 false hypotheses produce a significant result. Thus, there are 19 times more false hypotheses with non-significant results (40 * 19 = 760) than false positives (40). In this example the discovery rate is 100/860 = 11.6%.

The assumption that power is 100% is unrealistic but it makes it possible to estimate the minimum rate of true hypotheses. As power decreases, the rate of true hypotheses increases. This is illustrated by lowering mean power to 20%, while keeping the discovery rate the same.

NSSIG
TRUE30476380
FALSE45624480
760100860
Table 2

Once more the discovery rate is 100/860 = 11.6%. However, now a lot more of the hypotheses are true hypotheses (380/860 = 44%). As there are many non-significant true hypotheses, there are also a lot fewer non-significant false hypotheses. If we lower power further, the number of true hypotheses increases and reaches 100% in the limit when power approaches alpha. That is, all hypotheses may have very small effect sizes that are different from zero. Thus, not a single hypothesis tests an effect that is exactly zero.

Soric’s (1989) insight makes it possible to examine empirically whether a literature tests many true hypotheses. However, a problem for the application of the approach to actual data is that the true discovery rate is unknown because psychology journals tend to publish nearly exclusively significant results, while non-significant results remain hidden in the proverbial file-drawer (Rosenthal, 1979). Recently, Bartos and Schimmack (2020) developed a statistical model that solves this problem. The model called z-curve makes it possible to estimate the discovery rate on the basis of the distribution of published significant results. The method is called z-curve because published tests statistics are converted into absolute z-scores.

To demonstrate z-curve estimates of the true hypothesis rate (THR), I use test-statistics from the journal Psychonomic Bulletin and Review. The choice of this journal is motivated by prior meta-psychological investigations of results published in this journal. Gronau, Duizer, Bakker, and Wagenmakers (2017) used a Bayesian Mixture Model to estimate that about 40% of results published in this journal are false positive results. Table 3 shows that a 40% False Positive Rate (72/172 = .42) corresponds roughly to an estimate that cognitive psychologists test only 10% true hypotheses. This is a surprisingly low estimates but it matches estimates that psychologists in general test only 9% true hypotheses (Dreber, Pfeiffer, Almenber, Isakssona, Wilsone, Chen, Nosek, & Johannesson, 2015).

NSSIG
TRUE0100100
FALSE136872900
13681721000
Table 3

Given these estimates, it is not surprising that some psychologists attribute replication failures to a high rate of false positive results that are published in psychology journals. The problem, however, is that the BMM is fundamentally flawed and uses a dogmatic prior to produce inflated estimates of the false positive rate (Schimmack & Brunner, 2019).

To examine this issue with z-curve, I used automatically extracted test-statistics that were published in Psychonomic Bulletin and Review from 2000 to 2010. The program extracted 25,248 test statistics, which allows for much more robust estimates than the 855 t-tests used by Gronau et al. Figure 1 shows clear evidence of selection bias, which makes the observed discover rate of 70%, 95%CI = 69% to 70%, unusable. Z-curve suggests that the true discovery rate is only 36%. Despite the large sample size, there is considerable uncertainty about this estimate, 95%CI = 26% to 54%. These results suggest that for every significant result, researchers obtain 1 to 3 non-significant results (File-Drawer Ratio, 1.75; 95%CI = 0.85, 2.85).

The expected discovery rate of 36%, implies a minimum rate of True Hypotheses of 33%. This estimate is considerably higher than the estimates of 10% based on much smaller datasets and questionable statistical models (###, Gronau et al.).

The estimate of the minimum THR assumes that power is 100%. This is an unrealistic assumption. Z-curve is able to provide a more realistic maximum value based on the expected replication rate of 76% and the estimated maximum false positive rate of 9%. With a false positive rate of 9%, power for the remaining 91% of true positives has to be 83% to produce the ER of 76% (9% * .05 + 91% *.83 = 76%). With fewer false positives, power would decrease to a minimum of 76% when there are no false positives. Thus, power cannot be higher than 83%. Using a value of 83% power, yields a true positive rate of 62%. This scenario is illustrated in Table 4.

NSSIG
TRUE65316381
FALSE58831619
6533471000
Table 4

Power is 316/381 = 83%, the False Positive Rate is 31/347 = 9%, and the True Hypothesis Rate is 381/1000 = 38%. This is still a conservative estimate because power for the non-significant results is lower than power for the significant results, but it is more realistic than simply assuming that power is 100%. In this case, the ERR is fairly high and the max THR estimate changes only slightly from 33% to 38%.

Conclusion

Z-curve makes it possible to provide an empirical answer to the question how often researchers test true versus false hypothesis. For the journal Psychonomic Bulletin and Review the estimate suggests that at least a third of the hypotheses are true hypotheses. This is a minimum estimate. This is a more plausible estimate than the estimate that only 10% of hypotheses are true (Gronau et al., 2017). At the same time, there is no evidence that psychologists only test true hypotheses, which was a concern for decades (Lykken, 1968; Cohen, 1994). This claim was based on the idea that effect sizes are never exactly zero. However, effect sizes can be very close to zero and would require extremely large sample sizes to have sufficient power to detect them reliably. Moreover, these effects would be theoretically meaningless. Figure 1 suggests that cognitive psychologists are sometimes testing effects like this. The problem is that this disconfirming evidence is rarely reported which impedes theory development. Thus, psychologists must report all theoretically important results even when they fail to support their predictions. Failing to do so is unscientific.

Null-Hypothesis Testing with Confidence Intervals

Statistics is a mess. Statistics education is a mess. Not surprising, the understanding of statistics by applied research workers is a mess. This was less of a problem when there was only one way to conduct statistical analyses. Nobody knew what they were doing, but at least everybody was doing the same thing. Now we have a multiverse of statistical approaches and applied research workers are happy to mix and match statistics to fit their needs. This is making the reporting of results worse and leads to logical contradictions.

For example, the authors of an article in a recent issue of Psychological Science that shall remain anonymous claimed (a) that a Bayesian Hypothesis Test provided evidence for the nil-hypothesis (effect size of zero) and (b) claimed that their preregistered replication study had high statistical power. This makes no sense, because power is defined as the probability of correctly rejecting the null-hypothesis, which assumes an a priori effect size greater than zero. Power is simply not defined when the hypothesis is that the population effect size is zero.

Errors like this are avoidable, if we realize that Neyman introduced confidence intervals to make hypothesis testing easier and more informative. Here is a brief introduction to think clearly about hypothesis testing that should help applied research workers to understand what they are doing.

Effect Size and Sampling Error

The most important information that applied research workers should report are (a) an estimate of the effect size and (b) an estimate of sampling error. Every statistics course should start with introducing these two concepts because all other statistics like p-values or Bayes-Factors or confidence intervals are based on effect size and sampling error. They are also the most useful information for meta-analysis.

Information about effect sizes and sampling error can be in the form of unstandardized values (e.g., 5 cm difference in height with SE = 2 cm) or in standardized form (d = .5, SE = .2). This is not relevant for hypothesis testing and I will use standardized effect sizes for my example.

Specifying the Null-Hypothesis

The null-hypothesis is the hypothesis that a researcher believes to be untrue. It is the hypothesis that they want to reject or NULLify. The biggest mistake in statistics is the assumption that this hypothesis is always that there is no effect (effect size of zero). Cohen (1994) called this hypothesis the nil-hypothesis to distinguish it from other null-hypotheses.

For example, in a directional test that studying harder leads to higher grades, the null-hypothesis specifies all non-positive values (zero and all negative values). When this null-hypothesis is rejected, it automatically implies that the alternative hypothesis is true (given a specific error criterion and a bunch of assumption, not in a mathematically proven sense). Normally, we go through various steps to reject the null-hypothesis, to then affirm the alternative. However, with confidence intervals we can directly affirm the alternative.

Calculating Confidence Intervals

A confidence interval requires three pieces of information.

1. An ESTIMATE of the effect size. This estimate is provided by the mean difference in a sample. For example, the height difference of 5cm or d = .5 are estimates of the population mean difference in height.

2. An Estimate of sampling error. In simple designs, sampling error is a function of sample size, but even then we are making assumptions that can be violated and difficult or impossible to test in small samples In more complex designs, sampling error depends on other statistics that are sample dependent. Thus, sampling error is also just an estimate. The main job of statisticians is to find plausible estimates of sampling error for applied research workers. Applied researchers simply use the information that is provided by statistics programs. In our example, sampling error was estimated to be d = .2.

3. The third piece of information is how confident we want to be in our inferences. All data-based inferences are inductions that can be wrong, but we can specify the probability of being wrong. This quantity is known as the type-I error with the Greek symbol alpha. A common value is alpha = .05. This implies that we have a long-run error rate of no more than 5%. If we obtain 100 confidence intervals, the long-run error rate is limited to no more than 5% false inferences in favor of the alternative hypothesis. With alpha = .05, sampling error has to be multiplied by approximately 2 to compute a confidence interval.

To use our example, with d = .5, and SE = .2, we can create a confidence interval that ranges from d = .5 – .2*2 = .1 to .5 + .2*2 = .9. We can now state that WITHOUT ANY OTHER INFORMATION that may be relevant (e.g., we already know the alternative is true based on a much larger trustworthy prior study and our study is only a classroom demonstration) that the data support our hypothesis that there is a positive effect because the confidence interval fits into the predicted interval; that is values from .1 to .9 fit into the set of values from 0 to infinity.

A more common way to express this finding is to state that the confidence interval does not include the largest value of the null-hypothesis, which is zero. However, this leads to the impression that we tested the nil-hypothesis, and rejected it. But that is not true. We also rejected all the values less than 0. Thus, we did not test or reject the nil-hypothesis. We tested and rejected the null-hypothesis of effect sizes ranging from -infinity to 0. But it is also not necessary to state that we rejected this null-hypothesis because this statement is redundant with the statement we actually want to make. We found evidence for our hypothesis that the effect size is positive (i.e., in the range from 0 to infinity excluding 0).

I hope this example makes it clear how hypothesis testing with confidence intervals works. We first specify a range of values that we think are plausible (e.g., all positive values). We then compute a confidence interval of values that are consistent with our data. We then examine whether the confidence interval falls into the hypothesized range of values. When this is the case, we infer that the data support our hypothesis.

Different Outcomes

When we divide the range of possible effect sizes into two mutually exclusive regions, we can distinguish three possible outcomes.

One possible outcome is that the confidence interval falls into a predicted region. In this case, the data provide support for the prediction.

One possible outcome is that the confidence interval overlaps with the predicted range of values, but also falls outside the range of predicted values. For example, the data could have produced an effect size estimate of d = .1 and an confidence interval ranging from -.3 to .5. In this case, the data are inconclusive. It is possible that the population effect size is, as predicted, positive, or it is negative.

Another possible outcome is that the confidence interval falls entirely outside the predicted range of values (e.g., d = -.5, confidence interval -.9 to -.1). In this case, the data disconfirm the prediction of a positive effect. It follows that it is not even necessary to make a prediction one way or the other. We can simply see whether the confidence interval fits into one or the other region and infer that the population effect size is in the region that contains the confidence interval.

Do We Need A Priori Hypotheses?

Let’s assume that we predicted a positive effect and our hypothesis covers all effect sizes greater than zero and the confidence interval includes values from d = .1 to .9. We said that this finding allows us to accept our hypothesis that the effect size is positive; that is, it is within an interval ranging from 0 to infinity without zero. However, the confidence interval provides a much smaller range of values. A confidence interval ranging from .1 to .9 not only excludes negative values or a value of zero, it also excludes values of 1 or 2. Thus, we are not using all of the information that our data are providing when we simply infer from the data that the effect size is positive, which includes trivial values of 0.0000001 and implausible values of 999.9. The advantage of reporting results with confidence intervals is that we can specify a narrow range of values that are consistent with the data. This is particularly helpful when the confidence interval is close to zero. For example, a confidence interval that ranges from d = 0.001 to d = .801 can be used to claim that the effect size is positive, but it cannot be used to claim that the effect size is theoretically meaningful, unless d = .001 is theoretically meaningful.

Specifying A Minimum Effect Size

To make progress, psychology has to start taking effect sizes more seriously, and this is best achieved by reporting confidence intervals. Confidence intervals ranging from d = .8 to d = 1.2 and ranging from d = .01 to d = .41 are both consistent with the prediction that there is a positive effect, p < .01. However, the two confidence intervals also specify very different ranges of possible effect sizes. Whereas the first confidence interval rejects the hypothesis that effect sizes are small or moderate, the second confidence interval rejects large effect sizes. Traditional hypothesis testing with p-values hides this distinction and makes it look as if these two studies produced identical results. However the lowest value of the first interval (d = .8) is higher than the highest value of the second interval (d = .41), which actually implies that the results are significantly different from each other. Thus, these two studies produced conflicting results when we consider effect sizes, while giving the same answer about the direction of an effect.

If predictions were made in terms of a minimum effect size that is theoretically or practically relevant, the distinction between the two results would also be visible. For example, a standard criterion for a minimum effect size could be a small effect size of d = .2. Using this criterion, the first study confirms predictions (i.e.., the confidence interval from .8 to 1.2 falls into the region from .2 to infinity), but the second study does not, d = .01 to .41 is partially outside the interval from .2 to infinity. In this case, the data are inconclusive.

If the population effect size is zero (e.g., effect of random future events on behavior), confidence intervals will cluster around zero. This makes it hard to fit confidence intervals within a region that is below a minimum effect size (e.g., d = -.2 to d = .2). This is the reason why it is empirically difficult to provide evidence for the absence of an effect. Reducing the minimum effect size makes it even harder and eventually impossible. However, logically there is nothing special about providing evidence for the absence of an effect. We are again dividing the range of plausible effects into two regions: (a) values below the minimum effect size and (b) values above the minimum effect size. We then decide in favor of the interval that fully contains the confidence interval. Of course, we can do this also without an a priori range of effect sizes. For example, if we find a confidence interval ranging from -.15 to +.18, we can infer from this finding that the population effect size is small (less than .2).

But What about Bayesian Statistics?

Bayesian statistics also uses information about effect sizes and sampling error. The main difference is that Bayesians assume that we have prior knowledge that can inform our interpretation of results. For example, if one-hundred studies already tested the same hypothesis, we can use the information of these studies. In this case, it would also be possible to conduct a meta-analysis and to draw inferences on evidence from all 101 studies, rather than just a single study. Bayesians also sometimes incorporate information that is harder to quantify. However, the main logic of hypothesis testing with confidence intervals or Bayesian credibility intervals does not change. Ironically, Bayesians also tend to use alpha = .05 when they report 95% credibility intervals. The only difference is that information that is external to the data (prior distributions) is used, whereas confidence intervals rely exclusively on information from the data.

Conclusion

I hope that this blog post helps researchers to better understand what they are doing. Empirical studies provides estimates of two important statistics, an effect size estimate and an sampling error estimate. This information can be used to create intervals that specify a range of values that are likely to contain the population effect size. Hypotheses testing divides the range of possible values into regions and decides in favor of hypotheses that fully contain the confidence interval. However, hypothesis testing is redundant and less informative because we can simply decide in favor of the values that are inside the confidence interval which is smaller than the range of values specified by a theoretical prediction. The use of confidence intervals makes it possible to identify weak evidence (confidence interval excludes zero, but not very small values that are not theoretically interesting) and also makes it possible to provide evidence for the absence of an effect (confidence interval only includes trivial values).

A common criticism of hypothesis testing is that it is difficult to understand and not intuitive. The use of confidence intervals solves this problem. Seeing whether a small objects fits into a larger object is probably achieved at some early developmental stage in Piaget’s model and most applied research workers should be able to carry out these comparisons. Standardized effect sizes also help with evaluating the size of objects. Thus, confidence intervals provide all of the information that applied research workers need to carry out empirical studies and to draw inferences from these studies. The main statistical challenge is to obtain estimates of sampling error in complex designs; that is the job of statisticians. The main job of empirical research workers is to collect theoretically or practically important data with small sampling error.