Magical Moderated-Multiple Regression

A naive model of science is that scientists conduct studies and then report the results. At least for psychological science, this model does not describe the actual research practices. It has been documented repeatedly that psychological scientists pick and choose the results that they report. This explains how psychology journals publish mostly significant results (p < .05) although most studies have only a small chance to produce a significant result. One study found that social psychology journals publish nearly 100% significant results, when the actual chance to do so is only 25% (Open Science Collaboration, 2015). The discrepancy is explained by questionable research practices. Just like magic, questionable research practices can produce stunning results that never happened (Bem, 2011). I therefore compared articles that used QRPs to a magic show (Schimmack, 2012).

Over the past decades, several methods have been developed to distinguish real findings from magical ones. Applications of these methods have revealed the use of QRPs, especially in experimental social psychology. So far, the focus has been on simple statistical analysis, where an independent variable (e.g., an experimental manipulation) is used to predict variation in a dependent variable. A recent article focused on more complex statistical analysis, called moderated-multiple regression (O’Boyle, Banks, Carter, Walter & Yuan, 2019).

There are two reasons to suspect that moderated-multiple regression results are magical. First, moderated regression requires large sample sizes to have sufficient power to detect small effects (Murphy & Russell, 2016). Second, interaction terms in regression models are optional. Researchers can focus on the main results to publish and add interaction terms only when they produce a significant result. Thus, outcome reporting bias (O’Boyle et al., 2019) is an easy and seemingly harmless QRP that may produce a large file-drawer of studies where moderated-regression was tried, but failed to produce significant results. This is not the only possible QRP. It is also possible to try multiple interaction terms, until a specific combination of variables produces a significant result.

O’Boyle et al. hand-coded results from 343 articles in six management and applied psychology journals that were published between 1995 and 2014. Evidence for the use of QRPs was provided by examining the prevalence of just significant p-values (right figure). There is an unexplained peak just below .05 (.045 to .05).

P-value distributions are less informative about the presence of QRPs than plots of distributions when the p-values are converted into z-scores. O’Boyle et al. shared their data with me and I conducted a z-curve analysis of moderated regression results in applied psychology. The dataset contained information about 449 results that could be used to compute exact p-values. The z-curve plot shows clear evidence of QRPs.

Visual inspections shows a cliff around z = 1.96, which corresponds to a p-value of .05 (two-tailed). This indicates that there should be more non-significant results than are reported. Z-curve also estimates how many non-significant results there should be given the distribution of significant results (grey curve). The plot shows that a much larger number of non-significant results are expected than are actually reported. Z-curve quantifies the use of QRPs by comparing the observed discovery rate (how many reported results are significant) to the expected discovery rate (the area under the gray curve for significant results). The ODR is 52% and the EDR is only 12% and the confidence intervals do not overlap. The 95%CI for the EDR ranges from 5% to 32%. A value of 5% implies that discoveries are at chance level. Thus, based on these results, it is impossible to reject the nil-hypothesis that all significant results are false positives. This does not mean that all of the results are false positives. Soric’s maximum False Discovery Rate is estimated to be 39%, but the 95%CI is very wide and ranges from 11% to 100%. Thus, we simply have insufficient evidence to draw strong conclusions from the data.

Z-curve also computes the expected replication rate (ERR). The ERR is the percentage of analyses with significant results that are expected to produce a significant result again if studies were replicated exactly with the same sample sizes. The ERR is only 40%. One caveat is that it is difficult or impossible to replicate studies in psychology exactly. Bartos and Schimmack (2020) found that the EDR is a better predictor of actual replication outcomes, which suggests only 12% of results would replicate again.

In conclusion, these results confirm suspicions that moderated regression results are magical. Readers should be cautious or entirely ignore these results unless a study has a large sample size and the statistical evidence is strong (p < .001). Magic is fun, but it has no place in scientific journals.For the future, researchers should clearly state that their analyses are exploratory, report outcomes independent of results, or pre-register their data-analysis plan and follow it exactly.


Murphy, K. R., & Russell, C. J. (2016). Mend it or end it: Redirecting the search for interactions in the organizational sciences. Organizational Research Methods. 1094428115625322.

O’Boyle, E., Banks, G.C., Carter, K., Walter, S., & Yuan, Z. (2019). A 20-year review of outcome reporting bias in moderated multiple regression. Journal of Business and Psychology, 34, 19–37.

Estimating the Replicability of Results in ‘Journal of Experimental Social Psychology”

Picture Credit: Wolfgang Viechtbauer


Social psychology, or to be more precise, experimental social psychology, has a replication problem. Although articles mostly report successful attempts to reject the null-hypothesis, these results are obtained with questionable research practices that select for significance. This renders reports of statistical significance results meaningless (Sterling, 1959). Since 2011, some social psychologists are actively trying to improve the credibility of published results. A z-curve analysis of results in JESP shows that these reforms have had a mild positive effect, but that studies are still underpowered and that non-significant results are still suspiciously absent from published articles. Even pre-registration has been unable to ensure that results are reported honestly. The problem is that there are no clear norms that outlaw practices that undermine the credibility of a field. As a result, some bad actors continue to engage in questionable practices that advance their careers at the expense of their colleagues and the reputation of the field. They may not be as culpable as Stapel, who simply made up data, but their use of questionable practices also hurts the reputation of experimental social psychology. Given the strong incentives to cheat, it is wildly optimistic to assume that self-control and nudges are enough to curb bad practices. Strict rules and punishment are unpopular among liberal-leaning social psychologists (Fiske, 2016), but they may be the most effective way to curb these practices. Clear guidelines about research ethics would not affect practices of most researchers who are honest and who are motivated by truth, but it would make it possible to take actions against those who abuse the system for their personal gains.


There is a replication crisis in social psychology (see Schimmack, 2020, for a review). Based on actual replication studies, it is estimated that only 25% of significant results in social psychology journals can be replicated (Open Science Collaboration, 2015). The response to the replication crisis by social psychologists has been mixed (Schimmack, 2020).

The “Journal of Experimental Social Psychology” provides an opportunity to examine the effectiveness of new initiatives to improve the credibility of social psychology because the current editor, Roger Giner-Sorrola, has introduced several initiatives to improve the quality of the journal.

Giner-Sorolla (2016) correctly points out that selective reporting of statistically significant results is the key problem of the replication crisis. Given modest power, it is unlikely that multiple hypothesis tests within an article are all significant (Schimmack, 2012). Thus, the requirement to report only supporting evidence leads to dishonest reporting of results.

A group of five true statements and one lie is more dishonest than a group of six true ones; but a group of five significant results and one nonsignificant is more to be expected than a group of six significant results, when sampling at 80% statistical power.” (Gina-Sorrola, 2016, p. 2).

There are three solutions to this problem. First, researchers could reduce the number of hypothesis tests that are conducted. For example, a typical article in JESP reports three studies, which implies a minimum of three hypothesis tests, although often more than one hypothesis is tested within a study. The number of tests could be reduced by a third, if researchers would conduct one high-powered study rather than three moderately powered studies (Schimmack, 2012). However, the editorial did not encourage publication of a single study and there is no evidence that the number of studies in JESP articles has decreased.

Another possibility is to increase power to ensure that nearly all tests can produce significant results. To examine whether researchers increased power accordingly, it is necessary to examine the actual power of hypothesis tests reported in JESP. In this blog post, I am using z-curve to estimate power.

Finally, researchers may report more non-significant results. If studies are powered at 80%, and most hypotheses are true, one would expect that about 20% (1 out of 5) hypothesis tests produce a non-significant result. A simple count of significant results in JESP can answer this question. Sterling (1959) found that social psychology journals nearly exclusively report confirmation of predictions with p < .05. Motyl et al. (2017) replicated this finding for results from 2003 to 2014. The interesting question is whether new editorial policies have reduced this rate since 2016.

JESP has also adopted open-science badges that are rewarding researchers for sharing materials, sharing data, or pre-registering hypothesis. Of these badges, pre-registration is most interesting because it aims to curb the use of questionable research practices (QRPs, John et al., 2012) that are used to produce significant results with low power. Currently, there are relatively few articles where all studies are preregistered. However, JESP is interesting because editors sometimes request a final study that is preregistered following some studies that were not preregistered. Thus, JESP has published 58 articles with at least one preregistered study. This makes it possible to examine the effectiveness of preregistration to ensure more honest reporting of results.

Automated Extraction of Test Statistics

The first analyses are based on automatically extracted test statistics. The main drawback of automatic extraction is that it does not distinguish between manipulation checks and focal hypothesis tests. Thus, the absolute estimates do not reveal how replicable focal hypothesis tests are. The advantage of automatically extracted test-statistics is that it uses all test-statistics that are reported in text (t-values, F-values), which makes it possible to examine trends over time. If power of studies increases, test-statistics for focal and non-focal hypothesis will increase.

To examine time-trends in JESP, I downloaded articles from ZZZZ to 2019, extracted test-statistics, converted them into absolute z-scores, and analyzed the results with z-curve (Brunner & Schimmack, 2019; Bartos & Schimmack, 2020). To illustrate z-curve, I present the z-curve for all 45,792 test-statistics.

Visual inspection of the z-curve plot shows clear evidence that questionable research practices contributed to significant results in JESP. the distribution of significant z-scores peaks at z = 1.96, which corresponds to p = .05 (two-sided). At this point, there is a steep drop of reported results. Based on the distribution of significant results, z-curve also estimates the expected distribution of non-significant results (grey curve). There is a clear discrepancy between the observed frequencies of non-significant results and the expected frequencies of non-significant results. This discrepancy is quantified by the discovery rates; that is, the percentage of significant results. The observed discovery rate is 70%. The expected discovery rate is only 35% and the 95%CI ranges from 21% to 44%. Thus, the observed discovery rate is much higher than we would expect if there were no selection for significance in the reporting of results.

Z-curve also provides an estimate of the expected replication rate (ERR). This is the percentage of significant results that would be significant again if the studies could be replicated exactly with the same sample size. The ERR is 63% with a 95%CI ranging from 60% to 68%. Although this is lower than the recommended level of 80% power, it does not seem to justify the claim of a replication crisis. However, there are two caveats. First, the estimate includes manipulation checks. Although we cannot take replication of manipulation checks for granted, they are not the main concern. The main concern is that theoretically important, novel results do not replicate. The replicability of these results will be lower than 63%. Another problem is that the ERR is based on the assumption that studies in social psychology can be replicated exactly. This is not possible, nor is it informative. It is also important that results generalize across similar conditions and populations. To estimate the outcome of actual replication studies that are only similar to the original studies, the EDR is a better estimate (Bartos & Schimmack, 2020), and the estimate of 35% is more in line with the result that only 25% of results in social psychology journals can be replicated (Open Science Collaboration, 2015).

Questionable research practices are more likely to produce just-significant results with p-values between .05 to .005 than p-values below .005. Thus, one solution to the problem of low credibility, is to focus on p-values less than .005 (z = 2.8). Figure 2 shows the results when z-curve is limited to these test statistics.

The influence of QRPs now shows up as a pile of just-significant results that are not consistent with the z-curve model. For the more trustworthy results, the ERR increased to 85%, but more importantly, the EDR increased to 75%. Thus, readers of JESP should treat p-values above .005 as questionable results, while p-values below .005 are more likely to replicate. It is of course unclear how many of these trustworthy results are manipulation checks or interesting findings.

Figures 1 and 2 helped to understand ERR and EDR. The next figure shows time trends in the ERR (solid) and EDR (dotted) using results that are significant at .05 (black) and those significant at .005.

Visual inspection suggests no changes or even a decrease in the years leading up to the beginning of the replication crisis in 2011. ERR and the EDR for p-values below .005 show an increasing trend in the years since 2011. This is confirmed by linear regression analysis for the years 2012 to 2019, t(6)s > 4.62. However, the EDR for all significant results does not show a clear trend, suggesting that QRPs are still being used, t(6) = 0.84.

Figure 3 shows the z-curve plot for the years 2017 to 2019 to get a stable estimate for current results.

The main difference to Figure 1 is that there more highly significant results, which is reflected in the higher ERR of 86%, 95%CI = 82% to 91%. However, the EDR of 36%, 95%CI = 24% to 57% is still low and significantly lower than the observed discovery rate of 66%. Thus, there is still evidence that QRPs are being used. However, EDR estimates are highly sensitive to the percentage of just-significant results. Even excluded only results between 2 and 2.2, leads to a very different picture.

Most important, the EDR jumps from 36% to 73%, which is even higher than the ODR. Thus, one interpretation of the results is that a few bad actors continue to use QRPs that produce p-values between .05 and .025, while most other results are reported honestly.

In sum, the results based on automated-extraction of test statistics shows a clear improvement in recent years, especially for p-values below .005. This is consistent with observations that sample sizes have increased in social psychology (reference). The main drawback of these analysis is that estimates based on automated extraction do not reveal the robustness of focal hypothesis tests. This requires hand-coding of test statistics. These results are examined next.

Motyl et al.’s Hand-Coding of JESP (2003,2004,2013,2014).

The results for hand-coded focal tests justify the claim of a replication crisis in experimental social psychology. Even if experiments could be replicated exactly, the expected replication rate is only 39%, 95%CI = 26% to 49%. Given that they cannot be replicated exactly, the EDR suggests that as few as 12% of replications would be successful and the 95%CI includes 5%, meaning all significant results were false positives, 95%CI = 5% to 33%. The comparison to the observed discovery rate of 86% shows the massive use of QRPs to produce mostly significant results with low power. The time-trend analysis suggests that these numbers are representative of results in experimental social psychology until very recently (see also Cohen, 1962).

Focusing only on p-values below .005 may be a solution, but the figure shows that few focal tests reach this criterion. Thus, for the most part, articles in JESP do not provide empirical evidence for social psychological theories of human behavior. Only trustworthy replication studies can provide this information.

Hand-Coding of Articles in 2017

To examine improvement, I personally hand-coded articles published in 2017.

The ERR increased from 39% to 55%, 95%CI = 46% to 66%, and the confidence intervals barely overlap. However, the ERR did not show a positive trend in the automated analysis and even a value of 55% is still low. The EDR also improved from 12% to 35%, but the confidence intervals are much wider, which makes it hard to conclude from these results that this is a real trend. More important, an EDR of 35% is still not good. Finally. the results continue to show the influence of questionable research practices. The comparison of the ODR and EDR shows that many non-significant results that are obtained are not reported. Thus, despite some signs of improvement, these results do not show a radical shift in research practices that is needed to make social psychology more trustworthy.


A lot of reformers pin their hope on pre-registration as a way to curb the use of questionable research practices. An analysis of registered reports suggests that this can be the case. Registered reports are studies that are accepted before data are collected. Researchers then collect the data and report the results. This publishing model makes it unnecessary to use QRPs to produce significant results in order to get a publication. Preregistration in JESP is different. Here authors voluntarily post a data analysis plan before they collect data and then follow the preregistered plan in their analysis. To the extent that they do follow their plan exactly, the results are also not selected to be significant. However, there are still ways in which selection for significance may occur. For example, researchers may choose not to publish a preregistered study that produced a non-significant results or editors may not accept these studies for publication. It is therefore necessary to test the effectiveness of pre-registration empirically. For this purpose, I coded 210 studies in 58 articles that included at least one pre-registered study. There were 3 studies in 2016, 15 in 2017, 92 in 2018, and 100 in 2019. Five studies were not coded because they did not test a focal hypothesis or used sequential testing.

On a positive note, the ERR and EDR are higher than the comparison data for all articles in 2017. However, it is not clear how much of this difference is due to a simple improvement over time or preregistration. Not so good is the finding that the observed discovery rate is still high (86%) and this does not even count marginally significant results that are also used to claim a discovery. This high discovery rate is not justified by an increase in power. The EDR suggests that only 62% of results should be significant and the 95%CI does not include 86%, 95%CI = 23% to 77%. Thus, there is still evidence that QRPs are being used even in articles that receive a pre-registration badge.

One possible explanation is that articles can receive a pre-registration badge if at least one of the studies was pre-registered. Often this is the last study that has been requested by the editor to ensure that non-preregistered results are credible. I therefore also z-curved only studies that were pre-registered. There were 134 pre-registered studies.

The results are very similar to the previous results with ERR of 72% vs. 70% and EDR of 66% vs. 64%. Thus, there is no evidence that pre-registered studies are qualitatively better and stronger. Moreover, there is also no evidence that pre-registration leads to more honest reporting of non-significant results. The observed discovery rate is 84% and rises to 90% when marginally significant results are included.


Social psychology, or to be more precise, experimental social psychology, has a replication problem. Although articles mostly report successful attempts to reject the null-hypothesis, these results are obtained with questionable research practices that select for significance. This renders reports of statistical significance results meaningless (Sterling, 1959). Since 2011, some social psychologists are actively trying to improve the credibility of published results. A z-curve analysis of results in JESP shows that these reforms have had a mild positive effect, but that studies are still underpowered and that non-significant results are still suspiciously absent from published articles. Even pre-registration has been unable to ensure that results are reported honestly. The problem is that there are no clear norms that outlaw practices that undermine the credibility of a field. As a result, some bad actors continue to engage in questionable practices that advance their careers at the expense of their colleagues and the reputation of the field. They may not be as culpable as Stapel, who simply made up data, but their use of questionable practices also hurts the reputation of experimental social psychology. Given the strong incentives to cheat, it is wildly optimistic to assume that self-control and nudges are enough to curb bad practices. Strict rules and punishment are unpopular among liberal-leaning social psychologists (Fiske, 2016). The problem is that QRPs hurt social psychology, even if it is just a few bad actors who engage in these practices. Implementing clear standards with consequences would not affect practices of most researchers who are honest and who are motivated by truth, but it would make it possible to take actions against those who abuse the system for their personal gains.

A Recipe to Improve Psychological Science

Raw! First Draft! Manuscript in Preparation for Meta-Psychology
Open Comments are welcome.

The f/utility of psychological research has been debated since psychology became an established discipline after the second world-war (Cohen, 1962, 1994; Lykken, 1968; Sterling, 1959; lots of Meehl). There also have been many proposals to improve psychological science. However, most articles published today follow the same old recipe that was established decades ago; a procedure that Gigerenzer (2018) called the significance-testing ritual .

Step 1 is to assign participants to experimental conditions.

Step 2 is to expose groups to different stimuli or interventions.

Step 3 is to examine whether the differences between means of the groups are statistically significant.

Step 4a: If Step 3 produces a p-value below .05, write up the results and submit to a journal.

Step 4b: If Step 3 produces a p-value above .05, forget about the study, and go back to Step 1.

This recipe produces a literature where the empirical content of journal articles are only significant results that suggest the manipulation had an effect. As Sterling (1959) pointed out, this selective publishing of significant results essentially renders significance testing meaningless. The problem with this recipe became apparent when Bem (2011) published 9 successful demonstration of a phenomenon that does not exist: mental time travel where feelings about random future events seemed to cause behavior. If only successes are reported, significant results only show how motivated researchers are to collect data that support their beliefs.

I argue that the key problem in psychology is the specification of the null-hypothesis. The most common approach is to specify the null-hypothesis as the absence of an effect. Cohen called this the nil-hypothesis. The effect size is zero. Even after an original study rejects the nil-hypothesis, follow-up studies (direct or conceptual replication studies) again specify the nil-hypothesis as the hypothesis that has to be rejected, although the original study already rejected it. I propose to abandon nil-hypothesis testing and to replace it with null-hypothesis testing where the null-hypothesis specifies effect sizes. Contrary to the common practice to start with rejecting the nil-hypothesis, I argue that original studies should start with testing large effect sizes. Subsequent studies should use information from the earlier studies to modify the null-hypothesis. This recipe can be considered a stepwise process of parameter estimation. The advantage of a step-wise approach is that parameter estimation requires large samples that are often impossible to obtain during the early stages of a research program. Moreover, parameter estimation may be wasteful when the ultimate conclusion is that an effect size is too small to be meaningful. I illustrate the approach with a simple between-subject design that compares two groups. For mean differences, the most common effect size is the standardized mean difference (the mean difference when the dependent variable is standardized) and Cohen suggested values of d = .2, .5, and .8 as values for small, medium, and large effect sizes, respectively.

The first test of a novel hypothesis (e.g., taking Daniel Lakens’ course on statistics improves understanding of statistics), starts with the assumption that the effect size is large (H0: |d| = .8).

The next step is to specify what value should be considered a meaningful deviation from this effect size. A reasonable value would be d = .5, which is only a moderate effect size. Another reasonable approach is to half the starting effect size, d = .8/2 = .4. I use d = .4.

The third step is to conduct a power analysis for a mean difference of d = .4. This power analysis is not identical to a typical power analysis with H0: d = 0 and an effect size of d = .4 because the t-distribution is no longer symmetrical when it is centered over values other than zero (this may be a statistical reason for the choice of the nil-hypothesis). However, conceptually the power analysis does not differ. We are postulating a null-hypothesis of d = .8 and are willing to reject it when the population effect size is a meaningfully smaller effect size of d = .4 or less. With some trial and error, we find a sample size of N = 68 (n = 34 per cell). With this sample size, d-values below .4 occur only 5% of the time. Thus, we can reject the null-hypothesis of d = .8, if the study produces an effect size below .4.

The next step depends on the outcome of the first study. If the first study produced a result with an effect size estimate greater than .4, the null-hypothesis lives another day. Thus, the replication study is conducted with the same sample size as the original study (N = 68. The rational is that we have good reason to believe that the effect size is large and it would be wasteful to conduct replication studies with much larger samples (e.g., 2.5 times larger than the original study, N = 170. It is also not necessary to use much larger samples to demonstrate that the original finding was obtained with questionable research practices. An honest replication study has high power to reject the null-hypothesis of d = .8, if the true effect size is only d = .2 or even closer to zero. This makes it easier to reveal the use of questionable research practices with actual replication studies. The benefits are obtained because the original study makes a strong claim that the effect size is large rather than merely claiming that the effect size is positive or negative without specifying an effect size.

If the original study produces a significant result with an effect size less than d = .4, the null-hypothesis is rejected. The new null-hypotheses is the point-estimate of the study. Given a significant result, we know that this value is somewhere between 0 and .4. Let’s assume it is d = .25. This estimate comes with a two-sided 95% confidence interval ranging from d = -.23 to d = .74. The wide confidence interval shows that we can reject d = .8, but not a medium effect size of d = .5 or even a small effect in the opposite direction, d = -.2. Thus, we need to increase sample sizes in the next study to provide a meaningful test of the new null-hypothesis that the effect size is positive, but small (d = .25). We want to ensure that the effect size is indeed positive, d > 0, but weaker than a medium effect size, d = .5. Thus, we need to power the study to be able to reject the null-hypothesis (H0: d = .25) in both direction. This is achieved with a sample size of N = 256 (n = 128 per cell) and sampling error of .125. The 95% confidence interval centered over d = .25, ranges from 0 to .5. Thus, any observed d-value greater than .25 rejects the hypothesis that there is no effect and any value below .25 rejects the hypothesis of a medium effect size, d = .5.

The next step depends again on the outcome of the study. If the observed effect size is d = .35 with a 95% confidence interval ranging from d = .11 to d = 60, the new question is whether the effect size is at least small, d = .2 or whether it is even moderate. We could meta-analyze the results of both studies, but as the second study is larger, it will have a stronger influence on the weighted average. In this case, the weighted average of d = .33 is very close to the estimate of the larger second study. Thus, I am using the estimate of Study 2 for the planning of the next study. With the null-hypothesis of d = .35, a sample size of N = 484 (n = 242 per cell) is required to have 95% power to find a significant result if the population effect size is d = .2 or less, 90% confidence interval, d = .20 to d = .50. Thus, if an effect size less than d = .2 is observed, it is possible to reject the hypothesis that there is at least a statistically small effect size of d = .2. In this case, researchers have to decide whether they want to invest in a much larger study to see whether there is a positive effect at all or whether they would rather abandon this line of research because the effect size is too small to be theoretically or practically meaningful. The estimation of the effect size makes it at least clear that any further studies with small samples are meaningless because they have insufficient power to demonstrate that a small effect exists. This can be a meaningful result in itself because researchers currently waste resources on studies that test small effects with small samples.

If the effect size in Study 2 is less than d = .25, researchers know (with a 5% error probability) that the effect size is less than d = .5. However, it is not clear whether there is a positive effect or not. Say, the observed effect size was d = .10 with a 95%CI ranging from d = -.08 to d = .28. This leaves open the possibility of no effect, but also a statistically small effect of d = .2. Researchers may find it worthwhile to purse this research in the hope that the effect size is at least greater than d = .10, assuming a population effect size of d = .2. Adequate power is achieved with a sample size of N = 1,100 (n = 550 per cell). In this case, the 90% confidence interval around d = .2 ranges from d = .10 to d = .30. Thus, any value less than d = .10, rejects the hypothesis that the effect size is statistically small, d = .2, while any value greater than d = .30 would confirm that the effect size is at least a small effect size of d = .2.

This new way of thinking about null-hypothesis testing requires some mental effort (it is still difficult for me). To illustrate it further, I used open data from the many-lab project (Klein et al., 2014). I start with a project with a strong and well-replicated effect.


The first sample in the ML dataset is from Penn State U – Abington (‘abington’) with N = 84. Thus, the sample has good power to test the first hypothesis that d > .4, assuming an effect size of d = .8. The statistical test of the first anchoring effect (distance from New York to LA with 1,500 mile vs. 6,000 mile anchor) produced a standardized effect size of d = .98 with a 95%CI ranging from d = .52 to 1.44. The confidence interval includes a value of d = .8. Therefore the null-hypothesis cannot be rejected. Contrary to nil-hypothesis testing, however, this finding is highly informative and significant. It does suggest that anchoring is a strong effect.

As Study 1 was consistent with the null-hypothesis of a strong effect, Study 2 replicates the effect with the same sample size. To make this a conceptual replication study, I used the second anchoring question (anchoring2, population of Chicago with 200,000 vs. 6 million as anchor). The sample from Charles University, Prague, Czech Republic provided an equal sample size of N = 84. The study replicated the finding of Study 1, that the 95%CI includes a value of d = .8, 95%CI = .72 to 1.41.

To further examine the robustness of the effect, Study 3 used a different anchoring problem (height of Mt. Everest with 2,000 vs. 45,500 feet as anchors). To keep sample sizes similar, I used the UVA sample (N = 81). This time, the null-hypothesis was rejected with an even larger effect size, d = 1.47, 95%CI = 1.19 to 1.76.

Although additional replication studies can further examine the generalizability of the main effect, the three studies alone are sufficient to provide robust evidence for anchor effects, even with a modest total sample size of N = 249 participants. Researchers could therefore examine replicability and generalizabilty in the context of new research questions that explore boundary conditions, mediators, or moderators. More replication studies or replication studies with larger samples would be unnecessary.

Flag Priming

To maintain good comparability, I start again with the Penn State U – Abington sample (N = 84). The effect size estimate for the flag prime is close to zero, d = .05. More important, the 95% confidence interval does not include d = .8, 95%CI = -.28 to .39. Thus, the null-hypothesis that flag priming is a strong effect is rejected. The results are so disappointing that even a moderate effect size is not included in the confidence interval. Thus, the only question is whether there could be a small effect size. If this is theoretically interesting, the study would have to be sufficiently powered to distinguish a small effect size from zero. Thus, the study could be powered to examine whether the effect size is at least d = .1, assuming an effect size of d = .2. The previous power analysis suggested that a sample of N = 1,100 participants is needed to test this hypothesis. I used the Mturk sample (N = 1000) and the osu (N = 107) samples to get this sample size.

The results showed a positive effect size of d = .12. Using traditional NHST, this finding rejects the nil-hypothesis, but allows for extremely small effect sizes close to zero, 95%CI = .0003 to .25. More important, the results do not reject the actual null-hypothesis that there is a small effect size d = .2, but also do not ensure that the effect size is greater than d = .10. Thus, the results remain inconclusive.

To make use of the large sample of Study 2, it is not necessary to increase the sample size again. Rather, a third study can be conducted with the same sample size, and the results of the two studies can be combined to test the null-hypothesis that d is at least d = .10. I used the Project Implicit sample, although it is a bit bigger (N = 1329).

Study 3 alone produced an effect size of d = .03, 95%CI = -.09 to d = .14. An analysis that combines data from all three samples, produces an estimate of d = .02, 95%CI = -.06 to .10. These results clearly reject the null-hypothesis that d = .2, and they even suggest that d = .10 is unlikely. At this point, it seems reasonable to stop further study of this phenomenon, at least using the same paradigm. Although this program required over 2,000 participants, the results are conclusive and publishable with the conclusion that flag priming has negligible effects on ratings of political values. The ability to provide meaningful results arises from the specification of the null-hypothesis with an effect size rather than the nil-hypothesis that can only test direction of effects without making claims about effect sizes.

The comparison of the two examples shows why it is important to think about effect sizes, even when these effect sizes do not generalize to the real word. Effect sizes are needed to calibrate sample sizes so that resources are not wasted on overpowered studies (studying anchoring with N = 1,000) or on underpowered studies (studying flag priming with N = 100). Using a simple recipe that starts with the assumption that effect sizes are large, it is possible to use few resources first and then increase sample sizes as needed, if effect sizes turn out to be small.

Low vs. High Category Scales

To illustrate the recipe with a small-to-medium effect size, I picked Schwartz et al.’s (1985) manipulation of high versus low frequencies as labels for a response category. I started again with the U Penn State – Abington sample (N = 84). The effect size was d = .33, but the 95% confidence interval ranged from d = -.17 to d = .84. Although, the interval does not exclude d = .8, it seems unlikely that the effect size is large, but it is not unreasonable to assume that the effect size could be moderate rather than small. Thus, the next study used d = .5 as the null-hypothesis and examined whether the effect size is at least d = .2. A power analysis shows that N = 120 (n = 60 per cell) participants are needed. I picked the sample from Brasilia (N = 120) for this purpose. The results showed a strong effect size, d = .88. The 95% confidence interval even excluded a medium effect size, d = .51 to d = 1.23, but given the results of study 1, it is reasonable to conclude that the effect size is not small, but could be medium or even large. A sample size of N = 120 seems reasonable for replication studies that examine the generalizability of results across populations (or conceptual replicaiton studies, but they were not available in this dataset).

To further examine generalizability, I picked the sample from Instanbul (N = 113). Surprisingly, the 95% confidence interval, d = -.31 to d = .14 did not include d = .5. The confidence interval also does not overlap with the confidence interval in Study 2. Thus, there is some uncertainty about the effect and under what conditions it can be produced. However, a meta-analysis across all three studies shows a 95%CI that includes a medium effect size, 95%CI = .21 to .65.

Thus, it seems reasonable to examine replicability in other samples with the same sample size. The next sample with a similar sample size is Laurier (N = 112). The results show an effect size of d = .43 and the 95%CI includes d = .5, 95%CI = .17 to d = .69. The meta-analytic confidence interval, 95%CI = .27 to .61, excludes small effect sizes of d = .2 and large effect sizes of d = .8.

Thus, a research program with four samples and a total sample size of N = 429 participants helped to establish a medium effect size for the effect of low versus high scale labels on ratings. The effect size estimate based on the full ML dataset is d = .48.

At this point, it may seem as if I cheery-picked samples to make the recipe look good. I didn’t, but I don’t have a preregistered analysis plan to show that I did not. I suggest others try it out with other open data where we have a credible estimate of the real effect based on a large sample and then try to approach this effect size using the recipe I proposed here.

The main original contribution of this blog post is to move away from nil-hypothesis significance testing. I am not aware of any other suggestions that are similar to the proposed recipe, but the ideas are firmly based on Neyman-Pearson’s approach to significance testing and Cohen’s recommendation to think about effect sizes in the planning of studies. The use of confidence intervals makes the proposal similar to Cummings’ suggestion to focus more on estimation than hypothesis testing. However, I am not aware of a recipe for the systematic planning of sample sizes that vary as a function of effect sizes. Too often confidence intervals are presented as if the main goal is to provide precise effect size estimates, although the meaning of these precise effect sizes in psychological research is unclear. What a medium effect size for category labels means in practice is not clear, but knowing that it is medium allows researchers to plan studies with adequate power. Finally, the proposal is akin to sequential testing, where researchers look at their data to avoid collecting too many data. However, sequential testing still suffers from the problem that it tests the nil-hypothesis and that a non-significant result is inconclusive. In contrast, this recipe provides valuable information even if the fist study produces a non-significant result. If the first study fails to produce a significant result, it suggests that the effect size is large. This is valuable and publishable information. Significant results are also meaningful because they suggest that the effect size is not large. Thus, results are informative with significant and non-significant results, removing the asymmetry of nil-hypothesis testing where non-significant results are uninformative. The only studies that are not informative are studies where confidence intervals are too wide to be meaningful or replication studies that are underpowered. The recipe helps researchers to avoid these mistakes.

The proposal also addresses the main reason why researchers do not use power analysis to plan sample sizes. The mistaken belief is that it is necessary to guess the population effect size. Here I showed that this is absolutely not necessary. Rather researchers can start with the most optimistic assumptions and test the hypothesis that their effect is large. More often then not, the result will be disappointing, but not useless. The results of the first study provide valuable information for the planning of future studies.

I would be foolish to believe that my proposal can actually change research practices in psychology. Yet, I cannot help thinking that it is a novel proposal that may appeal to some researchers who are struggling in the planning of sample sizes for their studies. The present proposal allows them to shoot for the moon and fail, as long as they document this failure and then replicate with a larger sample. It may not solve all problems, but it is better than p-rep or Bayes-Factors and several other proposals that failed to fix psychological science.

Fiske and the Permanent Crisis in Social Psychology

Remedies include tracking one’s own questionable research practices” (Susan T. Fiske)

In 1959, Sterling observed that results sections of psychological articles provide no information. The reason is that studies nearly always reject the null-hypothesis. As a result, it is not necessary to read the results section. It is sufficient to read the predictions that are being made in the instruction because the outcome of the empirical test is a forgone conclusion.

in 1962, Cohen (1962) found that studies published in the Journal of Abnormal and Social Psychology (now separated into Journal of Abnormal Psychology and Journal of Personality and Social Psychology) have modest power to produce significant results. Three decades later, Gigerenzer and Sedlmeier (1989) replicated this finding.

Thus, for nearly 60 years social psychologists have been publishing many more significant results than they actually obtain in their laboratories, making the empirical results in their articles essentially meaningless. Every claim in social psychology, including crazy findings that nobody believes (Bem, 2011), is significant.

Over the past decades, some social psychologists have rebelled against the status quo in social psychology. To show that significant results do not provide empirical evidence, they have conducted replication studies and reported results even when they did not show a significant results. Suddenly, the success rate of nearly 100% dropped to 25%. Faced with this dismal result that reveals the extend of questionable practices in the field, some social psychologists have tried to downplay the significance of replication failures. Leader of the disinformation movement is Susan Fiske, who was invited to comment on a special issue on the replication crisis in the Journal of Experimental Social psychology (Fiske, 2016). Her article “How to publish rigorous experiments in the 21st century” is an interesting example of deceptive publishing that avoids dealing with the real issue.

First, it is always important to examine the reference list to examine bias in the literature review. For example, Fiske does not mention Bem’s embarrassing article that started the crisis, John et al.’s article on the use of questionable research practices, or Francis and Schimmack’s work on bias detection, although these articles are mentioned in several of the articles she comments on. For example, Hales (2016) writes .

In fact, in some cases failed replications have been foreshadowed by analyses showing that the evidence reported in support of a finding can be implausibly positive. For example, multiple analyses have questioned whether findings in support of precognition (Bem, 2011) are too good to be obtained without using questionable research practices (Francis, 2012;
Schimmack, 2012). In line with these analyses, researchers who have replicated Bem’s procedures have not replicated his results (Galak, LeBoeuf, Nelson, & Simmons, 2012; Ritchie, Wiseman, & French, 2012; Wagenmakers, Wetzels, Borsboom, van der Maas, & Kievit, 2012).

Fiske concludes that the replication crisis is an opportunity to improve research practices. She writes “Constructive advice for 21st century publication standards includes appropriate theory, internal validity, and external validity” Again, it is interesting what she is not saying. If theory, internal and external validity are advice for social psychologists in the 21st century, it implies that 20th century social psychologists did not have good theories and that studies lacked internal and external validity. After all, we do not give advice when things are going well.

Fiske (2016) discusses the replication crisis under the heading of internal validity.

Hales (2016) points out that, in the effort to report effects that are both significant and interesting, researchers may go beyond what the data allow. Over-claiming takes forms beyond the familiar Type I (false positive) and Type II (false negative) errors. A proposed Type III error describes reaching an accurate conclusion but by flawed methods (e.g., confirmation bias, hypothesizing after results are known, discarding data). A proposed Type IV error describes reaching an accurate conclusion based on faulty evidence (insufficient power, invalid measures). Remedies include tracking one’s own questionable research practices (e.g., ad hoc stopping, non-disclosure of failed replications, exploration reported as confirmation) or calculating the plausibility of one’s data (e.g., checking for experimenter bias during analysis). Pre-registration and transparency are encouraged.”

This is as close as Fiske comes to talking about the fundamental problem in social psychology, but Type-III errors are not just a hypothetical possibility; they are the norm in social psychology. Type-III errors explain how social psychologists can be successful most of the time, when their studies have a low probabilty to be successful.

Fiske’s recommendations for improvement are obscure. What does it mean for researchers to “track their own questionable practices?” Is there an acceptable quota of using these practices? What should researchers do when they find that they are using these questionable practices? How would researchers calculate the plausibilty of their data, and why is pre-registration useful? Fiske does not elaborate on this because she is not really interested in improving practices. At least, she makes it very clear what she does not want to happen: she opposes a clear code of research ethics that specifies which practices violate research integrity.

Norms about acceptable research methods change by social influence, not by regulation. As social psychology tells us, people internalize change when they trust and respect the source. A
punishing, feared source elicits at best compliance and at worst reactance, not to mention the source’s own reputational damage.

This naive claim ignores that many human behaviors are regulated by social norms that are enforced with laws. Even scientists have social norms about fraud and Stapel was fired for fabricating data. Clearly, academic freedom has limits. If fabricating data is unethical, it is not clear why hiding disconfirming evidence should be a personal choice.

Fiske also expresses here dislike of blog posts and so-called vigilantes.

“For the most part, the proposals in this special issue are persuasive communications,
not threats. And all are peer-reviewed, not mere blog posts. And they are mostly reasoned advisory proposals, not targeted bullying. As such, they appropriately treat other researchers as
colleagues, not miscreants. This respectful discourse moves the field forward better than vigilantism.

Maybe as a social psychologist, she should be aware that disobedience and protest have always been a part of social change, especially when powerful leaders opposed social change. Arguments that results sections in social psychology are meaningless have been made by eminent researchers in peer-reviewed publications (e.g., Cohen, 1994; Schimmack, 2012) and on blog posts (e.g., R-Index blog). The validity of the argument does not depend on the medium or peer-review, but on the internal and external validity of the evidence, and the evidence for sixty years has shown that social psychologists inflate their success rate.

There is also no evidence that social psychologists follow Fiske’s advice to track their own questionable research practices or avoid the use of these practices. This is not surprising. There is no real incentive to change behaviors and behavior does not change when the reinforcement schedule does not change. As long as p < .05 is rewarded and p > .05 is punished, psychologists will continue to publish meaningless p-values (Sterling, 1959). History has shown again and again that powerful elites do not change for the goodness of the greater good. Real change will come from public pressure (e.g., undergraduate students, funders) to demand honest reporting of results.

Expressing Uncertainty about Analysis Plans with Conservative Confidence Intervals

Unless researchers specify an analysis plan and follow it exactly, it is possible to analyze the same data several ways. If all analysis lead to the same conclusion this is not a problem. However, what should we do when the analyses lead to different conclusions? The problem generally arises when one analysis shows a p-value less than .05 and another plausible analysis shows a p-value greater than .05. The inconsistency introduces uncertainty about the proper conclusion. Traditionally, researchers selectively picked the more favorable analysis, which is known as a questionable research practices because it undermines the purpose of significance testing to control the long-run error rate. However, what do we do if researchers honestly present both results, p = .02 and p = .08? As many statistician have pointed out, the difference between these two results is itself not significant and negligible.

A simple solution to the problem is to switch from hypothesis testing with p-values to hypothesis testing with confidence intervals (Schimmack, 2020). With p = .02 and p = .08, the corresponding confidence intervals could be d = -.05 to .30 and d = .05 to .40. It is simple to present the uncertainty about the proper inference by picking the lower value for the lower limit and the higher value for the upper limit to create a conservative confidence interval, d = -.05 to .40. This confidence interval captures uncertainty about the proper analysis and uncertainty about sampling error. Inferences can then be drawn based on this confidence interval. In this case, there is insufficient information to reject the null-hypothesis. Yet, the data still provide evidence that the effect size is unlikely to be moderate. If this is theoretically meaningful or contradicts previous studies (e.g., studies that used QRPs to inflate effect sizes), the results are still important and publishable.

One problem is when there are many ways to analyze the data. A new suggestion has been to do a multiverse analysis. That is, run all possible analysis and see what you get. The problem is that this may create extremely divergent results and it is not clear how results from a multiverse analysis should be integrated. Conservative confidence intervals provide an easy way to do so, but they may be extremely wide if a multiverse analysis is not limited to a small range of reasonable analyses. It is therefore crucial that researchers think carefully about reasonable alternative ways to analyze the data without trying all possible ways of doing so which makes the results uninformative.

Estimating the Replicability of Results in ‘Journal of Experimental Psychology: Learning Memory, & Cognition”

The “Journal of Experimental Psychology” is the oldest journal of the American Psychological Association (Benjamin, 2019). When psychology grew, it was split into distinct journals for different areas of experimental psychology. IN 2019, the “Journal of Experimental Psychology: Learning, Memory, and Cognition” (JEP-LMC) published its 45th volume. In the same year, Aaron S. Benjamin took over as editor of JEP-LMC.

The editorial promises changes in publication practices in response to the so-called replication crisis in psychology. Concerns about the replicability of psychological findings were raised by the Open Science Collaboration (OSC, 2015). They replicated 100 studies from three journals, including JEP-LMC. For JEP-LMC they found that only 50% of published results produced a significant result in the replication attempts.

Benjamin (2019) hopes that changes in publication policies will raise this replication. He is also hopeful that this can be achieved with minor changes, suggesting that research practices in cognitive psychology are not as questionable as those in social psychology, where the replication rate was only 25%.

Aside from the OSC investigation relatively little is know about the research practices of cognitive psychology and the replicability of their findings. The reason is that systematic examination of replicability are difficult to do. It would take a major effort to repeat the replication project to see whether the replicability of cognitive psychology has already changed or will change in response to Benjamin’s initiatives. Without fast and credible indicators, editors are practically flying blind and can only hope for the best.

My colleagues and I developed a statistical method, called z-curve, to provide fast and representative information about research practices and replicability (Brunner & Schimmack, 2019; Bartos & Schimmack, 2020). Z-curve uses the actual test-statistics (t-values, F-values) of significant results to estimate the expected replication rate (RR) and expected discovery rate (EDR) of published results. The replication rate focuses on published significant result. It estimates how many of these results would be significant again if the original studies were replicated exactly with the same sample sizes. The discovery rate is the rate of significant results for all statistical tests that researchers conducted to produce their publications. Without publication bias, this rate would simply be the percentage of significant results that are reported in articles. However, with publication bias the observed discovery rate is inflated by publication bias. Z-curve provides an estimate of the actual discovery rate on the basis of the distribution of the significant results alone. A comparison of the ODR and EDR provides information about the presence of publication bias or selection for significance.

To provide this valuable information for JEP-LMC, I downloaded all articles from 2000 to 2019 and automatically extracted all test-statistics (t-values, F-values). These test-statistics are first converted into two-sided p-values that are converted into absolute z-scores. Higher z-scores provide stronger evidence against the null-hypothesis. Figure 1 shows the results for the 53,975 test-statistics published from 2000 to 2019.

Visual inspection shows a cliff at z = 1.96, which corresponds to a p-value of .05 (two-sided). This finding suggests that non-significant results are missing. A formal test of publication bias is provided by the comparison of the observed discovery rate of 68%, 95%CI = 67% to 68% and the expected discovery rate (EDR) of 44%, 95%CI = 33% to 60%. The confidence intervals do not overlap, indicating that this is not just a random finding. Thus, there is clear evidence that questionable practices inflate the percentage of significant results published in JEP-LMC.

The expected replication rate is high with 79%, 95%CI = 75% to 82%. This estimate is considerably higher than the actual success rate of replication studies of 50% (OSC, 2015). There are several reasons for this. Automatic extraction does not distinguish focal and non-focal hypothesis tests. Focal hypothesis tests are riskier and tend to produce weaker evidence. Estimates for the replicability of results with p-values between .05 and .01 (~ z = 2 to 2.5) show only a replicabilty of 55% (Figures below x-axis). Another reason for the discrepancy is that replication studies are rarely exact, even in cognitive psychology. When unknown moderating factors produce heterogeneity, the ERR overestimates actual replicability and the worst case scenario is that success rate matches the EDR. The 95%CI of the EDR does include 50%. Thus, editors are well advised to focus on the EDR as an indicator for improvement.

Z-curve also provides information about the risk that JEP-LMC publishes mostly false positive results (Benjamin, 2019). Although it is impossible to quantify the rate of true null-hypotheses, it is possible to use the EDR to estimate the maximum rate of false discoveries (Bartos & Schimmack, 2020; Soric, 1980). The Soric FDR is only 7% and even the upper limit of the 95%CI is only 11%. Thus, the results provide no evidence for the claim that most published results are false positives. Power estimates for z-scores between 1 and 2 rather suggest that many non-significant results are false negatives due to low statistical power. This has important implications for the interpretation of interaction effects. Interaction effects rarely show that effects are present in one condition and not in another condition. Most often they merely show that effects are stronger in one condition than in another, even if the weaker effect is not statistically significant.

The presence of questionable practices that inflate the discovery rate affects mostly just-significant results. One way to deal with this problem is to require stronger evidence to claim statistical significance. Just like correction for multiple comparisons, it is necessary to control for unreported tests that inflate the type-I error risk. Following other recommendations, I suggest to use p = .005 as a more stringent criterion to reject the null-hypothesis to contain the false positive risk at 5%. Figure 2 shows the results when only results that meet this criterion (z > 2.8) are fitted to the model.

The ERR increases to 91% and the EDR increases to 86%. Even for z-scores from 3 to 3.5, the ERR is 82%. Thus, most of these results are expected to replicate. For the future, this means authors should demonstrate that they did not use QRPs by preregistering and following a design and data analysis plan or they should use the more conservative criterion value of p = .005 to claim significance with alpha = .05.

The published results with p-values between .05 and .005 should be considered as questionable evidence. If multiple studies are available, meta-analyses that take publication bias into account can be used to examine whether these results are robust. If these results are very important, they should be subjected to a rigorous replication attempt in studies with larger samples that increase power.

The next figure examines whether research practices and replicability have changed over time. For this purpose, I computed the ERR (solid) and the EDR (dotted) for results significant at .05 (black) and those significant at .005 (grey) for each year.

Figure 3 shows high ERR and EDR with p < .005 as significance criterion (grey). The slight negative trends are not statistically significant, ERR: b = -.003, se = .0017; EDR: b = -.005, se = .0027. With p < .05 as criterion, the ERR is also high, but significantly decreasing, b = -.0025, se = .001. The EDR estimates are much more variable because they depend on the number of test-statistics that are just significant. The trend over time is negative, but not statistically significant, b = -.005, se = .006. Overall, these results do not show any changes in response to the replication crisis. Hopefully, the initiatives of the editor will reduce the use of questionable practices and increase power. Raising the EDR for all results with p < .05 to 80% can be achieved with less effort. Ironically, a simple way to do so is to publish fewer studies in a single article (Schimmack, 2012). Rather than reporting 8 studies with 25 participants each, results are more credible if they replicate across 4 studies with 50 participants in each study. Thus, without additional resources, it is possible to make results in JEP-LMC more credible and reduce the need to use questionable practices to move p-values below .05.

In conclusion, this blog post provides much needed objective, quantitative (meta-scientific) evidence about research practices and replicability of results in JEP-LMC. The results provide no evidence of a replication crisis in JEP-LMC, and fears that most published results are false positives are not based on empirical facts. Most results published in JEP-LMC are likely to be true and many of the replication failures in the OSC replication attempts were probably false negatives due to low power in both the original and the replication studies. Nevertheless, low power and false negatives are a problem because inconsistent results produce confusion. For powerful within-subject designs cognitive researchers can easily increase power by increasing sample sizes from the typical N = 20 to N = 40. They can do so by reducing the number of internal replication studies within an article or by investing more resources in meaningful conceptual replication studies. Better planning and justification of sample sizes is one of the initiatives in Benjamin’s editorial. Z-curve makes it possible to examine whether this initiative finally increases the power of studies in psychology, which has not been the case since Cohen (1962) warned about low power in psychological experiments. Maybe 60 years later, in 2022, we will see an increase in power in JEP-LMC.

Estimating the Replicability of Results in ‘European Journal of Social Psychology”

Over the past decade, questions have been raised about research practices in psychology and the replicability of published results. The focus has been mostly on research practices in social psychology. A major replication project found that only 25% of results in social psychology could be replicated (Open Science Collaboration, 2015). This finding has produced a lot of conflicting responses that blame the replication project for the low success rate to claims that most results in social psychology are false positives.

Social psychology journals have responded to concerns about the replicability of social psychology with promises to improve the reporting of results. The European Journal of Social Psychology (EJSP) is no exception. In 2015, the incoming editors Radmila Prislin and Vivian L. Vignoles wrote

we believe that scientific progress requires careful adherence to the highest standards of integrity and methodological rigour. In this regard, we welcome recent initiatives to improve the trustworthiness of research in social and personality psychology

In 2018, the new editorial team, Roland Imhoff, Joanne Smith, Martijn van Zomeren, addressed concerns about questionable research practices more directly.

“opening up also implies being considerate of empirical imperfections that would otherwise remain hidden from view. This means that we require authors to provide a transparent description of the research process in their articles (e.g., report all measures,manipulations,
and exclusions from samples, if any; e.g., Simmons, Nelson, & Simonsohn, 2011). We thus encourage authors to accurately report about the inclusion of failed studies and imperfect patterns (e.g., p-values not meeting the .05 threshold), but this also has to mean that disclosing such imperfections, all else being equal, should not affect the likelihood of acceptance.”

This blog post uses the test-statistics published in EJSP to examine whether research practices of authors who publish in EJSP have changed in response to the low replicability of results in social psychology. To do so, I downloaded articles from 2000 to 2019 and automatically extracted test-statistics (t-values, F-values). I then converted these test-statistics into two-sided p-values and then into absolute z-scores. Higher z-scores provide stronger evidence against the null-hypothesis. These z-scores are then analyzed using z-curve (Brunner & Schimmack, 2019; Bartos & Schimmack, 2019). Figure 1 shows the results for the z-curve plot for all 27,223 test statistics.

Visual inspection shows a cliff at z = 1.96, which corresponds to a p-value of .05, two-sided. The grey curve shows the expected distribution based on the published significant results. The z-curve predicts many more non-significant results than are actually reported, especially below a value of 1.65 that represents the implicit criterion for marginal significance, p = .05, one-sided.

A formal test of selective reporting of significant results compares the observed discovery rate and the expected discovery rate. The observed discovery rate (ODR) is the percentage of reported results that are significant. The expected discovery rate (EDR) is the percentage of significant results that is expected given the z-curve model. The ODR of 72%, 95%CI = 72%-73%. This is much higher than the EDR of 26%, 95%CI = 19% to 40%. Thus, there is clear evidence of selective reporting of significant results.

Z-curve also provides an estimate of the expected replication rate. That is, if the studies were replicated exactly, how many of the significant results in the original studies would be significant again in the exact replication studies. The estimate is 70%, 95%CI = 65% to 73%. This is not a bad replication rate, but the problem is that it requires exact replications that are difficult if not impossible to do in social psychology. Bartos and Schimmack (2020) found that the EDR is a better predictor of results for conceptual replication studies. The estimate of 26% is consistent with the low replication rate in the replication project (Open Science Collaboration, 2015).

Fortunately, it is not necessary to dismiss all published results in EJSP. Questionable practices are more likely to produce just-significant results. It is therefore possible to focus on more credible results with a p-value less than .005, which corresponds to a z-score of 2.8. Figure 2 shows the results.

Based on the distribution of z-scores greater than 2.8, the model predicts much fewer just-significant results than are reported. This also suggests that questionable practices were used to produce significant results. Excluding these articles boosts the EDR to a satisfactory level of 77%. Thus, even if replication studies are not exact, the model predicts that most replication studies would produce a significant result with alpha = .05 (that is, the significance criterion was not adjusted to a more stringent level of .005).

The following analysis examines whether EJSP editors were successful in increasing the credibility of results published in their journal. For this purpose, I computed the ERR (solid) and the EDR (dotted) using all significant results (black) and excluding questionable results (grey) for each year and plotted the results as a function of year.

The results show no statistically significant trend for any of the four indicators over time. The most important indicator that reflects the use of questionable practices is the EDR for all significant results (black dotted line). The low rates in the last three years show that there have been now major improvements in the publishing culture of EJSP. It is surely easier to write lofty editorials than to actually improve scientific practices. Readers who care about social psychology are advised to ignore p-values greater than .005 because these results may have been produced with questionable practices and unlikely to replicate. The current editorial team may take these results as a baseline for initiatives to improve the credibility of EJSP in the following years.