Category Archives: Z-Curve

Z-Curve: An even better p-curve

Abstract

P-Curve was a first attempt to take the problem of selection for significance seriously and to evaluate whether a set of studies provides credible evidence against the null-hypothesis (evidential value). Here I showed that p-curve has serious limitations and provides misleading evidence about the strength of evidence against the null-hypothesis.

I showed that all of the information that is provided by a p-curve analysis is also provided by a z-curve analysis. Moreover, z-curve provides additional information about the presence of selection bias and the risk of false positive results. I also show how alpha levels can be adjusted to separate significant results with weak and strong evidence to select credible findings even when selection for significance is present.

As z-curve does every thing that p-curve does and more, the rational choice is to choose z-curve for the meta-analysis of p-values.

Introduction

In 2011, it dawned on psychologists that something was wrong with their science. Daryl Bem had just published an article with nine studies that showed an incredible finding. Participants’ responses were influenced by random events that had not yet occurred. Since then, the flaws in research practices have become clear and it has been shown that they are not limited to mental time travel (Schimmack, 2020). For decades, psychologists assumed that statistically significant results reveal true effects and reported only statistically significant results (Sterling, 1959). However, selective reporting of significant results undermines the purpose of significance testing to distinguish true and false hypotheses. If only significant results are reported, most published results could be false positive results like those reported by Bem (2011).

Selective reporting of significant results also undermines the credibility of meta-analyses (Rosenthal, 1979), which explains why meta-analyses also suggest humans posses psychic abilities (Bem & Honorton, 1994). This sad state of affairs stimulated renewed interest in methods that detect selection for significance (Schimmack, 2012) and methods that correct for publication bias in meta-analyses. Here I focus on a comparison of p-curve (Simonsohn et al., 2014a, Simonsohn et al., 2014b), and z-curve (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020).

P-Curve

P-curve is a name for a family of statistical tests that have been combined into the p-curve app that researchers can use to conduct p-curve analyses, henceforth called p-curve . The latest version of p-curve is version 4.06 that was last updated on November 30, 2017 (p-curve.com).

The first part of a p-curve analysis is a p-curve plot. A p-curve plot is a histogram of all significant p-values where p-values are placed into five bins, namely p-values ranging from 0 to .01, .01 to .02, .02 to .03, .03 to .04, and .04 to .05. If the set of studies contains mostly studies with true effects that have been tested with moderate to high power, the plot shows decreasing frequencies as p-values increase (more p-values between 0 and .01 than between .04 and .05). This pattern has been called a right-skewed distribution by the p-curve authors. If the distribution is flat or reversed (more p-values between .04 and .05 than between 0 and .01), most p-values may be false positive results.

The main limitation of p-curve plots is that it is difficult to evaluate ambiguous cases. To aid in the interpretation of p-curve plots, p-curve also provides statistical tests of evidential value. One test is a significance tests against the null-hypothesis that all significant p-values are false positive results. If this null-hypothesis can be rejected with the traditional alpha criterion of .05, it is possible to conclude that at least some of the significant results are not false positives.

The main problem with significance tests is that they do not provide information about effect sizes. A right-skewed p-curve with a significant p-values may be due to weak evidence with many false positive results or strong evidence with few false positives.

To address this concern, the p-curve app also provides an estimate of statistical power. This estimate assumes that the studies in the meta-analysis are homogeneous because power is a conditional probability under the assumption that an effect is present. Thus, power does not apply to a meta-analysis of studies that contain true positive and false positive results because power is not defined for false positive results.

To illustrate the interpretation of p-curve analysis, I conducted a meta-analysis of all studies published by Leif D. Nelson, one of the co-authors of p-curve analysis. I found 119 studies with codable data and coded the most focal hypothesis for each of these studies. I then submitted the data to the online p-curve app. Figure 1 shows the output.

Visual inspection of the p-curve plot shows a right-skewed distribution with 57% of the p-values between 0 and .01 and only 6% of p-values between .04 and .05. The statistical tests against the null-hypothesis that all of the significant p-values are false positives is highly significant. Thus, at least some of the p-values are likely to be true positives. Finally, the power estimate is very high, 97%, with a tight confidence interval ranging from 96% to 98%. Somewhat redundant with this information, the p-curve app also provides a significance test for the hypothesis that power is less than 33%. This test is not significant, which is not surprising given the estimated power of 96%.

The next part of a p-curve output provides more details about the significance tests, but does not add more information.

The next part provides users with an interpretation of the results.

The interpretation informs readers that this set of p-values provides evidential value. Somewhat surprisingly, this automated interpretation does not mention the power estimate to quantify the strength of evidence. The focus on p-values is problematic because p-values are influenced by the number of tests. The p-value could be lower wit 100 studies with 40% power than with 10 studies with 99% power. As significance tests are redundant with confidence intervals, it is sufficient to focus on the confidence interval of the power estimate. With a 90% confidence interval ranging from 96% to 98%, we would be justified to conclude that this set of p-values provides strong support for the hypotheses tested in Nelson’s articles.

Z-Curve

Like p-curve, z-curve analyses also start with a plot of the p-values. The main difference is that p-values are converted into z-scores using the formula for the inverse normal distribution; z = qnorm(1-p/2). The second difference is that significant and non-significant p-values are plotted. The third difference is that z-curve plots have a much finer resolution than p-curve plots. Whereas p-curve bins all z-scores from 2.58 to infinity into one bin (p < .01), z-curve uses the information about the distribution of z-scores all the way up to z = 6 (p = .000000002; 1/500,000,000).

Visual inspection of the z-curve plot reveals something that the p-curve plot does not show, namely there is clear evidence for the presence of selection bias. Whereas p-curve suggests that “highly” significant results (0 to .01) are much more common than “just” significant results (.04 to .05), z-curve shows that just significant results (.05 to .005) are much more frequent than highly significant (p < .005) results. The difference is due to the implicit definition of high and low in the two plots. The high frequency of highly significant (p < .01) results in the p-curve plots is due to the wide range of values that are lumped together into this bin. Once it is clear that many p-values are clustered just below .05 (z > 1.96, the vertical red line), it is immediately notable that there are too few just non-significant (z < 1.96) values. This steep drop is not consistent with random sampling error. To summarize, z-curve plots provide more information than p-curve plots. Whereas z-curve plots make the presence of selection for significance visible, p-curve plots provide no means to evaluate selection bias. Even worse, right skewed distributions are often falsely interpreted as evidence that there is no selection for significance. This example shows that notable right-skewed distributions can be found even when selection bias is present.

The second part of a z-curve analysis uses a finite mixture model to estimate two statistical parameters of the data. These parameters are called the estimated discovery rate and the estimated replication rate (Bartos & Schimmack, 2021). Another term for these parameters is mean power before selection and mean power after selection for significance (Brunner & Schimmack, 2020). The meaning of these terms is best understood with a simple example where a researcher tests 100 false hypotheses and 100 true hypotheses with 100% power. The outcome of this study produces significant and non-significant p-values. The expected value for the frequency of significant p-values is 100 for the 100 true hypotheses tested with 100% power and 5% for the 100 false hypotheses that produce 5 significant results when alpha is set to 5%. Thus, we are expecting 105 significant results and 95 non-significant results. Although we know the percentages of true and false hypotheses, this information is not available with real data. Thus, any estimate of average power changes the meaning of power. It now includes false hypotheses with a power equal to alpha. We call this unconditional power to distinguish it from the typical meaning of power conditioned on a true hypothesis.

It is now possible to compute mean unconditional power for two populations of studies. One population of studies are all studies that were conducted. In this example, this population are all 200 studies (100 true, 100 false hypotheses). The average power for these 200 studies is easy to compute as (.5*100 + 1*100)/200 = 52.5%. The second population of studies focuses only on the significant studies. After selecting only significant studies, mean unconditional power is (.05*5 + 1*100)/105 = 95.5%. The reason why power is so much higher after selection for significance is that the significance filter keeps most false hypotheses out of the population of studies with a significant result (95 of the 100 studies to be exact). Thus, power is mostly determined by the true hypotheses that were tested with perfect power. Of course, real data are not as clean as this simple example, but the same logic applies to all sets of studies with a diverse range of power values for individual studies (Brunner & Schimmack, 2020).

Mean power before selection of significance determines the percentage of significant results for a number of tests. With 50% mean power before selection, 100 tests are expected to produce 50 significant results (Brunner & Schimmack, 2020). It is common to refer to statistically significant results as discoveries (Soric, 1989). Importantly, discoveries could be true or false, just like a significant result could be a true effect or a type-I error. In our example, there were 105 discoveries. Normally we would not know that 100 of these discoveries are true discoveries. All we know is the percentage of significant results. I use the term estimated discovery rate (EDR) to refer to mean unconditional power before selection, which is a mouthful. In short, EDR is an estimate of the percentage of significant results in a series of statistical tests.

Mean power after selection for significance is relevant because power of significant results determines the probability that a significant result can be successfully replicated in a direct replication study with the same sample size (Brunner & Schimmack, 2020). Using the EDR would be misleading. In the present example, the EDR of 52.5% would dramatically underestimate replicability of significant results, which is actually 95.5%. Using the EDR would punish researchers who conduct high-powered tests of true and false hypotheses. To assess the replicability of this researchers, it is necessary to compute power only for the studies that produced significant results. The problem with traditional meta-analyses is that selection for significance leads to inflated effect size estimates even if the researcher reported all non-significant results. To estimate the replicability of the significant results, the data are conditioned on significance, which inflates replicability estimates. Z-curve models this selection process and corrects for regression to the mean in the estimation of mean unconditional power after selection for significance. I call this statistic the estimated replication rate. The reason is that mean unconditional power after selection for significance determines the percentage of significant results that is expected in direct replication studies of studies with a significant result. In short, the ERR is the probability that a direct replication study with the same sample size produces a significant result.

I start discussion of the z-curve results for Nelson’s data with the estimated replication rate because this estimate is conceptually similar to the power estimate in the p-curve analysis. Both estimates focus on the population of studies with significant results and correct for selection for significance. Thus, one would expect similar results. However, the p-curve estimate of 97%, 95%CI = 96% to 98%, is very different from the z-curve estimate of 52%, 95%CI = 40% to 68%. The confidence intervals do not overlap, showing that the difference between these estimates is statistically significant itself.

The explanation for this discrepancy is that p-curve estimates are inflated estimates of the ERR when power is heterogeneous (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020). This is even true, if effect sizes are homogeneous and studies vary only in sample sizes (Brunner, 2018). The p-curve authors have been aware of this problem since 2018 (Datacolada), and have not updated the p-curve app in response to this criticism of their app. The present example shows that using the p-curve app can lead to extremely misleading conclusions. Whereas p-curve suggests that nearly every study by Nelson would produce a significant result again in a direct replication attempt, the correct z-curve estimates suggests that only every other result would replicate successfully. This difference is not only statistically significant, but also practically significant in the evaluation of Nelson’s work.

In sum, p-curve is not only redundant with z-curve. It also produces false information about the strength of evidence in a set of p-values.

Unlike p-curve, z-curve.2.0 also estimates the discovery rate based on the distribution of the significant p-values. The results are shown in Figure 2 as the grey curve in the range of non-significant results. As can be seen, while z-curve predicts a large number of non-significant results, the actual studies reported very few non-significant results. This suggests selection for significance. To quantify the amount of selection bias, it is possible to compare the observed discovery rate (i.e., the actual percentage of significant results), 87%, to the estimated discovery rate, EDR = 27%. The 95% confidence interval around the EDR can be used for a significance test. As 87% is well outside the 95%CI of the EDR, 5% to 51%, the results provide strong evidence that the reported results were selected from a larger set of tests with non-significant results that were not reported. In this specific case, this inference is consistent with the authors’ admission that questionable research practices were used (Simmons, Nelson, & Simonsohn, 2011).

Our best guess was that so many published findings were false because researchers
were conducting many analyses on the same data set and just reporting those that were statistically significant, a behavior that we later labeled “p-hacking” (Simonsohn, Nelson, & Simmons, 2014). We knew many researchers—including ourselves—who readily admitted
to dropping dependent variables, conditions, or participants to achieve significance.
” (Simmons, Nelson, & Simonsohn, 2018, p. 255).

The p-curve authors also popularized the idea that selection for significance may have produced many false positive results (Simmons et al., 2011). However, p-curve does not provide an estimate of the false positive risk. In contrast, z-curve provides information about the false discovery risk because the false discovery risk is a direct function of the discovery rate. Using the EDR with Soric’s formula, shows that the false discovery risk for Nelson studies is 14%, but due to the small number of tests, the 95%CI around this estimate ranges from 5% to 100%. Thus, even though the ERR suggests that half of the studies can be replicated, it is possible that the other half of the studies contain a fairly large number of false positive results. Without the identification of moderator variables, it would be impossible to say whether a result is a true or a false discovery.

The ability to estimate the false positive risk makes it possible to identify a subset of studies with a low false positive risk by lowering alpha. Lowering alpha reduces the false positive risk for two reasons. First, it follows logically that a lower alpha produces a lower false positive risk. For example, in the prior example with 100 true and 100 false hypothesis, an alpha of 5% produced 105 significant results that included 5 non-significant results and the false positive rate was 5/105 = 4.76%. Lowering alpha to 1%, produces only 101% significant results and the false positive rate is 1/100 = 1.00%. Second, questionable research practices are much more likely to produce false positive results with alpha = .05 than with alpha = .01.

In a z-curve analysis can be set to different values to examine the false positive rate. A reasonable criterion is to aim for a false discovery rate of 5%, which many psychologists falsely assume is the goal of setting alpha to 5%. For Nelson’s 109 publications, alpha can be lowered to .01 to achieve a false discovery risk of 5%.

With alpha = .01, there are still 60 out of 119 (50%) significant results. It is therefore not necessary to dismiss all of the published results because some results were obtained with questionable research practices.

For Nelson’s studies, a plausible moderator is timing. As Nelson and colleagues reported, he used QRPs before he himself drew attention to the problems with these practices. In response, he may have changed his research practices. To test this hypothesis, it is possible to fit a z-curve analysis to articles published before and after 2012 (due to publication lack, articles in 2012 are likely to still contain QRPs).

Consistent with the hypothesis, The EDR for 2012 and before is only 11%, 95%CI 5% to 31%, and the false discovery risk increases to 42%, 95%CI = 12% to 100%. Even with alpha = .01, the FDR is still 11%, and with alpha = .005 it is still 10%. With alpha = .001, it is reduced to 2% and 18 results remain significant. Thus, most of the published results lack credible evidence against the null-hypothesis.

Results look very different after 2012. The EDR is 83% and not different from the ODR, suggesting no evidence that selection for significance occurred. The high EDR implies a low false discovery risk even with the conventional alpha criterion of 5%. Thus, all 40 results with p < .05 provide credible evidence against the null-hypothesis.

To see how misleading p-curves can be, I also conducted a p-curve analysis for the studies published in the years up to 2012. The p-curve analysis shows merely that the studies have evidential value and provides a dramatically inflated estimate of power (84% vs. 35%). It does not show evidence that p-values are selected for significance and it does not provide information to distinguish p-hacked studies from studies with evidential value.

Conclusion

P-Curve was a first attempt to take the problem of selection for significance seriously and to evaluate whether a set of studies provides credible evidence against the null-hypothesis (evidential value). Here I showed that p-curve has serious limitations and provides misleading evidence about the strength of evidence against the null-hypothesis.

I showed that all of the information that is provided by a p-curve analysis is also provided by a z-curve analysis. Moreover, z-curve provides additional information about the presence of selection bias and the risk of false positive results. I also show how alpha levels can be adjusted to separate significant results with weak and strong evidence to select credible findings even when selection for significance is present.

As z-curve does every thing that p-curve does and more, the rational choice is to choose z-curve for the meta-analysis of p-values.

Replicability Audit of Ap Dijksterhuis

Abstract

This blog post reports a replicability audit of Ap Dijksterhuis 48 most highly cited articles that provide the basis for his H-Index of 48 (WebofScience, 4/23/2021). The z-curve analysis shows lack of evidential value and a high false positive risk. Rather than dismissing all findings, it is possible to salvage 10 findings by setting alpha to .001 to maintain a false positive risk below 5%. The main article that contains evidential value was published in 2016. Based on these results, I argue that 47 of the 48 articles do not contain credible empirical information that supports the claims in these articles. These articles should not be cited as if they contain empirical evidence.

INTRODUCTION

“Trust is good, but control is better”  

Since 2011, it has become clear that social psychologists misused the scientific method. It was falsely assumed that a statistically significant result ensures that a finding is not a statistical fluke. This assumption is false for two reasons. First, even if the scientific method is used correctly, statistically significance can occur without a real effect in 5% of all studies. This is a low risk if most studies test true hypothesis with high statistical power, which produces a high discovery rate. However, if many false hypotheses are tested and true hypotheses are tested with low power, the discovery rate is low and the false discovery risk is high. Unfortunately, the true discovery rate is not known because social psychologists only published significant results. This selective reporting of significant results renders statistically significance insignificant. In theory, all published results could be false positive results.

The question is what we, the consumers of social psychological research, should do with thousands of studies that provide only questionable evidence. One solution is to “burn everything to the ground” and start fresh. Another solution is to correct the mistake in the application of the scientific method. I compare this correction to the repair of the Hubble telescope (https://www.nasa.gov/content/hubbles-mirror-flaw). Only after the Hubble telescope was launched into space, it was discovered that a mistake was made in the creation of the mirror. Replacing the mirror in space was impractical. As a result, a correction was made to take the discrepancy in the data into account.

The same can be done with significance testing. To correct for the misuse of the scientific method, the criterion for statistical significance can be lowered to ensure an acceptably low risk of false positive results. One solution is to apply this correction to articles on a specific topic or to articles in a particular journal. Here, I focus on authors for two reasons. First, authors are likely to use a specific approach to research that depends on their training and the field of study. Elsewhere I demonstrated that researchers differ considerably in their research practices (Schimmack, 2021). More controversial, I also think that authors are accountable for their research practices. If they realize that they made mistakes, they could help the research community by admitting to their mistakes and retract articles or at least express their loss of confidence in some of their work (Rohrer et al., 2020).

Ap Dijksterhuis

Ap Dijksterhuis is a well-known social psychologist. His main focus has been on unconscious processes. Starting in the 1990s, social psychologists became fascinated by unconscious and implicit processes. This triggered what some call an implicit revolution (Greenwald & Banaji, 1995). Dijksterhuis has been prolific and his work is highly cited, which earned him an H-Index of 48 in WebOfScience.

However, after 2011 it became apparent that many findings in this literature are difficult to replicate (Kahneman, 2012). A large replication project also failed to replicate one of Dijksterhuis’s results (O’Donnell et al., 2018). It is therefore interesting and important to examine the credibility of Dijksterhuis’s studies.

Data

I used WebofScience to identify the most cited articles by Dijksterhuis  (datafile).  I then coded empirical articles until the number of coded articles matched the number of citations. The 48 articles reported 105 studies with a codable focal hypothesis test.

The total number of participants was 7,470 with a median sample size of N = 57 participants. For each focal test, I first computed the exact two-sided p-value and then computed a z-score for the p-value divided by two. Consistent with practices in social psychology, all reported studies supported predictions, even when the results were not strictly significant. The success for p < .05 (two-tailed) was 100/105 = 95%, which has been typical for social psychology for decades (Sterling, 1959).

The z-scores were submitted to a z-curve analysis (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020). The first part of a z-curve analysis is the z-curve plot (Figure 1).

The vertical red line at z = 1.96 represents the significance criterion with alpha = .05 (two-tailed). The figure shows that most p-values are just significant with z-scores just above 1.96. The distribution of z-scores is abnormal in the sense that random sampling error alone cannot produce the steep drop on the left side of the significance criterion. This provides visual evidence of selection for significance.

The second part of a z-curve analysis is to fit a finite mixture model to the distribution of the significant z-scores (z > 1.96). The model tries to match the distribution as closely as possible. The best fitting curve is shown with the grey/black checkered line. It is notable that the actual data decrease a bit more steeply than the grey curve. This shows a problem for the curve to fit the data even though the curve. This suggests that significance was obtained with massive p-hacking which produces an abundance of just significant results. This is confirmed with a p-curve analysis that shows more p-values between .04 and .05 than p-values between 0 and .01; 24% vs. 19%, respectively (Simonsohn et al., 2014).

The main implication of a left-skewed p-curve is that most significant results do not provide evidence against the null-hypothesis. This is confirmed by the z-curve analysis. A z-curve analysis projects the model based on significant results into the range of non-significant results. This makes it possible to estimate how many tests were conducted to produce the observed significant results (assuming a simple selection model). The results for these data suggest that that the reported significant results are only 5% of all statistical tests, which is what would be expected if only false hypotheses were tested. As a result, the false positive risk is 100%. Z-curve also computes bootstrapped confidence intervals around these estimates. The upper bound for the estimated discovery rate is 12%. Thus, most of the studies had a very low chance of producing a significant result (low power), even if they did not test a false hypothesis (low statistical power). With a low discover rate of 12%, the risk that a significant result is a false positive result is still 39%. This is unacceptably high.

The estimated replication rate of 7% is slightly higher than the estimated discovery rate of 5%. This suggests some heterogeneity across the studies which leads to higher power for studies that produced significant results. However, even 7% replicability is very low. Thus, most studies are expected to produce a non-significant result in a replication attempt.

Based on these results, it would be reasonable to burn everything to the ground and to dismiss the claims made in these 48 articles as empirically unfounded. However, it is also possible to reduce the false positive risk by increasing the significance threshold. With alpha = .01 the FDR is 19%, with alpha = .005 it is 10%, and with alpha = .001 it is 2%. So, to keep the false positive risk below 5%, it is possible to set alpha to .001. This renders most findings non-significant, but 10 findings remain significant.

One finding is evidence that liking of one’s initials has retest reliability. A more interesting finding is that 4 significant (p < .001) results were obtained in the most recent, 2016) article that also included pre-registered studies. This suggests that Dijksterhuis changed research practices in the wake of the replicability crisis. Thus, new articles that have not garnered a lot of citations may be more credible, but the pre-2011 articles lack credible empirical evidence for most of the claims made in these articles.

DISCLAIMER 

It is nearly certain that I made some mistakes in the coding of Ap Dijksterhuis’s articles. However, it is important to distinguish consequential and inconsequential mistakes. I am confident that I did not make consequential errors that would alter the main conclusions of this audit. However, control is better than trust and everybody can audit this audit.  The data are openly available and the z-curve code is also openly available.  Thus, this replicability audit is fully transparent and open to revision.

Moreover, the results are broadly consistent with the z-curve results based on automated extraction of test statistics (Schimmack, 2021). Based on automated coding, Dijksterhuis has an EDR of 17, with a rank of 312 out of 357 social psychologists. The reason for the higher EDR is that automated coding does not distinguish focal and non-focal tests and focal tests tend to have lower power and a higher risk of being false positives.

If you found this audit interesting, you might also be interested in other replicability audits (Replicability Audits).

Smart P-Hackers Have File-Drawers and Are Not Detected by Left-Skewed P-Curves

Abstract

In the early 2010s, two articles suggested that (a) p-hacking is common, (b) false positives are prevalent, and (c) left-skewed p-curves reveal p-hacking to produce false positive results (Simmons et al., 2011; Simonsohn, 2014a). However, empirical application of p-curve have produced few left-skewed p-curves. This raises question about the absence of left-skewed z-curves. One explanation is that some p-hacking strategies do not produce notable left skew and that these strategies may be used more often because they require fewer resources. Another explanation could be that file-drawering is much more common than p-packing. Finally, it could be that most of the time p-hacking is used to inflate true effect sizes rather than to chase false positive results. P-curve plots do not allow researchers to distinguish these alternative hypotheses. Thus, p-curve should be replaced by more powerful tools that detect publication bias or p-hacking and estimate the amount of evidence against the null-hypothesis. Fortunately, there is an app for this (zcurve package).

Introduction

Simonsohn, Nelson, and Simmons (2014) coined the term p-hacking for a set of questionable research practices that increase the chances of obtaining a statistically significant result. In the worst case scenario, p-hacking can produce significant results without a real effect. In this case, the statistically significant result is entirely explained by p-hacking.

Simonsohn et al. (2014) make a clear distinction between p-hacking and publication bias. Publication bias is unlikely to produce a large number of false positive results because it requires 20 attempts to produce a single significant result in either direction or 40 attempts to get a significant result with a predicted direction. In contrast, “p-hacking can allow researchers to get most studies to reveal significant relationships between truly unrelated variables (Simmons et al., 2011)” (p. 535).

There have been surprisingly few investigations of the best way to p-hack studies. Some p-hacking strategies may work in simulation studies that do not impose limits on resources, but they may not be practical in real applications of p-hacking. I postulate that the main goal of p-hacking is to get significant results with minimal resources rather than with a minimum number of studies and that p-hacking is more efficient with a file drawer of studies that are abandoned.

Simmons et al. (2011) and Simonsohn et al. (2014) suggest one especially dumb p-hacking strategy, namely simply collecting more data until a significant result emerges.

“For example, consider a researcher who p-hacks by analyzing data after every five per-condition participants and ceases upon obtaining significance.” (Simonsohn et al., 2014).

This strategy is known to produce more p-values close to .04 than .01.

The main problem with this strategy is that sample sizes can get very large before the significant result emerges. I limited the maximum sample size before a researcher would give up to N = 200. A limit of 20 makes sense because N = 200 would allow a researcher to run 20 studies with the starting sample size of N = 10 to get a significant result. The p-curve plot shows a similar distribution as the simulation in the p-curve article.

The success rate was 25%. This means, 75% of studies with N = 200 produced a non-significant result that had to be put in the file-drawer. Figure 2 shows the distribution of sample sizes for the significant results.

The key finding is that the chances of a significant results drop drastically after the first attempt. The reason is that the most favorable results in the first trial produce a significant result in the first trial. As a result, the non-significant ones are less favorable. It would be better to start a new study because the chances to get a significant result are higher than adding participants after an unsuccessful attempt. In short, just adding participants to get significant is a dumb p-hacking method.

Simonsohn et al. (2014) do not disclose the stopping rule, but they do show that they got only 5.6% significant results compared to the 25% with N = 200. This means they stopped much earlier. Simulation suggest that they stopped when N = 30 (n = 15 per cell) did not produce a significant result (1 million simulations, success rate = 5.547%). The success rates for N = 10, 20, and 30 were 2.5%, 1.8%, and 1.3%, respectively. These probabilities can be compared to a probability of 2.5 for each test with N = 10. It is clear that trying three studies is a more efficient strategy than to add participants until N reaches 30. Moreover, neither strategy avoids producing a file drawer. To avoid a file-drawer, researchers would need to combine several questionable research practices (Simmons et al., 2011).

Simmons et al. (2011) proposed that researchers can add covariates to increase the number of statistical tests and to increase the chances of producing a significant result. Another option is to include several dependent variables. To simplify the simulation, I am assuming that dependent variables and covariates are independent of each other. Sample size has no influence on these results. To make the simulation consistent with typical results in actual studies, I used n = 20 per cell. Adding covariates or additional dependent variables requires the same amount of resources. For example, participants make additional ratings for one more item and this item is either used as a covariate or as a dependent variable. Following Simmons et al. (2011), I first simulated a scenario with 10 covariates.

The p-curve plot is similar to the repeated peaking plot and is called left-skewed. The success rate, however, is disappointing. Only 4.48% of results were statistically significant. This suggests that collecting data to be used as covariates is another dumb p-hacking strategy.

Adding dependent variables is much more efficient. In the simple scenario, with independent DVs, the probability of obtaining a significant result equals 1-(1-.025)^11 = 24.31%. A simulation with 100,000 trials produced a percentage of 24.55%. More important, the p-curve is flat.

Correlation among the dependent variables produces a slight left-skewed distribution, but not as much as the other p-hacking methods. With a population correlation of r = .3, the percentages are 17% for p < .01 and 22% for p between .04 and .05.

These results provide three insights into p-hacking that have been overlooked. First, some p-hacking methods are more effective than others. Second, the amount of left-skewness varies across p-hacking methods. Third, efficient p-hacking produces a fairly large file-drawer of studies with non-significant results because it is inefficient to add participants to data that failed to produce a significant result.

Implications

False P-curve Citations

The p-curve authors made it fairly clear what p-curve does and what it does not do. The main point of a p-curve analysis is to examine whether a set of significant results was obtained at least partially with some true effects. That is, at least in a subset of the studies the null-hypothesis was false. The authors call this evidential value. A right-skewed p-curve suggests that a set of significant results have evidential value. This is the only valid inference that can be drawn from p-curve plots.

“We say that a set of significant findings contains evidential value when we can rule out selective reporting as the sole [italics added] explanation of those findings” (p. 535).

The emphasize on selective reporting as the sole explanation is important. A p-curve that shows evidential value can still be biased by p-hacking and publication bias, which can lead to inflated effect size estimates.

To make sure that I interpret the article correctly, I asked one of the authors on twitter and the reply confirmed that p-curve is not a bias test, but strictly a test that some real effects contributed to a right-skewed p-curve. The answer also explains why the p-curve authors did not care about testing for bias. They assume that bias is almost always present; which makes it unnecessary to test for it.

Although the authors stated the purpose of p-curve plots clearly, many meta-analysists have misunderstood the meaning of a p-curve analysis and have drawn false conclusions about right-skewed p-curves. For example, Rivers (2017) writes that a right-skewed p-curve suggests “that the WIT effect is a) likely to exist, and b) unlikely biased by extensive p-hacking.” The first inference is correct. The second one is incorrect because p-curve is not a bias detection method. A right-skewed p-curve could be a mixture of real effects and bias due to selective reporting.

Rivers also makes a misleading claim that a flat p-curve shows the lack of evidential value, whereas “a significantly left-skewed distribution indicates that the effect under consideration may be biased by p-hacking.” These statements are wrong because a flat p-curve can also be produced by p-hacking, especially when a real effect is also present.

Rivers is by no means the only one who misinterpreted p-curve results. Using the 10 most highly cited articles that applied p-curve analysis, we can see the same mistake in several articles. A tutorial for biologists claims “p-curve can, however, be used to identify p-hacking, by only considering significant findings” (Head, 2015, p. 3). Another tutorial for biologists repeats this false interpretation of p-curves. “One proposed method for identifying P-hacking is ‘P-curve’ analysis” (Parker et al., 2016, p. 714). A similar false claim is made by Polanin et al. (2016). “The p-curve is another method that attempts to uncover selective reporting, or “p-hacking,” in primary reports (Simonsohn, Nelson, Leif, & Simmons, 2014)” (p. 211). The authors of a meta-analysis of personality traits claim that they conduct p-curve analyses “to check whether this field suffers from publication bias” (Muris et al., 2017, 186). Another meta-analysis on coping also claims “p-curve analysis (Simonsohn, Nelson, & Simmons, 2014) allows the detection of selective reporting by researchers who “file-drawer” certain parts of their studies to reach statistical significance” (Cheng et al., 2014; p. 1594).

Shariff et al.’s (2016) article on religious priming effects provides a better explanation of p-curve, but their final conclusion is still misleading. “These results suggest that the body of studies reflects a true effect of religious priming, and not an artifact of publication bias and p-hacking.” (p. 38). The first part is correct, but the second part is misleading. The correct claim would be “not solely the result of publication bias and p-hacking”, but it is possible that publication bias and p-hacking inflate effect size estimates in this literature. The skew of p-curves simply does not tell us about this. The same mistake is made by Weingarten et al. (2016). “When we included all studies (published or unpublished) with clear hypotheses for behavioral measures (as outlined in our p-curve disclosure table), we found no evidence of p-hacking (no left-skew), but dual evidence of a right-skew and flatter than 33% power.” (p. 482). While a left-skewed p-curve does reveal p-hacking, the absence of left-skew does not ensure that p-hacking was absent. The same mistake is made by Steffens et al. (2017), who interpret a right-skewed p-curve as evidence “that the set of studies contains evidential value and that there is no evidence of p-hacking or ambitious p-hacking” (p. 303).

Although some articles correctly limit the interpretation of the p-curve to the claim that the data contain evidential value (Combs et al., 2015; Rand, 2016; Siks et al., 2018), the majority of applied p-curve articles falsely assume that p-curve can reveal the presence or absence of p-hacking or publication bias. This is incorrect. A left-skewed p-curve does provide evidence of p-hacking, but the absence of left-skew does not imply that p-hacking is absent.

How prevalent are left-skewed p-curves?

After 2011, psychologists were worried that many published results might be false positive results that were obtained with p-hacking (Simmons et al., 2011). As the combination of p-hacking in the absence of a real effect does produce left-skewed p-curves, one might expect that a large percentage of p-curve analyses revealed left-skewed distributions. However, empirical examples of left-skewed p-curves are extremely rare. Take, power-posing as an example. It is widely assumed these days that original evidence for power-posing was obtained with p-hacking and that the real effect size of power-posing is negligible. Thus, power-posing would be expected to show a left-skewed p-curve.

Simmons and Simonsohn (2017) conducted a p-curve analysis of the power-posing literature. They did not observe a left-skewed p-curve. Instead, the p-curve was flat, which justifies the conclusion that the studies contain no evidential value (i.e., we cannot reject the null-hypothesis that all studies tested a true null-hypothesis). The interpretation of this finding is misleading.

“In this Commentary, we rely on p-curve analysis to answer the following question: Does the literature reviewed by Carney et al. (2015) suggest the existence of an effect once one accounts for selective reporting? We conclude that it does not. The distribution of p values from those 33 studies is indistinguishable from what would be expected if (a) the average effect size were zero and (b) selective reporting (of studies or analyses) were solely responsible for the significant effects that were published”

The interpretation only focus on selective reporting (or testing of independent DVs) as a possible explanation for lack of evidential value. However, usually the authors emphasize p-hacking as the most likely explanation for significant results without evidential value. Ignoring p-hacking is deceptive because a flat p-curve can occur as a combination of p-hacking and real effect, as the authors showed themselves (Simonsohn et al., 2014).

Another problem is that significance testing is also one-sided. A right-skewed p-curve can be used to reject the null-hypotheses that all studies are false positives, but the absence of significant right skew cannot be used to infer the lack of evidential value. Thus, p-curve cannot be used to establish that there is no evidential value in a set of studies.

There are two explanations for the surprising lack of left-skewed p-curves in actual studies. First, p-hacking may be much less prevalent than is commonly assumed and the bigger problem is publication bias which does not produce a left-skewed distribution. Alternatively, false positive results are much rarer than has been assumed in the wake of the replication crisis. The main reason for replication failures could be that published studies report inflated effect sizes and that replication studies with unbiased effect size estimates are underpowered and produce false negative results.

How useful are Right-skewed p-curves?

In theory, left-skew is diagnostic of p-hacking, but in practice left-skew is rarely observed. This leaves right-skew as the only diagnostic information of p-curve plots. Right skew can be used to reject the null-hypothesis that all of the significant results tested a true null-hypothesis. The problem with this information is shared by all significance tests. It does not provide evidence about the effect size. In this case, it does not provide evidence about the percentage of significant results that are true positives (the false positive risk), nor does it quantify the strength of evidence.

This problem has been addressed by other methods that quantify how strong the evidence against the null-hypothesis is. Confusingly, the p-curve authors used the term p-curve for a method that estimates the strength of evidence in terms of the unconditional power of the set of studies (Simonsohn et al., 2014b). The problem with these power estimates is that they are biased when studies are heterogeneous (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020). Simulation studies show that z-curve is a superior method to quantify the strength of evidence against the null-hypothesis. In addition, z-curve.2.0 provides additional information about the false positive risk; that is the maximum number of significant results that may be false positives.

In conclusion, p-curve plots no longer produce meaningful information. Left-skew can be detected in z-curves plots as well as in p-curve plots and is extremely rare. Right skew is diagnostic of evidential value, but does not quantify the strength of evidence. Finally, p-curve plots are not diagnostic when data contain evidential value and bias due to p-hacking or publication bias.

The Power-Corrected H-Index

I was going to write this blog post eventually, but the online first publication of Radosic and Diener’s (2021) article “Citation Metrics in Psychological Science” provided a good opportunity to do so now.

Radosic and Diener’s (2021) article’s main purpose was to “provide norms to help evaluate the citation counts of psychological scientists” (p. 1). The authors also specify the purpose of these evaluations. “Citation metrics are one source of information that can be used in hiring, promotion, awards, and funding, and our goal is to help these evaluations” (p. 1).

The authors caution readers that they are agnostic about the validity of citation counts as a measure of good science. “The merits and demerits of citation counts are beyond the scope of the current article” (p. 8). Yet, they suggest that “there is much to recommend citation numbers in evaluating scholarly records” (p. 11).

At the same time, they list some potential limitations of using citation metrics to evaluate researchers.

1. Articles that developed a scale can have high citation counts. For example, Ed Diener has over 71,000 citations. His most cited article is the 1985 article with his Satisfaction with Life Scale. With 12,000 citations, it accounts for 17% of his citations. The fact that articles that published a measure have such high citation counts reflects a problem in psychological science. Researchers continue to use the first measure that was developed for a new construct (e.g., Rosenberg’s 1965 self-esteem scale) instead of improving measurement which would lead to citations of newer articles. So, the high citation counts of articles with scales is a problem, but it is only a problem if citation counts are used as a metric. A better metric is the H-Index that takes number of publications and citations into account. Ed Diener also has a very high H-Index of 108 publications with 108 or more citations. His scale article is only of these articles. Thus, scale development articles are not a major problem.

2. Review articles are cited more heavily than original research articles. Once more, Ed Diener is a good example. His second and third most cited articles are the 1984 and the co-authored 1999 Psychological Bulletin review articles on subjective well-being that together account for another 9,000 citations (13%). However, even review articles are not a problem. First, they also are unlikely to have an undue influence on the H-Index and second it is possible to exclude review articles and to compute metrics only for empirical articles. Web of Science makes this very easy. In WebofScience 361 out of Diener’s 469 publications are listed as articles. The others are listed as reviews, book chapters, or meeting abstracts. With a click of a button, we can produce the citation metrics only for the 361 articles. The H-Index drops from 108 to 102. Careful hand-selection of articles is unlikely to change this.

3. Finally, Radosic and Diener (2021) mention large-scale collaborations as a problem. For example, one of the most important research projects in psychological science in the last decade was the Reproducibility Project that examined the replicability of psychological science with 100 replication studies (Open Science Collaboration, 2015). This project required a major effort by many researchers. Participation earned researchers over 2,000 citations in just five years and the article is likely to be the most cited article for many of the collaborators. I do not see this as a problem because large-scale collaborations are important and can produce results that no single lab can produce. Thus, high citation counts provide a good incentive to engage in these collaborations.

To conclude, Radosic and Diener’s article provides norms for a citation counts that can and will be used to evaluate psychological scientists. However, the article sidesteps the main question about the use of citation metrics, namely (a) what criteria should be used to evaluate scientists and (b) are citation metrics valid indicators of these criteria. In short, the article is just another example that psychologists develop and promote measures without examining their construct validity (Schimmack, 2021).

What is a good scientists?

I didn’t do an online study to examine the ideal prototype of a scientist, so I have to rely on my own image of a good scientist. A key criterion is to search for some objectively verifiable information that can inform our understanding of the world, or in psychology ourselves; that is, humans affect, behavior, and cognition – the ABC of psychology. The second criterion elaborates the term objective. Scientists use methods that produce the same results independent of the user of the methods. That is, studies should be reproducible and results should be replicable within the margins of error. Third, the research question should have some significance beyond the personal interests of a scientist. This is of course a tricky criterion, but research that solves major problems like finding a vaccine for Covid-19 is more valuable and more likely to receive citations than research on the liking of cats versus dogs (I know, this is the most controversial statement I am making; go cats!). The problem is that not everybody can do research that is equally important to a large number of people. Once more Ed Diener is a good example. In the 1980s, he decided to study human happiness, which was not a major topic in psychology. Ed Diener’s high H-Index reflects his choice of a topic that is of interest to pretty much everybody. In contrast, research on stigma of minority groups is not of interest to a large group of people and unlikely to attract the same amount of attention. Thus, a blind focus on citation metrics is likely to lead to research on general topics and avoid research that applies research to specific problems. The problem is clearly visible in research on prejudice, where the past 20 years have produced hundreds of studies with button-press tasks by White researchers with White participants that gobbled up funding that could have been used for BIBOC researchers to study the actual issues in BIPOC populations. In short, relevance and significance of research is very difficult to evaluate, but it is unlikely to be reflected in citation metrics. Thus, a danger is that metrics are being used because they are easy to measure and relevance is not being used because it is harder to measure.

Do Citation Metrics Reward Good or Bad Research?

The main justification for the use of citation metrics is the hypothesis that the wisdom of crowds will lead to more citations of high quality work.

“The argument in favor of personal judgments overlooks the fact that citation counts are also based on judgments by scholars. In the case of citation counts, however, those judgments are broadly derived from the whole scholarly community and are weighted by the scholars who are publishing about the topic of the cited publications. Thus, there is much to recommend citation
numbers in evaluating scholarly records.” (Radosic & Diener, 2021, p. 8).

This statement is out of touch with discussions about psychological science over the past decade in the wake of the replication crisis (see Schimmack, 2020, for a review; I have to cite myself to get up my citation metrics. LOL). In order to get published and cited, researchers of original research articles in psychological science need statistically significant p-values. The problem is that it can be difficult to find significant results when novel hypotheses are false or effect sizes are small. Given the pressure to publish in order to rise in the H-Index rankings, psychologists have learned to use a number of statistical tricks to get significant results in the absence of strong evidence in the data. These tricks are known as questionable research practices, but most researchers think they are acceptable (John et al., 2012). However, these practices undermine the value of significance testing and published results may be false positives or difficult to replicate, and do not add to the progress of science. Thus, citation metrics may have the negative consequence to pressure scientists into using bad practices and to reward scientists who publish more false results just because they publish more.

Meta-psychologists have produced strong evidence that the use of these practices was widespread and accounts for the majority of replication failures that occurred over the past decade.

Schimmack, U. (2020). A meta-psychological perspective on the decade of replication failures in social psychology. Canadian Psychology/Psychologie canadienne, 61(4), 364–376. https://doi.org/10.1037/cap0000246

Motyl et al. (2017) collected focal test statistics from a representative sample of articles in social psychology. I analyzed their data using z-curve.2.0 (Brunner & Schimmack, 2020; Bartos & Schimmack, 2021). Figure 1 shows the distribution of the test-statistics after converting them into absolute z-scores, where higher values show a higher signal/noise (effect size / sampling error) ratio. A z-score of 1.96 is needed to claim a discovery with p < .05 (two-sided). Consistent with publication practices since the 1960s, most focal hypothesis tests confirm predictions (Sterling, 1959). The observed discovery rate is 90% and even higher if marginally significant results are included (z > 1.65). This high success rate is not something to celebrate. Even I could win all marathons if I use a short-cut and run only 5km. The problem with this high success rate is clearly visible when we fit a model to the distribution of the significant z-scores and extrapolate the distribution of z-scores that are not significant (the blue curve in the figure). Based on this distribution, the significant results are only 19% of all tests, indicating that many more non-significant results are expected than observed. The discrepancy between the observed and estimated discovery rate provides some indication of the use of questionable research practices. Moreover, the estimated discovery rate shows how much statistical power studies have to produce significant results without questionable research practices. The results confirm suspicions that power in social psychology is abysmally low (Cohen, 1961; Tversky & Kahneman, 1971).

The use of questionable practices makes it possible that citation metrics may be invalid. When everybody in a research field uses p < .05 as a criterion to evaluate manuscripts and these p-values are obtained with questionable research practices, the system will reward researchers how use the most questionable methods to produce more questionable results than their peers. In other words, citation metrics are no longer a valid criterion of research quality. Instead, bad research is selected and rewarded (Smaldino & McElreath, 2016). However, it is also possible that implicit knowledge helps researchers to focus on robust results and that questionable research practices are not rewarded. For example, prediction markets suggest that it is fairly easy to spot shoddy research and to predict replication failures (Dreber et al., 2015). Thus, we cannot assume that citation metrics are valid or invalid. Instead, citation metrics – like all measures – require a program of construct validation.

Do Citation Metrics Take Statistical Power Into Account?

A few days ago, I published the first results of an ongoing research project that examines the relationship between researchers’ citation metrics and estimates of the average power of their studies based on z-curve analyses like the one shown in Figure 1 (see Schimmack, 2021, for details). The key finding is that there is no statistically or practically significant relationship between researchers H-Index and the average power of their studies. Thus, researchers who invest a lot of resources in their studies to produce results with a low false positive risk and high replicability are not cited more than researchers who flood journals with low powered studies that produce questionable results that are difficult to replicate.

These results show a major problem of citation metrics. Although methodologists have warned against underpowered studies, researchers have continued to use underpowered studies because they can use questionable practices to produce the desired outcome. This strategy is beneficial for scientists and their career, but hurts the larger goal of science to produce a credible body of knowledge. This does not mean that we need to abandon citation metrics altogether, but it must be complemented with other information that reflects the quality of researchers data.

The Power-Corrected H-Index

In my 2020 review article, I proposed to weight the H-Index by estimates of researchers’ replicability. For my illustration, I used the estimated replication rate, which is the average power of significant tests, p < .05 (Brunner & Schimmack, 2020). One advantage of the ERR is that it is highly reliable. The reliability of the ERRs for 300 social psychologists is .90. However, the ERR has some limitations. First, it predicts replication outcomes under the unrealistic assumption that psychological studies can be replicated exactly. However, it has been pointed out that this often impossible, especially in social psychology (Strobe & Strack, 2014). As a result, ERR predictions are overly optimistic and overestimate the success rate of actual replication studies (Bartos & Schimmack, 2021). In contrast, EDR estimates are much more in line with actual replication outcomes because effect sizes in replication studies can regress towards the mean. For example, Figure 1 shows an EDR of 19% for social psychology and the actual success rate (if we can call it that) for social psychology was 25% in the reproducibility project (Open Science Collaboration, 2015). Another advantage of the EDR is that it is sensitive to questionable research practices that tend to produce an abundance of p-values that are just significant. Thus, the EDR more strongly punishes researchers for using these undesirable practices. The main limitation of the EDR is that it is less reliable than the ERR. The reliability for 300 social psychologists was only .5. Of course, it is not necessary to chose between ERR and EDR. Just like there are many citation metrics, it is possible to evaluate the pattern of power-corrected metrics using ERR and EDR. I am presenting both values here, but the rankings are sorted by EDR weighted H-Indices.

The H-Index is an absolute number that can range from 0 to infinity. In contrast, power is limited to a range from 5% (with alpha = .05) to 100%. Thus, it makes sense to use power as a weight and to weight the H-index by a researchers EDR. A researcher who published only studies with 100% power has a power-corrected H-Index that is equivalent to the actual H-Index. The average EDR of social psychologists, however, is 35%. Thus, the average H-index is reduced to a third of the unadjusted value.

To illustrate this approach, I am using two researchers with a large H-Index, but different EDRs. One researcher is James J. Gross with an H-Index of 99 in WebofScience. His z-curve plot shows some evidence that questionable research practices were used to report 72% significant results with 50% power. However, the 95%CI around the EDR ranges from 23% to 78% and includes the point estimate. Thus, the evidence for QRPs is weak and not statistically significant. More important, the EDR -corrected H-Index is 90 * .50 = 45.

A different example is provided by Shelly E. Taylor with a similarly high H-Index of 84, but her z-curve plot shows clear evidence that the observed discovery rate is inflated by questionable research practices. Her low EDR reduces the H-Index considerably and results in a PC-H-Index of only 12.6.

Weighing the two researchers’ H-Index by their respective ERR’s, 77 vs. 54, has similar, but less extreme effects in absolute terms, ERR-adjusted H-Indices of 76 vs. 45.

In the sample of 300 social psychologists, the H-Index (r = .74) and the EDR (r = .65) contribute about equal amounts of variance to the power-corrected H-Index. Of course, a different formula could be used to weigh power more or less.

Discussion

Ed Diener is best known for his efforts to measure well-being and to point out that traditional economic indicators of well-being are imperfect. While wealth of countries is a strong predictor of citizens’ average well-being, r ~ .8, income is a poor predictor of individuals’ well-being with countries. However, economists continue to rely on income and GDP because it is more easily quantified and counted than subjective life-evaluations. Ironically, Diener advocates the opposite approach when it comes to measuring research quality. Counting articles and citations is relatively easy and objective, but it may not measure what we really want to measure, namely how much is somebody contributing to the advancement of knowledge. The construct of scientific advancement is probably as difficult to define as well-being, but producing replicable results with reproducible studies is one important criterion of good science. At present, citation metrics fail to track this indicator of research quality. Z-curve analyses of published results make it possible to measure this aspect of good science and I recommend to take it into account when researchers are being evaluated.

However, I do not recommend the use of quantitative information for the evaluation of hiring and promotion decisions. The reward system in science is too biased to reward privileged upper-class, White, US Americans (see APS rising stars lists). That being said, a close examination of published articles can be used to detect and eliminate researchers who severely p-hacked to get their significant results. Open science criteria can also be used to evaluate researchers who are just starting their career.

In conclusion, Radosic and Diener’s (2021) article disappointed me because it sidesteps the fundamental questions about the validity of citation metrics as a criterion for scientific excellence.

Conflict of Interest Statement: At the beginning of my career I was motivated to succeed in psychological science by publishing as many JPSP articles as possible and I made the unhealthy mistake to try to compete with Ed Diener. That didn’t work out for me. Maybe I am just biased against citation metrics because my work is not cited as much as I would like. Alternatively, my disillusionment with the system reflects some real problems with the reward structure in psychological science and helped me to see the light. The goal of science cannot be to have the most articles or the most citations, if these metrics do not really reflect scientific contributions. Chasing indicators is a trap, just like chasing happiness is a trap. Most scientists can hope to make maybe one lasting contribution to the advancement of knowledge. You need to please others to stay in the game, but beyond those minimum requirements to get tenure, personal criteria of success are better than social comparisons for the well-being of science and scientists. The only criterion that is healthy is to maximize statistical power. As Cohen said, less is more and by this criterion psychology is not doing well as more and more research is published with little concern about quality.

NameEDR.H.IndexERR.H.IndexH-IndexEDRERR
James J. Gross5076995077
John T. Cacioppo48701024769
Richard M. Ryan4661895269
Robert A. Emmons3940468588
Edward L. Deci3643695263
Richard W. Robins3440576070
Jean M. Twenge3335595659
William B. Swann Jr.3244555980
Matthew D. Lieberman3154674780
Roy F. Baumeister31531013152
David Matsumoto3133397985
Carol D. Ryff3136486476
Dacher Keltner3144684564
Michael E. McCullough3034446978
Kipling D. Williams3034446977
Thomas N Bradbury3033486369
Richard J. Davidson30551082851
Phoebe C. Ellsworth3033466572
Mario Mikulincer3045714264
Richard E. Petty3047744064
Paul Rozin2949585084
Lisa Feldman Barrett2948694270
Constantine Sedikides2844634570
Alice H. Eagly2843614671
Susan T. Fiske2849664274
Jim Sidanius2730426572
Samuel D. Gosling2733535162
S. Alexander Haslam2740624364
Carol S. Dweck2642663963
Mahzarin R. Banaji2553683778
Brian A. Nosek2546574481
John F. Dovidio2541663862
Daniel M. Wegner2434524765
Benjamin R. Karney2427376573
Linda J. Skitka2426327582
Jerry Suls2443633868
Steven J. Heine2328376377
Klaus Fiedler2328386174
Jamil Zaki2327356676
Charles M. Judd2336534368
Jonathan B. Freeman2324307581
Shinobu Kitayama2332455071
Norbert Schwarz2235564063
Antony S. R. Manstead2237593762
Patricia G. Devine2125375867
David P. Schmitt2123307177
Craig A. Anderson2132593655
Jeff Greenberg2139732954
Kevin N. Ochsner2140573770
Jens B. Asendorpf2128415169
David M. Amodio2123336370
Bertram Gawronski2133434876
Fritz Strack2031553756
Virgil Zeigler-Hill2022277481
Nalini Ambady2032573556
John A. Bargh2035633155
Arthur Aron2036653056
Mark Snyder1938603263
Adam D. Galinsky1933682849
Tom Pyszczynski1933613154
Barbara L. Fredrickson1932523661
Hazel Rose Markus1944642968
Mark Schaller1826434361
Philip E. Tetlock1833454173
Anthony G. Greenwald1851613083
Ed Diener18691011868
Cameron Anderson1820276774
Michael Inzlicht1828444163
Barbara A. Mellers1825325678
Margaret S. Clark1823305977
Ethan Kross1823345267
Nyla R. Branscombe1832493665
Jason P. Mitchell1830414373
Ursula Hess1828404471
R. Chris Fraley1828394572
Emily A. Impett1819257076
B. Keith Payne1723305876
Eddie Harmon-Jones1743622870
Wendy Wood1727434062
John T. Jost1730493561
C. Nathan DeWall1728453863
Thomas Gilovich1735503469
Elaine Fox1721276278
Brent W. Roberts1745592877
Harry T. Reis1632433874
Robert B. Cialdini1629513256
Phillip R. Shaver1646652571
Daphna Oyserman1625463554
Russell H. Fazio1631503261
Jordan B. Peterson1631394179
Bernadette Park1624384264
Paul A. M. Van Lange1624384263
Jeffry A. Simpson1631572855
Russell Spears1529522955
A. Janet Tomiyama1517236576
Jan De Houwer1540552772
Samuel L. Gaertner1526423561
Michael Harris Bond1535423584
Agneta H. Fischer1521314769
Delroy L. Paulhus1539473182
Marcel Zeelenberg1429373979
Eli J. Finkel1426453257
Jennifer Crocker1432483067
Steven W. Gangestad1420483041
Michael D. Robinson1427413566
Nicholas Epley1419265572
David M. Buss1452652280
Naomi I. Eisenberger1440512879
Andrew J. Elliot1448712067
Steven J. Sherman1437592462
Christian S. Crandall1421363959
Kathleen D. Vohs1423453151
Jamie Arndt1423453150
John M. Zelenski1415206976
Jessica L. Tracy1423324371
Gordon B. Moskowitz1427472957
Klaus R. Scherer1441522678
Ayelet Fishbach1321363759
Jennifer A. Richeson1321403352
Charles S. Carver1352811664
Leaf van Boven1318274767
Shelley E. Taylor1244841452
Lee Jussim1217245271
Edward R. Hirt1217264865
Shigehiro Oishi1232522461
Richard E. Nisbett1230432969
Kurt Gray1215186981
Stacey Sinclair1217304157
Niall Bolger1220343658
Paula M. Niedenthal1222363461
Eliot R. Smith1231422973
Tobias Greitemeyer1221313967
Rainer Reisenzein1214215769
Rainer Banse1219264672
Galen V. Bodenhausen1228462661
Ozlem Ayduk1221353459
E. Tory. Higgins1238701754
D. S. Moskowitz1221333663
Dale T. Miller1225393064
Jeanne L. Tsai1217254667
Roger Giner-Sorolla1118225180
Edward P. Lemay1115195981
Ulrich Schimmack1122353263
E. Ashby Plant1118363151
Ximena B. Arriaga1113195869
Janice R. Kelly1115225070
Frank D. Fincham1135601859
David Dunning1130432570
Boris Egloff1121372958
Karl Christoph Klauer1125392765
Caryl E. Rusbult1019362954
Tessa V. West1012205159
Jennifer S. Lerner1013224661
Wendi L. Gardner1015244263
Mark P. Zanna1030621648
Michael Ross1028452262
Jonathan Haidt1031432373
Sonja Lyubomirsky1022382659
Sander L. Koole1018352852
Duane T. Wegener1016273660
Marilynn B. Brewer1027442262
Christopher K. Hsee1020313163
Sheena S. Iyengar1015195080
Laurie A. Rudman1026382568
Joanne V. Wood916263660
Thomas Mussweiler917392443
Shelly L. Gable917332850
Felicia Pratto930402375
Wiebke Bleidorn920273474
Jeff T. Larsen917253667
Nicholas O. Rule923303075
Dirk Wentura920312964
Klaus Rothermund930392376
Joris Lammers911165669
Stephanie A. Fryberg913194766
Robert S. Wyer930471963
Mina Cikara914184980
Tiffany A. Ito914224064
Joel Cooper914352539
Joshua Correll914233862
Peter M. Gollwitzer927461958
Brad J. Bushman932511762
Kennon M. Sheldon932481866
Malte Friese915263357
Dieter Frey923392258
Lorne Campbell914233761
Monica Biernat817292957
Aaron C. Kay814283051
Yaacov Schul815233664
Joseph P. Forgas823392159
Guido H. E. Gendolla814302747
Claude M. Steele813312642
Igor Grossmann815233566
Paul K. Piff810165063
Joshua Aronson813282846
William G. Graziano820302666
Azim F. Sharif815223568
Juliane Degner89126471
Margo J. Monteith818243277
Timothy D. Wilson828451763
Kerry Kawakami813233356
Hilary B. Bergsieker78116874
Gerald L. Clore718391945
Phillip Atiba Goff711184162
Elizabeth W. Dunn717262864
Bernard A. Nijstad716312352
Mark J. Landau713282545
Christopher R. Agnew716213376
Brandon J. Schmeichel714302345
Arie W. Kruglanski728491458
Eric D. Knowles712183864
Yaacov Trope732571257
Wendy Berry Mendes714312244
Jennifer S. Beer714252754
Nira Liberman729451565
Penelope Lockwood710144870
Jeffrey W Sherman721292371
Geoff MacDonald712183767
Eva Walther713193566
Daniel T. Gilbert727411665
Grainne M. Fitzsimons611232849
Elizabeth Page-Gould611164066
Mark J. Brandt612173770
Ap Dijksterhuis620371754
James K. McNulty621331965
Dolores Albarracin618331956
Maya Tamir619292164
Jon K. Maner622431452
Alison L. Chasteen617252469
Jay J. van Bavel621302071
William A. Cunningham619302064
Glenn Adams612173573
Wilhelm Hofmann622331866
Ludwin E. Molina67124961
Lee Ross626421463
Andrea L. Meltzer69134572
Jason E. Plaks610153967
Ara Norenzayan621341761
Batja Mesquita617232573
Tanya L. Chartrand69282033
Toni Schmader518301861
Abigail A. Scholer59143862
C. Miguel Brendl510153568
Emily Balcetis510153568
Diana I. Tamir59153562
Nir Halevy513182972
Alison Ledgerwood58153454
Yoav Bar-Anan514182876
Paul W. Eastwick517242169
Geoffrey L. Cohen513252050
Yuen J. Huo513163180
Benoit Monin516291756
Gabriele Oettingen517351449
Roland Imhoff515212373
Mark W. Baldwin58202441
Ronald S. Friedman58192544
Shelly Chaiken522431152
Kristin Laurin59182651
David A. Pizarro516232069
Michel Tuan Pham518271768
Amy J. C. Cuddy517241972
Gun R. Semin519301564
Laura A. King419281668
Yoel Inbar414202271
Nilanjana Dasgupta412231952
Kerri L. Johnson413172576
Roland Neumann410152867
Richard P. Eibach410221947
Roland Deutsch416231871
Michael W. Kraus413241755
Steven J. Spencer415341244
Gregory M. Walton413291444
Ana Guinote49202047
Sandra L. Murray414251655
Leif D. Nelson416251664
Heejung S. Kim414251655
Elizabeth Levy Paluck410192155
Jennifer L. Eberhardt411172362
Carey K. Morewedge415231765
Lauren J. Human49133070
Chen-Bo Zhong410211849
Ziva Kunda415271456
Geoffrey J. Leonardelli46132848
Danu Anthony Stinson46113354
Kentaro Fujita411182062
Leandre R. Fabrigar414211767
Melissa J. Ferguson415221669
Nathaniel M Lambert314231559
Matthew Feinberg38122869
Sean M. McCrea38152254
David A. Lishner38132563
William von Hippel313271248
Joseph Cesario39191745
Martie G. Haselton316291154
Daniel M. Oppenheimer316261260
Oscar Ybarra313241255
Simone Schnall35161731
Travis Proulx39141962
Spike W. S. Lee38122264
Dov Cohen311241144
Ian McGregor310241140
Dana R. Carney39171553
Mark Muraven310231144
Deborah A. Prentice312211257
Michael A. Olson211181363
Susan M. Andersen210211148
Sarah E. Hill29171352
Michael A. Zarate24141331
Lisa K. Libby25101854
Hans Ijzerman2818946
James M. Tyler1681874
Fiona Lee16101358

References

Open Science Collaboration (OSC). (2015). Estimating the reproducibility
of psychological science. Science, 349, aac4716. http://dx.doi.org/10
.1126/science.aac4716

Radosic, N., & Diener, E. (2021). Citation Metrics in Psychological Science. Perspectives on Psychological Science. https://doi.org/10.1177/1745691620964128

Schimmack, U. (2021). The validation crisis. Meta-psychology. in press

Schimmack, U. (2020). A meta-psychological perspective on the decade of replication failures in social psychology. Canadian Psychology/Psychologie canadienne, 61(4), 364–376. https://doi.org/10.1037/cap0000246

Replicability Rankings 2010-2020

Welcome to the replicability rankings for 120 psychology journals. More information about the statistical method that is used to create the replicability rankings can be found elsewhere (Z-Curve; Video Tutorial; Talk; Examples). The rankings are based on automated extraction of test statistics from all articles published in these 120 journals from 2010 to 2020 (data). The results can be reproduced with the R-package zcurve.

To give a brief explanation of the method, I use the journal with the highest ranking and the journal with the lowest ranking as examples. Figure 1 shows the z-curve plot for the 2nd highest ranking journal for the year 2020 (the Journal of Organizational Psychology is ranked #1, but it has very few test statistics). Plots for all journals that include additional information and information about test statistics are available by clicking on the journal name. Plots for previous years can be found on the site for the 2010-2019 rankings (previous rankings).

To create the z-curve plot in Figure 1, the 361 test statistics were first transformed into exact p-values that were then transformed into absolute z-scores. Thus, each value represents the deviation from zero for a standard normal distribution. A value of 1.96 (solid red line) corresponds to the standard criterion for significance, p = .05 (two-tailed). The dashed line represents the treshold for marginal significance, p = .10 (two-tailed). A z-curve analysis fits a finite mixture model to the distribution of the significant z-scores (the blue density distribution on the right side of the solid red line). The distribution provides information about the average power of studies that produced a significant result. As power determines the success rate in future studies, power after selection for significance is used to estimate replicability. For the present data, the z-curve estimate of the replication rate is 84%. The bootstrapped 95% confidence interval around this estimate ranges from 75% to 92%. Thus, we would expect the majority of these significant results to replicate.

However, the graph also shows some evidence that questionable research practices produce too many significant results. The observed discovery rate (i.e., the percentage of p-values below .05) is 82%. This is outside of the 95%CI of the estimated discovery rate which is represented by the grey line in the range of non-significant results; EDR = .31%, 95%CI = 18% to 81%. We see that there are fewer results reported than z-curve predicts. This finding casts doubt about the replicability of the just significant p-values. The replicability rankings ignore this problem, which means that the predicted success rates are overly optimistic. A more pessimistic predictor of the actual success rate is the EDR. However, the ERR still provides useful information to compare power of studies across journals and over time.

Figure 2 shows a journal with a low ERR in 2020.

The estimated replication rate is 64%, with a 95%CI ranging from 55% to 73%. The 95%CI does not overlap with the 95%CI for the Journal of Sex Research, indicating that this is a significant difference in replicability. Visual inspection also shows clear evidence for the use of questionable research practices with a lot more results that are just significant than results that are not significant. The observed discovery rate of 75% is inflated and outside the 95%CI of the EDR that ranges from 10% to 56%.

To examine time trends, I regressed the ERR of each year on the year and computed the predicted values and 95%CI. Figure 3 shows the results for the journal Social Psychological and Personality Science as an example (x = 0 is 2010, x = 1 is 2020). The upper bound of the 95%CI for 2010, 62%, is lower than the lower bound of the 95%CI for 2020, 74%.

This shows a significant difference with alpha = .01. I use alpha = .01 so that only 1.2 out of the 120 journals are expected to show a significant change in either direction by chance alone. There are 22 journals with a significant increase in the ERR and no journals with a significant decrease. This shows that about 20% of these journals have responded to the crisis of confidence by publishing studies with higher power that are more likely to replicate.

Rank  JournalObserved 2020Predicted 2020Predicted 2010
1Journal of Organizational Psychology88 [69 ; 99]84 [75 ; 93]73 [64 ; 81]
2Journal of Sex Research84 [75 ; 92]84 [74 ; 93]75 [65 ; 84]
3Evolution & Human Behavior84 [74 ; 93]83 [77 ; 90]62 [56 ; 68]
4Judgment and Decision Making81 [74 ; 88]83 [77 ; 89]68 [62 ; 75]
5Personality and Individual Differences81 [76 ; 86]81 [78 ; 83]68 [65 ; 71]
6Addictive Behaviors82 [75 ; 89]81 [77 ; 86]71 [67 ; 75]
7Depression & Anxiety84 [76 ; 91]81 [77 ; 85]67 [63 ; 71]
8Cognitive Psychology83 [75 ; 90]81 [76 ; 87]71 [65 ; 76]
9Social Psychological and Personality Science85 [78 ; 92]81 [74 ; 89]54 [46 ; 62]
10Journal of Experimental Psychology – General80 [75 ; 85]80 [79 ; 81]67 [66 ; 69]
11J. of Exp. Psychology – Learning, Memory & Cognition81 [75 ; 87]80 [77 ; 84]73 [70 ; 77]
12Journal of Memory and Language79 [73 ; 86]80 [76 ; 83]73 [69 ; 77]
13Cognitive Development81 [75 ; 88]80 [75 ; 85]67 [62 ; 72]
14Sex Roles81 [74 ; 88]80 [75 ; 85]72 [67 ; 77]
15Developmental Psychology74 [67 ; 81]80 [75 ; 84]67 [63 ; 72]
16Canadian Journal of Experimental Psychology77 [65 ; 90]80 [73 ; 86]74 [68 ; 81]
17Journal of Nonverbal Behavior73 [59 ; 84]80 [68 ; 91]65 [53 ; 77]
18Memory and Cognition81 [73 ; 87]79 [77 ; 81]75 [73 ; 77]
19Cognition79 [74 ; 84]79 [76 ; 82]70 [68 ; 73]
20Psychology and Aging81 [74 ; 87]79 [75 ; 84]74 [69 ; 79]
21Journal of Cross-Cultural Psychology83 [76 ; 91]79 [75 ; 83]75 [71 ; 79]
22Psychonomic Bulletin and Review79 [72 ; 86]79 [75 ; 83]71 [67 ; 75]
23Journal of Experimental Social Psychology78 [73 ; 84]79 [75 ; 82]52 [48 ; 55]
24JPSP-Attitudes & Social Cognition82 [75 ; 88]79 [69 ; 89]55 [45 ; 65]
25European Journal of Developmental Psychology75 [64 ; 86]79 [68 ; 91]74 [62 ; 85]
26Journal of Business and Psychology82 [71 ; 91]79 [68 ; 90]74 [63 ; 85]
27Psychology of Religion and Spirituality79 [71 ; 88]79 [66 ; 92]72 [59 ; 85]
28J. of Exp. Psychology – Human Perception and Performance79 [73 ; 84]78 [77 ; 80]75 [73 ; 77]
29Attention, Perception and Psychophysics77 [72 ; 82]78 [75 ; 82]73 [70 ; 76]
30Psychophysiology79 [74 ; 84]78 [75 ; 82]66 [62 ; 70]
31Psychological Science77 [72 ; 84]78 [75 ; 82]57 [54 ; 61]
32Quarterly Journal of Experimental Psychology81 [75 ; 86]78 [75 ; 81]72 [69 ; 74]
33Journal of Child and Family Studies80 [73 ; 87]78 [74 ; 82]67 [63 ; 70]
34JPSP-Interpersonal Relationships and Group Processes81 [74 ; 88]78 [73 ; 82]53 [49 ; 58]
35Journal of Behavioral Decision Making77 [70 ; 86]78 [72 ; 84]66 [60 ; 72]
36Appetite78 [73 ; 84]78 [72 ; 83]72 [67 ; 78]
37Journal of Comparative Psychology79 [65 ; 91]78 [71 ; 85]68 [61 ; 75]
38Journal of Religion and Health77 [57 ; 94]78 [70 ; 87]75 [67 ; 84]
39Aggressive Behaviours82 [74 ; 90]78 [70 ; 86]70 [62 ; 78]
40Journal of Health Psychology74 [64 ; 82]78 [70 ; 86]72 [64 ; 80]
41Journal of Social Psychology78 [70 ; 87]78 [70 ; 86]69 [60 ; 77]
42Law and Human Behavior81 [71 ; 90]78 [69 ; 87]70 [61 ; 78]
43Psychological Medicine76 [68 ; 85]78 [66 ; 89]74 [63 ; 86]
44Political Psychology73 [59 ; 85]78 [65 ; 92]59 [46 ; 73]
45Acta Psychologica81 [75 ; 88]77 [74 ; 81]73 [70 ; 76]
46Experimental Psychology73 [62 ; 83]77 [73 ; 82]73 [68 ; 77]
47Archives of Sexual Behavior77 [69 ; 83]77 [73 ; 81]78 [74 ; 82]
48British Journal of Psychology73 [65 ; 81]77 [72 ; 82]74 [68 ; 79]
49Journal of Cognitive Psychology77 [69 ; 84]77 [72 ; 82]74 [69 ; 78]
50Journal of Experimental Psychology – Applied82 [75 ; 88]77 [72 ; 82]70 [65 ; 76]
51Asian Journal of Social Psychology79 [66 ; 89]77 [70 ; 84]70 [63 ; 77]
52Journal of Youth and Adolescence80 [71 ; 89]77 [70 ; 84]72 [66 ; 79]
53Memory77 [71 ; 84]77 [70 ; 83]71 [65 ; 77]
54European Journal of Social Psychology82 [75 ; 89]77 [69 ; 84]61 [53 ; 69]
55Social Psychology81 [73 ; 90]77 [67 ; 86]73 [63 ; 82]
56Perception82 [74 ; 88]76 [72 ; 81]78 [74 ; 83]
57Journal of Anxiety Disorders80 [71 ; 89]76 [72 ; 80]71 [67 ; 75]
58Personal Relationships65 [54 ; 76]76 [68 ; 84]62 [54 ; 70]
59Evolutionary Psychology63 [51 ; 75]76 [67 ; 85]77 [68 ; 86]
60Journal of Research in Personality63 [46 ; 77]76 [67 ; 84]70 [61 ; 79]
61Cognitive Behaviour Therapy88 [73 ; 99]76 [66 ; 86]68 [58 ; 79]
62Emotion79 [73 ; 85]75 [72 ; 79]67 [64 ; 71]
63Animal Behavior79 [72 ; 87]75 [71 ; 80]68 [64 ; 73]
64Group Processes & Intergroup Relations80 [73 ; 87]75 [71 ; 80]60 [56 ; 65]
65JPSP-Personality Processes and Individual Differences78 [70 ; 86]75 [70 ; 79]64 [59 ; 69]
66Psychology of Men and Masculinity88 [77 ; 96]75 [64 ; 87]78 [67 ; 89]
67Consciousness and Cognition74 [67 ; 80]74 [69 ; 80]67 [62 ; 73]
68Personality and Social Psychology Bulletin78 [72 ; 84]74 [69 ; 79]57 [52 ; 62]
69Journal of Cognition and Development70 [60 ; 80]74 [67 ; 81]65 [59 ; 72]
70Journal of Applied Psychology69 [59 ; 78]74 [67 ; 80]73 [66 ; 79]
71European Journal of Personality80 [67 ; 92]74 [65 ; 83]70 [61 ; 79]
72Journal of Positive Psychology75 [65 ; 86]74 [65 ; 83]66 [57 ; 75]
73Journal of Research on Adolescence83 [74 ; 92]74 [62 ; 87]67 [55 ; 79]
74Psychopharmacology75 [69 ; 80]73 [71 ; 75]67 [65 ; 69]
75Frontiers in Psychology75 [70 ; 79]73 [70 ; 76]72 [69 ; 75]
76Cognitive Therapy and Research73 [66 ; 81]73 [68 ; 79]67 [62 ; 73]
77Behaviour Research and Therapy70 [63 ; 77]73 [67 ; 79]70 [64 ; 76]
78Journal of Educational Psychology82 [73 ; 89]73 [67 ; 79]76 [70 ; 82]
79British Journal of Social Psychology74 [65 ; 83]73 [66 ; 81]61 [54 ; 69]
80Organizational Behavior and Human Decision Processes70 [65 ; 77]72 [69 ; 75]67 [63 ; 70]
81Cognition and Emotion75 [68 ; 81]72 [68 ; 76]72 [68 ; 76]
82Journal of Affective Disorders75 [69 ; 83]72 [68 ; 76]74 [71 ; 78]
83Behavioural Brain Research76 [71 ; 80]72 [67 ; 76]70 [66 ; 74]
84Child Development81 [75 ; 88]72 [66 ; 78]68 [62 ; 74]
85Journal of Abnormal Psychology71 [60 ; 82]72 [66 ; 77]65 [60 ; 71]
86Journal of Vocational Behavior70 [59 ; 82]72 [65 ; 79]84 [77 ; 91]
87Journal of Experimental Child Psychology72 [66 ; 78]71 [69 ; 74]72 [69 ; 75]
88Journal of Consulting and Clinical Psychology81 [73 ; 88]71 [64 ; 78]62 [55 ; 69]
89Psychology of Music78 [67 ; 86]71 [64 ; 78]79 [72 ; 86]
90Behavior Therapy78 [69 ; 86]71 [63 ; 78]70 [63 ; 78]
91Journal of Occupational and Organizational Psychology66 [51 ; 79]71 [62 ; 80]87 [79 ; 96]
92Journal of Happiness Studies75 [65 ; 83]71 [61 ; 81]79 [70 ; 89]
93Journal of Occupational Health Psychology77 [65 ; 90]71 [58 ; 83]65 [52 ; 77]
94Journal of Individual Differences77 [62 ; 92]71 [51 ; 90]74 [55 ; 94]
95Frontiers in Behavioral Neuroscience70 [63 ; 76]70 [66 ; 75]66 [62 ; 71]
96Journal of Applied Social Psychology76 [67 ; 84]70 [63 ; 76]70 [64 ; 77]
97British Journal of Developmental Psychology72 [62 ; 81]70 [62 ; 79]76 [67 ; 85]
98Journal of Social and Personal Relationships73 [63 ; 81]70 [60 ; 79]69 [60 ; 79]
99Behavioral Neuroscience65 [57 ; 73]69 [64 ; 75]69 [63 ; 75]
100Psychology and Marketing71 [64 ; 77]69 [64 ; 74]67 [63 ; 72]
101Journal of Family Psychology71 [59 ; 81]69 [63 ; 75]62 [56 ; 68]
102Journal of Personality71 [57 ; 85]69 [62 ; 77]64 [57 ; 72]
103Journal of Consumer Behaviour70 [60 ; 81]69 [59 ; 79]73 [63 ; 83]
104Motivation and Emotion78 [70 ; 86]69 [59 ; 78]66 [57 ; 76]
105Developmental Science67 [60 ; 74]68 [65 ; 71]65 [63 ; 68]
106International Journal of Psychophysiology67 [61 ; 73]68 [64 ; 73]64 [60 ; 69]
107Self and Identity80 [72 ; 87]68 [60 ; 76]70 [62 ; 78]
108Journal of Counseling Psychology57 [41 ; 71]68 [55 ; 81]79 [66 ; 92]
109Health Psychology63 [50 ; 73]67 [62 ; 72]67 [61 ; 72]
110Hormones and Behavior67 [58 ; 73]66 [63 ; 70]66 [62 ; 70]
111Frontiers in Human Neuroscience68 [62 ; 75]66 [62 ; 70]76 [72 ; 80]
112Annals of Behavioral Medicine63 [53 ; 75]66 [60 ; 71]71 [65 ; 76]
113Journal of Child Psychology and Psychiatry and Allied Disciplines58 [45 ; 69]66 [55 ; 76]63 [53 ; 73]
114Infancy77 [69 ; 85]65 [56 ; 73]58 [50 ; 67]
115Biological Psychology64 [58 ; 70]64 [61 ; 67]66 [63 ; 69]
116Social Development63 [54 ; 73]64 [56 ; 72]74 [66 ; 82]
117Developmental Psychobiology62 [53 ; 70]63 [58 ; 68]67 [62 ; 72]
118Journal of Consumer Research59 [53 ; 67]63 [55 ; 71]58 [50 ; 66]
119Psychoneuroendocrinology63 [53 ; 72]62 [58 ; 66]61 [57 ; 65]
120Journal of Consumer Psychology64 [55 ; 73]62 [57 ; 67]60 [55 ; 65]

Men are created equal, p-values are not.

Is there still something new to say about p-values? Yes, there is. Most discussions of p-values focus on a scenario where a researcher tests a new hypothesis computes a p-value and now has to interpret the result. The status quo follows Fisher’s – 100 year old – approach to compare the p-value to a value of .05. If the p-value is below .05 (two-sided), the inference is that the population effect size deviates from zero in the same direction as the observed effect in the sample. If the p-value is greater than .05 the results are deemed inconclusive.

This approach to the interpretation of the data assumes that we have no other information about our hypothesis or that we do not trust this information sufficiently to incorporate it in our inference about the population effect size. Over the past decade, Bayesian psychologists have argued that we should replace p-values with Bayes-Factors. The advantage of Bayes-Factors is that they can incorporate prior information to draw inferences from data. However, if no prior information is available, the use of Bayesian statistics may cause more harm than good. To use priors without prior information, Bayes-Factors are computed with generic, default priors that are not based on any information about a research question. Along with other problems of Bayes-Factors, this is not an appealing solution to the problem of p-values.

Here I introduce a new approach to the interpretation of p-values that has been called empirical Bayesian and has been successfully applied in genomics to control the field-wise false positive rate. That is, prior information does not rest on theoretical assumptions or default values, but rather on prior empirical information. The information that is used to interpret a new p-value is the distribution of prior p-values.

P-value distributions

Every study is a new study because it relies on a new sample of participants that produces sampling error that is independent of the previous studies. However, studies are not independent in other characteristics. A researcher who conducted a study with N = 40 participants is likely to have used similar sample sizes in previous studies. And a researcher who used N = 200 is also likely to have used larger sample sizes in previous studies. Researchers are also likely to use similar designs. Social psychologists, for example, prefer between-subject designs to better deceive their participants. Cognitive psychologists care less about deception and study simple behaviors that can be repeated hundreds of times within an hour. Thus, researchers who used a between-subject design are likely to have used a between-subject design in previous studies and researchers who used a within-subject design are likely to have used a within-subject design before. Researchers may also be chasing different effect sizes. Finally, researchers can differ in their willingness to take risks. Some may only test hypotheses that are derived from prior theories that have a high probability of being correct, whereas others may be willing to shoot for the moon. All of these consistent differences between researchers (i.e., sample size, effect size, research design) influence the unconditional statistical power of their studies, which is defined as the long-run probability of obtaining significant results, p < .05.

Over the past decade, in the wake of the replication crisis, interest in the distribution of p-values has increased dramatically. For example, one approach uses the distribution of significant p-values, which is known as p-curve analysis (Simonsohn et al., 2014). If p-values were obtained with questionable research practices when the null-hypothesis is true (p-hacking), the distribution of significant p-values is flat. Thus, if the distribution is monotonically decreasing from 0 to .05, the data have evidential value. Although p-curve analyses has been extended to estimate statistical power, simulation studies show that the p-curve algorithm is systematically biased when power varies across studies (Bartos & Schimmack, 2020; Brunner & Schimmack, 2020).

As shown in simulation studies, a better way to estimate power is z-curve (Bartos & Schimmack, 2020; Brunner & Schimmack, 2020). Here I show how z-curve analyses of prior p-values can be used to demonstrate that p-values from one researcher are not equal to p-values of other researchers when we take their prior research practices into account. By using this prior information, we can adjust the alpha level of individual researchers to take their research practices into account. To illustrate this use of z-curve, I first start with an illustration how different research practices influence p-value distributions.

Scenario 1: P-hacking

In the first scenario, we assume that a researcher only tests false hypotheses (i.e., the null-hypothesis is always true (Bem, 2011; Simonsohn et al., 2011). In theory, it would be easy to spot false positives because replication studies would produce produce 19 non-significant results for every significant one and significant ones would have different signs. However, questionable research practices lead to a pattern of results where only significant results in one direction are reported, which is the norm in psychology (Sterling, 1959, Sterling et al., 1995; Schimmack, 2012).

In a z-curve analysis, p-values are first converted into z-scores, z = -qnorm(p/2) with qnorm being the inverse normal function and p being a two-sided p-value. A z-curve plot shows the histogram of all z-scores, including non-significant ones (Figure 1).

Visual inspection of the z-curve plot shows that all 200 p-values are significant (on the right side of the criterion value z = 1.96). it also shows that the mode of the distribution as at the significance criterion. Most important, visual inspection shows a steep drop from the mode to the range of non-significant values. That is, while z = 1.96 is the most common value, z = 1.95 is never observed. This drop provides direct visual information that questionable research practices were used because normal sampling error cannot produce such dramatic changes in the distribution.

I am skipping the technical details how the z-curve model is fitted to the distribution of z-scores (Bartos & Schimmack, 2020). It is sufficient to know that the model is fitted to the distribution of significant z-scores with a limited number of model parameters that are equally spaced over the range of z-scores from 0 to 6 (7 parameters, z = 0, z = 1, z = 2, …. z = 6). The model gives different weights to these parameters to match the observed distribution. Based on these estimates, z-curve.2.0 computes several statistics that can be used to interpret single p-values that have been published or future p-values by the same researcher, assuming that the same research practices are used.

The most important statistic is the expected discovery rate (EDR), which corresponds to the average power of all studies that were conducted by a researcher. Importantly, the EDR is an estimate that is based on only the significant results, but makes predictions about the number of non-significant results. In this example with N = 200 participants, the EDR is 7%. Of course, we know that it really is only 5% because the expected discovery rate for true hypotheses that are tested with alpha = .05 is 5%. However, sampling error can introduce biases in our estimates. Nevertheless, even with only 200 observations, the estimate of 7% is relatively close to 5%. Thus, z-curve tells us something important about the way these p-values were obtained. They were obtained in studies with very low power that is close to the criterion value for a false positive result.

Z-curve uses bootstrap to compute confidence intervals around the point estimate of the EDR. the 95%CI ranges from 5% to 18%. As the interval includes 5%, we cannot reject the hypothesis that all tests were false positives (which in this scenario is also the correct conclusion). At the upper end we can see that mean power is low, even if some true hypotheses are being tested.

The EDR can be used for two purposes. First, it can be used to examine the extent of selection for significance by comparing the EDR to the observed discovery rate (ODR; Schimmack, 2012). The ODR is simply the percentage of significant results that was observed in the sample of p-values. In this case, this is 200 out of 200 or 100%. The discrepancy between the EDR of 7% and 100% is large and 100% is clearly outside the 95%CI of the EDR. Thus, we have strong evidence that questionable research practices were used, which we know to be true in this simulation because the 200 tests were selected from a much larger sample of 4,000 tests.

Most important for the use of z-curve to interpret p-values is the ability to estimate the maximum False Discovery Rate (Soric, 1989). The false discovery rate is the percentage of significant results that are false positives or type-I errors. The false discovery rate is often confused with alpha, the long-run probability of making a type-I error. The significance criterion ensures that no more than 5% of significant and non-significant results are false positives. When we test 4,000 false hypotheses (i.e., the null-hypothesis is true) were are not going to have more than 5% (4,000 * .05 = 200) false positive results. This is true in general and it is true in this example. However, when only significant results are published, it is easy to make the mistake to assume that no more than 5% of the published 200 results are false positives. This would be wrong because the 200 were selected to be significant and they are all false positives.

The false discovery rate is the percentage of significant results that are false positives. It no longer matters whether non-significant results are published or not. We are only concerned with the population of p-values that are below .05 (z > 1.96). In our example, the question is how many of the 200 significant results could be false positives. Soric (1989 demonstrated that the EDR limits the number of false positive discoveries. The more discoveries there are, the lower is the risk that discoveries are false. Using a simple formula, we can compute the maximum false discovery rate from the EDR.

FDR = (1/(EDR – 1)*(.05/.95), with alpha = .05

With an EDR of 7%, we obtained a maximum FDR of 68%. We know that the true FDR is 100%, thus, the estimate is too low. However, the reason is that sampling error can have dramatic effects on the FDR estimates when the EDR is low. With an EDR of 6%, the FDR estimate goes up to 82% and with an EDR estimate of 5% it is 100%. To take account of this uncertainty, we can use the 95%CI of the EDR to compute a 95%CI for the FDR estimate, 24% to 100%. Now we see that we cannot rule out that the FDR is 100%.

In short, scenario 1 introduced the use of p-value distributions to provide useful information about the risk that the published results are false discoveries. In this extreme example, we can dismiss the published p-values as inconclusive or as lacking in evidential value.

Scenario 2: The Typical Social Psychologist

It is difficult to estimate the typical effect size in a literature. However, a meta-analysis of meta-analyses suggested that the average effect size in social psychology is Cohen’s d = .4 (Richard et al., 2003). A smaller set of replication studies that did not select for significance estimated an effect size of d = .3 for social psychology (d = .2 for JPSP, d = .4 for Psych Science; Open Science Collaboration, 2015). The later estimate may include an unknown number of hypotheses where the null-hypothesis is true and the true effect size is zero. Thus, I used d = .4 as a reasonable effect size for true hypotheses in social psychology (see also LeBel, Campbell, & Loving, 2017).

It is also known that a rule of thumb in experimental social psychology was to allocate n = 20 participants to a condition, resulting in a sample size of N = 40 in studies with two groups. In a 2 x 2 design, the main effect would be tested with N = 80. However, to keep this scenario simple, I used d = .4 and N = 40 for true effects. This affords 23% power to obtain a significant result.

Finkel, Eastwick, and Reis (2017) argued that power of 25% is optimal if 75% of the hypotheses that are being tested are true. However, the assumption that 75% of hypotheses are true may be on the optimistic side. Wilson and Wixted (2018) suggested that the false discovery risk is closer to 50%. With 23% power for true hypotheses, this implies a false discovery rate of Given uncertainty about the actual false discovery rate in social psychology, I used a scenario with 50% true and 50% false hypotheses.

I kept the number of significant results at 200. To obtain 200 significant results with an equal number of true and false hypotheses, we need 1,428 tests. The 714 true hypotheses contribute 714*.23 = 164 true positives and the 714 false hypotheses produce 714*.05 = 36 false positive results; 164 + 36 = 200. This implies a false discovery rate of 36/200 = 18%. The true EDR is (714*.23+714*.05)/(714+714) = 14%.

The z-curve plot looks very similar to the previous plot, but they are not identical. Although the EDR estimate is higher, it still includes zero. The maximum FDR is well above the actual FDR of 18%, but the 95%CI includes the actual value of 18%.

A notable difference between Figure 1 and Figure 2 is the expected replication rate (ERR), which corresponds to the average power of significant p-values. It is called the estimated replication rate (ERR) because it predicts the percentage of significant results if the studies that were selected for significance were replicated exactly (Brunner & Schimmack, 2020). When power is heterogeneous, power of the studies with significant results is higher than power of studies with non-significant results (Brunner & Schimmack, 2020). In this case, with only two power values, the reason is that false positives have a much lower chance to be significant (5%) than true positives (23%). As a result, the average power of significant studies is higher than the average power of all studies. In this simulation, the true average power of significant studies is the weighted average of true and false positives with significant results, (164*.23 +36*.05)/(164+36) = 20%. Z-curve perfectly estimated this value.

Importantly, the 95% CI of the ERR, 11% to 34%, does not include zero. Thus, we can reject the null-hypotheses that all of the significant results are false positives based on the ERR. In other words, the significant results have evidential value. However, we do not know the composition of this average. It could be a large percentage of false positives and a few true hypotheses with high power or it could be many true positives with low power. We also do not know which of the 200 significant results is a true positive or a false positive. Thus, we would need to conduct replication studies to distinguish between true and false hypotheses. And given the low power, we would only have a 23% chance of successfully replicating a true positive result. This is exactly what happened with the reproducibility project. And the inconsistent results lead to debates and require further replications. Thus, we have real-world evidence how uninformative p-values are when they are obtained this way.

Social psychologists might argue that the use of small samples is justified because most hypotheses in psychology are true. Thus, we can use prior information to assume that significant results are true positives. However, this logic fails when social psychologists test false hypotheses. In this case, the observed distribution of p-values (Figure 1) is not that different from the distribution that is observed when most significant results are true positives that were obtained with low power (Figure 2). Thus, it is doubtful that this is really an optimal use of resources (Finkel et al., 2015). However, until recently this was the way experimental social psychologists conducted their research.

Scenario 3: Cohen’s Way

In 1962 (!), Cohen conducted a meta-analysis of statistical power in social psychology. The main finding was that studies had only a 50% chance to get significant results with a median effect size of d = .5. Cohen (1988) also recommended that researchers should plan studies to have 80% power. However, this recommendation was ignored.

To achieve 80% power with d = .4, researchers need N = 200 participants. Thus, the number of studies is reduced from 5 studies with N = 40 to one study with N = 200. As Finkel et al. (2017) point out, we can make more discoveries with many small studies than a few large ones. However, this ignores that the results of the small studies are difficult to replicate. This was not a concern when social psychologists did not bother to test whether their discoveries are false discoveries or whether they can be replicated. The replication crisis shows the problems of this approach. Now we have results from decades of research that produced significant p-values without providing any information whether these significant results are true or false discoveries.

Scenario 3 examines what social psychology would look like today, if social psychologists had listened to Cohen. The scenario is the same as in the second scenario, including publication bias. There are 50% false hypotheses and 50% true hypotheses with an effect size of d = .4. The only difference is that researchers used N = 200 to test their hypotheses to achieve 80% power.

With 80% power, we need 470 tests (compared to 1,428 in Scenario 2) to produce 200 significant results, 235*.80 + 235*.05 = 188 + 12 = 200. Thus, the EDR is 200/470 = 43%. The true false discovery rate is 6%. The expected replication rate is 188*.80 + 12*.05 = 76%. Thus, we see that higher power increases replicability from 20% to 76% and lowers the false discovery rate from 18% to 6%.

Figure 3 shows the z-curve plot. Visual inspection shows that Figure 3 looks very different from Figures 1 and 2. The estimates are also different. In this example, sampling error inflated the EDR to be 58%, but the 95%CI includes the true value of 46%. The 95%CI does not include the ODR. Thus, there is evidence for publication bias, which is also visible by the steep drop in the distribution at 1.96.

Even with a low EDR of 20%, the maximum FDR is only 21%. Thus, we can conclude with confidence that at least 79% of the significant results are true positives. Remember, in the previous scenario, we could not rule out that most results are false positives. Moreover, the estimated replication rate is 73%, which underestimates the true replication rate of 76%, but the 95%CI includes the true value, 95%CI = 61% – 84%. Thus, if these studies were replicated, we would have a high success rate for actual replication studies.

Just imagine for a moment what social psychology might look like in a parallel universe where social psychologists followed Cohen’s advice. Why didn’t they? The reason is that they did not have z-curve. All they had was p < .05, and using p < .05, all three scenarios are identical. All three scenarios produced 200 significant results. Moreover, as Finkel et al. (2015) pointed out, smaller samples produce 200 significant results quicker than large samples. An additional advantage of small samples is that they inflate point estimates of the population effect size. Thus, the social psychologists with the smallest samples could brag about the biggest (illusory) effect sizes as long as nobody was able to publish replication studies with larger samples that deflated effect sizes of d = .8 to d = .08 (Joy-Gaba & Nosek, 2010).

This game is over, but social psychology – and other social sciences – have published thousands of significant p-values, and nobody knows whether they were obtained using scenario 1, 2, or 3, or probably a combination of these. This is where z-curve can make a difference. P-values are no longer equal when they are considered as a data point from a p-value distribution. In scenario 1, a p-value of .01 and even a p-value of .001 has no meaning. In contrast, in scenario 3 even a p-value of .02 is meaningful and more likely to reflect a true positive than a false positive result. This means that we can use z-curve analyses of published p-values to distinguish between probably false and probably true positives.

I illustrate this with three concrete examples from a project that examined the p-value distributions of over 200 social psychologists (Schimmack, in preparation). The first example has the lowest EDR in the sample. The EDR is 11% and because there are only 210 tests, the 95%CI is wide and includes 5%.

The maximum EDR estimate is high with 41% and the 95%CI includes 100%. This suggests that we cannot rule out the hypothesis that most significant results are false positives. However, the replication rate is 57% and the 95%CI, 45% to 69%, does not include 5%. Thus, some tests tested true hypotheses, but we do not know which ones.

Visual inspection of the plot shows a different distribution than Figure 2. There are more just significant p-values, z = 2.0 to 2.2 and more large z-scores (z > 4). This shows more heterogeneity in power. A comparison of the ODR with the EDR shows that the ODR falls outside the 95%CI of the EDR. This is evidence of publication bias or the use of questionable research practices. One solution to the presence of publication bias is to lower the criterion for statistical significance. As a result, the large number of just significant results is no longer significant and the ODR decreases. This is a post-hoc correction for publication bias. For example, we can lower alpha to .005.

As expected, the ODR decreases considerably from 70% to 39%. In contrast, the EDR increases. The reason is that many questionable research practices produce a pile of just significant p-values. As these values are no longer used to fit the z-curve, it predicts a lot fewer non-significant p-values. The model now underestimates p-values between 2 and 2.2. However, these values do not seem to come from a sampling distribution. Rather they stick out like a tower. By excluding them, the p-values that are still significant with alpha = .005 look more credible. Thus, we can correct for the use of QRPs by lowering alpha and by examining whether these p-values produced interesting discoveries. At the same time, we can ignore the p-values between .05 and .005 and await replication studies to provide empirical evidence whether these hypotheses receive empirical support.

The second example was picked because it was close to the median EDR (33) and ERR (66) in the sample of 200 social psychologists.

The larger sample of tests (k = 1,529) helps to obtain more precise estimates. A comparison of the ODR, 76%, and the 95%CI of the EDR, 12% to 48%, shows that publication bias is present. However, with an EDR of 33%, the maximum FDR is only 11% and the upper limit of the 95%CI is 39%. Thus, we can conclude with confidence that fewer than 50% of the significant results are false positives, however numerous findings might be false positives. Only replication studies can provide this information.

In this example, lowering alpha to .005 did not align the ODR and the EDR. This suggests that these values come from a sampling distribution where non-significant results were not published. Thus, adjusting the there is no simple fix to adjust the significance criterion. In this situation, we can conclude that the published p-values are unlikely to be false positives, but that replication studies are needed to ensure that published significant results are not false positives.

The third example is the social psychologists with the highest EDR. In this case, the EDR is actually a little bit lower than the ODR, suggesting that there is no publication bias. The high EDR also means that the maximum FDR is very small and even the upper limit of the 95%CI is only 7%.

Another advantage of data without publication bias is that it is not necessary to exclude non-significant results from the analysis. Fitting the model to all p-values produces much tighter estimates of the EDR and the maximum FDR.

The upper limit of the 95%CI for the FDR is now 4%. Thus, we conclude that no more than 5% of the p-values less than .05 are false positives. Even p = .02 is unlikely to be a false positive. Finally, the estimated replication rate is 84% with a tight confidence interval ranging from 78% to 90%. Thus, most of the published p-values are expected to replicate in an exact replication study.

I hope these examples make it clear how useful it can be to evaluate single p-values with prior information about the p-values distribution of a lab. As labs differ in their research practices, significant p-values are also different. Only if we ignore the research context and focus on a single result p = .02 equals p = .02. But once we see the broader distribution, p-values of .02 can provide stronger evidence against the null-hypothesis than p-values of .002.

Implications

Cohen tried and failed to change the research culture of social psychologists. Meta-psychological articles have puzzled why meta-analyses of power failed to increase power (Maxwell, 2004; Schimmack, 2012; Sedelmeier & Gigerenzer, 1989). Finkel et al. (2015) provided an explanation. In a game where the winner publishes as many significant results as possible, the optimal strategy is to conduct as many studies as possible with low power. This strategy continues to be rewarded in psychology, where jobs, promotions, grants, and pay raises are based on the number of publications. Cohen (1990) said less is more, but that is not true in a science that does not self-correct and treats every p-value less than .05 as a discovery.

To improve psychology as a science, we need to change the incentive structure and author-wise z-curve analyses can do this. Rather than using p < .05 (or p < .005) as a general rule to claim discoveries, claims of discoveries can be adjusted to the research practices of a researchers. As demonstrated here, this will reward researchers who follow Cohen’s rules and punish those who use questionable practices to produce p-values less than .05 (or Bayes-Factors > 3) without evidential value. And maybe, there is a badge for credible p-values one day.

(incomplete) References

Richard, F. D., Bond, C. F., Jr., & Stokes-Zoota, J. J. (2003). One hundred years of social psychology quantitatively described. Review of General Psychology, 7, 331–363. http://dx.doi.org/10.1037/1089-2680.7.4.331

“Psychological Science” in 2020

Psychological Science is the flagship journal of the Association for Psychological Science (APS). In response to the replication crisis, D. Stephen Lindsay worked hard to increase the credibility of results published in this journal as editor from 2014-2019 (Schimmack, 2020). This work paid off and meta-scientific evidence shows that publication bias decreased and replicability increased (Schimmack, 2020). In the replicability rankings, Psychological Science is one of a few journals that show reliable improvement over the past decade (Schimmack, 2020).

This year, Patricia J. Bauer took over as editor. Some meta-psychologists were concerned that replicability might be less of a priority because she did not embrace initiatives like preregistration (New Psychological Science Editor Plans to Further Expand the Journal’s Reach).

The good news is that these concerns were unfounded. The meta-scientific criteria of credibility did not change notably from 2019 to 2020.

The observed discovery rates were 64% in 2019 and 66% in 2020. The estimated discovery rates were 58% in 2019 and 59%, respectively. Visual inspection of the z-curves and the slightly higher ODR than EDR suggests that there is still some selection for significant result. That is, researchers use so-called questionable research practices to produce statistically significant results. However, the magnitude of these questionable research practices is small and much lower than in 2010 (ODR = 77%, EDR = 38%).

Based on the EDR, it is possible to estimate the maximum false discovery rate (i.e., the percentage of significant results where the null-hypothesis is true). This rate is low with 4% in both years. Even the upper limit of the 95%CI is only 12%. This contradicts the widespread concern that most published (significant) results are false (Ioannidis, 2005).

The expected replication rate is slightly, but not significantly (i.e., it could be just sampling error) lower in 2020 (76% vs. 83%). Given the small risk of a false positive result, this means that on average significant results were obtained with the recommended power of 80% (Cohen, 1988).

Overall, these results suggest that published results in Psychological Science are credible and replicable. However, this positive evaluations comes with a few caveats.

First, null-hypothesis significance testing can only provide information that there is an effect and the direction of the effect. It cannot provide information about the effect size. Moreover, it is not possible to use the point estimates of effect sizes in small samples to draw inferences about the actual population effect size. Often the 95% confidence interval will include small effect sizes that may have no practical significance. Readers should clearly evaluate the lower limit of the 95%CI to examine whether a practically significant effect was demonstrated.

Second, the replicability estimate of 80% is an average. The average power of results that are just significant is lower. The local power estimates below the x-axis suggest that results with z-scores between 2 and 3 (p < .05 & p > .005) have only 50% power. It is recommended to increase sample sizes for follow-up studies.

Third, the local power estimates also show that most non-significant results are false negatives (type-II errors). Z-scores between 1 and 2 are estimated to have 40% average power. It is unclear how often articles falsely infer that an effect does not exist or can be ignored because the test was not significant. Often sampling error alone is sufficient to explain differences between test statistics in the range from 1 to 2 and from 2 to 3.

Finally, 80% power is sufficient for a single focal test. However, with 80% power, multiple focal tests are likely to produce at least one non-significant result. If all focal tests are significant, there is a concern that questionable research practices were used (Schimmack, 2012).

Readers should also carefully examine the results of individual articles. The present results are based on automatic extraction of all statistical tests. If focal tests have only p-values in the range between .05 and .005, the results are less credible than if at least some p-values are below .005 (Schimmack, 2020).

In conclusion, Psychological Science has responded to concerns about a high rate of false positive results by increasing statistical power and reducing publication bias. This positive trend continued in 2020 under the leadership of the new editor Patricia Bauer.

A Meta-Scientific Perspective on “Thinking: Fast and Slow

2011 was an important year in the history of psychology, especially social psychology. First, it became apparent that one social psychologist had faked results for dozens of publications (https://en.wikipedia.org/wiki/Diederik_Stapel). Second, a highly respected journal published an article with the incredible claim that humans can foresee random events in the future, if they are presented without awareness (https://replicationindex.com/2018/01/05/bem-retraction/). Third, Nobel Laureate Daniel Kahneman published a popular book that reviewed his own work, but also many findings from social psychology (https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow).

It is likely that Kahneman’s book, or at least some of his chapters, would be very different from the actual book, if it had been written just a few years later. However, in 2011 most psychologists believed that most published results in their journals can be trusted. This changed when Bem (2011) was able to provide seemingly credible scientific evidence for paranormal phenomena nobody was willing to believe. It became apparent that even articles with several significant statistical results could not be trusted.

Kahneman also started to wonder whether some of the results that he used in his book were real. A major concern was that implicit priming results might not be replicable. Implicit priming assumes that stimuli that are presented outside of awareness can still influence behavior (e.g., you may have heard the fake story that a movie theater owner flashed a picture of a Coke bottle on the screen and that everybody rushed to the concession stand to buy a Coke without knowing why they suddenly wanted one). In 2012, Kahneman wrote a letter to the leading researcher of implicit priming studies, expressing his doubts about priming results, that attracted a lot of attention (Young, 2012).

Several years later, it has become clear that the implicit priming literature is not trustworthy and that many of the claims in Kahneman’s Chapter 4 are not based on solid empirical foundations (Schimmack, Heene, & Kesavan, 2017). Kahneman acknowledged this in a comment on our work (Kahneman, 2017).

We initially planned to present our findings for all chapters in more detail, but we got busy with other things. However, once in a while I am getting inquires about the other chapters (Engber). So, I am using some free time over the holidays to give a brief overview of the results for all chapters.

The Replicability Index (R-Index) is based on two statistics (Schimmack, 2016). One statistic is simply the percentage of significant results. In a popular book that discusses discoveries, this value is essentially 100%. The problem with selecting significant results from a broader literature is that significance alone, p < .05, does not provide sufficient information about true versus false discoveries. It also does not tell us how replicable a result is. Information about replicability can be obtained by converting the exact p-value into an estimate of statistical power. For example, p = .05 implies 50% power and p = .005 implies 80% power with alpha = .05. This is a simple mathematical transformation. As power determines the probability of a significant result, it also predicts the probability of a successful replication. A study with p = .005 is more likely to replicate than a study with p = .05.

There are two problems with point-estimates of power. One problem is that p-values are highly variable, which also produces high variability / uncertainty in power estimates. With a single p-value, the actual power could range pretty much from the minimum of .05 to the maximum of 1 for most power estimates. This problem is reduced in a meta-analysis of p-values. As more values become available, the average power estimate is closer to the actual average power.

The second problem is that selection of significant results (e.g., to write a book about discoveries) inflates power estimates. This problem can be addressed by comparing the success rate or discovery rate (i.e., the percentage of significant results) with the average power. Without publication bias, the discovery rate should match average power (Brunner & Schimmack, 2020). When publication bias is present, the discovery rate exceeds average power (Schimmack, 2012). Thus, the difference between the discovery rate (in this case 100%) and the average power estimates provides information about the extend of publication bias. The R-Index is a simple correction for the inflation that is introduced by selecting significant results. To correct for inflation the difference between the discovery rate and the average power estimate is subtracted from the mean power estimate. For example, if all studies are significant and the mean power estimate is 80%, the discrepancy is 20%, and the R-Index is 60%. If all studies are significant and the mean power estimate is only 60%, the R-Index is 20%.

When I first developed the R-Index, I assumed that it would be better to use the median (e.g.., power estimates of .50, .80, .90 would produce a median value of .80 and an R-Index of 60. However, the long-run success rate is determined by the mean. For example, .50, .80, .90 would produce a mean of .73, and an R-Index of 47. However, the median overestimates success rates in this scenario and it is more appropriate to use the mean. As a result, the R-Index results presented here differ somewhat from those shared publically in an article by Engber.

Table 1 shows the number of results that were available and the R-Index for chapters that mentioned empirical results. The chapters vary dramatically in terms of the number of studies that are presented (Table 1). The number of results ranges from 2 for chapters 14 and 16 to 55 for Chapter 5. For small sets of studies, the R-Index may not be very reliable, but it is all we have unless we do a careful analysis of each effect and replication studies.

Chapter 4 is the priming chapter that we carefully analyzed (Schimmack, Heene, & Kesavan, 2017).Table 1 shows that Chapter 4 is the worst chapter with an R-Index of 19. An R-Index below 50 implies that there is a less than 50% chance that a result will replicate. Tversky and Kahneman (1971) themselves warned against studies that provide so little evidence for a hypothesis. A 50% probability of answering multiple choice questions correctly is also used to fail students. So, we decided to give chapters with an R-Index below 50 a failing grade. Other chapters with failing grades are Chapter 3, 6, 711, 14, 16. Chapter 24 has the highest highest score (80, wich is an A- in the Canadian grading scheme), but there are only 8 results.

Chapter 24 is called “The Engine of Capitalism”

A main theme of this chapter is that optimism is a blessing and that individuals who are more optimistic are fortunate. It also makes the claim that optimism is “largely inherited” (typical estimates of heritability are about 40-50%), and that optimism contributes to higher well-being (a claim that has been controversial since it has been made, Taylor & Brown, 1988; Block & Colvin, 1994). Most of the research is based on self-ratings, which may inflate positive correlations between measures of optimism and well-being (cf. Schimmack & Kim, 2020). Of course, depressed individuals have lower well-being and tend to be pessimistic, but whether optimism is really preferable over realism remains an open question. Many other claims about optimists are made without citing actual studies.

Even some of the studies with a high R-Index seem questionable with the hindsight of 2020. For example, Fox et al.’s (2009) study of attentional biases and variation in the serotonin transporter gene is questionable because single-genetic variant research is largely considered unreliable today. Moreover, attentional-bias paradigms also have low reliability. Taken together, this implies that correlations between genetic markers and attentional bias measures are dramatically inflated by chance and unlikely to replicate.

Another problem with narrative reviews of single studies is that effect sizes are often omitted. For example, Puri and Robinson’s finding that optimism (estimates of how long you are going to live) and economic risk-taking are correlated is based on a large sample. This makes it possible to infer that there is a relationship with high confidence. A large sample also allows fairly precise estimates of the size of the relationship, which is a correlation of r = .09. A simple way to understand what this correlation means is to think about the increase in predicting in risk taking. Without any predictor, we have a 50% chance for somebody to be above or below the average (median) in risk-taking. With a predictor that is correlated r = .09, our ability to predict risk taking increases from 50% to 55%.

Even more problematic, the next article that is cited for a different claim shows a correlation of r = -.04 between a measure of over-confidence and risk-taking (Busenitz & Barney, 1997). In this study with a small sample (N = 124 entrepreneurs, N = 95 managers), over-confidence was a predictor of being an entrepreneur, z = 2.89, R-Index = .64.

The study by Cassar and Craig (2009) provides strong evidence for hindsight bias, R-Index = 1. Entrepreneurs who were unable to turn a start-up into an operating business underestimated how optimistic they were about their venture (actual: 80%, retrospective: 60%).

Sometimes claims are only loosely related to a cited article (Hmieleski & Baron, 2009). The statement “this reasoning leads to a hypothesis: the people who have the greatest influence on the lives of others are likely to be optimistic and overconfident, and to take more risks than they realize” is linked to a study that used optimism to predict revenue growth and employment growth. Optimism was a negative predictor, although the main claim was that the effect of optimism also depends on experience and dynamism.

A very robust effect was used for the claim that most people see themselves as above average on positive traits (e.g., overestimate their intelligence) (Williams & Gilovich, 2008), R-Index = 1. However, the meaning of this finding is still controversial. For example, the above average effect disappears when individuals are asked to rate themselves and familiar others (e.g., friends). In this case, ratings of others are more favorable than ratings of self (Kim et al., 2019).

Kahneman then does mention the alternative explanation for better-than-average effects (Windschitl et al., 2008). Namely rather than actually thinking that they are better than average, respondents simply respond positively to questions about qualities that they think they have without considering others or the average person. For example, most drivers have not had a major accident and that may be sufficient to say that they are a good driver. They then also rate themselves as better than the average driver without considering that most other drivers also did not have a major accident. R-Index = .92.

So, are most people really overconfident and does optimism really have benefits and increase happiness? We don’t really know, even 10 years after Kahneman wrote his book.

Meanwhile, the statistical analysis of published results has also made some progress. I analyzed all test statistics with the latest version of z-curve (Bartos & Schimmack, 2020). All test-statistics are converted into absolute z-scores that reflect the strength of evidence against the null-hypothesis that there is no effect.

The figure shows the distribution of z-scores. As the book focussed on discoveries most test-statistics are significant with p < .05 (two-tailed, which corresponds to z = 1.96. The distribution of z-scores shows that these significant results were selected from a larger set of tests that produced non-significant results. The z-curve estimate is that the significant results are only 12% of all tests that were conducted. This is a problem.

Evidently, these results are selected from a larger set of studies that produced non-significant results. These results may not even have been published (publication bias). To estimate how replicable the significant results are, z-curve estimates the mean power of the significant results. This is similar to the R-Index, but the R-Index is only an approximate correction for information. Z-curve does properly correct for the selection for significance. The mean power is 46%, which implies that only half of the results would be replicated in exact replication studies. The success rate in actual replication studies is often lower and may be as low as the estimated discovery rate (Bartos & Schimmack, 2020). So, replicability is somewhere between 12% and 46%. Even if half of the results are replicable, we do not know which results are replicable and which one’s are not. The Chapter-based analyses provide some clues which findings may be less trustworthy (implicit priming) and which ones may be more trustworthy (overconfidence), but the main conclusion is that the empirical basis for claims in “Thinking: Fast and Slow” is shaky.

Conclusion

In conclusion, Daniel Kahneman is a distinguished psychologist who has made valuable contributions to the study of human decision making. His work with Amos Tversky was recognized with a Nobel Memorial Prize in Economics (APA). It is surely interesting to read what he has to say about psychological topics that range from cognition to well-being. However, his thoughts are based on a scientific literature with shaky foundations. Like everybody else in 2011, Kahneman trusted individual studies to be robust and replicable because they presented a statistically significant result. In hindsight it is clear that this is not the case. Narrative literature reviews of individual studies reflect scientists’ intuitions (Fast Thinking, System 1) as much or more than empirical findings. Readers of “Thinking: Fast and Slow” should read the book as a subjective account by an eminent psychologists, rather than an objective summary of scientific evidence. Moreover, ten years have passed and if Kahneman wrote a second edition, it would be very different from the first one. Chapters 3 and 4 would probably just be scrubbed from the book. But that is science. It does make progress, even if progress is often painfully slow in the softer sciences.

A New Look at the Implicit Revolution

Psychology is not a unified paradigmatic science. That is, it lacks an overarching theory like evolution theory in biology. In a science without an empirically grounded paradigm, progress is made very much like evolution made progress in a process of trial and error. Some ideas may thrive for a moment, but if they are not fruitful, they are discarded. The emergence of a new idea is often characterized as a revolution, and psychology has seen its fair share of revolutions. Behaviorism replaced introspectionism and the cognitive revolution replaced behaviorism. For better or worse, cognitivism is dominating psychology at the moment. The cognitive revolution also had a strong influence on social psychology with the rise of social cognition research.

In the early days, social psychologists focussed on higher cognitive processes like attributions. However, in the 1980s, the implicit revolution shifted focus towards lower cognitive processes that may occur without awareness. This was not the first time, unconscious processes became popular. A special issue in the American Psychologists in 1992 called it the New Look 3 (Greenwald, 1992).

The first look was Freud’s exploration of conscious and unconscious processes. A major hurdle for this first look was conceptual confusion and a lack of empirical support. Puritan academic may also have shied away from the sexual content in Freudian theories (e.g., sexual desire directed at the mother).

However, the second look did try to study many of Freud’s ideas with empirical methods. For example, Silverman and Weinberger (1985) presented the phrase “Mommy and I are one” on a computer screen so quickly that participants were unable to say what they saw. This method is called subliminal priming. The idea was that the unconscious has a longing to be loved by mommy and that presenting this phrase would gratify the unconscious. Numerous studies used the “Mommy and I are one” priming method to see effects on behavior.

Greenwald (1992) reviewed this evidence.

Can subliminal presentations result in cognitive analyses of multiword strings? There have been reports of such effects, especially in association with tests of psychoanalytic hypotheses. The best known of these findings (described as subliminal psychodynamic activation [SPA], using “Mommy and I are One” as the text of a subliminal stimulus; Silverman & Weinberger, 1985) has been identified, on the basis of meta-analysis, as a reproducible phenomenon (Hardaway, 1990; Weinberger & Hardaway, 1990).

Despite this strong evidence, many researchers remain skeptical about the SPA result (see, e.g., the survey reported in Appendix B). Such skepticism is almost certainly due to the lack of widespread enthusiasm for the SPA result’s proposed psychodynamic interpretation (Silverman & Weinberger, 1985).

Because of the positive affective values of words in the critical stimulus (especially Mommy and I) , it is possible that observed effects might be explained by cognitive analysis limited to the level of single words. Some support for that interpretation is afforded by Hardaway’s demonstration (1990, p. 183, Table 3) that other affectively positive strings that include Mommy or One also produce significant effects. However, these other effects are weaker than the effect of the specific string, “Mommy and I are One.”

In summary of evidence from studies of subliminal activation, it is now well established that analysis occurs for stimuli presented at exposure conditions in a region between objective and subjective thresholds; this analysis can extract at least some semantic content of single words.

The New Look 3, however, was less interested in Freudian theory. Most of the influential subliminal priming studies used ordinary stimuli to study common topics in social psychology, including prejudice.

For example, Greenwald (1992) cites Devine’s (1989) highly influential subliminal priming studies with racial stimuli as evidence that “experiments using stimulus conditions that are clearly above objective thresholds (but presumably below subjective thresholds) have obtained semantic activation findings with apparent relative ease” (p. 769).

25 years later, in their Implicit Revolution article, Greenwald and Banaji feature Devine’s influential article.

Patricia Devine’s (1989) dissertation research extended the previously mentioned subliminal priming methods of Bargh and Pietromonaco (1982) to automatic stereotypes. Devine’s article brought attention to the possibility of dissociation between automatic stereotype activation
and controlled inhibition of stereotype expression
” (p. 865).

In short, subliminal priming has played an important role in the implicit revolution. However, subliminal priming is still rare. Most studies use clearly visible stimuli. This is surprising, given the clear advantages of subliminal priming to study unconscious processes. A major concern with stimuli that are presented with awareness is that participants can control their behavior. In contrast, if they are not even aware that a racial stimulus was presented, they have no ability to supress a prejudice response.

Another revolution explains why subliminal studies remain rare despite their obvious advantages. This revolution has been called the credibility revolution, replication revolution, or open science revolution. The credibility revolution started in 2011, after a leading social cognition journal published a controversial article that showed time-reversed subliminal priming effects (Bem, 2011). This article revealed a fundamental problem in the way social psychologists conducted their research. Rather than using experiments to see whether effects exist, they used experiments to accumulate evidence in favor of effects. Studies that failed to show the expected effects were hidden. In the 2010s, it has become apparent that this flawed use of the scientific method has produced large literatures with results that cannot be replicated. A major replication project found that less than 25% of results in social psychological experiments could be replicated (OSC, 2015). Given these results, it is unclear which results provided credible evidence.

Despite these troubling findings, social psychologists continue to cite old studies like Devine’s (1989) study (it was just one study!) as if it provided conclusive evidence for subliminal priming of prejudice. If we need any evidence for Freud’s theory of repression, social psychologists would be a prime example. Through various defense mechanisms they maintain the belief that old findings that were obtained with bad scientific practices provided credible evidence that can inform our understanding of the unconscious.

Here I show that this is wishful thinking. To do so, I conducted a modern meta-analysis of subliminal priming studies. Unlike traditional meta-analysis that do not take publication bias into account, this new method provides a strong test of publication bias and corrects for its effect on the results. While there are several new methods, z-curve has been shown to be superior to other methods (Brunner & Schimmack, 2020).

The figure shows the results. The red line at z = 1.96, corresponds to the significance criterion of .05. It is easy to see that this criterion acts like a censor. Results with z-scores greater than 1.96 (i.e., p < .05) are made public and can enter researchers awareness. Results that are not significant, z < 1.06, are repressed and may linger only in the unconscious of researchers who prefer not to think about their failures.

Statistical evidence of repression is provided by a comparison of the observed discovery rate (i.e., the percentage of published results that are significant) of 90% and the expected discovery rate based on the z-curve model (i.e., the grey curve in the figure) of 13%. Evidently, published results are selected from a much larger number of analyses that failed to support subliminal priming. This clear evidence of selection for significance undermines the credibility of individual studies in the subliminal priming literature.

However, there is some evidence of heterogeneity across studies. This is seen in the increasing numbers below the x-axis. Whereas studies with z-scores below 4, have low average power, studies with z-scores above 4, have a mean power greater than 80%. This suggests that replications of these studies could produce significant results. This information could be used to salvage a few solid findings from a pile of junk findings. Closer examination of these studies is beyond the purpose of this blog post, and Devine’s study is not one of them.

The main point of this analysis is that there is strong scientific evidence to support the claim that subliminal priming researchers did not use the scientific method properly. By selecting only results that support the existence of subliminal priming, they created only illusory evidence in support of subliminal priming. Thirty years after Devine’s (1989) subliminal prejudice study was published, we have no scientific evidence in support of the claim that racial stimuli can bypass consciousness and directly influence behavior.

However, Greenwald and other social psychologists who made a career out of these findings repress the well-known fact that published results in experimental social psychology are not credible and cite them as if they are credible evidence (Greenwald & Banaj, 2017).

Social psychologists are of course very familiar with deception. First, they became famous for deceiving participants (Milgram studies). In 2011, it became apparent that they were deceiving themselves. Now, it seems they are willing to deceive others to avoid facing the inconvenient truth that decades of research have produced no scientific results.

The inability to face ego-threatening information is of course not new to psychologists. Freud studied defense mechanisms and social psychologists studied cognitive biases and motivated reasoning. Right now, this trait is on display in Donald Trump and his supporters inability to face the fact that he lost an election. It is ironic that social psychologists have the same inability when their own egos are on the line.