All posts by Ulrich Schimmack

About Ulrich Schimmack

Since Cohen (1962) published his famous article on statistical power in psychological journals, statistical power has not increased. The R-Index makes it possible f to distinguish studies with high power (good science) and studies with low power (bad science). Protect yourself from bad science and check the R-Index before you believe statistical results.

Wilson and Wixted’s False False-Positive Rates


Wilson BM, Wixted JT. The Prior Odds of Testing a True Effect in Cognitive and Social Psychology. Advances in Methods and Practices in Psychological Science. 2018;1(2):186-197. doi:10.1177/2515245918767122

Abstract

Wilson and Wixted had a cool idea, but it turns out to be wrong. They proposed that sign errors in replication studies can be used to estimate false positive rates. Here I show that their approach makes a false assumption and does not work.

Introduction

Two influential articles shifted concerns about false positives in psychology from complacency to fear (Ioannidis, 2005; Simmons, Nelson, & Simonsohn, 2011). First, psychologists assumed that false rejections of the null hypothesis (no effect) are rare because the null hypothesis is rarely true. Effects were either positive or negative, but never really zero. In addition, meta-analyses typically found evidence for effects, even assuming biased reporting of studies (Rosenthal, 1979).

Simmons et al. (2011) demonstrated, however, that questionable, but widely used statistical practices can increase the risk of publishing significant results without real effects from the nominal 5% level (p < .05) to levels that may exceed 50% in some scenarios. When only 25% of significant results in social psychology could be replicated, it seemed possible that a large number of the replication failures were false positives (Open Science Collaboration, 2015).

Wilson and Wixted (2018) used the reproducibility results to estimate how often social psychologists test true null hypotheses. Their approach relied on the rate of sign reversals between original and replication estimates. If the null hypothesis is true, sampling error will produce an equal number of estimates in both directions. Thus, a high rate of sign reversals could be interpreted as evidence that many original findings reflect sampling error around a true null. Second, for every sign reversal there is typically a same-sign replication, and Wilson and Wixted treated the remaining same-sign results as reflecting tests of true hypotheses that reliably produce the correct sign.

Let P(SR) be the observed proportion of sign reversals between originals and replications (not conditional on significance). If true effects always reproduce the same sign and null effects produce sign reversals 50% of the time, then the observed SR provides an estimate of the proportion of true null hypotheses that were tested, P(True-H0).

P(True-H0) = 2*P(SR)

Wilson and Wixted further interpreted this quantity as informative about the fraction of statistically significant original results that might be false positives. Wilson and Wixted (2018) found approximately 25% sign reversals in replications of social psychological studies. Under their simplifying assumptions, this implies 50% true null hypotheses in the underlying set of hypotheses being tested, and they used this inference, together with assumptions about significance and power, to argue that false positives could be common in social psychology.

Like others, I thought this was a clever way to make use of sign reversals. The article has been cited only 31 times (WoS, January 6, 2026), and none of the articles critically examined Wilson and Wixted’s use of sign errors to estimate false positive rates.

However, other evidence suggested that false positives are rare (Schimmack, 2026). To resolve the conflict between Wilson and Wixted’s conclusions and other findings, I reexamined their logic and ChatGPT pointed out Wilson and Wixted’s (2018) formula rests on assumptions that need not hold.

The main reason is that it makes the false assumption that tests of true hypotheses do not produce sign errors. This is simply false because studies that test false null hypotheses with low power can still produce sign reversals (Gelman & Carlin, 2014). Moreover, sign reversals can be generated even when the false-positive rate is essentially zero, if original studies are selected for statistical significance and the underlying studies have low power. In fact, it is possible to predict the percentage of sign reversals from the non-centrality of the test statistic under the assumption that all studies have the same power. To obtain 25% sign reversals, all studies could test a false null hypothesis with about 10% power. In that scenario, many replications would reverse sign because estimates are highly noisy, while the original literature could still contain few or no literal false positives if the true effects are nonzero.


Empirical Examination with Many Labs 5

I used the results from ManyLabs5 (Ebersole et al., 2020) to evaluate what different methods imply about the false discovery risk of social psychological studies in the Reproducibility Project, first applying Wilson and Wixted’s sign-reversal approach and then using z-curve (Bartos & Schimmack, 2022; Brunner & Schimmack, 2020).

ManyLabs5 conducted additional replications of 10 social psychological studies that failed to replicate in the Reproducibility Project (Open Science Collaboration, 2015). The replication effort included both the original Reproducibility Project protocols and revised protocols developed in collaboration with the original authors. There were 7 sign reversals in total across the 30 replication estimates. Using Wilson and Wixted’s sign-reversal framework, 7 out of 30 sign reversals (23%) would be interpreted as evidence that approximately 46% of the underlying population effects in this set are modeled as exactly zero (i.e., that H0 is true for about 46% of the effects).

To compare these results more directly to Wilson and Wixted’s analysis, it is necessary to condition on non-significant replication outcomes, because ManyLabs5 selected studies based on replication failure rather than original significance alone. Among the non-significant replication results, 25 sign reversals occurred out of 75 estimates, corresponding to a rate of 33%, which would imply a false-positive rate of approximately 66% under Wilson and Wixted’s framework. Although this estimate is somewhat higher, both analyses would be interpreted as implying a large fraction of false positives—on the order of one-half—among the original significant findings within that framework.

To conduct a z-curve analysis, I transformed the effect sizes (r) in ManyLabs5 (Table 3) into d-values and used the reported confidence intervals to compute standard errors, SE = (d upper − d lower)/3.92, and corresponding z-values, z = d/SE. I fitted a z-curve model that allows for selection on statistical significance (Bartos & Schimmack, 2022; Brunner & Schimmack, 2020) to the 10 significant original results. I fitted a second z-curve model to the 30 replication results, treating this set as unselected (i.e., without modeling selection on significance).

The z-curve for the 10 original results shows evidence consistent with strong selection on statistical significance, despite the small set of studies. Although all original results are statistically significant, the estimated expected discovery rate is only 8%, and the upper limit of the 95% confidence interval is 61%, well below 100%. Visual inspection of the z-curve plot also shows a concentration of results just above the significance threshold (z = 1.96) and none just below it, even though sampling variation does not create a discontinuity between results with p = .04 and p = .06.

The expected replication rate (ERR) is a model-based estimate of the average probability that an exact replication would yield a statistically significant result in the same direction. For the 10 original studies, ERR is 32%, but the confidence interval is wide (3% to 70%). The lower bound near 3% is close to the directional false-alarm rate under a two-sided test when the true effect is zero (α/2 = 2.5%), meaning that the data are compatible with the extreme-null scenario in which all underlying effects are zero and the original significant results reflect selection. This does not constitute an estimate of the false-positive rate; rather, it indicates that the data are too limited to rule out that worst-case possibility. At the same time, the same results are also compatible with an alternative scenario in which all underlying effects are non-zero but power is low across studies.

For the 30 replication results, the z-curve model provides a reasonable fit to the observed distribution, which supports the use of a model that does not assume selection on statistical significance. In this context, the key quantity is the expected discovery rate (EDR), which can be interpreted as a model-based estimate of the average true power of the 30 replication studies. The estimated EDR is 17%. This value is lower than the corresponding estimate based on the original studies, despite increases in sample sizes and statistical power in the replication attempts. This pattern illustrates that ERR estimates derived from biased original studies tend to be overly optimistic predictors of actual replication outcomes (Bartos & Schimmack, 2022). In contrast, the average power of the replication studies can be estimated more directly because the model does not need to correct for selection bias.

A key implication is that the observed rate of sign reversals (23%) could have been generated by a set of studies in which all null hypotheses are false but average power is low (around 17%). However, the z-curve analysis also shows that even a sample of 30 studies is insufficient to draw precise conclusions about false positive rates in social psychology. Following Sorić (1989), the EDR can be used to derive an upper bound on the false discovery rate (FDR), that is, the maximum proportion of false positives consistent with the observed discovery rate. Based on this approach, the FDR ranges from 11% to 100%. To rule out high false positive risks, studies would need higher power, narrower confidence intervals, or more stringent significance thresholds.

Conclusion

This blog post compared Wilson and Wixted’s use of sign reversals to estimate false discovery rates with z-curve estimates of false discovery risk. I showed that Wilson and Wixted’s approach rests on implausible assumptions. Most importantly, it assumes that sign reversals occur only when the true effect is exactly zero. It does not allow for sign reversals under nonzero effects, which can occur when all null hypotheses are false but tests of these hypotheses have low power.

The z-curve analysis of 30 replication estimates in the ML5 project shows that low average power is a plausible explanation for sign reversals even without invoking a high false-positive rate. Even with the larger samples used in ML5, the data are not precise enough to draw firm conclusions about false positives in social psychology. A key problem remains the fundamental asymmetry of NHST: it makes it possible to reject null hypotheses, but it does not allow researchers to demonstrate that an effect is (practically) zero without very high precision.

The solution is to define the null hypothesis as a region of effect sizes that are so small that they are practically meaningless. The actual level may vary across domains, but a reasonable default is Cohen’s criterion for a small effect size, r = .1 or d = .2. By this criterion, only two of the replication studies in ML5 had sample sizes that were large enough to produce results that ruled out effect sizes of at least r = .1 with adequate precision. Other replications still lacked precision to do so. Interestingly, five of the ten original statistically significant results also failed to rule out effect sizes of at least r = .1, because their confidence intervals included r = .10. Thus, these studies at best provided suggestive evidence about the sign of an effect, but no evidence that the effect size is practically meaningful.

The broader lesson is that any serious discussion of false positives in social psychology requires (a) a specification of what counts as an “absence of an effect” in practice, using minimum effect sizes of interest that can be empirically tested, (b) large sample sizes that allow precise estimation of effect sizes, and (c) unbiased reporting of results. A few registered replication reports come close to this ideal, but even these results have failed to resolve controversies because effect sizes close to zero in the predicted direction remain ambiguous without a clearly specified threshold for practical importance. To avoid endless controversies and futile replication studies, it is necessary to specify minimum effect sizes of interest before data are collected.

In practice, this means designing studies so that the confidence interval can exclude effects larger than the minimum effect size of interest, rather than merely achieving p < .05 against a point null of zero. Conceptually, this is closely related to specifying the null hypothesis as a minimum effect size and using a directional test, rather than using a two-sided test against a nil null of exactly zero. Put differently, the problem is not null hypothesis testing per se, but nil hypothesis testing (Cohen, 1994).

What a Decade of P-Curve Tells Us About False-Positive Psychology

The File-Drawer Problem

A single study is rarely enough to provide sufficient evidence for a theoretically derived hypothesis. To make sense of inconsistent results across multiple studies, psychologists began to conduct meta-analysis. The key contribution of meta-analyses is that pooling evidence from multiple studies reduces sampling error and allows for more precise estimation of effect sizes.

The key problem of meta-analysis is the assumption that individual studies are an unbiased sample of all studies that were conducted. Selective publishing of statistically significant results in favor of a prediction leads to inflated effect size estimates. This problem has been dubbed the file drawer problem. Whereas significant results are submitted for publication, non-significant results are put into a file drawer.

Rosenthal (1979) pointed out that a literature consisting entirely of statistically significant findings could, in principle, reflect no true effects. However, such a scenario was often considered unlikely under the assumption of honest testing with a fixed Type I error rate, because studies without real effects produce a significant result only about 1 out of 20 times when the error rate is controlled with alpha = .05. In addition, Rosenthal proposed a statistic, fail-safe N, to evaluate this risk, and meta-analyses often found that fail-safe N was large enough to infer a real effect.

The assessment of the published literature in psychology shifted dramatically in the early 2010s. Critically, Simmons, Nelson, and Simonsohn (2011) showed with simulations that a few statistical tricks could increase the probability of significant results without real effects from 1:20 to nearly 1:2. Even consistent statistically significant results in several studies were no longer unlikely. Presenting 9 significant results would not require 180 studies, but only 18 (Bem, 2011), or even fewer with more extreme use of questionable statistical practices that later became known as p-hacking. This highly cited article created a crisis of confidence in single studies, and by extension also in meta-analytic findings.


The False Positive Detection Problem

A few years later, Simonsohn, Nelson, and Simmons (2014) published a statistical method, p-curve, to probe the credibility of sets of statistically significant results, often drawn from meta-analyses. They referred to this method as a “key to the file drawer.” The analogy is potentially confusing. P-curve does not test whether publication bias exists, nor does it determine whether questionable statistical practices were used. It also does not estimate how many studies are in the proverbial file drawer. In fact, p-hacking implies that file drawers can be small, even if significant results are false positives. When p-hacking is present, the size of the file drawer is therefore no longer informative.

What p-curve does, and what it does better than fail-safe N, is to assess whether the observed set of statistically significant results is inconsistent with the hypothesis that all tested effects are exactly zero. Simonsohn et al. (2014) call this property evidential value. Formally, significance tests are applied to the distribution of significant p-values to evaluate the null hypothesis implied by this scenario. When this hypothesis can be rejected using conventional significance tests, the data are said to have evidential value. Later versions of p-curve also include stronger tests, but the underlying logic remains the same.

Equipped with a new diagnostic tool, psychologists had a different way to evaluate published studies. While p-curve still shares a limitation of significance testing—namely that it cannot provide affirmative evidence for the null hypothesis, such as the claim that all published significant results are false positives—it can nevertheless show that a set of studies fails to provide evidence against this extreme null hypothesis. Lack of evidence is still valuable information, especially when a set of statistically significant results appears to suggest strong support for a hypothesis, but this support is potentially driven by selective reporting or p-hacking rather than true effects.

P-Curve Coding

P-curve made it possible to evaluate the hypothesis that many, if not most (Ioannidis, 2005), published results are false positives. If this were the case, many p-curve meta-analyses of a specific hypothesis should be flat or left-skewed. In contrast, true hypotheses should produce right-skewed p-curves. Surprisingly, this simple approach to examine the false positive rate has not been systematically applied.

I conducted a review of p-curve articles to see what we have learned about false positives from a decade of p-curve analyses. The article introducing p-curve has been cited more than 1,000 times (WoS, December 30, 2025). I reviewed two samples of articles. First, I sampled the most highly cited articles. These articles are of interest because they may introduce many readers to p-curve, including readers who are not experts in meta-analysis. These articles are also more likely to report high-visibility results. The second set of articles focused on the most recent articles. The rationale is that more recent articles reflect current practice in how p-curve is used and how p-curve results are interpreted.

P-curve results were coded first in terms of evidential value (right-skewed vs. not right-skewed). The second classification concerned the proper interpretation of right-skewed p-curves. Correct interpretations were limited to claims about evidential value. However, some articles misinterpreted p-curve as a bias test and falsely inferred a low risk of bias from a p-curve with evidential value.

The coding scheme had three categories. First, articles that did not report a p-curve analysis were coded as irrelevant. Second, articles that reported a p-curve analysis and correctly limited discussion to evidential value were coded as correct. Third, some articles reported a p-curve analysis but made claims about bias, selection bias, or p-hacking that are invalid. Namely, these articles interpreted results showing evidential value to conclude that publication bias or p-hacking were not a concern. This conclusion is invalid because data can show evidential value while biases can still inflate effect size estimates simultaneously. These articles were coded as incorrect.

Articles were found using WebOfScience. Articles classified as editorial material were excluded. The list of coded articles and their coding is available on the Open Science Framework (OSF) project.

P-Curve Results

I coded 142 articles. A large number of them (k = 95; 67%) cited the p-curve article but did not report a p-curve analysis. An additional two articles stated that a p-curve analysis had been conducted but did not provide a clear description of the results. All 45 articles that reported a p-curve analysis found evidential value. Some of these articles showed flat p-curves for specific subsets of studies, but this pattern was attributed to theoretically meaningful moderators (e.g., Tucker-Drob & Bates, 2015). Importantly, none of the reviewed p-curve analyses suggested that all reported results were false positives.

To further probe this issue, I conducted an automated search of all 990 abstracts retrieved from Web of Science for references to p-curve results indicating no evidential value or flat p-curves. This search did not identify a single abstract reporting such a result.

In terms of interpretations, the results are also notable. More articles misrepresented p-curve as a bias test (k = 28) than correctly presented p-curve as a test of evidential value. Because p-curves were almost always right-skewed, these misinterpretations frequently led authors to infer a low risk of bias, which is not a valid inference from a right-skewed p-curve.

Once, p-curve was even used to discount evidence of bias obtained with other methods. For example, “Funnel plot analysis showed evidence of publication bias, but p-curve analysis suggested that our results could not be caused by selective reporting” (Goubran et al., 2025).

Discussion

Two influential theoretical articles raised concerns that many published rejections of null hypotheses could be false positive results (Ioannidis, 2005; Simmons et al., 2011). P-curve provides an opportunity to evaluate this prediction empirically, but the evidence obtained in p-curve meta-analyses has not been systematically examined. I found that p-curve results always showed evidential value. This finding is in stark contrast to scenarios that suggest the majority of statistically significant results are false (Ioannidis, 2005).

At the same time, p-curve is often misunderstood and misinterpreted as a bias test. This interpretation may lead to a false sense of the credibility of published results. Just as replication failures do not justify the inference that the original result was a false positive (Maxwell et al., 2006), evidential value does not imply that results can be replicated.

There are several possible explanations for the failure to find evidence of false positive results in meta-analyses. One explanation is that false positives are more likely to arise in individual studies than in meta-analyses, which require multiple studies testing the same hypothesis. Sustaining a literature of false positives would therefore require repeated and consistent use of extremely questionable research practices. Few researchers may be motivated to use extreme p-hacking repeatedly to force significant results in the absence of a real effect. Bem (2011) may represent an unusual case in that he appeared to be highly motivated to convince skeptical scientists of the existence of extrasensory perception and to present evidence that met prevailing methodological standards in experimental social psychology. More commonly, researchers may advance claims based on selective or suggestive evidence without attempting to build a cumulative evidential record.

Another explanation is that the statistical null-hypothesis is unlikely to be true (Cohen, 1994). What are the chances that an experimental manipulation has no effect whatsoever on behavior? Subliminal stimuli are often cited as candidates, but even in this literature concerns have been raised that effects may be driven by partial stimulus detection. In correlational research, it is even less likely that two variables have a true correlation of exactly zero. As a result, p-hacking may often inflate effect sizes rather than generate false positive results in the strict sense of rejecting a false null hypothesis.

The problem is when rejection of the nil-hypothesis is confused with credible evidence for a meaningful effect. For example, a p-curve analysis of ego depletion shows evidential value (Carter et al., 2019), but even the original authors were unable to replicate the effect (Vohs et al., 2019). This example illustrates that evidential value is a necessary but not sufficient condition for a credible science. Even if effect sizes are not exactly zero, they can be dramatically inflated. As p-curve is limited to the assessment of evidential value, other methods are required to (a) assess whether published results are biased by selection or p-hacking, (b) estimate population effect sizes while correcting for bias, and (c) estimate the false positive risk in heterogeneous meta-analyses, where a subset of statistically significant results may be false positives.

However, it is also possible that p-curve results are biased and provide spurious evidence of evidential value, that is, evidential value itself may constitute a meta-level false positive. In this case, p-curve would falsely reject the null hypothesis that all statistically significant results are false positives. One possible source of bias is that studies with stronger (but false) evidence may be more likely to be included in meta-analyses than studies with weaker (false) evidence. For example, some researchers may p-hack to more stringent thresholds (e.g., α = .01) or apply Bonferroni corrections, while standard meta-analytic coding practices may mask these selection processes. However, p-hacking of this kind would be expected to produce left-skewed or flat p-curves, such that explaining the near-absence of flat p-curves would require the additional assumption that extreme p-hacking is rare. At present, this possibility cannot be ruled out, but it appears unlikely to account for the overwhelming predominance of right-skewed p-curves.

A more plausible explanation is selective reporting of p-curve results. Because reporting p-curve analyses is optional, meta-analysts may be more likely to include p-curve results when they show evidential value and omit them when p-curves are flat or left-skewed. Evaluating this form of meta-analytic selection bias requires auditing meta-analyses that did not report p-curve results and applying the method retrospectively.


Conclusion

The most important finding is that concerns about many false positive results in psychology journals are not based on empirical evidence. False positives in single studies are not important because no single study can serve as an empirical foundation for a theory. There is no evidence that entire literatures are just a collection of false positive results. This does not mean that published results are credible. Publication bias, inflation of effect sizes, low replicability, method factors in correlational studies, and lack of construct validation remains serious obstacles that have sometimes been overshadowed by concerns about false positive results. These issues deserve more attention in the future.


Frequently Asked Questions about Z-Curve

under development

Can’t find what you are looking for?
1. Ask an AI to search replicationindex.com to find answers that are not here)
2. Send me an email and I will answer your question and add it to the FAQ list.

Does z-curve offer options for small sample (small-N) literatures like animal research?

Short answer:
Yes — z-curve 3.0 adds new transformation methods and a t-curve option that make the method more appropriate for analyses involving small samples (e.g., N < 30). These options are designed to mitigate biases that arise when you convert small-sample test statistics to z-scores using standard normal approximations. Z-curve.3.0 also allows researchers to use t-distributions (t-curve) with a fixed df that is more similar to the distributions of test statistics from small samples than the standard normal distribution.

Details:

  • The z-curve 3.0 tutorial (Chapter 8) explains that instead of only converting p-values to z-scores, you can now:
    • Try alternative transformations of t-values that better reflect their sampling distribution, and
    • Use a direct t-curve model that fits t-distributions with specified degrees of freedom instead of forcing a normal approximation. This “t-curve” option is recommended when studies have similar and genuinely small degrees of freedom (like many animal experiments). (replicationindex.com)
  • These improvements help reduce bias introduced by naïve normal transformations, though they don’t completely eliminate all small-sample challenges, and performance can still be unstable when degrees of freedom vary widely or are extremely small. (replicationindex.com)

Link to the tutorial:
🔗 Z-Curve 3.0 Tutorial (Introduction and links to all chapters)https://replicationindex.com/2025/07/08/z-curve-tutorial-introduction/ (replicationindex.com)


A Z-Curve of Epidemiology

Concerns about credibility are widespread, but they often do not clearly distinguish between different sciences. One problem is that it is difficult to compare sciences quantitatively. One way to do so is to examine the strength of empirical evidence. We cannot compare effect sizes across sciences, but we can compare how precise effect size estimates are and how often rejections of null hypotheses may be false.

The first “science-wide” study was based on empirical results in medicine (Jager & Leek, 2013). Their methods and results were challenged, and it took some time before alternative methods became available. A better method is z-curve (Bartos & Schimmack, 2022; Brunner & Schimmack, 2020). Z-curve has been mostly used in psychology. A comparison with medicine showed that clinical trials have less power, but report results more honestly than psycholgy, where focal tests confirm predicts with success rates over 90% (Schimmack, 2020; Schimmack & Bartos, 2023).

A recent article extracted confidence intervals from four epidemiology journals.

Ackley, S. F., Andrews, R. M., Seaman, C., Flanders, M., Chen, R., Wang, J., Lopes, G., Sims, K. D., Buto, P., Ferguson, E., Allen, I. E., & Glymour, M. M. (2025). Trends in the distribution of P values in epidemiology journals: A statistical, P-curve, and simulation study. American Journal of Epidemiology, 194(12), 3630–3639. https://doi.org/10.1093/aje/kwaf184

The authors were not aware of z-curve, but I was able to analyze their shared data. I examined trends over time and found no evidence that z-curve parameters correlated with publication year. Thus, the results are representative of the literature. An initial analysis with standard z-curve showed no evidence of selection bias. Therefore, I fitted the full data, including non-significant results.

The key findings are: (a) The observed and expected discovery rates are 91%. This means that 91% of the results are significant, but the reason is not selection bias – as it is often in psycholoy – but high power to reject false null hypotheses; (b) a very low false positive risk even with alpha = .05, and a high probability that an exact replication study with a new sample from the same population would produce a significant result again. However, there is some evidence of p-hacking. That is there are more just significant results (2 to 2.4) than the model predicts, Excessive Just Significance Test, p = .0008. Visual inspection of the plot shows, however, that the effect size is small, observed 0.25%, expected 0.24%, and that statistical significance reflects mainly the large sample size.

In short, this is a healthy literature that instills confidence in epidemiological research. The reason is that epidemiological studies typically have large sample sizes and aim for precise effect size estimation. With this goal in mind, studies have high power to reject the null hypothesis of no effect, if only because doing so is not very informative. In contrast, psychologists often use small samples that are sometimes not even large enough to test a null hypothesis properly. Comparing epidemiology with psychology is therefore a bit like comparing apples and oranges, but this false comparison is often made when science as a whole is evaluated. In reality, different sciences face different problems. Low power and replicability are a problem for sciences that invest few resources in many studies, such as psychology. We cannot generalize from replication failures in these sciences to other sciences.

For psychology, effect sizes estimation mostly remains a scientific utopia, but the z-curve of epidemiology shows what that utopia looks like.

Willful Incompetence: Repeating False Claims Does Not Make them True

“Windmills are evil” (Don Quixote cited by Trump)

“Zcurve is made by the devil” (Pek et al., 2024)

Preamble

Ideal conceptions of science have a set of rules that help to distinguish beliefs from knowledge. Actual science is a game with few rules. Anything goes (Feyerabend), if you can sell it to an editor of a peer-reviewed journal. US American psychologist also conflate the meaning of freedom in “Freedom of speech” and “academic freedom” to assume that there are no standards for truth in science, just like there are none in American politics. The game is to get more publications, citations, views, and clicks, and truth is decided by the winner of popularity contests. Well, not to be outdone in this war, I am posting yet another blog post about Pek’s quixotian attacks on z-curve.

For context, Pek has already received an F by statistics professor Jerry Brunner for her nonsensical attacks on a statistical method (Brunner, 2024), but even criticism by a professor of statistics has not deterred her from repeating misinformation about z-curve. I call this willful incompetence; the inability to listen to feedback and to wonder whether somebody else could have more expertise than yourself. This is not to be confused with the Dunning-Kruger effect, where people have no feedback about their failures. Here failures are repeated again and again, despite strong feedback that errors are being made.

Context

One of the editors of Cognition and Emotion, Sander Koole, has been following our work and encouraged us to submit our work on the credibility of emotion research as an article to Cognition & Emotion. We were happy to do so. The manuscript was handled by the other editor Klaus Rothermund. In the first round of reviews, we received a factually incorrect and hostile review by an anonymous reviewer. We were able to address these false criticisms of z-curve and resubmitted the manuscript. In a new round of reviews, the hostile reviewer came up with simulation studies that showed z-curve fails. We showed that this is indeed the case in the simulations that used studies with N = 3 and 2 degrees of freedom. The problem here is not z-curve, but the transformation of t-values into z-values. When degrees of freedom are like those in the published literature we examined, this is not a problem. The article was finally accepted, but the hostile reviewer was allowed to write a commentary. At least, it wa now clear that the hostile reviewer was Pek.

I found out that the commentary is apparently accepted for publication when somebody sent me the link to it on ResearchGate and a friendly over to help with a rebuttal. However, I could not wait and drafted a rebuttal with the help of ChatGPT. Importantly, I use ChatGPT to fact check claims and control my emotions, not to write for me. Below you can find a clear, point by point response to all the factually incorrect claims about z-curve made by Pek et al. that passed whatever counts as human peer-review at Cognition and Emotion.

Rebuttal

Abstract

What is the Expected Discovery Rate?

EDR also lacks a clear interpretation in relation to credibility because it reflects both the average pre-data power of tests and the estimated average population effect size for studied effects.”

This sentence is unclear and introduces several poorly defined or conflated concepts. In particular, it confuses the meaning of the Expected Discovery Rate (EDR) and misrepresents what z-curve is designed to estimate.

A clear and correct definition of the Expected Discovery Rate (EDR) is that it is an estimate of the average true power of a set of studies. Each empirical study has an unknown population effect size and is subject to sampling error. The observed effect size is therefore a function of these two components. In standard null-hypothesis significance testing, the observed effect size is converted into a test statistic and a p-value, and the null hypothesis is rejected when the p-value falls below a prespecified criterion, typically α = .05.

Hypothetically, if the population effect size were known, one could specify the sampling distribution of the test statistic and compute the probability that the study would yield a statistically significant result—that is, its power (Cohen, 1988). The difficulty, of course, is that the true population effect size is unknown. However, when one considers a large set of studies, the distribution of observed p-values (or equivalently, z-values) provides information about the average true power of those studies. This is the quantity that z-curve seeks to estimate.

Average true power predicts the proportion of statistically significant results that should be observed in an actual body of studies (Brunner & Schimmack, 2020), in much the same way that the probability of heads predicts the proportion of heads in a long series of coin flips. The realized outcome will deviate from this expectation due to sampling error—for example, flipping a fair coin 100 times will rarely yield exactly 50 heads—but large deviations from the expected proportion would indicate that the assumed probability is incorrect. Analogously, if a set of studies has an average true power of 80%, the observed discovery rate should be close to 80%. Substantially lower rates imply that the true power of the studies is lower than assumed.

Crucially, true power has nothing to do with pre-study (or pre-data) power, contrary to the claim made by Pek et al. Pre-study power is a hypothetical quantity based on researchers’ assumptions—often optimistic or wishful—about population effect sizes. These beliefs can influence study design decisions, such as planned sample size, but they cannot influence the outcome of a study. Study outcomes are determined by the actual population effect size and sampling variability, not by researchers’ expectations.

Pek et al. therefore conflate hypothetical pre-study power with true power in their description of EDR. This conflation is a fundamental conceptual error. Hypothetical power is irrelevant for interpreting observed results or evaluating their credibility. What matters for assessing the credibility of a body of empirical findings is the true power of the studies to produce statistically significant results, and EDR is explicitly designed to estimate that quantity.

Pek et al.’s misunderstanding of the z-curve estimands (i.e., the parameters the method is designed to estimate) undermines their more specific criticisms. If a critique misidentifies the target quantity, then objections about bias, consistency, or interpretability are no longer diagnostics of the method as defined; they are diagnostics of a different construct.

The situation is analogous to Bayesian critiques of NHST that proceed from an incorrect description of what p-values or Type I error rates mean. In that case, the criticism may sound principled, but it does not actually engage the inferential object used in NHST. Likewise here, Pek et al.’s argument rests on a category error about “power,” conflating hypothetical pre-study power (a design-stage quantity based on assumed effect sizes) with true power (the long-run success probability implied by the actual population effects and the study designs). Because z-curve’s EDR is an estimand tied to the latter, not the former, their critique is anchored in conceptual rather than empirical disagreement.

2. Z-Curve Does Not Follow the Law of Large Numbers

.“simulation results further demonstrate that z-curve estimators can often be biased and inconsistent (i.e., they fail to follow the Law of Large Numbers), leading to potentially misleading conclusions.”

This statement is scientifically improper as written, for three reasons.

First, it generalizes from a limited set of simulation conditions to z-curve as a method in general. A simulation can establish that an estimator performs poorly under the specific data-generating process that was simulated, but it cannot justify a blanket claim about “z-curve estimators” across applications unless the simulated conditions represent the method’s intended model and cover the relevant range of plausible selection mechanisms. Pek et al. do not make those limitations explicit in the abstract, where readers typically take broad claims at face value.

Second, the statement is presented as if Pek et al.’s simulations settle the question, while omitting that z-curve has already been evaluated in extensive prior simulation work. That omission is not neutral: it creates the impression that the authors’ results are uniquely diagnostic, rather than one contribution within an existing validation literature. Because this point has been raised previously, continuing to omit it is not a minor oversight; it materially misleads readers about the evidentiary base for the method.

Third, and most importantly, their claim that z-curve estimates “fail to follow the Law of Large Numbers” is incorrect. Z-curve estimates are subject to ordinary sampling error, just like any other estimator based on finite data. A simple analogy is coin flipping: flipping a fair coin 10 times can, by chance, produce 10 heads, but flipping it 10,000 times will not produce 10,000 heads by chance. The same logic applies to z-curve. With a small number of studies, the estimated EDR can deviate substantially from its population value due to sampling variability; as the number of studies increases, those random deviations shrink. This is exactly why z-curve confidence intervals narrow as the number of included studies grows: sampling error decreases as the amount of information increases. Nothing about z-curve exempts it from this basic statistical principle. Suggesting otherwise implies that z-curve is somehow unique in how sampling error operates, when in fact it is a standard statistical model that estimates population parameters from observed data and, accordingly, becomes more precise as the sample size increases.

3. Sweeping Conclusion Not Supported by Evidence

“Accordingly, we do not recommend using 𝑍-curve to evaluate research findings.”

Based on these misrepresentations of z-curve, Pek et al. make a sweeping recommendation that z-curve estimates provide no useful information for evaluating published research and should be ignored. This recommendation is not only disproportionate to the issues they raise; it is also misaligned with the practical needs of emotion researchers. Researchers in this area have a legitimate interest in whether their literature resembles domains with comparatively strong replication performance or domains where replication has been markedly weaker. For example, a reasonable applied question is whether the published record in emotion research looks more like areas of cognitive psychology, where about 50% of results replicate or more like social psychology, where about 25% replicate (Open Science Collaboration, 2015).

Z-curve is not a crystal ball capable of predicting the outcome of any particular future replication with certainty. Rather, the appropriate claim is more modest and more useful: z-curve provides model-based estimates that can help distinguish bodies of evidence that are broadly consistent with high average evidential strength from those that are more consistent with low average evidential strength and substantial selection. Used in that way, z-curve can assist emotion researchers in critically appraising decades of published results without requiring the field to replicate every study individually.

4. Ignoring the Replication Crisis That Led to The Development of Z-curve

We advocate for traditional meta-analytic methods, which have a well-established history of producing appropriate and reliable statistical conclusions regarding focal research findings”

This statement ignores that traditional meta-analysis ignores publication bias and have produced dramatically effect size estimates. The authors ignore the need to take biases into account to separate true findings from false ones.

Article

5. False definition of EDR (again)

EDR (cf. statistical power) is described as “the long-run success rate in a series of exact replication studies” (Brunner & Schimmack, 2020, p. 1)”.

This quotation describes statistical power in Brunner and Schimmack (2020), not the Expected Discovery Rate (EDR). The EDR was introduced later in Bartoš and Schimmack (2022) as part of z-curve 2.0, and, as described above, the EDR is an estimate of average true power. While the power of a single study can be defined in terms of the expected long-run frequency of significant results (Cohen, 1988), it can also be defined as the probability of obtaining a significant result in a single study. This is the typical use of power in a priori power calculations to plan a specific study. More importantly, the EDR is defined as the average true power of a set of unique studies and does not assume that these studies are exact replications.

Thus, the error is not merely a misplaced citation, but a substantive misrepresentation of what EDR is intended to estimate. Pek et al. import language used to motivate the concept of power in Brunner and Schimmack (2020) and incorrectly present it as a defining interpretation of EDR. This move obscures the fact that EDR is a summary parameter of a heterogeneous literature, not a prediction about repeated replications of a single experiment.

6. Confusing Observed Data with Unobserved Population Parameters (Ontological Error)

Because z-curve analysis infers EDR from observed p-values, EDR can be understood as a measure of average observed power.

This statement is incorrect. To clarify the issue without delving into technical statistical terminology, consider a simple coin-toss example. Suppose we flip a coin that is unknown to us to be biased, producing heads 60% of the time, and we toss it 100 times. We observe 55 heads. In this situation, we have an observed outcome (55 heads), an unknown population parameter (the true probability of heads, 60%), and an unknown expected value (60 heads in 100 tosses). Based on the observed data, we attempt to estimate the true probability of heads or to test the hypothesis that the coin is fair (i.e., that the expected number of heads is 50 out of 100). Importantly, we do not confuse the observed outcome with the true probability; rather, we use the observed outcome as noisy information about an underlying parameter. That is, we treat 55 as a reasonable estimate of the true power and use confidence intervals to see whether it includes 50. If it does not, we can reject the hypothesis that it is fair.

Estimating average true power works in exactly the same way. If 100 honestly reported studies yield 36 statistically significant results, the best estimate of the average true power of these studies is 36%, and we would expect a similar discovery rate if the same 100 studies were repeated under identical conditions (Open Science Collaboration, 2015). Of course, we recognize that the observed rate of 36% is influenced by sampling error and that a replication might yield, for example, 35 or 37 significant results. The observed outcome is therefore treated as an estimate of an unknown parameter, not as the parameter itself. The true average power is probably not 36%, but it is somewhere around this estimate and not 80%.

The problem with so-called “observed power” calculations arises precisely when this distinction is ignored—when estimates derived from noisy data are mistaken for true underlying parameters. This is the issue discussed by Hoenig and Heisey (2001). There is nothing inherently wrong with computing power using effect-size estimates from a study (see, e.g., Yuan & Maxwell, 200x); the problem arises when sampling error is ignored and estimated quantities are treated as if they were known population values. In a single study, the observed power could be 36% power, and the true power is 80%, but in a reasonably large set of studies this will not happen.

Z-curve explicitly treats average true power as an unknown population parameter and uses the distribution of observed p-values to estimate it. Moreover, z-curve quantifies the uncertainty of this estimate by providing confidence intervals, and correct interpretations of z-curve results explicitly take this uncertainty into account. Thus, the alleged ontological error attributed to z-curve reflects a misunderstanding of basic statistical inference rather than a flaw in the method itself.

7. Modeling Sampling Error of Z-Values

z-curve analysis assumes independence among the K analyzed p-values, making the inclusion criteria for p-values critical to defining the population of inference…. Including multiple 𝑝-values from the same sampling unit (e.g., an article) violates the independence assumption, as 𝑝-values within a sampling unit are often correlated. Such dependence can introduce bias, especially because the 𝑍-curve does not account for unequal numbers of 𝑝-values across sampling units or within-unit correlations.

It is true that z-curve assumes that sampling error for a specific result converted into a z-value follows the standard normal distribution with a variance of 1. Correlations among results can lead to violations of this assumption. However, this does not imply that z-curve “fails” in the presence of any dependence, nor does it justify treating this point as a decisive objection to our application. Rather, it means that analysts should take reasonable steps to limit dependence or to use inference procedures that are robust to clustering of results within studies or articles.

A conservative way to meet the independence assumption is to select only one test per study or one test per article in multiple-study articles where the origin of results is not clear. It is also possible to use more than one result per study by computing confidence intervals that sample one result at random, but different results for each sample with replacement. This is closely related to standard practices in meta-analysis for handling multiple dependent effects per study, where uncertainty is estimated with resampling or hierarchical approaches rather than by treating every effect size as independent. The practical impact of dependence also depends on the extent of clustering. In z-curve applications with large sets of articles (e.g., all articles in Cognition and Emotion), the influence of modest dependence is typically limited, and in our application we obtain similar estimates whether we treat results as independent or use clustered bootstrapping to compute uncertainty. Thus, even if Pek et al.’s point is granted in principle, it does not materially change the interpretation of our empirical findings about the emotion literature. Although we pointed this out in our previous review, the authors continue to misrepresent how our z-curve analyses addressed non-independence among p-values (e.g., by using clustered bootstrapping and/or one-test-per-study rules).

8. Automatic Extraction of Test Statistics

Unsurprisingly, automated text mining methods for extracting test statistics has been criticized for its inability to reliably identify 𝑝-values suitable for forensic meta-analysis, such as 𝑍-curve analysis.

This statement fails to take into account advantages and disadvantageous of automatically extracting results from articles. The advantages are that we have nearly population level data for research in the top two emotion journals. This makes it possible to examine time trends (did power increase; did selection bias decrease). The main drawback is that automatic extraction does not, by itself, distinguish between focal tests (i.e., tests that bear directly on an article’s key claim) and non-focal tests. We are explicit about this limitation and also included analyses of hand-coded focal analyses to supplement the results based on automatically extracted test statistics. Importantly, our conclusion that z-curve estimates are similar across these coding approaches is consistent with an often-overlooked feature of Cohen’s (1962) classic assessment of statistical power: Cohen explicitly distinguished between focal and non-focal tests and reported that this distinction did not materially change his inferences about typical power. In this respect, our hand-coded focal analyses suggest that the inclusion of non-focal tests in large-scale automated extraction is not necessarily a fatal limitation for estimating average evidential strength at the level of a literature, although it remains essential to be transparent about what is being sampled and to supplement automated extraction with focal coding when possible.

Pek et al. accurately describe our automated extraction procedure as relying on reported test statistics (e.g., t, F), which are then converted into z-values for z-curve analysis. However, their subsequent criticism shifts to objections that apply specifically to analyses based on scraped p-values, such as concerns about rounded or imprecise information about p-values (e.g., p < .05) and their suitability for forensic meta-analysis. This criticism is valid, but it is also the reason why we do not use p-values for z-curve analysis when better information is available.

9. Pek et al.’s Simulation Study: What it really shows

Pek et al.’s description of their simulation study is confusing. They call one condition “no bias” and the other “bias.” The problem here is that “no bias” refers to a simulation in which selection bias is present. The bias here assumes that α = .05 serves as the selection mechanism. That is, studies are selected based on statistical significance, but there is no additional selection among statistically significant results. Most importantly, it is assumed that there is no further selection based on effect sizes.

Pek et al.’s simulation of “bias” instead implies that researchers would not publish a result if d = .2, but would publish it if d = .5, consistent with a selection mechanism that favors larger observed effects among statistically significant results. Importantly, their simulation does not generalize to other violations of the assumptions underlying z-curve. In particular, it represents only one specific form of within-significance selection and does not address alternative selection mechanisms that have been widely discussed in the literature.

For example, a major concern about the credibility of psychological research is p-hacking, where researchers use flexibility in data analysis to obtain statistically significant results from studies with low power. P-hacking has the opposite effect of Pek et al.’s simulated bias. Rather than boosting the representation of studies with high power, studies with low power are over-represented among the statistically significant results.

Pek et al. are correct that z-curve estimates depend on assumptions about the selection mechanism, but this is not a fundamental problem. All selection models necessarily rely on assumptions about how studies enter the published literature, and different models make different assumptions (e.g., selection on significance thresholds, on p-value intervals, or on effect sizes). Because the specific practices that generate bias in published results are unknown, no selection model can avoid such assumptions, and z-curve’s assumptions are neither unique nor unusually restrictive.

Pek et al.’s simulations are also confusing because they include scenarios in which all p-values are reported and analyzed. These conditions are not relevant for standard applications of z-curve that assume and usually find evidence of bias. Accordingly, we focus on the simulations that match the usual publication environment, in which z-curve is fitted to the distribution of statistically significant z-values.

Pek et al.’s figures are also easy to misinterpret because the y-axis is restricted to a very narrow range of values. Although EDR estimates can in principle range from alpha (5% )to 100%, the y-axis in Figure 1a spans only approximately 60% to 85%. This makes estimation errors look big visually, when they are numerically relatively small.

In the relevant condition, the true EDR is 72.5%. For small sets of studies (e.g., K = 100), the estimated EDR falls roughly 10 percentage points below this value, a deviation that is visually exaggerated by the truncated y-axis. As the number of studies increases the point estimate approaches the true value. In short, Pek et al.’s simulation reproduces Bartos and Schimmack’s results that z-curve estimates are fairly accurate when bias is simply selection for significance.

The simulation based on selection by strength of evidence leads to an overestimation of the EDR. Here smaller samples appear more accurate because they underestimated the EDR and the two biases cancel out. More relevant is that with large samples, z-curve overestimates true average power by about 10 percentage points. This value is limited to one specific simulation of bias and could be larger or smaller. The main point of this simulation is to show that z-curve estimates depend on the type of selection bias in a set of studies. The simulation does not tell us the nature of actual selection biases and the amount of bias in z-curve estimates that violations of the selection assumption introduce.

From a practical point of view an overestimation by 10 percentage points is not fatal.  If the EDR estimate is 80% and the true average power is only 70%, the literature is still credible. The problem is bigger with literatures that already have low EDRs like experimental social psychology. With an EDR of 21% a 10 percentage point correction would reduce the EDR to 11% and the lower bound of the CI would include 5% (Schimmack, 2020), implying that all significant results could be false positives. Thus, Pek et al.’s simulation suggests that z-curve estimates may be overly optimistic. In fact, z-curve overestimates replicability compared to actual replication outcomes in the reproducibility project (Open Science Collaboration, 2015). Pek et al.’s simulations suggest that selection for effect sizes could be reason, but other reasons cannot be ruled out.

Simulation results for the False Discovery Risk and bias (Observed Discovery Rate minus Expected Discoery Rate) are the same because they are a direct function of the EDR. The Expected Replication Rate (ERR), average true power of the significant results, is a different parameter, but shows the same pattern.

In short, Pek et al.’s simulations show that z-curve estimates depend on the actual selection processes that are unknown, but that does not invalidate z-curve estimates. Especially important is that z-curve evaluations of credibility are asymmetrical (Schimmack, 2012). Low values raise concerns about a literature, but high values do not ensure credibility (Soto & Schimmack, 2024).

Specific Criticism of the Z-Curve Results in the Emotion Literature

10. Automatic Extraction (Again)

Based on our discussion on the importance of determining independent sampling units, formulating a well-defined research question, establishing rigorous inclusion and exclusion criteria for 𝑝-values, and conducting thorough quality checks on selected 𝑝-values, we have strong reservations about the methods used in SS2024.” (Pek et al.)

As already mentioned, the population of all statistical hypothesis tests reported in a literature is meaningful for researchers in this area. Concerns about low replicability and high false positive rates have undermined the credibility of the empirical foundations of psychological research. We examined this question empirically using all available statistical test results. This defines a clearly specified population of reported results and a well-defined research question. The key limitation remains that automatic extraction does not distinguish focal and non-focal results. We believe that information for all tests is still important. After all, why are they reported if they are entirely useless? Does it not matter whether a manipulation check was important or whether a predicted result was moderated by gender? Moreover, it is well known that focality is often determined only after results are known in order to construct a compelling narrative (Kerr, 1998). A prominent illustration is provided by Cesario, Plaks, and Higgins (2006), where a failure to replicate the original main effect was nonetheless presented as a successful conceptual replication based on a significant moderator effect.

Pek et al. further argue that analyzing all reported tests violates the independence assumption. However, our inference relied on bootstrapping with articles as the clustering unit, which is the appropriate approach when multiple test statistics are nested within articles and directly addresses the dependence they emphasize. In addition, SS2024 reports z-curve analyses based on hand-coded focal tests that are not subject to these objections; these results are not discussed in Pek et al.’s critique.

11. No Bias in Psychology

Even I f the 𝑍-curve estimates and their CIs are unbiased and exhibit proper coverage, SS2024’s claim of selection bias in emotional research – based on observing that.EDR for both journals were not contained within their respective 95% CIs for ODR – is dubious”. 

It is striking that Pek et al. question z-curve evidence of publication bias. Even setting aside z-curve entirely, it is difficult to defend the assumption of honest and unbiased reporting in psychology. Sterling (1959) already noted that success rates approaching those observed in the literature are implausible under unbiased reporting, and subsequent surveys have repeatedly documented overwhelmingly high rates of statistically significant findings (Sterling et al., 1995).

To dismiss z-curve evidence of selection bias as “dubious” would therefore require assuming that average true power in psychology is extraordinarily high. This assumption is inconsistent with longstanding evidence that psychological studies are typically underpowered to detect even moderate effect sizes, with average power estimates far below conventional benchmarks (Cohen, 1988). None of these well-established considerations appear to inform Pek et al.’s evaluation of z-curve, which treats its results in isolation from the broader empirical literature on publication bias and research credibility. In this broader context, the combination of extremely high observed discovery rates for focal tests and low EDR estimates—such as the EDR of 27% reported in SS2024—is neither surprising nor dubious, but aligns with conclusions drawn from independent approaches, including large-scale replication efforts (Open Science Collaboration, 2015).

12. Misunderstanding of Estimation

Inference using these estimators in the presence of bias would be misleading because the estimators converge onto an incorrect value.

This statement repeats the fallacy of drawing general conclusions about the interpretability of z-curve from a specific, stylized simulation. In addition, Pek et al.’s argument effectively treats point estimates as the sole inferential output of z-curve analyses while disregarding uncertainty. Point estimates are never exact representations of unknown population parameters. If this standard were applied consistently, virtually all empirical research would have to be dismissed on the grounds that estimates are imperfect. Instead, estimates must be interpreted in light of their associated uncertainty and reasonable assumptions about error.

For the 227 significant hand-coded focal tests, the point estimate of the EDR was 27%, with a confidence interval ranging from 10% to 67%. Even if one were to assume an overestimation of 10 percentage points, as suggested by Pek et al.’s most pessimistic simulation scenario, the adjusted estimate would be 17%, and the lower bound of the confidence interval would include 5%. Under such conditions, it cannot be ruled out that a substantial proportion—or even all—statistically significant focal results in this literature are false positives. Rather than undermining our conclusions, Pek et al.’s simulation therefore reinforces the concern that many focal findings in the emotion literature may lack evidential value. At the same time, the width of the confidence interval also allows for more optimistic scenarios. The appropriate response to this uncertainty is to code and analyze additional studies, not to dismiss z-curve results simply because they do not yield perfect estimates of unknown population parameters.

13. Conclusion Does Not Follow From the Arguments

𝑍-curve as a tool to index credibility faces fundamental challenges – both at the definitional and interpretational levels as well as in the statistical performance of its estimators.”

This conclusion does not follow from Pek et al.’s analyses. Their critique rests on selective simulations, treats point estimates as decisive while disregarding uncertainty, and evaluates z-curve in isolation from the broader literature on publication bias, statistical power, and replication. Rather than engaging with z-curve’s assumptions, scope, and documented performance under realistic conditions, their argument relies on narrow counterexamples that are then generalized to broad claims about invalidity.

More broadly, the article exemplifies a familiar pattern in which methodological tools are evaluated against unrealistic standards of perfection rather than by their ability to provide informative, uncertainty-qualified evidence under real-world conditions. Such standards would invalidate not only z-curve, but most statistical methods used in empirical science. When competing conclusions are presented about the credibility of a research literature, the appropriate response is not to dismiss imperfect tools, but to weigh the totality of evidence, assumptions, and robustness checks supporting each position.

We can debate whether the average true power of studies in the emotion literature is closer to 5% or 50%, but there is no plausible scenario under which average true power would justify success rates exceeding 90%. We can also debate the appropriate trade-off between false positives and false negatives, but it is equally clear that the standard significance criterion does not warrant the conclusion that no more than 5% of statistically significant results are false positives, especially in the presence of selection bias and low power. One may choose to dismiss z-curve results, but what cannot be justified is a return to uncorrected effect-size meta-analyses that assume unbiased reporting. Such approaches systematically inflate effect-size estimates and can even produce compelling meta-analytic evidence for effects that do not exist, as vividly illustrated by Bem’s (2011) meta-analysis of extrasensory perception findings.

Postscript

Ideally, the Schimmack-Pek controversy will attract some attention from human third parties with sufficient statistical expertise to understand the issues and weigh in on this important issue. As Pek et al. point out, a statistical tool that can distinguish credible and unbelievable research is needed. Effect size meta-analyses are also increasingly recognizing the need to correct for bias and new methods show promise. Z-curve is one tool among others. Rather than dismissing these attempts, we need to improve them, because we cannot go back to the time where psychologists were advised to err on the side of discovery (Bem, 2000).

2025: A Quantitative Review

First, a sigh of relief. We made it through 2025, despite the rise of evil in the world. Let’s all hope or pray that 2026 will be better. In our little and mostly harmless world of psychology, however, things were rather calm. Here are some stats from my blog post to review 2025.

Clicks and Engagement

Many of these stats are bots and 1second visits, but some are real humans who are interested in scientific psychology and how to make it better. As long as I see these levels of engagement, I feel encouraged to keep the blog going. Likes and comments are rare, but very much appreciated. As the saying goes, one comment is worth more than 1,000 visits.

Content

The posts about “Fast and Slow” are still the most viewed pages. Reconstruction of a train wreck is home to Kahneman’s humble response to the z-curve analysis of the priming chapter, but the “Meta-Scientific Perspective” post is a review of all chapters.

The surprising addition is my farewell to personality psychology. It was an emotional rant expressing my frustration with bad measurement practices and unwillingness to improve them in the field.

Personality Psychology: Bye Bye, Au revoir, Auf Wiedersehen – Replicability-Index

I was ready to give up on it entirely, but ironically, I am still teaching an undergraduate course on personality, which I define as the study of personality differences. Fortunately, there is enough research and open data that I can analyze myself that I can teach the course form a coherent scientific perspective. The textbook is now available on the blog.

Personality Science: The Science of Human Diversity – Replicability-Index

Another surprise is that the new post on terror management made the Top 10 list. I thank the author for including z-curve in their meta-analysis. The article shows that an entire literature with more than 800 studies can be made up of studies with low replicability and a high false positive risk. It also shows how z-curve is superior to p-curve that can only reject the null hypothesis that all (100%) of the studies are false positives, but not that 95% are.

R.I.P Terror Management: A Z-Curve Analysis – Replicability-Index

Below the Top 10 are still some noteworthy posts from 2025.

Why Uri Simonsohn is a Jerk – Replicability-Index (967 views)
I found an old email that referred to Uri as a jerk. The complaint was that the datacolada team are all about open science and criticism of others, but do not even allow people who are criticized to post an open response on their blog. No comments, please! I say that is not open science. That is just like legacy journals that do not publish comments they do not like (yes, I mean you Psychological Methods).

Review of “On the Poor Statistical Properties of the P-Curve Meta-Analytic Procedure” – Replicability-Index (657 views)

A related post features a critique of p-curve that later triggered a “no comments allowed” response by Uri on the datacolada blog. After 10 years of p-curve, it is fair to say that it has only produced one notable finding. Most of the time, p-curve shows evidential value; that is, we can reject the null hypothesis that all results are false positives. So, concerns about massive p-hacking in the false positive article are not empirically supported. The area of false positive paranoia is coming to an end. More on this, in a forthcoming post.

The Ideology versus the Science of Evolved Sex Differences – Replicability-Index (649 views)

My personality textbook includes a scientific review of the research on sex-differences. It debunks many stereotypes that are often dressed up as pseudo-scientific evolutionary wet dreams of sexist pseudo-scientists like Roy F. Baumeister, who still is treated by some psychologist as an eminent scholar. To be a science, psychology has to hold psychologists accountable to scientific standards. Otherwise, it is just a cult.

Psychologists really confuse academic freedom with “you can say the stupidest things”, like one article that compared use of AI to dishonest research practices.

Is Using AI in Science Really “Research Misconduct”? A Response to Guest & van Rooij – Replicability-Index (649 views)

Many of these articles could be published as blog posts, saving us (tax payers) money.

Traffic

Most of the traffic still comes from search engines and over 90% of this traffic comes from Google. Next is Facebook, where I maintain the Psychological Methods Discussion Group.
The group was very active during the Replication Crisis times, but little discussion occurs these days. I would love to move it somewhere else, but I have not found an easy and cheap alternative. Interestingly, Open Science advocates also do not seem to see value in hosting an open forum for discussion. I tried to post on APS, but you have to pay to correct their misinformation. So, there you have it. Psychology lacks an open discussion forum (OSF); talk about scientific utopia. I left fascist X a long time ago, but still get traffic from there. I am now posting on Bluesky with little direct engagement, but apparently some people notice the posts and visit.

Most interesting are the visits from ChatGPT. I am fully aware of the ecological problems, but AI will fix many of the problems that psychology faces, and equally environmentally problematic conferences serve more the purpose of taxpayer-paid vacations than advancing science. Nothing wrong with perks for underpaid academics, but let’s not pretend it advances science when talks are just sales pitches for the latest pseudo-innovation. Anyhow, blogging is now more important than publishing in costly peer-reviewed journals because AI does not care about peer-review and is blocked from work behind paywalls.

Some academics rail against AI, but appearently they never used one. ChatGPT constantly finds errors in my drafts and helps me to correct them before I post them. It also finds plenty of mistakes in published articles and we often have a chuckle that this stuff passed human peer review. On that note, human peer-review sucks and can be easily replaced with AI reviews. At least you don’t have to wait months to get some incoherent nonsensical reviews that only show lack of expertise by the reviewers.

The traffic from ChatGPT also underestimates the importance of AI. Few people actually click on links to check information, but ChatGPT’s answer is still influenced by my content. Ultimately, the real criterion for impact will be how much our work influences AI answers.

Location

The United States continues to dominate psychology. Europe has more people, but only a few European countries invest (waste?) money on psychology. A clear North-West vs the rest difference is visible. Former communist East is just poor, but the problem in the South is religion.

One of the highlights of 2025 was my visit of the Winter school in Padua, one of the oldest universities. I learned that it was relatively free and Galileo made important discoveries there, but he ran into “a little problem” when he moved to Florence and clashed with the Catholic church. The lesson is that the broader culture influences science and currently the USA is showing that its values are inconsistent with science. Religious fundamentalism in the Confederate states is incompatible with science, especially a social one. China is on the rise, and it seems more likely that they will be the next home of psychology, unless Europe gets its act together. China is a totalitarian regime, but a communist dictatorship seems better than religious ones for science and for the future of the planet.

Forecast 2026

The main prediction is that traffic from ChatGPT and China will increase. Possibly, traffic from other AIs like Gemini will also emerge. Other developments are harder to predict and that is the fun of blogging. I don’t have to invest months of my limited remaining life span to fight stupid reviewers to be allowed to pay $3,000 or more of Canadian tax-payers money to share my work with the world. I can do so for free and just see whether somebody found the work interesting. Fortunately, I am paid well enough so that I do not have to worry about the incentive structure in academia that everybody knows leads to fast, cheap, and bad science, but nobody is able to change. For those of you who are still a cog in this mindless machine, I can say there is hope. The tyranny of publication cartels is weakening and maybe sometimes it is better to just post a preprint than to try the 10th journal for the stamp of approval in a peer-reviewed (predatory) journal. Academia is a long game. The rewards come at the end when you can do what you want without concerns about approval. Humanist psychologists call this self-actualization. I call it the freedom of not giving a fuck anymore.

See you in 2026,
Ulrich Schimmack

Happy New Year

Beyond Open Science: A Meta-Paradigmatic Perspective

What Is Science?

What is science? According to ChatGPT, the most basic concept of science lacks a clear definition. There is not one science, but many sciences that share overlapping features, creating a family resemblance rather than a set of necessary and sufficient conditions. As Laudan (1983) argued, “the search for a demarcation criterion between science and non-science is a pseudo-problem.”

Attempts to define science in terms of verification, falsifiability, empirical content, prediction, method, progress, or realism have all faced objections. Nevertheless, these concepts remain relevant for distinguishing science from other belief systems.

Even in the absence of strict definitions, concepts can be characterized by prototypes. Psychology distinguishes between descriptive prototypes, which capture typical features (e.g., feathers and flight for birds), and ideal prototypes, which represent standards rather than averages. Nobody is perfectly healthy or happy, but comparison to an ideal allows meaningful evaluation. I argue that science functions in the same way: not as a fixed set of practices, but as an ideal prototype against which actual scientific activity can be evaluated.

An old-fashioned ideal prototype of science describes it as a collective effort to test beliefs and to revise or replace them when new information reveals inconsistencies (Hume, Popper, Peirce, Dewey). A defining feature of this ideal is openness—openness to new evidence and openness to changing beliefs.

The value science places on discovery reflects this openness. Novelty matters because scientific inquiry is oriented toward progressive improvement in understanding. The history of science shows that expanding, revising, and sometimes challenging existing belief systems is a central driver of progress, even when that progress is slow and non-linear.


Paradigms and Confined Openness

The definition of science as an ideal prototype differs from descriptions of actual scientific practice because reality rarely matches ideals. Scientists are human agents embedded in social and institutional contexts, and their behavior is shaped by incentives that can conflict with norms of openness and belief revision. Kuhn’s analysis of paradigms and paradigm shifts illustrates these tensions between epistemic ideals and community dynamics.

A scientific paradigm functions much like a culture: it has foundational beliefs, socialization practices, initiation rituals, and a collective goal of preservation and expansion. Within paradigms, researchers may revise beliefs and pursue novelty, but foundational assumptions are typically treated as off-limits. For example, a foundational assumption in mainstream social psychology is that experiments are the primary source of valuable knowledge, privileging laboratory studies over field research.

This produces what I call confined openness: openness to criticism, replication, and revision within a paradigm, combined with resistance to challenges that target its foundations. A visitor to a scientific conference would see many hallmarks of science on display, yet might not notice that certain questions are never asked.


The Replication Crisis in Social Psychology

In the early 2010s, it became evident that common research practices in experimental social psychology deviated from the ideal of open inquiry in which evidence can genuinely threaten beliefs. A key flashpoint was Bem (2011), which reported evidence for a phenomenon incompatible with established physical and psychological assumptions. The lesson was not that Bem committed fraud, but that ordinary analytic flexibility combined with selective publication can make implausible claims appear empirically supported (Schimmack, 2012).

The core problem was that researchers could accumulate confirmatory evidence without reporting nonconfirmatory outcomes. When null or contradictory results are systematically underreported, the published literature ceases to constrain belief revision. Psychology has long exhibited unusually high rates of statistically significant findings; Sterling (1959) reported that roughly 97% of tests rejected the null, and later work confirmed this excess of positive results (Motyl et al., 2017).

In addition, many journals are organized around specific paradigms and explicitly aim to promote them. Such journals are structurally unlikely to publish work that challenges the paradigm’s core assumptions.


Open Science and Its Limits

In response, researchers advocated reforms under the banner of open science, typically operationalized as procedural transparency and reproducibility: sharing data, materials, and code; preregistration; replication; and safeguards against selective reporting. These reforms improve error detection and accountability within paradigms by making claims easier to audit and by reducing reporting flexibility.

The replication crisis also socialized a new generation of researchers to view credibility as a methodological and institutional problem rather than a matter of personal integrity. However, the open science movement’s focus on single studies and single findings risks deflecting attention from deeper structural sources of closedness. These include incentive systems that reward publishable success, norms that delimit legitimate questions, and paradigm-level assumptions treated as nonnegotiable.

The most fundamental constraint is the trap of paradigmatic research described by Kuhn. Paradigms restrict openness by confining criticism to questions that can be addressed within an accepted framework. In mature sciences, stable theoretical foundations allow paradigmatic research to produce cumulative progress. Psychology, by contrast, lacks a unifying paradigm and is fragmented into numerous micro-paradigms sustained as much by social and institutional commitments as by decisive empirical support. Debates such as the personality–situation controversy illustrate how paradigm boundaries can become sites of identity and norm enforcement rather than objects of open-ended inquiry.

Current incentive structures exacerbate this problem. Science operates as a reputational marketplace in which publications, grants, and visibility are assumed to signal quality. Yet producers and evaluators largely overlap. Reviewers, editors, and panelists are drawn from within paradigms, creating selection pressures favoring work that extends existing frameworks. These dynamics propagate into citation counts, funding decisions, and career advancement, reinforcing paradigmatic stability.

Open science rightly targets misaligned incentives for transparency and reproducibility (Nosek et al., 2015), but it remains focused on improving paradigmatic research rather than evaluating paradigms themselves. As a result, research programs can become highly replicable without producing theoretical progress. For example, it is highly replicable that self-report measures and Implicit Association Test scores correlate weakly. Yet this regularity alone does not resolve whether the discrepancy reflects measurement error or a substantive distinction between conscious and unconscious processes (Schimmack, 2021).

After more than a decade of reform, these deeper concerns remain largely unaddressed. Researchers can adopt open practices while leaving foundational assumptions intact. While implausible claims are now harder to publish, the incentive structure still rewards work that stabilizes paradigms rather than subjects them to serious challenge.


Meta-Analysis, Meta-Science, and Meta-Paradigmatic Critique

The term meta has a long history in psychology. Meta-analysis emerged in the 1970s to integrate evidence across studies, and today meta-analyses are highly cited because they summarize large literatures. However, meta-analyses typically aggregate results produced within paradigms rather than evaluating the paradigms themselves. Theoretical questions and under-studied alternatives fall outside their scope, and conclusions are shaped by publication bias and researcher allegiance.

Addressing these limitations requires moving beyond meta-analysis to meta-science. Meta-science evaluates research programs rather than producing new findings. Meta-scientists function as knowledgeable consumers who assess whether bodies of research adhere to best practices and whether paradigms remain epistemically productive.

Yet most meta-science operates at a high level of abstraction, focusing on general properties of science rather than sustained critique of specific paradigms. What is needed instead is meta-paradigmatic evaluation: paradigm-specific critique conducted by domain experts who are institutionally independent of the paradigms they evaluate.


Toward Open Meta-Paradigmatic Science

Systematic paradigm evaluation is likely to encounter resistance, just as open science did. Meta-paradigmatic critique may be framed as a threat to academic freedom. But academic freedom has never been absolute. Researchers accept ethics review and other forms of oversight when fundamental norms are at stake. Critical evaluation does not restrict inquiry; it is a constitutive feature of science.

Unlike open science reforms, meta-paradigmatic evaluation requires institutional change. It must be recognized as a legitimate scholarly activity with its own funding, review panels, positions, and journals. While meta-science is itself imperfect and subject to capture, it offers a cost-effective means of preventing the long-term stagnation of research programs.

Existing outlets provide only partial solutions. Journals with low rejection rates reduce gatekeeping but carry high costs and low prestige. Specialized journals such as Meta-Psychology welcome critical evaluation but remain marginal. Journals devoted to meta-science typically operate at an abstract level and do not engage deeply with specific paradigms.

What is missing are field-specific, meta-paradigmatic journals insulated from paradigm capture. Meta-paradigmatic critique requires deep disciplinary expertise combined with institutional independence—an uncommon combination given current training and reward structures.


Conclusion: Science as Utopia

The ideal prototype of science is a utopia: a state that cannot be fully realized but that serves as a regulative aspiration. Open and honest reporting of results should not be an utopia; it is a minimal requirement of scientific practice, and reward structures must support it.

A more demanding utopia requires something further: openness to sustained critical examination of fundamental beliefs. Such beliefs can carry emotional and identity-laden significance for scientists, comparable to religious beliefs for believers. Because humans naturally resist scrutiny of core commitments, openness at this level cannot be left to individual virtue. It must be institutionalized.

Open and independent challenges to scientific paradigms—especially at the level of foundational assumptions—should therefore be understood not as threats to science, but as necessary conditions for its long-term epistemic vitality.

After more than a decade of reform, these deeper concerns remain largely unaddressed. Researchers can adopt open practices while leaving foundational assumptions intact. While implausible claims are now harder to publish, the incentive structure still rewards work that stabilizes paradigms rather than subjects them to serious challenge.


Core References

Laudan, L. (1983). The demise of the demarcation problem. In R. S. Cohen & L. Laudan (Eds.), Physics, philosophy and psychoanalysis: Essays in honor of Adolf Grünbaum (pp. 111–127). D. Reidel. (SciSpace)

Kuhn, T. S. (1970). The structure of scientific revolutions (2nd ed.). University of Chicago Press. (UW-Madison Libraries)

Nosek, B. A., Spies, J. R., & Motyl, M. (2012). Scientific utopia: II. Restructuring incentives and practices to promote truth over publishability. Perspectives on Psychological Science, 7(6), 615–631. https://doi.org/10.1177/1745691612459058 (SAGE Journals)

Nosek, B. A., Alter, G., Banks, G. C., Borsboom, D., Bowman, S. D., Breckler, S. J., et al. (2015). Promoting an open research culture. Science, 348(6242), 1422–1425. https://doi.org/10.1126/science.aab2374 (Aspen Institute)

Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551–566. https://doi.org/10.1037/a0029487 (UBC Emotion & Self Lab)

Schimmack, U. (2020). A meta-psychological perspective on the decade of replication failures in social psychology. Canadian Psychology/Psychologie canadienne, 61(4), 364–376. https://doi.org/10.1037/cap0000246 (CoLab)

Schimmack, U. (2021). Invalid claims about the validity of implicit association tests by prisoners of the implicit social-cognition paradigm. Perspectives on Psychological Science, 16(2), 435–442. https://doi.org/10.1177/1745691621991860 (SAGE Journals)

Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology, 100(3), 407–425. https://doi.org/10.1037/a0021524 (American Psychological Association)

Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance—or vice versa. Journal of the American Statistical Association, 54, 30–34. (The James Lind Library)

Motyl, M., Demos, A. P., Carsel, T. S., Hanson, B. E., Melton, Z. J., Mueller, A. B., Prims, J. P., Sun, J., Washburn, A. N., Wong, K. M., Yantis, C. A., & Skitka, L. J. (2017). The state of social and personality science: Rotten to the core, not so bad, getting better, or getting worse? Journal of Personality and Social Psychology. https://doi.org/10.1037/pspa0000084 (sciencedaily.com)


A Scientific Response to The Right-Wing War on Science

At its best, science is amazing. It produces discoveries that change our understanding of the world—and the world itself. Human lives have been transformed by scientific knowledge and technology, often for the better. It has certainly made my life better than that of my ancestors.

Yet science continues to be under attack. Historically, religious dogma sometimes clashed with scientific progress. It took the Catholic Church more than three centuries before Pope John Paul II formally acknowledged that Galileo was right to claim that the Earth moves around the Sun.

A more recent and devastating example is Nazi Germany, where science was subordinated to ideological pseudo-science in order to justify mass murder. The regime also drove out many Jewish scientists, some of whom later contributed to the Allied war effort. Later still, scientific progress in the Eastern Bloc was hampered by putting party loyalty over scientific excellence rather than evidence. These episodes illustrate a recurring lesson: science requires ethical guardrails, but it does not survive political domination.

Today, science is advancing rapidly in parts of the world, including China, for example through major investments in green energy. At the same time, the United States has increasingly undermined scientific consensus on issues such as vaccines and climate change and has placed growing pressure on scientific institutions. A number of observers warn that these developments threaten academic freedom and risk slowing scientific progress. One prominent justification for attacks on universities is the claim—advanced by some conservative academics, including Jonathan Haidt and Jordan Peterson—that universities are ideological “cesspools” in which naïve students are indoctrinated by hard-left professors.

This image of universities is both inaccurate and unscientific. For example, modern genetics has shown that humans are one species with a single, shared gene pool, not distinct biological races that can be ranked by skin color. This is not “woke ideology”; it is a straightforward empirical fact that only conflicts with racist belief systems.

Critics often argue that universities are repeating historical mistakes by ignoring science in order to impose liberal or radical-left values on campus. But what, concretely, are these alleged policies? Following the murder of George Floyd, many North American universities examined whether systemic racism contributes to a hostile climate for Black students or whether hiring practices unfairly favor applicants from privileged backgrounds. For example, universities may prefer a White applicant from Harvard whose parents also attended Harvard over a Black applicant from Michigan State University—despite comparable or superior qualifications.

Whether such policies reduce inequality or create new inequalities is an important and difficult empirical question. However, the underlying goal of diversity, equity, and inclusion programs—to promote fairness and equal protection—is grounded in the 14th Amendment of the U.S. Constitution. Efforts to bring social outcomes more in line with these principles are not radical; they are consistent with constitutional ideals and basic human rights. Opposition often aligns with existing power and status hierarchies rather than with empirical evidence.

It is understandable that politically conservative professors may feel out of place in departments where most colleagues are liberal. But the same is true for female police officers or Black lawyers in elite law firms. Ironically, DEI initiatives could also benefit politically conservative academics by ensuring that universities foster inclusive environments and avoid discrimination based on political orientation. In practice, this is rarely a major problem. Most professors interact with colleagues infrequently outside formal meetings, and promotions depend far more on student evaluations, publications, and grant funding than on political views.

Concerns about ideological repression are often fueled by highly visible but rare cases. Data from the Foundation for Individual Rights and Expression (FIRE, 2023) show that sanction campaigns against scholars originate from both the political left (about 52%) and the political right (about 41%), and that most cases do not result in formal discipline. When sanctions do occur, universities typically cite violations of institutional policies or professional standards. Since early 2025, however, campus politics have become more volatile. In the aftermath of the killing of conservative activist Charlie Kirk, several universities removed or suspended faculty and staff over controversial social media posts (Inside Higher Education, September, 19, 2025) Similar controversies have been reported in Canada as well (RIndex, 2025).

Debates about universities and politics also ignore a crucial body of scientific evidence concerning political orientation itself. Research in behavioral genetics and personality psychology shows that political orientation is surprisingly trait-like—closer to being an extravert or introvert than to preferring Pepsi over Coke (Hatemi, 2010). Like personality traits, political orientation has a heritable component and shows substantial stability across adulthood. This stability helps explain why political campaigns spend billions of dollars targeting a small number of swing voters while most citizens vote consistently over time.

Another widespread misconception is that parents exert a strong and lasting influence on their adult children’s political views. Parents do influence political attitudes during childhood and adolescence, but this influence declines sharply in early adulthood (Hatemi, 2009). By adulthood, similarity between parents and their children is explained largely by genetic similarity rather than by parental socialization (Hatemi, 2010). This helps explain why political disagreements within families are common—and why Thanksgiving dinner conversations so often avoid politics.

The most important conclusion from this research is that adolescents are not blank hard drives waiting to be programmed by parents or professors. Adolescence and early adulthood are periods of exploration in which individuals actively gravitate toward ideas that fit their underlying dispositions. Students may encounter certain arguments or perspectives for the first time at universities, but they choose how to interpret and integrate them. Exposure is not indoctrination.

Longitudinal studies of university students support this conclusion. There is little evidence that conservative students enter university and reliably graduate as “flaming liberals” (Mariani & Hewitt, 2006). Where changes in political attitudes do occur, they are typically modest and better explained by self-selection, maturation, and peer sorting than by classroom instruction.

So why does the belief in widespread university indoctrination persist? One explanation lies in a common cognitive error: people often infer causation from temporal coincidence. When parents observe that their child goes to university and later adopts different political views, it is tempting to assume that university caused the change. Yet similar changes would often have occurred anyway, regardless of whether the student attended a secular university, a religious institution, or none at all.

In conclusion, universities create and transmit scientific knowledge. Societies that invest in science and higher education tend to produce citizens who are healthier and live longer lives. Scientific inquiry can challenge traditional beliefs that are not grounded in evidence, and this tension is unavoidable in knowledge-based societies. The solution is not to vilify universities, but to recognize that diversity of viewpoints is inevitable—and valuable. Creating learning environments that benefit all students while tolerating disagreement is central to the mission of universities. Anyone who genuinely cares about students’ learning and wellbeing should support efforts to promote diversity, equity, and inclusion. This includes tolerating different political viewpoints—but tolerance cannot extend to intolerance, racism, sexism, or ideologies that deny equal rights or basic human dignity.

Personality Science 2025: About the Author

Science is often described as objective. Given the same evidence, anyone should reach the same conclusion. In reality, things are more complicated. Even in the most rigorous sciences, researchers’ perspectives influence how they interpret evidence. This influence is even stronger in the social sciences. Psychologists, for example, cannot fully set aside their personal views when designing studies, interpreting findings, or writing textbooks. That is why it may help for you to know a little about the author of this book. 

This textbook explores fundamental questions about human nature:

  • How much are people alike, and how much do they differ?
  • To what extent is behavior influenced by situations (social norms, conformity) versus personality (values, dispositions)?
  • How much of personality is shaped by nature (genes) and how much by nurture (culture, socialization, parenting)?

Psychologists disagree about the answers to these questions. Biologically oriented psychologists emphasize evolution and genetics. Developmental psychologists highlight parenting. Social psychologists stress the power of situations. These perspectives are sometimes called paradigms. A paradigm is like a research culture with its own fundamental beliefs and research practices. Each perspective adds valuable insights, but paradigms also create blind-spots and biases. 

Behaviorism is a good example. Behaviorism denied the existence of personality traits. Everybody was just the product of a different reinforcement schedule. It also ruled out the study of emotions and forbade self-reports. For this reason, research on personality and emotions with self-report measures only emerged in the 1980s, when the behavioristic paradigm lost its influence. I would not be a psychologist, if behaviorism had lasted another couple of decades. Instead, I attended a conference in 1990, where Skinner gave his last speech to a large audience and only a handful of psychology clapped when he criticized cognitivism. The behavioristic paradigm was dead. At another a conference, an older psychologist described himself as a prisoner of the behavioristic paradigm. That phrase stuck with me. I did not want to look back at my career and realize that I had been a prisoner.  This does not mean I am without biases, but it does mean that I am not trying you to sell the personality paradigm that has many limitations that card-carrying personality psychologists like to ignore. 

The Origin of My Perspective

My journey began in 1966, in a small town in northern (West) Germany. Too young for the student revolutions of the late 1960s, I nevertheless grew up in their aftermath, surrounded by cultural shifts that reshaped much of the Western world. I was raised in a comfortable middle-class family, with a natural affinity for math and a growing interest in social issues. Once I discovered that psychology was a science—not just speculation about dreams—I knew it was the right field for me.

In 1988, I moved to West Berlin to study psychology, just one year before the fall of the Berlin Wall—an event that profoundly shaped my worldview and my appreciation of free societies. My early academic interests were in emotion research. I studied with Professor Rainer Reisenzein, who introduced me to theories of emotion, and with Professor Hubert Feger, who focused on measurement and group processes. At that stage, personality psychology did not appeal to me. The field was dominated by grand theories, such as Freud’s, that seemed disconnected from evidence. Other approaches emphasized genetics and biology in ways that, to me, echoed the dark history of Nazi eugenics. As a young student, I rejected this line of thought. 

In 1996, I began my dissertation research on how people recall their emotions: How do you know how happy you were last month, and how accurate is that judgment? That same year, I received a scholarship to study with Ed Diener at the University of Illinois, one of the leading figures in happiness research. Working with him and his students was an extraordinary experience. After defending my dissertation in 1997, I was fortunate to secure a two-year fellowship from the German Science Foundation (DFG), which allowed me to continue working with Ed Diener in Illinois. My focus shifted from emotions to personality and well-being: Why do some people consistently experience more positive and fewer negative emotions than others? Why are some people happier? Over time, my perspective expanded. Feeling good is important, but it is not the whole story. A full picture of well-being requires asking people how satisfied they are with their lives overall. Life satisfaction became the central theme of my research, and Chapter 14 of this book summarizes some key findings in this area. 

Since 2000, I have been a faculty member at the University of Toronto, Mississauga, a unique campus that reflects the cultural diversity of Toronto. Most of my research focused on happiness (subjective well-being), but since 2011, I have been working on examining the research practices of psychologists. This work was motivated by increasing awareness that many results in psychology journals that end up in textbooks are not replicable. This scientific study of scientists’ behavior is called meta-science or meta-psychology. With Rickard Carlsson in Sweden, I co-founded a journal with the title “Meta-Psychology.” My awareness of the replication crisis helped me to select only credible results for this textbook. Another benefit for students is that this makes the book a lot shorter because some research areas have few replicable findings. For example, we still know very little about the neurological differences between people that shape their personalities. 

Writing this textbook as an active researcher comes with both strengths and weaknesses. On the one hand, I can bring you closer to the science itself—critiquing studies, highlighting controversies, and even sharing my own analyses. On the other hand, professional textbook writers are often more skilled at producing polished narratives. The problem with polished narratives, however, is that they often gloss over controversies and discourage critical thinking. They present findings as if they were unshakable facts. In reality, personality psychology is an emerging science, barely 50 years old, and many findings rest on shaky foundations. The aim of this book goes deeper. It introduces students to scientific thinking, critical evaluation of empirical findings, and quantitative reasoning about personality. That is why the word science appears in the title. I will make a clear distinction between empirical facts (e.g., monozygotic twins are more similar than dizygotic twins for most traits) and inferences or implications (e.g., genetic differences cause personality differences). Facts should not be denied. Inferences can and should be questioned.

As I said before, I did not want to believe in genetic differences, but the evidence became impossible to ignore. Rather than resisting it, I learned to see it differently. Genetic differences do not mean that some people are born with better genes. They mean people are different—and good societies allow everyone to be who they are. Genetic variation is a strength. This principle is true in human evolution and in human societies. Understanding differences, and understanding people who differ from us, is essential for modern life.

The scientific study of personality can also help people avoid chasing unrealistic goals rooted in social norms of perfection. Instead, we can learn to accept ourselves and become our best unique selves. This non-judgmental approach aligns with science’s aim to be objective. Whether there are truly bad, evil, or pathological personalities is a difficult question, but psychology’s history shows how dangerous it can be to label some variations as pathological. Only 50 years ago, homosexuality was considered a disorder. Today, it is accepted as a normal variation in human sexuality.

Finally, I must mention political orientation. Like sexual orientation, it has some genetic roots. Some people are drawn to familiar, traditional values; others to different cultures and new ways of living. Universities are often criticized as leftist and “woke,” accused of indoctrinating students. In reality, students’ political beliefs are largely established before they enter the classroom, and professors have little power to change them. Moreover, many conservative critiques ignore the fact that some conservative ideas are directly opposed to science. At the University of Padua, where Galileo taught, it took the Catholic Church 500 years to accept that the Earth revolves around the sun.

The conflict between traditional values and science is especially sharp in psychology. Psychological science is still concentrated in a handful of mostly secular countries in Western Europe, North America, and East Asia. In the United States, science is currently under attack by right-wing conservatives. Learning about psychology as a science will expose students to progressive ideas that challenge traditional beliefs about human nature, sexuality, gender, and race. At the same time, most topics in psychology are not political, and personality psychology is less politically charged than social psychology. As you will see in Chapter 1, however, personality psychology does have its own dark history—one that is important to confront as we move forward.

A Multiverse Analysis of Regional Implicit Bias: Implicit 1 : 13 Explicit

 

Snyder, J. S., & Henry, P. J. (2023). Regional Measures of Sexual-Orientation Bias Predict Where Same-Gender Couples Live. Psychological science34(7), 794–808. https://doi.org/10.1177/09567976231173903

Multiverse Analysis
OSF | A Multiverse Analysis of Snyder and Henry (2023) “Regional Measures of Sexual-Orientation Bias”

Summary

Snyder and Henry (2023) argue that county-level aggregation of IAT scores yields a reliable regional measure of anti-LGB bias that predicts where same-gender couples live, and they highlight a key adjusted regression (Table 3, Column 3) in which the implicit measure appears to outperform a single-item explicit measure. While aggregation can reduce random error, it does not by itself establish that IAT scores capture a distinct implicit construct; aggregation also stabilizes systematic method variance and sampling artifacts, and regional differences in self-presentation could affect explicit reports.

A reanalysis using a multiverse framework shows that the “implicit > explicit” contrast is highly model-dependent. In simple associations, implicit and explicit measures show similar relationships with the outcome. Across 42 reasonable specifications that vary outcome handling (raw, log-transformed, count model), weighting (with/without), and covariate inclusion (none, single covariates, full published set), only the published specification yields a statistically significant advantage for the IAT, while multiple alternatives yield either no difference or a statistically significant advantage for the explicit measure. The main conclusion is that the paper’s headline inference—implicit bias is a stronger predictor than explicit bias—is not robust to reasonable analytic choices and should be interpreted more cautiously.

Full Article

 

 

This article asks whether regional measures of sexual-orientation bias predict where same-gender couples live. The central claim is that county-level implicit bias provides predictive value beyond explicit measures, and that this pattern remains when adjusting for a set of county-level covariates and region indicators.

The key evidence is a regression framework in which the outcome is a county-level measure of same-gender couple prevalence (and/or counts), with implicit and explicit bias entered jointly, and then a “full” specification that adds a covariate set (policy environment, religion, education, income, political orientation, rurality, and census region). They interpret the implicit coefficient as the stronger (or uniquely informative) predictor in the adjusted model.

They motivate covariates as adjustments for structural and cultural factors that could correlate with both attitudes and residential patterns of same-gender couples. They treat the adjusted model as closer to the causal quantity of interest: the association between bias and couple locations net of these background county characteristics.

What do IATs Measure?

Researchers disagree about what IAT scores mean. The early interpretation was that IATs capture evaluative associations that are at least partly outside conscious awareness. Low correlations between IAT scores and self-reported attitudes were often taken as support for this view. This interpretation remains common, but a growing literature challenges it.

At the individual level, IAT scores correlate only modestly with other indirect measures and with behavior, suggesting that a substantial share of variance reflects random noise and systematic method variance rather than unique, construct-valid “implicit” content. One alternative view is that IATs are indirect, error-prone measures of largely the same evaluative attitudes people can report in questionnaires, with differences between methods driven in part by measurement artifacts rather than distinct underlying constructs.

Snyder and Henry (2023) adopt a related but distinct argument at the regional level. They propose that aggregation of individual IAT scores to the county level reduces random error and yields a more reliable measure of the local “implicit climate,” which can then predict county-level outcomes. This logic is reasonable as far as reliability is concerned. However, improved reliability is not the same as improved discriminant validity.

Aggregation reduces random noise, but it also stabilizes systematic components of the measure that may vary across counties (e.g., platform- and sampling-related artifacts, regional differences in who takes the test, and other method-specific influences). The same concern applies to self-reports. Social desirability and self-presentation may differ across regions, which could attenuate the implicit–explicit correlation even if the two methods track a single underlying construct.

In the present data, the county-level correlation between the IAT measure and a single-item explicit measure is reported as r = .60. This is substantial shared variance, but it still leaves considerable unique variance in each measure. With only two methods, it is difficult to draw strong conclusions about what that unique variance represents. In particular, it is possible for two imperfect measures of the same construct to show different “unique” predictive power in regression models when systematic measurement error and correlated predictors are present. Conversely, if one measure fails to predict the outcome across reasonable model specifications, that would cast doubt on claims that it contains unique valid information about the criterion.

A further question is why one measure would be a stronger predictor than the other. One possibility is validity: the more valid measure should predict relevant outcomes more strongly. Another possibility is model dependence: when two predictors are highly correlated and both contain measurement error, small analytic choices (model form, covariate set, weighting) can shift the apparent “winner” without reflecting a stable underlying difference.

It is sometimes argued that indirect measures should outperform self-report on socially sensitive topics because self-reports are suppressed by social desirability. Yet evidence for this claim is mixed; in many anonymous contexts, people report socially sensitive attitudes with little apparent inhibition, and self-reports often predict behavior at least as well as IAT scores at the individual level. The key point for the present analysis is that differential predictive power does not, by itself, establish that IAT variance is uniquely “implicit.”

The paper’s central result for the “implicit beats explicit” claim appears in Table 3, Column 3. In that specification, the implicit measure shows a stronger negative association with the outcome than the explicit measure when both are entered together and additional county covariates are included. The authors interpret this as evidence that the aggregated IAT captures something distinct and more predictive than the explicit measure.

However, the corresponding zero-order correlations in Table 1 are comparatively balanced: the implicit and explicit measures show similar correlations with the outcome. This suggests that the divergence in Table 3 is driven by the particular multivariable specification—especially the inclusion of several covariates that are themselves strongly related to both attitudes and the outcome (e.g., political conservatism, rurality, and religiosity).

One way to address sensitivity to analytic choices is to provide a strong theoretical rationale for a specific model and, ideally, preregister it. Another is to examine robustness across a transparent set of reasonable alternatives. To that end, I conducted a multiverse analysis (MVA) that focuses on the robustness of the “implicit vs explicit” contrast.

The analysis acknowledges that the outcome is highly skewed and that the authors used a weighting scheme based on the number of IAT observations per county. Accordingly, models were estimated using (a) the raw outcome (as in the paper), (b) a log-transformed outcome, and (c) a count-model approach. Each model was estimated with and without weights. Finally, models were estimated with no covariates, with individual covariates, and with the full covariate set used in Table 3. This produced 42 specifications. For each specification, I computed and tested the difference between the implicit and explicit coefficients.

The results indicate substantial model dependence. Only one specification produced a statistically significant “implicit > explicit” contrast—namely the specification corresponding to Table 3, Column 3. In contrast, 13 specifications produced a statistically significant advantage for the explicit measure over the implicit measure, and the remaining specifications were non-significant. In other words, the published pattern is atypical in the multiverse: modest changes to modeling decisions (e.g., outcome transformation or omitting weights) eliminate the reported contrast, and in a nontrivial subset of specifications the sign of the contrast reverses.

These findings illustrate the value of robustness checks in complex observational analyses. The Open Data badge made it possible to evaluate the sensitivity of the headline claim to reasonable analytic choices. The key conclusion from the MVA is not that the focal association is absent, but that the specific inference that “implicit bias is a stronger predictor than explicit bias” is not robust to alternative, defensible specifications.

At minimum, the results warrant a narrower interpretation: the data show that both regional measures relate to the outcome, but the direction and significance of the implicit–explicit contrast depends strongly on modeling decisions. A cautious reading is therefore that the evidence does not uniquely support the claim that the IAT measures a distinct “implicit” construct that outperforms explicit self-report at the county level.