Dr. Ulrich Schimmack Blogs about Replicability

For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication” (Cohen, 1994).

DEFINITION OF REPLICABILITYIn empirical studies with sampling error, replicability refers to the probability of a study with a significant result to produce a significant result again in an exact replication study of the first study using the same sample size and significance criterion (Schimmack, 2017). 

See Reference List at the end for peer-reviewed publications.

Mission Statement

The purpose of the R-Index blog is to increase the replicability of published results in psychological science and to alert consumers of psychological research about problems in published articles.

To evaluate the credibility or “incredibility” of published research, my colleagues and I developed several statistical tools such as the Incredibility Test (Schimmack, 2012); the Test of Insufficient Variance (Schimmack, 2014), and z-curve (Version 1.0; Brunner & Schimmack, 2020; Version 2.0, Bartos & Schimmack, 2021). 

I have used these tools to demonstrate that several claims in psychological articles are incredible (a.k.a., untrustworthy), starting with Bem’s (2011) outlandish claims of time-reversed causal pre-cognition (Schimmack, 2012). This article triggered a crisis of confidence in the credibility of psychology as a science. 

Over the past decade it has become clear that many other seemingly robust findings are also highly questionable. For example, I showed that many claims in Nobel Laureate Daniel Kahneman’s book “Thinking: Fast and Slow” are based on shaky foundations (Schimmack, 2020).  An entire book on unconscious priming effects, by John Bargh, also ignores replication failures and lacks credible evidence (Schimmack, 2017).  The hypothesis that willpower is fueled by blood glucose and easily depleted is also not supported by empirical evidence (Schimmack, 2016). In general, many claims in social psychology are questionable and require new evidence to be considered scientific (Schimmack, 2020).  

Each year I post new information about the replicability of research in 120 Psychology Journals (Schimmack, 2021).  I also started providing information about the replicability of individual researchers and provide guidelines how to evaluate their published findings (Schimmack, 2021). 

Replication is essential for an empirical science, but it is not sufficient. Psychology also has a validation crisis (Schimmack, 2021).  That is, measures are often used before it has been demonstrate how well they measure something. For example, psychologists have claimed that they can measure individuals’ unconscious evaluations, but there is no evidence that unconscious evaluations even exist (Schimmack, 2021a, 2021b). 

If you are interested in my story how I ended up becoming a meta-critic of psychological science, you can read it here (my journey). 

References

Brunner, J., & Schimmack, U. (2020). Estimating population mean power under conditions of heterogeneity and selection for significance. Meta-Psychology, 4, MP.2018.874, 1-22
https://doi.org/10.15626/MP.2018.874

Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17, 551–566
http://dx.doi.org/10.1037/a0029487

Schimmack, U. (2020). A meta-psychological perspective on the decade of replication failures in social psychology. Canadian Psychology/Psychologie canadienne, 61(4), 364–376. 
https://doi.org/10.1037/cap0000246

Mastodon

The Time for AI-Generated Peer-Reviews is Now

I should have done this before I submitted a ms. for publication, but later is better than never. After waiting for a couple of months, I received a rejection for a ms. on the relationship between personality and life-satisfaction.

The rejection decision was based on a form letter by the action editor without any acknowledgement that they read the paper or comments on the main point of the ms. and two reviews.

.

One review was by the author who’s data I reanalyzed and showed that the published results are not supported by my reanalyses. You can guess how favorable that review was or you can read it below. The other review made no substantial comments, probably because the reviewer is not familiar with structural equation modeling and unable to read the MPLUS output files that were shared on OSF. Anyhow, these reviews provide no helpful feedback, but they served the purpose of allowing the editor to reject the ms.

Unhappy, or rather frustrated, I decided to ask ChatGPT for a CRITICAL review of the manuscript and just pasted the manuscript in the dialogue box. Less than a minute later, i had a objective review that showed understanding of the issue, acknowledged strength, and pointed out several limitations that can be used to strengthen the manuscript. Wow. This is a gamechanger. Not only can editors save time and get valuable reviews. Authors can get the feedback before they submit a ms. and improve it. We all knew peer-review is shit because it is driven by nepotism, defensiveness, motivated bias, incompetence, and lack of time and motivation to do a good job. I have written my fair share of terse reviews without much care for the feelings of the authors. ChatGPT doesn’t have a bone in the fight. It may not be ready to replace peer-review, but it surely can elevate it.

ChatGPT Review

Critical Peer Review of “Personality and Life-Satisfaction: A Hierarchical, Multi-Rater Perspective”

Title and Abstract The manuscript addresses a critical and enduring question in psychology: the relationship between personality traits and life satisfaction. The title effectively conveys the scope of the research, and the abstract provides a concise overview of the methodology, results, and implications. However, the abstract could better highlight the novelty of the hierarchical factor analysis approach used, as this seems to be a significant contribution of the study.

Strengths of the Manuscript

  1. Methodological Rigor: The use of hierarchical confirmatory factor analysis to separate variance components represents a significant methodological strength. This approach allows the study to address measurement errors and identify unique contributions of specific traits.
  2. Multi-Rater Datasets: Incorporating multiple datasets with self-ratings and informant ratings enhances the reliability and generalizability of the findings. The inclusion of domain satisfaction measures in four of the datasets is also commendable.
  3. Findings on Neuroticism: The consistent identification of the depressiveness facet of Neuroticism as a primary predictor of life satisfaction is a valuable contribution to the literature. The mediation effect of domain satisfaction provides novel insights into the mechanisms underlying this relationship.
  4. Contextual Moderation: The manuscript explores variations across populations (e.g., students vs. parents) and cultural contexts, which adds depth to the findings.

Weaknesses and Areas for Improvement

  1. Theoretical Framing:
    • While the manuscript builds on Costa and McCrae’s hierarchical model, the discussion of how this model advances or challenges existing theories of personality and subjective well-being (SWB) could be expanded. For example, the implications of the findings for alternative models, such as dynamic equilibrium theory, are not sufficiently explored.
    • The manuscript heavily focuses on Neuroticism and Extraversion. It briefly mentions other Big Five traits (e.g., Openness) but does not delve into their potential roles or reasons for their weaker effects. This omission may leave readers with an incomplete picture.
  2. Facet-Level Analysis:
    • While the study underscores the importance of the depressiveness facet of Neuroticism, the results for Extraversion and Cheerfulness are inconsistent. The manuscript could include a more thorough discussion of why these inconsistencies arise and what they suggest about the role of positive affect in life satisfaction.
    • There is little exploration of why specific traits (beyond Depressiveness and Cheerfulness) fail to consistently predict life satisfaction. Providing more nuanced interpretations could enrich the discussion.
  3. Domain Satisfaction and Top-Down Effects:
    • The findings on domain satisfaction are compelling but underexplored. The “top-down” influence of personality traits on domain evaluations is mentioned, but the mechanisms underlying this influence remain vague. For instance, how does Depressiveness lead to negative evaluations of specific domains? This area warrants further elaboration.
    • The manuscript could provide more detail on the “bottom-up” integration of domain satisfaction into global life satisfaction judgments. Are certain domains (e.g., relationships, financial stability) consistently more influential across samples?
  4. Methodological Concerns:
    • Measurement of Life Satisfaction: The use of a single-item measure of life satisfaction in some datasets raises validity concerns, as acknowledged by the authors. A more detailed justification for this choice and a discussion of its potential limitations are needed.
    • Sample Representativeness: The reliance on student samples in some studies limits generalizability. While this limitation is noted, it could be addressed more explicitly in the discussion.
    • Reliance on Multi-Rater Data: The manuscript assumes that variance shared among raters reflects valid variance. However, it does not adequately address potential biases introduced by informant ratings, such as social desirability effects or shared environmental influences.
  5. Presentation and Accessibility:
    • The manuscript’s extensive use of technical language and complex statistical concepts may limit accessibility for non-specialist readers. Including simplified explanations or visual aids (e.g., diagrams of hierarchical models) would enhance readability.
    • The tables summarizing results are dense and lack accompanying narrative summaries that highlight key patterns. For example, Table 1 and Table 2 could benefit from clearer labels and textual integration into the results section.

Suggestions for Future Research

  • The manuscript highlights the need to identify additional predictors of life satisfaction beyond Depressiveness and Cheerfulness. Future studies could explore traits or facets not captured by traditional Big Five measures, such as self-compassion or grit.
  • Longitudinal studies that disentangle the causal relationships between personality, domain satisfaction, and life satisfaction would address limitations of the current cross-sectional datasets.
  • Cross-cultural studies that examine the moderating effects of societal factors (e.g., individualism vs. collectivism) on the personality-life satisfaction relationship could provide valuable insights.

Conclusion The manuscript makes a meaningful contribution to the literature on personality and subjective well-being by quantifying the effects of personality traits and exploring their mediation through domain satisfaction. However, the study’s potential impact could be enhanced by addressing the theoretical, methodological, and presentation-related issues outlined above. Overall, the research is robust, and its findings are valuable, but greater clarity and depth in certain areas would strengthen its contribution to the field.

Overall Recommendation: Revise and resubmit, addressing the identified weaknesses and incorporating the suggested improvements to enhance the manuscript’s clarity, accessibility, and theoretical depth.

Questionable Reviewer Practices: Dishonest Simulations

In an ideal world, humans would curb self-interest for the greater good. In reality, psychology has shown that human information processing is riddled with self-serving biases. A great achievement of human cultural evolution is the development of tools that can reduce these biases such as logical thinking and objective empirical observations. Since the invention of computers, it has also become easier to use simulations to put intuition to a test. However, motivated self-interest can bias simulations to produce desirable outcomes. For example, Uri Simonsohn made unreasonable assumptions to claim that p-curve performs well even with heterogeneity in power (Schimmack, 2018). It does not (Brunner,

An anonymous reviewer, henceforth known as Reviewer A (which may stand for anonymous or another word starting with A), recently accused us that we also used biased simulations to make the false claim that z-curve provides useful estimates of power with good coverage of confidence intervals (Brunner & Schimmack, 2020; Bartos & Schimmack, 2022; Schimmack & Bartos, 2023). The same reviewer previously made numerous false claims about z-curve and estimation of true power that we addressed elsewhere (Brunner, 2024). In our response to Reviewer A’s earlier comments, we also challenged them to provide a simulation that shows when z-curve breaks down.

Reviewer A was able to do so. The question is how they did it and whether Reviewer A’s results challenge our simulation results. Here is Reviewer A’s simulation.

To show that z-curve breaks down without hodgepodge heterogeneity, let us consider a situation of unconditional power of value .25   We have a one-sample Cohen’s d population value ranging from 0.2 to 2.2 (by increments of .05 to end up with 40 values), that is accompanied by sample size ranging from 167 to 2 that is calculated to be associated with a power of .25.

I generated data from each combination of Cohen’s d and sample size and fit a paired-samples t-test to obtain 40 p-values. These p-values are associated with a (“unconditional”) power value of .25. The expected discovery rate should be .25 (which is the power associated with the design of these observed results). The output I obtain from the zcurve package for the 40 estimated p-values is 0.05 – very far from the true value of .25.

To be clear, an estimate of 5% power (no evidence against the null-hypothesis) when the true power is 25% is horrible. So, we need to examine the conditions that lead to this horrible outcome. Here is Reviewer A’s code.

R code
d <- seq(.2,2.2,by = 0.05) # range of population Cohen’s d value
ssize<- matrix (0, ncol=40,nrow=1) #placeholder for sample size
pow = .25 #change this to change level of true power
for (i in 1:40){ #obtain sample size associated with Cohen’s d value for chosen level of power.
  ssize [i] <- pwr.t.test(n = NULL, d = d[i], sig = .05, power = pow,
                          type=”one.sample”, alternative=”two.sided”)$n
}
ssize_ <- round(ssize,digits=0) #round up the values
pp <- matrix(0,ncol=40,nrow=1)  #placeholder for estimated p-values
#let’s generate data and collect p-values
for (i in 1: 40){
  dat <- rnorm(n=ssize_[i], mean=d[i], sd=1)
  pp[i] <- t.test(dat, paired=FALSE, alternative=”two.sided”)$p.value
}
*let’s use zcurve to see whether EDR reproduces the specified power value
zcurve(p=as.vector(pp))
==

Let’s first address an annoying side issue in this simulation. The simulation is based on 40 simulated studies or test results. With power of 25% only about 10 of those are expected to be significant and useful for a z-curve analysis to estimate power. We have warned that z-curve estimates with small k (k = 10; 10 p-values below .05) are too variable to be meaningful. They also have wide confidence intervals that Reviewer A does not bother to report. However, this is a side-issue. We can simply increase the number of tests from 40 to 100,000 and see the large-sample bias in z-curve estimates. This confirms that z-curve severely underestimates power in this simulation. So, let’s take a closer look at the scenario that is being simulated.

The simulation starts with effect sizes ranging from small (d = .2) to effect sizes that are very large and rarely observed in real studies (d = 2.2). It is well known that power increases with larger effect sizes. Thus, to maintain low power of 25%, we have to reduce sample sizes. The sample sizes implied by this simulation are as follows.

N freq. perc.
2 11 26.8%
3 13 31.7%
4 4 9.8%
5 3 7.3%
6 1 2.4%
7 2 4.9%
9 1 2.4%
10 1 2.4%
12 1 2.4%
15 1 2.4%
20 1 2.4 %
28 1 2.4%
43 1 2.4%

These results show that only 10% of studies had sample sizes of 15 or more participants and over half of the simulated studies had 2 or 3 participants. As the simulation focused on a one-sample t-test, a study with N = 2 has 1 degree of freedom. This important information was hidden from the editor who is supposed to make decisions based on peer-reviewers’ comments.

Reviewer A could have just simulated studies with sample sizes of 2 or 3 participants to show that z-curve does not work well with these sample sizes, but maybe the editor would have noticed that this is not a reasonable assumption because most studies have more than 3 participants. In fact most studies have more than 20 participants. So, the only plausible reason to simulate effect sizes, when sample size is the driving factor is to hid explicit information about sample sizes from readers who do not understand t-distributions very well.

That being said, it is interesting to examine whether small sample sizes that are actually found in research articles still bias z-curve estimates. For example, John Bargh’s infamous elderly priming study that could not be replicated had only n = 15 participants in the control and experimental group for a total of N = 30 and 28 degrees of freedom. Would z-curve estimates underestimate power with these small sample sizes? Before I can present the results, it is important to point out that z-curve provides to estimates of power. One is the estimated power (including power of 5% for studies where the null-hypothesis is true) of all studies that were conducted and produced significant and non-significant results. This is called the expected discover rate. The second estimate is the power of the subset of studies that produced a significant result (including false positive results with power of 5%, if alpha is set to .05). This is called the expected replication rate because it predicts how many significant results would be obtained if only the studies with significant results were replicated exactly, including the original sample sizes. When power is fixed, the true EDR and ERR are the same, but estimates and biases can differ because estimating the EDR is more difficult.

We can use Reviewer A’s code to determine the effect size that is needed to get significance in a simple between-group study with n = 15 per group (Bargh also had a covariate which increases power, but that is not relevant here). I simulated power of 50%.

#Note. pwr uses n of a single group. with n=15 and type=two.sample, the total N=30 and df =28
d <- pwr.t.test(n = 15, d = NULL, sig = .05, power = .50,
                          type=”two.sample”, alternative=”two.sided”)$d

With N = 30 total sample size we need a large effect size of d = .74 to have 50% power.

Figure 1 shows the t-distribution from which test statistics of individual studies are sampled (green). It also shows the standard normal distribution that is implied by 50% power (black). Visual inspection shows that the two distributions are similar but not identical.

The approximation of the asymmetrical non-central t-distribution with the standard normal distribution introduces some bias in z-curve estimates. For this example, the true power of 50% is underestimated by 4 percentage point. More problematic is that the EDR is underestimated by 13 percentage points. This finding suggests that estimates of the EDR and the FDR, which is simply a transformation of the EDR, are biased in sets of studies with small sample sizes (N < 30), even if we disregard silly sample sizes of N = 2.


Is there a solution to this problem? Indeed there is one and maybe we should have thought about it before, but as they say “better late than never.” There is an alternative approach to ‘convert’ t-values into z-scores. (or F-values with df = 1, t = F^2, i.e., t is the square root of the F-value or F-values are simply squared t-values). The alternative is to simply use the t-value as an estimate of z-scores.


ChatGPT: How to convert t-values into z-scores?

Find the cumulative probability of the t-value: Use a t-distribution table or statistical software to find the p-value associated with the t-value given your specific degrees of freedom. This p-value represents the area under the curve to the left of your t-value (for a one-sided test) or half the area in a two-tailed test. Once you have the p-value, use the inverse of the standard normal distribution (Φ⁻¹) to find the corresponding z-score.

Alternative Formula (if df > 30): For degrees of freedom over 30, the t-distribution closely resembles the normal distribution, so you can directly approximate the z-score by using the t-value.

It is clear that this approach will lead to overestimation of power (EDR, ERR) because uncorrected t-values with small degrees of freedom are always larger than the corresponding z-scores. The question is how big this bias is when we use this approach to conduct z-curve analyses. Here are the results for the same data, but the input are the uncorrected t-values.

The results for the ERR are as expected: the true power is overestimated. Interestingly, the bias is as strong as for the transformation approach, but in the opposite direction. This suggests a possible way to quantify the amount of bias, by using both approaches and use half of the difference in estimates as an estimate of the amount of bias. The results for the EDR are a bit more surprising. There is no bias in the estimate. The reason is that bias is introduced by the wide tail of the t-distribution and this tail has a weaker effect on power for all tests. This is a very encouraging finding and suggests that it is preferable to use this approach to submit t-values to z-curve.

Large sample bias is often hard to detect when the set of studies is small. Moreover, confidence intervals of z-curve estimates are adjusted to allow for small systematic biases. It is therefore interesting to compare the coverage of confidence intervals for both approaches. To do so, I split the 200,000 t-values into 1,000 sets of 200 observations and ran z-curve with confidence intervals. I then checked the percentage of confidence intervals that included the true parameter.

For the ERR, 94.7% of confidence intervals included the true parameter. This is just 0.3 percentage points more than we would expect from a 95% confidence interval. For sample sizes greater than 30, this would imply good coverage. For the EDR, 97.7% of confidence intervals included the true power, indicating good coverage for the 95% confidence interval.

We will follow up on these preliminary results with more extensive simulations, but the results suggest that it is preferable to use t-values as estimates of z-values rather than using the transformation by means of p-values. We also suggest to limit z-curve analysis to studies with at least N = 30 participants for now.

Conclusions about Z-Curve

Z-curve was developed as a statistical tool that estimates the average power of a set of studies. It does so for two populations of studies. One population is all studies that were conducted independent of the result. The other population is the subset of studies that produced a statistically significant result. Alternative methods exist, but z-curve is the only method that can be used when studies differ in power (heterogeneity) without having to make assumptions about the distribution of true power (Brunner & Schimmack, 2020).

As all methods that rely on samples to make claims about populations, z-curve cannot reveal the true average power. It can only provide estimates of average true power in a population of studies. There are two sources of uncertainty in these estimates. One is ordinary sampling error. The other is systematic bias that can be introduced by approximating test-statistics from different designs with z-values. Z-curve provides confidence intervals that take both sources of error into account. Simulation studies suggest that 95% confidence intervals contain the true parameter at least 95% of the time in many realistic scenarios.

Reviewer A’s simulations showed that this is no longer the case when sample sizes are small. A simple solution to this problem is not to include studies with very small sample sizes in z-curve analyses. At present, I would suggest to exclude studies with N < 30 or to be mindful if studies with smaller sample sizes are part of the set of studies. I also recommend a new approach to include t-values and F-values with one degree of freedom. instead of converting them to p-values, t-values should be directly used as estimates of z-values.

Conclusions about Peer Review, Scientific Integrity, and Reviewer A

After presenting misleading claims about statistical power that we have carefully examined and shown to be misguided (Brunner, 2024), Reviewer A uses the results of their dishonest simulation study to claim that z-curve is unable to estimate the true power of a set of studies.

In sum, I do not think that the z-curve delivers estimates of the expected discovery rate (and its sister concept of expected replication rate) on a conceptual basis. The arguments for using estimates of “unconditional power” seem not to reasonably justify making a claim on the discovery rate of a set of publications (why not just count up the p-values if p-values signal discovery?). Even if my conceptual points are swept under the rug (again), perhaps the simulated illustration showing that z-curve does not provide an estimate near to the true value of the “expected discovery rate” would be convincing. Why does it break? Well, that goes back to the conceptual issues I have pointed out about consistency and efficiency of observed power, and the hodgepodge problem of combining lots of things together and hoping all the bad stuff averages out.  One might choose to argue that z-curve can run but cannot walk (that is, it performs well enough in a complex case but fails miserably in simple cases). I would not be convinced of such an argument that ignores first principles.

To be fair, Reviewer A states “I do not think,” which suggests that they are open to the idea that they may be wrong. However, it is unclear why reviewer A continues to ignore the evidence presented in peer-reviewed articles and the opinions of several reviewers who did not see the fundamental problems that they see.  When there is differences in opinions about factual statements, it is important to examine the underlying thought processes and evidence. Reviewer A failed to do so (“I do not think”). Rather, Reviewer A has set up a biased simulation that confirms their suspicion and accuses us of doing the very same thing they were doing. (If this sounds familiar, you know that it can be an effective strategy to deceive people).

Why does z-curve break in the simulations? As I have shown in this blog post, it breaks down in the simulations by Reviewer A when we use it with sample sizes of N = 2 or 3, it is still biased with sample sizes of N = 30, and it does well with large sample sizes. Reviewer A hides the missing moderator (a.k.a., hidden moderator), sample size, which they are well aware off because they knew hot to break z-curve with sample sizes of N < 10. However, they falsely generalize from this unrealistic scenario to all applications with larger sample sizes and ignore that many simulations with larger sample sizes have shown that z-curve performs well.

This deception is not different from the questionable research practices that original researchers sometimes use to present statistically significant results without any real effects (Bem, 2011). They know more than they are telling their audience to present misleading scientific evidence for false claims. Just like Reviewer A’s dishonest behavior, questionable practices in peer-reviews are not rare exceptions, but common occurrences that are based on human’s struggle to overcome their own biases. This would not be a problem, if there were an open exchange of arguments between authors and reviewers that works out the causes of disagreement. Here it is easy to show that the original simulation and Reviewer A’s simulations are both correct and that sample size is the moderator. Once this is clarified, the editor can decide whether simulations based on N = 2 or N = 100 are more relevant. However, often editors are quick to reject articles based on the expert opinions of reviewers, especially in flashy journals that pride themselves on high rejection rates. It is not surprising that the quality of articles in these journals is not better than in other journals because experts will use their power to favor articles that agree with their opinions and be hypercritical about articles that do not. It is well known that pre-publication peer-review is very subjective and reviewers often disagree even about ratings of the quality of a literature review.

How can we improve peer-review? The answer is simple. Make them open! Open science requires transparency of all steps of the production of a scientific article, which includes peer-review. Some innovative journals have implemented open peer-reviews. We are proud that two z-curve articles have been published in the leading journal Meta-Psychology (conflict of interest declaration: I am co-founder of this journal with Rickard Carlsson who has been main editor since its inception). Reviewer A’s opinions do not just clash with our own opinions. They are also inconsistent with reviewers of z-curve who put their name next to their reviews. In contrast, reviewers in legacy journals hide their identity from the authors and the public. Just like I challenged Reviewer A to present a simulation to break z-curve, I challenge them here to an open exchange about the ability of z-curve to estimate the true power of a set of studies. Open exchange of arguments in real time (like in a chess game) in front of an audience needs to be added to the open science practices. Let’s make a badge for that and I will be happy to earn a few of those.

Junk Studies Pollute the Literature: Money Priming in 2024

In meta-psychology, social priming has become the posterchild of junk science. Researchers conducted many cheap studies with small samples and published the results only when they supported their predictions. After Nobel Laurate Daniel Kahneman published some of these results in his bestselling book “Thinking: Fast and Slow,” he became concerned about the robustness of these results. He send an open letter to the leading social priming research Bargh, asking for replication studies. Bargh and other prominent social priming researchers declined. However, many younger researchers answered the call and reported replication failures. Anybody familiar with the replication crisis in social psychology is well aware of these problems and would not cite social priming studies as scientific evidence unless the studies were preregistered and conducted with reasonable sample sizes to detect small real effects.

However, many psychologists and other social scientists seem to be unaware of the replication crisis or willfully ignore the fact that articles by leading priming researchers provide no credible evidence for the claims made in these articles. As a result, a decade of replication failures has failed to correct the scientific literature. This blog post uses money priming as an example.

The simple idea behind money priming is that some manipulation that makes people think about money will change their attitudes and behaviors in ways that make people more materialistic, selfish, and less altruistic. The original article by Vohs et al. (2006) published 9 studies to provide evidence for this claim. Ironically, an article with 9 successful studies should not make us believe in the effect because even well-designed studies with real effects will occasionally be unsuccessful; that is, produce a p-value above .05 (Schimmack, 2012). Thus, an article that features only successful studies tells us nothing about the actual effect because it is unclear how many attempts were made to get the 9 significant results (Sterling, 1995).

In response to the replication crisis, my colleagues and I have worked on statistical methods to detect selective publishing of confirmatory evidence, which is sometimes called a questionable research practice, but clearly violates the spirit of scientific integrity and undermines the credibility of science and the scientific community. I am focussing on money priming here because I have used Vohs et al.’s (2006) article to train students in bias detection (video). Even 9 studies are sufficient to show that the evidence in this article omits studies that failed to support the mone-priming effect.

I was curious to see whether criticism of priming research and concerns about money priming in general influenced citations of the article. A search in WebOfScience suggests that citations are decreasing. However, citations of psychology articles have been decreasing in 2023 in general and the article is still cited about 20 times a year, which is rather high for psychology. Clearly money priming is not dead yet, and criticizing the work is not akin to flogging a dead horse.

I then looked at articles to see how they cited Vohs et al. (2006). The article “Effects of money priming on sustainable consumption attitudes” caught my attention because the title suggested that it reports results of new money priming research. Indeed, the article reports the results of two successful studies. I will focus on those results a bit later, but first I want to focus on Table 1 in the article that carefully listed results of 19 money priming studies with sample sizes and test statistics (t-tests, F-tests). This is information is sufficient to examine the presence of selection bias in the broader money-priming literature.

I was even able to copy past the table directly into excel. I just needed to add the information about test results in a way that excel could use it and add a few formulas.

The top row shows the important information. 95% of the results were significant if “marginally significant” results (p < .10) are counted as successes. This is consistent with success rates in psychology journals since 1959 (Sterling, 1959). Could this just be due to the fact that money priming is real and that the studies had high statistical power; that is a high probability to get p < .05? The answer to this question can be found by computing the exact p-values for the various test-statistics and converting them into z-scores. These z-scores can be used to compute the observed power of each study based on the effect size estimate in a single study. It is well known that this information is not useful for a single study, but it becomes useful when we have sets of studies and can compute the average power. The average observed power is 66.9%. Thus, we should have gotten about 67% significant results, not 95%. However, observed power is just an estimate of power. Maybe power is really higher? The problem with this argument is that observed power is inflated when selection bias is present. So, the 70% estimate is an overly optimistic estimate and the true average power of the studies is likely to be lower. How much lower? A simpel way to estimate the true average power is to subtract the difference between the success rate and average observed power because the inflation of the estimate increases the more selection bias there is. With an inflation of 94.7% – 67% = 25%, the estimated true power is only 40%. Whether this is sufficient evidence to wonder about the number of studies that failed to show the effect and were not reported is of course a subjective judgment. As the saying goes, a sucker is born every minute.

My colleagues and I have developed a more powerful method to examine these kinds of data called z-curve (Bartos & Schimmack, 2022; Brunner & Schimmack, 2021). The method fits a model to the distribution of the z-values implied by test statistics. Visual inspection of the distribution also provides clear evidence that the distribution of z-scores could not have been produced by random sampling error. It is just not possible to get so many results that are just significant (z = 2 implies p = .05) and no non-significant results.

Based on z-curve, we can estimate a number of statistics. The estimated replication rate (ERR) predicts how many of the 95% studies with significant results would produce a significant result again if these studies were repeated exactly with the same sample sizes. It is like asking Bargh to do his study again and show us that he can get the same results again. When asked to do so, he said nobody would believe him anyways if he would report significant results again. The ERR estimate of 35% is less important than the 95% confidence interval around this estimate. The predicted success rate in a set of exact replication studies could be as low as 10%. It could also be as high as 64%, but we just don’t know how replicable these results are. Thus, the 18 (out of 19) significant results tell us practically nothing about the ability to replicate these results.

The other statistic is the expected discovery rate. The expected discovery rate is an estimate of the percentage of significant results that we would find if researchers had kept all of their results and we could find the missing results and see how many significant results we get. The point estimate is very low. We would expect 5% just by chance alone. To get 7% is hardly better than chance. Again, we need to consider the uncertainty in this point estimate. It could be up to 17%, which would imply that researchers get about 1 significant result in 5 attempts (20%). Would you trust a researcher who hides 80% of their results? Moreover, it is also possible that the EDR is 5%, which is chance level. This means that all significant results were obtained by chance alone without any real money priming effect.

To be clear. This is not my data. I didn’t select studies or code studies. The data come from true believers in money priming who used these data to motivate their own studies.

Despite the wealth of adverse effects documented in the literature on money priming in relation to specific behaviors (as outlined in Table 1), there is a noticeable gap in research regarding the impact of money priming on sustainable consumption. (Koruk & Mandrik, 2024, p. 309).

I only plugged the data into stats programs that can look beyond the evidence we see to see whether we can trust the evidence and the answer is that these 19 studies provide no convincing evidence that money priming caused the significant results in these 18 studies. It could just have been chance and selective reporting of significant results.

Now take a moment and predict the outcome of the new studies in this article. Money priming was manipulated with a scrambled sentence task that made participants rearrange words that were either related to money (experimental group) or not (control group) (Exp1) or a paragraph writing task and a picture of money.

The outcome variabel was the average rating to the following three items.

(1) I am concerned about wasting the resources of our planet.
(2) I will make an effort to use products that do not harm the environment.
(3) It is important to change my consumption patterns (use less or avoid buying products) in order to protect the environment

Scroll down to see the results.

Results

Experiment 1 Effect size d = 1.44, z = 9.73
Experiment 2 Effect size d = 1.55, z = 9.18

Let’s just say that these results are very surprising. The effect sizes are very large for results in psychological research in general and the priming literature specifically. A difference of 1.5 standard deviations is as big as the difference in height between men and women.

Due to these large effect sizes, the test-statistic is of the chart. Z-scores of 9 have a probability of 1 out of a gazillion to occur by chance. These are not chance finding. These results are also not consistent with the evidence in Table 1, which showed much lower z-scores for most studies.

There are many possible explanations for these surprising results that include computational errors, demand effects among others (wink). I don’t really care about these results because priming studies are problematic even if they show real effects. First, the manipulation is artificial and may not correspond to real world situations in which we think about money. Second, ratings on a scale do not imply that people would really change their actual behaviors. However, it might still be an easy study to replicate and see whether the results are replicable. Even smaller samples would be sufficient to produce these results again, given the strong effect sizes reported in this article.

The main point of this blog post is that we need to look at results in published articles differently. We cannot just see how many significant results we see in journals. We already know the answer to this question. The published literature tends to have over 90% significant results. This is not an empirical finding that can be used to evaluate evidence. The real question is always how many non-significant are missing. Bias tests can be useful to answer this question. Thus, if you want to be a scientist and make scientific claims you need to examine the amount of bias in the studies that you are using. “Studies show…” is not a scientific claim. Studies also show that extraverts can sense pornography before it is even presented (Bem, 2011). The real question is how many studies really show an effect.

Guest Post by Jerry Brunner: Response to an Anonymous Reviewer

Introduction

Jerry Brunner is a recent emeritus from the Department of Statistics at the University of Toronto Mississauga. Jerry first started in psychology, but was frustrated by the unscientific practices he observed in graduate school. He went on to become a professor in statistics. Thus, he is not only an expert in statistis. He also understands the methodological problems in psychology.

Sometime in the wake of the replication crisis around 2014/15, I went to his office to talk to him about power and bias detection. . Working with Jerry was educational and motivational. Without him z-curve would not exist. We spend years on trying different methods and thinking about the underlying statistical assumptions. Simulations often shattered our intuitions. The Brunner and Schimmack (2020) article summarizes all of this work.

A few years later, the method is being used to examine the credibility of published articles across different research areas. However, not everybody is happy about a tool that can reveal publication bias, the use of questionable research practices, and a high risk of false positive results. An anonymous reviewer dismissed z-curve results based on a long list of criticisms (Post: Dear Anonymous Reviewer). It was funny to see how ChatGPT responds to these criticisms (Comment). However, the quality of ChatGPT responses is difficult to evaluate. Therefore, I am pleased to share Jerry’s response to the reviewer’s comments here. Let’s just say that the reviewer was wise to make their comments anonymously. Posting the review and the response in public also shows why we need open reviews like the ones published in Meta-Psychology by the reviewers of our z-curve article. Hidden and biased reviews are just one more reason why progress in psychology is so slow.

Jerry Brunner’s Response

This is Jerry Brunner, the “Professor of Statistics” mentioned the post. I am also co-author of Brunner and Schimmack (2020). Since the review Uli posted is mostly an attack on our joint paper (Brunner and Schimmack, 2020), I thought I’d respond.

First of all, z-curve is sort of a moving target. The method described by Brunner and Schimmack is strictly a way of estimating population mean power based on a random sample of tests that have been selected for statistical significance. I’ll call it z-curve 1.0. The algorithm has evolved over time, and the current z-curve R package (available at https://cran.r-project.org/web/packages/zcurve/index.html) implements a variety of diagnostics based on a sample of p-values. The reviewer’s comments apply to z-curve 1.0, and so do my responses. This is good from my perspective, because I was in on the development of z-curve 1.0, and I believe I understand it pretty well. When I refer to z-curve in the material that follows, I mean z-curve 1.0. I do believe z-curve 1.0 has some limitations, but they do not overlap with the ones suggested by the reviewer.

Here are some quotes from the review, followed by my answers.

(1) “… z-curve analysis is based on the concept of using an average power estimate of completed studies (i.e., post hoc power analysis). However, statisticians and methodologists have written about the problem of post hoc power analysis …”

This is not accurate. Post-hoc power analysis is indeed fatally flawed; z-curve is something quite different. For later reference, in the “observed” power method, sample effect size is used to estimate population effect size for a single study. Estimated effect size is combined with observed sample size to produce an estimated non-centrality parameter for the non-central distribution of the test statistic, and estimated power is calculated from that, as an area under the curve of the non-central distribution. So, the observed power method produces an estimated power for an individual study. These estimates have been found to be too noisy for practical use.

The confusion of z-curve with observed power comes up frequently in the reviewer’s comments. To be clear, z-curve does not estimate effect sizes, nor does it produce power estimates for individual studies.

(2) “It should be noted that power is not a property of a (completed) study (fixed data). Power is a performance measure of a procedure (statistical test) applied to an infinite number of studies (random data) represented by a sampling distribution. Thus, what one estimates from completed study is not really “power” that has the properties of a frequentist probability even though the same formula is used. Average power does not solve this ontological problem (i.e., misunderstanding what frequentist probability is; see also McShane et al., 2020). Power should always be about a design for future studies, because power is the probability of the performance of a test (rejecting the null hypothesis) over repeated samples for some specified sample size, effect size, and Type I error rate (see also Greenland et al., 2016; O’Keefe, 2007). z-curve, however, makes use of this problematic concept of average power (for completed studies), which brings to question the validity of z-curve analysis results.”

The reviewer appears to believe that once the results of a study are in, the study no longer has a power. To clear up this misconception, I will describe the model on which z-curve is based.

There is a population of studies, each with its own subject population. One designated significance test will be carried out on the data for each study. Given the subject population, the procedure and design of the study (including sample size), significance level and the statistical test employed, there is a probability of rejecting the null hypothesis. This probability has the usual frequentist interpretation; it’s the long-term relative frequency of rejection based on (hypothetical) repeated sampling from the particular subject population. I will use the term “power” for the probability of rejecting the null hypothesis, whether or not the null hypothesis is exactly true.

Note that the power of the test — again, a member of a population of tests — is a function of the design and procedure of the study, and also of the true state of affairs in the subject population (say, as captured by effect size).

So, every study in the population of studies has a power. It’s the same before any data are collected, and after the data are collected. If the study were replicated exactly with a fresh sample from the same population, the probability of observing significant results would be exactly the power of the study — the true power.

This takes care of the reviewer’s objection, but let me continue describing our model, because the details will be useful later.

For each study in the population of studies, a random sample is drawn from the subject population, and the null hypothesis is tested. The results are either significant, or not. If the results are not significant, they are rejected for publication, or more likely never submitted. They go into the mythical “file drawer,” and are no longer available. The studies that do obtain significant results form a sub-population of the original population of studies. Naturally, each of these studies has a true power value. What z-curve is trying to estimate is the population mean power of the studies with significant results.

So, we draw a random sample from the population of studies with significant results, and use the reported results to estimate population mean power — not of the original population of studies, but only of the subset that obtained significant results. To us, this roughly corresponds to the mean power in a population of published results in a particular field or sub-field.

Note that there are two sources of randomness in the model just described. One arises from the random sampling of studies, and the other from random sampling of subjects within studies. In an appendix containing the theorems, Brunner and Schimmack liken designing a study (and choosing a test) to the manufacture of a biased coin with probability of heads equal to the power. All the coins are tossed, corresponding to running the subjects, collecting the data and carrying out the tests. Then the coins showing tails are discarded. We seek to estimate the mean P(Head) for all the remaining coins.

(3) “In Brunner and Schimmack (2020), there is a problem with ‘Theorem 1 states that success rate and mean power are equivalent …’ Here, the coin flip with a binary outcome is a process to describe significant vs. nonsignificant p-values. Focusing on observed power, the problem is that using estimated effect sizes (from completed studies) have sampling variability and cannot be assumed to be equivalent to the population effect size.”

There is no problem with Theorem 1. The theorem says that in the coin tossing experiment just described, suppose you (1) randomly select a coin from the population, and (2) toss it — so there are two stages of randomness. Then the probability of observing a head is exactly equal to the mean P(Heads) for the entire set of coins. This is pretty cool if you think about it. The theorem makes no use of the concept of effect size. In fact, it’s not directly about estimation at all; it’s actually a well-known result in pure probability, slightly specialized for this setting. The reviewer says “Focusing on observed power …” But why would he or she focus on observed power? We are talking about true power here.

(4) “Coming back to p-values, these statistics have their own distribution (that cannot be derived unless the effect size is null and the p-value follows a uniform distribution).

They said it couldn’t be done. Actually, deriving the distribution of the p-value under the alternative hypothesis is a reasonable homework problem for a masters student in statistics. I could give some hints …

(5) “Now, if the counter argument taken is that z-curve does not require an effect size input to calculate power, then I’m not sure what z-curve calculates because a value of power is defined by sample size, effect size, Type I error rate, and the sampling distribution of the statistical procedure (as consistently presented in textbooks for data analysis).”

Indeed, z-curve uses only p-values, from which useful estimates of effect size cannot be recovered. As previously stated, z-curve does not estimate power for individual studies. However, the reviewer is aware that p-values have a probability distribution. Intuitively, shouldn’t the distribution of p-values and the distribution of power values be connected in some way? For example, if all the null hypotheses in a population of tests were true so that all power values were equal to 0.05, then the distribution of p-values would be uniform on the interval from zero to one. When the null hypothesis of a test is false, the distribution of the p-value is right skewed and strictly decreasing (except in pathological artificial cases), with more of the probability piling up near zero. If average power were very high, one might expect a distribution with a lot of very small p-values. The point of this is just that the distribution of p-values surely contains some information about the distribution of power values. What z-curve does is to massage a sample of significant p-values to produce an estimate, not of the entire distribution of power after selection, but just of its population mean. It’s not an unreasonable enterprise, in spite of what the reviewer thinks. Also, it works well for large samples of studies. This is confirmed in the simulation studies reported by Brunner and Schimmack.

(6) “The problem of Theorem 2 in Brunner and Schimmack (2020) is assuming some distribution of power (for all tests, effect sizes, and sample sizes). This is curious because the calculation of power is based on the sampling distribution of a specific test statistic centered about the unknown population effect size and whose variance is determined by sample size. Power is then a function of sample size, effect size, and the sampling distribution of the test statistic.”

Okay, no problem. As described above, every study in the population of studies has its own test statistic, its own true (not estimated) effect size, its own sample size — and therefore its own true power. The relative frequency histogram of these numbers is the true population distribution of power.

(7) “There is no justification (or mathematical derivation) to show that power follows a uniform or beta distribution (e.g., see Figure 1 & 2 in Brunner and Schimmack, 2000, respectively).”

Right. These were examples, illustrating the distribution of power before versus after selection for significance — as given in Theorem 2. Theorem 2 applies to any distribution of true power values.

(8) “If the counter argument here is that we avoid these issues by transforming everything into a z-score, there is no justification that these z-scores will follow a z-distribution because the z-score is derived from a normal distribution – it is not the transformation of a p-value that will result in a z-distribution of z-scores … it’s weird to assume that p-values transformed to z-scores might have the standard error of 1 according to the z-distribution …”

The reviewer is objecting to Step 1 of constructing a z-curve estimate, given on page 6 of Brunner and Schimmack (2020). We start with a sample of significant p-values, arising from a variety of statistical tests, various F-tests, chi-squared tests, whatever — all with different sample sizes. Then we pretend that all the tests were actually two-sided z-tests with the results in the predicted direction, equivalent to one-sided z-tests with significance level 0.025. Then we transform the p-values to obtain the z statistics that would have generated them, had they actually been z-tests. Then we do some other stuff to the z statistics.

But as the reviewer notes, most of the tests probably are not z-tests. The distributions of their p-values, which depend on the non-central distributions of their test statistics, are different from one another, and also different from the distribution for genuine z-tests. Our paper describes it as an approximation, but why should it be a good approximation? I honestly don’t know, and I have given it a lot of thought. I certainly would not have come up with this idea myself, and when Uli proposed it, I did not think it would work. We both came up with a lot of estimation methods that did not work when we tested them out. But when we tested this one, it was successful. Call it a brilliant leap of intuition on Uli’s part. That’s how I think of it.

Uli’s comment.
It helps to know your history. Well before psychologists focused on effect sizes for meta-analysis, Fisher already had a method to meta-analyze p-values. P-Curve is just a meta-analysis of p-values with a selection model. However, p-values have ugly distributions and Stouffer proposed the transformation of p-values into z-scores to conduct meta-analyses. This method was used by Rosenthal to compute the fail-safe-N, one of the earliest methods to evaluate the credibility of published results (Fail-Safe-N). Ironically, even the p-curve app started using this transformation (p-curve changes). Thus, p-curve is really a version of z-curve. The problem with p-curve is that it has only one parameter and cannot model heterogeneity in true power. This is the key advantage of z-curve.1.0 over p-curve (Brunner & Schimmack, 2020). P-curve is even biased when all studies have the same population effect size, but different sample sizes, which leads to heterogeneity in power (Brunner, 2018].

Such things are fairly common in statistics. An idea is proposed, and it seems to work. There’s a “proof,” or at least an argument for the method, but the proof does not hold up. Later on, somebody figures out how to fill in the missing technical details. A good example is Cox’s proportional hazards regression model in survival analysis. It worked great in a large number of simulation studies, and was widely used in practice. Cox’s mathematical justification was weak. The justification starts out being intuitively reasonable but not quite rigorous, and then deteriorates. I have taught this material, and it’s not a pleasant experience. People used the method anyway. Then decades after it was proposed by Cox, somebody else (Aalen and others) proved everything using a very different and advanced set of mathematical tools. The clean justification was too advanced for my students.

Another example (from mathematics) is Fermat’s last theorem, which took over 300 years to prove. I’m not saying that z-curve is in the same league as Fermat’s last theorem, just that statistical methods can be successful and essentially correct before anyone has been able to provide a rigorous justification.

Still, this is one place where the reviewer is not completely mixed up.

Another Uli comment
Undergraduate students are often taught different test statistics and distributions as if they are totally different. However, most tests in psychology are practically z-tests. Just look at a t-distribution with N = 40 (df = 38) and try to see the difference to a standard normal distribution. The difference is tiny and invisible when you increase sample sizes above 40! And F-tests. F-values with 1 experimenter degree of freedom are just squared t-values, so the square root of these is practically a z-test. But what about chi-square? Well, with 1 df, chi-square is just a squared z-score, so we can use the square root and have a z-score. But what if we don’t have two groups, but compute correlations or regressions? Well, the statistical significance test uses the t-distribution and sample sizes are often well above 40. So, t and z are practically identical. It is therefore not surprising to me that approximating empirical results with different test-statistics can be approximated with the standard normal distribution. We could make teaching statistics so much easier, instead of confusing students with F-distributions. The only exception are complex designs with 3 x 4 x 5 ANOVAs, but they don’t really test anything and are just used to p-hack. Rant over. Back to Jerry.

(9) “It is unclear how Theorem 2 is related to the z-curve procedure.”

Theorem 2 is about how selection for significance affects the probability distribution of true power values. Z-curve estimates are based only on studies that have achieved significant results; the others are hidden, by a process that can be called publication bias. There is a fundamental distinction between the original population of power values and the sub-population belonging to studies that produce significant results. The theorems in the appendix are intended to clarify that distinction. The reviewer believes that once significance has been observed, the studies in question no longer even have true power values. So, clarification would seem to be necessary.

(10) “In the description of the z-curve analysis, it is unclear why z-curve is needed to calculate “average power.” If p < .05 is the criterion of significance, then according to Theorem 1, why not count up all the reported p-values and calculate the proportion in which the p-values are significant?”

If there were no selection for significance, this is what a reasonable person would do. But the point of the paper, and what makes the estimation problem challenging, is that all we can observe are statistics from studies with p < 0.05. Publication bias is real, and z-curve is designed to allow for it.

(11) “To beat a dead horse, z-curve makes use of the concept of “power” for completed studies. To claim that power is a property of completed studies is an ontological error …”

Wrong. Power is a feature of the design of a study, the significance test, and the subject population. All of these features still exist after data have been collected and the test is carried out.

Uli and Jerry comment:
Whenever a psychologist uses the word “ontological,” be very skeptical. Most psychologists who use the word understand philosophy as well as this reviewer understands statistics.

(12) “The authors make a statement that (observed) power is the probability of exact replication. However, there is a conceptual error embedded in this statement. While Greenwald et al. (1996, p. 1976) state “replicability can be computed as the power of an exact replication study, which can be approximated by [observed power],” they also explicitly emphasized that such a statement requires the assumption that the estimated effect size is the same as the unknown population effect size which they admit cannot be met in practice.”

Observed power (a bad estimate of true power) is not the probability of significance upon exact replication. True power is the probability of significance upon exact replication. It’s based on true effect size, not estimated effect size. We were talking about true power, and we mistakenly thought that was obvious.

(13) “The basis of supporting the z-curve procedure is a simulation study. This approach merely confirms what is assumed with simulation and does not allow for the procedure to be refuted in any way (cf. Popper’s idea of refutation being the basis of science.) In a simulation study, one assumes that the underlying process of generating p-values is correct (i.e., consistent with the z-curve procedure). However, one cannot evaluate whether the p-value generating process assumed in the simulation study matches that of empirical data. Stated a different way, models about phenomena are fallible and so we find evidence to refute and corroborate these models. The simulation in support of the z-curve does not put the z-curve to the test but uses a model consistent with the z-curve (absent of empirical data) to confirm the z-curve procedure (a tautological argument). This is akin to saying that model A gives us the best results, and based on simulated data on model A, we get the best results.”

This criticism would have been somewhat justified if the simulations had used p-values from a bunch of z-tests. However, they did not. The simulations reported in the paper are all F-tests with one numerator degree of freedom, and denominator degrees of freedom depending on the sample size. This covers all the tests of individual regression coefficients in multiple regression, as well as comparisons of two means using two-sample (and even matched) t-tests. Brunner and Schmmack say (p. 8)

Because the pattern of results was similar for F-tests
and chi-squared tests and for different degrees of freedom,
we only report details for F-tests with one numerator
degree of freedom; preliminary data mining of
the psychological literature suggests that this is the case
most frequently encountered in practice. Full results are
given in the supplementary materials.

So I was going to refer the reader (and the anonymous reviewer, who is probably not reading this post anyway) to the supplementary materials. Fortunately I checked first, and found that the supplementary materials include a bunch of OSF stuff like the letter submitting the article for publication, and the reviewers’ comments and so on — but not the full set of simulations. Oops.

All the code and the full set of simulation results is posted at

https://www.utstat.utoronto.ca/brunner/zcurve2018

You can download all the material in a single file at

https://www.utstat.utoronto.ca/brunner/zcurve2018.zip

After expanding, just open index.html in a browser.

Actually we did a lot more simulation studies than this, but you have to draw the line somewhere. The point is that z-curve performs well for large numbers of studies with chi-squared test statistics as well as F statistics — all with varying degrees of freedom.

(14) “The simulation study was conducted for the performance of the z-curve on constrained scenarios including F-tests with df = 1 and not for the combination of t-tests and chi-square tests as applied in the current study. I’m not sure what to make of the z-curve performance for the data used in the current paper because the simulation study does not provide evidence of its performance under these unexplored conditions.”

Now the reviewer is talking about the paper that was actually under review. The mistake is natural, because of our (my) error in not making sure that the full set of simulations was included in the supplementary materials. The conditions in question are not unexplored; they are thoroughly explored, and the accuracy of z-curve for large samples is confirmed.

(15+) There are some more comments by the reviewer, but these are strictly about the paper under review, and not about Brunner and Schimmack (2020). So, I will leave any further response to others.

Replicability Report for the Journal ‘Evolutionary Psychology’

Authors: Maria Soto and Ulrich Schimmack

Citation: Soto, M. & Schimmack, U. (2024, June, 24/06/24).  2024 Replicability Report for the Journal 'Evolutionary Psychology'.  Replicability Index. 
https://replicationindex.com/2024/06/24/rr24-evopsy/

Introduction

In the 2010s, it became apparent that empirical psychology had a replication problem. When psychologists tested the replicability of 100 results, they found that only 36% of the 97 significant results in original studies could be reproduced (Open Science Collaboration, 2015). In addition, several prominent cases of research fraud further undermined trust in published results. Over the past decade, several proposals were made to improve the credibility of psychology as a science. Replicability Reports aim to improve the credibilty of psychological science by examining the amount of publication bias and the strength of evidence for empirical claims in psychology journals.

The main problem in psychological science is the selective publishing of statistically significant results and the blind trust in statistically significant results as evidence for researchers’ theoretical claims. Unfortunately, psychologists have been unable to self-regulate their behaviour and continue to use unscientific practices to hide evidence that disconfirms their predictions. Moreover, ethical researchers who do not use unscientific practices are at a disadvantage in a game that rewards publishing many articles without concern about these findings’ replicability.

My colleagues and I have developed a statistical tool that can reveal the use of unscientific practices and predict the outcome of replication studies (Brunner & Schimmack, 2021Bartos & Schimmack, 2022). This method is called z-curve. Z-curve cannot be used to evaluate the credibility of a single study. However, it can provide valuable information about the research practices in a particular research domain.

Replicability-Reports (RR) use z-curve to provide information about psychological journal research and publication practices. This information can aid authors choose journals they want to publish in, provide feedback to journal editors who influence selection bias and replicability of published results, and, most importantly, to readers of these journals.

Evolutionary Psychology

Evolutionary Psychology was founded in 2003. The journal focuses on publishing empirical theoretical and review articles investigating human behaviour from an evolutionary perspective. On average, Evolutionary Psychology publishes about 35 articles in 4 annual issues.

As a whole, evolutionary psychology has produced both highly robust and questionable results. Robust results have been found for sex differences in behaviors and attitudes related to sexuality. Questionable results have been reported for changes in women’s attitudes and behaviors as a function of hormonal changes throughout their menstrual cycle.

According to Web of Science, the impact factor of Evolutionary Psychology ranks 88th in the Experimental Psychology category (Clarivate, 2024). The journal has a 48 H-Index (i.e., 48 articles have received 48 or more citations).

In its lifetime, Evolutionary Psychology has published over 800 articles The average citation rate in this journal is 13.76 citations per article. So far, the journal’s most cited article has been cited 210 times. The article was published in 2008 and investigated the influence of women’s mate value on standards for a long-term mate (Buss & Shackelford, 2008).

The current Editor-in-Chief is Professor Todd K. Shackelford. Additionally, the journal has four other co-editors Dr. Bernhard Fink, Professor Mhairi Gibson, Professor Rose McDermott, and Professor David A. Puts.

Extraction Method

Replication reports are based on automatically extracted test statistics such as F-tests, t-tests, z-tests, and chi2-tests. Additionally, we extracted 95% confidence intervals of odds ratios and regression coefficients. The test statistics were extracted from collected PDF files using a custom R-code. The code relies on the pdftools R package (Ooms, 2024) to render all textboxes from a PDF file into character strings. Once converted the code proceeds to systematically extract the test statistics of interest (Soto & Schimmack, 2024). PDF files identified as editorials, review papers and meta-analyses were excluded. Meta-analyses were excluded to avoid the inclusion of test statistics that were not originally published in Evolution & Human Behavior. Following extraction, the test statistics are converted into absolute z-scores.

Results For All Years

Figure 1 shows a z-curve plot for all articles from 2003-2023 (see Schimmack, 2023, for a detailed description of z-curve plots). However, the total available test statistics available for 2003, 2004 and 2005 were too low to be used individually. Therefore, these years were joined to ensure the plot had enough test statistics for each year. The plot is essentially a histogram of all test statistics converted into absolute z-scores (i.e., the direction of an effect is ignored). Z-scores can be interpreted as the strength of evidence against the null hypothesis that there is no statistical relationship between two variables (i.e., the effect size is zero and the expected z-score is zero). A z-curve plot shows the standard criterion of statistical significance (alpha = .05, z = 1.96) as a vertical red dotted line.

Figure 1

Z-curve plots are limited to values less than z = 6. The reason is that values greater than 6 are so extreme that a successful replication is all but certain unless the value is a computational error or based on fraudulent data. The extreme values are still used for the computation of z-curve statistics but omitted from the plot to highlight the shape of the distribution for diagnostic z-scores in the range from 2 to 6. Using the expectation maximization (EM) algorithm, Z-curve estimates the optimal weights for seven components located at z-values of 0, 1, …. 6 to fit the observed statistically significant z-scores. The predicted distribution is shown as a blue curve. Importantly, the model is fitted to the significant z-scores, but the model predicts the distribution of non-significant results. This makes it possible to examine publication bias (i.e., selective publishing of significant results). Using the estimated distribution of non-significant and significant results, z-curve provides an estimate of the expected discovery rate (EDR); that is, the percentage of significant results that were actually obtained without selection for significance. Using Soric’s (1989) formula the EDR is used to estimate the false discovery risk; that is, the maximum number of significant results that are false positives (i.e., the null-hypothesis is true).

Selection for Significance

The extent of selection bias in a journal can be quantified by comparing the Observed Discovery Rate (ODR) of 68%, 95%CI = 67% to 70% with the Expected Discovery Rate (EDR) of 49%, 95%CI = 26%-63%. The ODR is higher than the upper limit of the confidence interval for the EDR, suggesting the presence of selection for publication. Even though the distance between the ODR and the EDR estimate is narrower than those commonly seen in other journals the present results may underestimate the severity of the problem. This is because the analysis is based on all statistical results. Selection bias is even more problematic for focal hypothesis tests and the ODR for focal tests in psychology journals is often close to 90%.

Expected Replication Rate

The Expected Replication Rate (ERR) estimates the percentage of studies that would produce a significant result again if exact replications with the same sample size were conducted. A comparison of the ERR with the outcome of actual replication studies shows that the ERR is higher than the actual replication rate (Schimmack, 2020). Several factors can explain this discrepancy, such as the difficulty of conducting exact replication studies. Thus, the ERR is an optimist estimate. A conservative estimate is the EDR. The EDR predicts replication outcomes if significance testing does not favour studies with higher power (larger effects and smaller sampling error) because statistical tricks make it just as likely that studies with low power are published. We suggest using the EDR and ERR in combination to estimate the actual replication rate.

The ERR estimate of 72%, 95%CI = 67% to 77%, suggests that the majority of results should produce a statistically significant, p < .05, result again in exact replication studies. However, the EDR of 49% implies that there is some uncertainty about the actual replication rate for studies in this journal and that the success rate can be anywhere between 49% and 72%.

False Positive Risk

The replication crisis has led to concerns that many or even most published results are false positives (i.e., the true effect size is zero). Using Soric’s formula (1989), the maximum false discovery rate can be calculated based on the EDR.

The EDR of 49% implies a False Discovery Risk (FDR) of 6%, 95%CI = 3% to 15%, but the 95%CI of the FDR allows for up to 15% false positive results. This estimate contradicts claims that most published results are false (Ioannidis, 2005).

Changes Over Time

One advantage of automatically extracted test-statistics is that the large number of test statistics makes it possible to examine changes in publication practices over time. We were particularly interested in changes in response to awareness about the replication crisis in recent years.

Z-curve plots for every publication year were calculated to examine time trends through regression analysis. Additionally, the degrees of freedom used in F-tests and t-tests were used as a metric of sample size to observe if these changed over time. Both linear and quadratic trends were considered. The quadratic term was included to observe if any changes occurred in response to the replication crisis. That is, there may have been no changes from 2000 to 2015, but increases in EDR and ERR after 2015.

Degrees of Freedom

Figure 2 shows the median and mean degrees of freedom used in F-tests and t-tests reported in Evolutionary Psychology. The mean results are highly variable due to a few studies with extremely large sample sizes. Thus, we focus on the median to examine time trends. The median degrees of freedom over time was 121.54, ranging from 75 to 373. Regression analyses of the median showed a significant linear increase by 6 degrees of freedom per year, b = 6.08, SE = 2.57, p = 0.031. However, there was no evidence that the replication crisis influenced a significant increase in sample sizes as seen by the lack of a significant non-linear trend and a small regression coefficient, b = 0.46, SE = 0.53, p = 0.400.

Figure 2

Observed and Expected Discovery Rates

Figure 3 shows the changes in the ODR and EDR estimates over time. There were no significant linear, b = -0.52 (SE = 0.26 p = 0.063) or non-linear, b = -0.02 (SE = 0.05, p = 0.765) trends observed in the ODR estimate. The regression results for the EDR estimate showed no significant linear, b = -0.66 (SE = 0.64 p = 0.317) or non-linear, b = 0.03 (SE = 0.13 p = 0.847) changes over time. These findings indicate the journal has not increased its publication of non-significant results and continues to report more significant results than one would predict based on the mean power of studies.

Expected Replicability Rates and False Discovery Risks

Figure 4 depicts the false discovery risk (FDR) and the Estimated Replication Rate (ERR). It also shows the Expected Replication Failure rate (EFR = 1 – ERR). A comparison of the EFR with the FDR provides information for the interpretation of replication failures. If the FDR is close to the EFR, many replication failures may be due to false positive results in original studies. In contrast, if the FDR is low, most replication failures will likely be false negative results in underpowered replication studies.

The ERR estimate did not show a significant linear increase over time, b = 0.36, SE = 0.24, p = 0.165. Additionally, no significant non-linear trend was observed, b = -0.03, SE = 0.05, p = 0.523. These findings suggest the increase in sample sizes did not contribute to a statistically significant increase in the power of the published results. These results suggests that replicability of results in this journal has not increased over time and that the results in Figure 1 can be applied to all years.

Figure 4

Visual inspection of Figure 4 depicts the EFR between 30% and 40% and an FDR between 0 and 10%. This suggests that more than half of replication failures are likely to be false negatives in replication studies with the same sample sizes rather than false positive results in the original studies. Studies with large sample sizes and small confidence intervals are needed to distinguish between these two alternative explanations for replication failures.

Adjusting Alpha

A simple solution to a crisis of confidence in published results is to adjust the criterion to reject the null-hypothesis. For example, some researchers have proposed to set alpha to .005 to avoid too many false positive results. With z-curve we can calibrate alpha to keep the false discovery risk at an acceptable level without discarding too many true positive results. To do so, we set alpha to .05, .01, .005, and .001 and examined the false discovery risk.

Figure 5

Figure 5 shows that the conventional criterion of p < .05 produces false discovery risks above 5%. The high variability in annual estimates also makes it difficult to provide precise estimates of the FDR. However, adjusting alpha to .01 is sufficient to produce an FDR with tight confidence intervals below 5%. The benefits of reducing alpha further to .005 or .001 are minimal.

Figure 6

Figure 6 shows the impact of lowering the significance criterion, alpha, on the discovery rate (lower alpha implies fewer significant results). In Evolutionary Psychology lowering alpha to .01 reduces the observed discovery rate by about 20 to 10 percentage points. This implies that 20% of results reported p-values between .05 and .01. These results often have low success rates in actual replication studies (OSC, 2015). Thus, our recommendation is to set alpha to .01 to reduce the false positive risk to 5% and to disregard studies with weak evidence against the null-hypothesis. These studies require actual successful replications with larger samples to provide credible evidence for an evolutionary hypothesis.

There are relatively few studies with p-values between .01 and .005. Thus, more conservative researchers can use alpha = .005 without losing too many additional results.

Limitations

The main limitation of these results is the use of automatically extracted test statistics. This approach cannot distinguish between theoretically important statistical results and other results that are often reported but do not test focal hypotheses (e.g., testing the statistical significance of a manipulation check, reporting a non-significant result for a factor in a complex statistical design that was not expected to produce a significant result).

To examine the influence of automatic extraction on our results, we can compare the results to hand-coding results of over 4,000 hand-coded focal hypotheses in over 40 journals in 2010 and 2020. The ODR was 90% around 2010 and 88% around 2020. Thus, the tendency to report significant results for focal hypothesis tests is even higher than the ODR for all results and there is no indication that this bias has decreased notably over time. The ERR increased a bit from 61% to 67%, but these values are a bit lower than those reported here. Thus, it is possible that focal tests also have lower average power than other tests, but this difference seems to be small. The main finding is that the publishing of non-significant results for focal tests remains an exception in psychology journals and probably also in this journal.

One concern about the publication of our results is that it merely creates a new criterion to game publications. Rather than trying to get p-values below .05, researchers may use tricks to get p-values below .01. However, this argument ignores that it becomes increasingly harder to produce lower p-values with tricks (Simmons et al., 2011). Moreover, z-curve analysis makes it easy to see selection bias for different levels of significance. Thus, a more plausible response to these results is that researchers will increase sample sizes or use other methods to reduce sampling error to increase power.

Conclusion

The replicability report shows that the average power to report a significant result (i.e., a discovery) ranges from 49% to 72% in Evolutionary Psychology. This finding is higher than previous estimates observed in evolutionary psychology journals. However, the confidence intervals are wide and suggest that many published studies remain underpowered. The report did not capture any significant changes over time in the power and replicability as captured by the EDR and the ERR estimates. The false positive risk is modest and can be controlled by setting alpha to .01. Replication attempts of original findings with p-values above .01 should increase sample sizes to produce more conclusive evidence. Lastly, the journal shows clear evidence of selection bias.

There are several ways, the current or future editors of this journal can improve the credibility of results published in this journal. First, results with weak evidence (p-values between .05 and .01) should only be reported as suggestive results that require replication or even request a replication before publication. Second, editors should try to reduce publication bias by prioritizing research questions over results. A well-conducted study with an important question should be published even if the results are not statistically significant. Pre-registration and registered reports can help to reduce publication bias. Editors may also ask for follow-up studies with higher power to follow up on a non-significant result.

Publication bias also implies that point estimates of effect sizes are inflated. It is therefore important to take uncertainty in these estimates into account. Small samples with large sampling errors are usually unable to provide meaningful information about effect sizes and conclusions should be limited to the direction of an effect.

The present results serve as a benchmark for future years to track progress in this journal to ensure trust in research by evolutionary psychologists.

How Useful are Statistical Power calculations?

One of the greatest meta-psychologists was Jacob Cohen. He was concerned about the risk that psychologists might waste resources on studies that had a low probability to provide evidence for a true hypothesis. Following Neyman and Pearson this error is called a type-II error. It can also be called a false negative result.

Psychologists typically rely on null-hypothesis testing to provide evidence for their predictions. They set the criterion value for a statistically significant result to 5%. This means that there is only a 5% probability to get a significant result without a real effect. This is called a type-I error or a false positive result. In this approach, a type-II error occurs when a prediction is true (a treatment is effective), but the p-value is above .05.

Cohen (1961) warned psychologists that many studies have a high risk of producing false negative results, especially when effect sizes are statistically small. Even when effect sizes are statistically around the average effect size in psychological studies, the risk of a false negative result was about 50%. Follow-up studies showed that this situation had not changed in the following decades (Sedlmeier & Gigerenzer, 1989).

One might assume that psychologists simply have little control over the false negative risk in their studies. However, that is not true. A simple way to decrease the false negative risk is to increase sample sizes. Thus, one has to wonder why psychologists did not increase sample sizes in response to evidence that they are conducting studies with high risk of a false negative result?

Imagine a gambler who can play two slot machines. One has a 50% chance of winning, the other one has an 80% chance of winning? Which machine would you pick? The answer is obvious. The situation for a researcher is a bit different. Fist, they have to pay more (invest more resources in larger samples) to play the higher odds of winning (i.e. avoiding a false negative result). Second, they do not know the actual odds of winning. They merely know that the odds of winning are higher when they invest more resources. Cohen (1988) tried to help researchers to make decisions that reduce the false negative risk without paying too much for larger sample sizes. It took 50 years for power analyses to become more popular in psychology in the past decade.

While better control of false negative results may seem desirable to all, a recent peer-reviewed article by Pek, Pitt, and Wegener (2024) suggests that power analyses are useless. They claim in the title “Uncertainty limits the use of power analysis.” In the article, they wonder “Isn’t use of power better than not using power at all?” and their answer is not a simple “yes” (p. 11). They say it is also not a simple “No”, but they provide no examples where power analysis is better than drawing a random number from a hat to determine the sample size of a study. In fact, they go on to state that “we recommend that researchers place limited confidence when using power to design experiments, or not use it at all as a direct justification for determining N” (p. 11). If that does not mean “power analysis is useless”, they do a very good job at hiding the benefits of power analysis.

It is remarkable that such a harsh criticism of Cohen can be published in a leading psychology journal without even mentioning Cohen’s work. It is also remarkable that they never mention false negative results / type-II errors, although power is defined as the probability of avoiding a type-II error (beta = type-II error, power = 1 – beta). So, we do not know what Pek et al.’s (2024) suggestion for researchers is when they get a non-significant result. Maybe somebody should write to them. “Hey I just did a study with a randomly generated sample size and got a non-significant result. What now?”

Two other giants in the history of psychology wrote an article in 1971 about the problem with small samples that often have a high false negative risk (Kahneman & Tversky, 1971). They wrote “We refuse to believe that a serious investigator will knowingly accept a .50 risk of failing to confirm a valid research hypothesis.” (p. 110).

What Pek et al. do not tell there readers is the real reason why psychologists ignored false negative results and continued to use small samples. The reason is not that they rarely have false negative results. The reason is that they invest relatively few resources in their studies so that they can conduct many studies or many tests within a study to get at least one significant result. The non-significant results are simply discarded. This is known as using questionable research practices because researchers are not disclosing all results. This increases the risk that the published results are false positives. If a researcher tests 20 false hypotheses, they can expect to get one p-value below .05. If they do not disclose that they ran 20 tests, readers cannot see that the 1 significant result was expected by chance alone.

Pek et al. also do not tell readers why power analysis has become more popular in the past decade. The reason is that a high rate of false negative results makes significant results less informative. Imagine that researchers test 100 true hypotheses and 100 false hypotheses. The 100 false hypotheses are expected to produce 5 significant results. This is implied by the use of the 5% criterion. If the 100 true hypotheses are tested with only 10% power, we have 10 true findings and 90 false negative results. Now we publish only the significant results, which means there are 15 results, 5 are false findings (a medicine does not work) and 10 are true findings (a medicine works). This means one third of published findings are false. Cohen recommended to plan studies with 80% power. This would mean we get 80 true findings and 5 false findings. As a result, only 6% of the published results are false. Would you still believe Pek et al. (2025) that power analyses are useless or would you rather wonder whether the average power in psychology is closer to 10% or 80%?

The key argument in Pek et al.’s article is that power is a hypothetical construct because we do not know whether the predicted effect is small, medium, or large. First of all, this is not true of all science. Some sciences have theories that make quantitative predictions. Even psychologists may have some idea whether they are testing a weak, moderate, or strong effect. However, we do not even need to know what the true effect size is. We can conduct a power analysis based on an effect size that is theoretically interesting. For example, the question whether money buys happiness or not is a silly question. A more interesting question is how much happiness money can buy. Let’s say that money is only important for a theory of happiness, if the correlation between money and happiness is at least r = .1, what Cohen calls a weak effect. Power analysis not only helps us to determine a reasonable sample size to look for this correlation, it can also help to make non-significant results informative. For example, if we power a study to have a 95% chance to be significant with a correlation of r = .1, and we obtain a non-significant result, the chance that this result is a false positive result is less than 5%. We may therefore be willing to accept the hypothesis that the correlation is less than .1 and conclude that money has a negligible influence on happiness (BTW the true correlations tend to be between 1. and .3). This is valuable information that can only be obtained by considering the risk of a false negative finding. Finding a non-significant result in a study with N = 20 people does not warrant the conclusion that money does not matter much for happiness because the false negative risk is too high. Pek et al. (2024) ignore all of this useful information that power analyses can provide even if there is great uncertainty about the true power of a study. Thus, researchers can easily track the power of their studies by keeping track of their success record.

If a researcher conducts 20 statistical tests and finds only 4 significant results, the average power is about 25%. According to Kahneman and Tversky (1971) any serious researcher would have to wonder whether they are just testing a lot of false hypotheses or whether they produce a lot of false negative results. No serious researcher should just continue doing what they are doing and just publish the 4 significant results and call it a day. However, that is what social psychologists like Pek’s co-author Duane Wegener have been doing for decades, while ignoring power analyses. This has led to the replication crisis in social psychology that has uncovered many false findings. At least Noble Laureat Daniel Kahneman had the humility to recognize his mistake. “What the blog gets absolutely right is that I placed too much faith in underpowered studies” (Kahneman, 2017).

Kahneman (2017) also points out that we need science to make new discoveries and to correct false beliefs, but that science can only serve this function when all relevant results are published. And that was not the case in social psychology. Non-significant results were ignored and only significant results that confirmed even the most implausible predictions were published. This bias is evident in the high percentage of significant results in psychology journals (Motyl et al., 1997; Sterling, 1959; Sterling et al., 1995). With success rates of 90%, honest reporting would imply that psychologists only test true hypothesis with high power. Ironically, this would mean that psychologists do not need power analysis because they miraculously never obtain false negative results. The real reason for 90% success rate is rather different. A replication project found only 25% significant results in replication studies of social psychologists (Open Science Collaboration, 2015), suggesting that most studies are well below the 50% criterion for serious researchers (Kahneman & Tversky, 1971). These are well known facts that Pek et al. (2024) and the editor who published this article simply ignore and are hiding from readers who are not familiar with the history of power analysis.

Finally, Pek et al.’s (2024) concern about the uncertainty about the true power of a study is irrelevant for the usefulness of power analyses. The true power of a study is less important than the truthful reporting of results. Uncertainty about true power implies that even researchers who conduct power analyses will sometimes conduct studies that produce false negative results. First, Cohen’s recommendation to aim for 80% power implies that 20% of tests of a true hypothesis will produce false negative results. Second, power analyses can overestimate true power and the false negative risk can be even greater than 20%, This is not a problem if the results are published and combined with other evidence that can correct false negative results. This is what researchers in medicine do. Here, studies have only about 30% power on average, but non-significant results are reported and meta-analyses can reduce the risk of false conclusions. Thus, the biggest threat to psychology as a science is uncertainty about the honest reporting of results and not uncertainty about true power. The advantage of conducting power analyses is that it is more likely that we have honest and credible evidence when researchers conduct a few studies with high power than many studies with low power. This is what Cohen meant when he said “Less is more, except for sample size.”

In conclusion, if you are new to statistical power and its role in psychological science, I recommend to read Cohen (1988, 1992) and to ignore Pek et al.’s (2024) useless article. A simple truth about power is that the percentage of significant results in a set of studies is an estimate of the mean power of studies. If you see a set of studies with over 90% significant results, you have to ask yourself: did these studies really test only true hypothesis with high power, or did researchers not report studies that failed to support their claims ” (Schimmack, 2012). I trust you to come to the right conclusion, but you can also use power calculations to test for the presence of selection bias. But that is a story for another day.

Dear Anonymous Reviewer…

Peer-review is the foundation of science. Peer-reviewers work hard to evaluate manuscript to see whether they are worthy of being published, especially in old-fashioned journals with strict page limitations. Their hard work often goes unnoticed because peer-reviews remain unpublished. This is a shame. A few journals have recognized that science might benefit from publishing reviews. Not all reviews are worthy of publication, but when a reviewer spends hours, if not days, to write a long and detailed comment, it seems only fair to share the fruits of their labor in public. Unfortunately, I am not able to give credit to Reviewer 1 who was too modest or shy to share their name. This does not undermine the value they created and I hope the reviewer may find the courage to take credit for their work.

Reviewer 1 was asked to review a paper that used z-curve to evaluate the credibility of research published in the leading emotion journals. Yet, going beyond the assigned task, Reviewer 1 gave a detailed and thorough review of the z-curve method that showed the deep flaws of this statistical method that had been missed by reviewers of articles that promoted this dangerous and misleading tool. After a theoretical deep-dive into the ontology of z-curve, Reviewer 1 points out that simulation studies seem to have validated the method. Yet, Reviewer 1 was quick to notice that the simulations were a shame and designed to show that z-curve works rather than to see it fail in applications to more realistic data. Deeply embarrassed, my co-thors, including a Professor of Statistics, are now contacting journals to retract our flawed articles.

Please find the damaging review of z-curve below.

P.S. We are also offering a $200 reward for credible simulation studies that demonstrate that z-curve is crap.

P.P.S Some readers seem to have missed the sarcasm and taken the criticism by Reviewer 1 seriously. The problem is lack of expertise to evaluate the conflicting claims. To make it easy I share an independent paper that validated z-curve with actual replication outcomes. Not sure how Reviewer 1 would explain the positive outcome. Maybe we hacked the replication studies, too?

Röseler, Lukas, 2023. “Predicting Replication Rates with Z-Curve: A Brief Exploratory Validation Study Using the Replication Database,” MetaArXiv ewb2t, Center for Open Science.

ANNONYMOUS, July 17, 2024

Referee: 1

Comments to the Author
The manuscript “Credibility of results in emotion science: A z-curve analysis of results in the journals Cognition & Emotion and Emotion” (CEM-DA.24) presents results from a z-curve analysis on reported statistics (t-tests, F-tests, and chi-square tests with df < 6 and 95% confidence intervals) for empirical studies (excluding meta-analysis) published in Cognition & Emotion from 1987 to 2023 and Emotion from 2001 to 2023. The purposes of reporting results from a z-curve analysis are to (a) estimate selection bias in emotion research and (b) predict a success rate in replication studies.

I have strong reservations about the conclusions drawn by the authors that do not seem to be strongly supported by their reported results. Specifically, I am not confident that conclusions from z-curve results justify the statements made in the paper under review. Below, I outline the main concerns that center on the z-curve methodology that unfortunately focuses on providing a review on Brunner and Schimmack (2020) and not so much on the current paper.

VAGUE METHODOLOGY. The authors make strong claims about what conclusions can be drawn from z-curve analyses. Their presentation of z-curve analysis in the present paper is declarative and does not provide the necessary information to describe the assumptions made by the method, how it works, when it fails, etc. The authors cite previous publications on z-curve (Brunner & Schimmack, 2020; Bartos & Schimmack, 2022; Schimmack & Bartos, 2023). Furthermore, this work ignores recent criticism in the literature about such statistical forensics. One example questioning the validity of conclusions by tests of credibility/replicability (e.g., p-curve, Francis’s [2013] consistency test) is in a talk by Richard Morey titled “Statistical games: Flawed thinking of popular methods for assessing reproducibility” (https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3Dc0G98qp1cf4&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245569384%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=n12hBiuHLvvL7uvpt5cj0qaYKrmze39HggtrSPVYIZ0%3D&reserved=0). The talk was based on Morey (2013). Other authors who have written on this topic include McShane, Böckenholt, and Hansen (2020) and Pek, Hoisington-Shaw, & Wegener (2022).

==
Morey, R. D. (2013). The consistency test does not–and cannot–deliver what is advertised: A comment on Francis (2013). Journal of Mathematical Psychology, 57(5), 180-183. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1016%2Fj.jmp.2013.03.004&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245573351%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=giZR7Etdc9n8qTUvXMCMnFeh95GeGO5KRCCoG0P2bHY%3D&reserved=0

McShane, B. B., Böckenholt, U., & Hansen, K. T. (2020). Average Power: A Cautionary Note. Advances in Methods and Practices in Psychological Science, 3(2), 185–199. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1177%2F2515245920902370&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245577184%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=qQtgjxmUam%2ByfFCjknA84sQnecQTk8qm7MObb7b%2BO3E%3D&reserved=0

Francis, G. (2013). Replication, statistical consistency, and publication bias. Journal of Mathematical Psychology, 57(5), 153-169. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1016%2Fj.jmp.2013.02.003&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245580995%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=%2Fkd82Q%2BgOfm1yZECD%2FRbah2uAdZROtPlyKfb4kmFmS4%3D&reserved=0

Pek, J., Hoisington-Shaw, K. J., & Wegener, D. T. (2022). Avoiding Questionable Research Practices Surrounding Statistical Power Analysis. In W. O’Donohue, A. Masuda, & S. Lilienfeld (Eds.), Avoiding Questionable Practices in Applied Psychology (pp. 243–267). Springer. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1007%2F978-3-031-04968-2_11&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245584836%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=DRLox%2Bmn2ztlp6Y4hagpuZKCyCsUFOF1xEXZP779gvk%3D&reserved=0
==

In reading Brunner and Schimmack (2020), z-curve analysis is based on the concept of using an average power estimate of completed studies (i.e., post hoc power analysis). However, statisticians and methodologists have written about the problem of post hoc power analysis (whether it be for a single study or for a set of studies; see Pek, Hoisington-Shaw, & Wegener, in press for a treatment of this misconception).

It should be noted that power is *not* a property of a (completed) study (fixed data). Power is a performance measure of a procedure (statistical test) applied to an infinite number of studies (random data) represented by a sampling distribution. Thus, what one estimates from completed study is not really “power” that has the properties of a frequentist probability even though the same formula is used. Average power does not solve this ontological problem (i.e., misunderstanding what frequentist probability is; see also McShane et al., 2020). Power should *always* be about a design for future studies, because power is the probability of the performance of a test (rejecting the null hypothesis) over repeated samples for some specified sample size, effect size, and Type I error rate (see also Greenland et al., 2016; O’Keefe, 2007). z-curve, however, makes use of this problematic concept of average power (for completed studies), which brings to question the validity of z-curve analysis results.

Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., & Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. European Journal of Epidemiology, 31(4), 337–350. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1007%2Fs10654-016-0149-3&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245588763%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=iKLnBvCg0BGd8l9x%2BZii7O%2BKapURRjoWn8rKZpTxHDw%3D&reserved=0

O’Keefe, D. J. (2007). Post hoc power, observed power, a priori power, retrospective power, prospective power, achieved power: Sorting out appropriate uses of statistical power analyses. Communication Methods and Measures, 1(4), 291–299. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1080%2F19312450701641375&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245592749%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=q3TpAIdWZs%2BPxLhZ1YI2Yby0qSbj14%2Fxc6hcc4YZtA8%3D&reserved=0
==

In Brunner and Schimmack (2020), there is a problem with “Theorem 1 states that success rate and mean power are equivalent even if the set of coins is a subset of all coins.” Here, the coin flip with a binary outcome is a process to describe significant vs. nonsignificant p-values. Focusing on observed power, the problem is that using estimated effect sizes (from completed studies) have sampling variability and cannot be assumed to be equivalent to the population effect size. Methodological papers that deal with power analysis making use of estimated effect size show that the uncertainty due to sampling variability is extremely high (e.g., see Anderson et al., 2017; McShane & Böckenholt, 2016); it is worse when effects are random (cf. random effects meta-analysis; see McShane, Böckenholt, & Hansen, 2020; Pek, Pitt, & Wegener, 2024). Accepting that effects are random seems to be more consistent with what we observe in empirical results of the same topic. The extent of uncertainty in power estimates (based on observed effects) is so high that much cannot be concluded with such imprecise calculations.

Coming back to p-values, these statistics have their own distribution (that cannot be derived unless the effect size is null and the p-value follows a uniform distribution). However, because p-values have sampling variability (and an unknown sampling distribution), one cannot take a significant p-value to deterministically indicate a tally on power (which assumes that an unknown specific effect size is true). Stated differently, a significant p-value can be consistent with a Type I error. Now, if the counter argument taken is that z-curve does not require an effect size input to calculate power, then I’m not sure what z-curve calculates because a value of power is defined by sample size, effect size, Type I error rate, and the sampling distribution of the statistical procedure (as consistently presented in textbooks for data analysis).

There seems to be some conceptual slippage on the meaning of power here because what the authors call power does not seem to have the defining features of power.

The problem of Theorem 2 in Brunner and Schimmack (2020) is assuming some distribution of power (for all tests, effect sizes, and sample sizes). This is curious because the calculation of power is based on the sampling distribution of a specific test statistic centered about the unknown population effect size and whose variance is determined by sample size. Power is then a function of sample size, effect size, and the sampling distribution of the test statistic. There is no justification (or mathematical derivation) to show that power follows a uniform or beta distribution (e.g., see Figure 1 & 2 in Brunner and Schimmack, 2000, respectively). If the counter argument here is that we avoid these issues by transforming everything into a z-score, there is no justification that these z-scores will follow a z-distribution because the z-score is derived from a normal distribution – it is not the transformation of a p-value that will result in a z-distribution of z-scores. P-values are statistics and follow a sampling distribution; the variance of the sampling distribution is a function of sample size. So, it’s weird to assume that p-values transformed to z-scores might have the standard error of 1 according to the z-distribution. If the further argument is using a mixture of z-distributions to estimate the distribution of the z-scores, then these z-scores are not technically z-scores in that they are nor distributed following the z-distribution. We might estimate the standard error of the mixture of z-distributions to rescale the distribution again to a z-distribution… but to what end? Again, there is some conceptual slippage in what is meant by a z-score. If the distribution of p-values that have been transformed to a z-score is not a z-distribution and then the mixture distribution is then shaped back into a z-distribution (with truncations that seem arbitrary) so that the critical value of 1.96 can be used – I’m not sure what the resulting distribution is of, anymore. A related point is that we do not yet know whether p-values are transformation invariant (in distribution) under a z-score transformation. Furthermore, the distribution for power invoked in Theorem 1 is not a function of sample size, effect size, or statistical procedure, suggesting that the assumed distribution does not align well with the features that we know influence power. It is unclear how Theorem 2 is related to the z-curve procedure. Again, there seems to be some conceptual slippage involved with p-values being transformed into z-scores that somehow give us an estimate of power (without stating the effect size, sample size, or procedure).

In the description of the z-curve analysis, it is unclear why z-curve is needed to calculate “average power.” If p < .05 is the criterion of significance, then according to Theorem 1, why not count up all the reported p-values and calculate the proportion in which the p-values are significant? After all, p-values can be transformed to z-scores and vice-versa in that they carry the same information. But then, there is a problem of p-values having sampling variability and might be consistent with Type I error. A transformation from p to z will not fix sampling variability.

To beat a dead horse, z-curve makes use of the concept of “power” for completed studies. To claim that power is a property of completed studies is an ontological error about the meaning of frequentist probability. A thought experiment might help. Suppose I completed a study, and the p-value is .50. I convert this p-value to a z-score for a two-tailed test and get 0.67. Let’s say I collect a bunch of studies and do this and get a distribution of z-scores (that don’t end up being distributed z). I do a bunch of things to make this distribution become a z-distribution. Then, I define power as the proportion of z-scores above the cutoff of 1.96. We are now calling power a collection of z-scores above 1.96 (without controlling for sample size, effect size, and procedure). This newly defined “power” based on the z-distribution does not reflect the original definition of power (area under the curve for a specific effect size, a specific procedure, and a specific sample size, assuming the Type I error is .05). This conceptual slippage is akin to burning a piece of wood, putting the ashes into a mold that looks like wood, and calling the molded ashes wood.

The authors make a statement that (observed) power is the probability of exact replication. However, there is a conceptual error embedded in this statement. While Greenwald et al. (1996, p. 1976) state “replicability can be computed as the power of an exact replication study, which can be approximated by [observed power],” they also explicitly emphasized that such a statement requires the assumption that the estimated effect size is the same as the unknown population effect size which they admit cannot be met in practice. Furthermore, recall that power is a property of a procedure and is not a property of completed data (cf. ontological error), thus using observed power to quantify replicability presents replicability as a property of a procedure and not about the robustness of an observed effect. Again, there seems to be some conceptual slippage occurring here on what is meant by replication versus what is quantifying replication (which should not be observed power).

The basis of supporting the z-curve procedure is a simulation study. This approach merely confirms what is assumed with simulation and does not allow for the procedure to be refuted in any way (cf. Popper’s idea of refutation being the basis of science.) In a simulation study, one assumes that the underlying process of generating p-values is correct (i.e., consistent with the z-curve procedure). However, one cannot evaluate whether the p-value generating process assumed in the simulation study matches that of empirical data. Stated a different way, models about phenomena are fallible and so we find evidence to refute and corroborate these models. The simulation in support of the z-curve does not put the z-curve to the test but uses a model consistent with the z-curve (absent of empirical data) to confirm the z-curve procedure (a tautological argument). This is akin to saying that model A gives us the best results, and based on simulated data on model A, we get the best results.

Further, the evidence that z-curve performs well is specific to the assumptions within the simulation study. If p-values were generated in a different way to reflect a competing tentative process, the performance of the z-curve would be different.  The simulation study was conducted for the performance of the z-curve on constrained scenarios including F-tests with df = 1 and not for the combination of t-tests and chi-square tests as applied in the current study. I’m not sure what to make of the z-curve performance for the data used in the current paper because the simulation study does not provide evidence of its performance under these unexplored conditions.

==
Anderson, S. F., Kelley, K., & Maxwell, S. E. (2017). Sample-size planning for more accurate statistical power: A method adjusting sample effect sizes for publication bias and uncertainty. Psychological Science, 28(11), 1547–1562. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1177%2F0956797617723724&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245596532%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=qg%2BHctfodgm9tHN4oiKkFSJgcIk5%2BSWGBvrWGKRalRQ%3D&reserved=0

Greenwald, A. G., Gonzalez, R., Harris, R. J., & Guthrie, D. (1996). Effect sizes and p values: What should be reported and what should be replicated? Psychophysiology, 33(2), 175–183. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1111%2Fj.1469-8986.1996.tb02121.x&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245600546%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=du%2BhnlOwN22%2FmOAgdoPqEVB3WQhXVYl%2FI0l5J6xTXhU%3D&reserved=0

McShane, B. B., & Böckenholt, U. (2016). Planning Sample Sizes When Effect Sizes Are Uncertain: The Power-Calibrated Effect Size Approach. Psychological Methods, 21(1), 47–60. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1037%2Fmet0000036&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245604346%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=Lx2kD5FruPdsja9v%2B5uqSrl%2BaiWma1o316z%2BXgSojIY%3D&reserved=0

Pek, J., Hoisington-Shaw, K. J., & Wegener, D. T. (in press). Uses of uncertain statistical power: Designing future studies, not evaluating completed studies . Psychological Methods. 
https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.researchgate.net%2Fpublication%2F368358276_Uses_of_uncertain_statistical_power_Designing_future_studies_not_evaluating_completed_studies&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245608080%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=4KepFlEFOqQyVrhSfXGHUmHToMZlzKt4AlR9sMtzif0%3D&reserved=0

Pek, J., Pitt, M. A., & Wegener, D. T. (2004). Uncertainty limits the use of power analysis. Journal of Experimental Psychology: General, 153(4), 1139–1151. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2Fhttps%3A%2F%2Fdoi.org%2F10.1037%2Fxge0001273&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245611962%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=S6EZxxThRGeruV6RF9%2FUuIXv4MmMKZWlfAgXkYUtoxI%3D&reserved=0
==

IINPUT DATA. The authors made use of statistics reported in empirical research published in Cognition & Emotion and Emotion. Often, articles might report several studies, and studies would have several models, and models would contain several tests. Thus, there might be a nested structure of tests nested within models, models nested within studies, and studies nested within articles. It does not seem that this nesting is taken into account to provide a good estimate of selection bias and the expectation replication rate. Thus, the estimates provided cannot be deemed unbiased (e.g., estimates would be biased toward articles that tend to report a lot of statistics compared to others).

As the authors admit, there is no separation of statistical tests used for manipulation checks, preliminary analyses, or tests of competing and alternative hypothesis. Given that the sampling of the statistics might not be representative of key findings in emotion research, little confidence can be placed in the accuracy of the estimates reported and the strong claims being made using them (about emotion research in general). 
Finally, the authors excluded chi-square tests with degrees of freedom larger than 6. This would mean that tests of independence with designs larger than a 2×2 contingency table would be excluded (or tests of independence with 6 categories). In general, the authors need to be careful on what conditions their conclusions apply to.

UNSUBSTANTIATED CONCLUSIONS. The key conclusions made by the authors are that there is selection bias in emotion research, and there is a success rate of 70% in replication studies. These conclusions are made from z-curve analysis, in which I question the validity of. My concerns of the z-curve procedure has to do with ontological errors made about the probability attached to the concept of power, the rationale for z-transformations on p-values (along with strange distributional gymnastics with little justification provided in the original paper), and equating power with replication.

Even if the z-curve is valid, the performance of z-curve should be better evaluated to show that they apply to the conditions of the data used in the current study. Furthermore, data quality used in z-curve analysis in terms of selection criteria (e.g., excluding tests for manipulation checks, etc.) and modeling the nested structure inherent in reported results would go a long way in ensuring that the estimate provided is as unbiased as can be.

Finally, it seems odd to conclude selection bias based on data with selection bias. There might be some tautology going on within the argument. An analogy about missing data might help. Given a set of data in which we assume had undergone selection (i.e., part of the distribution is missing), how can we know from the data what is missing? The only way to talk about the missing part of the distribution is to assume a distribution for the “full” data that subsumes the observed data distribution. But who can say that the assumed distribution is the correct one that would have generated the full data? Our selected data does not have the features to let us infer what the full distribution should be. How can we know what we observe has undergone selection bias without knowledge of the selection process (cf. distribution of the full data) unless some implicit assumption is made. We are not given the assumption and therefore cannot evaluate whether the assumption is valid. I cannot tell what assumptions z-curve makes about selection.

Replicability Report 2024: Acta Psychologica

Authors: Maria Soto and Ulrich Schimmack

Citation: Soto, M. & Schimmack, U. (2024, July 4/06/24).  2024 Replicability Report for the Journal 'Acta Psychologica'.  Replicability Index. 
https://replicationindex.com/2024/07/04/rr24-actapsy/

Introduction

In the 2010s, it became apparent that empirical psychology had a replication problem. When psychologists tested the replicability of 100 results, they found that only 36% of the 97 significant results in original studies could be reproduced (Open Science Collaboration, 2015). In addition, several prominent cases of research fraud further undermined trust in published results. Over the past decade, several proposals were made to improve the credibility of psychology as a science. Replicability Reports aim to improve the credibility of psychological science by examining the amount of publication bias and the strength of evidence for empirical claims in psychology journals.

The main problem in psychological science is the selective publishing of statistically significant results and the blind trust in statistically significant results as evidence for researchers’ theoretical claims. Unfortunately, psychologists have been unable to self-regulate their behavior and continue to use unscientific practices to hide evidence that disconfirms their predictions. Moreover, ethical researchers who do not use unscientific practices are at a disadvantage in a game that rewards publishing many articles without any concern about the replicability of these findings.

My colleagues and I have developed a statistical tools that can reveal the use of unscientific practices and predict the outcome of replication studies (Brunner & Schimmack, 2021Bartos & Schimmack, 2022). This method is called z-curve. Z-curve cannot be used to evaluate the credibility of a single study. However, it can provide valuable information about the research practices in a particular research domain. Replicability-Reports (RR) analyze the statistical results reported in a journal with z-curve to estimate the replicability of published results, the amount of publication bias, and the risk that significant results are false positive results (i.e, the sign of a mean difference or correlation of a significant result does not match the sign in the population).

Acta Psychologica

Acta Psychologica is an old psychological journal that was founded in 1936. The journal publishes articles from various areas of psychology, but cognitive psychological research seems to be the most common area. Since 2021, the journal is a Gold Open Access journal that charges authors a $2,000 publication fee.

On average, Acta Psychologica publishes about 150 articles a year in 9 annual issues.

According to Web of Science, the impact factor of Acta Psychologica ranks 44th in the Experimental Psychology category (Clarivate, 2024). The journal has an H-Index of 140 (i.e., 140 articles have received 140 or more citations).

In its lifetime, Acta Psychologica has published over 6,000 articles with an average citation rate of 21.5 citations. So far, the journal has published 5 articles with more than 1,000 citations. However, most of these articles were published in the 1960s and 1970s.  The most highly cited article published in the 2000s examined the influence of response categories on the psychometric properties of survey items (Preston & Colman, 2000; 1055 citations).

Psychology literature has faced difficult realizations in the last decade. Acta Psychologica is a broad-scope journal that offers us the possibility to observe changes in the robustness of psychological research practices and results. The current report serves as a glimpse into overall trends in psychology literature as it considers research from multiple subfields.

Given the multidisciplinary nature of the journal, the journal has a team of editors. The current editors are Dr. Muhammad Abbas, Dr. Mohamed Alansari, Dr. Colin Cooper, Dr. Valerie De Cristofaro, Dr. Nerelie Freeman, Professor, Alessandro Gabbiadini, Professor Matthieu Guitton, Dr. Nhung T Hendy, Dr. Amanpreet Kaur, Dr. Shengjie Lin, Dr. Hui Jing Lu, Professor Robrecht Van Der Wel and Dr. Olvier Weigelt.

Extraction Method

Replication reports are based on automatically extracted test statistics such as F-tests, t-tests, z-tests, and chi2-tests. Additionally, we extracted 95% confidence intervals of odds ratios and regression coefficients. The test statistics were extracted from collected PDF files using a custom R-code. The code relies on the pdftools R package (Ooms, 2024) to render all textboxes from a PDF file into character strings. Once converted the code proceeds to systematically extract the test statistics of interest (Soto & Schimmack, 2024). PDF files identified as editorials, review papers and meta-analyses were excluded. Meta-analyses were excluded to avoid the inclusion of test statistics that were not originally published in Acta Psychologica Following extraction, the test statistics are converted into absolute z-scores.

Results For All Years

Figure 1 shows a z-curve plot for all articles from 2000-2023 (see Schimmack, 2022a, 2022b, for a detailed description of z-curve plots). The plot is essentially a histogram of all test statistics converted into absolute z-scores (i.e., the direction of an effect is ignored). Z-scores can be interpreted as the strength of evidence against the null hypothesis that there is no statistical relationship between two variables (i.e., the effect size is zero and the expected z-score is zero). A z-curve plot shows the standard criterion of statistical significance (alpha = .05, z = 1.96) as a vertical red dotted line.

Figure 1

Z-curve plots are limited to values less than z = 6. The reason is that values greater than 6 are so extreme that a successful replication is all but certain unless the value is a computational error or based on fraudulent data. The extreme values are still used for the computation of z-curve statistics but omitted from the plot to highlight the shape of the distribution for diagnostic z-scores in the range from 2 to 6. Using the expectation maximization (EM) algorithm, Z-curve estimates the optimal weights for seven components located at z-values of 0, 1, …. 6 to fit the observed statistically significant z-scores. The predicted distribution is shown as a blue curve. Importantly, the model is fitted to the significant z-scores, but the model predicts the distribution of non-significant results. This makes it possible to examine publication bias (i.e., selective publishing of significant results). Using the estimated distribution of non-significant and significant results, z-curve provides an estimate of the expected discovery rate (EDR); that is, the percentage of significant results that were actually obtained without selection for significance. Using Soric’s (1989) formula the EDR is used to estimate the false discovery risk; that is, the maximum number of significant results that are false positives (i.e., the null-hypothesis is true).

Selection for Significance

The extent of selection bias in a journal can be quantified by comparing the Observed Discovery Rate (ODR) of 70%, 95%CI = 70% to 71% with the Expected Discovery Rate (EDR) of 38%, 95%CI = 27%-54%. The ODR is notably higher than the upper limit of the confidence interval for the EDR, indicating statistically significant publication bias. It is noteworthy that the present results may underestimate the severity of the problem because the analysis is based on all statistical results. Selection bias is even more problematic for focal hypothesis tests and the ODR for focal tests in psychology journals is often higher than the ODR for all tests. Thus, the current results are a conservative estimate of bias for critical hypothesis tests.

Expected Replication Rate

The Expected Replication Rate (ERR) estimates the percentage of studies that would produce a significant result again if exact replications with the same sample size were conducted. A comparison of the ERR with the outcome of actual replication studies shows that the ERR is higher than the actual replication rate (Schimmack, 2020). Several factors can explain this discrepancy, including the difficulty of conducting exact replication studies. Thus, the ERR is an optimist estimate. A conservative estimate is the EDR. The EDR predicts replication outcomes if significance testing does not favour studies with higher power (larger effects and smaller sampling error) because statistical tricks make it just as likely that studies with low power are published. We suggest using the EDR and ERR in combination to estimate the actual replication rate.

The ERR estimate of 73%, 95%CI = 69% to 77%, suggests that the majority of results should produce a statistically significant, p < .05, result again in exact replication studies. However, the EDR of 38% implies that there is considerable uncertainty about the actual replication rate for studies in this journal and that the success rate can be anywhere between 27% and 77%.

False Positive Risk

The replication crisis has led to concerns that many or even most published results are false positives (i.e., the true effect size is zero or in the opposite direction). The high rate of replication failures, however, may simply reflect low power to produce significant results for true positives and does not tell us how many published results are false positives. We can provide some information about the false positive risk based on the EDR. Using Soric’s formula (1989), the EDR can be used to calculate the maximum false discovery rate.

The EDR of 38% for Acta Psychologica implies a False Discovery Risk (FDR) of 9%, 95%CI = 5% to 15%, but the 95%CI of the FDR allows for up to 15% false positive results. This estimate contradicts claims that most published results are false (Ioannidis, 2005), but is probably a bit higher than many readers of this journal would like.

Time Trends

One advantage of automatically extracted test-statistics is that the large number of test statistics makes it possible to examine changes in publication practices over time. We were particularly interested in changes in response to awareness about the replication crisis in recent years.

Z-curve plots for every publication year were calculated to examine time trends through regression analysis. Additionally, the degrees of freedom used in F-tests and t-tests were used as a metric of sample size to observe if these changed over time. Both linear and quadratic trends were considered. The quadratic term was included to observe if any changes occurred in response to the replication crisis. That is, there may have been no changes from 2000 to 2015 but increases in EDR and ERR after 2015.

Degrees of Freedom

Figure 2 shows the median and mean degrees of freedom used in F-tests and t-tests reported in Acta Psychologica. The mean results are highly variable due to a few studies with extremely large sample sizes. Thus, we focus on the median to examine time trends. The median degrees of freedom over time was 38, ranging from 22 to 74. Regression analyses of the median showed a significant linear increase of a 1.4 degrees of freedom per year, b = 1.39, SE = 3.00, p < 0.0001. Furthermore, the results suggest the replication crisis influenced a significant increase in sample sizes noted by the significant non-linear trend, b = 0.09, SE = 0.03, p = 0.007.

Figure 2

Observed and Expected Discovery Rates

Figure 3 shows the changes in the ODR and EDR estimates over time. The ODR estimate showed a significant linear decrease of about b = -0.42 (SE = 0.10 p = 0.001) percentage points per year.  The results did not show a significant non-linear trend in the ODR estimate, b = -0.10 (SE = 0.02, p = 0.563. The regression results for the EDR estimate showed no significant trends, linear, b = 0.04, SE = 0.37, p = 0.903, non-linear, b = 0.01, SE = 0.06, p = 0.906.

These findings indicate the journal has increased the publication of non-significant results. However, there is no evidence that this change occurred in response to the replicability crisis. Even with this change, the ODR and EDR estimates do not overlap, indicating that selection bias is still present. Furthermore, the lack of changes to the EDR suggests that many studies continue to be statistically underpowered to detect true effects.

Figure 3

Expected Replicability Rates and False Discovery Risks

Figure 4 depicts the false discovery risk (FDR) and the Estimated Replication Rate (ERR). It also shows the Expected Replication Failure rate (EFR = 1 – ERR). A comparison of the EFR with the FDR provides information for the interpretation of replication failures. If the FDR is close to the EFR, many replication failures may be due to false positive results in original studies. In contrast, if the FDR is low, most replication failures will likely be false negative results in underpowered replication studies.

There were no significant linear, b = 0.13, SE = 0.10, p = 0.204 or non-linear,  b = 0.01, SE = 0.16, p = 0.392 trends observed in the ERR estimate. These findings are inconsistent with the observed significant increase in sample sizes as the reduction in sampling error often increases the likelihood that an effect will replicate. One possible explanation for this is that the type of studies has changed. If a journal publishes more studies from disciplines with large samples and small effect sizes, sample sizes go up without increasing power.

Given the lack of change in the EDR and ERR estimate over time, many published significant results are based on underpowered studies that are difficult to replicate.

Figure 4

Visual inspection of Figure 4 depicts the EFR consistently around 30% and the FDR around 10%, suggesting that about 30% of replication failures are false positives.

Adjusting Alpha

A simple solution to a crisis of confidence in published results is to adjust the criterion to reject the null-hypothesis. For example, some researchers have proposed to set alpha to .005 to avoid too many false positive results. With z-curve, we can calibrate alpha to keep the false discovery risk at an acceptable level without discarding too many true positive results. To do so, we set alpha to .05, .01, .005, and .001 and examined the false discovery risk.

Figure 5

Figure 5 shows that the conventional criterion of p < .05 produces false discovery risks above 5%. The high variability in annual estimates also makes it difficult to provide precise estimates of the FDR. However, adjusting alpha to .01 is sufficient to produce an FDR with tight confidence intervals below 5%. The benefits of reducing alpha further to .005 or .001 are minimal.

Figure 6

Figure 6 shows the impact of lowering the significance criterion, alpha, on the discovery rate (lower alpha implies fewer significant results). In Acta Psychologica lowering alpha to .01 reduces the observed discovery rate by about 20 percentage points. This implies that 20% of results reported p-values between .05 and .01. These results often have low success rates in actual replication studies (OSC, 2015). Thus, our recommendation is to set alpha to .01 to reduce the false positive risk to 5% and to disregard studies with weak evidence against the null-hypothesis. These studies require actual successful replications with larger samples to provide credible evidence for an evolutionary hypothesis. There are relatively few studies with p-values between .01 and .005. Thus, more conservative researchers can use alpha = .005 without losing too many additional results.

Limitations

The main limitation of these results is the use of automatically extracted test statistics. This approach cannot distinguish between theoretically important statistical results and other results that are often reported but do not test focal hypotheses (e.g., testing the statistical significance of a manipulation check, reporting a non-significant result for a factor in a complex statistical design that was not expected to produce a significant result).

Hand-coding of 81 studies in 2010 and 112 studies from 2020 showed ODRs of 98%, 95%CI = 94%-100% and 91%, 95%CI = 86%-96%, suggesting a slight increase in reporting of non-significant focal tests. However, ODRs over 90% suggest that publication bias is still present in this journal. ERR estimates were similar and the small sample size made it impossible to obtain reliable estimates of the EDR and FDR.

One concern about the publication of our results is that it merely creates a new criterion to game publications. Rather than trying to get p-values below .05, researchers may use tricks to get p-values below .01. However, this argument ignores that it becomes increasingly harder to produce lower p-values with tricks (Simmons et al., 2011). Moreover, z-curve analysis makes it easy to see selection bias for different levels of significance. Thus, a more plausible response to these results is that researchers will increase sample sizes or use other methods to reduce sampling error to increase power.

Conclusion

The replicability report for Acta Psychologica shows clear evidence of selection bias, although there is a trend that selection bias has decreased due to reporting of more non-significant results, but not necessarily focal ones. The power to obtain a significant result to report a significant result (i.e., a discovery) ranges from 38% to 73%. This finding suggests that many studies are underpowered and require luck to get a significant result. The false positive risk is modest and can be controlled by setting alpha to .01. Replication attempts of original findings with p-values above .01 should increase sample sizes to produce more conclusive evidence.

There are several ways, the current or future editors of this journal can improve credibility of results published in this journal. First, results with weak evidence (p-values between .05 and .01) should only be reported as suggestive results that require replication or even request a replication before publication. Second, editors should try to reduce publication bias by prioritizing research questions over results. A well-conducted study with an important question should be published even if the results are not statistically significant. Pre-registration and registered reports can help to reduce publication bias. Editors may also ask for follow-up studies with higher power to follow up on a non-significant result.

Publication bias also implies that point estimates of effect sizes are inflated. It is therefore important to take uncertainty in these estimates into account. Small samples with large sampling errors are usually unable to provide meaningful information about effect sizes and conclusions should be limited to the direction of an effect.

We hope that these results provide readers of this journal with useful informatoin to evaluate the credibility of results reported in this journal. The results also provide a benchmark to evaluate the influence of reforms on the credibility of psychological science. We hope that reform initiatives will increase power and decrease publication bias and false positive risks.

Nuisances in personality and wellbeing research

Abstract

A recent article suggested that personality explained most, ~ 80%, of the variance in life-satisfaction. This estimate is much higher than previous studies and meta-analyses suggested. I examine three explanations for this surprising finding. First, the use of multiple-raters removed measurement error and boosted effect size estimates. Second, the use of single items as predictors added incremental predictive validity. Third, the use of ad-hoc items that were administered in the contexts altered the interpretation of life-satisfaction items. My analysis suggests that nuances added only 2% of incremental variance, the use of a multi-rater approach added about 15% more variance, and the survey methodology added about 30% of additional variance. These results suggest that the 80% estimate is misleading and that personality explains about half of the variance in life-satisfaction. This estimate is more in line with behavioral genetics studies of heritability and longitudinal studies of the stability of life-satisfaction. The key personality predictors are dispositions to be in a cheerful mood and not to be in a depressive mood. This finding is consistent with hedonistic theories of wellbeing and previous evidence that people’s life evaluations are related to their affective experiences.

Introduction

The empirical study of personality and wellbeing is nearly 100 years old. Hartmann (1936) showed that Neuroticism predicts self-ratings and peer-ratings of happiness. This finding has been replicated consistently in studies with self-ratings or multi-rater studies of personality and well-being (Anglim et al., 2020; Schimmack & Kim, 2022; Schimmack, Oishi, Furr, & Funder, 2004). Forty years ago, Costa and McCrae showed that Extraversion is an additional predictor of life-satisfaction. While their own multi-rater study failed to replicate this finding (McCrae & Costa, 1991), other studies have confirmed this relationship with self-ratings and informant ratings (Anglim et al., 2020; Schimmack & Kim, 2022; Schimmack et al., 2004). Extraversion and Neuroticism are assumed to be the strongest predictors of life-satisfaction because they have main effects on people’s affective experiences. Neuroticism predicts higher levels of Negative Affect (NA) and Extraversion predicts higher levels of Positive Affect (PA). In turn, PA and NA predict life-satisfaction, and often fully mediate the relationship between these personality traits and life-satisfaction (Schimmack, Diener, & Oishi, 2002; Schimmack & Kim, 2022; Schimmack, Schupp, & Wagener, 2008).

Thirty years ago, McCrae and Costa (1991) found that Agreeableness and Conscientiousness were additional personality predictors of life-satisfaction. Evidence for these additional predictors is more mixed. While meta-analyses confirm this finding (Anglim et al., 2020), individual studies sometimes fail to find effects for Agreeableness, Conscientiousness , or both (Kim, Schimmack, Oishi, Tsutsui, 2018; Schimmack et al., 2008; Schimmack, Schupp, & Wagner, 2008).

While the pattern of relationships is fairly consistent across studies, it has been more difficult to estimate effect sizes of these relationships. One problem is that effect sizes in studies that use self-ratings are inflated by shared method bias. The main problem is desirability bias. Happiness and life-satisfaction are desirable and items used to measure personality also vary in desirability (e.g., lazy, rude, creative, sociable). A few studies have shown that controlling for this bias, reduces effect size estimates (e.g., Kim et al., 20##). However, single-method correlations are also attenuated by random measurement error and systematic rater biases. To control for this bias, it is necessary to model multi-rater data with a measurement model that corrects for random and systematic measurement error.

Schimmack and Kim (2020) published the first and only[let me know about other studies!] study that corrects for random and systematic measurement error in personality and life-satisfaction using a latent variable model of multiple raters. The study used a family round-robin design in which students and their biological parents rated each other on the Big Five personality traits and life-satisfaction. The results are reproduced here in Table 1.

The results are fairly consistent with the meta-analytic results of single-method factors. The reason is that the inflation due to desirability bias and the attenuation due to unique rater biases are of similar magnitude and cancel each other out. In this model, the Big Five are independent and the amount of variance can be summed to compute the total amount of explained variance, r2 = .33, and the multiple correlation, R = .57. The results suggest that the Big Five explain about one-third of the variance in life-satisfaction.

Although many studies rely on the Big Five model to measure personality, most personality researchers agree that the Big Five factors do not capture all personality differences. A prominent model of personality recognizes 30 more specific personality traits called facets (McCrae & Costa, 1992). Some studies have related life-satisfaction to these specific traits. A consistent finding is that most facets do not add to the prediction of life-satisfaction. This means that people can have high life-satisfaction with different personality traits. For example, being orderly (a facet of Conscientiousness) or altruistic (a facet of Agreeableness) do not predict life-satisfaction, suggesting that their are different ways to pursue happiness. The main predictors of life-satisfaction are the Depression facet of Neuroticism and the Cheerfulness facet of Extraversion (Anglim et al. 2020, Røysamb et al., 2018, Schimmack et al., 2004). These facets (Depression, Cheerfulness) are stronger predictors of life-satisfaction than their corresponding Big Five factors (Neuroticism, Extraversion) (Anglim et al., 2020; Schimmack et al., 2004). Røysamb et al. (2018) reported an increase in explained variance from 24% for the Big Five to 32% for facets. Anglim et al.’s (2020) reported simple correlations of r = -.48 for Depression and r = .34 for Cheerfulness. Assuming independence, these correlations imply a multiple correlation of R = .59 and 35% explained variance. Anglim and Grant (2016) found that explained variance increased from 40% to 52%. Thus, mono-method studies suggest that about one-third to one half of the variance in life-satisfaction is explained by personality traits, mainly the disposition to experience cheerfulness and few periods of depressive affect. The use of a multi-method approach to control for measurement error might lead to even higher estimates.

Nuisances

Mõttus et al. , (2019) introduced the notion of personality-item nuances. Nuances represent valid unique variance in personality items that is not explained by the Big Five factors or facets. This variance could predict additional variance in life-satisfaction beyond Depression and Cheerfulness facets. This hypothesis could be tested by showing that nuances have incremental predictive validity over facets, just like previous studies showed that facets have incremental predictive validity over Big Five traits (Schimmack et al., 2004).

Mõttus, Realo, Allik, Ausmees, and Henry (2004) reported three multi-rater studies of personality and life-satisfaction. Here, I focus on the English sample that had open data. Results for the other two datasets will be reported after I have access to them. Mõttus et al.’s (2004) key result was that a few personality items explained about 80% of the variance in life-satisfaction ratings after correcting for random and systematic measurement error. This is a surprising finding given the literature review of previous studies that suggested even comprehensive facet measures predict only one-third of the variance in life-satisfaction judgments. There are several explanations for the different results. First, it is possible that the personality items in Mõttus et al.’s (2004) item pool have incremental predictive validity over personality facets, especially depressiveness and cheerfulness. Unfortunately, the authors did not examine how much of the variance facets explain in their datasets. Second, it is possible that the use of a multi-method approach dramatically increases the amount of explained variance. Third, it is possible that Mõttus et al.’s (2004) assessment of life-satisfaction produced different results. The main difference to previous studies is that Mõttus et al.’s (2004) did not use a standard measure of life-satisfaction and that their life-satisfaction items were included in the block of personality items. It is possible that participants interpreted these items as personality items rather than questions about life-satisfaction. I conducted the following analyses to examine why Mõttus et al.’s (2004) reported much stronger relationships between personality and life-satisfaction than previous studies.

Results

Single Rater Results

Mõttus et al.’s item pool included enough items to model the facet structure of the Big Five with the exception of a couple of facets (O6: Progressive Values, C1: Self-Efficacy/Confidence). In a single rater model (OSF link: https://osf.io/cev49/), I replicated the finding that self-ratings of life-satisfaction were predicted by Depressiveness, b = -.64, SE = .04, and Cheerfulness, B = .31, SE = .05, as the only predictors. The parallel mono-method results for informant ratings were similar, Depressiveness, b = -.64, SE = .04, Cheerfulness, b = 31, SE = .04. The amount of explained variance in life-satisfaction was 67%, R = .82, for self-ratings and 60%, R = .77 for informant ratings. These effect sizes can be compared to the multiple correlation of R = .90 in Mõttus et al.’s regression model with single-item predictors. The difference is either due to the correction for measurement error, the incremental validity of nuisances, or both. To answer this question, I fitted models that used self-ratings and informant ratings as indicators of life-satisfaction and personality. To simplify these analysis, I used only the Depression and Cheerfulness facets as the Big Five and other facets did not predict life-satisfaction.

Life-Satisfaction Measurement Model

Mõttus et al. did not use a standard life-satisfaction item or scale to measure life-satisfaction. Instead, they used four items from the personality item-pool as indicators of life-satisfaction, namely (a) “Am happy with my life”, (b) “Feel that my life lacks direction”, (c) “Have a dark outlook on the future” (d) “Have been richly blessed in my life.” A multi-rater measurement model had good fit, chi2(5) = 20.02, CFI = .990, RMSEA = .071. One modification index accounted for most of the misfit, chi2(1) = 18.80. The model showed low loadings (< .4) for self-ratings and informant ratings on the item “Have been richly blessed in my life” which was originally intended to measure gratitude. Dropping this item improved fit of the measurement model, chi2(1) = 0.13, CFI = 1.00, RMSEA = .000.

Life-Satisfaction with Depression and Cheerfulness Facets

The next model added the Depressiveness and Cheerfulness items. The model had good fit to the data, chi2(41) = 64.31, CFI = 995, RMSEA = .031. The main finding was that life-satisfaction was predicted by Depression, b = -.69, SE = .04, and Cheerfulness, b = .35, SE = .05. The effect sizes were stronger and the amount of explained variance increased to 80%, R = .91. Importantly, the effect size is as high as the effect size reported for Mõttus et al.’s set of single-item predictors.

Life-Satisfaction with Mõttus et al.’s Strongest Item Predictors

In Figure 1, Mõttus et al. showed items with the strongest correlations with life-satisfaction. I created a multi-rater latent variable model that used the top 5 positive and top 5 negative items as predictors of life-satisfaction. The multiple regression model showed problems with multicollinearity. I therefore opted to enter variables based on their simple correlations. The variables with the strongest simple correlation were “Often feel that others misunderstand me”, r = -.75 and “Find that nothing excites me”, r = -.84. A regression model with both items as predictors showed that both items contributed to the prediction of life-satisfaction, r = -.51 for both items and accounted for 82% of the variance in life-satisfaction, R = .90. This is equivalent to the effect size for the Depression and Cheerfulness facets.

Incremental Predictive Validity

The final model combined the two models. The most important question is how the two items from Mõttus et al.’s item pool relate to personality constructs in the hierarchical model of personality. Item content suggests that they are related to Neuroticism and Extraversion and may also share variance with the Depression and Cheerfulness facets. Fit of the model was good, chi2(70) = 87.53, CFI = .997, RMSEA = .020. Variance in the item “”Find that nothing excites me” was strongly related to Cheerfulness, b = .72, SE = .08, and also related to Depression, b = -.40, SE = .08. The residual variance was only 8%, SE = .11 and not statistically significantly different from zero, 95%CI = .00 to .29. Thus, there is no evidence of nuance variance in this item. The item Often feel that others misunderstand me” was strongly related to Depression, b = .55, and negatively related to Cheerfulness, b = -.30, SE = .08, but 46% of the variance was not predicted by these two facets, 95%CI = .30 to .62. This item also added to the prediction of life-satisfaction, b = -.21, SE = .10, but the evidence is weak, p = .03, and requires replication. The amount of explained variance increased from 81.2% to 83.3%, R = .90 vs. .91.

Discussion

Decades of personality research on wellbeing have established Depressiveness and Cheerfulness as the main predictors of life-satisfaction, but effect size estimates in mono-method studies are biased and may have underestimated the true effect size. Mõttus et al.’s UK data made it possible to obtain effect size estimates using a multi-rater latent variable model. The results replicated Depressiveness and Cheerfulness as the main predictors of life-satisfaction. In addition, the latent variable model suggested that 80% of the variance in life-satisfaction is predicted by these two personality facets after controlling for random and systematic measurement error. Slightly more variance might be explained by adding the nuance variance of the item “often feel misunderstood” to the prediction model. These new results provide important context for Mõttus et al.’s discussion of their results.

Mõttus et al. were surprised by their finding that a few personality items can predict most of the variance in life-satisfaction. Their preferred interpretation of this surprising result is that measurement error attenuated correlations in previous studies and that that the use of single items adds incremental predictive validity because items contain valuable information beyond their relationship to the Big Five. The present results provided little support for these explanations. Even a mono-method analysis of their data showed that facets explained two-thirds of the variance in life-satisfaction ratings, about double of the amount of variance in previous studies. Correcting for measurement error boosted the effect size by another 10-20 percentage points, but the relationship between personality and life-satisfaction is much stronger in this dataset than in previous studies.

Nuances also did not explain the higher amount of explained variances. Only the nuance variance of one item was a significant predictor and added only 2% of explained variance. Most of the 80% explained variance could be predicted by the facet predictors Depression and Cheerfulness. Thus, the dataset provides further evidence that these two facets are the main personality predictors of life-satisfaction. A plausible explanation for this finding is that experiences of positive versus negative affect are an important aspect of life-satisfaction (Schimmack et al., 2002) and that biological dispositions influence how people are feeling. These feelings color life-evaluations with the proverbial rose-colored or cloudy glasses.

The main reason for Mõttus et al.’s large amount of explained variance appears to be unique to this dataset. The most salient difference between this study and previous studies is the way life-satisfaction was measured. Whereas previous studies asked about life-satisfaction in a separate block of items and used validated items, Mõttus et al.’s used ad-hoc items that were mixed personality items in a single block. It is possible that participants interpret items differently when they are presented in different contexts. Future research needs to examine in more detail the cognitive processes underlying personality and life-satisfaction ratings. Valid life-satisfaction ratings are expected to be more strongly related to evaluations of specific life domains because life as a whole is made up of these domains (Payne & Schimmack, 2020).

The main limitation of this study was that life-satisfaction was measured with an unusual set of items that were embedded in the personality questionnaire. This raises the concern that responses to these items were based on the same type of information that was used to make personality ratings. It is therefore necessary to replicate the results in a study in which life-satisfation judgments were made independently – and ideally before – personality questions. Another problem of the study was the use of a single informant. This makes it difficult to model self-informant agreement because it is unclear whether self-ratings or informant ratings are more valid. To model the data, it is necessary to make the assumption that self-ratings and informant ratings are equally valid. While this is a reasonable assumption (Schimmack & Kim, 2022), but the results need to be replicated with multiple informants that require fewer assumptions to model measurement error.

Finally, it is noteworthy that self-informant agreement in this dataset is above the usual levels observed for personality traits (Connely et al., ###) and life-satisfaction (Schneider & Schimmack, 2009; Zou et al., 2013). Typically self-informant agreement for life-satisfaction is between .3 and .4, even after correcting for unreliability. In Mõttus et al.’s UK data the correlation is r = .74. This suggests that the ratings were not done independently or some reports were filled out by the same person.

In conclusion, Mõttus et al. aimed to estimate the relationship between life-satisfaction and personality while controlling for random and systematic measurement error. They found that personality explained about 80% of the variance in life-satisfaction ratings. They attributed this strong relationship to a few personality items. I showed that much of this relationship is accounted for by the Depression facet of Neuroticism and the Cheerfulness facet of Extraversion. Only nuance variance of one item might add another 2% to the prediction. The finding that Depression and Cheerfulness are the strongest personality predictors of life-satisfaction are consistent with previous studies and highlight the importance of affective experiences in life-evaluations (Costa & McCrae, 1980; Schimmack et al., 2002; Schimmack & Kim, 2020). I also found that the mono-method relationships between life-satisfaction and personality in this dataset was much stronger than in previous studies. The most likely reason for this difference is that the life-satisfaction items were administered concurrently with the personality items and that this may have altered the way participants’ interpreted these items. Future research needs to manipulate item-context to test this hypothesis.

The results also provide further evidence that many personality traits are not related to life-satisfaction. This means that there are different ways to pursue happiness and people with different personalities can have high wellbeing by creating lives that match their personalities. Humanistic psychologists have called this self-actualization. The main obstacle towards achieving high wellbeing appears to be a melancholic personality disposition. Interestingly, not all negative affects are detrimental and more research needs to examine the distinction between them.

References

Anglim, J., Horwood, S., Smillie, L. D., Marrero, R. J., & Wood, J. K. (2020). Predicting psychological and subjective well-being from personality: A meta-analysis. Psychological Bulletin, 146, 279–323. https://doi.org/10.1037/bul0000226

Costa, P. T., Jr., & McCrae, R. R. (1980). Influence of Extraversion and Neuroticism on subjective wellbeing: Happy and unhappy people. Journal of Personality and Social Psychology, 38, 668-678.

Costa, P. T., & McCrae, R. R. (1992). Revised NEO Personality Inventory (NEO PI-R) and NEO Five-Factor Inventory (NEO-FFI) professional manual. Psychological Assessment Resources.

McCrae, R. R., & Costa, P. T. (1991). Adding Liebe und Arbeit: The Full Five-Factor Model and Well-Being. Personality and Social Psychology Bulletin, 17, 227–232. https://doi.org/10.1177/014616729101700217

Payne, J. W., & Schimmack, U. (2020). Construct validity of global life-satisfaction judgments: A look into the black box of self–informant agreement. Journal of Research in Personality, 89, 104041. https://doi.org/10.1016/j.jrp.2020.104041

Røysamb, E., Nes, R. B., Czajkowski, N. O., & Vassend, O. (2018). Genetics, personality and wellbeing. A twin study of traits, facets and life satisfaction. Scientific Reports8(1), 12298. https://doi.org/10.1038/s41598-018-29881-x

Schimmack, U. (2010). What multi-method data tell us about construct validity. European Journal of Personality, 24, 241–257. https://doi.org/10.1002/per.771

Schimmack, U., & Kim, H. (2020). An integrated model of social psychological and personality psychological perspectives on personality and wellbeing. Journal of Research in Personality, 84, Article 103888. https://doi.org/10.1016/j.jrp.2019.103888

Schimmack, U., Oishi, S., Furr, R. M., & Funder, D. C. (2004). Personality and Life Satisfaction: A Facet-Level Analysis. Personality and Social Psychology Bulletin, 30, 1062–1075.
https://doi.org/10.1177/0146167204264292

Schneider, L., & Schimmack, U. (2009). Self-Informant Agreement in Well-Being Ratings: A Meta-Analysis. Social Indicators Research, 94, 363. https://doi.org/10.1007/s11205-009-9440-y

Schimmack, U., Schupp, J., & Wagner, G. G. (2008). The influence of environment and personality on the affective and cognitive component of subjective well-being. Social Indicators Research, 89(1), 41–60. https://doi.org/10.1007/s11205-007-9230-3