All posts by Ulrich Schimmack

About Ulrich Schimmack

Since Cohen (1962) published his famous article on statistical power in psychological journals, statistical power has not increased. The R-Index makes it possible f to distinguish studies with high power (good science) and studies with low power (bad science). Protect yourself from bad science and check the R-Index before you believe statistical results.

Citation Counts in Psychological Science After the Bubble Burst

For years, psychologists interested in their ‘impact’ could check their citation counts in Web of Science and see an increase. The main reason for this increase was the creation of more journals, including online journals without page limitations. The positive illusion of increasing importance came to a halt in 2021.

Figure 1. Citation count for the journal Psychological Science

The reason is probably that WebOfScience became more selective and delisted some low quality and possibly predatory journals. The end of exponential increases in citations is an interesting phenomenon in itself, but it is not the key focus of this blog post. The main question is whether the popping of the citation bubble also reflects concerns about the credibility of published articles in the wake of the replication crisis.

A useful analogy are stock market bubbles, like the valuation of new internet companies in the late 1990s. A bubble is created because investors buy stocks independent of the true value of a stock. Famous value-investor Warren Buffet quipped “Only when the tide goes out do you discover who’s been swimming naked” (not there is anything wrong with it).

After the crash of the dot.com bubble, companies like  Pets.com lost all of their value. In contrast, (for better or worse) companies like Amazon.com became very valuable because they actually generated revenues and profits.

This blog post examines whether the citation-crash in 2021 affected all researchers equally or whether it discriminated between credible researchers that advanced science and researchers who produced articles with incredible results that are difficult to replicate and do not advance psychological science. The list of researchers is not meant to be representative in any way, but based on their significance during the replication crisis. My personal focus is on personality and social psychology. Readers interested in other research areas with access to WebOfScience can compare results for researchers they are interested in on their own.

Citations of Journals

To create a benchmark, i picked a small set of journals that represent social and personality research. For each journal, I recorded the peak (highest number of citations in a year), and the citations in 2023 and 2024. 2024 citations will still increase a bit over the next couple of month. So, a difference between 2023 and 2024 cannot be interpreted as a further decrease, yet. (I may update the numbers in a couple of months).

Note. JESP = Journal of Experimental Social Psychology, EJSP = European Journal of Social Psychology, EJP = European Journal of Personality, JRP = Journal of Research in Personality, PsySci = Psychological Science

The results are fairly consistent across journals. So far, the citation rates in 2024 are 90% or more of those in 2024. Thus, there is no evidence that citations are decreasing notably. The comparison of citations in 24 with the peak show that citations are now 20% to 30% below the peak (bear market territory in stock market language). The most notable drop is observed for Social Cognition, which may reflect the loss of trust or interest in social cognition research, which was hit hard by the replication crisis. The comparison of peak and citations in 2023, shows a similar pattern with a higher average because 2024 citations are not yet complete. Again, social cognition has the biggest drop. Psychological science which also published a lot of questionable research shows the second biggest drop, but there is no notable difference between social and personality journals in general. The averages provide a benchmark to examine trends for specific researchers. A loss in trust (or value in stock market language) would be revealed by bigger drops from the peak to recent years than the average drop for personality and social psychology journals.

Citation Counts of Prominent Social and Personality Psychologists

To provide context, i am presenting the results in order of the 24/Peak values that reflect the biggest drop in confidence.

1. Diderick Stapel

Diderick Stapel contributed to the loss of confidence in (social) psychology in 2011, when it emerged that he had faked data for numerous articles. After an investigation, 58 of his articles were retracted. It is remarkable that Stapel’s peak citation count is a meager 367, a number that pales in comparison to some of the citation counts below.

Citations have decreased to 27% of the peak in 2024 and 32% in 2023. The comparison of 24 and 23 suggests that citations are still decreasing (85% vs. 94% for the journal average). Evidently, awareness of fraud and retractions have produced a strong signal to neglect Stapel’s articles. However, this decrease is unrelated to the bust of the bubble in 2021. The correction appeared rather quickly after the discovery of fraud in 2011.

2. John A Bargh

John A Bargh is best known for his articles on priming without awareness, like his famous claim that reading a few words related to the elderly makes young students walk slower. This work was featured in the popular book of Nobel Laureate Daniel Kahneman along with other social priming studies. After a failed replication study, Kahneman wrote a letter to Bargh expressing his concerns about the replicability of social priming results. In the following years many of these results could not be replicated, confirming suspicion that these results were obtained with questionable research practices, including hiding of studies that did not show the predicted results (cf. Schimmack et al., 2017).

Citations for Bargh are now only 49% of the peak citations in 2014. Thus, it does not take fraud and retractions to correct citation bubbles based on questionable research. Bargh did not make up data like Stapel. Rather, he used unscientific ways to analyze data that increase the chances of obtaining a statistically significant and publishable result or simply conducting many studies and reporting only those that ‘worked.’

The bubble burst in 2014. While there was a slight uptick in citations in 2019, citations increased notably in the following years. With over 800 citations a year, the absolute number of citations is still high for psychology and suggests that many researchers still cite this work without awareness that social priming results are questionable.

On a positive note, the decrease in citations is notably bigger than the average decrease from peak to 2024, 49% vs. 77%.

3. Fritz Strack

Fritz Strack is another social cognition researcher with a notable presence during the replication crisis in social psychology. For example, he wrote a meta-scientific article in defense of social psychology in the face of replicability problems (Strobe & Strack, 2014). Moreover, he is the author of an article that used a deceptive manipulation of facial muscles to provide evidence for the facial feedback theory of emotions. In 2019, he was awarded the IG Nobel Prize for this work. The results could not be replicated in one large replication attempts. is study failed to replicate in a big replication attempt, but he attributed this to problems with the replication study. However, another replication attempt in a project that he was actively involved in also failed. Another line of work suggested that item-order influences the correlation between life-satisfaction and domain satisfaction (e.g., marital satisfaction) judgments. This result was featured in Daniel Kahneman’s book and also failed to replicate. it is now generally accepted that life-satisfaction judgments are rather stable and based on chronically accessible information about important life domains. He is also a co-author on an article about an experimental manipulation of ease of retrieval that also failed to replicate (Schwarz et al., 1991).

The citation count peaked in 2021 and then decreased notably.

Importantly, the decrease is more pronounced that the average decrease of articles in social and personality psychology (

Given the recency of the peak, it is possible that citations continue to fall faster than the average in the coming years.

4. Dan Ariely

Dan Ariely’s research practices were scrutinized after it was discovered that data in one study of an article with him as author were manipulated. The data were provided by him, but he maintains that the manipulation was carried out by somebody else. Dan Ariely also co-authored article with another researcher, Francesca Gino, who is under investigation for data manipulation.

Ariely may not care much about citation counts, as he is one of those psychologists who have broken out of academia into pop culture. His popular book “Predictably Irrational” about his life and research was turned into a TV show that premiered in 2023 on NBC.

Citations peaked in 2021, when the citation bubble burst. Since then, citations have decreased somewhat faster than the average (62% vs. 77%).

Given the recency of the peak and the ongoing investigation of articles co-authored with Gino, it is possible that citation counts will decrease more than the average article in the coming years.

5. Anthony Greenwald*

* tied with Danna Carney who was involved in power-posing research, but distanced herself from this work. Her other work has not been a topic in the replication crisis and her overall citation count is low compared to the famous authors listed here .

Anthony Greenwald is famous for the invention of the Implicit Association Test (IAT). Millions of people have used IATs to obtain feedback about their unconscious/implicit/automatic associations. The most famous IAT is the race IAT that has been used to claim that the vast majority are more prejudiced against African Americans than they would admit in self-ratings. The test has also been used to suggest that African Americans often hold more favorable attitudes of White people than Black people even if they would state the opposite. A famous example is Malcolm Gladwell.

Numerous authors, including myself, have criticized the interpretation of IAT scores and pointed out that they lack evidence of validity (they may not measure racial attitudes) and are poor predictors of actual behavior. For example, the race IAT failed to predict racial bias in the 2008 US presidential election (Obama vs. McCain). That is, there was racial bias and self-report measures showed it, but IAT scores did not. Even proponents of implicit biases are now critical about the use of the IAT to study these biases. Even Greenwald has walked back unsupported claims in his popular book “Blindspots” and suggests IAT scores should simply be interpreted as scores produced by a method that does not rely on self-reports without specific claims about the meaning of these scores (i.e., it is a method without a clear construct).


Citations peaked in 2021, when the citation bubble burst. The decline in citations is somewhat steeper than the journal average (66% vs. 77%), but the number of citations in 2024 is still likely to break the 2,000 mark. The hype about the IAT is far from over and Project Implicit continues to hide the fact that the test is often invalid and provides false feedback.

6. Francesca Gino

The discovery of data anomalies (lawyer speak for funky data) in an article by Francesca Gino was big news on social media in 2023 (Datacolada, 2023). The accusation of fraud led to an investigation by Gino’s employer (Harvard) and a lawsuit by Gino against Harvard and the Datacolada team. The lawsuit against the investigators was recently dropped (Science, 2024) and the four flagged articles have been retracted. The investigation and law suit have drawn a lot of attention to Gino’s work. Collaborators have tried to separate credible studies from studies tainted by data from Gino’s lab, but this effort has not resulted in a clear separation of articles with and without tainted data. This may have affected citation of articles co-authored by Gino.

Note. Only articles from 2007 or later are used because WebOfScience only uses full first names in those years and using the initial would retrieve too many articles by other authors.

Citation counts peaked in 2021. The decrease from 2021 to 2024 is steeper than the average decrease for social psychology journals (67% vs. 77%). If more articles are retracted, the decrease is likely to accelerate in the coming years.

7. Tom Pyszczynski

IT was easier to search for citations for Tom Pyszczynski than his collaborators Jeff Greenberg and Sheldon Solomon, who invented Terror Management Theory (TMT) because his name is more unique. Terror management theory postulates that reminders of death alter people’s priorities and behaviors. Replicability varies depending on the methods that were used to study TMT. Many influential studies used priming without awareness. These results are difficult to replicate as are other priming studies (see Bargh above). TMT studies were highly influential in the early 2000s, but the topic attracted less attention recently, although TMT studies were not a major focus of discussion during the replication crisis.

Pyszczynski’s citation count peaked in 2021 when the citation bubble burst.

Citations decreased a little bit more than the journal average. However, with nearly 800 citations in 2024, TMT is likely to outlive its creators.

8. Kathleen D. Vohs

Kathleen Vohs’ has two lines of research related to the replication crisis. Her work on money priming is just a variation of priming without awareness studies with money as primes. These studies lack credibility and there is good evidence that published results were obtained with questionable research practices (Schimmack, 2024). The more important work was conducted in collaboration with Roy F. Baumeister on ego-depletion. Ego-depletion research has become another posterchild of replication failures. After a large replication project failed to show the effect, Vohs and colleague conducted another large replication project to demonstrate that the first attempt was flawed. However, even she could not replicate the finding with over 2,000 participants. Meta-analyses of published studies also find clear evidence of bias and little evidence of a real effect after controlling for bias. The general consensus among meta-psychologists is that the large ego-depletion literature has produced little insights into willpower and self-regulation failures (e.g., publishing only significant results although this violates scientific principles). Only a few people, still insist that ego-depletion effects are real. It is not clear where Vohs stands on this issue.

Vohs’ citation count peaked in 2021, the year the citation bubble burst. The decrease is a bit steeper than the journal average, down by 29 percentage points from the peak (71% vs. 77). However, with over 1,700 citations in 2024, there is still a lot of room for a correction that reflects the value of her money-priming and ego-depletion articles.

9. Dan Gilbert

Dan Gilbert made a contribution to the replication debate with some snarky online comments (e.g., calling replication researchers “shameless little bullies”) and an attempt to discredit the embarrassing finding of the reproducibility project that only 25% of social psychology studies could be replicated. The key argument was that replication studies were underpowered, but this argument does not explain how similarly powered studies could have a success rate of over 90% in the published articles.

His own research is often co-authord by Timothy Wilson, who is equally known for disparaging comments about replication researchers (Bartlett, 2014). It was easier to look up citations for Gilbet than for Wilson. Their work covers many topics, but they are probably best known for their studies on affective forecasting. The idea is that we overestimate the effects of negative events on our moods. The effect is replicable, but there is disagreement about the interpretation of the finding (Levine et al., 2012).

The peak citation count was in 2020, a year before the bubble burst. Citation counts decreased a bit faster than the journal average down to 71% versus 77%.

10. Roy F. Baumeister

And last, but not least, is Roy F. Baumeister, the inventor of ego-depletion (with Kathleen Vohs). At some point, Baumeister proposed that will-power is related to blood glucose levels and this claim was of course supported by statistically significant results, no less than 9 to be exact. When I wrote an article that pointed out that it is unlikely to get so many significant results without a non-significant effect even if an effect is real, Baumeister was a reviewer of the manuscript and simply noted that they of course had more studies and didn’t publish the weaker ones. That is just how the field works. At this point, ego-depletion was still accepted as a phenomenon, even if it didn’t rely on glucose levels. Now, even ego-depletion has been debunked and hundreds of significant results provide no insight into self-regulation. Nevertheless, Roy Baumeister continues to write articles that claim ego-depletion is real and all the critics are wrong. Roy F. Baumeister has also written on many other topics with influential review articles. For example, he reviewed credible evidence that men have a stronger sex drive than women.

Roy F. Baumeister has an impressive citation record with a peak citation rate of 7192 citations in 2021. He may be relieved that the decline in the following years is partially due to a general decline.

However, the decline is a bit steeper than the decline for the average journal (73% vs. 77%). To examine whether the results would be different, I also examined only articles that mention ego-depletion as a search term in any field. The key finding was hardly different (71% vs 73%). With nearly 900 ego-depletion related citations, psychology still has a long time to find out that Baumeister was swimming naked.

And here you have it. The most interesting result is that Bargh is the most notable victim of the replication crisis, whereas equally questionable research like the work on ego-depletion has been affected much less by demonstration of bias and major replication failures. While the citation bubble has burst, researchers are not as savvy as value investors to spot junk stocks. The following years will show whether awareness about replication failures spreads.

Some People Can be Better than Average

Some notable researchers have trends above the journal average. Here are three examples.

1. Susan Fiske

Susan Fiske became famous during the replication crisis when she called investigators of research bias “method terrorists.” Her own research has not been the focus of method terrorists. This may explain why her citation count decreased less than the journal average (83% vs. 77%).

2. Ed Diener

Ed Diener is one of the most influential psychologists of his generation, akin to Roy F. Baumeister. His main research has been on life-satisfaction and subjective well-being. This research has flourished since Diener published his seminal review paper in 1984. His Satisfaction with Life Scale (Diener et al., 1985) is still the most widely used multi–item measure of SWB. Many results have been replicated with large, nationally representative samples.

Citation counts peaked once again in 2021, the year the citation bubble burst. However, the decrease is relatively small and less than the journal average (87% vs. 77%).

A Positive Response to the Replication Crisis

My UofT colleague Michael Inzlicht also played a part in the replication crisis. In contrast to many social psychologists, he acknowledged that some of his early work was based on shoddy practices that produced results that cannot be replicated. Just recently, he commented on the demise of stereotype threat, another popular literature that turned out to be junk science (Inzlicht, 2024). After being unable to replicate his earlier findings at UofT, he started work on ego-depletion. After he joined a large replication project that failed to show the effect, he gave up on ego-depletion as well (Inzlicht, 2020). Since then, he started over and has already gathered a citation record that stands on its own without the previous publications.

Citation counts for publications (k = 80) from 2000 to 2014 peaked in 2019 and then decreased sharply. The decrease matches the decrease for the third place (Strack).

However, over the past 10 years, Inzlicht published already over 100 new articles that have a higher peak than his old publications.

Moreover, the decrease is shallow and above the journal average (91% vs. 77%).

There you have it. When you invested in a bubble, it is best to sell and invest again in some new enterprise rather than sticking to your guns and hoping this will turn around.

Your’s Truly

Of course, I also looked up myself. In fact, the post is based on my investigation of the disappointing decrease in my own citations. It is comforting that the decrease is general and not specific to my publications.

Rather the decrease is just slightly below the journal average (75% vs. 77%).

Conclusion

Citation counts matter even if they are only a poor reflection of research quality. They show what people are studying or reading and they are used to allocate resources for future research. The availability of citation counts makes it easy to conduct some meta-scientific studies on the popularity of topics and researchers. Whether the average number of citations goes up or down is a matter of journal space and the tracking of articles in an index. In WebOfScience citation counts have decreased. Researchers who look up themselves should be aware that the decrease is not a reflection of their popularity. More important is the relative increase compared to some benchmark or other researchers. Here we see some interesting differences. Most notably, citation counts of people who used fraud (Stapel) or questionable research practices (Bargh) to produce results that cannot be replicated have decreased notably. This is probably more important than the absolute number of citations or the H-Index of a researcher. The H-Index does not track whether a researcher made a positive contribution to science. It merely shows that a researcher was popular at some point in time. The hallmark of science is that it is self-correcting. This means that bubbles will be deflated eventually and popular but false claims will stop being citated. It is good to see some science that psychology acts like a science, but there is still a lot of room for further corrections in the future. So, do not be disappointed if your citation count pales in comparison to some of the giants in the field or decreased this year. No need for imposter syndrome. Not everybody can be better than average and some of the giants did more harm than good. Better to be just average.

Happy New Year
Ulrich Schimmack

The Time for AI-Generated Peer-Reviews is Now

I should have done this before I submitted a ms. for publication, but later is better than never. After waiting for a couple of months, I received a rejection for a ms. on the relationship between personality and life-satisfaction.

The rejection decision was based on a form letter by the action editor without any acknowledgement that they read the paper or comments on the main point of the ms. and two reviews.

.

One review was by the author who’s data I reanalyzed and showed that the published results are not supported by my reanalyses. You can guess how favorable that review was or you can read it below. The other review made no substantial comments, probably because the reviewer is not familiar with structural equation modeling and unable to read the MPLUS output files that were shared on OSF. Anyhow, these reviews provide no helpful feedback, but they served the purpose of allowing the editor to reject the ms.

Unhappy, or rather frustrated, I decided to ask ChatGPT for a CRITICAL review of the manuscript and just pasted the manuscript in the dialogue box. Less than a minute later, i had a objective review that showed understanding of the issue, acknowledged strength, and pointed out several limitations that can be used to strengthen the manuscript. Wow. This is a gamechanger. Not only can editors save time and get valuable reviews. Authors can get the feedback before they submit a ms. and improve it. We all knew peer-review is shit because it is driven by nepotism, defensiveness, motivated bias, incompetence, and lack of time and motivation to do a good job. I have written my fair share of terse reviews without much care for the feelings of the authors. ChatGPT doesn’t have a bone in the fight. It may not be ready to replace peer-review, but it surely can elevate it.

ChatGPT Review

Critical Peer Review of “Personality and Life-Satisfaction: A Hierarchical, Multi-Rater Perspective”

Title and Abstract The manuscript addresses a critical and enduring question in psychology: the relationship between personality traits and life satisfaction. The title effectively conveys the scope of the research, and the abstract provides a concise overview of the methodology, results, and implications. However, the abstract could better highlight the novelty of the hierarchical factor analysis approach used, as this seems to be a significant contribution of the study.

Strengths of the Manuscript

  1. Methodological Rigor: The use of hierarchical confirmatory factor analysis to separate variance components represents a significant methodological strength. This approach allows the study to address measurement errors and identify unique contributions of specific traits.
  2. Multi-Rater Datasets: Incorporating multiple datasets with self-ratings and informant ratings enhances the reliability and generalizability of the findings. The inclusion of domain satisfaction measures in four of the datasets is also commendable.
  3. Findings on Neuroticism: The consistent identification of the depressiveness facet of Neuroticism as a primary predictor of life satisfaction is a valuable contribution to the literature. The mediation effect of domain satisfaction provides novel insights into the mechanisms underlying this relationship.
  4. Contextual Moderation: The manuscript explores variations across populations (e.g., students vs. parents) and cultural contexts, which adds depth to the findings.

Weaknesses and Areas for Improvement

  1. Theoretical Framing:
    • While the manuscript builds on Costa and McCrae’s hierarchical model, the discussion of how this model advances or challenges existing theories of personality and subjective well-being (SWB) could be expanded. For example, the implications of the findings for alternative models, such as dynamic equilibrium theory, are not sufficiently explored.
    • The manuscript heavily focuses on Neuroticism and Extraversion. It briefly mentions other Big Five traits (e.g., Openness) but does not delve into their potential roles or reasons for their weaker effects. This omission may leave readers with an incomplete picture.
  2. Facet-Level Analysis:
    • While the study underscores the importance of the depressiveness facet of Neuroticism, the results for Extraversion and Cheerfulness are inconsistent. The manuscript could include a more thorough discussion of why these inconsistencies arise and what they suggest about the role of positive affect in life satisfaction.
    • There is little exploration of why specific traits (beyond Depressiveness and Cheerfulness) fail to consistently predict life satisfaction. Providing more nuanced interpretations could enrich the discussion.
  3. Domain Satisfaction and Top-Down Effects:
    • The findings on domain satisfaction are compelling but underexplored. The “top-down” influence of personality traits on domain evaluations is mentioned, but the mechanisms underlying this influence remain vague. For instance, how does Depressiveness lead to negative evaluations of specific domains? This area warrants further elaboration.
    • The manuscript could provide more detail on the “bottom-up” integration of domain satisfaction into global life satisfaction judgments. Are certain domains (e.g., relationships, financial stability) consistently more influential across samples?
  4. Methodological Concerns:
    • Measurement of Life Satisfaction: The use of a single-item measure of life satisfaction in some datasets raises validity concerns, as acknowledged by the authors. A more detailed justification for this choice and a discussion of its potential limitations are needed.
    • Sample Representativeness: The reliance on student samples in some studies limits generalizability. While this limitation is noted, it could be addressed more explicitly in the discussion.
    • Reliance on Multi-Rater Data: The manuscript assumes that variance shared among raters reflects valid variance. However, it does not adequately address potential biases introduced by informant ratings, such as social desirability effects or shared environmental influences.
  5. Presentation and Accessibility:
    • The manuscript’s extensive use of technical language and complex statistical concepts may limit accessibility for non-specialist readers. Including simplified explanations or visual aids (e.g., diagrams of hierarchical models) would enhance readability.
    • The tables summarizing results are dense and lack accompanying narrative summaries that highlight key patterns. For example, Table 1 and Table 2 could benefit from clearer labels and textual integration into the results section.

Suggestions for Future Research

  • The manuscript highlights the need to identify additional predictors of life satisfaction beyond Depressiveness and Cheerfulness. Future studies could explore traits or facets not captured by traditional Big Five measures, such as self-compassion or grit.
  • Longitudinal studies that disentangle the causal relationships between personality, domain satisfaction, and life satisfaction would address limitations of the current cross-sectional datasets.
  • Cross-cultural studies that examine the moderating effects of societal factors (e.g., individualism vs. collectivism) on the personality-life satisfaction relationship could provide valuable insights.

Conclusion The manuscript makes a meaningful contribution to the literature on personality and subjective well-being by quantifying the effects of personality traits and exploring their mediation through domain satisfaction. However, the study’s potential impact could be enhanced by addressing the theoretical, methodological, and presentation-related issues outlined above. Overall, the research is robust, and its findings are valuable, but greater clarity and depth in certain areas would strengthen its contribution to the field.

Overall Recommendation: Revise and resubmit, addressing the identified weaknesses and incorporating the suggested improvements to enhance the manuscript’s clarity, accessibility, and theoretical depth.

Questionable Reviewer Practices: Dishonest Simulations

In an ideal world, humans would curb self-interest for the greater good. In reality, psychology has shown that human information processing is riddled with self-serving biases. A great achievement of human cultural evolution is the development of tools that can reduce these biases such as logical thinking and objective empirical observations. Since the invention of computers, it has also become easier to use simulations to put intuition to a test. However, motivated self-interest can bias simulations to produce desirable outcomes. For example, Uri Simonsohn made unreasonable assumptions to claim that p-curve performs well even with heterogeneity in power (Schimmack, 2018). It does not (Brunner,

An anonymous reviewer, henceforth known as Reviewer A (which may stand for anonymous or another word starting with A), recently accused us that we also used biased simulations to make the false claim that z-curve provides useful estimates of power with good coverage of confidence intervals (Brunner & Schimmack, 2020; Bartos & Schimmack, 2022; Schimmack & Bartos, 2023). The same reviewer previously made numerous false claims about z-curve and estimation of true power that we addressed elsewhere (Brunner, 2024). In our response to Reviewer A’s earlier comments, we also challenged them to provide a simulation that shows when z-curve breaks down.

Reviewer A was able to do so. The question is how they did it and whether Reviewer A’s results challenge our simulation results. Here is Reviewer A’s simulation.

To show that z-curve breaks down without hodgepodge heterogeneity, let us consider a situation of unconditional power of value .25   We have a one-sample Cohen’s d population value ranging from 0.2 to 2.2 (by increments of .05 to end up with 40 values), that is accompanied by sample size ranging from 167 to 2 that is calculated to be associated with a power of .25.

I generated data from each combination of Cohen’s d and sample size and fit a paired-samples t-test to obtain 40 p-values. These p-values are associated with a (“unconditional”) power value of .25. The expected discovery rate should be .25 (which is the power associated with the design of these observed results). The output I obtain from the zcurve package for the 40 estimated p-values is 0.05 – very far from the true value of .25.

To be clear, an estimate of 5% power (no evidence against the null-hypothesis) when the true power is 25% is horrible. So, we need to examine the conditions that lead to this horrible outcome. Here is Reviewer A’s code.

R code
d <- seq(.2,2.2,by = 0.05) # range of population Cohen’s d value
ssize<- matrix (0, ncol=40,nrow=1) #placeholder for sample size
pow = .25 #change this to change level of true power
for (i in 1:40){ #obtain sample size associated with Cohen’s d value for chosen level of power.
  ssize [i] <- pwr.t.test(n = NULL, d = d[i], sig = .05, power = pow,
                          type=”one.sample”, alternative=”two.sided”)$n
}
ssize_ <- round(ssize,digits=0) #round up the values
pp <- matrix(0,ncol=40,nrow=1)  #placeholder for estimated p-values
#let’s generate data and collect p-values
for (i in 1: 40){
  dat <- rnorm(n=ssize_[i], mean=d[i], sd=1)
  pp[i] <- t.test(dat, paired=FALSE, alternative=”two.sided”)$p.value
}
*let’s use zcurve to see whether EDR reproduces the specified power value
zcurve(p=as.vector(pp))
==

Let’s first address an annoying side issue in this simulation. The simulation is based on 40 simulated studies or test results. With power of 25% only about 10 of those are expected to be significant and useful for a z-curve analysis to estimate power. We have warned that z-curve estimates with small k (k = 10; 10 p-values below .05) are too variable to be meaningful. They also have wide confidence intervals that Reviewer A does not bother to report. However, this is a side-issue. We can simply increase the number of tests from 40 to 100,000 and see the large-sample bias in z-curve estimates. This confirms that z-curve severely underestimates power in this simulation. So, let’s take a closer look at the scenario that is being simulated.

The simulation starts with effect sizes ranging from small (d = .2) to effect sizes that are very large and rarely observed in real studies (d = 2.2). It is well known that power increases with larger effect sizes. Thus, to maintain low power of 25%, we have to reduce sample sizes. The sample sizes implied by this simulation are as follows.

N freq. perc.
2 11 26.8%
3 13 31.7%
4 4 9.8%
5 3 7.3%
6 1 2.4%
7 2 4.9%
9 1 2.4%
10 1 2.4%
12 1 2.4%
15 1 2.4%
20 1 2.4 %
28 1 2.4%
43 1 2.4%

These results show that only 10% of studies had sample sizes of 15 or more participants and over half of the simulated studies had 2 or 3 participants. As the simulation focused on a one-sample t-test, a study with N = 2 has 1 degree of freedom. This important information was hidden from the editor who is supposed to make decisions based on peer-reviewers’ comments.

Reviewer A could have just simulated studies with sample sizes of 2 or 3 participants to show that z-curve does not work well with these sample sizes, but maybe the editor would have noticed that this is not a reasonable assumption because most studies have more than 3 participants. In fact most studies have more than 20 participants. So, the only plausible reason to simulate effect sizes, when sample size is the driving factor is to hid explicit information about sample sizes from readers who do not understand t-distributions very well.

That being said, it is interesting to examine whether small sample sizes that are actually found in research articles still bias z-curve estimates. For example, John Bargh’s infamous elderly priming study that could not be replicated had only n = 15 participants in the control and experimental group for a total of N = 30 and 28 degrees of freedom. Would z-curve estimates underestimate power with these small sample sizes? Before I can present the results, it is important to point out that z-curve provides to estimates of power. One is the estimated power (including power of 5% for studies where the null-hypothesis is true) of all studies that were conducted and produced significant and non-significant results. This is called the expected discover rate. The second estimate is the power of the subset of studies that produced a significant result (including false positive results with power of 5%, if alpha is set to .05). This is called the expected replication rate because it predicts how many significant results would be obtained if only the studies with significant results were replicated exactly, including the original sample sizes. When power is fixed, the true EDR and ERR are the same, but estimates and biases can differ because estimating the EDR is more difficult.

We can use Reviewer A’s code to determine the effect size that is needed to get significance in a simple between-group study with n = 15 per group (Bargh also had a covariate which increases power, but that is not relevant here). I simulated power of 50%.

#Note. pwr uses n of a single group. with n=15 and type=two.sample, the total N=30 and df =28
d <- pwr.t.test(n = 15, d = NULL, sig = .05, power = .50,
                          type=”two.sample”, alternative=”two.sided”)$d

With N = 30 total sample size we need a large effect size of d = .74 to have 50% power.

Figure 1 shows the t-distribution from which test statistics of individual studies are sampled (green). It also shows the standard normal distribution that is implied by 50% power (black). Visual inspection shows that the two distributions are similar but not identical.

The approximation of the asymmetrical non-central t-distribution with the standard normal distribution introduces some bias in z-curve estimates. For this example, the true power of 50% is underestimated by 4 percentage point. More problematic is that the EDR is underestimated by 13 percentage points. This finding suggests that estimates of the EDR and the FDR, which is simply a transformation of the EDR, are biased in sets of studies with small sample sizes (N < 30), even if we disregard silly sample sizes of N = 2.


Is there a solution to this problem? Indeed there is one and maybe we should have thought about it before, but as they say “better late than never.” There is an alternative approach to ‘convert’ t-values into z-scores. (or F-values with df = 1, t = F^2, i.e., t is the square root of the F-value or F-values are simply squared t-values). The alternative is to simply use the t-value as an estimate of z-scores.


ChatGPT: How to convert t-values into z-scores?

Find the cumulative probability of the t-value: Use a t-distribution table or statistical software to find the p-value associated with the t-value given your specific degrees of freedom. This p-value represents the area under the curve to the left of your t-value (for a one-sided test) or half the area in a two-tailed test. Once you have the p-value, use the inverse of the standard normal distribution (Φ⁻¹) to find the corresponding z-score.

Alternative Formula (if df > 30): For degrees of freedom over 30, the t-distribution closely resembles the normal distribution, so you can directly approximate the z-score by using the t-value.

It is clear that this approach will lead to overestimation of power (EDR, ERR) because uncorrected t-values with small degrees of freedom are always larger than the corresponding z-scores. The question is how big this bias is when we use this approach to conduct z-curve analyses. Here are the results for the same data, but the input are the uncorrected t-values.

The results for the ERR are as expected: the true power is overestimated. Interestingly, the bias is as strong as for the transformation approach, but in the opposite direction. This suggests a possible way to quantify the amount of bias, by using both approaches and use half of the difference in estimates as an estimate of the amount of bias. The results for the EDR are a bit more surprising. There is no bias in the estimate. The reason is that bias is introduced by the wide tail of the t-distribution and this tail has a weaker effect on power for all tests. This is a very encouraging finding and suggests that it is preferable to use this approach to submit t-values to z-curve.

Large sample bias is often hard to detect when the set of studies is small. Moreover, confidence intervals of z-curve estimates are adjusted to allow for small systematic biases. It is therefore interesting to compare the coverage of confidence intervals for both approaches. To do so, I split the 200,000 t-values into 1,000 sets of 200 observations and ran z-curve with confidence intervals. I then checked the percentage of confidence intervals that included the true parameter.

For the ERR, 94.7% of confidence intervals included the true parameter. This is just 0.3 percentage points more than we would expect from a 95% confidence interval. For sample sizes greater than 30, this would imply good coverage. For the EDR, 97.7% of confidence intervals included the true power, indicating good coverage for the 95% confidence interval.

We will follow up on these preliminary results with more extensive simulations, but the results suggest that it is preferable to use t-values as estimates of z-values rather than using the transformation by means of p-values. We also suggest to limit z-curve analysis to studies with at least N = 30 participants for now.

Conclusions about Z-Curve

Z-curve was developed as a statistical tool that estimates the average power of a set of studies. It does so for two populations of studies. One population is all studies that were conducted independent of the result. The other population is the subset of studies that produced a statistically significant result. Alternative methods exist, but z-curve is the only method that can be used when studies differ in power (heterogeneity) without having to make assumptions about the distribution of true power (Brunner & Schimmack, 2020).

As all methods that rely on samples to make claims about populations, z-curve cannot reveal the true average power. It can only provide estimates of average true power in a population of studies. There are two sources of uncertainty in these estimates. One is ordinary sampling error. The other is systematic bias that can be introduced by approximating test-statistics from different designs with z-values. Z-curve provides confidence intervals that take both sources of error into account. Simulation studies suggest that 95% confidence intervals contain the true parameter at least 95% of the time in many realistic scenarios.

Reviewer A’s simulations showed that this is no longer the case when sample sizes are small. A simple solution to this problem is not to include studies with very small sample sizes in z-curve analyses. At present, I would suggest to exclude studies with N < 30 or to be mindful if studies with smaller sample sizes are part of the set of studies. I also recommend a new approach to include t-values and F-values with one degree of freedom. instead of converting them to p-values, t-values should be directly used as estimates of z-values.

Conclusions about Peer Review, Scientific Integrity, and Reviewer A

After presenting misleading claims about statistical power that we have carefully examined and shown to be misguided (Brunner, 2024), Reviewer A uses the results of their dishonest simulation study to claim that z-curve is unable to estimate the true power of a set of studies.

In sum, I do not think that the z-curve delivers estimates of the expected discovery rate (and its sister concept of expected replication rate) on a conceptual basis. The arguments for using estimates of “unconditional power” seem not to reasonably justify making a claim on the discovery rate of a set of publications (why not just count up the p-values if p-values signal discovery?). Even if my conceptual points are swept under the rug (again), perhaps the simulated illustration showing that z-curve does not provide an estimate near to the true value of the “expected discovery rate” would be convincing. Why does it break? Well, that goes back to the conceptual issues I have pointed out about consistency and efficiency of observed power, and the hodgepodge problem of combining lots of things together and hoping all the bad stuff averages out.  One might choose to argue that z-curve can run but cannot walk (that is, it performs well enough in a complex case but fails miserably in simple cases). I would not be convinced of such an argument that ignores first principles.

To be fair, Reviewer A states “I do not think,” which suggests that they are open to the idea that they may be wrong. However, it is unclear why reviewer A continues to ignore the evidence presented in peer-reviewed articles and the opinions of several reviewers who did not see the fundamental problems that they see.  When there is differences in opinions about factual statements, it is important to examine the underlying thought processes and evidence. Reviewer A failed to do so (“I do not think”). Rather, Reviewer A has set up a biased simulation that confirms their suspicion and accuses us of doing the very same thing they were doing. (If this sounds familiar, you know that it can be an effective strategy to deceive people).

Why does z-curve break in the simulations? As I have shown in this blog post, it breaks down in the simulations by Reviewer A when we use it with sample sizes of N = 2 or 3, it is still biased with sample sizes of N = 30, and it does well with large sample sizes. Reviewer A hides the missing moderator (a.k.a., hidden moderator), sample size, which they are well aware off because they knew hot to break z-curve with sample sizes of N < 10. However, they falsely generalize from this unrealistic scenario to all applications with larger sample sizes and ignore that many simulations with larger sample sizes have shown that z-curve performs well.

This deception is not different from the questionable research practices that original researchers sometimes use to present statistically significant results without any real effects (Bem, 2011). They know more than they are telling their audience to present misleading scientific evidence for false claims. Just like Reviewer A’s dishonest behavior, questionable practices in peer-reviews are not rare exceptions, but common occurrences that are based on human’s struggle to overcome their own biases. This would not be a problem, if there were an open exchange of arguments between authors and reviewers that works out the causes of disagreement. Here it is easy to show that the original simulation and Reviewer A’s simulations are both correct and that sample size is the moderator. Once this is clarified, the editor can decide whether simulations based on N = 2 or N = 100 are more relevant. However, often editors are quick to reject articles based on the expert opinions of reviewers, especially in flashy journals that pride themselves on high rejection rates. It is not surprising that the quality of articles in these journals is not better than in other journals because experts will use their power to favor articles that agree with their opinions and be hypercritical about articles that do not. It is well known that pre-publication peer-review is very subjective and reviewers often disagree even about ratings of the quality of a literature review.

How can we improve peer-review? The answer is simple. Make them open! Open science requires transparency of all steps of the production of a scientific article, which includes peer-review. Some innovative journals have implemented open peer-reviews. We are proud that two z-curve articles have been published in the leading journal Meta-Psychology (conflict of interest declaration: I am co-founder of this journal with Rickard Carlsson who has been main editor since its inception). Reviewer A’s opinions do not just clash with our own opinions. They are also inconsistent with reviewers of z-curve who put their name next to their reviews. In contrast, reviewers in legacy journals hide their identity from the authors and the public. Just like I challenged Reviewer A to present a simulation to break z-curve, I challenge them here to an open exchange about the ability of z-curve to estimate the true power of a set of studies. Open exchange of arguments in real time (like in a chess game) in front of an audience needs to be added to the open science practices. Let’s make a badge for that and I will be happy to earn a few of those.

Junk Studies Pollute the Literature: Money Priming in 2024

In meta-psychology, social priming has become the posterchild of junk science. Researchers conducted many cheap studies with small samples and published the results only when they supported their predictions. After Nobel Laurate Daniel Kahneman published some of these results in his bestselling book “Thinking: Fast and Slow,” he became concerned about the robustness of these results. He send an open letter to the leading social priming research Bargh, asking for replication studies. Bargh and other prominent social priming researchers declined. However, many younger researchers answered the call and reported replication failures. Anybody familiar with the replication crisis in social psychology is well aware of these problems and would not cite social priming studies as scientific evidence unless the studies were preregistered and conducted with reasonable sample sizes to detect small real effects.

However, many psychologists and other social scientists seem to be unaware of the replication crisis or willfully ignore the fact that articles by leading priming researchers provide no credible evidence for the claims made in these articles. As a result, a decade of replication failures has failed to correct the scientific literature. This blog post uses money priming as an example.

The simple idea behind money priming is that some manipulation that makes people think about money will change their attitudes and behaviors in ways that make people more materialistic, selfish, and less altruistic. The original article by Vohs et al. (2006) published 9 studies to provide evidence for this claim. Ironically, an article with 9 successful studies should not make us believe in the effect because even well-designed studies with real effects will occasionally be unsuccessful; that is, produce a p-value above .05 (Schimmack, 2012). Thus, an article that features only successful studies tells us nothing about the actual effect because it is unclear how many attempts were made to get the 9 significant results (Sterling, 1995).

In response to the replication crisis, my colleagues and I have worked on statistical methods to detect selective publishing of confirmatory evidence, which is sometimes called a questionable research practice, but clearly violates the spirit of scientific integrity and undermines the credibility of science and the scientific community. I am focussing on money priming here because I have used Vohs et al.’s (2006) article to train students in bias detection (video). Even 9 studies are sufficient to show that the evidence in this article omits studies that failed to support the mone-priming effect.

I was curious to see whether criticism of priming research and concerns about money priming in general influenced citations of the article. A search in WebOfScience suggests that citations are decreasing. However, citations of psychology articles have been decreasing in 2023 in general and the article is still cited about 20 times a year, which is rather high for psychology. Clearly money priming is not dead yet, and criticizing the work is not akin to flogging a dead horse.

I then looked at articles to see how they cited Vohs et al. (2006). The article “Effects of money priming on sustainable consumption attitudes” caught my attention because the title suggested that it reports results of new money priming research. Indeed, the article reports the results of two successful studies. I will focus on those results a bit later, but first I want to focus on Table 1 in the article that carefully listed results of 19 money priming studies with sample sizes and test statistics (t-tests, F-tests). This is information is sufficient to examine the presence of selection bias in the broader money-priming literature.

I was even able to copy past the table directly into excel. I just needed to add the information about test results in a way that excel could use it and add a few formulas.

The top row shows the important information. 95% of the results were significant if “marginally significant” results (p < .10) are counted as successes. This is consistent with success rates in psychology journals since 1959 (Sterling, 1959). Could this just be due to the fact that money priming is real and that the studies had high statistical power; that is a high probability to get p < .05? The answer to this question can be found by computing the exact p-values for the various test-statistics and converting them into z-scores. These z-scores can be used to compute the observed power of each study based on the effect size estimate in a single study. It is well known that this information is not useful for a single study, but it becomes useful when we have sets of studies and can compute the average power. The average observed power is 66.9%. Thus, we should have gotten about 67% significant results, not 95%. However, observed power is just an estimate of power. Maybe power is really higher? The problem with this argument is that observed power is inflated when selection bias is present. So, the 70% estimate is an overly optimistic estimate and the true average power of the studies is likely to be lower. How much lower? A simpel way to estimate the true average power is to subtract the difference between the success rate and average observed power because the inflation of the estimate increases the more selection bias there is. With an inflation of 94.7% – 67% = 25%, the estimated true power is only 40%. Whether this is sufficient evidence to wonder about the number of studies that failed to show the effect and were not reported is of course a subjective judgment. As the saying goes, a sucker is born every minute.

My colleagues and I have developed a more powerful method to examine these kinds of data called z-curve (Bartos & Schimmack, 2022; Brunner & Schimmack, 2021). The method fits a model to the distribution of the z-values implied by test statistics. Visual inspection of the distribution also provides clear evidence that the distribution of z-scores could not have been produced by random sampling error. It is just not possible to get so many results that are just significant (z = 2 implies p = .05) and no non-significant results.

Based on z-curve, we can estimate a number of statistics. The estimated replication rate (ERR) predicts how many of the 95% studies with significant results would produce a significant result again if these studies were repeated exactly with the same sample sizes. It is like asking Bargh to do his study again and show us that he can get the same results again. When asked to do so, he said nobody would believe him anyways if he would report significant results again. The ERR estimate of 35% is less important than the 95% confidence interval around this estimate. The predicted success rate in a set of exact replication studies could be as low as 10%. It could also be as high as 64%, but we just don’t know how replicable these results are. Thus, the 18 (out of 19) significant results tell us practically nothing about the ability to replicate these results.

The other statistic is the expected discovery rate. The expected discovery rate is an estimate of the percentage of significant results that we would find if researchers had kept all of their results and we could find the missing results and see how many significant results we get. The point estimate is very low. We would expect 5% just by chance alone. To get 7% is hardly better than chance. Again, we need to consider the uncertainty in this point estimate. It could be up to 17%, which would imply that researchers get about 1 significant result in 5 attempts (20%). Would you trust a researcher who hides 80% of their results? Moreover, it is also possible that the EDR is 5%, which is chance level. This means that all significant results were obtained by chance alone without any real money priming effect.

To be clear. This is not my data. I didn’t select studies or code studies. The data come from true believers in money priming who used these data to motivate their own studies.

Despite the wealth of adverse effects documented in the literature on money priming in relation to specific behaviors (as outlined in Table 1), there is a noticeable gap in research regarding the impact of money priming on sustainable consumption. (Koruk & Mandrik, 2024, p. 309).

I only plugged the data into stats programs that can look beyond the evidence we see to see whether we can trust the evidence and the answer is that these 19 studies provide no convincing evidence that money priming caused the significant results in these 18 studies. It could just have been chance and selective reporting of significant results.

Now take a moment and predict the outcome of the new studies in this article. Money priming was manipulated with a scrambled sentence task that made participants rearrange words that were either related to money (experimental group) or not (control group) (Exp1) or a paragraph writing task and a picture of money.

The outcome variabel was the average rating to the following three items.

(1) I am concerned about wasting the resources of our planet.
(2) I will make an effort to use products that do not harm the environment.
(3) It is important to change my consumption patterns (use less or avoid buying products) in order to protect the environment

Scroll down to see the results.

Results

Experiment 1 Effect size d = 1.44, z = 9.73
Experiment 2 Effect size d = 1.55, z = 9.18

Let’s just say that these results are very surprising. The effect sizes are very large for results in psychological research in general and the priming literature specifically. A difference of 1.5 standard deviations is as big as the difference in height between men and women.

Due to these large effect sizes, the test-statistic is of the chart. Z-scores of 9 have a probability of 1 out of a gazillion to occur by chance. These are not chance finding. These results are also not consistent with the evidence in Table 1, which showed much lower z-scores for most studies.

There are many possible explanations for these surprising results that include computational errors, demand effects among others (wink). I don’t really care about these results because priming studies are problematic even if they show real effects. First, the manipulation is artificial and may not correspond to real world situations in which we think about money. Second, ratings on a scale do not imply that people would really change their actual behaviors. However, it might still be an easy study to replicate and see whether the results are replicable. Even smaller samples would be sufficient to produce these results again, given the strong effect sizes reported in this article.

The main point of this blog post is that we need to look at results in published articles differently. We cannot just see how many significant results we see in journals. We already know the answer to this question. The published literature tends to have over 90% significant results. This is not an empirical finding that can be used to evaluate evidence. The real question is always how many non-significant are missing. Bias tests can be useful to answer this question. Thus, if you want to be a scientist and make scientific claims you need to examine the amount of bias in the studies that you are using. “Studies show…” is not a scientific claim. Studies also show that extraverts can sense pornography before it is even presented (Bem, 2011). The real question is how many studies really show an effect.

Guest Post by Jerry Brunner: Response to an Anonymous Reviewer

Introduction

Jerry Brunner is a recent emeritus from the Department of Statistics at the University of Toronto Mississauga. Jerry first started in psychology, but was frustrated by the unscientific practices he observed in graduate school. He went on to become a professor in statistics. Thus, he is not only an expert in statistis. He also understands the methodological problems in psychology.

Sometime in the wake of the replication crisis around 2014/15, I went to his office to talk to him about power and bias detection. . Working with Jerry was educational and motivational. Without him z-curve would not exist. We spend years on trying different methods and thinking about the underlying statistical assumptions. Simulations often shattered our intuitions. The Brunner and Schimmack (2020) article summarizes all of this work.

A few years later, the method is being used to examine the credibility of published articles across different research areas. However, not everybody is happy about a tool that can reveal publication bias, the use of questionable research practices, and a high risk of false positive results. An anonymous reviewer dismissed z-curve results based on a long list of criticisms (Post: Dear Anonymous Reviewer). It was funny to see how ChatGPT responds to these criticisms (Comment). However, the quality of ChatGPT responses is difficult to evaluate. Therefore, I am pleased to share Jerry’s response to the reviewer’s comments here. Let’s just say that the reviewer was wise to make their comments anonymously. Posting the review and the response in public also shows why we need open reviews like the ones published in Meta-Psychology by the reviewers of our z-curve article. Hidden and biased reviews are just one more reason why progress in psychology is so slow.

Jerry Brunner’s Response

This is Jerry Brunner, the “Professor of Statistics” mentioned the post. I am also co-author of Brunner and Schimmack (2020). Since the review Uli posted is mostly an attack on our joint paper (Brunner and Schimmack, 2020), I thought I’d respond.

First of all, z-curve is sort of a moving target. The method described by Brunner and Schimmack is strictly a way of estimating population mean power based on a random sample of tests that have been selected for statistical significance. I’ll call it z-curve 1.0. The algorithm has evolved over time, and the current z-curve R package (available at https://cran.r-project.org/web/packages/zcurve/index.html) implements a variety of diagnostics based on a sample of p-values. The reviewer’s comments apply to z-curve 1.0, and so do my responses. This is good from my perspective, because I was in on the development of z-curve 1.0, and I believe I understand it pretty well. When I refer to z-curve in the material that follows, I mean z-curve 1.0. I do believe z-curve 1.0 has some limitations, but they do not overlap with the ones suggested by the reviewer.

Here are some quotes from the review, followed by my answers.

(1) “… z-curve analysis is based on the concept of using an average power estimate of completed studies (i.e., post hoc power analysis). However, statisticians and methodologists have written about the problem of post hoc power analysis …”

This is not accurate. Post-hoc power analysis is indeed fatally flawed; z-curve is something quite different. For later reference, in the “observed” power method, sample effect size is used to estimate population effect size for a single study. Estimated effect size is combined with observed sample size to produce an estimated non-centrality parameter for the non-central distribution of the test statistic, and estimated power is calculated from that, as an area under the curve of the non-central distribution. So, the observed power method produces an estimated power for an individual study. These estimates have been found to be too noisy for practical use.

The confusion of z-curve with observed power comes up frequently in the reviewer’s comments. To be clear, z-curve does not estimate effect sizes, nor does it produce power estimates for individual studies.

(2) “It should be noted that power is not a property of a (completed) study (fixed data). Power is a performance measure of a procedure (statistical test) applied to an infinite number of studies (random data) represented by a sampling distribution. Thus, what one estimates from completed study is not really “power” that has the properties of a frequentist probability even though the same formula is used. Average power does not solve this ontological problem (i.e., misunderstanding what frequentist probability is; see also McShane et al., 2020). Power should always be about a design for future studies, because power is the probability of the performance of a test (rejecting the null hypothesis) over repeated samples for some specified sample size, effect size, and Type I error rate (see also Greenland et al., 2016; O’Keefe, 2007). z-curve, however, makes use of this problematic concept of average power (for completed studies), which brings to question the validity of z-curve analysis results.”

The reviewer appears to believe that once the results of a study are in, the study no longer has a power. To clear up this misconception, I will describe the model on which z-curve is based.

There is a population of studies, each with its own subject population. One designated significance test will be carried out on the data for each study. Given the subject population, the procedure and design of the study (including sample size), significance level and the statistical test employed, there is a probability of rejecting the null hypothesis. This probability has the usual frequentist interpretation; it’s the long-term relative frequency of rejection based on (hypothetical) repeated sampling from the particular subject population. I will use the term “power” for the probability of rejecting the null hypothesis, whether or not the null hypothesis is exactly true.

Note that the power of the test — again, a member of a population of tests — is a function of the design and procedure of the study, and also of the true state of affairs in the subject population (say, as captured by effect size).

So, every study in the population of studies has a power. It’s the same before any data are collected, and after the data are collected. If the study were replicated exactly with a fresh sample from the same population, the probability of observing significant results would be exactly the power of the study — the true power.

This takes care of the reviewer’s objection, but let me continue describing our model, because the details will be useful later.

For each study in the population of studies, a random sample is drawn from the subject population, and the null hypothesis is tested. The results are either significant, or not. If the results are not significant, they are rejected for publication, or more likely never submitted. They go into the mythical “file drawer,” and are no longer available. The studies that do obtain significant results form a sub-population of the original population of studies. Naturally, each of these studies has a true power value. What z-curve is trying to estimate is the population mean power of the studies with significant results.

So, we draw a random sample from the population of studies with significant results, and use the reported results to estimate population mean power — not of the original population of studies, but only of the subset that obtained significant results. To us, this roughly corresponds to the mean power in a population of published results in a particular field or sub-field.

Note that there are two sources of randomness in the model just described. One arises from the random sampling of studies, and the other from random sampling of subjects within studies. In an appendix containing the theorems, Brunner and Schimmack liken designing a study (and choosing a test) to the manufacture of a biased coin with probability of heads equal to the power. All the coins are tossed, corresponding to running the subjects, collecting the data and carrying out the tests. Then the coins showing tails are discarded. We seek to estimate the mean P(Head) for all the remaining coins.

(3) “In Brunner and Schimmack (2020), there is a problem with ‘Theorem 1 states that success rate and mean power are equivalent …’ Here, the coin flip with a binary outcome is a process to describe significant vs. nonsignificant p-values. Focusing on observed power, the problem is that using estimated effect sizes (from completed studies) have sampling variability and cannot be assumed to be equivalent to the population effect size.”

There is no problem with Theorem 1. The theorem says that in the coin tossing experiment just described, suppose you (1) randomly select a coin from the population, and (2) toss it — so there are two stages of randomness. Then the probability of observing a head is exactly equal to the mean P(Heads) for the entire set of coins. This is pretty cool if you think about it. The theorem makes no use of the concept of effect size. In fact, it’s not directly about estimation at all; it’s actually a well-known result in pure probability, slightly specialized for this setting. The reviewer says “Focusing on observed power …” But why would he or she focus on observed power? We are talking about true power here.

(4) “Coming back to p-values, these statistics have their own distribution (that cannot be derived unless the effect size is null and the p-value follows a uniform distribution).

They said it couldn’t be done. Actually, deriving the distribution of the p-value under the alternative hypothesis is a reasonable homework problem for a masters student in statistics. I could give some hints …

(5) “Now, if the counter argument taken is that z-curve does not require an effect size input to calculate power, then I’m not sure what z-curve calculates because a value of power is defined by sample size, effect size, Type I error rate, and the sampling distribution of the statistical procedure (as consistently presented in textbooks for data analysis).”

Indeed, z-curve uses only p-values, from which useful estimates of effect size cannot be recovered. As previously stated, z-curve does not estimate power for individual studies. However, the reviewer is aware that p-values have a probability distribution. Intuitively, shouldn’t the distribution of p-values and the distribution of power values be connected in some way? For example, if all the null hypotheses in a population of tests were true so that all power values were equal to 0.05, then the distribution of p-values would be uniform on the interval from zero to one. When the null hypothesis of a test is false, the distribution of the p-value is right skewed and strictly decreasing (except in pathological artificial cases), with more of the probability piling up near zero. If average power were very high, one might expect a distribution with a lot of very small p-values. The point of this is just that the distribution of p-values surely contains some information about the distribution of power values. What z-curve does is to massage a sample of significant p-values to produce an estimate, not of the entire distribution of power after selection, but just of its population mean. It’s not an unreasonable enterprise, in spite of what the reviewer thinks. Also, it works well for large samples of studies. This is confirmed in the simulation studies reported by Brunner and Schimmack.

(6) “The problem of Theorem 2 in Brunner and Schimmack (2020) is assuming some distribution of power (for all tests, effect sizes, and sample sizes). This is curious because the calculation of power is based on the sampling distribution of a specific test statistic centered about the unknown population effect size and whose variance is determined by sample size. Power is then a function of sample size, effect size, and the sampling distribution of the test statistic.”

Okay, no problem. As described above, every study in the population of studies has its own test statistic, its own true (not estimated) effect size, its own sample size — and therefore its own true power. The relative frequency histogram of these numbers is the true population distribution of power.

(7) “There is no justification (or mathematical derivation) to show that power follows a uniform or beta distribution (e.g., see Figure 1 & 2 in Brunner and Schimmack, 2000, respectively).”

Right. These were examples, illustrating the distribution of power before versus after selection for significance — as given in Theorem 2. Theorem 2 applies to any distribution of true power values.

(8) “If the counter argument here is that we avoid these issues by transforming everything into a z-score, there is no justification that these z-scores will follow a z-distribution because the z-score is derived from a normal distribution – it is not the transformation of a p-value that will result in a z-distribution of z-scores … it’s weird to assume that p-values transformed to z-scores might have the standard error of 1 according to the z-distribution …”

The reviewer is objecting to Step 1 of constructing a z-curve estimate, given on page 6 of Brunner and Schimmack (2020). We start with a sample of significant p-values, arising from a variety of statistical tests, various F-tests, chi-squared tests, whatever — all with different sample sizes. Then we pretend that all the tests were actually two-sided z-tests with the results in the predicted direction, equivalent to one-sided z-tests with significance level 0.025. Then we transform the p-values to obtain the z statistics that would have generated them, had they actually been z-tests. Then we do some other stuff to the z statistics.

But as the reviewer notes, most of the tests probably are not z-tests. The distributions of their p-values, which depend on the non-central distributions of their test statistics, are different from one another, and also different from the distribution for genuine z-tests. Our paper describes it as an approximation, but why should it be a good approximation? I honestly don’t know, and I have given it a lot of thought. I certainly would not have come up with this idea myself, and when Uli proposed it, I did not think it would work. We both came up with a lot of estimation methods that did not work when we tested them out. But when we tested this one, it was successful. Call it a brilliant leap of intuition on Uli’s part. That’s how I think of it.

Uli’s comment.
It helps to know your history. Well before psychologists focused on effect sizes for meta-analysis, Fisher already had a method to meta-analyze p-values. P-Curve is just a meta-analysis of p-values with a selection model. However, p-values have ugly distributions and Stouffer proposed the transformation of p-values into z-scores to conduct meta-analyses. This method was used by Rosenthal to compute the fail-safe-N, one of the earliest methods to evaluate the credibility of published results (Fail-Safe-N). Ironically, even the p-curve app started using this transformation (p-curve changes). Thus, p-curve is really a version of z-curve. The problem with p-curve is that it has only one parameter and cannot model heterogeneity in true power. This is the key advantage of z-curve.1.0 over p-curve (Brunner & Schimmack, 2020). P-curve is even biased when all studies have the same population effect size, but different sample sizes, which leads to heterogeneity in power (Brunner, 2018].

Such things are fairly common in statistics. An idea is proposed, and it seems to work. There’s a “proof,” or at least an argument for the method, but the proof does not hold up. Later on, somebody figures out how to fill in the missing technical details. A good example is Cox’s proportional hazards regression model in survival analysis. It worked great in a large number of simulation studies, and was widely used in practice. Cox’s mathematical justification was weak. The justification starts out being intuitively reasonable but not quite rigorous, and then deteriorates. I have taught this material, and it’s not a pleasant experience. People used the method anyway. Then decades after it was proposed by Cox, somebody else (Aalen and others) proved everything using a very different and advanced set of mathematical tools. The clean justification was too advanced for my students.

Another example (from mathematics) is Fermat’s last theorem, which took over 300 years to prove. I’m not saying that z-curve is in the same league as Fermat’s last theorem, just that statistical methods can be successful and essentially correct before anyone has been able to provide a rigorous justification.

Still, this is one place where the reviewer is not completely mixed up.

Another Uli comment
Undergraduate students are often taught different test statistics and distributions as if they are totally different. However, most tests in psychology are practically z-tests. Just look at a t-distribution with N = 40 (df = 38) and try to see the difference to a standard normal distribution. The difference is tiny and invisible when you increase sample sizes above 40! And F-tests. F-values with 1 experimenter degree of freedom are just squared t-values, so the square root of these is practically a z-test. But what about chi-square? Well, with 1 df, chi-square is just a squared z-score, so we can use the square root and have a z-score. But what if we don’t have two groups, but compute correlations or regressions? Well, the statistical significance test uses the t-distribution and sample sizes are often well above 40. So, t and z are practically identical. It is therefore not surprising to me that approximating empirical results with different test-statistics can be approximated with the standard normal distribution. We could make teaching statistics so much easier, instead of confusing students with F-distributions. The only exception are complex designs with 3 x 4 x 5 ANOVAs, but they don’t really test anything and are just used to p-hack. Rant over. Back to Jerry.

(9) “It is unclear how Theorem 2 is related to the z-curve procedure.”

Theorem 2 is about how selection for significance affects the probability distribution of true power values. Z-curve estimates are based only on studies that have achieved significant results; the others are hidden, by a process that can be called publication bias. There is a fundamental distinction between the original population of power values and the sub-population belonging to studies that produce significant results. The theorems in the appendix are intended to clarify that distinction. The reviewer believes that once significance has been observed, the studies in question no longer even have true power values. So, clarification would seem to be necessary.

(10) “In the description of the z-curve analysis, it is unclear why z-curve is needed to calculate “average power.” If p < .05 is the criterion of significance, then according to Theorem 1, why not count up all the reported p-values and calculate the proportion in which the p-values are significant?”

If there were no selection for significance, this is what a reasonable person would do. But the point of the paper, and what makes the estimation problem challenging, is that all we can observe are statistics from studies with p < 0.05. Publication bias is real, and z-curve is designed to allow for it.

(11) “To beat a dead horse, z-curve makes use of the concept of “power” for completed studies. To claim that power is a property of completed studies is an ontological error …”

Wrong. Power is a feature of the design of a study, the significance test, and the subject population. All of these features still exist after data have been collected and the test is carried out.

Uli and Jerry comment:
Whenever a psychologist uses the word “ontological,” be very skeptical. Most psychologists who use the word understand philosophy as well as this reviewer understands statistics.

(12) “The authors make a statement that (observed) power is the probability of exact replication. However, there is a conceptual error embedded in this statement. While Greenwald et al. (1996, p. 1976) state “replicability can be computed as the power of an exact replication study, which can be approximated by [observed power],” they also explicitly emphasized that such a statement requires the assumption that the estimated effect size is the same as the unknown population effect size which they admit cannot be met in practice.”

Observed power (a bad estimate of true power) is not the probability of significance upon exact replication. True power is the probability of significance upon exact replication. It’s based on true effect size, not estimated effect size. We were talking about true power, and we mistakenly thought that was obvious.

(13) “The basis of supporting the z-curve procedure is a simulation study. This approach merely confirms what is assumed with simulation and does not allow for the procedure to be refuted in any way (cf. Popper’s idea of refutation being the basis of science.) In a simulation study, one assumes that the underlying process of generating p-values is correct (i.e., consistent with the z-curve procedure). However, one cannot evaluate whether the p-value generating process assumed in the simulation study matches that of empirical data. Stated a different way, models about phenomena are fallible and so we find evidence to refute and corroborate these models. The simulation in support of the z-curve does not put the z-curve to the test but uses a model consistent with the z-curve (absent of empirical data) to confirm the z-curve procedure (a tautological argument). This is akin to saying that model A gives us the best results, and based on simulated data on model A, we get the best results.”

This criticism would have been somewhat justified if the simulations had used p-values from a bunch of z-tests. However, they did not. The simulations reported in the paper are all F-tests with one numerator degree of freedom, and denominator degrees of freedom depending on the sample size. This covers all the tests of individual regression coefficients in multiple regression, as well as comparisons of two means using two-sample (and even matched) t-tests. Brunner and Schmmack say (p. 8)

Because the pattern of results was similar for F-tests
and chi-squared tests and for different degrees of freedom,
we only report details for F-tests with one numerator
degree of freedom; preliminary data mining of
the psychological literature suggests that this is the case
most frequently encountered in practice. Full results are
given in the supplementary materials.

So I was going to refer the reader (and the anonymous reviewer, who is probably not reading this post anyway) to the supplementary materials. Fortunately I checked first, and found that the supplementary materials include a bunch of OSF stuff like the letter submitting the article for publication, and the reviewers’ comments and so on — but not the full set of simulations. Oops.

All the code and the full set of simulation results is posted at

https://www.utstat.utoronto.ca/brunner/zcurve2018

You can download all the material in a single file at

https://www.utstat.utoronto.ca/brunner/zcurve2018.zip

After expanding, just open index.html in a browser.

Actually we did a lot more simulation studies than this, but you have to draw the line somewhere. The point is that z-curve performs well for large numbers of studies with chi-squared test statistics as well as F statistics — all with varying degrees of freedom.

(14) “The simulation study was conducted for the performance of the z-curve on constrained scenarios including F-tests with df = 1 and not for the combination of t-tests and chi-square tests as applied in the current study. I’m not sure what to make of the z-curve performance for the data used in the current paper because the simulation study does not provide evidence of its performance under these unexplored conditions.”

Now the reviewer is talking about the paper that was actually under review. The mistake is natural, because of our (my) error in not making sure that the full set of simulations was included in the supplementary materials. The conditions in question are not unexplored; they are thoroughly explored, and the accuracy of z-curve for large samples is confirmed.

(15+) There are some more comments by the reviewer, but these are strictly about the paper under review, and not about Brunner and Schimmack (2020). So, I will leave any further response to others.

Replicability Report for the Journal ‘Evolutionary Psychology’

Authors: Maria Soto and Ulrich Schimmack

Citation: Soto, M. & Schimmack, U. (2024, June, 24/06/24).  2024 Replicability Report for the Journal 'Evolutionary Psychology'.  Replicability Index. 
https://replicationindex.com/2024/06/24/rr24-evopsy/

Introduction

In the 2010s, it became apparent that empirical psychology had a replication problem. When psychologists tested the replicability of 100 results, they found that only 36% of the 97 significant results in original studies could be reproduced (Open Science Collaboration, 2015). In addition, several prominent cases of research fraud further undermined trust in published results. Over the past decade, several proposals were made to improve the credibility of psychology as a science. Replicability Reports aim to improve the credibilty of psychological science by examining the amount of publication bias and the strength of evidence for empirical claims in psychology journals.

The main problem in psychological science is the selective publishing of statistically significant results and the blind trust in statistically significant results as evidence for researchers’ theoretical claims. Unfortunately, psychologists have been unable to self-regulate their behaviour and continue to use unscientific practices to hide evidence that disconfirms their predictions. Moreover, ethical researchers who do not use unscientific practices are at a disadvantage in a game that rewards publishing many articles without concern about these findings’ replicability.

My colleagues and I have developed a statistical tool that can reveal the use of unscientific practices and predict the outcome of replication studies (Brunner & Schimmack, 2021Bartos & Schimmack, 2022). This method is called z-curve. Z-curve cannot be used to evaluate the credibility of a single study. However, it can provide valuable information about the research practices in a particular research domain.

Replicability-Reports (RR) use z-curve to provide information about psychological journal research and publication practices. This information can aid authors choose journals they want to publish in, provide feedback to journal editors who influence selection bias and replicability of published results, and, most importantly, to readers of these journals.

Evolutionary Psychology

Evolutionary Psychology was founded in 2003. The journal focuses on publishing empirical theoretical and review articles investigating human behaviour from an evolutionary perspective. On average, Evolutionary Psychology publishes about 35 articles in 4 annual issues.

As a whole, evolutionary psychology has produced both highly robust and questionable results. Robust results have been found for sex differences in behaviors and attitudes related to sexuality. Questionable results have been reported for changes in women’s attitudes and behaviors as a function of hormonal changes throughout their menstrual cycle.

According to Web of Science, the impact factor of Evolutionary Psychology ranks 88th in the Experimental Psychology category (Clarivate, 2024). The journal has a 48 H-Index (i.e., 48 articles have received 48 or more citations).

In its lifetime, Evolutionary Psychology has published over 800 articles The average citation rate in this journal is 13.76 citations per article. So far, the journal’s most cited article has been cited 210 times. The article was published in 2008 and investigated the influence of women’s mate value on standards for a long-term mate (Buss & Shackelford, 2008).

The current Editor-in-Chief is Professor Todd K. Shackelford. Additionally, the journal has four other co-editors Dr. Bernhard Fink, Professor Mhairi Gibson, Professor Rose McDermott, and Professor David A. Puts.

Extraction Method

Replication reports are based on automatically extracted test statistics such as F-tests, t-tests, z-tests, and chi2-tests. Additionally, we extracted 95% confidence intervals of odds ratios and regression coefficients. The test statistics were extracted from collected PDF files using a custom R-code. The code relies on the pdftools R package (Ooms, 2024) to render all textboxes from a PDF file into character strings. Once converted the code proceeds to systematically extract the test statistics of interest (Soto & Schimmack, 2024). PDF files identified as editorials, review papers and meta-analyses were excluded. Meta-analyses were excluded to avoid the inclusion of test statistics that were not originally published in Evolution & Human Behavior. Following extraction, the test statistics are converted into absolute z-scores.

Results For All Years

Figure 1 shows a z-curve plot for all articles from 2003-2023 (see Schimmack, 2023, for a detailed description of z-curve plots). However, the total available test statistics available for 2003, 2004 and 2005 were too low to be used individually. Therefore, these years were joined to ensure the plot had enough test statistics for each year. The plot is essentially a histogram of all test statistics converted into absolute z-scores (i.e., the direction of an effect is ignored). Z-scores can be interpreted as the strength of evidence against the null hypothesis that there is no statistical relationship between two variables (i.e., the effect size is zero and the expected z-score is zero). A z-curve plot shows the standard criterion of statistical significance (alpha = .05, z = 1.96) as a vertical red dotted line.

Figure 1

Z-curve plots are limited to values less than z = 6. The reason is that values greater than 6 are so extreme that a successful replication is all but certain unless the value is a computational error or based on fraudulent data. The extreme values are still used for the computation of z-curve statistics but omitted from the plot to highlight the shape of the distribution for diagnostic z-scores in the range from 2 to 6. Using the expectation maximization (EM) algorithm, Z-curve estimates the optimal weights for seven components located at z-values of 0, 1, …. 6 to fit the observed statistically significant z-scores. The predicted distribution is shown as a blue curve. Importantly, the model is fitted to the significant z-scores, but the model predicts the distribution of non-significant results. This makes it possible to examine publication bias (i.e., selective publishing of significant results). Using the estimated distribution of non-significant and significant results, z-curve provides an estimate of the expected discovery rate (EDR); that is, the percentage of significant results that were actually obtained without selection for significance. Using Soric’s (1989) formula the EDR is used to estimate the false discovery risk; that is, the maximum number of significant results that are false positives (i.e., the null-hypothesis is true).

Selection for Significance

The extent of selection bias in a journal can be quantified by comparing the Observed Discovery Rate (ODR) of 68%, 95%CI = 67% to 70% with the Expected Discovery Rate (EDR) of 49%, 95%CI = 26%-63%. The ODR is higher than the upper limit of the confidence interval for the EDR, suggesting the presence of selection for publication. Even though the distance between the ODR and the EDR estimate is narrower than those commonly seen in other journals the present results may underestimate the severity of the problem. This is because the analysis is based on all statistical results. Selection bias is even more problematic for focal hypothesis tests and the ODR for focal tests in psychology journals is often close to 90%.

Expected Replication Rate

The Expected Replication Rate (ERR) estimates the percentage of studies that would produce a significant result again if exact replications with the same sample size were conducted. A comparison of the ERR with the outcome of actual replication studies shows that the ERR is higher than the actual replication rate (Schimmack, 2020). Several factors can explain this discrepancy, such as the difficulty of conducting exact replication studies. Thus, the ERR is an optimist estimate. A conservative estimate is the EDR. The EDR predicts replication outcomes if significance testing does not favour studies with higher power (larger effects and smaller sampling error) because statistical tricks make it just as likely that studies with low power are published. We suggest using the EDR and ERR in combination to estimate the actual replication rate.

The ERR estimate of 72%, 95%CI = 67% to 77%, suggests that the majority of results should produce a statistically significant, p < .05, result again in exact replication studies. However, the EDR of 49% implies that there is some uncertainty about the actual replication rate for studies in this journal and that the success rate can be anywhere between 49% and 72%.

False Positive Risk

The replication crisis has led to concerns that many or even most published results are false positives (i.e., the true effect size is zero). Using Soric’s formula (1989), the maximum false discovery rate can be calculated based on the EDR.

The EDR of 49% implies a False Discovery Risk (FDR) of 6%, 95%CI = 3% to 15%, but the 95%CI of the FDR allows for up to 15% false positive results. This estimate contradicts claims that most published results are false (Ioannidis, 2005).

Changes Over Time

One advantage of automatically extracted test-statistics is that the large number of test statistics makes it possible to examine changes in publication practices over time. We were particularly interested in changes in response to awareness about the replication crisis in recent years.

Z-curve plots for every publication year were calculated to examine time trends through regression analysis. Additionally, the degrees of freedom used in F-tests and t-tests were used as a metric of sample size to observe if these changed over time. Both linear and quadratic trends were considered. The quadratic term was included to observe if any changes occurred in response to the replication crisis. That is, there may have been no changes from 2000 to 2015, but increases in EDR and ERR after 2015.

Degrees of Freedom

Figure 2 shows the median and mean degrees of freedom used in F-tests and t-tests reported in Evolutionary Psychology. The mean results are highly variable due to a few studies with extremely large sample sizes. Thus, we focus on the median to examine time trends. The median degrees of freedom over time was 121.54, ranging from 75 to 373. Regression analyses of the median showed a significant linear increase by 6 degrees of freedom per year, b = 6.08, SE = 2.57, p = 0.031. However, there was no evidence that the replication crisis influenced a significant increase in sample sizes as seen by the lack of a significant non-linear trend and a small regression coefficient, b = 0.46, SE = 0.53, p = 0.400.

Figure 2

Observed and Expected Discovery Rates

Figure 3 shows the changes in the ODR and EDR estimates over time. There were no significant linear, b = -0.52 (SE = 0.26 p = 0.063) or non-linear, b = -0.02 (SE = 0.05, p = 0.765) trends observed in the ODR estimate. The regression results for the EDR estimate showed no significant linear, b = -0.66 (SE = 0.64 p = 0.317) or non-linear, b = 0.03 (SE = 0.13 p = 0.847) changes over time. These findings indicate the journal has not increased its publication of non-significant results and continues to report more significant results than one would predict based on the mean power of studies.

Expected Replicability Rates and False Discovery Risks

Figure 4 depicts the false discovery risk (FDR) and the Estimated Replication Rate (ERR). It also shows the Expected Replication Failure rate (EFR = 1 – ERR). A comparison of the EFR with the FDR provides information for the interpretation of replication failures. If the FDR is close to the EFR, many replication failures may be due to false positive results in original studies. In contrast, if the FDR is low, most replication failures will likely be false negative results in underpowered replication studies.

The ERR estimate did not show a significant linear increase over time, b = 0.36, SE = 0.24, p = 0.165. Additionally, no significant non-linear trend was observed, b = -0.03, SE = 0.05, p = 0.523. These findings suggest the increase in sample sizes did not contribute to a statistically significant increase in the power of the published results. These results suggests that replicability of results in this journal has not increased over time and that the results in Figure 1 can be applied to all years.

Figure 4

Visual inspection of Figure 4 depicts the EFR between 30% and 40% and an FDR between 0 and 10%. This suggests that more than half of replication failures are likely to be false negatives in replication studies with the same sample sizes rather than false positive results in the original studies. Studies with large sample sizes and small confidence intervals are needed to distinguish between these two alternative explanations for replication failures.

Adjusting Alpha

A simple solution to a crisis of confidence in published results is to adjust the criterion to reject the null-hypothesis. For example, some researchers have proposed to set alpha to .005 to avoid too many false positive results. With z-curve we can calibrate alpha to keep the false discovery risk at an acceptable level without discarding too many true positive results. To do so, we set alpha to .05, .01, .005, and .001 and examined the false discovery risk.

Figure 5

Figure 5 shows that the conventional criterion of p < .05 produces false discovery risks above 5%. The high variability in annual estimates also makes it difficult to provide precise estimates of the FDR. However, adjusting alpha to .01 is sufficient to produce an FDR with tight confidence intervals below 5%. The benefits of reducing alpha further to .005 or .001 are minimal.

Figure 6

Figure 6 shows the impact of lowering the significance criterion, alpha, on the discovery rate (lower alpha implies fewer significant results). In Evolutionary Psychology lowering alpha to .01 reduces the observed discovery rate by about 20 to 10 percentage points. This implies that 20% of results reported p-values between .05 and .01. These results often have low success rates in actual replication studies (OSC, 2015). Thus, our recommendation is to set alpha to .01 to reduce the false positive risk to 5% and to disregard studies with weak evidence against the null-hypothesis. These studies require actual successful replications with larger samples to provide credible evidence for an evolutionary hypothesis.

There are relatively few studies with p-values between .01 and .005. Thus, more conservative researchers can use alpha = .005 without losing too many additional results.

Limitations

The main limitation of these results is the use of automatically extracted test statistics. This approach cannot distinguish between theoretically important statistical results and other results that are often reported but do not test focal hypotheses (e.g., testing the statistical significance of a manipulation check, reporting a non-significant result for a factor in a complex statistical design that was not expected to produce a significant result).

To examine the influence of automatic extraction on our results, we can compare the results to hand-coding results of over 4,000 hand-coded focal hypotheses in over 40 journals in 2010 and 2020. The ODR was 90% around 2010 and 88% around 2020. Thus, the tendency to report significant results for focal hypothesis tests is even higher than the ODR for all results and there is no indication that this bias has decreased notably over time. The ERR increased a bit from 61% to 67%, but these values are a bit lower than those reported here. Thus, it is possible that focal tests also have lower average power than other tests, but this difference seems to be small. The main finding is that the publishing of non-significant results for focal tests remains an exception in psychology journals and probably also in this journal.

One concern about the publication of our results is that it merely creates a new criterion to game publications. Rather than trying to get p-values below .05, researchers may use tricks to get p-values below .01. However, this argument ignores that it becomes increasingly harder to produce lower p-values with tricks (Simmons et al., 2011). Moreover, z-curve analysis makes it easy to see selection bias for different levels of significance. Thus, a more plausible response to these results is that researchers will increase sample sizes or use other methods to reduce sampling error to increase power.

Conclusion

The replicability report shows that the average power to report a significant result (i.e., a discovery) ranges from 49% to 72% in Evolutionary Psychology. This finding is higher than previous estimates observed in evolutionary psychology journals. However, the confidence intervals are wide and suggest that many published studies remain underpowered. The report did not capture any significant changes over time in the power and replicability as captured by the EDR and the ERR estimates. The false positive risk is modest and can be controlled by setting alpha to .01. Replication attempts of original findings with p-values above .01 should increase sample sizes to produce more conclusive evidence. Lastly, the journal shows clear evidence of selection bias.

There are several ways, the current or future editors of this journal can improve the credibility of results published in this journal. First, results with weak evidence (p-values between .05 and .01) should only be reported as suggestive results that require replication or even request a replication before publication. Second, editors should try to reduce publication bias by prioritizing research questions over results. A well-conducted study with an important question should be published even if the results are not statistically significant. Pre-registration and registered reports can help to reduce publication bias. Editors may also ask for follow-up studies with higher power to follow up on a non-significant result.

Publication bias also implies that point estimates of effect sizes are inflated. It is therefore important to take uncertainty in these estimates into account. Small samples with large sampling errors are usually unable to provide meaningful information about effect sizes and conclusions should be limited to the direction of an effect.

The present results serve as a benchmark for future years to track progress in this journal to ensure trust in research by evolutionary psychologists.

How Useful are Statistical Power calculations?

One of the greatest meta-psychologists was Jacob Cohen. He was concerned about the risk that psychologists might waste resources on studies that had a low probability to provide evidence for a true hypothesis. Following Neyman and Pearson this error is called a type-II error. It can also be called a false negative result.

Psychologists typically rely on null-hypothesis testing to provide evidence for their predictions. They set the criterion value for a statistically significant result to 5%. This means that there is only a 5% probability to get a significant result without a real effect. This is called a type-I error or a false positive result. In this approach, a type-II error occurs when a prediction is true (a treatment is effective), but the p-value is above .05.

Cohen (1961) warned psychologists that many studies have a high risk of producing false negative results, especially when effect sizes are statistically small. Even when effect sizes are statistically around the average effect size in psychological studies, the risk of a false negative result was about 50%. Follow-up studies showed that this situation had not changed in the following decades (Sedlmeier & Gigerenzer, 1989).

One might assume that psychologists simply have little control over the false negative risk in their studies. However, that is not true. A simple way to decrease the false negative risk is to increase sample sizes. Thus, one has to wonder why psychologists did not increase sample sizes in response to evidence that they are conducting studies with high risk of a false negative result?

Imagine a gambler who can play two slot machines. One has a 50% chance of winning, the other one has an 80% chance of winning? Which machine would you pick? The answer is obvious. The situation for a researcher is a bit different. Fist, they have to pay more (invest more resources in larger samples) to play the higher odds of winning (i.e. avoiding a false negative result). Second, they do not know the actual odds of winning. They merely know that the odds of winning are higher when they invest more resources. Cohen (1988) tried to help researchers to make decisions that reduce the false negative risk without paying too much for larger sample sizes. It took 50 years for power analyses to become more popular in psychology in the past decade.

While better control of false negative results may seem desirable to all, a recent peer-reviewed article by Pek, Pitt, and Wegener (2024) suggests that power analyses are useless. They claim in the title “Uncertainty limits the use of power analysis.” In the article, they wonder “Isn’t use of power better than not using power at all?” and their answer is not a simple “yes” (p. 11). They say it is also not a simple “No”, but they provide no examples where power analysis is better than drawing a random number from a hat to determine the sample size of a study. In fact, they go on to state that “we recommend that researchers place limited confidence when using power to design experiments, or not use it at all as a direct justification for determining N” (p. 11). If that does not mean “power analysis is useless”, they do a very good job at hiding the benefits of power analysis.

It is remarkable that such a harsh criticism of Cohen can be published in a leading psychology journal without even mentioning Cohen’s work. It is also remarkable that they never mention false negative results / type-II errors, although power is defined as the probability of avoiding a type-II error (beta = type-II error, power = 1 – beta). So, we do not know what Pek et al.’s (2024) suggestion for researchers is when they get a non-significant result. Maybe somebody should write to them. “Hey I just did a study with a randomly generated sample size and got a non-significant result. What now?”

Two other giants in the history of psychology wrote an article in 1971 about the problem with small samples that often have a high false negative risk (Kahneman & Tversky, 1971). They wrote “We refuse to believe that a serious investigator will knowingly accept a .50 risk of failing to confirm a valid research hypothesis.” (p. 110).

What Pek et al. do not tell there readers is the real reason why psychologists ignored false negative results and continued to use small samples. The reason is not that they rarely have false negative results. The reason is that they invest relatively few resources in their studies so that they can conduct many studies or many tests within a study to get at least one significant result. The non-significant results are simply discarded. This is known as using questionable research practices because researchers are not disclosing all results. This increases the risk that the published results are false positives. If a researcher tests 20 false hypotheses, they can expect to get one p-value below .05. If they do not disclose that they ran 20 tests, readers cannot see that the 1 significant result was expected by chance alone.

Pek et al. also do not tell readers why power analysis has become more popular in the past decade. The reason is that a high rate of false negative results makes significant results less informative. Imagine that researchers test 100 true hypotheses and 100 false hypotheses. The 100 false hypotheses are expected to produce 5 significant results. This is implied by the use of the 5% criterion. If the 100 true hypotheses are tested with only 10% power, we have 10 true findings and 90 false negative results. Now we publish only the significant results, which means there are 15 results, 5 are false findings (a medicine does not work) and 10 are true findings (a medicine works). This means one third of published findings are false. Cohen recommended to plan studies with 80% power. This would mean we get 80 true findings and 5 false findings. As a result, only 6% of the published results are false. Would you still believe Pek et al. (2025) that power analyses are useless or would you rather wonder whether the average power in psychology is closer to 10% or 80%?

The key argument in Pek et al.’s article is that power is a hypothetical construct because we do not know whether the predicted effect is small, medium, or large. First of all, this is not true of all science. Some sciences have theories that make quantitative predictions. Even psychologists may have some idea whether they are testing a weak, moderate, or strong effect. However, we do not even need to know what the true effect size is. We can conduct a power analysis based on an effect size that is theoretically interesting. For example, the question whether money buys happiness or not is a silly question. A more interesting question is how much happiness money can buy. Let’s say that money is only important for a theory of happiness, if the correlation between money and happiness is at least r = .1, what Cohen calls a weak effect. Power analysis not only helps us to determine a reasonable sample size to look for this correlation, it can also help to make non-significant results informative. For example, if we power a study to have a 95% chance to be significant with a correlation of r = .1, and we obtain a non-significant result, the chance that this result is a false positive result is less than 5%. We may therefore be willing to accept the hypothesis that the correlation is less than .1 and conclude that money has a negligible influence on happiness (BTW the true correlations tend to be between 1. and .3). This is valuable information that can only be obtained by considering the risk of a false negative finding. Finding a non-significant result in a study with N = 20 people does not warrant the conclusion that money does not matter much for happiness because the false negative risk is too high. Pek et al. (2024) ignore all of this useful information that power analyses can provide even if there is great uncertainty about the true power of a study. Thus, researchers can easily track the power of their studies by keeping track of their success record.

If a researcher conducts 20 statistical tests and finds only 4 significant results, the average power is about 25%. According to Kahneman and Tversky (1971) any serious researcher would have to wonder whether they are just testing a lot of false hypotheses or whether they produce a lot of false negative results. No serious researcher should just continue doing what they are doing and just publish the 4 significant results and call it a day. However, that is what social psychologists like Pek’s co-author Duane Wegener have been doing for decades, while ignoring power analyses. This has led to the replication crisis in social psychology that has uncovered many false findings. At least Noble Laureat Daniel Kahneman had the humility to recognize his mistake. “What the blog gets absolutely right is that I placed too much faith in underpowered studies” (Kahneman, 2017).

Kahneman (2017) also points out that we need science to make new discoveries and to correct false beliefs, but that science can only serve this function when all relevant results are published. And that was not the case in social psychology. Non-significant results were ignored and only significant results that confirmed even the most implausible predictions were published. This bias is evident in the high percentage of significant results in psychology journals (Motyl et al., 1997; Sterling, 1959; Sterling et al., 1995). With success rates of 90%, honest reporting would imply that psychologists only test true hypothesis with high power. Ironically, this would mean that psychologists do not need power analysis because they miraculously never obtain false negative results. The real reason for 90% success rate is rather different. A replication project found only 25% significant results in replication studies of social psychologists (Open Science Collaboration, 2015), suggesting that most studies are well below the 50% criterion for serious researchers (Kahneman & Tversky, 1971). These are well known facts that Pek et al. (2024) and the editor who published this article simply ignore and are hiding from readers who are not familiar with the history of power analysis.

Finally, Pek et al.’s (2024) concern about the uncertainty about the true power of a study is irrelevant for the usefulness of power analyses. The true power of a study is less important than the truthful reporting of results. Uncertainty about true power implies that even researchers who conduct power analyses will sometimes conduct studies that produce false negative results. First, Cohen’s recommendation to aim for 80% power implies that 20% of tests of a true hypothesis will produce false negative results. Second, power analyses can overestimate true power and the false negative risk can be even greater than 20%, This is not a problem if the results are published and combined with other evidence that can correct false negative results. This is what researchers in medicine do. Here, studies have only about 30% power on average, but non-significant results are reported and meta-analyses can reduce the risk of false conclusions. Thus, the biggest threat to psychology as a science is uncertainty about the honest reporting of results and not uncertainty about true power. The advantage of conducting power analyses is that it is more likely that we have honest and credible evidence when researchers conduct a few studies with high power than many studies with low power. This is what Cohen meant when he said “Less is more, except for sample size.”

In conclusion, if you are new to statistical power and its role in psychological science, I recommend to read Cohen (1988, 1992) and to ignore Pek et al.’s (2024) useless article. A simple truth about power is that the percentage of significant results in a set of studies is an estimate of the mean power of studies. If you see a set of studies with over 90% significant results, you have to ask yourself: did these studies really test only true hypothesis with high power, or did researchers not report studies that failed to support their claims ” (Schimmack, 2012). I trust you to come to the right conclusion, but you can also use power calculations to test for the presence of selection bias. But that is a story for another day.

Dear Anonymous Reviewer…

Peer-review is the foundation of science. Peer-reviewers work hard to evaluate manuscript to see whether they are worthy of being published, especially in old-fashioned journals with strict page limitations. Their hard work often goes unnoticed because peer-reviews remain unpublished. This is a shame. A few journals have recognized that science might benefit from publishing reviews. Not all reviews are worthy of publication, but when a reviewer spends hours, if not days, to write a long and detailed comment, it seems only fair to share the fruits of their labor in public. Unfortunately, I am not able to give credit to Reviewer 1 who was too modest or shy to share their name. This does not undermine the value they created and I hope the reviewer may find the courage to take credit for their work.

Reviewer 1 was asked to review a paper that used z-curve to evaluate the credibility of research published in the leading emotion journals. Yet, going beyond the assigned task, Reviewer 1 gave a detailed and thorough review of the z-curve method that showed the deep flaws of this statistical method that had been missed by reviewers of articles that promoted this dangerous and misleading tool. After a theoretical deep-dive into the ontology of z-curve, Reviewer 1 points out that simulation studies seem to have validated the method. Yet, Reviewer 1 was quick to notice that the simulations were a shame and designed to show that z-curve works rather than to see it fail in applications to more realistic data. Deeply embarrassed, my co-thors, including a Professor of Statistics, are now contacting journals to retract our flawed articles.

Please find the damaging review of z-curve below.

P.S. We are also offering a $200 reward for credible simulation studies that demonstrate that z-curve is crap.

P.P.S Some readers seem to have missed the sarcasm and taken the criticism by Reviewer 1 seriously. The problem is lack of expertise to evaluate the conflicting claims. To make it easy I share an independent paper that validated z-curve with actual replication outcomes. Not sure how Reviewer 1 would explain the positive outcome. Maybe we hacked the replication studies, too?

Röseler, Lukas, 2023. “Predicting Replication Rates with Z-Curve: A Brief Exploratory Validation Study Using the Replication Database,” MetaArXiv ewb2t, Center for Open Science.

ANNONYMOUS, July 17, 2024

Referee: 1

Comments to the Author
The manuscript “Credibility of results in emotion science: A z-curve analysis of results in the journals Cognition & Emotion and Emotion” (CEM-DA.24) presents results from a z-curve analysis on reported statistics (t-tests, F-tests, and chi-square tests with df < 6 and 95% confidence intervals) for empirical studies (excluding meta-analysis) published in Cognition & Emotion from 1987 to 2023 and Emotion from 2001 to 2023. The purposes of reporting results from a z-curve analysis are to (a) estimate selection bias in emotion research and (b) predict a success rate in replication studies.

I have strong reservations about the conclusions drawn by the authors that do not seem to be strongly supported by their reported results. Specifically, I am not confident that conclusions from z-curve results justify the statements made in the paper under review. Below, I outline the main concerns that center on the z-curve methodology that unfortunately focuses on providing a review on Brunner and Schimmack (2020) and not so much on the current paper.

VAGUE METHODOLOGY. The authors make strong claims about what conclusions can be drawn from z-curve analyses. Their presentation of z-curve analysis in the present paper is declarative and does not provide the necessary information to describe the assumptions made by the method, how it works, when it fails, etc. The authors cite previous publications on z-curve (Brunner & Schimmack, 2020; Bartos & Schimmack, 2022; Schimmack & Bartos, 2023). Furthermore, this work ignores recent criticism in the literature about such statistical forensics. One example questioning the validity of conclusions by tests of credibility/replicability (e.g., p-curve, Francis’s [2013] consistency test) is in a talk by Richard Morey titled “Statistical games: Flawed thinking of popular methods for assessing reproducibility” (https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3Dc0G98qp1cf4&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245569384%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=n12hBiuHLvvL7uvpt5cj0qaYKrmze39HggtrSPVYIZ0%3D&reserved=0). The talk was based on Morey (2013). Other authors who have written on this topic include McShane, Böckenholt, and Hansen (2020) and Pek, Hoisington-Shaw, & Wegener (2022).

==
Morey, R. D. (2013). The consistency test does not–and cannot–deliver what is advertised: A comment on Francis (2013). Journal of Mathematical Psychology, 57(5), 180-183. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1016%2Fj.jmp.2013.03.004&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245573351%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=giZR7Etdc9n8qTUvXMCMnFeh95GeGO5KRCCoG0P2bHY%3D&reserved=0

McShane, B. B., Böckenholt, U., & Hansen, K. T. (2020). Average Power: A Cautionary Note. Advances in Methods and Practices in Psychological Science, 3(2), 185–199. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1177%2F2515245920902370&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245577184%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=qQtgjxmUam%2ByfFCjknA84sQnecQTk8qm7MObb7b%2BO3E%3D&reserved=0

Francis, G. (2013). Replication, statistical consistency, and publication bias. Journal of Mathematical Psychology, 57(5), 153-169. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1016%2Fj.jmp.2013.02.003&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245580995%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=%2Fkd82Q%2BgOfm1yZECD%2FRbah2uAdZROtPlyKfb4kmFmS4%3D&reserved=0

Pek, J., Hoisington-Shaw, K. J., & Wegener, D. T. (2022). Avoiding Questionable Research Practices Surrounding Statistical Power Analysis. In W. O’Donohue, A. Masuda, & S. Lilienfeld (Eds.), Avoiding Questionable Practices in Applied Psychology (pp. 243–267). Springer. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1007%2F978-3-031-04968-2_11&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245584836%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=DRLox%2Bmn2ztlp6Y4hagpuZKCyCsUFOF1xEXZP779gvk%3D&reserved=0
==

In reading Brunner and Schimmack (2020), z-curve analysis is based on the concept of using an average power estimate of completed studies (i.e., post hoc power analysis). However, statisticians and methodologists have written about the problem of post hoc power analysis (whether it be for a single study or for a set of studies; see Pek, Hoisington-Shaw, & Wegener, in press for a treatment of this misconception).

It should be noted that power is *not* a property of a (completed) study (fixed data). Power is a performance measure of a procedure (statistical test) applied to an infinite number of studies (random data) represented by a sampling distribution. Thus, what one estimates from completed study is not really “power” that has the properties of a frequentist probability even though the same formula is used. Average power does not solve this ontological problem (i.e., misunderstanding what frequentist probability is; see also McShane et al., 2020). Power should *always* be about a design for future studies, because power is the probability of the performance of a test (rejecting the null hypothesis) over repeated samples for some specified sample size, effect size, and Type I error rate (see also Greenland et al., 2016; O’Keefe, 2007). z-curve, however, makes use of this problematic concept of average power (for completed studies), which brings to question the validity of z-curve analysis results.

Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., & Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. European Journal of Epidemiology, 31(4), 337–350. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1007%2Fs10654-016-0149-3&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245588763%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=iKLnBvCg0BGd8l9x%2BZii7O%2BKapURRjoWn8rKZpTxHDw%3D&reserved=0

O’Keefe, D. J. (2007). Post hoc power, observed power, a priori power, retrospective power, prospective power, achieved power: Sorting out appropriate uses of statistical power analyses. Communication Methods and Measures, 1(4), 291–299. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1080%2F19312450701641375&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245592749%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=q3TpAIdWZs%2BPxLhZ1YI2Yby0qSbj14%2Fxc6hcc4YZtA8%3D&reserved=0
==

In Brunner and Schimmack (2020), there is a problem with “Theorem 1 states that success rate and mean power are equivalent even if the set of coins is a subset of all coins.” Here, the coin flip with a binary outcome is a process to describe significant vs. nonsignificant p-values. Focusing on observed power, the problem is that using estimated effect sizes (from completed studies) have sampling variability and cannot be assumed to be equivalent to the population effect size. Methodological papers that deal with power analysis making use of estimated effect size show that the uncertainty due to sampling variability is extremely high (e.g., see Anderson et al., 2017; McShane & Böckenholt, 2016); it is worse when effects are random (cf. random effects meta-analysis; see McShane, Böckenholt, & Hansen, 2020; Pek, Pitt, & Wegener, 2024). Accepting that effects are random seems to be more consistent with what we observe in empirical results of the same topic. The extent of uncertainty in power estimates (based on observed effects) is so high that much cannot be concluded with such imprecise calculations.

Coming back to p-values, these statistics have their own distribution (that cannot be derived unless the effect size is null and the p-value follows a uniform distribution). However, because p-values have sampling variability (and an unknown sampling distribution), one cannot take a significant p-value to deterministically indicate a tally on power (which assumes that an unknown specific effect size is true). Stated differently, a significant p-value can be consistent with a Type I error. Now, if the counter argument taken is that z-curve does not require an effect size input to calculate power, then I’m not sure what z-curve calculates because a value of power is defined by sample size, effect size, Type I error rate, and the sampling distribution of the statistical procedure (as consistently presented in textbooks for data analysis).

There seems to be some conceptual slippage on the meaning of power here because what the authors call power does not seem to have the defining features of power.

The problem of Theorem 2 in Brunner and Schimmack (2020) is assuming some distribution of power (for all tests, effect sizes, and sample sizes). This is curious because the calculation of power is based on the sampling distribution of a specific test statistic centered about the unknown population effect size and whose variance is determined by sample size. Power is then a function of sample size, effect size, and the sampling distribution of the test statistic. There is no justification (or mathematical derivation) to show that power follows a uniform or beta distribution (e.g., see Figure 1 & 2 in Brunner and Schimmack, 2000, respectively). If the counter argument here is that we avoid these issues by transforming everything into a z-score, there is no justification that these z-scores will follow a z-distribution because the z-score is derived from a normal distribution – it is not the transformation of a p-value that will result in a z-distribution of z-scores. P-values are statistics and follow a sampling distribution; the variance of the sampling distribution is a function of sample size. So, it’s weird to assume that p-values transformed to z-scores might have the standard error of 1 according to the z-distribution. If the further argument is using a mixture of z-distributions to estimate the distribution of the z-scores, then these z-scores are not technically z-scores in that they are nor distributed following the z-distribution. We might estimate the standard error of the mixture of z-distributions to rescale the distribution again to a z-distribution… but to what end? Again, there is some conceptual slippage in what is meant by a z-score. If the distribution of p-values that have been transformed to a z-score is not a z-distribution and then the mixture distribution is then shaped back into a z-distribution (with truncations that seem arbitrary) so that the critical value of 1.96 can be used – I’m not sure what the resulting distribution is of, anymore. A related point is that we do not yet know whether p-values are transformation invariant (in distribution) under a z-score transformation. Furthermore, the distribution for power invoked in Theorem 1 is not a function of sample size, effect size, or statistical procedure, suggesting that the assumed distribution does not align well with the features that we know influence power. It is unclear how Theorem 2 is related to the z-curve procedure. Again, there seems to be some conceptual slippage involved with p-values being transformed into z-scores that somehow give us an estimate of power (without stating the effect size, sample size, or procedure).

In the description of the z-curve analysis, it is unclear why z-curve is needed to calculate “average power.” If p < .05 is the criterion of significance, then according to Theorem 1, why not count up all the reported p-values and calculate the proportion in which the p-values are significant? After all, p-values can be transformed to z-scores and vice-versa in that they carry the same information. But then, there is a problem of p-values having sampling variability and might be consistent with Type I error. A transformation from p to z will not fix sampling variability.

To beat a dead horse, z-curve makes use of the concept of “power” for completed studies. To claim that power is a property of completed studies is an ontological error about the meaning of frequentist probability. A thought experiment might help. Suppose I completed a study, and the p-value is .50. I convert this p-value to a z-score for a two-tailed test and get 0.67. Let’s say I collect a bunch of studies and do this and get a distribution of z-scores (that don’t end up being distributed z). I do a bunch of things to make this distribution become a z-distribution. Then, I define power as the proportion of z-scores above the cutoff of 1.96. We are now calling power a collection of z-scores above 1.96 (without controlling for sample size, effect size, and procedure). This newly defined “power” based on the z-distribution does not reflect the original definition of power (area under the curve for a specific effect size, a specific procedure, and a specific sample size, assuming the Type I error is .05). This conceptual slippage is akin to burning a piece of wood, putting the ashes into a mold that looks like wood, and calling the molded ashes wood.

The authors make a statement that (observed) power is the probability of exact replication. However, there is a conceptual error embedded in this statement. While Greenwald et al. (1996, p. 1976) state “replicability can be computed as the power of an exact replication study, which can be approximated by [observed power],” they also explicitly emphasized that such a statement requires the assumption that the estimated effect size is the same as the unknown population effect size which they admit cannot be met in practice. Furthermore, recall that power is a property of a procedure and is not a property of completed data (cf. ontological error), thus using observed power to quantify replicability presents replicability as a property of a procedure and not about the robustness of an observed effect. Again, there seems to be some conceptual slippage occurring here on what is meant by replication versus what is quantifying replication (which should not be observed power).

The basis of supporting the z-curve procedure is a simulation study. This approach merely confirms what is assumed with simulation and does not allow for the procedure to be refuted in any way (cf. Popper’s idea of refutation being the basis of science.) In a simulation study, one assumes that the underlying process of generating p-values is correct (i.e., consistent with the z-curve procedure). However, one cannot evaluate whether the p-value generating process assumed in the simulation study matches that of empirical data. Stated a different way, models about phenomena are fallible and so we find evidence to refute and corroborate these models. The simulation in support of the z-curve does not put the z-curve to the test but uses a model consistent with the z-curve (absent of empirical data) to confirm the z-curve procedure (a tautological argument). This is akin to saying that model A gives us the best results, and based on simulated data on model A, we get the best results.

Further, the evidence that z-curve performs well is specific to the assumptions within the simulation study. If p-values were generated in a different way to reflect a competing tentative process, the performance of the z-curve would be different.  The simulation study was conducted for the performance of the z-curve on constrained scenarios including F-tests with df = 1 and not for the combination of t-tests and chi-square tests as applied in the current study. I’m not sure what to make of the z-curve performance for the data used in the current paper because the simulation study does not provide evidence of its performance under these unexplored conditions.

==
Anderson, S. F., Kelley, K., & Maxwell, S. E. (2017). Sample-size planning for more accurate statistical power: A method adjusting sample effect sizes for publication bias and uncertainty. Psychological Science, 28(11), 1547–1562. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1177%2F0956797617723724&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245596532%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=qg%2BHctfodgm9tHN4oiKkFSJgcIk5%2BSWGBvrWGKRalRQ%3D&reserved=0

Greenwald, A. G., Gonzalez, R., Harris, R. J., & Guthrie, D. (1996). Effect sizes and p values: What should be reported and what should be replicated? Psychophysiology, 33(2), 175–183. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1111%2Fj.1469-8986.1996.tb02121.x&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245600546%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=du%2BhnlOwN22%2FmOAgdoPqEVB3WQhXVYl%2FI0l5J6xTXhU%3D&reserved=0

McShane, B. B., & Böckenholt, U. (2016). Planning Sample Sizes When Effect Sizes Are Uncertain: The Power-Calibrated Effect Size Approach. Psychological Methods, 21(1), 47–60. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2F10.1037%2Fmet0000036&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245604346%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=Lx2kD5FruPdsja9v%2B5uqSrl%2BaiWma1o316z%2BXgSojIY%3D&reserved=0

Pek, J., Hoisington-Shaw, K. J., & Wegener, D. T. (in press). Uses of uncertain statistical power: Designing future studies, not evaluating completed studies . Psychological Methods. 
https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.researchgate.net%2Fpublication%2F368358276_Uses_of_uncertain_statistical_power_Designing_future_studies_not_evaluating_completed_studies&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245608080%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=4KepFlEFOqQyVrhSfXGHUmHToMZlzKt4AlR9sMtzif0%3D&reserved=0

Pek, J., Pitt, M. A., & Wegener, D. T. (2004). Uncertainty limits the use of power analysis. Journal of Experimental Psychology: General, 153(4), 1139–1151. https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoi.org%2Fhttps%3A%2F%2Fdoi.org%2F10.1037%2Fxge0001273&data=05%7C02%7Cmaria.soto%40mail.utoronto.ca%7C263f8a3e995646af4a6d08dca5ecb390%7C78aac2262f034b4d9037b46d56c55210%7C0%7C0%7C638567683245611962%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=S6EZxxThRGeruV6RF9%2FUuIXv4MmMKZWlfAgXkYUtoxI%3D&reserved=0
==

IINPUT DATA. The authors made use of statistics reported in empirical research published in Cognition & Emotion and Emotion. Often, articles might report several studies, and studies would have several models, and models would contain several tests. Thus, there might be a nested structure of tests nested within models, models nested within studies, and studies nested within articles. It does not seem that this nesting is taken into account to provide a good estimate of selection bias and the expectation replication rate. Thus, the estimates provided cannot be deemed unbiased (e.g., estimates would be biased toward articles that tend to report a lot of statistics compared to others).

As the authors admit, there is no separation of statistical tests used for manipulation checks, preliminary analyses, or tests of competing and alternative hypothesis. Given that the sampling of the statistics might not be representative of key findings in emotion research, little confidence can be placed in the accuracy of the estimates reported and the strong claims being made using them (about emotion research in general). 
Finally, the authors excluded chi-square tests with degrees of freedom larger than 6. This would mean that tests of independence with designs larger than a 2×2 contingency table would be excluded (or tests of independence with 6 categories). In general, the authors need to be careful on what conditions their conclusions apply to.

UNSUBSTANTIATED CONCLUSIONS. The key conclusions made by the authors are that there is selection bias in emotion research, and there is a success rate of 70% in replication studies. These conclusions are made from z-curve analysis, in which I question the validity of. My concerns of the z-curve procedure has to do with ontological errors made about the probability attached to the concept of power, the rationale for z-transformations on p-values (along with strange distributional gymnastics with little justification provided in the original paper), and equating power with replication.

Even if the z-curve is valid, the performance of z-curve should be better evaluated to show that they apply to the conditions of the data used in the current study. Furthermore, data quality used in z-curve analysis in terms of selection criteria (e.g., excluding tests for manipulation checks, etc.) and modeling the nested structure inherent in reported results would go a long way in ensuring that the estimate provided is as unbiased as can be.

Finally, it seems odd to conclude selection bias based on data with selection bias. There might be some tautology going on within the argument. An analogy about missing data might help. Given a set of data in which we assume had undergone selection (i.e., part of the distribution is missing), how can we know from the data what is missing? The only way to talk about the missing part of the distribution is to assume a distribution for the “full” data that subsumes the observed data distribution. But who can say that the assumed distribution is the correct one that would have generated the full data? Our selected data does not have the features to let us infer what the full distribution should be. How can we know what we observe has undergone selection bias without knowledge of the selection process (cf. distribution of the full data) unless some implicit assumption is made. We are not given the assumption and therefore cannot evaluate whether the assumption is valid. I cannot tell what assumptions z-curve makes about selection.

Replicability Reports of Psychology Journals – 2024

Citation: Soto, M. & Schimmack, U. (2024, July 5, 2024). Replicability Report of Psychology Journals – 2024. Replicability Index.
https://replicationindex.com/2024/07/05/replicability-reports-of-psychology-journals/

In the 2010s, it became apparent that empirical psychology has a replication problem. When psychologists tested the replicability of 100 results, they found that only 36% of the 97 significant results in original studies could be reproduced (Open Science Collaboration, 2015). In addition, several prominent cases of research fraud further undermined trust in published results. Over the past decade, several proposals were made to improve the credibility of psychology as a science. Replicability reports are the results of one of these initiatives.

The main problem in psychological science is the selective publishing of statistically significant results and the blind trust in statistically significant results as evidence for researchers’ theoretical claims. Unfortunately, psychologists have been unable to self-regulate their behavior and continue to use unscientific practices to hide evidence that disconfirms their predictions. Moreover, ethical researchers who do not use unscientific practices are at a disadvantage in a game that rewards publishing many articles without any concern about the replicability of these findings.

My colleagues and I have developed a statistical tools that can reveal the use of unscientific practices and predict the outcome of replication studies (Brunner & Schimmack, 2021; Bartos & Schimmack, 2022). This method is called z-curve. Z-curve cannot be used to evaluate the credibility of a single study. However, it can provide valuable information about the research practices in a particular research domain.

Research reports use z-curve to provide information about psychological journals. This information can be used by authors to chose journals they want to publish in, provides feedback to journal editors who have influence on selection bias and replicability of results published in their journals, and most importantly to readers of these journals.

List of Journals with Replicability Reports for 2024

Acta Psychologica (2000-2023)

Evolution and Human Behavior (2000-2023)

Evolutionary Psychology (2000-2023)

Journal of Experimental Social Psychology (2000-2023)