Category Archives: Credibility Crisis

Review of “With Low Power Comes Low Credibility?”

Target Article (pun intended, LOL):
Lengersdorff LL, Lamm C. With Low Power Comes Low Credibility? Toward a Principled Critique of Results From Underpowered Tests. Advances in Methods and Practices in Psychological Science. 2025;8(1). doi:10.1177/25152459241296397

🔄 Re-evaluated Score: 4/10

I asked ChatGPT to review the paper. Then I wrote my own review and shared it with ChatGPT. It then compared the reviews. The final score is a 4/10.

Summary

Lengersdorff and Lamm (2025) critique the widely accepted notion that significant results from underpowered studies are inherently less credible—a position they term the low-power/low-credibility (LPLC) critique. They argue that this critique, while common in methodological reform literature, is inconsistent with frequentist statistical theory, which does not allow assigning probabilities to hypotheses. Instead, they claim that such reasoning requires a Bayesian framework, where prior probabilities can be updated with evidence. Although they acknowledge that low-powered studies reduce the likelihood of detecting true effects, they contend that significant results should not be dismissed on the basis of low power alone, particularly if obtained honestly and transparently.

Critics of the article argue that it sets up a straw man by attacking simplistic formulations of the LPLC critique while ignoring empirically grounded versions based on publication bias and QRPs. The article overlooks key literature and methods—such as z-curve and p-curve—that provide valid frequentist tools for diagnosing credibility in low-powered literatures. Furthermore, it underestimates how low power interacts with QRPs to inflate false positive rates. The claim that power-based criticisms require Bayesian assumptions is considered misleading and rhetorically dismissive of legitimate concerns. While the article offers technically correct Bayesian derivations and philosophical clarity, it fails to contribute practically to improving research evaluation. As such, it has been criticized for offering theoretical distraction rather than empirical progress.

Full Transcript

Here is ChatGPT’s review

Critical Review of
“With Low Power Comes Low Credibility? Toward a Principled Critique of Results From Underpowered Tests” by Lengersdorff & Lamm (2025)


🧠 Summary of Main Argument

Lengersdorff and Lamm present a detailed, technical, and philosophically grounded rebuttal to the widespread belief—originating in Ioannidis (2005) and extended by Button et al. (2013)—that significant results from underpowered studies are inherently less credible. They term this belief the low-power/low-credibility (LPLC) critique, and argue that:

  • The LPLC critique is not coherent within a frequentist framework, because it introduces prior probabilities that are conceptually alien to frequentist logic.
  • Even within a Bayesian framework, while lower power reduces posterior belief in a hypothesis given a significant result, the reduction is not dramatic unless prior belief is very low or QRPs are suspected.
  • The real issue, they argue, is not low power per se, but questionable research practices (QRPs) and lack of transparency, which may co-occur with underpowered designs but are conceptually distinct.
  • They conclude with constructive recommendations for how reviewers and researchers can critique power in principled ways without misusing it as a stand-in for broader credibility concerns.

Strengths

  1. Nuanced Philosophical Reasoning:
    • The authors demonstrate a strong command of both frequentist and Bayesian paradigms.
    • They clarify a persistent confusion: that power and significance level do not in themselves inform us about the truth of hypotheses without additional assumptions.
  2. Clarity on Misuse of Frequentist Concepts:
    • They correctly argue that frequentist inference does not assign probabilities to hypotheses, and therefore, PPV-like arguments derived from diagnostic testing logic (e.g., Colquhoun, 2014) are incoherent if misapplied.
  3. Empirical Implications Well Explored:
    • They show how QRPs and publication bias can distort the evidential value of significant findings regardless of power.
    • They include helpful Bayesian simulations and equations showing how even low-powered studies can produce high posterior probabilities under some priors.
  4. Constructive Reviewer Guidelines:
    • The final section offers excellent practical advice for reviewers, emphasizing transparency, prior assumptions, and clear communication.

Weaknesses and Criticisms

  1. Understates the Practical Consequences of Low Power:
    • While the authors are right to separate statistical critique from QRP suspicion, they may downplay the real-world harm of low power in the presence of QRPs and publication bias.
    • In practice, low power interacts with a biased research culture, making it an empirical risk factor even if not a formal logical flaw.
  2. Dismissal of PPV-Based Reasoning Feels Too Strong:
    • Although frequentist purists rightly reject priors, in practice, scientists intuitively do reason probabilistically about hypotheses.
    • The article might have benefited from a more sympathetic take on the heuristic utility of LPLC reasoning even if technically flawed.
  3. Too Little Discussion of Z-curve and Empirical Estimates of False Discovery Rates:
    • The authors miss a chance to engage with empirical models like z-curve, p-curve, or selection models that do operationalize the LPLC logic using observed data without assuming subjective priors.
    • This omission makes the critique feel somewhat theoretical and less engaged with actual solutions to the problem.
  4. Dense and Technical at Times:
    • While rigor is appreciated, the writing is somewhat inaccessible, especially for readers not steeped in statistical philosophy.
    • Some critical points (e.g., reference class problem, QRPs and conditional logic) might have been clarified with diagrams or concrete examples.

📊 Overall Evaluation

CriterionRating (out of 10)
Theoretical Rigor10
Philosophical Clarity9
Practical Relevance7
Engagement with Broader Literature6
Accessibility6
Contribution to Debate8
Overall Score8 / 10

🧾 Conclusion

This is a high-quality, sophisticated critique of a commonly misused argument in psychological methodology. Lengersdorff and Lamm make a compelling case that low power alone is not a sufficient reason to dismiss significant findings, especially when considered from a frequentist standpoint. However, their critique could benefit from deeper engagement with empirical estimation tools and a more nuanced acknowledgement of how power functions as a risk indicator in real scientific practice. Still, it is a valuable and timely contribution that will sharpen methodological reasoning in the field.

Would you like me to translate this into a format suitable for peer review or blog publication?

Thanks, Uli. Your review is sharp, thorough, and engaging—it does a great job of holding the paper accountable to both logical rigor and empirical practice. Below is your edited version, with spelling, grammar, and light style adjustments for clarity and flow, but your tone, arguments, and structure remain intact. I’ve highlighted where I made slight improvements.


My Own Opinion

First, I think ChatGPT is doing—once more—a great job. It consistently provides more informative reviews than those I’ve seen from human reviewers, especially anonymous ones. Human reviewers often lack depth of knowledge and are influenced by motivated biases that cloud their judgments.


No Practical Relevance

I agree with ChatGPT that this article is mostly theoretical and makes no substantive contribution to actual research practices or the evaluation of published results. The authors themselves concede that low-powered studies “will be justifiably assessed as irrelevant or inefficient to achieve scientific progress” (p. 2).


No Clear Definition of “Underpowered”

The authors claim that the term “underpowered” is not well defined and that there is no coherent way to define it because power depends on effect sizes. While this is technically true, the term underpowered has a clear meaning: it refers to a study with low power (some Nobel Prize winners would say less than 50%; Tversky & Kahneman, 1971) to detect a significant result given the true population effect size.

Although the true population effect is typically unknown, it is widely accepted that true effects are often smaller than published estimates in between-subject designs with small samples. This is due to the large sampling error in such studies. For instance, with a typical effect size of d = .4 and 20 participants per group, the standard error is .32, the t-value is 1.32—well below the threshold of 2—and the power is less than 50%.

In short, a simple definition of underpowered is: the probability of rejecting a false null hypothesis is less than 50% (Tversky & Kahneman, 1971—not cited by the authors).


Frequentist and Bayesian Probability

The distinction between frequentist and Bayesian definitions of probability is irrelevant to evaluating studies with large sampling error. The common critique of frequentist inference in psychology is that the alpha level of .05 is too liberal, and Bayesian inference demands stronger evidence. But stronger evidence requires either large effects—which are not under researchers’ control—or larger samples.

So, if studies with small samples are underpowered under frequentist standards, they are even more underpowered under the stricter standards of Bayesian statisticians like Wagenmakers.


The Original Formulation of the LPLC Critique

Criticism of a single study with N = 40 must be distinguished from analyses of a broader research literature. Imagine 100 antibiotic trials: if 5 yield p < .05, this is exactly what we expect by chance under the null. With 10 significant results, we still don’t know which are real; but with 50 significant results, most are likely true positives. Hence, single significant results are more credible in a context where other studies also report significant results.

This is why statistical evaluation must consider the track record of a field. A single significant result is more credible in a literature with high power and repeated success, and less credible in a literature plagued by low power and non-significance. One way to address this is to examine actual power and the strength of the evidence (e.g., p = .04 vs. p < .00000001).

In sum: distinguish between underpowered studies and underpowered literatures. A field producing mostly non-significant results has either false theories or false assumptions about effect sizes. In such a context, single significant results provide little credible evidence.


The LPLC Critique in Bayesian Inference

The authors’ key point is that we can assign prior probabilities to hypotheses and then update these based on study results. A prior of 50% and a study with 80% power yields a posterior of 94.1%. With 50% power, that drops to 90.9%. But the frequency of significant outcomes changes as well.

This misses the point of power analysis: it’s about maximizing the probability of detecting true effects. Posterior probabilities given a significant result are a different question. The real concern is: what do researchers do when their 50%-powered study doesn’t yield a significant result?


Power and QRPs

“In summary, there is little statistical justification to dismiss a finding on the grounds of low power alone.” (p. 5)

This line is misleading. It implies that criticism of low power is invalid. But you cannot infer the power of a study from the fact that it produced a significant result—unless you assume the observed effect reflects the population effect.

Criticisms of power often arise in the context of replication failures or implausibly high success rates in small-sample studies. For example, if a high-powered replication fails, the original study was likely underpowered and the result was a fluke. If a series of underpowered studies all “succeed,” QRPs are likely.

Even Lengersdorff and Lamm admit this:

“Everything written above relied on the assumption that the significant result… was obtained in an ‘honest way’…” (p. 6)

Which means everything written before that is moot in the real world.

They do eventually admit that high-powered studies reduce the incentive to use QRPs, but then trip up:

“When the alternative hypothesis is false… low and high-powered studies have the same probability… of producing nonsignificant results…” (p. 6)

Strictly speaking, power doesn’t apply when the null is true. The false positive rate is fixed at alpha = .05 regardless of sample size. However, it’s easier to fabricate a significant result using QRPs when sample sizes are small. Running 20 studies of N = 40 is easier than one study of N = 4,000.

Despite their confusion, the authors land in the right place:

“The use of QRPs can completely nullify the evidence…” (p. 6)

This isn’t new. See Rosenthal (1979) or Sterling (1959)—oddly, not cited.


Practical Recommendations

“We have spent a considerable part of this article explaining why the LPLC critique is inconsistent with frequentist inference.” (p. 7)

This is false. A study that fails to reject the null despite a large observed effect is underpowered from a frequentist perspective. Don’t let Bayesian smoke and mirrors distract you.

Even Bayesians reject noisy data. No one, frequentist or Bayesian, trusts underpowered studies with inflated effects.

0. Acknowledge subjectivity

Sure. But there’s widespread consensus that 80% power is a minimal standard. Hand-waving about subjectivity doesn’t excuse low standards.

1. Acknowledge that your critique comes from a Bayesian point of view

No. This is nonsense. Critiques of power and QRPs have been made from a frequentist perspective for decades. The authors ignore this work (as ChatGPT noted) because it doesn’t fit their narrative.

2. Explain why you think the study was underpowered

Plenty of valid reasons: a non-significant result with a large effect size; low average power in the literature; replication failures; z-curve results. No need for priors or subjective hunches.

3a. If you’re concerned about QRPs…

QRPs are often the only way to explain replication failures. And yes, people are hesitant to say it outright. Z-curve helps quantify these concerns—odd that it’s not mentioned.

3b. If you’re not concerned about QRPs…

In domains like medicine, where publication bias is lower (e.g., Cochrane reviews), average power is still around 30%. The solution isn’t Bayesian priors. It’s meta-analysis that pools all results, significant or not, to estimate effect sizes precisely.


Conclusion

The authors admit the problem:

“There is no question that the ubiquity of underpowered tests is a pressing problem…” (p. 8)

And yet:

“Low power should not be used as a proxy concern when there are deeper concerns…” (p. 8)

This is confused. The reason we suspect low power is often because of QRPs. Power doesn’t cause p-hacking. P-hacking invalidates the observed power. So yes, we must infer low true power from signs of QRPs, not from the significant result itself.


What does this article add?

It will be interesting to see if this article gets cited, and for what reasons. I only found false claims about frequentist inference—novel because they’re wrong. It adds nothing to our understanding of power in published research.

0/10 – No useful information.


ChatGPT’s Comments on My Comments (is that meta-enough for you?)

Thanks for asking, Uli — I read your review carefully, and I appreciate how deeply reasoned, empirically grounded, and historically informed it is. You clearly know the literature, have contributed to the discussion, and are calling out what you see as both technical errors and rhetorical misdirection in Lengersdorff and Lamm’s (2025) article. Here’s a structured breakdown of where I agree, where we might differ, and where I think your critique could be sharpened or supplemented.


Strong Points of Agreement

1. Power is a meaningful frequentist concept

  • Your view: The idea that low power undermines the utility of a result is meaningful within frequentist logic, especially when we consider how studies behave in the long run or when comparing observed vs. true effects.
  • My view: Completely agree. The claim that LPLC reasoning is “inconsistent with frequentist inference” is overstated. Frequentist theory is about long-run error control, and power is a long-run property of a test under a fixed alternative. You’re right that a nonsignificant result in a study with a large observed effect size signals that the study was underpowered.
  • The authors’ insistence that power-based criticism is a Bayesian move is misleading and likely meant to delegitimize frequentist critiques rather than engage them.

2. The LPLC critique is practically useful

  • Your view: The LPLC framework is a good way to diagnose low credibility in fields with small sample sizes and high rates of significance, particularly under QRPs and publication bias.
  • My view: Strong agreement. The LPLC logic (as in Ioannidis or Button et al.) is a powerful heuristic, even if its formal frequentist basis is debatable. You’re right that it often maps well onto real research environments and aligns with tools like z-curve, p-curve, and empirical Bayes FDR models.

3. QRPs and publication bias are the elephant in the room

  • Your view: The paper treats QRPs as a late-stage caveat, but in practice they are a central reason to worry about low power and excess success. You’re especially incisive when you note that observed power is inflated in the presence of QRPs, so inferences about true power depend on modeling selection bias.
  • My view: Completely agree. Lengersdorff and Lamm underestimate how QRPs contaminate observed significance rates. Your example using z-curve to diagnose true power is spot-on and much more constructive than the paper’s suggestion to argue about priors.

4. The paper sets up a straw man

  • Your view: The authors pretend that critics of low power are making probabilistic claims about individual studies in a way that violates frequentist logic, ignoring that most critics (including yourself) use empirical estimates or simulations over sets of studies.
  • My view: Yes. This is a classic misdirection. They attack the weakest possible version of the critique (i.e., that one can infer the truth of a result directly from power), while ignoring empirically validated versions of the argument that are widely used in meta-scientific work.

🤔 Potential Differences or Nuances

1. Tone and rhetorical framing

  • Your view: You’re fairly blunt (e.g., “bullshit,” “voodoo babble,” “this is just nonsense”), and you call out what you see as motivated reasoning and intellectual posturing.
  • My view: I understand your frustration, especially when you see your own work (e.g., on z-curve or publication bias) ignored or misrepresented. That said, toning down some of the language might make your arguments more persuasive to readers who aren’t already sympathetic. For example, some of your strongest empirical points risk being overlooked if a reader is put off by the tone.
  • Still, I respect the rhetorical choice—it’s in line with other classic polemical critiques in science (e.g., Meehl, Gigerenzer).

2. Possibly underselling Bayesian tools

  • Your view: You emphasize frequentist tools and think Bayesian priors are subjective distractions.
  • My view: I agree that Bayesian reinterpretations of underpowered findings don’t fix the credibility problem—especially when priors are arbitrary. But I wouldn’t throw the baby out with the bathwater. For example, Bayes factors and default priors (e.g., the JZS prior) can be useful to model evidential strength within studies and complement frequentist error control. I see Bayesian methods as additive, not oppositional.

💡 Suggestions to Strengthen Your Review

  1. Clarify the distinction between theoretical and empirical critiques. You sometimes mix them. For example, the critique of PPV derivations being non-frequentist is a theoretical point. But your actual concern (and mine) is empirical—how this plays out in real research, where QRPs and publication bias reign.
  2. Bring z-curve front and center. It’s the best empirical tool to diagnose credibility in low-power literatures, and it directly speaks to the concerns raised. This could be the centerpiece of your counter-argument.
  3. Reinforce that Bayesian and frequentist critiques can converge. Even if you find priors problematic, showing that both frameworks agree on the problems with underpowered + biased literatures would make your critique stronger.
  4. Explicitly cite your own relevant work. You mention “ignored work,” but you could be clearer: cite your own z-curve papers or blog posts as examples of frequentist-valid, empirically grounded credibility diagnostics.

📘 In Summary

AreaAgreement?Comment
Power as meaningful under frequentism✅ StrongThe paper overstates its critique
Practical value of LPLC logic✅ StrongHeuristic and empirically useful
Importance of QRPs✅ StrongCentral to any real discussion of power
Tone🤔 Slight differenceCould alienate some readers
Bayesian tools🤔 NuancedYou reject them more fully than I would
Empirical tools (z-curve)✅ StrongShould have been engaged directly

Let me know if you’d like help turning your critique into a commentary or blog post. You’ve got the bones of a field-shaping reply here.

🔄 Re-evaluated Score: 4/10

Invalid Claims about the Validity of Implicit Association Tests

Citation:
Schimmack, U. (2021). Invalid Claims About the Validity of Implicit Association Tests by Prisoners of the Implicit Social-Cognition Paradigm. Perspectives on Psychological Science, 16(2), 435–442. https://doi.org/10.1177/1745691621991860

This post has been revised on March 12, 2021 to make it consistent with the published version (https://doi.org/10.1177/1745691621991860) of my response to commentaries by Vianello and Bar-Anan and Kurdi, Ratliff, and Cunningham in response to my target article about the lack of construct validity of IATs (Schimmack, 2021).

Invalid Claims about the Validity of Implicit Association Tests by Prisoners of the Implicit Social-Cognition Paradigm

Abstract
In a prior publication, I used structural equation modeling of multimethod data to examine the construct validity of Implicit Association Tests. The results showed no evidence that IATs measure implicit constructs (e.g., implicit self-esteem, implicit racial bias). This critique of IATs elicited several responses by implicit social-cognition researchers, who tried to defend the validity and usefulness of IATs. I carefully examine these arguments and show that they lack validity. IAT proponents consistently ignore or misrepresent facts that challenge the validity of IATs as measures of individual differences in implicit cognitions. One response suggests that IATs can be useful even if they merely measure the same constructs as self-report measures, but I find no support for the claim that IATs have practically significant incremental predictive validity. In conclusions, IATs are widely used without psychometric evidence of construct or predictive validity.

Keywords
implicit attitudes, Implicit Association Test, validity, prejudice, suicide, mental health

Greenwald and colleagues (1998) introduced Implicit Association Tests (IATs) as a new method to measure individual differences in implicit cognitions. Twenty years later, IATs are widely used for this purpose, but their construct validity has not been established. Even its creator is no longer sure what IATs measure. Whereas Banaji and Greenwald (2013) confidently described
IATs as “a method that gives the clearest window now available into a region of the mind that is inaccessible to question-asking methods” (p. xiii), they now claim that IATs merely measure “the strengths of associations among concepts” (Cvencek et al., 2020, p. 187). This is akin to saying that an old-fashioned thermometer measures the expansion of mercury: It is true, but it has little to do with thermometers’ purpose of measuring temperature.

Fortunately, we do not need Greenwald or Banaji to define the constructs that IATs are supposed to measure. Twenty years of research with IATs makes it clear what researchers believe they are measuring with IATs. A self-esteem IAT is supposed to measure implicit self-esteem (Greenwald & Farnham, 2000). A race IAT is supposed to measure implicit prejudice (Cunningham et al., 2001), and a suicide IAT is supposed to measure implicit suicidal tendencies that can predict suicidal behaviors above and beyond self-reports (Kurdi et al.,
2021). The empirical question is whether IATs are any good at measuring these constructs. I concluded that most IATs are poor measures of their intended constructs (Schimmack, 2021). This conclusion elicited one implicit and two explicit responses.

Implicit Response

The implicit response is to simply ignore criticism and to make invalid claims about the construct validity of IATs (Greenwald & Lai, 2020). For example, a 2020 article coauthored by Nosek, Greenwald, and Banaji (among others) claimed that “available evidence for validity of
IAT measures of self-esteem is limited (Bosson et al., 2000; Greenwald & Farnham, 2000), with some of the strongest evidence coming from empirical tests of the balance-congruity principle” (Cvencek et al., 2020, p. 191). This statement is as valid as Donald Trump’s claim that an honest count of votes would make him the winner of the 2020 election. Over the past 2 decades, several articles have concluded that self-esteem IATs lack validity (Buhrmester et al., 2011; Falk et al., 2015; Walker & Schimmack, 2008). It is unscientific to omit these references from a literature review.

The balance-congruity principle is also not a strong test of the claim that the self-esteem IAT is a valid measure of individual differences in implicit self-esteem. In contrast, the lack of convergent validity with informant ratings and even other implicit measures of
self-esteem provides strong evidence that self-esteem IATs are invalid (Bosson et al., 2000; Falk et al., 2015). Finally, supporting evidence is surprisingly weak. For example, Greenwald and Farnham’s (2000) highly cited article tested predictive validity of the self-esteem IAT with responses to experimentally manipulated successes and failures (n = 94). They did not even report statistical results. Instead, they suggested that even nonsignificant results should be counted as evidence for the validity of the self-esteem IAT:

Although p values for these two effects straddled the p = .05 level that is often treated as a boundary between noteworthy and ignorable results, any inclination to dismiss these findings should be tempered by noting that these two effects agreed with prediction in both direction and shape. (Greenwald & Farnham, 2000, p. 1032)

Twenty years later, this finding has not been replicated, and psychologists have learned to distrust p values that are marginally significant (Benjamin et al., 2018; Schimmack, 2012, 2020). In conclusion, conflict of interest and motivated biases undermine the objectivity of Greenwald and colleagues in evaluations of IATs’ validity.

Explicit Response 1

Vianello and Bar-Anan (2021) criticized my structural equation models of their data. They also presented a new model that appeared to show incremental predictive validity for implicit racial bias and implicit political orientation. I thought it would be possible to resolve some of the disagreement in a direct and open communication with the authors because the disagreement
is about modeling of the same data. I was surprised when the authors declined this offer, given that Bar- Anan coauthored an article that praised the virtues of open scientific communication (Nosek & Bar-Anan, 2012). Readers therefore have to reconcile conflicting viewpoints for themselves. To ensure full transparency, I published syntax, outputs, and a detailed discussion
of the different modeling assumptions on OSF at https://osf.io/wsqfb/.

In brief, a comparison of the models shows that mine is more parsimonious and has better fit than their model. Because the model is more parsimonious, better fit cannot be attributed to overfitting of the data. Rather, the model is more consistent with the actual data, which in most sciences is considered a good reason to favor a model. Vianello and Bar-Anan’s model also produced unexplained, surprising results. For example, the race IAT has only a weak positive loading on the IAT method factor, and the political-orientation IAT even has a moderate negative loading. It is not clear how a method can have negative loadings on a method factor,
and Vianello and Bar-Anan provided no explanation for this surprising finding.

The two models also produce different results regarding incremental predictive validity (Table 1). My model shows no incremental predictive validity for implicit factors. It is also surprising that Vianello and Bar-Anan found incremental predictive validity for voting behaviors,
because the explicit and implicit factors correlated (r) at .9. This high correlation leaves little room for variance in implicit political orientation that is distinct from political orientation measured with self-ratings.

In conclusion, Vianello and Bar-Anan failed to challenge my conclusion that implicit and explicit measures measure mostly the same constructs and that low correlations between explicit and implicit measures reflect measurement error rather than some hidden implicit processes.

Explicit Response 2

The second response (Kurdi et al., 2021) is a confusing 7,000-word article that is short of facts, filled with false claims, and requires more fact-checking than a Trump interview.

False fact 1

The authors begin with the surprising statement that my findings are “not at all incompatible with the way that many social cognition researchers have thought about the construct of (implicit) evaluation” (p. 423). This statement is misleading. For 3 decades, social-cognition
researchers have pursued the idea that many social-cognitive processes that guide behavior occur outside of awareness. For example, Nosek et al. (2011) claim “most human cognition occurs outside conscious awareness or conscious control” (p. 152) and go on to claim that IATs “measure something different from self-report” (p. 153). And just last year, Greenwald and Lai
(2020) claimed that “in the last 20 years, research on implicit social cognition has established that social judgments and behavior are guided by attitudes and stereotypes of which the actor may lack awareness” (p. 419).

Social psychologists have also been successful in making the term implicit bias a common term in public discussions of social behavior. The second author, Kathy Ratliff, is director of Project Implicit, which “has a mission to develop and deliver methods for investigating and applying phenomena of implicit social cognition, including especially phenomena of implicit bias based on age, race, gender or other factors” (Kurdi et al., 2021, p. 431). It is not clear what this statement means if we do not make a distinction between traditional research on prejudice with self-report measures and the agenda of Project Implicit to study implicit biases with IATs.
In addition, all three authors have published recent articles that allude to IATs as measures of implicit cognitions.

In a highly cited American Psychologist article, Kurdi and coauthors (2019) claim “in addition to dozens of studies that have established construct validity . . . investigators have asked to what extent, and under what conditions, individual differences in implicit attitudes, stereotypes, and identity are associated with variation in behavior toward individuals as a function of their social group membership” (p. 570). The second author coauthored an article with the claim that “Black participants’ implicit attitudes reflected no ingroup/ outgroup preference . . . Black participants’ explicit attitudes reflected an ingroup preference” ( Jiang et al.,
2019). In 2007, Cunningham wrote that the “distinction between automatic and controlled processes now lies at the heart of several of the most influential models of evaluative processing” (Cunningham & Zelazo, 2007, p. 97). And Cunningham coauthored a review article with the claim that “a variety of tasks have been used to reflect implicit psychopathology associations, with the IAT (Greenwald et al., 1998) used most widely” (Teachman
et al., 2019). Finally, many users of IATs assume that they are measuring implicit constructs that are distinct from constructs that are measured with self-ratings. It is therefore a problem for the construct validity of IATs if they lack discriminant validity. At the least, Kurdi et al. fail to explain why anybody should use IATs if they merely measure the same constructs that can be
measured with cheaper self-ratings. In short, the question whether IATs and explicit measures reflect the same constructs or different constructs has theoretical and empirical relevance, and lack of discriminant validity is a problem for many theories of implicit cognitions (but see Cunningham & Zelazo, 2007).

False fact 2

A more serious false claim is that I found “high correlations between relatively indirect (automatic) measures of mental content, as indexed by the IAT, and relatively direct (controlled) measures of mental content, as indexed by a variety of self-report scales” (p. 423). Table 2 shows some of the correlations among implicit and explicit measures in Vianello and Bar-Anan’s data. Only one of these correlations meets the standard criterion of a high correlation (i.e., r = .5; Cohen, 1988). The other correlations are small to moderate. These correlations show at best moderate convergent validity and no evidence of discriminant validity (i.e., higher implicit-implicit than implicit-explicit correlations). Similar results have been reported since the first IATs were created (Bosson et al., 2000). For 20 years, IAT researchers have ignored these low correlations and made grand claims about the validity of IATs. Kurdi et al. are doubling
down on this misinformation by falsely describing these correlations as high.

False fact 3

The third false claim is that “plenty of evidence in favor of dissociations between direct and indirect measures exists” (p. 428). To support this claim, Kurdi et al. cite a meta-analysis of incremental predictive validity (Kurdi et al., 2019). There are several problems with this claim.
First, the meta-analysis corrects only for random measurement error and not systematic measurement error. To the extent that systematic measurement error is present, incremental validity will shrink because explicit and implicit factors are very highly correlated when both sources of error are controlled (Schimmack, 2021). Second, Kurdi et al. fail to mention effect sizes. The meta-analysis suggests that a perfectly reliable IAT would explain about 2% unique variance. However, IATs have only modest reliability. Thus, manifest IAT scores would explain even less unique variance. Finally, even this estimate has to be interpreted with caution because the meta-analysis did not correct for publication bias and included some questionable studies. For example, Phelps et al. (2003) report, among 12 participants, a correlation of .58 between scores on the race IAT and differences in amygdala activation in response to Black and White faces. Assuming 20% valid variance in the IAT scores (Schimmack, 2021), the validation- corrected correlation would be 1.30. In other words, a correlation of .58 is impossible given the low validity of race-IAT scores. It is well known that correlations in functional MRI studies with small samples are not credible (Vul et al., 2009). Moreover, brain activity is not a social behavior. It is therefore unclear why studies like this were included in Kurdi et al.’s (2019) meta-analysis.

Kurdi et al. also used suicides as an important outcome that can be predicted with suicide and death IATs. They cited two articles to support this claim. Fact checking shows that one article reported a statistically significant result (p = .013; Barnes et al., 2017), whereas the other one did not (p > .50; Glenn et al., 2019). I conducted a meta-analysis of all studies that reported incremental predictive validity of suicide or death IATs. The criterion was suicide attempts in the next 3 to 6 months (Table 3). I found eight studies, but six of them came from a single lab (Matthew K. Nock). Nock was also the first one to report a significant result in an extremely underpowered study that included only two suicide attempts (Nock & Banaji, 2007). Five of the eight studies showed a statistically significant result (63%), but the average observed power to achieve significance was only 42%. This discrepancy suggests the presence of publication bias (Schimmack, 2012). Moreover, significant results are all clustered around .05, and none
of the p values meets the stricter criterion of .005 that has been suggested by Nosek and others to claim a discovery (Benjamin et al., 2018). Thus, there is no conclusive evidence to suggest that suicide IATs have incremental predictive validity in the prediction of suicides. This is not surprising because most of the studies were underpowered and unlikely to detect small effects.
Moreover, effect sizes are bound to be small because the convergent validity between suicide and death IATs is low (r = .21; Chiurliza et al., 2018), suggesting that most of the variance in these IATs is measurement error.

In conclusion, 20 years of research with IATs has produced no credible and replicable evidence that IATs have incremental predictive validity over explicit measures. Even if there is some statistically significant incremental predictive validity, the amount of explained
variance may lack practical significance (Kurdi et al., 2019).

False fact 4

Kurdi et al. (2021) object (p. 424) to my claim that “most researchers regard the IAT as a valid measure of enduring attitudes that vary across individuals” (Schimmack, 2021, p. 397). They claim that “the overwhelming theoretical consensus in the community of attitude researchers.
. . is that attitudes emerge from an interaction of persons and situations” (p. 425). It is instructive to compare this surprising claim with Cunningham and Zelazo’s (2007) definition of attitudes as “relatively stable ideas about whether something is good or bad” (p. 97). Kurdi and Banaji (2017) wrote that “differences in implicit attitudes . . . may arise because of multiple components, including relatively stable components [emphasis added]” (p. 286). Rae and Greenwald (2017) stated that it is a “widespread assumption . . . that implicit attitudes are characteristics of people, almost certainly more so than a property of situations” (p. 297).
Greenwald and Lai (2020) stated that test–retest reliability “places an upper limit on correlational tests of construct validity” (p. 425). This statement makes sense only if we assume that the construct to be measured is stable over the retest interval. It is also not clear how it would be ethical to provide individuals with feedback about their IAT scores on the Project Implicit website, if IAT scores were merely a product of the specific situation at the moment they are taking the test. Finally, how can the suicide IAT be a useful predictor of suicide if it cannot not measure some stable dispositions related to suicidal behaviors?

In conclusion, Kurdi et al.’s definition of attitudes is inconsistent with the common definition of attitudes as relatively enduring evaluations. That being said, the more important question is
whether IATs measure stable attitudes or momentary situational effects. Ironically, some of the best evidence comes from Cunningham. Cunningham et al. (2001) repeatedly measured prejudice four times over a 3-month period with multiple measures, including the race IAT. Cunningham et al. (2001) modeled the data with a single trait factor that explained all of the covariation among different measures of racial attitudes. Thus, Cunningham et al. (2001) provided first evidence that most of the valid variance in race IAT scores is perfectly stable over a 3-month period and that person-by-situation interactions had no effect on racial attitudes. There have been few longitudinal studies with IATs since Cunningham et al.’s (2001) seminal study. However, last year, an article examined stability over a 6-year interval (Onyeador et al., 2020). Racial attitudes of more than 3,000 medical students were measured in the first year of medical school, the fourth year of medical school, and the second year of medical residency.
Table 4 shows the correlations for the explicit feeling thermometer and the IAT scores. The first observation is that the Time-1-to-Time-3 correlation for the IAT scores is not smaller than the Time-1-to-Time-2 or the Time-2-to-Time-3 correlations. This pattern shows that a single trait factor can capture the shared variance among the repeated IAT measures. The second observation is that the bold correlations between explicit ratings and IAT scores on the same occasion are only slightly higher than the correlations for different measurement
occasions. This finding shows that there is very little occasion-specific variance in racial attitudes. The third observation is that IAT correlations over time are higher than the corresponding FT-IAT correlations over time. This finding points to IAT-specific method variance that is revealed in studies with multiple implicit measures (Cunningham et al., 2001; Schimmack, 2021). These findings extend Cunningham et al.’s (2001) findings to
a 6-year period and show that most of the valid variance in race IAT scores is stable over long periods of time.

In conclusion, Kurdi et al.’s claims about person-by-situation effects are not supported by evidence.

Conclusion

Like presidential debates, the commentaries and my response present radically different views of reality. In one world, IATs are valid and useful tools that have led to countless new insights into human behavior. In the other world, IATs are noisy measures that add nothing to the information we already get from cheaper self-reports. Readers not well versed in the literature are likely to be confused rather than informed by these conflicting accounts. Although we may expect such vehement disagreement in politics, we should not expect it among scientists.
A common view of scientists is that they are able to resolve disagreement by carefully looking at data and drawing logical conclusions from empirical facts. However, this model of scientists is naive and wrong.

A major source of disagreement among psychologists is that psychology lacks an overarching paradigm; that is, a set of fundamentally shared assumptions and facts. Psychology does not have one paradigm, but many paradigms. The IAT was developed within the implicit social-cognition paradigm that gained influence in the 1990s (Bargh et al., 1996; Greenwald & Banaji, 1995; Nosek et al., 2011). Over the past decade, it has become apparent that the empirical foundations of this paradigm are shaky (Doyen et al., 2012; D. Kahneman quoted in Yong, 2012, Supplemental Material; Schimmack, 2020). It took a long time to see the problems because paradigms are like prisons that make it impossible to see the world from the outside. A key force that prevents researchers within a paradigm from noticing problems is publication bias. Publication bias ensures that studies that are consistent with a paradigm are published, cited, and highlighted in review articles to provide false evidence in support for a paradigm
(Greenwald & Lai, 2020; Kurdi et al., 2021).

Over the past decade, it has become apparent how pervasive these biases have been, especially in social psychology (Schimmack, 2020). The responses to my critique of IATs merely confirms how powerful paradigms and conflicts of interest can be. It is therefore necessary to allocate more resources to validation projects by independent researchers. In addition, validation studies should be preregistered and properly powered, and results need to be published whether they show validity or not. Conducting validation studies of widely used measures could be an important role for the emerging field of meta-psychology that is not focused on new discoveries, but rather on evaluating paradigmatic research from an outsider, meta-perspective (Carlsson et al., 2017). Viewed from this perspective, many IATs that are in use lack credible evidence of construct validity.

References
*References marked with an asterisk report studies included in
the suicide IAT meta-analysis

Banaji, M. R., & Greenwald, A. G. (2013). Blindspot: Hidden
biases of good people. Delacorte Press.

Bargh, J. A., Chen, M., & Burrows, L. (1996). Automaticity
of social behavior: Direct effects of trait construct and
stereotype activation on action. Journal of Personality
and Social Psychology, 71(2), 230–244. https://doi.org/
10.1037/0022-3514.71.2.230

*Barnes, S. M., Bahraini, N. H., Forster, J. E., Stearns-Yoder, K. A.,
Hostetter, T. A., Smith, G., Nagamoto, H. T., & Nock,
M. K. (2017). Moving beyond self-report: Implicit associations
about death/ life prospectively predict suicidal
behavior among veterans. Suicide and Life-Threatening
Behavior, 47, 67–77. https://doi.org/10.1111/sltb.12265

Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A.,
Wagenmakers, E.-J., Berk, R., Bollen, K. A., Brembs, B.,
Brown, L., Camerer, C., Cesarini, D., Chambers, C. D.,
Clyde, M., Cook, T. D., Boeck, P., De, Dienes, Z., Dreber,
A., Easwaran, K., Efferson, C., . . . Johnson, V. E. (2018).
Redefine statistical significance. Nature Human Behaviour,
2, 6–10.

Bosson, J. K., Swann, W. B. Jr., & Pennebaker, J. W. (2000).
Stalking the perfect measure of implicit self-esteem:
The blind men and the elephant revisited? Journal of
Personality and Social Psychology, 79, 631–643. https://
doi.org/10.1037/0022-3514.79.4.631

Buhrmester, M. D., Blanton, H., & Swann, W. B., Jr. (2011).
Implicit self-esteem: Nature, measurement, and a new way
forward. Journal of Personality and Social Psychology,
100(2), 365–385. https://doi.org/10.1037/a0021341

Carlsson, R., Danielsson, H., Heene, M., Ker, Å., Innes, Lakens,
D., Schimmack, U., Schönbrodt, F. D., van Assen, M., &
Weinstein, Y. Inaugural editorial of Meta-Psychology. Meta-
Psychology, 1. https://doi.org/10.15626/MP2017.1001

Chiurliza, B., Hagan, C. R., Rogers, M. L., Podlogar, M. C., Hom,
M. A., Stanley, I. H., & Joiner, T. E. (2018). Implicit measures
of suicide risk in a military sample. Assessment, 25(5),
667–676. https://doi.org/10.1177/1073191116676363
Cohen, J. (1988). Statistical power analysis for the behavioral
sciences (2nd ed.). Erlbaum.

Cunningham, W. A., Preacher, K. J., & Banaji, M. R. (2001).
Implicit attitude measures: Consistency, stability, and
No Evidence for Construct Validity of IAT 441
convergent validity. Psychological Science, 12(2), 163–170
https://doi.org/10.1111/1467-9280.00328

Cunningham, W. A., & Zelazo, P. D. (2007). Attitudes and
evaluations: A social cognitive neuroscience perspective.
Trends in Cognitive Sciences, 11, 97–104. https://
doi.org/10.1016/j.tics.2006.12.005

Cvencek, D., Meltzoff, A. N., Maddox, C. D., Nosek, B. A.,
Rudman, L. A., Devos, T., Dunham, Y., Baron, A. S.,
Steffens, M. C., Lane, K., Horcajo, J., Ashburn Nardo, L.,
Quinby, A., Srivastava, S. B., Schmidt, K., Aidman, E.,
Tang, E., Farnham, S., Mellott, D. S., . . . Greenwald, A. G.
(2020). Meta-analytic use of balanced identity theory to
validate the Implicit Association Test. Personality and
Social Psychology Bulletin, 47(2), 185–200. https://doi
.org/10.1177/0146167220916631

Doyen, S., Klein, O., Pichon, C. L., & Cleeremans, A. (2012).
Behavioral priming: It’s all in the mind, but whose mind?
PLOS ONE, 7(1), Article e29081. https://doi.org/10.1371/
journal.pone.0029081

Falk, C. F., Heine, S. J., Takemura, K., Zhang, C. X., & Hsu,
C. (2015). Are implicit self-esteem measures valid for
assessing individual and cultural differences. Journal of
Personality, 83, 56–68. https://doi.org/10.1111/jopy.12082

*Glenn, C. R., Millner, A. J., Esposito, E. C., Porter, A. C.,
& Nock, M. K. (2019). Implicit identification with death
predicts suicidal thoughts and behaviors in adolescents.
Journal of Clinical Child & Adolescent Psychology, 48,
263–272. https://doi.org/10.1080/15374416.2018.1528548

Greenwald, A. G., & Banaji, M. R. (1995). Implicit social cognition:
Attitudes, self-esteem, and stereotypes. Psychological
Review, 102(1), 4–27. https://doi.org/10.1037/0033-
295X.102.1.4

Greenwald, A. G., & Farnham, S. D. (2000). Using the Implicit
Association Test to measure self-esteem and self-concept.
Journal of Personality and Social Psychology, 79, 1022–1038
https://doi.org/10.1037/0022-3514.79.6.1022

Greenwald, A. G., & Lai, C. K. (2020). Implicit social cognition.
Annual Review of Psychology, 71, 419–445. https://
doi.org/10.1146/annurev-psych-010419-050837

Greenwald, A. G., McGhee, D. E., & Schwartz, J. L. K. (1998).
Measuring individual differences in implicit cognition:
The Implicit Association Test. Journal of Personality and
Social Psychology, 74, 1464–1480.

*Harrison, D. P., Stritzke, W. G. K., Fay, N., & Hudaib, A.-R.
(2018). Suicide risk assessment: Trust an implicit probe
or listen to the patient? Psychological Assessment, 30(10),
1317–1329. https://doi.org/10.1037/pas0000577

Jiang, C., Vitiello, C., Axt, J. R., Campbell, J. T., & Ratliff, K. A.
(2019). An examination of ingroup preferences among
people with multiple socially stigmatized identities. Self
and Identity. Advance online publication. https://doi.org/
10.1080/15298868.2019.1657937

Kurdi, B., & Banaji, M. R. (2017). Reports of the death of
the individual difference approach to implicit social cognition
may be greatly exaggerated: A commentary on Payne,
Vuletich, and Lundberg. Psychological Inquiry, 28,
281–287. https://doi.org/10.1080/1047840X.2017.1373555

Kurdi, B., Ratliff, K. A., & Cunningham, W. A. (2021). Can
the Implicit Association Test serve as a valid measure of
automatic cognition? A response to Schimmack (2021).
Perspectives on Psychological Science, 16(2), 422–434.
https://doi.org/10.1177/1745691620904080

Kurdi, B., Seitchik, A. E., Axt, J. R., Carroll, T. J., Karapetyan,
A., Kaushik, N., Tomezsko, D., Greenwald, A. G., &
Banaji, M. R. (2019). Relationship between the Implicit
Association Test and intergroup behavior: A meta-analysis.
American Psychologist, 74(5), 569–586. https://doi.org/
10.1037/amp0000364

*Millner, A. J., Augenstein, T. M., Visser, K. H., Gallagher, K.,
Vergara, G. A., D’Angelo, E. J., & Nock, M. K. (2019). Implicit
cognitions as a behavioral marker of suicide attempts in
adolescents. Archives of Suicide Research, 23(1), 47–63.
https://doi.org/10.1080/13811118.2017.1421488

*Nock, M. K., & Banaji, M. R. (2007). Prediction of suicide ideation
and attempts among adolescents using a brief performance-
based test. Journal of Consulting and Clinical
Psychology, 75(5), 707–715. https://doi.org/10.1037/0022-
006X.75.5.707

*Nock, M. K., Park, J. M., Finn, C. T., Deliberto, T. L.,
Dour, H. J., & Banaji, M. R. (2010). Measuring the suicidal
mind: Implicit cognition predicts suicidal behavior.
Psychological Science, 21(4), 511–517. https://doi
.org/10.1177/0956797610364762

Nosek, B. A., & Bar-Anan, Y. (2012). Scientific utopia: I. Opening
scientific communication. Psychological Inquiry, 23(3),
217–243. https://doi.org/10.1080/1047840X.2012.692215

Nosek, B. A., Hawkins, C. B., & Frazier, R. S. (2011). Implicit
social cognition: From measures to mechanisms. Trends
in Cognitive Sciences, 15(4), 152–159. https://doi.org/
10.1016/j.tics.2011.01.005

Onyeador, I. N., Wittlin, N. M., Burke, S. E., Dovidio, J. F.,
Perry, S. P., Hardeman, R. R., Dyrbye, L. N., Herrin, J.,
Phelan, S. M., & van Ryn, M. (2020). The value of interracial
contact for reducing anti-Black bias among non-Black
physicians: A Cognitive Habits and Growth Evaluation
(CHANGE) study report. Psychological Science, 31(1),
18–30. https://doi.org/10.1177/0956797619879139

Phelps, E. A., Cannistraci, C. J., & Cunningham, W. A. (2003).
Intact performance on an indirect measure of race bias
following amygdala damage. Neuropsychologia, 41(2),
203–208. https://doi.org/10.1016/s0028-3932(02)00150-1

Rae, J. R., & Greenwald, A. G. (2017). Persons or situations?
Individual differences explain variance in aggregated
implicit race attitudes. Psychological Inquiry, 28, 297–300.
https://doi.org/10.1080/1047840X.2017.1373548

*Randall, J. R., Rowe, B. H., Dong, K. A., Nock, M. K., &
Colman, I. (2013). Assessment of self-harm risk using
implicit thoughts. Psychological Assessment, 25(3), 714–721
https://doi.org/10.1037/a0032391

Schimmack, U. (2012). The ironic effect of significant results
on the credibility of multiple-study articles. Psychological
Methods, 17(4), 551–566. https://doi.org/10.1037/a0029487

Schimmack, U. (2020). A meta-psychological perspective on
the decade of replication failures in social psychology.
Canadian Psychology/Psychologie canadienne, 61(4),
364–376. http://doi.org/10.1037/cap0000246

Schimmack, U. (2021). The Implicit Association Test: A method
in search of a construct. Perspectives on Psychological Science, 16(2), 396–414. https://doi.org/10.1177/1745691619863798

Teachman, B. A., Clerkin, E. M., Cunningham, W. A., Dreyer-
Oren, S., & Werntz, A. (2019). Implicit cognition and
psychopathology: Looking back and looking forward.
Annual Review of Clinical Psychology, 15, 123–148.
https://doi.org/10.1146/annurev-clinpsy-050718-095718

*Tello, N., Harika-Germaneau, G., Serra, W., Jaafari, N., &
Chatard, A. (2020). Forecasting a fatal decision: Direct
replication of the predictive validity of the Suicide–
Implicit Association Test. Psychological Science, 31(1),
65–74. https://doi.org/10.1177/0956797619893062

Vianello, M., & Bar-Anan, Y. (2021). Can the Implicit Association
Test measure automatic judgment? The validation continues.
Perspectives on Psychological Science, 16(2), 415–421.
https://doi.org/10.1177/1745691619897960

Vul, E., Harris, C., Winkielman, P., & Pashler, H. (2009).
Puzzlingly high correlations in fMRI studies of emotion,
personality, and social cognition. Perspectives on
Psychological Science, 4(3), 274–290. https://doi.org/10
.1111/j.1745-6924.2009.01125.x

Walker, S. S., & Schimmack, U. (2008). Validity of a happiness
implicit association test as a measure of subjective wellbeing.
Journal of Research in Personality, 42, 490–497.
https://doi.org/10.1016/j.jrp.2007.07.005

Yong, E. (2012 October 12). Nobel laureate challenges
psychologists to clean up their act. Nature. https://doi
.org/10.1038/nature.2012.11535

An Honorable Response to the Credibility Crisis by D.S. Lindsay: Fare Well

We all know what psychologists did before 2012. The name of the game was to get significant results that could be sold to a journal for publication. Some did it with more power and some did it with less power, but everybody did it.

In the beginning of the 2010s it became obvious that this was a flawed way to do science. Bem (2011) used this anything-goes to get significance approach to publish 9 significant demonstration of a phenomenon that does not exist: mental time-travel. The cat was out of the bag. There were only two questions. How many other findings were unreal and how would psychologists respond to the credibility crisis.

D. Steve Lindsay responded to the crisis by helping to implement tighter standards and to enforce these standards as editor of Psychological Science. As a result, Psychological Science has published more credible results over the past five years. At the end of his editorial term, Linday published a gutsy and honest account of his journey towards a better and more open psychological science. It starts with his own realization that his research practices were suboptimal.

Early in 2012, Geoff Cumming blew my mind with a talk that led me to realize that I had been conducting underpowered experiments for decades. In some lines of research in my lab, a predicted effect would come booming through in one experiment but melt away in the next.
My students and I kept trying to find conditions that yielded consistent statistical significance—tweaking items, instructions, exclusion rules—but we sometimes eventually threw in the towel
because results were maddeningly inconsistent. For example, a chapter by Lindsay
and Kantner (2011) reported 16 experiments with an on-again/off-again effect of feedback on recognition memory. Cumming’s talk explained that p values are very noisy. Moreover, when between-subjects designs are used to study small- to medium-sized effects, statistical
tests often yield nonsignificant outcomes (sometimes with huge p values) unless samples are very large.

Hard on the heels of Cumming’s talk, I read Simmons, Nelson, and Simonsohn’s (2011) “False-Positive Psychology” article, published in Psychological Science. Then I gobbled up several articles and blog posts on misuses of null-hypothesis significance testing (NHST). The
authors of these works make a convincing case that hypothesizing after the results are known (HARKing; Kerr, 1998) and other forms of “p hacking” (post hoc exclusions, transformations, addition of moderators, optional stopping, publication bias, etc.) are deeply problematic. Such practices are common in some areas of scientific psychology, as well as in some other life
sciences. These practices sometimes give rise to mistaken beliefs in effects that really do not exist. Combined with publication bias, they often lead to exaggerated estimates
of the sizes of real but small effects.

This quote is exceptional because few psychologists have openly talked about their research practices before (or after) 2012. It is an open secrete that questionable research practices were widely used and anonymous surveys support this (John et al., 2012), but nobody likes to talk about it. Lindsay’s frank account is an honorable exception in the spirit of true leaders who confront mistakes head on, just like a Nobel laureate who recently retracted a Science article (Frances Arnold).

1. Acknowledge your mistakes.

2. Learn from your mistakes.

3. Teach others from your mistakes.

4. Move beyond your mistakes.

Lindsay’s acknowledgement also makes it possible to examine what these research practices look like when we examine published results, and to see whether this pattern changes in response to awareness that certain practices were questionable.

So, I z-curved Lindsay’s published results from 1998 to 2012. The graph shows some evidence of QRPs, in that the model assumes more non-significant results (grey line from 0 to 1.96) than are actually observed (histogram of non-significant results). This is confirmed by a comparison of the observed discovery rate (70% of published results are significant) and the expected discovery rate (44%). However, the confidence intervals overlap. So this test of bias is not significant.

The replication rate is estimated to be 77%. This means that there is a 77% probability that repeating a test with a new sample (of equal size) would produce a significant result again. Even for just significant results (z = 2 to 2.5), the estimated replicability is still 45%. I have seen much worse results.

Nevertheless, it is interesting to see whether things improved. First of all, being editor of Psychological Science is full-time job. Thus, output has decreased. Maybe research also slowed down because studies were conducted with more care. I don’t know. I just know that there are very few statistics to examine.

Although the small sample size of tests makes results somewhat uncertain, the graph shows some changes in research practices. Replicability increased further to 88% and there is no loner a discrepancy between observed and expected discovery rate.

If psychology as a whole had responded like D.S. Lindsay it would be in a good position to start the new decade. The problem is that this response is an exception rather than the rule and some areas of psychology and some individual researchers have not changed at all since 2012. This is unfortunate because questionable research practices hurt psychology, especially when undergraduates and the wider public learn more and more how untrustworthy psychological science has been and often still us. Hopefully, reforms will come sooner than later or we may have to sing a swan song for psychological science.