Tag Archives: Motivated Biases

“Valid Replications Require Valid Methods—And Originals Don’t?”

Harmon-Jones, E., Harmon-Jones, C., Amodio, D. M., Gable, P. A., & Schmeichel, B. J. (2025). Valid replications require valid methods: Recommendations for best methodological practices with lab experiments. Motivation Science, 11(3), 235–245

“Far from over.” (Frank Wang, tennis buddy when he is down 2:5)

The replication crisis shook social psychology in the 2010s. Heated debates—often on social media—divided critics, reformers, and defenders of the published record. The heat has cooled, but the crisis is far from over. The central empirical problems remain: unusually high rates of statistically significant results in journals, implausible success rates given typical power, and repeated failures to reproduce headline findings under rigorous conditions.

A striking pattern in parts of the methodological commentary that followed is explanatory asymmetry. Replication failures are readily attributed to contextual factors, subtle procedural differences, or “messy methods,” while the same standards are not applied with equal force to original studies. If minor contextual differences can wipe out an effect, then original results should also be unstable—yet the published record historically looks unnaturally successful. Any account that explains failure must also explain success.

There is also an ironic subtext: some of the strongest defenses of fragile effects come from researchers who study motivation and bias, yet methodological narratives can display their own motivated reasoning—favoring interpretations that protect prior conclusions. None of this requires imputing bad faith. It is enough to recognize that professional stakes and identity can shape what kinds of explanations feel plausible.

To avoid my own biases, I asked ChatGPT to evaluate bias in this article. More importantly, ChatGPT also provided an explanation for the rating.

Bias Evaluation

Harmon-Jones et al. (2025) argue that many replication failures in motivation and emotion research arise not from invalid theories or false positives, but from “messy methods.” They provide extensive practical recommendations regarding laboratory setup, experimenter behavior, manipulation strength, measurement sensitivity, replication design, data management, and statistical interpretation. The article is methodologically rich and offers useful guidance for improving internal validity in lab experiments.

However, when situated within the broader replication debate, the paper exhibits a consistent asymmetry in explanatory framing. On a scale from –10 (strongly defensive of existing literature) to +10 (strongly skeptical that most results are true), this article falls around –4 to –5: moderately biased in defense of established findings.

The basis for this rating is outlined below.


  1. Core Contribution: Internal Validity Matters

The article’s strongest contribution is its detailed emphasis on internal validity. The authors correctly note that laboratory experiments are sensitive systems in which:

  • Subtle environmental cues may influence participant motivation.
  • Experimenter demeanor and appearance can affect outcomes.
  • Manipulations must be strong and construct-valid.
  • Dependent variables must be sensitive and properly timed.
  • Multilab projects introduce coordination risk.
  • Data handling errors can contaminate results.

These are real methodological concerns. The paper provides concrete, experience-based guidance that would likely improve experimental rigor if widely adopted. It is especially valuable as a practical resource for researchers conducting lab-based motivation studies.


  1. Asymmetry in Causal Attribution

The principal concern is not methodological advice but explanatory direction.

Replication failures are repeatedly attributed to:

  • Context sensitivity
  • Weak or improperly implemented manipulations
  • Insensitive measures
  • Experimenter variability
  • Procedural deviations in multilab collaborations
  • Data management errors

These are legitimate explanations in some cases. However, the article does not apply equivalent scrutiny to original studies.

There is little engagement with:

  • Publication bias
  • Inflated effect sizes
  • Researcher degrees of freedom
  • Selective reporting
  • Power deficiencies in original work
  • Theory elasticity

The explanatory burden for null replications is placed largely on replication implementation rather than on possible inflation or fragility in the original literature.

This directional asymmetry is what produces the defensive tilt.


  1. Context Sensitivity as a Buffer

The authors cite contextual sensitivity as a key explanation for replication variability. Conceptually, psychological effects can depend on time, culture, and population. However, the article treats contextual sensitivity as supporting evidence for interpreting replication failures, without addressing debate over the empirical robustness of this claim.

More importantly, the paper does not quantify how strong contextual sensitivity would need to be to account for large-scale null findings in well-powered, preregistered, multilab studies. If minor environmental differences are sufficient to eliminate effects, then those effects are fragile by definition. That implication is not confronted directly.


  1. Treatment of Ego Depletion

The article references the large preregistered multisite ego-depletion test (Vohs et al., 2021), which included proponents of the theory. Rather than interpreting the null results as evidence about true effect size, the authors emphasize coordination errors and deviations across labs.

While procedural complications can occur, the study was high-powered and preregistered. A pattern of near-zero effects across many sites cannot be explained solely by minor procedural noise without implying extreme fragility.

The possibility that the true effect is very small or nonexistent is not seriously engaged. This reinforces the asymmetry in explanatory weighting.


  1. The “Psychological Sledgehammer” Standard

The recommendation that manipulations should function like a “psychological sledgehammer” raises an additional issue. If only very strong manipulations count as valid tests, then many real-world operationalizations will be deemed insufficient. This narrows the acceptable domain of theory testing and increases the probability that null findings are attributed to weakness of implementation rather than limitations of theory.

That standard shifts the evidentiary burden in a way that implicitly protects established effects.


  1. The Excess Success Gap

A major omission concerns the historically high rate of statistically significant findings in psychology journals—often described as exceeding 90%.

If effects are highly context-sensitive and fragile, then original studies should also frequently fail. Minor variations in lab setup, experimenter behavior, and measurement sensitivity would generate many null outcomes. Yet the published literature overwhelmingly reports positive results.

These two claims cannot comfortably coexist without additional explanation.

There are only a few ways to reconcile fragile effects with excess success:

  1. Many effects are actually robust and high-powered.
  2. Publication bias and selective reporting filter out null results.
  3. Researchers iteratively tune operationalizations and analyses until significance is obtained.
  4. Journals selectively publish successful implementations.

The article does not engage this macro-level constraint. It does not integrate publication bias or excess-success analysis into its explanatory framework. As a result, replication failures are treated as methodologically suspect, while the structural inflation of original literatures remains largely unaddressed.

This omission materially strengthens the case for a defensive bias rating.


  1. What Prevents a More Extreme Rating

Despite these concerns, the article does not:

  • Deny p-hacking or questionable practices.
  • Reject replication as essential.
  • Claim that all replication failures are invalid.
  • Dismiss preregistration.
  • Attack statistical reform movements.

It offers constructive methodological advice and acknowledges complexity in statistical inference. The tilt is moderate, not extreme.


Overall Assessment

Harmon-Jones et al. provide valuable, concrete guidance on improving internal validity in laboratory research. Their emphasis on methodological nuance is important and often neglected in replication debates.

However, the paper consistently places greater explanatory weight on replication imperfections than on possible inflation or fragility in original findings. It does not reconcile its fragility narrative with the excess success of published psychology, nor does it engage deeply with quantitative evidence regarding effect size shrinkage and false discovery rates.

For these reasons, the article can be fairly characterized as moderately biased in defense of existing literature — approximately –4 to –5 on a –10 to +10 scale.

P.S. Why my bias rating would be more extreme

The most important cue for bias is that success rates over 90% in psychology journals have been documented repeatedly since Sterling (1959). Any article that avoids talking about this implausible result that undermines the meaning of statistical significance is often biased and downplays the amount of selection bias in psychology. Insiders know that only significant results can be published. This unscientific incentive structure is the root cause of the replication crisis, not contextual sensitivity. Failure to mention Sterling is a red flag.

The second red flag is the citation of van Bavel as a reference to contextual sensitivity. Van Bavel claimed to have shown that contextual sensitivity explains the lower replication rate in social psychology. However, Inbar showed that they did not present the critical tests of an interaction and that this test was not significant. There is no evidence that contextual sensitivity contributes to low success rates in social psychology. Rather, social psychologists never ran direct replications and used contextual sensitivity as a way to protect their theories from disconfirming evidence. Change something trivial and you get significance again: great, the effect is robust. If not, clearly the effect was real before but not in this context. Now publish only the significant results and claim that the theory is universally true across time, place, and populations. This is how it was done, and it was wrong. Sadly, some social psychologists cannot just say, sorry, we messed up, now let’s move on.