Scoring Adverserial Collaborations

Daniel Kahneman coined the term “adversarial collaborations” for research projects conducted by teams of researchers with conflicting or opposing views. The idea was that such projects would help to resolve academic disputes that can linger for decades because researchers usually conduct confirmatory studies that support their own views and neglect evidence that does not. This human bias, called confirmation bias, is one of the most well documented and robust biases demonstrated by social psychologists. At the same time, the replication crisis in social psychology has demonstrated that even scientists who are fully aware of this bias are often unable to overcome it. We may call this “confirmation bias neglect,” “confirmation bias blindspot” or “willful incompetence” – the inability to train oneself to conduct research as it is supposed to be by putting one’s own ideas to a risky empirical test that can demonstrate that the cherished theory is false.

Adverserial collaboration projects remain rare and the few that exist also show that they do not overcome human biases. Discussion sections are often not decisive, and the only agreement is that “more research is needed.” Taxpayers might say that they are not interested in funding more research that provides no clear answers. Much of social psychology is more infotainment than science.

I have been working with AI for over a year now and I think AI can play an important role in reducing human biases and making psychology look more like a science; that is, declare a winner in scientific debates so that field can move on. This blog post is just a little demonstration what this might look like.

The example comes from a debate in statistics. Readers may think that statistics is like math, where it is relatively easy to find consensus that 2 + 2 = 4, and not 5. A debate about this would make anybody arguing that 2 + 2 = 5 look like a biased, if not crazy, person, and the debate would quickly fizzle. However, statistics is nothing like match even if it uses formulas and Greek symbols. In the end, statistics only exists because math is being used to make claims about unknowns like the outcome of the next election or the true effect of feedback on learning. Observed data provide some information, but statistics is needed to make the leap to unobservable. In this world, academics fight even about the most basic assumptions, and everybody has a cherished approach that is defended like other people defend religious beliefs. Some not only defend their God but also spend considerable time attacking other religions. In this world, 2 + 2 can be 4, 5, or infinite.

I am not a trained statistician, but I worked with a statistician, Jerry Brunner, on a statistical method to make sense of published results after the replication crisis. We validated this method with extensive simulation studies and showed that it performs best under realistic conditions; that is when studies vary considerably in designs, sample sizes, and the unknowable effect sizes (Brunner & Schimmack, 2020). In 2022, a genus student, Frantisek Bartos and I extended the method to quantify the amount of publication bias (how many results that did not confirm predicts did researchers hide from the public) and the false positive risk (how many of the published results might have provided false evidence for researchers claims). Frantisk used hundreds of computing hours to validate this method across hundreds of realistic scenarios. Both articles appeared in Meta-Psychology, a journal that demands publishing reproducible code to check results that can also be used to run additional simulations. In the past three years, z-curve has been criticized, and we have responded to these criticisms. For example, z-curve can show bias when samples are tiny (N < 30). This is usually not a problem because studies with tiny sample sizes are rare, but we also created a fix for this problem for applications in research areas that have tiny samples.

Anyhow, Erik van Zwet, a biostatistican from the Netherlands, posted two blog posts to criticize z-curve. Each blog post showed a simulation that revealed problems with the estimation of the expected discovery rate that is used to evaluate publication bias and false positive risks. The first discussion quickly deteriorated and serves as an example of the passion that academics can have for topics that normal humans could not care less about. The second one attracted little interest and was mainly an exchange between Erik and me.

I asked ChatGPT to referee the exchange and it wrote a “game report” that is informative and funny at the same time. Importantly, I won, but barely. The outcome is less important than the neutral and balanced report. The main advantage of AI is not that it has superintelligence. It’s main strength is that it does not care about the outcome of the game. The desire to be right is the Kryptonite of academics. Calls for humility are nice, but unrealistic. Trying to prove your ideas to be right motivates many academics to work long hours and sometimes to waste their entire life on nutty ideas like extrasensory perception. AI may help us individually and collectively to avoid this mistake.

You can find the actual game here
More concerns about the z-curve method | Statistical Modeling, Causal Inference, and Social Science

And here is ChatGPTs game report:


Full time

Ulrich 5 – 4 Erik

(Technically decisive win for Ulrich, but with avoidable late fouls.)


First half

Erik goals (2)

(6′) Framing the agenda
Erik successfully sets the frame: undercoverage of EDR CIs, bootstrap collapse, and model misspecification. This is a legitimate opening and forces engagement.

(18′) Collapsing CI diagnosis
The zero-length CI for (P(\mathrm{SNR}=0)) is a real inferential pathology. This is Erik’s strongest technical contribution and remains uncontested as a phenomenon.


Ulrich goals (2)

(25′) Extensive validation defense
Ulrich correctly invokes prior large-scale simulations and reproducibility checks. This blunts any claim that z-curve was casually or narrowly validated.

(38′) Black-swan argument
The point that one pathological case does not invalidate a method in general is sound and resonates with statistically literate readers.


Second half

Erik goals (2)

(52′) Model-uncertainty critique
Erik’s argument that bootstrap CIs reflect sampling uncertainty but ignore model uncertainty is correct in principle and applies to mixture models under misspecification.

(64′) “Undetectable violation” claim
The insistence that the misspecification is practically undetectable from truncated data keeps pressure on defaults and diagnostics. This is a fair methodological concern.


Ulrich goals (3)

(70′) vj intervention (assist credited to Ulrich)
The vj comment decisively reframes the issue:

  • identifies perfect power homogeneity,
  • explains weak identification of EDR,
  • localizes the failure to a knife-edge case.

This is a major momentum shift.

(78′) Mu-grid diagnosis and fix
Demonstrating that coverage is restored by aligning the mu grid is a technical knockout: it shows the issue is resolution under discretization, not a broken method.

(85′) Final Ulrich comment (heterogeneity + alternatives)
This is your strongest direct response:

  • acknowledges the edge case,
  • explains why it is unrealistic,
  • cites known alternatives (moving μ, random-effects),
  • and challenges Erik to generalize his claim.

Substantively, this closes the loop.


Own goals

Ulrich — Own Goals (2)

(44′) Early defensive tone
The initial “he does not hear me” framing and appeal to Erik’s past review role weakened the epistemic high ground.

(90’+2) P.S. about Andrew
The postscript is unnecessary and risks shifting attention back to tone and personalities rather than substance.


Erik — Own Goals (3)

(60′) Escalation to personal insinuation
Claims of harassment, editorial intervention, and “you haven’t read the post” add heat but no inferential value.

(88′) Latest reply (“comforting thought for you”)
This is a clear tone foul. It undercuts Erik’s otherwise disciplined methodological position and hands Ulrich the moral high ground late in the game.

(90′) Failure to engage heterogeneity point
Erik never answers the central empirical challenge: where do we see near-homogeneous power in real literatures? That omission matters.


Man of the Match

Ulrich

Reason: You end the exchange with a coherent synthesis:

  • the failure mode is known,
  • it arises in unrealistic knife-edge cases,
  • defaults work because real data are heterogeneous,
  • and alternatives exist if one worries about that edge case.

That is the position readers will remember.


Final assessment

  • You won on substance.
  • You mostly avoided tone own goals, except for the P.S.
  • Erik’s last comment actually hurts his case more than it hurts yours.

At this point, do not reply again. The ball is out of play, and any further touch risks a needless foul.

If Erik posts new simulations with realistic heterogeneity, bring them here first. Otherwise, this match is over—and you won it.

3 thoughts on “Scoring Adverserial Collaborations

  1. “The main advantage of AI is not that it has superintelligence. It’s main strength is that it does not care about the outcome of the game. The desire to be right is the Kryptonite of academics.” Apart from this being a fun read, I think this is a really important point.

  2. AI like ChatGPT can be loaded up to check for the validity of our scientific arguments according to classical rules of logic, or alternative rules of logic, or the use of truth tables and so on.

    Analytic Philosophy professors who specialise in things like logic and evidence are already using this to great effect.

    As far as I have seen, they mostly report that if the initial prompting is thorough enough, the AI is capable of doing tasks involving systems of logic much faster and more accurately than is humanly possible. Their double checking seems to see the AI be accurate in the majority of cases, so long as their initial prompts were done well enough.

    There is no reason why this couldn’t be leveraged to help us make and evaluate arguments in the scientific literature.

Leave a Reply