P-Hacking in 2024 Like it was 2004

For over a decade, psychologist have been faced with a replication crisis. Laboratory studies with small undergraduate samples and low-power between-subject designs often produced non-significant results, but with the help of statistical tricks that were acceptable to peers (but not naive observers who trusted psychologists to be real scientists) and selective publishing of statistically significant results, journals published 90% or more results that supported researchers’ beliefs (Sterling, 1959; Sterling et al., 1995). Many psychologists have willfully ignored the incredibly high success rates in their journals without realizing that selection for significance means everything can be significant, even totally ridiculous claims that “extraverts can foresee sexual opportunities that do not even exist yet” (Bem, 2011).

After a decade of embarrassing replication failures, some reforms have been implemented to make research findings more credible, although success rates are still around 90%. Moreover, some anti-science psychologists are fighting back against reforms and evidence that their own work didn’t really advance our understanding of the human mind and behaviors. A leading anti-science campaigner is Duane T. Wegener, at Ohio State University, a former student of Richard Petty, also at Ohio State University. As they say, the apple doesn’t fall far from the tree.

A well-known unscientific practice in academia is selectively citing work that can be used to publish one’s work and to ignore citations that are not helpful. For experts, this makes it easy to find articles that are likely to have a specific bias. I used this approach to examine Wegener, Fabrigar, Pek, and Hoisington-Shaw’s article “Evaluating Research in Personality and Social Psychology: Considerations of Statistical Power and Concerns About False Findings” (2022) published in the journal Probably Significant Publication Bias (PSPB). The article talks about power to evaluate published results, but ignores my article that used power exactly for this purpose (Schimmack, 2012). This does not mean that they do not know the article. Fabrigar actually invited me to give a talk at Queens University where I illustrated the low probability of getting 5 significant results in a row even with high power (80%) with five dice. Getting a one was a non-significant result, p = 1/6 = .17, any other number was a significant result, p = 5/6 = .83. The probability of obtaining 5 significant results without cheating is .83^5. = 40%. This is pretty high, but does not justify the nearly 100% success rates of multiple-study articles in journals like PSPB. So, many multiple-study articles probably had significant publication bias.

Wegener and Fabrigar became editors of PSPB after the replication crisis, but didn’t really implement reforms. The journal didn’t even implement badges for good practices. So, what do Wegener et al. (2022) have to say about power to evaluate published articles? Let’s hear from researchers who cited them in a six-study article in, checks, the Journal of Research in Personality (sad face emoji) with the Title “Personal relative deprivation and moral self-judgments: The moderating role of sense of control” (Zhang , Wei, Wang , & Zhang, 2024).

Although the power to detect the interaction effect in each single study may fall below the conventional criterion of 0.80, with a set of four studies in a row showing consistent results, it is suggested that moderate levels of power may not perform much worse than high power in reducing false finding rates, and that the results did provide strong evidence for a true effect over null effect (with odds > 150 when the prior probability of the null hypothesis was set to be 0.50, Wegener et al., 2022)“

Thankfully, I am taking my medication and do not have any false expectations about psychology as a science. My biggest regret is going into psychology, and I warn everybody not to repeat this mistake. See the progress biology has made since 1988 when I had to decide on my major. At least I can use what I learned to warn people about academics who masquerade as scientists.

So, we do 4 studies in a row with modest power, and this will help us to establish that there is a true effect? Sorry, if this is too much math for psychologists, but the probability of this lucky outcome is just .5^4, like betting on red in roulette and hoping that you double your lives savings every time. The probability of this outcome is only 6%. Now academics do not bet their live savings. They bet tax-payer’s money on studies and win when they get a publication out of it. Wouldn’t you like to gamble with other people’s money?

So, what do the results look like.

Study 1, main effect in a scenario study, t(146) = 2.97, p = .003 (not personality)
Study 2, interaction in a scenario study, t(398) = 2.22, p = .027 (not personality)
Study 3, interaction in a scenario study, t(331) = 2.45, p = .015 (not personality)
Study 4a, sense of control as moderator,1, t(348) = 2.20, p = .028 (personality)
Study 4b, sense of control as moderator 2, t(348) = 2.95, p = .003 (personality)
Study 5a, sense of control as moderator 1, t(160) = 2.20, p .029 (personality)
Study 5b, sense of control as moderator, 2, t(157) = 0.17, p = 0.16 (personality)
Study 6a, sense of control as moderator 1, t(268) = 2.36, p = .019 (personality)
Study 6b, sense of control as moderator 2, not tested
Study 6c, sense of control as mediator, t(264) = 2.07, (personality)

Let’s ignore moderated medidation.
“Moderated mediation analyses revealed that the interaction between sense of control and self-esteem on moral self-judgments approached significance, β = -0.12, t(268) = 1.88, p = 0.061, 95 % CI [-0.243, 0.006]”

Only three findings are relevant for the claim in the title, namely 4a, 5a, and 6a. All three are significant, but do they support the claim in the title? Welcome to the Test of Insufficient Variance (TIVA). T-values in larger samples (N, 100) are approximate z-values and z-values have a standard deviation of 1. Thus, z-values from independent tests should vary according to the sampling error in z-values. If the variance is smaller, it suggests human influence that reduced sampling error, especially if all results are just statistically significant (z > 2 & z < 2.8).

The t-values for the three critical tests are 2.20, 2.20, and 2.26. The observed standard deviation is 0.06 rather than 1.00.

And there you have it, the first empirical article that I found that cited Wegener et al.’s (2022) p-hacked and used the Wegener et al. (2022) citation to claim that many significant results in studies with modest power provide strong evidence for researchers’ claims. This claim ignores that it is very unlikely to get only significant results with modest power (Schimmack, 2012).

Let this blog post be a warning for students and other consumers of psychology. Psychology is not trustworthy, and 10 years of discussion about bad practices have not led to reforms that ban unscientific practices. The Anti-Psychological Science (APS) society has embraced badges (I am sharing my materials, LOL) as a sign that psychology is now more trustworthy, but they have not promoted real reforms that call out unethical scientific practices. They also show no interest in serious investigations of past research with incredible success rates of 90% that have been ridiculed by statisticians since Sterling (1959).

Why do I care? Why do I attack individual researchers like Wegener which some falsely call an ad-hominem attack? The reason is that there are psychologists who want to do good science, but they are unable to win in a game that rewards cheaters and liars, what Fiske and Taylor called charlatans. Fighting bad actors means standing up for the victims that did not get a faculty position with tenure or grants after they invested 10 yeas of their lives getting a degree in a failed science. Fortunately, the bad actors will go to hell (well, Wegener already lives in Ohio, so there is that).

If you are an academic, tell your friends and colleagues about this blog post. Warn them that they should not cite Wegener, especially if they are p-hacking. Doing so is like parking your car illegally in a handicap spot and putting the hazard lights on. You are just signaling the parking attended to give you a ticket (and yes, that is a threat).

Replicability-Index

Improving the replicability of empirical research

P-Hacking in 2024 Like it was 2004

Like this:

Leave a ReplyCancel reply

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from Replicability-Index