Why you should not trust Jolynn Pek and Duanne T. Wegener

When I started as an undergraduate student in 1988, a big problem was finding relevant articles or books. I became interested in emotion research and by 1993, I had read pretty much every relevant piece of academic research published at that time. I even called a researcher at the University of Hawaii, who sent me an unpublished manuscript.

Fast forward to 2024 and we live in a world where we are flooded with too many articles on every possible topic. The new problem is that most of these published articles are useless because they (a) ignore other articles because authors did not have time to read all of the other articles in their area, (b) misrepresent facts, or (c) present well-known facts as news. In this new world of information overflow, the new challenge is to find the articles that are actually useful. This is especially problematic for novices, who are unable to evaluate the quality of published articles.

What we urgently need is a consumer report that provides consumers of scientific information with credible information about the quality of the product they are about to consume. This requires funding researchers to be independent evaluators of research. Maybe there could be a meta-journal that republishes articles that have passed expert-review rather than peer-review.

Unfortunately, I am not an independent expert. Rather, my colleagues and I have worked for over a decade to create statistical tools that can reveal questionable research practices that turn non-significant results into significant ones that can be published. The basic idea of these statistical tools is simple and can be traced to Sterling et al.’s (1995) article that examined the success rate in psychology journals. That is, how often do psychology articles report a statistically significant result that supports researchers predictions about the direction of an effect (e.g., priming people with words about old people without their awareness makes them walk slower).  

Sterling et al. (1995) replicated his earlier finding from 1959 that psychology journals have a success rate over 90%. He pointed out that this is implausible for two reasons. First, to have such a high success rate requires testing only true hypotheses because false hypotheses only have a probability of 5% to produce a significant result. Second, even if a true hypothesis is tested, sampling error can produce non-significant results, especially in between-subject designs with small samples favored by experimental social psychologists before 2010. Thus, the high rate of significant results suggests that articles with significant results are published and articles with non-significant results are not (Rosenthal, 1979). The problem is that selection bias undermines the value of a significant result and in theory all published results or at least most could report false results (Ioannidis, 2005).

Sterling’s theoretical insight led to several empirical attempts to detect selection bias (Francis, 2012; Ioannidis & Trikalonis, 2007; Schimmack, 2012). These tests rely on a comparison of the success rate and mean observed power. Without selection bias, the success rate cannot be higher than mean observed power (Brunner & Schimmack, 2020). Numerous articles have shown that social psychology articles often report success rates of 90% or more with much lower estimates of mean power. For example, Schimmack (2020) analyzed 678 statistical tests that were hand-coded by Motyl et al. (2017) from social psychology journals.

The results showed a success rate of 90%, which replicates Sterling’s findings from 1959 and 1995. The estimated man power – called the expected discovery rate – was 19%. Even if we take uncertainty in the power estimate into account and use the upper limit of a conservative confidence interval, mean power is only 36%. These results also explain why actual replications of a representative sample of studies produced only 25% significant results (Open Science Collaboration, 2015).

In short, estimating the mean power of published studies provides a powerful tool to examine the credibility of success rates by authors, journals, or scientific disciplines. The near perfect success rate in experimental social psychology is inconsistent with the power of studies to produce significant results. This is a key problem that has led to the replication crisis or crisis of confidence in social psychology and initiatives that encourage more honest reporting of non-significant results (badges for data sharing and pre-registration of designs and analysis plans).

You Shall Not Compute Observed Power

Not everybody likes a powerful tool to reveal shady practices. Least of all, people who have or are engaging in shady practices. For example, many people who got away with murder were not pleased when it became possible to identify them with DNA analyses decades after they committed murder. In sports, it was suddenly possible to reveal doping when more powerful tests were used to analyze frozen urine samples decades later. In this regard, observed power is a doping test for scientists, especially prolific ones who have published hundreds of statistical results. There large number of statistical results makes it easy to show that their success rates are incredible.

The first criticism against the use of bias tests was that it is obvious that 90% success rate are not realistic and that a bias test only reveals the obvious that everybody was selecting for significance. For this reasons, the developers of p-curve did not even care to test for publication bias and just assumed that it is present. The response to this criticism is that bias tests can actually show that there is no bias and that they can quantify the amount of publication bias. For example, psychologists might be surprised that meta-analyses of clinical trials show no signs of publication bias (van Zwet et al., 2023).

Gaslighting about the Use of Bias Test

A team of 11 authors, led by social psychologist Roger Giner-Sorolla wrote a long review article of statistical power for social psychologists (Giner-Sorolla et al., 2023). They agree that “publication bias can be assumed in most topics of psychology” (p. 22). They also mention our attempt to estimate the amount of publication bias using statistical analysis of published results. “Numerous published analyses have attempted to test the credibility of published multi-study articles or even literatures, using one of the specific methods developed for these purposes: for example, p-curve (Simonsohn et al., 2014) or Z-curve (Bartoš & Schimmack, 2022; Brunner & Schimmack, 2020). [To clarify, p-curve explicitly was not designed to asses publication bias because it assumes that publication bias is present.]

However, the authors caution readers that “the application of these methods has itself come under criticism, most strongly by Pek et al. (2022). Apart from definitional objections to calculating power post-hoc, they also bring up cautions about the theoretical assumptions of power analysis that are likely to be violated by properties of aggregated actual studies. These assumptions render diagnostic analyses based on observed statistical power imprecise, and “at best”, exploratory” (p. 261).

I was not aware of Pek et al.’s criticism of our method and I was never asked to review any of their articles. So, let’s examine these arguments in open, post-publication review .

The Argument Against Post-Hoc Power Estimation

Pek, J., Hoisington-Shaw, K. J., & Wegener, D. T. (2022). Avoiding questionable research
practices surrounding statistical power analysis. In W. O’Donohue, A. Masada, & S.
Lilienfeld. (Eds.), Avoiding questionable research practices in applied psychology (pp.
243-267). Springer International Publishing

It is notable that Pek et al. published their criticism of our method that can reveal questionable research practices in a book on questionable research practices with a title that implied our method is itself a questionable practice. That is of course fun when you write a chapter for a book and don’t have to defend yourself against the people you are attacking. However, it should not come as a surprise that the criticized authors feel motivated to respond and are not particularly motivated to be kind and polite in their response (it is also not my personality).

One way to evaluate the scientific value of an article is to examine the list of references that are cited and that are not cited. Notable omissions in Pek et al.’s article are Sterling’s observations of incredibly high success rates in psychology and his theoretical insight that success rates can be compared to the power of studies. They cite Simmons et al.’s famous “False Positive Psychology” article, but they do not mention that it led to widespread concerns that many published results in social psychology might be false positives. They do not mention that a celebrated replication project found that only 25% of replication attempts in social psychology produced a significant result, suggesting massive publication bias in original articles. Finally, the article does not discuss Motyl et al.’s coding of hundreds of research findings in social psychology journals or my z-curve analyses of their results (Schimmack, 2020). In short, a chapter in a book on questionable research practices does not mention any empirical evidence that publication bias is a serious concern in experimental social psychology. Even if our bias-detection method were flawed, it doesn’t mean that the results are flawed because there is ample convergent evidence that experimental social psychology has produced many false findings that do not hold up in actual replication attempts.

Pek et al. also (have to) admit that our method has been validated in extensive simulation studies, which is typically use as evidence that a statistical method works. So, what is Pek’s criticism of our method to reveal publication bias and inflated success rates? Their only argument is that these simulations are not informative because real data are different from real data. I an not kidding you. Here is the quote.

“Simulation studies validate the performance of these methods on hypothetical data, but the separation between theoretical and empirical data raise questions about the valid performance of these methods on collected data” (p. 28).

There are two problems with this argument. First, stating the obvious that simulated data are different from real data does not explain why real data would not produce the same results as simulated data. If we simulate data with 50% power and only select the significant results, we are going to see a discrepancy between the success rate (100%) and observed power (75% not corrected for selection bias, Schimmack, 2012) or 50% using a correction model (Bartos & Schimmack, 2021). Pek et al. do not explain how real data can produce a success rate of 100% when our model estimates only 50% power. They would have to argue that our method underestimates power, but they do not provide any reasonable explanation why our method produces correct estimates with simulated data and underestimates power with real data.

The second problem is that they do not explain why replication studies of studies with real data often produce replication failures with real data. If real data have no publication bias and our method is flawed, we would expect high success rates in replication studies. However, if our method is correct and power in real studies is much lower than success rates suggest, we would expect low replication rates and that is exactly what we see in replication studies of social psychology.

In short, Pek et al.’s article lacks any scientifically valid criticism of our method that is based on a simple mathematical necessity. If you conduct studies with 50% power, you can expect only 50% significant results, not 100% (Brunner & Schimmack, 2020; Sterling et al., 1959).

Despite the lack of any valid arguments, Pek et al conclude “results from such tests for QRPs cannot be definitive and remain, at best, exploratory,” which is then quoted verbatim by Giner-Sorolla to discredit our method.

The attempt by (some) social psychologist to deflect criticism from their discipline to maintain a picture of integrity is comical and few people may care because most of the results by social psychologists are harmless and void of real-world significance. However, every year thousands of students are eager to find out about social behaviors and pay for courses to learn what scientists have found out about social behavior. Universities and faculty are not interested in telling them the truth that decades have been wasted on bogus research. My students are often hear the first time from me about replication problems and how they can protect themselves from bad research. They are shocked to learn that only 25% of results can be replicated and welcome a statistical tool that can tell them whether results are honestly reported or not. Pek et al. have nothing to offer as an alternative and they are not interested in an alternative. Their game is to discredit researchers who reveal shady practices. I know how I feel about academics who engage in questionable criticism of methods that can reveal questionable practices, but I don’t think I have to spell it out for you.

In conclusion, I am not an unbiased observer. I can only tell you what Pek et al. told their readers: “our method works well in a broad range of simulations” (Bruner & Schimmack, 2021; Bartos & Schimmack, 2022). Their objection is that results based on simulated data differ from results with actual data in mysterious ways that make real data produce 90% significant results without the use of questionable practices to get significance. It is a free world and you are allowed to believe what you want to believe. Unfortunately, these believes also have real world consequences for the world we live in. So, check your sources wisely. If you just read bullshit, you will believe bullshit.

Leave a ReplyCancel reply