Who Holds Meta-Scientists accountable?

“Instead of a scientist entering the environment in search of the truth, we find the rather unflattering picture of a charlatan trying to make the data come out in a manner most advantageous to his or her already-held theories” (p. 88).  [Fiske & Taylor, 1984)

A critical examination of Miller and Ulrich’s article “Optimizing Research Output:
How Can Psychological Research Methods Be Improved?” (https://doi.org/10.1146/annurev-psych-020821-094927)

Introduction

Meta-science has become a big business over the past decade because science has become a big business. With the increase in scientific production and the minting of Ph.D. students, competition has grown and researchers are pressured to produce ever more publications to compete with each other. At the same time, academia still pretends that it plays by the rules of English lords with “peer”-review and a code of honor. Even outright fraud is often treated like jaw-walking.

The field of meta-science puts researchers’ behaviors under the microscope and often reveals shady practices and shoddy results. However, meta-scientists are subject to the same pressures as the scientists they examine. They get paid, promoted, and funded based on the quantity of their publications, and citations. It is therefore reasonable to ask whether meta-scientists are any more trustworthy than other scientists. Sadly, that is not the case. Maybe this is not surprising because they are human like everybody else. Maybe the solution to human biases will be artificial intelligence programs. For now, the only way to reduce human biases is to call them out whenever you see them. Meta-scientists do not need meta-meta-scientists to hold them accountable, just like meta-scientists are not needed to hold scientists accountable. In the end, scientists hold each other accountable by voicing scientific criticism and responding to these criticisms. The key problem is that open exchange of arguments and critical discourse is often lacking because insiders use peer-review and other hidden power structures to silence criticism.

Here I want to use the chapter “Optimizing Research Output: How Can Psychological
Research Methods Be Improved?” by Jeff Miller and Rolf Ulrich as an example of biased and unscientific meta-science. The article was published in the series “Annual Reviews of Psychology” that publishes invited review articles. One of the editors is Susan Fiske, a social psychologists who once called critical meta-scientists like me “method terrorists” because they make her field look bad. So far, this series has published several articles on the replication crisis in psychology with titles like “Psychology’s Renaissance.” I was never asked to write or review any of these articles, although I have been invited to review articles on this topic by several editors of other journals. However, Miller and Ulrich did cite some of my work and I was curious to see how they cited it.

Consistent with the purpose of the series, Miller and Ulrich claim that their article provides “a (mostly) nontechnical overview of this ongoing metascientific work.” (p. 692). They start with a discussion of possible reasons for low replicability.

2. WHY IS REPLICABILITY SO POOR?

The state “there is growing consensus that the main reason for low replication rates is that many original published findings are spurious” (p. 693).

To support this claim they point out that psychology journals mostly publish statistically significant results (Sterling, 1959; Sterling et al., 1959), and then conclude “current evidence of low replication rates tends to suggest that many published findings are FPs [false positives] rather than TPs.[true positives]. This claim is simply wrong because it is very difficult to distinguish false positives from true positives with very low power to produce a significant result. They do not mention attempts to estimate the false positive rate (Jager & Leek, 2014; Gronau et al., 2016; Schimmack & Bartos, 2021). These methods typically show low to moderate estimates of the false positive rate and do not justify the claim that most replication failures occur when an article reported a false positive result.

Miller and Ulrich now have to explain how false positive results can enter the literature in large numbers when the alpha criterion of .05 is supposed to keep most of these results out of publications. The propose that many “FPs [false positive] may reflect honest research errors at many points during the research process” (p. 694). This argument ignores the fact that concerns about shady research practices first emerged when Bem (2011) published a 9 study article that seemed to provide evidence for pre-cognition. According to Miller and Ulrich we have to believe that Bem made 9 honest errors in a row that miraculously produced evidence for his cherished hypothesis that pre-cognition is real. If you believe this is possible, you do not have to read further and I wish you a good life. However, if you share my skepticism, you might feel relieved that there is actually meta-scientific evidence that Bem used shady practices to produce his evidence (Schimmack, 2018).

3. STATISTICAL CAUSES OF FALSE POSITIVES

Honest mistakes alone cannot explain a high percentage of false positive results in psychology journals. Another contributing factor has to be that psychologists test a lot more false hypotheses than true hypotheses. Miller and Ulrich suggest that social psychologists test only 1 out of 10 hypotheses tests tests a true hypothesis. Research programs with such a high rate of false hypotheses are called high-risk. However, this description does not fit the format of typical social psychology articles that have lengthy theory sections and often state “as predicted” in the results section, often repeatedly for similar studies. Thus, there is a paradox. Either social psychology is risky and results are surprising or it is theory-driven and results are predicted. It cannot be both.

Miller and Ulrich ignore the power of replication studies to reveal false positive results. This is not only true in articles with multiple replication studies, but across different articles that publish conceptual replication studies of the same theoretical hypothesis. How is it possible that all of these conceptual replication studies produced significant results, when the hypothesis is false? The answer is that researchers simply ignored replication studies that failed to produce the desired results. This selection bias, also called publication bias, is well-known and never called an honest mistake.

All of this gaslighting serves the purpose to present social psychologists as honest and competent researchers. High false positive rates and low replication rates happen “for purely statistical reasons, even if researchers use only the most appropriate scientific methods.” This is bullshit. Competent researchers would not hide non-significant results and continue to repeatedly test false hypotheses, while writing articles that claim all of the evidence supports their theories. Replication failures are not an inevitable statistical phenomenon. They are man-made in the service of self-preservation during early career stages and ego-preservation during later ones.

4. SUGGESTIONS FOR REDUCING FALSE POSITIVES

Conflating false positives and replication failures, Miller and Ulrich review suggestions to improve replication rates.

4.1. Reduce the α Level

One solution to reducing false positive results is to lower the significance threshold. An influential article called for alpha to be set to .005 (1 out of 200 tests can produce a false positive result). However, Miller and Ulrich falsely cite my 2012 article in support of this suggestion. This ignores that my article made a rather different recommendation, namely to conduct fewer studies with a higher probability to provide evidence for a true hypothesis. This would also reduce the false positive rate without having to lower the alpha criterion. Apparently, they didn’t really read or understand my article.

4.2 Eliminate Questionable Research Practices

A naive reader might think that eliminating shady research practices should help to increase replication rates and to reduce false positive rates. For example, if all results have to be published, researchers would think twice about the probability of obtaining a significant results. Which sane researcher would test their cherished hypothesis twice with 50% power; that is, the probability of finding evidence for it. Just like flipping a coin twice, the chance of getting at least one embarrassing non-significant result would be 75%. Moreover, if they had to publish all of their results, it would be easy to detect hypotheses with low replication rates and either give up on them or increase sample sizes to detect small effect sizes. Not surprisingly, consumers of scientific research (e.g., undergraduate students) assume that results are reported honestly and scientific integrity statements often imply that this is the norm.

However, Miller and Ulrich try to spin this topic in a way that suggests shady practices are not a problem. They argue that shady practices are not as harmful as some researchers have suggested, citing my 2020 article, because “because QRPs also increase power by making it easier to reject null hypotheses that are false as well as those that are true (e.g., Ulrich & Miller 2020).” Let’s unpack this nonsense in more detail.

Yes, questionable researcher practices increase the chances of obtaining a significant result independent of the truth of the hypothesis. However, if researchers test only 1 true hypotheses for every 9 false hypotheses, QRPs can have a much more sever effect on the rate of significant results when the null-hypothesis is false. Also a false hypotheses starts with a low probability of a significant result when researchers are honest, namely 5% with the standard criterion of significance. In contrast, a true hypothesis can have anywhere between 5% and 100% power, limiting the room for shady practices to inflate the rate of significant results when the hypothesis is true. In short, the effect of shady practices are not equal and false hypotheses benefit more from shady practices than true hypotheses.

The second problem is that Miller and Ulrich conflate false positives and replication failures. Shady practices in original studies will also produce replication failures when the hypothesis is true. The reason is that shady practices lead to inflated effect size estimates, while the outcome of the honest replication study is based on the true population effect size. As this is often 50% smaller than the inflated estimates in published articles, replication studies with similar sample sizes are bound to produce non-significant results (Open Science Collaboration, 2015). Again, this is true even if the hypothesis is true (i.e., the effect size is not zero).

4.3 Increase Power

As Miller and Ulrich point out, increasing power has been a recommendation to improve psychological science (or a recommendation for psychology to become a science) for a long time (Cohen, 1961). However, they point out that this recommendation is not very practical because “it is very difficult to say what sample sizes are needed to attain specific target
power levels, because true effect sizes are unknown” (p. 698). This argument against proper planning of sample sizes is false for several reasons.

First, I advocated for higher power in the context of multi-study papers. Rather than conducting 5 studies with 20% power, researchers should use their resources to conduct one study with 80% power. The main reason researchers do not do this is that the single study might still not produce a significant result and they are allowed to hide underpowered studies that failed to produce a significant result. Thus, the incentive structure that rewards publication of significant results rewards researchers who conduct many underpowered studies and only report those that worked. Of course, Miller and Ulrich avoid discussing this reason for the lack of proper power analysis to maintain the image that psychologists are honest researchers with the best intentions.

Second, researchers do not need to know the population effect size to plan sample sizes. One way to plan future studies is to base the sample size on previous studies. This is of course what researchers have been doing only to find out that results do not replicate because the original studies used shady practices to produce significant results. Many graduate students who left academia spent years of their Ph.D. trying to replicate published findings and failed to do so. However, all of these failures remain hidden so that power analyses based on published effect sizes lead to more underpowered studies that do not work. Thus, the main reason why it is difficult to plan sample sizes is that the published literature reports inflated effect sizes that imply small samples are sufficient to have adequate power.

Finally, it is possible to plan studies with the minimal effect size of interest. These studies are useful because a non-significant result implies that the hypothesis is not important even if the strict nil-hypothesis is false. The effect size is just so small that it doesn’t really matter and requires extremely large effect sizes to study them. Nobody would be interested in doing studies on this irrelevant effects that require large resources. However, to know that the population (true) effect size is too small to matter, it is important to conduct studies that are able to estimate small effect sizes precisely. In contrast, Miller and Ulrich warn that sample sizes could be too large because large samples “provide high power to detect effects that are too small to be of practical interest”. (p. 698). This argument is rooted in the old statistical approach to ignore effect sizes and be satisfied with a conclusion that the effect size is not zero, p < .05, what Cohen called nil-hypothesis testing and others have called a statistical ritual. Sample sizes are never too large because larger samples provide more precision in the estimation of effect sizes, which is the only way to establish that a true effect size is too small to be important. A study that define the minimum effect size of interest and uses this effect size as the null-hypothesis can determine whether the effect is relevant or not.

4.4. Increase the Base Rate

Increasing the base rate means testing more true hypotheses. Of course, researchers do not know a priori which hypotheses are true or not. Otherwise, the study would not be necessary (actually many studies in psychology test hypotheses where the null-hypothesis is false a priori, but that is a different issue). However, hypotheses can be more or less likely to be true based on exiting knowledge. For example, exercise is likely to reduce weight, but counting backwards from 100 to 1 every morning is not likely to reduce weight. Many psychological studies are at least presented as tests of theoretically derived hypotheses. The better the theory, the more often a hypothesis is true and a properly powered study will produce a true positive result. Thus, theoretical progress should increase the percentage of true hypotheses that are tested. Moreover, good theories would even make quantitative predictions about effect sizes that can be used to plan sample sizes (see previous section).

Yet, Miller and Ulrich conclude that “researchers have little direct control over their base
rates” (p. 698). This statement is not only inconsistent with the role of theory in the scientific process, it is also inconsistent with the nearly 100% success rate in published articles that always show the predicted results, if only because the prediction was made after the results were observed rather than from an a priori theory (Kerr, 1998).

In conclusion, Miller and Ulrich’s review of recommendations is abysmal and only serves the purpose to exonerate psychologists from justified accusations that they are playing a game that looks like science, but is not science, because researchers are rewarded for publishing significant results that fail to provide evidence for hypotheses because even false hypotheses produce significant results with the shady practices that psychologists use.

5. OBJECTIONS TO PROPOSED CHANGES

Miller and Ulrich start this section with the statement that “although the above suggestions for reducing FPs all seem sensible, there are several reasonable objections to them” (p. 698). Remember one of the proposed changes was to curb the use of shady practices. According to Miller and Ulrich there is a reasonable objection to this recommendation. However, what would be a reasonable objection to the request that researchers should publish all of their data, even those that do not support their cherished theory? Every undergraduate student immediately recognizes that selective reporting of results undermines the essential purpose of science. Yet, Miller and Ulrich want readers to believe that there are reasonable objections to everything.

“Although to our knowledge there have been no published objections to the idea that QRPs should be eliminated to hold the actual Type 1 error rate at the nominal α level, even this
suggestion comes with a potential cost. QRPs increase power by providing multiple opportunities to reject false null hypotheses as well as true ones” (p. 699).

Apparently, academic integrity only applies to students, but not to their professors when they go into the lab. Dropping participants, removing conditions, dependent variables, or entire studies, or presenting exploratory results as if they were predicted a priori are all ok because these practices can help to produce a significant result even when the nil-hypothesis is false (i.e., there is an effect).

This absurd objection has several flaws. First, it is based on the old and outdated assumption that the only goal of studies is to decide whether there is an effect or not. However, even Miller and Ulrich earlier acknowledged that effect sizes are important. Sometimes effect sizes are too small to be practically important. What they do not tell their readers is that shady practices produce significant results by inflating effect sizes, which can lead to the false impression that the true effect size is large, when it is actually tiny. For example, the effect size of an intervention to reduce implicit bias on the Implicit Association Test was d = .8 in a sample of 30 participants and shrank to d = .08 in a sample of 3,000 participants (cf. Schimmack, 2012). What looked like a promising intervention when shady practices were used, turned out to be a negligible effect in an honest attempt to investigate the effect size.

The other problem is of course that shady practices can produce significant results when a hypothesis is true and when a hypothesis is false. If all studies are statistically significant, statistical significance no longer distinguishes between true and false hypotheses (Sterling, 1959). It is therefore absurd to suggest that shady practices can be beneficial because they can produce true positive results. The problem of shady practices is the same problem as a liar. They sometimes say something true and sometimes they lie, but you don’t know when they are honest or lying.

9. CONCLUSIONS

The conclusion merely solidifies Miller and Ulrich’s main point that there are no simple recommendations to improve psychological science. Even the value of replications can be debated.

“In a research scenario with a 20% base rate of small effects (i.e., d = 0.2), for example, a researcher would have the choice between either running a certain number of large studies with α = 0.005 and 80% power, obtaining results that are 97.5% replicable, or running six times as many small studies with α = 0.05 and 40% power, obtaining results that are 67% replicable. It is debatable whether choosing the option producing higher replicability would necessarily result in the fastest scientific progress.”

Fortunately, we have a real example of scientific progress to counter Miller and Ulrich’s claim that fast science leads to faster scientific progress. The lesson comes from molecular genetics research. When it became possible to measure variability in the human genome, researchers were quick to link variations in one specific gene to variation in phenotypes. This candidate gene research produced many significant results. However, unlike psychological scientists journals in this area of research also published replication failures and it became clear that discoveries could often not be replicated. This entire approach has been replaced by collaborative projects that rely on very large data sets and many genetic predictors to find relationships. Most important, they reduced the criterion for significance from .05 to .000000005 to increase the ratio of true positives and false positives. The need for large samples slows down this research, but at least this approach has produced some solid findings.

In conclusion, Miller and Ulrich pretend to engage in a scientific investigation of scientific practices and a reasonable discussion of their advantages and disadvantages. However, in reality they are gaslighting their readers and fail to point out a simple truth about science. Science is build on trust and trust requires honest and trustworthy behavior. The replication crisis in psychology has revealed that psychological science is not trustworthy because researchers use shady practices to support their cherished theories. While they pretend to subject their theories to empirical tests, the tests are a sham and rigged in their favor. The researcher always wins because they have control over the results that are published. As long as these shady practices persist, psychology is not a science. Miller and Ulrich disguise this fact in a seemingly scientific discussion of trade-offs, but there is no trade-off between honesty and lying in science. Only scientists who report all of their data and analyses decision can be trusted. This seems obvious to most consumers of science, but it is not. Psychological scientists who are fed up with the dishonest reporting of results in psychology journals created the term Open Science to call for transparent reporting and open sharing of data, but these aspects of science are integral to the scientific method. There is no such thing as closed science where researchers go to their lab and then present a gold nugget and claim to have created it in their lab. Without open and transparent sharing of the method, nobody should believe them. The same is true for contemporary psychology. Given the widespread use of shady practices, it is necessary to be skeptical and to demand evidence that shady practices were not used.

It is also important to question the claims of meta-psychologists. Do you really think it is ok to use shady practices because they can produce significant results when the nil-hypothesis is false? This is what Miller and Ulrich want you to believe. If you see a problem with this claim, you may wonder what other claims are questionable and not in the best interest of science and consumers of psychological research. In my opinion, there is no trade-off between honest and dishonest reporting of results. One is science, the other is pseudo-science. But hey, that is just my opinion and the way the real sciences work. Maybe psychological science is special.

Leave a Reply