All posts by Ulrich Schimmack

About Ulrich Schimmack

Since Cohen (1962) published his famous article on statistical power in psychological journals, statistical power has not increased. The R-Index makes it possible f to distinguish studies with high power (good science) and studies with low power (bad science). Protect yourself from bad science and check the R-Index before you believe statistical results.

Why you should not trust IAT researchers

An outdated idealistic concept of science is that scientists are trying hard to test their theories in empirical studies and revise false theories when studies do not confirm their predictions. In reality, scientists are human and act in accordance with social psychologists’ description of human information processing.

“Instead of a naïve scientist entering the environment in search of the truth, we find the rather unflattering picture of a charlatan trying to make the data come out in a manner most advantageous to his or her already-held theories” (Fiske & Taylor, 1984).

So, a logical conclusion is that IAT researchers are charlatans because they are humans and humans are charlatans. More direct evidence for their untrustworthiness can be found in their publications (Schimmack, 2021). IAT researchers continue to conflate performance on an Implicit Association Test with measurement of implicit biases, although the wider community has rejected this view (Sherman & Klein, 2020). Even Greenwald and Banaji (2017) have walked back the original claim that the IAT probes implicit attitudes.

IAT researchers also continue to ignore valid criticism of their work. I feel compelled to write this blog post to highlight this blatant disregard of scientific criticism to promote a questionable computer task as an important tool to fight racism. A key claim is that changing scores on the race IAT is an important goal because these changes reflect changes in people’s attitudes that influence their behavior. This claim legitimizes simple and quick online studies that can be run with large samples, but may have little practical consequences for the understanding of race relationships and intergroup behaviors.

The latest propaganda piece by IAT researchers is Kurdi, Sanchez, Dasgupta, and Banaji’s (2023) article “(When) Do Counter attitudinal Exemplars Shift Implicit Racial Evaluations? Replications and Extensions of Dasgupta and Greenwald (2001)”

I was actually a reviewer of a manuscript of this paper and made several critical comments that the authors’ blissfully ignored. Instead, they provide a misleading description of the history of studies that aim to change scores on the race IAT and omit relevant articles that do not fit their uplifting message that IAT research is thriving and making theoretical progress in the understanding of racism.

The Bullshit Story

The authors start with a highly cited article by Dasgupta and Greenwald (2001) that suggested showing some counter-attitudinal examples can change IAT scores and that these changes even last for a week.

They claim that “the article by Dasgupta and Greenwald (2001) set into motion what was soon to become a fundamental shift in our understanding of the nature of racial attitudes and, specifically, implicit racial evaluations.”

They claim that over the past 20 years, “a firm theoretical understanding has emerged that implicit evaluations, including implicit racial evaluations, can exhibit sizable temporary shifts toward neutrality in response to a wide range of interventions”

They cite Lai et al. (2014), who demonstrated that “implicit racial
evaluations [a.k.a., IAT scores] were found to shift in response to a broad range of
experimental manipulations.”

The authors note that it hardly seems necessary to conduct a replication study of Dasgupta and Greenwald’s study, if there is robust evidence that IAT scores can be moved around. Their argument to do so is based on the fact that the original article has been cited over 1,800 times in Google Scholars. However, a better reason is that Dasgupta and Greewald’s study was the target of a large replication attempt well before large replication studies became fashionable in social psychology.

“Second, an independent replication today is timely given that a previous replication attempt published by Joy-Gaba and Nosek (2010) over a decade ago replicated the Dasgupta and Greenwald (2001) result with large samples but with considerably smaller effect
sizes (Cohen’s ds = 0.17 and 0.14).” [The authors do not mention that this set of replication studies had over N = 3,000 participants compared to the N < 50 in the original study.]

The authors then spend a lot of time to suggest that population effect sizes may have decreased over time due to societal changes. They leave out the well-known fact that large effect sizes in small samples are often followed by small effect sizes in large samples because researchers with small samples require luck or flair to get significance with low statistical power.

Study 1 is a simple online study with N = 1,533 participants. Despite the large sample size, the study failed to replicate Dasgupta and Greenwald’s results and produced an effect size of d = 0.02. In other words, nada. Nothing to see here, a result very much consistent with Joy-Gaba and Nosek’s (2010) results in large online studies a decade earlier. 

Study 2 was also a bust, d = -0.04 and provided further evidence that effect sizes in Dasgupta and Greenwalds’ underpowered study were inflated and that the true effect size is much smaller or zero (Joy-Gaba & Nosek, 2010).

Study 3 and 4 once more confirmed that Dasgupta and Greenwald’s results could not be replicated, d = 0.15.

Although these replication failures are largely consistent with Joy-Gaba and Nosek’s results, the authors ignore this consistency.

“The failure to replicate the shifts in implicit racial evaluations observed by Dasgupta and Greenwald (2001) is puzzling for several reasons, the primary one being that other procedures, far less potent, have been shown to create malleability in implicit evaluations.” 

Studies 5-7 use different procedures to shift IAT scores. Although these results are interesting on their own, they do not explain why Dasgupta and Greenwald’s (2017) results could not be replicated.

In the General Discussion, the authors discuss the replication failures.

“In three high-powered (total N > 1,800) and close-to-exact replications, we failed to obtain the effect originally reported by Dasgupta and Greenwald (2001). That is, we found no reduction in
pro-White/anti-Black implicit evaluations after exposure to positive Black and negative White exemplars (Experiments 1–3). Given the substantial amount of time that has elapsed since the original results were published, we can only make informed guesses about the reasons for the lack of replication.

“At a first approximation, it is conceivable that the original result was a false positive, in which case one should expect replication attempts to yield null results. Contrary to this possibility, some of the experiments conducted as part of the only known previous independent replication attempt by Joy-Gaba and Nosek (2010) produced statistically significant results.”

The shift to statistical significance is problematic because the effect sizes in Joy-Gaba and Nosek’s studies were small and much closer to the zero effect sizes in this study than to the large effect sizes in the original study. Consistent with this finding, Joy-Gaba and Nosek’s article was titled “The surprisingly limited malleability of implicit racial evaluations”. Kurdi et al’s results only show that it is even more limited than in Joy-Gaba and Nosek’s studies, but the fact remains that Dasgupta and Greenwald’s results do not provide a solid empirical foundation for interventions that can reduce racial biases.

The authors then try to sell the hypothesis that Dasgupta and Greenwald’s study in a small sample miraculously produced a precise estimate of the population effect size and that population effect sizes have really decreased over time.

“As such, we believe that it is more likely that the effect originally obtained in the late 1990s
decreased in size over time, both between 2001 and 2010 and between 2010 and 2023.”

If you believe this, you probably also believe in Santa Clause and Immaculate Conception.

They then suggest that the difference between lab and online studies could contribute to the different findings. They may not have read Joy-Gaba and Nosek’s article or simply failed to mention the fact that Joy-Gaba and Nosek tested this hypothesis.

“Experiment 3 was a direct replication of Experiments 2a and 2b across different settings (Internet or laboratory) and samples (undergraduate participant pool or heterogeneous
volunteers). Students in the participant pool at the University of Virginia completed the study either online or in the laboratory….As shown in Figure 2, a 2 (Condition) × 3 (Sample)
ANOVA revealed no main effect of Condition, F(1, 1178) = .76, p = .74, d = .05, and no interaction between Condition and Sample, F(2, 1177) = .04, p = .96, d = .01.” (Joy-Gaba & Nosek, 2010, Study 3).

To put a cherry on top, the authors totally ignore that one of Dasgupta and Greenwald’s amazing findings was that the manipulation seemed to be lasting a full day.

“Results revealed that exposure to admired Black and disliked White exemplars significantly weakened automatic pro-White attitudes for 24 hr beyond the treatment” (Dasgupta & Greenwald, 2001, abstract).

The authors did not even attempt to replicate this important finding, they also fail to mention that Lei et al. (2016) found that none of the manipulations that had immediate effects on IAT scores produced changes several days later. They also did not examine whether their successful manipulations in Studies 5-7 produced lasting effects.

In sum, this article makes no scientific contribution to the understanding of racism and ways to reduce it. Instead, it is a glaring piece of evidence why you shouldn’t trust IAT researchers. Of course, whether you trust them or me is up to you. It is a free world and there are no ethical guidelines that regulate publications. There is the illusion that peer-review corrects mistakes, but authors can get away with bullshit if editors let them.

In conclusion, producing lasting changes on IAT scores is hard and there is no solid evidence that it is possible. It is also not clear why this would be important because scores on the race IAT are messy measures of consciously accessible racial biases that do not predict behavior (Schimmack, 2021). The notion of implicit bias is unscientific, lacks empirical support, and implicit bias training has been ineffective or even harmful. It is time to fund research that studies real behaviors of discrimination and to stop wasting time on reaction times in online studies (Baumeister, Vohs, & Funder, 2007).

Meta-Science vs. Meta-Physics: How many false discoveries are there?

The goal of empirical sciences is to rely on observations to test theories. As it is typically not possible to observe all relevant phenomena, empirical sciences often rely on samples to make claims that are assumed to generalize to all observations (i.e., the population). The generalization of results from samples to populations works pretty well, when the phenomenon is clear (e.g., it is easier to see things during the day than at night, older people are more likely to die than younger people, people who have sex are more likely to have children, etc., etc.). It becomes more difficult to draw the correct conclusions about populations when samples are small and the relationship between two variables is less obvious (e.g., does eating more vegetables lead to a longer life, does a subliminal presentation of a coke bottle make people drink more coke, does solving crosswords reduce the risk of dementia, etc., etc.). When empirical data are used to answer non-obvious questions, it is possible that a single study will produce the wrong answer. There are many names for these false answers such as Type 1 error (Neyman & Pearson, 1928), False Positives (Simons, Nelson, & Simonsohn, 2011, False Relationships (Ioannidis, 2005), or False Discoveries (Soric, 1989). I will use the term False Discoveries because it is used most commonly in the literature that discusses the risk of false discoveries (Bartos & Schimmack, 2022).

Whether a study produced a false discovery depends on many factors. Most important, it depends on the specification of the hypotheses that are being tested. For example, a study might find that exercising more than 1 hour a week extends life expectancy by 216 days. The researchers conclude that the population effect size exactly matches their estimate in the sample. This claim is unlikely to be true because sampling error will produce different estimates of the population effect size. Thus, the risk that the discovery is false is very high and practically 100%. It is well known among statisticians that point-estimates of point-hypotheses are virtually always false. The solution to this problem is to propose a range of values. The wider the range of values, the more likely it is that the discovery is true. For example, based on their finding of a difference of 216 days, researchers might simply claim that exercise has a positive effect on life-expectancy without saying anything about the magnitude of the effect. It could be 1 day or several years. This conclusion is more likely to be true than the conclusion that the effect is exactly 216 days. The problem with this conclusion is that it does not tell us how big the benefits are. Few people might push themselves to exercise regularly, if the benefit is 1 day. More people might be willing to do so, if the effect is 1 year or more. While precise estimates of effect sizes are desirable, it is often not possible to get more precise estimates because sampling error is too large.

Continuing with the example of exercise, a difference of 219 days is a very small difference compared to the large variability in mortality. Assume that the average life-expectancy is 80 years and that the standard deviation is 10 years. Accordingly, 95% of people die within an age range of 60 to 100 (the real distribution is skewed, but that is not relevant for this example). Compared to the natural variability in life expectancy, 219 days is a very small difference. 219 days are 0.60 years and 0.60 years are 0.060 standard deviations. Effects of this magnitude are statistically small and it would require large samples to provide evidence that there is a positive effect of exercise. A sample size of N ~ 4,500 participants is is needed to reduce sampling error to .03, which yields a statistically significant result with the standard criterion for statistical significance, z = .06/.03 = 2.00, p = .046. 

In conclusion, while we would like to know the direction and magnitude of effects in populations, empirical studies are often unable to provide this information. Therefore, scientists settle for answers that their studies can provide. At a minimum, scientists try to determine the sign of a relationship. For example, researchers have studied whether having children INCREASES or DECREASES happiness or sex differences in hundreds of traits. While magnitude of differences is important, the first question is often whether there is relationship between two variables and the most common conclusion drawn in empirical articles is that there is a positive or negative relationship. Thus, the most common false discoveries are claims where the results of a study lead to the conclusion of a positive relationship when the population relationship is not positive or the conclusion that there is a negative relationship when there is no negative relationship in the population.

There is a lot of confusion among statisticians and users about statistics about the hypotheses that are being tested in empirical studies. The confusion arises from the fact that most studies test two hypotheses simultaneously with a single test (Hodges & Lehmann, 1954). Take having children and happiness as an example. There are many reasons why children could increase and decrease happiness and the overall effect in the population is unclear. To test the hypothesis that children increase happiness, we can test the hypothesis that the difference (happiness of people with children minus happiness of people without children) is positive, d > 0. To test the hypothesis that children make people less happy, we can test the hypothesis that the difference score is negative, d < 0. to test both hypotheses simultaneously, we can test the hypothesis that there is no difference d = 0 and a statistically significant result allows us to reject the d = 0 hypothesis in favor of d > 0, if we observe a positive difference and in favor of d < 0 if we observe a negative difference.

It is a misunderstanding that rejecting the hypothesis d = 0 only tells us that there is an effect and does not tell us anything about the direction of the effect. When we accept d > 0, we are not only rejecting d = 0, but also rejecting d < 0. And when we accept d < 0, we are not only rejecting d = 0, but also d > 0. 

Any decision in favor of a hypothesis implies that there is a risk that the decision was wrong. We may conclude that children increase happiness, when in fact there is no effect or a negative effect or we may conclude that children decrease happiness when having children has no effect on happiness or increases happiness.

In conclusion, empirical studies aim to provide information about populations based on information in samples. Sampling error can distort the results in samples and lead to false conclusions. Conclusions about the magnitude of effects require large samples. Empirical studies, especially of new questions, often settle for conclusions about the direction of an effect. These conclusions can be false when a significant result in a sample suggests a relationship in one direction, but there is no relationship or a relationship in the opposite direction in the population.

What is statistical significance?

Without going into the details of significance testing, it is necessary to point out that scientists have control over the risk of a false discovery. As noted above, it is riskier to make precise predictions about an effect size (e.g., exercise increases life-expectancy by 100 to 300 days), than to draw conclusions about the direction of an effect (exercise increases longevity by 1 day or more).

Another way to control the risk of a false discovery is to reduce sampling error. It is sometimes argued that scientists could also increase effect sizes, but that is not always possible. How would we increase the effect size of having children on happiness? The easiest and sometimes only way to decrease sampling error is to increase sample sizes. As noted above, 4,500 people are needed to get a sampling error of .03 to produce a statistically significant result when the effect size is .06 standard deviations and we require a p-value below .05 to conclude that exercise has a positive effect on longevity. What does this 5% criterion for statistical significance mean. It means that if we repeat the study over and over again with 4,500 new people, we would get p-values below .05 no more than 5% of the time when exercise has no effect. When the effect size is exactly 0, we would expect half of the significant results to be negative (and give us false information that exercise decreases life-expectancy) and half of the results to be positive (false discovery of positive effects in the population). 

Some statisticians have argued that the 5% criterion is to liberal in these days when so many scientists are testing many hypotheses and have computers to run many statistical tests. This might lead to many false discoveries. A simple solution to this problem is to lower the criterion value, which is done in some fields. For example, particle physics uses p < 0.0000006 as a criterion to avoid false discoveries (e.g., discovery of the Higgs-Boson particle). Molecular geneticists also use a similarly low criterion when they search for genetic variants that predict risks for depression or other mental illnesses. However, there is a cost to lowering the risk of a false discovery. A lower criterion value requires even larger samples to produce significant results. For example, N ~ 9,000 participants would be needed to obtain significance with an effect size of d = 0.06 to get significance with a criterion of p < .005 that some statisticians have suggested (Benjamin et al., 2017). 

In sum, scientists can control the risk of false discoveries by demanding more or less empirical evidence for a claim. While demanding stronger evidence reduces the risk of false discoveries, it also limits the ability to make true discoveries with limited resources when effect sizes are small. Ideally, researchers would set their criterion for statistical evidence to maintain a low risk of false discoveries while allowing for as many true discoveries as possible. To do so, they need information about the risk of false discoveries.

How Many False Discoveries Do Scientists Make?

As explained before, to determine whether a discovery is true or false requires a comparison of the results in a sample to the effect in the population and population effects are by definition not observable. Not surprisingly, discussion of false positive rates resemble discussions among Medieval philosophers how many angles can dance on the head of a pin.

A popular claim is that the nil-hypothesis is practically never true because effect sizes are never exactly zero. Thus, rejecting the hypothesis that the effect size is exactly zero is never wrong and there are no false discoveries. This argument ignores that researchers do not just reject the null-hypothesis, but draw inferences about the sign of the effect. Thus, sign errors are possible. Moreover, many effect sizes can be so small that they are practically zero. For example, if exercise extends life expectancy by 1 day, the standardized effect size is 0.003 standard deviations (SD = 10 years). While this value is not zero, it is practically impossible to demonstrate that it is different from zero. Thus, the argument that false discoveries never occur is logically invalid.

The argument also sounds implausible in the face of concerns that many published results might be false discoveries because researchers use statistical tricks to produce significant results. Simmons et al. (2011) showed with simulations that these tricks could produce false discoveries in over 50% of studies. A large number of replication failures in some scientific disciplines (e.g., experimental social psychology) has also led to widespread concerns that false discoveries are common (Open Science Collaboration, 2015). Finally, Ioannidis speculated that some areas of medicine have false discovery rates over 50% because they test thousands of false hypotheses.

I call these speculating meta-physical because they rely on unproven assumptions and the wide variation in conclusions can be traced to the variation in assumptions. These speculations are not very helpful for applied researchers who want to reduce the risk of false discoveries while still being able to make discoveries with limited resources.

It is not possible to estimate the false discovery risk for a single study, but it is possible to do so for sets of studies from the same field. There have been numerous attempts to estimate the false discovery risk empirically, but simulation studies have shown problems with the underlying models. The main problem with these models is that they try to estimate the rate of false discoveries. This is impossible because it is impossible to distinguish small effect sizes from each other, especially in small samples. Nobody knows whether the population effect size is d = -.05, 0, or .05. Once more metaphysical assumptions are needed to make these models work, but the assumptions influence the conclusions. To avoid this problem, Bartos and Schimmack (2022) relied on Soric’s (1989) approach to estimate the maximum false discovery rate. The maximum false discovery rate is a function of the discovery rate and the discovery rate is the percentage of significant results in a set of studies.

Schimmack and Bartos (2023) applied their model to abstracts of medical journals that reported the results of clinical trials. They found that 70% of abstracts reported a significant result. However, they also found evidence of selection bias. After correcting for selection bias, they estimated a discovery rate of 30% and a false discovery risk of 14%.

Another dataset that can be used to examine false discoveries in clinical trials are Cochrane Reviews. Cochrane reviews are meta-analyses of clinical trials. There are many types of clinical trials and some meta-analyses are based on fewer than 10 studies. However, there are sufficient Cochrane Reviews that have 10 or more studies and examine the effectiveness of a treatment against a placebo (194 reviews with 4,394 studies). All studies were scored so that positive values indicated the same sign as the meta-analytic z-score and negative values indicates a sign error. There were 14.5% sign errors, but only 0.75% significant sign errors (z < -1.96). Thus, we have a very low estimate of the false discovery rate in clinical trials with a placebo control condition.

Figure 1 shows a histogram of the z-scores and the fit of the z-curve model to the data. While positive and negative values are shown in the figure, but only positive values are used to fit the model.

The model fits the positive values well, but it underestimates sign errors. The most plausible explanation for this is that there is heterogeneity in effect sizes. However, the difference is relatively small. Due to the larger number of sign-reversals, the Observed Discovery Rate (39%; i.e., the number of significant results in the right direction) is lower than the expected discovery rate (44%). Thus, there is no evidence of selection bias that produces more significant results than predicted by the model.

According to Soric’s formula, an EDR of 44% implies a maximum false discovery rate of 7%. This is much higher than the observed false discovery rate of 0.75%. The reason is that Soric’s model assumes that true hypotheses are tested with 100% power. If average power is lower, the false discovery risk decreases. It is possible to obtain a better estimate of the false discovery risk by taking the expected replication rate into account. The expected replication rate is 62% and is composed of false discoveries that produce significant results in a positive direction with a 2.5% probability. Assuming that 7% of significant results are false discoveries implies an average power to detect true positive effects of 67%. With this power estimate, the false discovery risk decreases to 4.5%. This is still considerably higher than the actual false discovery rate of 0.75%. Thus, z-curve estimates of the false discovery risk are conservative and likely to overestimate the actual rate of false discoveries.

Z-curve also makes it possible to estimate the proportion of true and and false hypotheses that were tested. Once more, these estimates rely on the observable discovery rate and assumptions about the average power of studies that test true hypotheses. Figure 2 shows the percentage of true and false hypotheses for different levels of power. With 80% power, the ratio of true and false hypotheses is 1:1. With the estimated power of 67%, the ratio is about 3:2. Thus, there is no evidence that clinical trials test many more false than true hypotheses.

Conclusion

Concerns about high false discovery rates in science have produced many articles that ask for a new statistical approach to analyze data. Many of these articles are based on a misguided interpretation of significance testing and false assumptions about the number of false hypotheses that are being tested.

I showed that significance testing can be used to draw inferences about the direction of an effect in the population from a significant result in a sample. These directional inferences can be justified by interpreting the two-sided test of the null-hypothesis as two one-sided tests of the null-hypotheses that the effect is less or equal than zero (<= 0) or greater or equal than zero (> 0). While significance criteria control the rate of false discoveries in all tests, it remains unclear how many of the significant results are false discoveries. However, it is possible to estimate the false positive risk in sets of studies based on the percentage of significant results in all tests (i.e., the discovery rate). When no publication bias is present, the discovery rate can be directly observed. When selection bias is present, the discovery rate can be estimated using a model that takes selection bias into account (Bartos & Schimmack, 2022). I demonstrated that this approach produces conservative estimates of the false discovery risk and that the false discovery risk in clinical trials that test effectiveness against a placebo is very low. Thus, concerns about the credibility of clinical trials are based on wild speculations that are, ironically, not backed up by evidence. However, these results cannot be generalized to other disciplines. Claims about false discoveries in these areas need to be supported by empirical evidence and z-curve provides a valuable tool to provide this information, especially when selection bias is severe (e.g., experimental social psychology; Schimmack, 2020).

A Comparison of False Discovery Rates and Type-S Error Rates

Applied researchers who have substantive research questions (e.g., is there a wage gap for men and women) rely on statistical methods that were invented over 100 years. This statistical approach is known as Null-Hypothesis-Hypothesis-Testing (NHST) and every year thousands of new students in the social sciences are introduced to NHST to understand published articles and to conduct their own research. Thus, NHST is as fundamental to quantitative research as microscopes are for biologists and telescopes are for astronomers. However, unlike builders of microscopes and telescopes, many statisticians believe that NHST is flawed and needs to be replaced, ideally with their own statistical approach. After all, statisticians are human and are influenced by the same incentives that motivated other scientists. The holy grail is to replace NHST and to become the father (most of these driven statisticians are male) of the new statistics.

The religious fervor is especially notable among statisticians who call themselves Bayesians. Like atheists who only have the belief that there is no God in common, hate of NST is the only common element of Bayesians, which explains why they have failed to provide a coherent alternative to NHST. A prominent example of the religious zeal of Bayesians is Andrew Gelman’s blog. For example, a recent blog post was titled “Bayesians moving from defense to offense.”

Criticism of NHST is as old as NHST itself and is often cited. However, there have also been articles in support of NHST that are less well known and often neglected, especially by Bayesian critics of NHST. One article that made a lot of sense to me was Tukey’s (1991) defense of NHST that I featured on my blog (Schimmack, 2019).

The article makes it clear that there is a fundamental misunderstanding of the hypotheses that are typically tested in NHST. Take our example of the wage gap as an example. NHST would test this question by postulating the null-hypothesis, which is typically the assumption that there is no wage gap. This hypothesis is typically a strawman and nobody beliefs it to be true. The only reason to postulate the null-hypothesis that there is no wage gap is to use empirical data to falsify or reject it, when the data provide sufficient evidence that the hypothesis is false. This happens when the observed difference in wages is about twice the size of the sampling error, where sampling error reflects the amount of variability in the wage differences that are obtained in a random sample. This estimate of the real wage gap in the population will vary from sample to sample. However, sampling error alone is unlikely to produce differences that are twice the size of the sampling error. In fact, we can state that any wage gap that is twice the sampling error or more will occur only 5% of all attempts if there is no wage gap in the population. And when this happens, p-values below .05 allow researchers to reject the null-hypothesis that the wage gap is zero and conclude that there is a wage gap.

This version of NHST can be easily criticized. For starters, what is the point of collecting data just to reject a hypothesis that nobody believed anyways. The best defense of this practice would be that we still need empirical data to be sure. After all, there is a 1/1000000 probability that the null-hypothesis might be true. Fair enough, but that only leads to the next problem. We really do not care about the conclusion that there is a wage gap because the immediate next question is the direction of the difference. Do men earn more than women or do women earn more than men? Strictly speaking, the rejection of the null-hypothesis that wages are exactly the same does not allow us to make claims about the direction of the effect. Does that mean we need to do another study to test this hypothesis and if so how would we analyze the data to do that if NHST doesn’t allow us to draw inferences about the direction of an effect?

To use NHST to draw inferences about the direction of an effect, we need to test a directional hypothesis. In our example, we might want to specify H0 as the presumably false hypothesis that women earn more or an equal amount than men and use the statistical significance criterion to reject this hypotheses when our sample shows that men earn more than women and the difference is significant, p < .05. When we test this directional hypothesis, we actually only need an effect size that is 1.65 times larger than sampling error to reject the null-hypothesis that women earn more or an equal amount. The use of NHST with directional hypotheses is known as one-tailed or one-sided testing.

With one-sided tests, p < .05 quantifies the risk of drawing a false conclusion about the sign of an effect. That is, the results of a study might show that men earn more than women, the p-value is below .05, but in the population there is no wage gap (zero difference) or women actually earn more than men. NHST does not differentiate between an outcome where the difference is exactly zero (equal pay) or the difference is in the opposite direction (women earn more than men). Both outcomes are errors because the study suggested women earn more when this is not the case.

The risk of drawing a false conclusion in NHST is called the Type 1 error. The statistical test produced the wrong results because sampling error produced a very unlikely outcome (e.g., the sample included many unemployed men). The 5% criterion is a conventional criterion to ensure that no more than 5% of statistical tests show significance when the null-hypothesis is true. This would be the same as using rapid Covid tests that have a 5% probability to show that you are positive when you are actually negative. Statisticians have also debated the 5% criterion, but NHST allows researchers to use other criterion values. Thus, for the discussion of NHST and criticisms of NHST, it is not important what Type-1 error we find acceptable and I will continue to use the typical 5% value.

Now you may wonder what you can do when you are not sure about the sign of an effect (e.g., are men or women more extraverted)? It would be silly to make up a directional hypothesis, just to falsify it, and then conclude that the opposite is true. Moreover, you might find a significant result in the correct direction if you specified the null-hypothesis correctly, but you would not find significance if you picked the wrong hypothesis. That makes the procedure rather arbitrary and useless. Fortunately, there is a solution to this problem. You just do two one-sided tests. First, you try to reject the hypothesis that “men are as extraverted or more extraverted than women” and then you try to reject the “hypothesis that “women are as extraverted or more extraverted than men.” If you obtain significance for one of these tests you can infer that the alternative hypothesis is true. That is, if your sample shows that men are more extraverted than women and p < .05 you are allowed to infer that men are more extraverted than women in the population. If your sample shows that women are more extraverted than men and p < .05, you are allowed to infer that women are more extraverted than men in the population. So, you can generalize the sign of the effect in your sample to the population.

However, there is a catch. You tested two hypotheses and the Type-I error risk increases each time you test a new hypothesis. A 5% error rate implies that every 20 tests of a false hypothesis will produce a Type-I error in the long run. So, if you conduct two one-sided tests with the traditional 5% risk of a Type-1 error, your risk is actually 10%. Fortunately, there is a simple solution to this. You can lower the Type-I error risk of the directional tests. If you cut the risk in half, you have a 2.5% risk to make a Type-1 error in one direction and a 2.5% risk of a type-1 error in the opposite direction and your combined risk of making an error in either direction is 5%. 

Maybe you already realized it, but conducting two one-sided tests with 2.5% Type-I error rates is identical to conducting a two-sided tests with 5% probability. That is the essence of Tukey’s defense of NHST. While it looks as if we are testing an implausible null-hypothesis that there is no wage gap, we are really testing whether men earn more than women or women earn more than men and are allowed to infer from a significant result, p < .05 and higher pay for men than for women in a sample that women earn less than men in the population under investigation. Rather than refuting a silly nil-hypothesis (Cohen, 1994), NHST is a statistical tool to draw inferences about the direction of an effect in a sample about the direction of an effect in the population.

Unfortunately, Tukey’s (1991) insight that a two-tailed test is really a convenient way to conduct two one-sided tests to test for significance in both directions is often ignored by critics of NHST. Gelman introduced sign errors as an alternative to dumb NHST under the assumption that NHST is only used to reject the hypothesis that an effect size is zero, but is never used to test the direction of an effect. This is a misrepresentation of the way NHST is used, especially in clinical trials. Nobody would argue that treatment is beneficial if the p-value is below .05, but the results show more benefits in the placebo condition. The problem with Gelman’s criticism of NHST is that it is a criticism of dumb NHST and not NHST as it is used in practice. In reality researchers follow Tukey’s (1991) logic and use NHST to test two directional hypotheses simultaneously. Thus, Type-1 errors include sign errors. Type-1 errors occur when the sign of a significant result is different from the population effect size, including population effect sizes of zero.

The problem with Gelman’s Type-S error is that Gelman ignores the possibility that the population effect size can be zero. Gelman’s Type-S error is not defined when the population effect size is zero (personal communication, Gelman, 2024). The omission of the classic Type-1 error (rejecting the point-null or nil-hypothesis) is difficult to justify. The main justification for this assumption is that most population effect sizes are unlikely to be exactly zero. For example, the wage gap between men and woman is unlikely to be less than 1/100000000 dollars. However, what about time reversed causality and extrasensory perception (Bem, 2011)? Gelman clearly does not believe in ESP, even with effect sizes of 0.00000001 standard deviations. He is also often critical of other findings like ovulation effects on preferences for Obama. He might argue that these effect sizes are much smaller than inflated estimates in small samples, but not zero. However, why couldn’t some effects be zero or so close to zero that we don’t care about the effect size?

Assuming that there are no zero effects can easily explain why vanZwet et al. (2023) ended up with an estimate that only 2% of significant results in clinical trials are false, whereas Schimmack & Bartos (2023) estimated that up to 14% of significant results could be false. The difference might be due to the fact that estimates of the false discovery rate include significant results with an effect size of zero as errors, whereas the type-S error assumes that these errors do not exist.

Unfortunately, we can not quantify the amount of population effect sizes that are zero because sampling error will always result in imprecise estimates of the population effect size. A solution to this problem is rounding. At some point, we simply do not care about a difference from 0. For example, a wage gap of $0.00001 dollars is practically the same as a wage gap of 0 dollars. Even a difference of $0.01 (1 cent) is meaningless. Rounding has obvious implications for the estimation of Type-S and FDR rates. Let’s say a set of 1000 clinical trials with effect sizes close to zero (some would argue homeopathy would fit the bill) have effect sizes that are close to zero, less than 1/1000th of a standard deviation, but they are not exactly zero). Rounding would turn these miniscule effects into zero-effects and findings in either direction would be considered false positives. However, Type-S errors are cut in half because significant results with the same sign are not considered errors even if the effect size is a fraction of a cent.

In conclusion, the interpretation of NHST as two one-sided tests with alpha/2 implies that sign errors are less likely than Type-1 errors and that the percentage of sign errors among significant results (Type-S error rate) is always lower than the false discovery rate (i.e., the percentage of false positives among significant results).

The real question is whether this difference is large enough to explain the difference between vanZwet et al.’s (2023) results and other studies that estimated the FDR (Jager & Leek, 2014; Schimmack, 2023). I had to resort to simulation studies to find the answer and I am happy to share the results of this simulation (r-code on OSF).

The simulation is based on Ioannidis’s scenario for underpowered, but well performed Clinical Trials with 1 true hypothesis for every 5 false hypothesis and low power (20%). To simply the simulation, the simulation did not include bias. The false hypotheses were simulated with a small standard deviation of population effect sizes (SD = .01). This ensures that none of the false hypotheses are strictly zero, but effect sizes are close to zero (d = -.05 to .05).

Figure 1 shows the distribution of effect sizes for a between-subject design with N = 100 (50 per group).

This simulation produces equal Type-S and FDR estimates of 24%. The reason that these estimates are the same that the condition “<” “<=” are the same when no population effect sizes are zero.

Figure 2 shows the same simulation, but effect sizes are rounded to one decimal. As a result, the small effect sizes around 0 all become zero.

In this scenario, the FDR is 53% and the Type-S error rate is only 0.4%. 

In sum, the criticism of dumb NHST is that there are no sign errors because it only tests whether the effect size is zero, which it rarely is. Second, the smart version of NHST (Tukey, 1991) tests directional hypothesis and treats sign errors and rejection of the point-null hypotheses as errors. Third, Gelman’s Type-S error counts only sign errors and underestimates error rates when many effect sizes are practically zero. Thus, it is a mistake to exclude zero effect sizes from the computation of false discovery rates. Type-S error rates underestimate error rates when some studies have an effect size that is practically zero.

The second simulation used Zwet et al.’s (2023) parameters of their model to compute the FDR and to compare it to their estimate of the Type-S error of 2% The model assumes that the distribution of effect sizes in the Cochrane data is a mixture of four normal distributions with mean 0 and standard deviations 0.61, 1.42, 2.16, 5.64. The first three components are weighted about equally, .32, .31, and .30, and the last component is weighted less, .07.

Figure 3 shows the distribution of the implied effect sizes for N = 100.

Although the model fixes the mode of effect sizes at zero and effect sizes close to zero are the most frequent effect sizes, the probability of an exact value of zero is zero. Thus, the model assumes that there can only be sign errors. The simulation reproduced vanZwet et al.’s estimate that the Type-S Error Rate is 2%. As there are no exact zero values, the FDR is also 2%.

I then repeated the simulation with effect sizes rounded to one decimal. The distribution of effect sizes is not notably different (Figure 4).

However, 17% of the effect sizes are now zero. This increases the FDR to 4%, while the Type-S error decreases to 0.8 percent because zero effect sizes do not produce sign errors in Gelman’s formula. In short, these results show that effect sizes close to zero can produce a difference in calculations of Type-S and FDR, but this difference is relatively small and does not explain the differences between a Type-S error rate of 2% and an FDR of 14%.

The real reason for the different results are the different model assumptions. The same density distribution can be fitted with different models that make different assumptions about the underlying components. The key difference between vanZwet et al.’s (2023) model and other models is that vanZwet et al.’s model assumes that there is not a large proportion of effect sizes close to zero. In contrast, other models allow for a large proportion of effect sizes to be zero. Figure 5 shows the fit of vanZwet.’s model and z-curve (Bartos and Schimmack, 2022) to the Cochrane data. Z-curve was slightly modified to have only four components with means of (0, 2, 4, and 6). This modification models low z-scores as a mixture of studies with type-I errors (z = 0) and modest power (50%), but does not allow for studies with low power (5% to 50%). This is of course an arbitrary assumption, but it is no more arbitrary than van Zwet et al.’s assumption that there are no effect sizes of zero (p (z = 0) = 0).

Both models fit equally to these data. However, the z-curve model allocated a weight of 57% to the component with a mean of zero. This implies a false discovery rate of 10%. Thus, the same data are compatible with a Type-S error of 2% and an FDR of 10%.

This brings up the question which of these estimates is closer to the truth. The honest answer is that we do not know. At least, the distribution of z-scores alone does not provide this answer. I have tried for a year to find a way to estimate the true FDR, but simulation studies showed that it is just not possible to do so.

To avoid the problem of overly precise estimates that are based on unproven and untestable assumptions, Bartos and Schimmack (2022) suggested to focus on the false discovery risk. The false discovery risk is the maximum rate of false discoveries that is consistent with the data. Z-curve2.0 relies on the discovery rate to determine the false discovery rate using a formula developed by Soric (1989).

Soric’s formula relies on the fact that the maximum false discovery rate for a given discovery rate occurs when all true hypotheses are tested with 100% power. With 29% significant results and no evidence of publication bias, the assumption is that the 29% significant results are produced by testing 25 true hypothesis with 100 power (25% significant results) and 75% null-hypotheses with a probability of 5% to produce a significant result (3.75% significant results). 

We realize that it is implausible to assume that true hypothesis in clinical trials are tested with 100% power. However, we do not know the average power of tests of true hypotheses. Thus, we do not know the real FDR. The benefit of estimating the false discovery risk is that we can say that there are no more than 14% false positive results. In contrast, vanZwet’s (2023) estimate of 2% sign errors is only correct when we assume that there are no effect sizes that are exactly zero and that there is only a small percentage of effect sizes close to zero. Thus, the problem with this result is that it depends on assumptions that may be false, whereas the FDR estimate of 14% is a worst case scenario. In this regard, the FDR is similar to the Type-1 error that is based on the worst case scenario that all false hypotheses have an effect size of zero. The risk of sign error decreases, the more effect sizes differ from zero. Not everybody might be happy with risk assessments that are conservative and based on worst case scenarios, but we think that this approach is useful to ensure credibility of scientific results.

We also agree with Goodman (2014) that it is less interesting to know the FDR with the conventional criterion of .05 to reject the null-hypothesis. A more important question is how the alpha criterion can be adjusted to ensure an acceptable maximum percentage of false discoveries. With 29% discoveries, alpha can be set to .01 to produce a false discover risk below 5%. It is often argued that many researchers confuse alpha with the FDR and assume that alpha = .05 ensures a false discovery risk of 5% or less. By setting alpha to .01, they can actually claim that the false discovery risk is below .05. Of course, any error rate allows for errors and a single significant result with alpha = .05 or alpha = .01 requires replication and honest reporting of replication failures. 

In conclusion, statistics is needed to make sense of data, especially when data are noisy and effect sizes are small. However, statistics can only produce useful results for applied researchers if the assumptions underlying statistical models are consistent with reality and when assumptions cannot be tested, it is important to conduct sensitivity analyses or to consider worst case scenarios. This blog post examined why vanZwet et al. (2023) estimated that only 2% of clinical trials produce a significant result with a sign error, whereas Schimmack and Bartos (2023) found a false discovery risk of 14%. These differences were not explained by the analysis of different data sets. After correcting for selection bias, Schimmack and Bartos found a bias-corrected/expected discovery rate that was identical to the discovery rate in vanZwert et al.’s (2023) data. Thus, both datasets implied a false discovery risk of 14, using Sorics’s (1989) formula. The distinction between sign-errors (Type-S error rates) and false discovery rates also did not explain the differences. The key factor was the specification of the mixture model. vanZwet et al.’s model made the assumption that effect sizes follow a normal distribution centered at zero. Thus, the model does not allow for a cluster of effect sizes close to zero. Other models allow for a large number of effect sizes close to zero and these models can fit the data equally well. Thus, it is impossible to determine the false discover rate in the Cochrane data. However, the discovery rate of 29% does not allow for more than 14% false discoveries with an effect size of zero or an effect size in the wrong direction. 

The 14% FDR is clearly inconsistent with Ioannidis’s (2005) scenario that clinical trials test only 1 out of 6 true hypothesis with only 20% power, which leads to the prediction that most results are false positives. The results are more consistent with Ioannidis’s scenario 1 with adequately powered clinical trials that have only a small amount of bias and test 1 true hypothesis for every false hypothesis. For this scenario, Ioannidis (2005) predicted 15% false discoveries. Thus, our results suggest that Cochrane reviews and abstracts in leading medical journals match this scenario. We hope that z-curve analyses with other types of studies can provide empirical tests of Ioannidis’s predictions for those studies. We are pleased to provide researchers interested in the credibility of science with a tool that can provide sound empirical evidence about the false discovery risk.

A Comparison of the New Look and the Old Look at Z-Values from Clinical Trials

Authors: Z-Curve Development Team

Note 1/5/24.
This is Draft2.0. It was revised after further consultation with Erik van Zwet about his approach. In this communication it became clear that Erik’s method can be fitted to absolute z-scores. So, the comparison of the two methods with positive and negative z-scores is no longer relevant. The use of abs(z) also produced better estimates with the original dataset, but continues to perform worse than z-curve when the set of studies includes more powerful studies than those in the Cochrane Review. It remains the case that z-curve performs as well or better than vanZwert’s new approach because it does not require unrealistic assumptions about the distribution of power in sets of studies with varying effect sizes and sample sizes (Brunner & Schimmack, 2021).

Abstract

Two recent studies that extended Jager and Leek’s (2014) analysis of p-values in medical research are reviewed. A comparison of the statistical models shows that z-curve (Schimmack & Bartos, 2013) is superior to the new method proposed by vanZwet et al. (2023). Both studies show that clinical trials have low power (~ 30%), but also low false positive rates (~14%) that can be reduced to less than 5% by setting alpha to .01. Whereas abstracts in clinical journals show clear evidence of selection bias (70% significant results), Cochrane reviews show no evidence of selection bias (~ 30% significant results). The results show that clinical trials are more credible than Ioannidis (2005) suggested. In the absence of publication bias, Cochrane reviews produce unbiased effect size estimates of population effect sizes and can be used to guide practical decisions.

Introduction

The replication crisis is a term for the assumption that empirical science is less credible than it pretends to be and that most published results may be false (Ioannidis, 2005). While there are many factors that can produce misleading evidence for scientific claims, the key factor is the use of questionable research practices that produce more statistically significant results than the power of empirical studies warrants.

A seminal study examined the replicability of social psychological experiments and found that only 25% of replication studies obtained a statistically significant result again (OSC, 2015). This low success rate undermines the credibility of original articles publshed in social psychology journals that report over 90% significant result (Schimmack, 2020).

Ioannidis (2005) famously suggested that many critical trials also inflate evidence in medicine and that at least 50% of statistically significant results are false positives. Jager and Leek (2014) provided the first empirical test of this prediction. They extracted the statistical results of clinical trials from journal abstracts and modeled the distribution of p-values with a mixture model of true and false hypotheses (false hypotheses = no effect, H0 is true). They estimated a false positive risk of 13% for results that reached significance with alpha = .05. Jager and Leek’s seminal attempt to provide empirical evidence about the credibility of evidence in medicine had relatively little impact on discussions about the replication crisis in medicine, presumably because it was harshly criticized by several prominent statisticians.

Ioannidis (2014) declared in the title of his comment that Jager and Leek’s “estimate of the science-wise false discovery rate and application to the top medical
literature is false” (p. 28). To justify this conclusion, he claimed that excluding 470 p-values (from 5792, 8%) ”markedly affects the overall FDR estimates, and totally invalidates FDR estimates comparing journals and years” (p. 32). Yet, even if all of these 470 results were false positives, it would increase the estimated FDR only from 14% to (.14*.92 + .08*1 = 21%. Ioannidis (2014) also claimed that “much of the data is either wrong or makes little sense” (p. 32). If this were true, it would imply that medical abstracts are useless. His final conclusion was that Jager and Leek’s article serves as a warning “how badly things can go when automated scripts are combined with wrong methods and unreliable data” (p. 34). However, he never produced better empirical estimates of the false discovery risk in medical trials.

Gelman and O’Rouke (2014) simple found “their claims unbelievable” (p. 19). One of their key concerns is that “there is just too much selection going on” (p. 20) to obtain reasonable estimates, even though Jager and Leek’s model took selection bias into account. Ultimately, they express skepticism about the value of modeling distributions of p-values. “To us, an empirical estimate would involve looking at some number of papers with p-values and then follow up and see if the claims were replicated,” (p. 22). This did not stop Gelman from being a co-author of a recent article that used p-values to make recommendations about the interpretation of results in clinical trials (vanZwet et al., 2023).

Goodman (2014) criticizes the use of multiple p-values from a single study because these results are not independent and it is difficult to model these data without information about the amount of dependency. Another criticism was the choice of top journals. Presumably, higher false positive rates could be found in clinical trials published in less prominent journals or when pilot studies are included. A third concern was that results reported in abstracts may be misleading because abstracts feature significant results. Goodman (2014) also demands that “any estimate of the reliability of the medical literature must incorporate, as a primary result and not just as a sensitivity analysis, some estimate of the effect of biases” (p. 26). Yet, Goodman was a co-author of a recent article that used p-values to examine evidence in clinical trials without mentioning selection bias at all (vanZwet et al., 2023).

It would take nearly another decade before researchers tried again to examine the credibility of clinical trials based on distributions of p-values in two independent studies (Schimmack & Bartos, 2023; vanZwet et al., 2023). Both articles used z-scores rather than p-values, but this is a minor technical detail because p-values can be converted into z-scores and vice versa. The advantage of z-scores is that they have more interpretable distributions (see Figures below).

Schimmack and Bartos (2023) used a model that was first developed to estimate mean power of studies that are selected for significance (Brunner & Schimmack, 2021). The model is called z-curve because it fits a mixture model to the observed distribution of z-scores. The aim of z-curve1.0 was to predict the outcome of replication studies when original studies are selected for significance. This model was applied to studies in social psychology where selection bias leads to success rates over 90% (Schimmack, 2020; Sterling, 1995). Z-curve2.0 extended z-curve to estimate mean power of all studies, including non-significant studies that are not reported (Bartos & Schimmack, 2022). An estimate of the discovery rate without selection bias makes it possible to estimate the amount of selection bias (Sterling et al., 1995). Selection bias is simply the difference between the observed discovery rate (ODR; i.e., the percentage of significant results) and the estimated discovery rate (EDR; i.e., mean power to produce significant results). Without bias, the observed discovery rate matches the estimated discovery rate because the percentage of significant results is a direct function of mean power (Brunner & Schimmack, 2021). The EDR also provides an estimate of the maximum false discovery rate (Soric, 1989). The main advantage of this approach is that it does not require a priori assumptions about the amount of false and true hypotheses that are being tested. Thus, the results do not depend on assumptions that make claims about false discovery rates speculative.

Schimmack and Bartos (2023) applied z-curve.2.0 to results extracted from abstracts in medical journals. They also compared z-curve.2.0 to Jager and Leek’s model. The key finding was that z-curve.2.0 performed better than Jager and Leek’s model in simulation studies. When the model was applied to 19,751 p-values from medical abstracts using an improved extraction method, it closely reproduced Jager and Leek’s results. The point estimate was 13% with a 95% confidence interval ranging from 8% to 21%. Thus, the results confirmed that the false positive risk in clinical trials is well below 50%. The article also produced several new ways to assess clinical trials. Most important, there was clear evidence of selection bias in abstracts of journal articles. The observed discovery rate (i.e., abstracts that reported a significant result) was 70%, the model predicted only 30% significant results. Thus, the actual discovery rate more than doubles in abstracts of published articles. Finally, z-curve estimated that the mean power of studies that produced a significant result was 65% (95%CI = 61% to 69%), which implies that about 2/3 of clinical trials with a significant results are expected to produce a significant result again in an exact replication study with the same sample size. Overall, these results suggest that selection bias makes it difficult to interpret the point-estimate of the population effect size in a single clinical trial, but that clinical trials in general produce solid empirical evidence, especially when the evidence is pooled in meta-analyses.

vanZwet et al. (2023) took a different approach. First, they created their own model of z-value distributions. Second, they relied on Cochrane reviews to obtain information about effect sizes and sampling error in clinical trials and used this information to compute z-values, z ~ ES/SE (ES = effect size, SE = sampling error). I will first discuss their model and then discuss the results based on Cochrane reviews.

vanZwet et al. do not provide evidence of the validity of their model, nor do they compare their model to previous models (Jager & Leek, 2014; Bartos & Schimmack, 2021). The main differences between their model and z-curve is that their model assumes (a) full normal distributions rather than truncated, folded normal distributions, (b) allows for standard deviations greater than zero to model variation in population parameters (i.e., differences in sample sizes and population effect sizes) and sampling error, while z-curve uses (truncated, folded standard normal distributions and only models sampling error (SD = 1), and (c) their model assumes a mean of zero, whereas z-curve allows for variation in means to model variability in population parameters. The key assumptions that distinguish vanZwet’s model from z-curve are the use of full normal distributions centered at zero. Therefore, I call it the symmetrical-full-normal model (SFN-curve).

vanZwet et al. provide no theoretical justification for their assumptions that distributions of z-values from clinical trials or other studies should fit the SFN model. In personal email communication, vanZwet justified the model by stating that they “analyzed a specific dataset in an entirely appropriate way.” (December, 2023). However, the article did not provide any information about model fit. Therefore, I conducted my own comparison of model fit by fitting the data from vanZwet et al.’s article with SFN-curve and with z-curve. While z-curve is usually fitted with absolute z-scores because the sign is irrelevant, the software allows specifying negative means of the mixture components. Using this approach makes it possible to directly compare the fit of the two models against each other in a plot of the predicted density distributions.

Figure 1 shows a histogram of the k = 23,557 z-values used by vanZwet et al. along with the density distributions estimated by kernel density (blue) z-curve (dark green line) and SFN-curve (red line). Visual inspection is sufficient to see that both models fit the data relatively well, but that z-curve fits the data better than SFN-curve.

However, the relatively good fit of the SNR-model is incidental. Figure 2 shows the histogram of a subset of studies from vanZwet et al. (2023) that compared treatments to placebo. The mean of the distribution shifts towards negative values because more studies tested outcomes where negative values imply clinical benefits of a treatment. Z-curve fits the data well, but SFN-curve does not fit the data well because it makes the unreasonable assumption that the observed data have a mode at zero. 

Figure 2 makes it clear that vanZwet et al.’s new SFN-model is suboptimal. There is no reason to use their approach to fit distributions of z-scores because z-curve fits the data as well or better.

Examining Cochrane Clinical Trials with Z-Curve

The previous analyses used positive and negative z-scores to allow for a direct comparison of z-curve and SFN-curve. However, the sign of z-scores in vanZwet’s dataset has no meaning and it is easier to use z-curve with the absolute z-scores that only show how strong the evidence against the null-hypothesis is (larger z-scores are less likely to occur when the the null-hypothesis is true). Figure 3 shows the z-curve with absolute z-values as data.

One interesting results of z-curving the Cochrane data is that the observed discovery rate (i.e., the percentage of significant results, p < .05) is practically identical to the estimated discovery rate (i.e., average power), ODR = 29%, EDR = 27%. This is a noteworthy empirical finding that vanZwet et al. (2023) fail to mention. The finding is even more remarkable because Schimmack and Bartos (2023) found evidence of publication bias in abstracts of leading medical journals. An interesting question for future research is how Cochrane reviews debias results from original articles. For now, it is important that z-curve is a simple tool that can be used to assess publication bias in sets of statistical results and that effect size estimates in Cochrane reviews are not inflated by selection for significance.

A second noteworthy finding is that the EDR in Cochrane reviews (29% including all studies used by vanZwet et al., see Figure 1) is similar to the EDR estimate based on abstracts in medical journals (30%). While this may be a coincidence, it suggests that z-curve provides reasonable estimates of the EDR even when selection bias is present. Future research should compare z-curve estimates based on the Cochrane database with matching original studies.

In his critique of Jager and Leek’s article, Goodman’s (2014) pointed out that it is more important to find the alpha level that limits the false discovery risk at a reasonably low level (say 5%) rather than estimating the false discovery risk at the traditional alpha level of .05. Schimmack & Bartos found that the false positive risk is 14%. Figure 3 replicates this finding with the Cochrane data because FDR is based on the EDR and the EDRs are the same. Thus, there is converging evidence that the percentage of false positive results is 14% or less. A simple change of alpha shows that an EDR of 30% produces a false positive risk below 5% if alpha is set to .01. Thus, to answer Goodman’s question, we are recommending alpha = .01 to maintain a reasonably low false positive risk.

vanZwet et al. (2023) do not provide estimates of the false positive risk, but they do estimate the risk of sign errors, which is closely related to the false positive risk. Sign errors depend on the magnitude of the population effect sizes, but in the limit when all effects are positive or negative, but the magnitude is close to zero, the risk of sign errors approaches the type-I error rate. In other words, the false positive risk is the worst case scenario where observed effect sizes have a different sign than the population effect size (effect size estimate is positive when population effect size is zero or negative and vice versa). Interestingly, vanZwet et al. estimate that significant results have only a 2% risk of being sign errors, which also implies a false positive risk around 2%. This is much lower than the 14% estimate obtained with z-curve.

The reason for this difference is the conservative nature of FDR estimates with z-curve. Using Soric’s approach, the false positive risk assumes that the expected discovery rate is a mixture of null-hypotheses and true hypotheses tested with 100% power. Assuming lower power, reduces the false positive risk. For example, if power is only 50%, twice as many true hypotheses need to be tested to get the same number of true positive results. As a result, fewer false hypotheses are tested and there are fewer false positives. With 50% power, the false discovery risk is only 6%, which is still higher than vanZwet et al.’s estimate of 2%. .A false discovery rate of 2% can be obtained with a scenario where researchers test 80% true hypotheses with 35% power. This scenario is strikingly different from Ioannidis’s (2005) scenarios that often assume researchers are testing more false hypotheses than true hypotheses. In short, although vanZwet et al. (2023) do not comment on Ioannidis’s famous prediction, their estimate of only 2% false positive results in Cochrane clinical trials is noteworthy and further challenges Ioannidis’s assumptions about the credibility of clinical trials.

Ignoring Heterogeneity

The main purpose of vanZwet et al.’s article was to propose to interpret results from new clinical trials in the context of the power of previous clinical trials. The main problem with this recommendation is that their database is a heterogeneous set of clinical trials that examined radically different treatments. We suggest that researchers conduct z-curve analyses of specific studies that more closely match their research topic and study designs. We recommend z-curve simply because the SFN-model makes assumptions that are typically violated in these subsets of studies (see Figure 2). To illustrate the use of z-curve for specific meta-analyses, we picked the Cochrane review #CD001886 that examined “Anti‐fibrinolytic use for minimizing perioperative allogeneic blood transfusion.” This review was chosen because the distribution of z-scores with positive and negative signs was clearly not symmetrical (90% negative, 10% positive, Median = -1.97, Mean = 2.24) and because there were many efficacy outcomes (k = 719).

Figure 4 shows the z-curve results. First, once more there is no evidence of publication bias (ODR = 51%, EDR = 50%). Second, the higher discovery rate implies a lower false positive risk. No more than 5% of statistically significant results can be false positive results. The distribution of z-scores and the implied parameters differ from those for vanZwet et al.’s (2023) data. Evidently, it would be a mistake to use vanZwet et al.’s results to interpret results in these clinical trials and the same would be true for other subsets of data. We recommend that researchers conduct their own z-curve analyses of theoretically relevant studies rather than using vanZwet et al.’s results. The key problem with these results is that the “inclusion of a trial in the Cochrane Database largely depends on whether someone happens to be interested in a particular treatment or intervention, so the database is not a random sample from the population of all trials” (vanZwet et al., 2023, p. 6).

We also believe that it is unreasonable to adjust effect size estimates based on their results. While it is true that selection on significance inflates point estimates of effect sizes, it is a fallacy to interpret point estimates of effect sizes, especially in small samples. Fortunately, medical researchers routinely report results with confidence intervals that provide information about uncertainty in these point estimates. Even conditioned on significance, confidence intervals will often include the true population effect size. Moreover, the most notable finding was that there is no evidence of selection bias in the Cochrane database. Thus, there is no selection bias that inflates effect size estimates in the Cochrane reviews. Thus, effect size estimates in meta-analyses do not need to be corrected using vanZwet et al.’s or other correction methods. Rather, these methods are more likely to lead to underestimation of treatment effects. Thus, the most useful information that is provided by z-curve analyses is the examination of publication bias in the studies at hand.

Conclusion

One decade after Jager and Leek’s seminal study of p-values in medical research, two articles built on their seminal work using two different datasets and two different methods. The following conclusions can be drawn from these new studies.

First, the z-curve method is superior and more applicable to different datasets than the SFN model that fits only one dataset by chance. As both methods have the same goal of fitting a density distribution of z-scores, we recommend z-curve as the statistical tool of choice.

The two datasets produce surprisingly similar estimates of mean power to produce significant results (i.e., the expected discovery rate). About 1/3 of clinical trials are expected to produce p-values below .05. Thus, many clinical trials have low statistical power. It is therefore important to avoid the fallacy of confusing the absence of evidence (a non-significant result with a treatment effect) with evidence for the absence of an effect.

An expected discovery rate of 30%, implies a relatively low false positive risk between 10% and 15%. Even this modest estimate is based on a worst case scenario and the actual false positive rate is likely to be much lower. Thus, it is much more likely that a statistically significant result reveals a true effect than being a false positive result. The results from the Cochrane database avoid many of the criticism raised by Ioannidis against results reported in abstracts. Thus, evidence is accumulating that his estimate of 50% or more false positives is based on unrealistic scenarios and false assumptions (Schimmack & Bartos, 2023). This is an important finding in a time of widespread science skepticism. Moreover, lowering alpha to .01 can reduce the false positive risk to less than 5%.

The biggest statistical threat to the validity of meta-analysis is selection bias. Thus, it is essential to ensure that studies are not selected for significance or to correct for selection bias when it is present. Z-curve provides a simple way to estimate not only the presence of selection bias, but also to quantify selection bias. The interesting finding is that abstracts in journal articles show large selection bias whereas Cochrane reviews show no evidence of selection bias at all. Future research needs to examine how selection bias is reduced when original studies are entered into Cochrane reviews.

Overall, the results from both studies provide converging evidence that clinical trials produce robust and credible evidence about the effectiveness of treatments that can guide medical practices. The problem of low power is mitigated by the publication of many non-significant results that help to produce unbiased effect size estimates in Cochrane reviews. As researchers who have become interested in the replication crisis in psychology, we look at these results with envy and believe that the enforcement of preregistration and the high quality of Cochrane reviews can serve as an example for psychological science that is only starting to encourage preregistration and does not have rigorous standards to evaluate biases in meta-analyses. While medical research is clearly not without problems, the overly negative image that has been created by Ioannidis’s (2005) influential article is clearly not supported by empirical data (Jager & Leek, 2014; Schimmack & Bartos, 2023; vanZwet et al., 2023).

Maslow’s Hierarchy of Needs: An Empirical Perspective

Maslow’s Hierarchy of Needs is a staple of Introductory Psychology textbooks and popular psychology (Johns, 2023).

The popularity of this model of human motivation is inversely related to its empirical support. The main appeal appears to be the appealing visual presentation of a ranking in the form of a pyramid. 

The lack of empirical support may stem from the lack of a dedicated discipline that studies needs and motives. Animal psychologists can only study basic needs rather than self-esteem or self-actualization. Cognitive psychologists are not concerned with motives and experimental social psychologists focus on situations rather than internal causes of behavior. Finally, personality psychologists are interested in variation of internal causes across individuals, but Maslow’s hierarchy implies a universal law that does not leave room for variation across individuals or cultures.

Despite these problems, empirical researchers have tried to test Maslow’s theory, but the history of this work is largely forgotten. For example, the popular personality textbook “The personality puzzle” by David Funder does not include a single reference to a study that tested Maslow’s theory.

The most influential article on empirical tests of Maslow’s theory was published nearly 50 years ago (Wahba & Bridwell, 1976). The article reviewed factor analytic and ranking studies, neither of which provided much support for Maslow’s theory. The problem of factor analytic studies is obvious. Factor analysis relies on correlations across individuals, while Maslow’s theory makes predictions about the ordering of means and in its strong form assumes that any variation across individuals is just measurement error. Ranking studies, on the other hand, provide a straightforward test of Maslow’s theory. However, none of the ranking studies produced the predicted order from most to least important need.

Physiological Needs - 1. Most important
Security Needs   - 2nd
Relationship Needs - 3rd
Self-Esteem Needs - 4th
Self-Actualization - 5. Least Important

The disappointing results may have discouraged other researchers from further empirical tests. A notable exception is a recent study of an online, convenience sample (N = 943). Participants were asked to rank Maslow’s needs. The average rankings of four needs were consistent with Maslow’s model, but relationship needs were ranked number 1, before physiological and safety needs. The problem might be that participants were asked to rank needs according t how important the fulfillment of each need is for them. It is possible that fail to consider fulfilled needs as important and therefore did not rank fulfillment of physiological needs as important.

In sum, empirical studies often fail to support Maslow’s hierarchy of needs, but this lack of empirical support is often ignored in textbooks and pop psychology.

Results Form a Simple Engagement Exercise

E-textbooks and engagement tools for classrooms (IClicker, TopHat) make it possible to increase engagement with class materials with simple tasks. These exercises provide empirical data that can be shared with students. The results of these exercises can be useful to demonstrate replicability and generalizability of textbook findings that are based on older studies and studies from different populations. My textbook. Personality Science : The Science of Human Diversity (Schimmack, 2020) contains a ranking of Maslow’s values to make students think about the theory in relationship to their own values. TopHat makes it easy to present a ranking task (see Figure 1).

The results have been consistent over the past three years and support Maslow’s theory in terms of the average importance of the five needs. Figure 2 shows the results for N = 129 students at the University of Toronto, Mississauga for 2023.

Physiological needs are ranked as most important by over 40% of the sample and second by over 20% of students. Safety needs are ranked second by over 40% of students and first by another 20%. These two needs are clearly ranked as more important than the other three. Relationship needs are ranked third by over 30% of students followed by self-esteem and self-actualization. Self-esteem is ranked forth by nearly 40% of the students, while over 40% rank self-actualization the least important need. 

However, the results also show that students differ in their rank orders. For some students self-esteem is more important than relationships and vice versa. Some students even rank self-actualization as higher than physiological needs. While it is not clear whether these differences reflect true personality differences, the results suggest that the hierarchy is not universal and can vary across individuals, time periods, and cultures. I then move on to research on human values using Schwartz’s model of 10 values that has received a lot more attention and explicitly allows for diversity in human’s values, needs, motives, life-goals, etc.

Conclusion

Maslow’s hierarchy of needs is well-known inside and outside of psychology, despite a lack of empirical support for it. In fact, most empirical tests failed to provide support for it. A simple ranking tasks allows students to reflect on the importance of needs in their lives and to examine the plausibility of Maslow’s theory. Ironically, this simple class exercise provides the best empirical evidence for Maslow’s theory. In this regard, the results do not replicate existing evidence, but provide the strongest evidence so far for Maslow’s theory that needs differ in their importance, while also demonstrating that the hierarchy is probabilistic and not deterministic and universal.

Loken and Gelman’s Simulation Is Not a Fair Comparison

“What I’d like to say is that it is OK to criticize a paper, even [if, typo in original] it isn’t horrible.” (Gelman, 2023)

In this spirit, I would like to criticize Loken and Gelman’s confusing article about the interpretation of effect sizes in studies with small samples and selection for significance. They compare random measurement error to a backpack and the outcome of a study to running speed. Common sense suggests that the same individual under identical conditions would run faster without a backpack than with a backpack. The same outcome is also suggested by psychometric theories that suggest random measurement error attenuates population effect sizes, which would make it harder to demonstrate significance and produce, on average, weaker effect sizes.

The key point of Loken and Gelman’s article is to suggest that this intuition fails under some conditions. “Should we assume that if statistical significance is achieved in the presence
of measurement error, the associated effects would have been stronger without
noise? We caution against the fallacy”

To support their clam that common sense is a fallacy under certain conditions, they present the results of a simple simulation study. After some concerns about their conclusions were raised, Loken and Gelman shared the actual code of their simulation study. In this blog post, I share the code with annotations and reproduce their results. I also show that their results are based on selecting for significance only for the measure with random measurement error (with a backpack) and not for the measure without a backpack (no random measurement error). Reversing the selection shows that selection for significance without measurement error produces stronger effect sizes even more often than selection for significance with a backpack. Thus, it is not a fallacy to assume that we would all run faster without a backpack holding all other factors equal. However, a runner with a heavy backpack and tailwinds might run faster than a runner without a backpack facing strong headwinds. While this is true, the influence of wind on performance makes it difficult to see the influence of the backpack. Under identical conditions backpacks slow people down and random measurement error attenuates effects.

Loken and Gelman’s presentation of the results may explain why some readers, including us, misinterpreted their results to imply that selection bias and random measurement error may interaction in some complex way to produce even more inflated estimates of the true correlation. We added some lines of code to their simulation to compute the average correlations after selection for significance separately for the measure without error and the measure with error. This way, both measures benefit equally from selection bias. The plot also provides more direct evidence about the amount of bias that is introduced by selection bias and random measurement error. In addition, the plot shows the average 95% confidence intervals around the estimated correlation coefficients.


The plot shows that for large samples (N > 1,000), the measure without error always produces the expected true correlation of r = .15, whereas the measure with error always produces the expected attenuated correlation of r = .15 * .80 = .12. As sample sizes get smaller, the effect of selection bias becomes apparent. For the measure without error, the observed effect sizes are now inflated. For the measure with error, selection bias corrects for the inflation and the two biases cancel each other out to produce more accurate estimates of the true effect size than with the measure without error. For sample sizes below N = 400, however, both measures produce inflated estimates and in really small samples the attenuation effect due to unreliability is overwhelmed by selection bias. However, while the difference due to unreliability is negligible and approaches zero, it is clear that random measurement error combined with selection bias never produces even stronger estimates than the measure without error. Thus, it remains true that we should expect a measure without random measurement error to produce stronger correlations than a measure with random error. This fundamental principle of psychometrics, however, does not warrant the conclusion that an observed statistically significant correlation in small samples underestimates the true correlation coefficient because the correlation may have been inflated by selection for significance.

The plot also shows how researchers can avoid misinterpretation of inflated effect size estimates in small samples. In small samples, confidence intervals are wide. Figure 2 shows that the confidence interval around inflated effect size estimates in small samples is so wide that it includes the true correlation of r = .15. The width of the confidence interval in small samples make it clear that the study provided no meaningful information about the size of an effect. This does not mean the results are useless. After all, the results correctly show that the relationship between the variables is positive rather than negative. For the purpose of effect size estimation it is necessary to conduct meta-analysis and to include studies with significant and non-significant results. Furthermore, meta-analysis need to test for the presence of selection bias and correct for it when it is present.

P.S. If somebody claims that they ran a marathon in 2 hours with a heavy backpack, they may not be lying. They may just not tell you all of the information. We often fill in the blanks and that is where things can go wrong. If the backpack were a jet pack and the person was using it to fly for some of the race, we would no longer be surprised by the amazing feat. Similarly, if somebody tells you that they got a correlation of r = .8 in a sample of N = 8 with a measure that has only 20% reliable variance, you should not be surprised if they tell you that they got this result after picking 1 out of 20 studies because selection for significance will produce strong correlations in small samples even if there is no correlation at all. Once they tell you that they tried many times to get the one significant result, it is obvious that the next study is unlikely to replicate a significant result.

Sometimes You Can Be
Faster With a Heavy Backpack

Annotated Original Code

 
### This is the final code used for the simulation studies posted by Andrew Gelman on his blog
 
### Comments are highlighted with my initials #US#
 
# First just the original two plots, high power N = 3000, low power N = 50, true slope = .15
 
r <- .15
sims<-array(0,c(1000,4))
xerror <- 0.5
yerror<-0.5
 
for (i in 1:1000) {
x <- rnorm(50,0,1)
y <- r*x + rnorm(50,0,1) 
 
#US# this is a sloppy way to simulate a correlation of r = .15
#US# The proper code is r*x + rnorm(50,0,1)*sqrt(1-r^2)
#US# However, with the specific value of r = .15, the difference is trivial
#US# However, however, it raises some concerns about expertise
 
xx<-lm(y~x)
sims[i,1]<-summary(xx)$coefficients[2,1]
x<-x + rnorm(50,0,xerror)
y<-y + rnorm(50,0,yerror)
xx<-lm(y~x)
sims[i,2]<-summary(xx)$coefficients[2,1]
 
x <- rnorm(3000,0,1)
y <- r*x + rnorm(3000,0,1)
xx<-lm(y~x)
sims[i,3]<-summary(xx)$coefficients[2,1]
x<-x + rnorm(3000,0,xerror)
y<-y + rnorm(3000,0,yerror)
xx<-lm(y~x)
sims[i,4]<-summary(xx)$coefficients[2,1]
 
}
 
plot(sims[,2] ~ sims[,1],ylab=”Observed with added error”,xlab=”Ideal Study”)
abline(0,1,col=”red”)
 
plot(sims[,4] ~ sims[,3],ylab=”Observed with added error”,xlab=”Ideal Study”)
abline(0,1,col=”red”)
 
#US# There is no major issue with graphs 1 and 2. 
#US# They merely show that high sampling error produces large uncertainty in the estimates.
#US# The small attenuation effect of r = .15 vs. r = 12 is overwhelmed by sampling error
#US# The real issue is the simulation of selection for significance in the third graph
 
# third graph
 
# run 2000 regressions at points between N = 50 and N = 3050 
 
r <- .15
 
propor <-numeric(31)
powers<-seq(50,3050,100)
 
#US# These lines of code are added to illustrate the biased selection for significane 
propor.reversed.selection <-numeric(31) 
mean.sig.cor.without.error <- numeric(31) # mean correlation for the measure without error when t > 2
mean.sig.cor.with.error <- numeric(31) # mean correlation for the measure with error when t > 2
 
#US# It is sloppy to refer to sample sizes as powers. 
#US# In between subject studies, the power to produce a true positive result
#US# is a function of the population correlation and the sample size
#US# With population correlations fixed at r = .15 or r = .12, sample size is the
#US# only variable that influences power
#US# However, power varies from alpha to 1 and it would be interesting to compare the 
#US# power of studies with r = .15 and r = .12 to produce a significant result.
#US# The claim that “one would always run faster without a backback” 
#US# could be interpreted as a claim that it is always easier to obtain a 
#US# significant result without measurement error, r = .15, than with measurement error, r = .12
#US# This claim can be tested with Loken and Gelman’s simulation by computing 
#US# the percentage of significant results obtained without and with measurement error
#US# Loken and Golman do not show this comparison of power.
#US# The reason might be the confusion of sample size with power. 
#US# While sample sizes are held constant, power varies as a function of the population correlations
#US# without, r = .15, and with, r = .12, measurement error. 
 
xerror<-0.5
yerror<-0.5
 
j = 1
i = 1
 
for (j in 1:31)  {
 
sims<-array(0,c(1000,4))
for (i in 1:1000) {
x <- rnorm(powers[j],0,1)
y <- r*x + rnorm(powers[j],0,1)
#US# the same sloppy simulation of population correlations as before
xx<-lm(y~x)
sims[i,1:2]<-summary(xx)$coefficients[2,1:2]
x<-x + rnorm(powers[j],0,xerror)
y<-y + rnorm(powers[j],0,yerror)
xx<-lm(y~x)
sims[i,3:4]<-summary(xx)$coefficients[2,1:2]
}
 
#US# The code is the same as before, it just adds variation in sample sizes
#US# The crucial aspect to understand figure 3 is the following code that 
#US# compares the results for the paired outcomes without and with measurement error
 
#US# Carlos Ungil (https://ch.linkedin.com/in/ungil) pointed out on Gelman’s blog #US# that there is another sloppy mistake in the simulation code that does not alter the results #US# The code compares absolute t-values (coefficient/sampling error), while the article #US# talks about inflated effect size estimates. However, while the sampling error variation #US# creates some variability, the pattern remains the same.  #US# For sake of reproducibility I kept the comparison of t-values. 
 
# find significant observations (t test > 2) and then check proportion
temp<-sims[abs(sims[,3]/sims[,4])> 2,]
 
#US# the use of t > 2 is sloppy and unnecessary.
#US# summary(lm) gives the exact p-values that could be used to select for significance
#US# summary(xx)[2,4] < .05
#US# However, this does not make a substantial difference 
 
#US# The crucial part of this code is that it uses the outcomes of the simulation 
#US# with random measurement error to select for significance
#US# As outcomes are paired, this means that the code sometimes selects outcomes
#US# in which sampling error produces significance with random measurement error 
#US# but not without measurement error. 
 
propor[j] <- table((abs(temp[,3]/temp[,4])> abs(temp[,1]/temp[,2])))[2]/length(temp[,1])
 
#US# this line is added to compute the mean correlation for significant outcomes 
#US# when measurement error is present.
mean.sig.cor.with.error[j] = mean(temp[,3])
 
#US# Conditioning on significance for one of the two measures is a strange way
#US$ to compare outcomes with and without measurement error.
#US# Obviously, the opposite selection bias would favor the measure without error.
#US# This can be shown by computing the same proportion after selectiong for significance 
#US$ for the measure without error
 
temp<-sims[abs(sims[,1]/sims[,2])> 2,]
propor.reversed.selection[j] <- table((abs(temp[,1]/temp[,2])> abs(temp[,3]/temp[,2])))[2]/length(temp[,4])
 
#US# this line is added to compute the mean correlation for significant outcomes 
#US# without measurement error. 
mean.sig.cor.without.error[j] = mean(temp[,1])
 
print(j)
 
#US# we can also add to comparisons that are more meaningful and avoid the comparison 
###
 
}
 
 
#US# the plot code had to be modified slightly to have matching y-axes 
#US# I also added a title 
title = “Original Loken and Gelman Code”
 
plot(powers,propor,type=”l”,
ylim=c(0,1),main=title,  ### added code
xlab=”Sample Size”,ylab=”Prop where error slope greater”,col=”blue”)
 
#US# text that explains what the plot displays, not shown
#US# #text(200,.8,”How often is the correlation higher for the measure with error”,pos=4)
#US# text(200,.75,”when pairs of outcomes are selected based on significance of”,pos=4) 
#US# text(200,.70,”of the measure with error?”,pos=4)
 
#US# We can now plot the two outcomes in the same figure 
#US# The original color was blue. I used red for the reversed selection
par(new=TRUE)
plot(powers,propor.reversed.selection,type=”l”,
ylim=c(0,1), ### added code
xlab=”Sample Size”,ylab=”Prop where error slope greater”,col=”firebrick2″)
 
#US# adding a legend 
legend(1500,.9,legend=c(“with backpack only sig. \n shown in article \n “,
“without backpack only sig. \n added by me”),pch=c(15),
pt.cex=2,col=c(“blue”,”firebrick2″))
 
#US# adding a horizontal line at 50%
abline(h=.5,lty=2)
 
 
#US# The following code shows the plot of mean correlations after selection for significance
#US# for the measure with error (blue) and the measure witout error (red)
 
title = “Comparison of Correlations after Selection for Significance”
 
plot(powers,mean.sig.cor.with.error,type=”l”,ylim=c(.1,.4),main=title,
xlab=”Sample Size”,ylab=”Mean Observed Correlation”,col=”blue”)
 
par(new=TRUE)
 
plot(powers,mean.sig.cor.without.error,type=”l”,ylim=c(.1,.4),main=””,
xlab=”Sample Size”,ylab=”Mean Observed Correlation”,col=”firebrick2″)
 
#US# adding a legend 
legend(2000,.4,legend=c(“with error”,
“without error”),pch=c(15),
pt.cex=2,col=c(“blue”,”firebrick2″))
 
#US# adding a horizontal line at 50%
abline(h=.15,lty=2)
 

Loken and Gelman are still wrong

Abstract
Loken and Gelman published the results of a study that aimed to simulate the influence of random measurement error on effect size estimates in studies with low power (small sample, small correlation). Figure 3 suggested that “of statistically significant effects observed after error, a majority could be greater than in the “ideal” setting when N is small” I show with a simple simulation study that this result is based on a mistake in their simulation study that conflates sampling error and random measurement error. Holding random measurement error constant across simulations reaffirms Hausman’s iron law of econometrics that random measurement error is likely to produce attenuated effect size estimates. The article concludes with the iron law of meta-science. Original authors of a novel discovery are the least likely people to find an error in their work.

Introduction

Loken and Gelman published a brief essay on “Measurement error and the replication crisis” in the magazine Science. As it turns out, the claims in the article are ambiguous because key terms like measurement error are not properly defined. However, the article does contain the results of simulation studies that are presented in a Figure. The key figure is Figure 3.

This figure is used to support the claim that “of statistically significant effects observed after error, a majority could be greater than in the “ideal” setting when N is small.”

Some points about this claim are important. It is not a claim about a single study. In a single study, measurement error, sampling error, and other factors CAN produce a stronger result with a less reliable measures, just like some people can win the lottery, even though it is a very unlikely event. The claim is clearly about the outcome in the long-run after many repeated trials. That is also implied by a figure that is based on simulations of many repeated trials of the same study. What does the figure imply? It implies that measurement error attenuates observed correlations (or regression coefficients with measurement error in the predictor variable, x) in large samples. The reason is simply that random measurement error adds variance to a variable that is unrelated to the outcome measure. As a result, the observed correlation is a mixture of the true relationship and a correlation of zero and the mixture depends on the amount of random measurement error in the predictor variable.

Selection for significance on the other hand has the opposite effect. To obtain significance, the observed correlation has to have a minimum value so that the observed correlation is approximately twice as large as the sampling error (t ~ 2 equals p < .05, two-tailed). In large samples, sampling error is small and correlations of r = .15 are significant in most cases (i.e., the study has high power). When 99% of all studies are significant, selecting for significance to get a success rate of 100% is irrelevant. However, in small samples with N = 50, a small correlation of r = .15 is not enough to get significance. Thus, all significant correlations are inflated. Measurement error attenuates correlations and makes it even harder to get significant results. With reliability = .8 and a correlation of r = .15, the expected correlation is only .15 * .8 = .12 and more inflation is needed to get significance.

Figure 3 in Loken and Gelman’s article suggests that selection for significance with unreliable measures produces even more inflated effect size estimates than selection for significance without measurement error. This is implied by the results of a simulation study that produced a majority (over 50%) of outcomes where the effect size estimate was higher (and more inflated) when random measurement error was added than in the ideal setting without random measurement error. Loken and Gelman’s claim “of statistically significant effects observed after error, a majority could be greater than in the “ideal” setting when N is small” is based on this result of their simulation studies. With N = 50, r = .15, and reliability of .8, a majority of the comparisons showed a stronger effect size estimate for the simulation with random error than for the simulation without random error.

I believe that this outcome is based on an error in their simulation studies. The simulation does not clearly distinguish between sampling error and random measurement error. I have tried to make this point repeatedly on Gelman’s blog post, but this discussion and my previous blog post (that Gelman probably did not read) failed to resolve this controversy. However, it helped me to understand the source of the disagreement more clearly. I maintain that Gelman does not make a clear distinction between sampling error (i.e., even with perfectly reliable measures, results will vary from sample to sample and this variability is larger in small samples, STATS101) and random measurement error (i.e., two measures of the same constructs are not perfectly correlated with each other, NOT A TOPIC OF STATISTCS, which typically assumes perfect measures). Based on this insight, I wrote a new r-script that clearly distinguishes between sampling error and random measurement error. I ran the script 10,000 times. Here are the key results.

The simulation ensured that reliability in each run is exactly 80%.


The expected effect sizes are r = .15 for the true relationship and r = .12 for the measure with 80% reliability. The average effect sizes across the 10,000 simulations match these expected values. We also see that sampling error produces huge variability in specific runs. However, even extreme deviations are attenuated by random measurement error. Thus, random measurement error makes values less extreme.

What about sign errors. We don’t really know the true correlation and two-tailed testing allows researchers to reject H0 with the wrong sign. To allow for this possibility, we can compute the absolute correlations.

This does not matter. The results for the measure with random error are still lower and less extreme.

Now we can examine how conditioning on significance influences the results.

Once more the effect size estimates for the true correlation are stronger and more extreme than those for the measure with random measurement error. This is also true for absolute effect size estimates.

Loken and Gelman’s Figure 3 required the direct comparison of two outcomes in the same run after selection for significance. This creates a problem because sometimes one result will be significant and the other one will not be significant. As a result, the comparison is biased because it compares estimates after selection for significance with estimates without selection for significance. However, even with this bias in favor of the unreliable measure, random measurement error produced weaker effect size estimates in the majority of all cases.

Conclusion

In short, these results confirm Hausman’s iron law of econometrics that random measurement error typically attenuates effect size estimates. Typically, of course, does not mean always. However, Loken and Gelman claimed that they identified a situation in which the iron law of economics does not apply and can lead to false inferences. They claimed that (a) in small samples and (b) after selection for significance, random measurement error will produce stronger effect size estimates not once or twice but IN A MAJORITY of studies. This claim was implied by the results of their simulations displayed in Figure 3. Here is showed that their simulation fails to simulate the influence of random measurement error. Holding random measurement error constant at 80% produces the expected outcome that random measurement error is more likely to attenuate effect size estimates than to inflate it even in small samples and after selection for significance. Thus, researchers are justified to claim that they could have obtained stronger correlations with a more reliable measure or to use latent variable models to correct for unreliability. What they cannot do is to claim that the true population correlation is stronger than their observed correlation because this claim ignores the influence of selection for significance that inevitably inflates observed correlations in small samples with small effect sizes. It is also not correct to assume that two wrongs (selecting for significance with unreliable measures) make one right. Robust and replicable results require good data. Effect sizes of correlations should only be interpreted if measures have demonstrated good reliability (and validity, but that is another topic) and when sampling error is small enough to produce a meaningful range of plausible values.

New Simulation of Reliability

N = 50

REL = .80
n.sim = 10000

res = c()

for (i in 1:n.sim) {

SV = scale(rnorm(N))*sqrt(REL)
var(SV)

x1 = rnorm(N)
x1 = residuals(lm(x1 ~ SV))
x1 = scale(x1)
x1 = x1*sqrt(1-REL)
var(x1)

x2 = rnorm(N)
x2 = residuals(lm(x2 ~ x1 + SV))
x2 = scale(x2)
x2 = x2*sqrt(1-REL)
var(x2)

x1 = x1 + SV
x2 = x2 + SV

y = .15 * SV + rnorm(N)*sqrt(1-.15^2)

r = c(var(x1),var(x2),cor(x1,x2),
summary(lm(y ~ SV))$coef[2,],
summary(lm(y ~ x1))$coef[2,],
summary(lm(y ~ x2))$coef[2,]
);r

res = rbind(res,r)

} # End of sim

summary(res)

Open Science Practices and Replicability

Summary

A recent article in the flashy journal “Nature Human Behaviour” that charges authors or their universities $6,000 published the claim “high replicability of newly discovered social-behavioural findings is achievable” (Protzko et al., 2023). This is good news for social scientists and consumer of social psychology after a decade of replication failures caused by questionable research practices, including fraud.

So, what is the magic formula to produce replicable and credible findings in the social sciences?

The paper attributes success to the implementation of four rigour-enhancing practices, namely confirmatory tests, large sample sizes, preregistration, and methodological transparency, The problem with this multi-pronged approach is that it is not possible to say which of these features are necessary or sufficient to produce replicable results.

I analyze the results of this article with the R-Index. Based on these results, I conclude that none of the four rigor-enhancing practices are necessary to produce highly replicable results. The key ingredients for high replicability are honesty and high power. It is wrong to confuse large samples (N = 1,500) with high power. As shown, sometimes N = 1,500 has low power and sometimes much smaller samples are sufficient to have high power.

Introduction

The article reports 16 studies. Each study was proposed by one lab and the lab reported the results of a confirmatory test that produced significant results in 15 of the 16 studies. The replication studies by the other three labs produced significant results in 79% of the studies.

I predicted these replication outcomes with the Replicability-Index (R-Index). The R-Index is a simple method to estimate replicability for a small set of studies. The key insight of the R-Index is that the outcome of unbiased replication studies is a function of the mean (I once assumed the median would be better, but this was wrong) power of the original studies (Brunner & Schimmack, 2021). Unfortunately, it can be difficult to estimate the true mean power based on original studies because original studies are often selected for significance and selection for significant leads to inflated estimates of observed power. The R-Index adjusts for this inflation by comparing the success rate (percentage of significant results) to the mean observed power. If the success rate is higher than the mean observed power, selection bias is present and the mean power is inflated. A simple heuristic to correct for this inflation is to subtract the inflation from the observed power.

The article reported the outcomes of “original” (blue = self-replication) and replication studies (green = independent replications by other labs) in Figure 1.

To obtain estimates of observed power, I used the point estimates of the original (original) studies and the lower limit of the 95%CI. I converted these statistics into z-scores, using the formula (ES/((ES – LL.CI)/2). The z-scores were converted into p-values and p-values below .05 were considered significant. Visual inspection of Figure 1 shows that one original study (blue) did not have a statistically significant result (i.e., the 95%CI includes a value of zero). Thus, the actual success rate was 15/16 = 94%.

Table 1 shows that the mean observed power is 87%. Thus, there is evidence of a small amount of selection for significance and the predicted success rate of replication studies is .87 – .06 = .81. The actual success rate was computed as the percentage of replication studies (k = 3) that produced a significant result. The overall success rate of replication studies was 79%, which is close to the estimate of the R-Index, 81%. Finally, it is evident that power of studies varies across studies. 9 studies had z-scores greater than 5 (the 5 sigma rule of particle physics) and all 9 studies had a replication success rate of 100%. The only reason for replication failures of studies with z-scores greater than 5 is fraud or problems in the implementation of the actual replication study. In contrast, studies with z-scores below 4 have insufficient power to produce consistent significant results. The correlation between observed power and replication success rates is r = .93. This finding demonstrates empirically that power determines the outcome of unbiased replication studies.

Discussion

Honest reporting of results is necessary to trust published results. Open Science Practices may help to ensure that results are reported honesty. This is particularly valuable for the evaluation of a single study. However, statistical tools like the R-Index can be used to examine whether a set of studies is unbiased or whether the results are biased. In the present set of 16 original studies, it detected a small bias that explains the differences in success rate for the original studies (blue, 94%) and the replication studies (green, 79%).

More importantly, the investigation of power shows that some of the studies were underpowered to reject the nil-hypothesis even with N = 1,500 because the real effect sizes were too close to zero. This shows how difficult it is to provide evidence for the absence of an important effect.

At the same time, other studies had large effect sizes and were dramatically overpowered to demonstrate an effect. As shown, z-scores of 5 are sufficient to provide conclusive evidence against a nil-hypothesis and this criterion is used in particle physics for strong hypothesis tests. Using N = 1,500 for an effect size of d = .6 is overkill. This means that researchers who cannot easily collect data from large samples can produce credible results. There are also other methods to reduce sampling error and to increase power than increasing sample sizes. Within-subject designs with many repeated trials can produce credible and replicable results with sample size of N = 8. Sample size should not be used as a criterion to evaluate studies and large samples should not be used as a criterion for good science.

To evaluate the credibility of results in single studies, it is useful to examine confidence intervals and to see which effect sizes are excluded by the lower limit of the confidence interval. Confidence intervals that exclude zero, but not values close to zero suggest that a study was underpowered and that the true population effect size may be so close to zero that it is practically zero. In addition, p-values or z-scores provide valuable information about replicability. Results with z-scores greater than 5 are extremely likely to replicate in an exact replication study and replication failures suggest a significant moderating factor.

Finally, the present results suggest that other aspects of open science like pre-registration are not necessary to produce highly replicable results. Even exploratory results that produced strong evidence (z > 5) are likely to replicate. The reason is that luck or extreme p-hacking does not produce such extreme evidence against the null-hypothesis. A better understanding of the strength of evidence may help to produce credible results without wasting precious resources on unnecessarily large samples.

Random Measurement Error and the Replication Crisis

The code for all simulations is available on OSF (https://osf.io/pyhmr).

P.S. I have been arguing with Andrew Gelman on his blog about his confusing and misleading article with Loken. Things have escalated and I just want to share his latest response.

Andrew on  said:

Ulrich:
I’m not dragging McShane into anything. You’re the one who showed up in the comment thread, mischaracterized his paper in two different ways, and called him an “asshole” for doing something that he didn’t do. You say now that you don’t care that he cited you; earlier you called him an asshole for not citing you, even though you did.

Also, your summaries of the McShane et al. and Gelman and Loken papers are inaccurate, as are your comments about confidence intervals, as are a few zillion other things you’ve been saying in this thread.

Please just stop it with the comments here. You’re spreading false statements and wasting our time; also I don’t think this is doing you any good either. I don’t really mind the insults if you had anything useful to say (here’s an example of where I received some critical feedback that was kinda rude but still very helpful), but whatever useful you have to say, you’ve already said, and at this point you’re just going around in circles. So, no more.

The main reason to share this is that I already showed that confidence intervals are often accurate even after selection for significance and that this is even more true when studies use unreliable measures because the attenuation due to random measurement error compensates for the inflation due to selection for significance. i am not saying that this makes it ok to suppress non-significant results, but it does show that Gelman is not interested in telling researchers how to avoid misinterpretation of biased point estimates. He likes to point out mistake in other people’s work, but he is not very good at noticing mistakes in his own work. I have repeatedly asked for feedback on my simulation results and if there are mistakes I am going to correct them. Gelman hasn’t done so and so far nobody else has. Of course, I cannot find a mistake in my own simulations. Ergo, I maintain that confidence intervals are useful to avoid misinterpretation of pointless point estimates. The real reason why confidence intervals are rarely interpreted (other than saying CI = .01 to 1.00 excludes zero, therefore the nil-hypothesis can be rejected, which is just silly nil-hypothesis testing, Cohen, 1994) is that confidence intervals in between-study designs with small samples are so wide that they do not allow strong conclusions about population effect sizes.

Introduction

A few years ago, Loken and Gelman (2017) published an article in the Magazine “Science.” A key novel claim in this article was that random measurement error can inflate effect size estimates.

“In a low-noise setting, the theoretical results of Hausman and others correctly show that measurement error will attenuate coefficient estimates. But we can demonstrate with a simple exercise that the opposite occurs in the presence of high noise and selection on statistical significance”

Language is famously ambiguous and open to interpretation. However, the article also presented a figure that seemed to support this counterintuitive conclusion.

The figure seems to suggest that with selection for significance, overestimation of effect sizes is increasingly more common in studies that use an unreliable measure rather than a reliable measure. At some point, the proportion of studies where the effect size estimate is greater with rather than without error seems to be over 50%.

Paradox findings are interesting and attracted our attention (Schimmack & Carlson, 2017). We believed that this conclusion is based on a mistake in the simulation code. We also tried to explain the combined effects of sampling error and random measurement error on effect sizes in a short commentary that remind unpublished. We never published our extensive simulation results.

Recently, a Ph.D student also questioned the simulation code and Andrew Gelman posted the students concerns on his blog post (Simulations of measurement error and the replication crisis: Maybe Loken and I have a mistake in our paper?) The blog post also included the simulation code.

The simulation is simple. It generates two variables with SD = 1 and a correlation ~ r = .15. It then adds 25% random measurement error to both variables, so that the two variables are measures of the former variables with 4/5 = 80% reliability. This attenuates the true correlation slightly to .15*.8 = .12. The crucial condition is when this simulation is run with a small sample size of N = 50.

N= 50 is a small sample size to study an effect size of .15 or .12. So, we are expecting mostly non-significant results. The crucial question is what happens when researchers get lucky and obtain a statistically significant result. Would selection for significance produce a stronger effect size estimate for the perfect measure or the unreliable measure?

It is not easy to answer this question because selection for significance requires conditioning on an outcome and Loken and Gelman’s simulation has two outcomes in the simulation. The outcomes for the perfect measure are paired with the outcome for the unreliable measure. So, which outcome should be used to select for significance? Using either measure will of course benefit the measure that was used to select for significance. To avoid this problem, I simply examined all four possible outcomes, neither measure was significant, the perfect measure was significant and the unreliable was not, the unreliable was significant and the perfect was not, or both were significant. To obtain stable cell frequencies, I ran 10,000 simulations.

Here are the results.

1. Neither measure produced a significant result

4870 times the perfect measure had a higher estimate than the unreliable measure (58%)
3629 times the unreliable measure had a higher estimate than the unreliable measure (42%)

2. Both measure produced a significant result

579 times the perfect measure had a higher estimate than the unreliable measure (61%)
377 times the unreliable measure had a higher estimate than the unreliable measure (39%)

3. The reliable measure is significant and the unreliable measure is not significant

981 times the perfect measure had a higher estimate than the unreliable measure (100%)
0 times the unreliable measure had a higher estimate than the unreliable measure (0%)

4. The unreliable measure is significant and the reliable measure is not significant

0 times the perfect measure had a higher estimate than the unreliable measure (0%)
464 times the unreliable measure had a higher estimate than the unreliable measure (100%)

The main point of these results is that selection for significance will always favor the measure that is used for conditioning on significance. By definition, the effect size of a significant result will be larger than the effect size of a non-significant result given equal sample size. However, it is also clear that the unreliable measure produces fewer significant results because random measurement error attenuates the effect size and reduces power; that is, the probability to obtain a significant result.

Based on these results, we can reproduce Loken and Gelman’s results that showed larger effect size estimates more often with the unreliable measure. To produce this result, they conditioned on significance for the measure with random error, but not for the measure without random measurement error. That is, they combined conditions 2 (both measures produced significant results) and 4 (ONLY the unreliable measure produced significant result).

5. (2 + 4) The unreliable measure is significant, the reliable measure can be significant or not significant.

When we simply select for significance on the unreliable measure, we see that the unreliable measure has the stronger effect size over 50% of the time.

579 times the perfect measure had a higher estimate than the unreliable measure (41%)
377+464 = 841 times the unreliable measure had a higher estimate than the unreliable measure (59%)

However, this is not a fair comparison of the two measures. Selection for significance is applied to one of them and not the other. The illusion of reversal is produced by selection bias in the simulation, not in a real world scenario where researchers use one or the other measure. This is easy to see, when we condition on the reliable measure.

6. (2 + 3) The reliable measure is significant, significance on the other measure does not matter.

579+981 = 1560 times the perfect measure had a higher estimate than the unreliable measure (81%)
377 times the unreliable measure had a higher estimate than the unreliable measure (19%)

Now, we overestimate the advantage of the reliable measure. Conditioning on significance selectively for one variable and not the other produces biased simulation results that falsely suggest that an unreliable measure produces stronger correlations despite the fact that random measurement error attenuates correlation coefficients and other standardized effect size estimates.

Extension 1: A Fair Comparison

I posted polls on Twitter (X) and in a Facebook Group that discusses methodological issues in psychological science. I created a scenario in which two labs are studying the same effect, say r = .15 in the population between two variables. Both labs use the same small sample size of N = 50. Both labs only publish when they get a statistically significant result, p < .05 (two-tailed). Both labs use a perfectly reliable preditor variable (e.g., height). The only difference between the two labs is that LAB A uses a more reliable measure (Cronbach’s alpha = .8) than LAB B (Cronbach’s alpha = .5). The question is simply which lab is more likely to publish a bigger (more biased) effect size estimate? The key difference between this scenario and Loken and Gelman’s simulation is that both outcomes are conditioned on significance. Thus, we can actually see the influence of random measurement error on correlation coefficients when results are selected for significance. The polls showed that the correct answer is unknown. In both polls, two-third of respondents sided with Loken and Gelman and answered that LAB B is more likely to report larger correlations (Twitter/X 65%, Facebook Group, 68%). They may have been influenced by Loken and Gelman’s article that claimed “in noisy research settings, poor measurement can contribute to exaggerated estimates of effect size.”

ChatGPT on Bing thinks LAB A will report stronger effect sizes, but its reasoning does not take conditioning for significance into account. “According to the results, the effect size is a measure of the magnitude of the experimental effect, and it is influenced by the reliability of the measures used. The higher the reliability, the more accurate the effect size estimate will be1. Therefore, LAB A, which uses a more reliable measure (Cronbach’s alpha = .8) than LAB B (Cronbach’s alpha = .5), is more likely to publish a bigger (more biased) effect size estimate. LAB B, which uses a less reliable measure, will have more measurement error and lower statistical power, making it harder to detect the true effect size in the population.”

To obtain the correct answer, I made only a small change to Loken and Gelman’s simulation. First, I did not add measurement error to the predictor variable, X. Second, I added different amounts of random measurement error to two outcome variables, Y1 with 80% reliable variance for LAB A, and y2 with 50% reliable variance for LAB B. I ran 10,000 simulations to have a reasonably large number of cases after selection for significance. LAB A had more significant results because the population effect size or average sample correlation is larger, .15 * .8 = .12 than the one for LAB B, .15 * .5 = .075, and studies with larger effect sizes in the same sample size have more statistical power, a greater chance to produce a significant result. In the simulation, LAB A had 1,435 significant results (~ 14% power) and LAB B had 1,106 significant results (11% power). I then compared the first 1,106 significant results from LAB A to the 1,106 results from LAB B and computed how often LAB A had a higher effect size estimate than LAB B.

Results: LAB A had a higher effect size estimate in 569 cases (51%) and LAB B had a higher effect size estimate in 537 cases (49%). Thus, there is no reversal that less reliable measures produce stronger (more biased) correlations in studies with low power and after selection for significance. Loken and Gelman’s false conclusion is based on an error in their simulations that conditioned on significance for the unreliable measure, but not for the measure without random measurement error.

Would a more extreme scenario produce a reversal? Power is already low and nobody should compute correlation coefficients in samples with N = 20, but fMRI researchers famously reported dramatic correlations between brain and behavior i studies with N = 8 (“voodoo correlations; Vul et al., 2012). So, I repeated the simulation with N = 20, and pitting a measure with 100% reliability against a measure with 20% reliability. Given the low power, I ran 100,000 simulations to get stable results.

Results:

LAB A obtained 9,620 significant results (Power ~ 10%). LAB B obtained 6,030 (Power ~ 6%, close to chance, 5% with alpha = .05).

The comparison of the first 6,030 significant results with the 6,030 significant results from LAB B showed that LAB A reported a stronger effect size 3,227 times (54%) and LAB B reported a stronger effect size 2,803 times (46%). Thus, more reliable not only help to report voodoo correlations more often, but also report higher correlations. Clearly, using less reliable measures does not contribute to the replication crisis as Loken and Gelman claimed. Their claim is based on a mistake in their simulations that conditioned joked outcomes on significance of the unreliable measure.

Extension 2: Simulation with two equally unreliable measures

The next simulation is a new simulation that has two purposes. First, it drives home the message that Gelman’s simulation study unfairly biased the results in favor of the unreliable measure by conditioning on significance for this measure. Second, it provides a new insight into the contribution of unreliable measures to the replication crisis. The simulation assumes that researchers really use two dependent variables (or more) and are going to report results if at least one of the measures has a significant result. Evidently, this doesn’t really work with two perfect measures because they are perfectly correlated, r = 1. As a result, they will always show the same correlation with the independent variable. However, unreliable measures are not perfectly correlated with each other and produce different correlations. This provides room for capitalizing on chance and getting significance with low power. The lower the reliability of the measures the better. I picked a reliability of .5 for both dependent measures (Y1, Y2) and assumed that the independent variable has perfect reliability (e.g., an experimental manipulation).

1. Neither measure produced a significant result

4011 times Y1 had a higher estimate than Y2 (49%)
4109 times Y2 had a higher estimate than Y1 (51%).

2. Both measures produced a significant result.

212 times Y1 had a higher estimate than Y2 (57%)
162 times Y2 had a higher estimate than Y1 (43%).

3. Y1 is significant and Y2 is not significant

743 times Y1 had a higher estimate than Y2 (100%)
0 times Y2 had a higher estimate than Y1 (0%).

4. Y2 is significant and Y1 is not significant

0 times Y1 had a higher estimate than Y2 (0%)
763 times Y2 had a higher estimate than Y1 (100%).

The results show that using two measures with 50% reliability increases the chances of obtaining a significant result by about 750 / 10000 tries (7.5 percentage points). Thus, unreliable measures can contribute to the replication crisis if researchers use multiple unreliable measures and selectively publish results for the significant one. However, using a single unreliable measure versus a single reliable measure is not beneficial because an unreliable measure makes it less likely to obtain a significant result. Gelman’s reversal is an artifact by conditioning on one outcome. This can be easily seen by comparing the results after conditioning on significance for Y1 or Y2.

5. (2 + 3) Y1 is significant, significance of Y2 does not matter

212+743 = 955 times Y1 had a higher estimate than Y2 (85%)
162 times Y2 had a higher estimate than Y1 (15%).

6. (2 + 4) Y2 is significant, significance of Y1 does not matter

212 = times Y1 had a higher estimate than Y2 (81%)
162+763 = 925 times Y2 had a higher estimate than Y1 (19%).

When we condition on significance for Y1, Y1 produces more often significant results. When we condition on Y2, Y2 produces more often significant results. This has nothing to do with the reliability of the measures because they have the same reliability. The difference is illusory because selection for significance in the simulation produces biased results.

Another interesting observation

While working on this issue with Rickard, we also discovered an important distinction between standardized and unstandardized effect sizes. Loken and Gelman simulated standardized effect sizes because by correlating two variables. Random measurement error lowers standardized effect sizes because the unstandardized effect sizes are divided by the standard deviation and random measurement error adds to the naturally occurring variance in a variable. However, unstandardized effect sizes like the covariance or the mean difference between two groups are not attenuated by random measurement error. For this reason, it would be wrong to claim that unreliability of a measure attenuated unstandardized effect sizes or that they should be corrected for unreliability of a measure.

Random measurement error will however increase the standard error and make it more difficult to get a significant result. As a result, selection for significance will inflate the unstandardized effect size more for an unreliable measure. The following simulation demonstrates this point. To keep things similar, I kept the effect size of b = .15, but used the unstandardized effect size of a regression analysis as the effect size.

First, I show the distribution of the effect size estimates. Both distributions are centered over the simulated effect size of b = .15. However, the measure with random error produces a wider distribution which often results in more extreme effect size estimates.

1. Neither measure produced a significant result

3170 times the perfect measure had a higher estimate than the unreliable measure (41%)
4587 times the unreliable measure had a higher estimate than the unreliable measure (59%)

This scenario shows the surprising reversal that the less reliable measure shows the stronger absolute effect size estimates more often and more than 50% of the time that Loken and Gelman wanted to demonstrated, but their simulation used standardized effect size estimates that do not produce this reversal. Only unstandardized effect size estimates show it.

2. Both measures produced a significant result

73 times the perfect measure had a higher estimate than the unreliable measure (10%)
659 times the unreliable measure had a higher estimate than the unreliable measure (90%)

When both effect size estimates are significant, the one with the unreliable measure is much more likely to show a stronger effect size estimate. The reason is simple. Sampling error is larger and it takes a stronger effect size estimate to produce the same t-value that produces a significant result.

3. The reliable measure is significant and the unreliable measure is not significant

790 times the perfect measure had a higher estimate than the unreliable measure (73%)
299 times the unreliable measure had a higher estimate than the unreliable measure (27%)

With standardized effect sizes, selection for significance always favored the conditioning variable 100% of the time. Now unstandardized coefficients are higher 27% of the time. However, the conditioning effect is notable because conditioning on significance for the perfect measure reverses the usual pattern that the unreliable measure produces stronger effect size estimates.

4. The unreliable measure is significant and the reliable measure is not significant

0 times the perfect measure had a higher estimate than the unreliable measure (0%)
422 times the unreliable measure had a higher estimate than the unreliable measure (100%)

Conditioning on significance on the unreliable measure produces a 100% rate of stronger effect sizes because effect sizes are already biased in favor of the unreliable measure.

The interesting observation is that Loken and Gelman were right that effect size estimates can be inflated with unreliable measures, but they failed to demonstrate this reversal because they used standardized effect size estimates. Inflation occurs with unstandardized effect sizes. Moreover, it does not require selection for significance. Even non-significant effect size estimates tend to be larger because there is more sampling error.

The Fallacy of Interpreting Point Estimates of Effect Sizes

Loken and Gelman’s article is framed as a warning to practitioners to avoid misinterpretation of effect size estimates. The are concerned that researchers “assume that the observed effect size would have been even larger if not for the burden of measurement error” and “when it comes to surprising research findings from small studies, measurement error (or other uncontrolled variation) should not be invoked automatically to suggest that effects are even larger” and “our concern is that researchers are sometimes tempted to use the “iron law” reasoning to defend or justify surprisingly large statistically significant effects from small studies”

They missed an opportunity to point out that there is a simple solution to avoid misinterpretation of effect sizes estimates that has been recommended by psychological methodologists since the 1990s (I highly recommend Cohen, 1994; also Cummings, 2013). The solution is to consider the uncertainty in effect sizes estimates by means of confidence intervals. Confidence provide a simple solution to many fallacies of traditional null-hypothesis tests, p < .05. A confidence interval can be used to test not only the nil-hypothesis, but also hypotheses about specific effect sizes. A confidence interval may exclude zero, but it might include other values of theoretical interest, especially if sampling error is large. To claim an effect size larger than the true population effect size of b = .15, the confidence interval has to exclude a value of b = .15. Otherwise, it is a fallacy to claim that the effect size in the population is larger than .15.

As demonstrated before, random measurement error inflates effect size estimates of unstandardized effect sizes, but it also increases sampling error, resulting in wider confidence interval. Thus, it is an important question whether unreliable measures really allow researchers to claim effect sizes that are significantly larger than the simulated true effect size of b = .15.

A final simulation examined how often the 95%CI excluded the true value of b = .15 for the perfect measure and the unreliable measure. To produce more precise estimates, I ran 100,000 simulations.

1. Measure without error

2809 Significant underestimations (2.8%)
2732 Significant overestimations (2.7%)
5541 Errors

2. Measure with error

2761 Significant underestimations (2.8%)
2750 Significant overestimations (2.7%)
5511 Errors

The results should not come as a surprise. 95% confidence intervals are designed to have a 5% error rate and to split these errors into equal errors on both sides. The addition of random measurement error does not affect this property of confidence intervals. Most important, there is no reversal in the probability of overestimation. The measure without error produces confidence intervals that overestimate the true effect size as often as the measure without error. However, the effect of random measurement error is noticeable in the amount of bias.

For the measure without error, the lower bound of the 95%CI ranges from .15 to .55, M = .21.
For the measure with error, the lower bound of the 95%CI ranges from .15 to .65, M = .22.

These differences are small and have no practical consequences. Thus, the use of confidence intervals provides a simple solution to false interpretation of effect size estimates. Although selection for significance in small samples inflates the point estimate of effect sizes, the confidence interval often includes the smaller true effect size.

The Replication Crisis

Loken and Gelman’s article aimed to relate random measurement error to the replication crisis. They write “If researchers focus on getting statistically significant estimates of small effects, using noisy measurements and small samples, then it is likely that the additional sources of variance are already making the t test look strong. Measurement error and selection bias thus can combine to exacerbate the replication crisis.”

The previous results show that this statement ignores the key influence of random measurement error on statistical significance. Random measurement error increases the standard deviation and the standard deviation is in the denominator of the t-statistic. Thus, t-values are biased downwards and it is harder to get statistical significance with unreliable measures. The key point is that studies in small samples with unreliable measures have low statistical power. It is therefore misleading to claim that random-measurement error inflates t-values. It actually attenuates t-values. Selection for significance inflates point estimates of effect sizes, but these are meaningless and the confidence interval around this estimate often include the true population parameter.

More important, it is not clear what Loken and Gelman mean by the replication crisis. Let’s assume a researcher conducts a study with N = 50, a measure with 50% reliability, and an effect size of r = .15. Luck, the winner’s curse, gives them a statistically significant result with an effect size estimate of r = .6 and a 95% confidence interval ranging from .42 to .86. They get a major publication out of this finding. Another researcher conducts a replication study and gets a non-significant result with r = .11 and a 95%CI ranging from -.11 to .33. This outcome is often called a replication failure because significant results are considered successes and non-significant results are considered failures. However, findings like this do not signal a crisis. Replication failures are normal and to be expected because significance testing allows for error and replication failures.

The replication crisis in psychology is caused by the selective omission of replication failures from the literature or even from a set of studies within a single article (Schimmack, 2012). The problem is not that a single significant result is followed by a non-significant result. The problem is that non-significant results are not published. The success rate in psychology journals is over 90% (Sterling, 1959; Sterling et al., 1995). Thus, the replication crisis refers to the fact that psychologists never published failed replication studies. When publication of replication failures became more acceptable in the past decade, we just saw that selection bias inflated the success rate. Given the typical power of studies in psychology, replication failures are to be expected. This has nothing to do with random measurement error. The main contribution of random measurement error is to reduce power and increase the percentage of studies with non-significant results.

Forgetting about False Negatives

Over the past decade, a few influential articles have created a fear of false positive results (Ioannidis, 2005; Simmons, Nelson, & Simonsohn, 2011). The real problem, however, is that selection for significance makes it impossible to know whether an effect exists or not. Whereas real effects in reasonably powered studies would often produce significant results, false positives would be followed by many replication failures. Without credible replication studies that are published independent of the outcome, statistical significance has no meaning. This led to concerns that over 50% of published results could be false positives. However, empirical studies of the false positive risk often find much lower plausible values (Bartos & Schimmack, 2023). Arguably, the bigger problem in studies with small samples and unreliable measures is that these studies will often produce a false negative result. Take, Loken and Gelman’s simulation as an example. The key outcome of studies that look for a small effect size of r = .15 or r = .12 with a noisy measure is a non-significant result. This is a false negative result because we know a priori that there is a non-zero correlation between the two variables with a theoretically important effect size. For example, the correlation between income and a noisy measure of happiness is around r = .15. Looking for this small relationship in a small sample will often suggest that money does not buy happiness, while large samples consistently show this small relationship. One might even argue that the few studies that produce a significant result with an inflated point estimate but a confidence interval that includes r = .15 avoid the false negative result without providing inflated estimates of the effect size given the wide range of plausible values. Only the 2.5% of studies that produce confidence intervals that do not include r = .15 are misleading, but a single replication study is likely to correct this inflated estimate.

This line of reasoning does not justify selective publishing of significant results. Rather it draws attention back to the concerns of methodologists in the 1990s that low power is wasteful because many studies produce inconclusive results. To address this problem researchers need to think carefully about the plausible range of effect sizes and plan studies that can produce significant results for real effects. Researchers also need to be able and willing to publish results when the results are not significant. No statistical method can produce valid results when the data are biased. In comparison, the problem of inflated point estimates of effect sizes in a single small sample is trivial. Confidence interval make it clear that the true effect size can be much smaller and rare outcomes of extreme inflation will be corrected quickly by failed replication studies.

In short, as much as Gelman likes to think that there is something fundamentally wrong with the statistical methods that psychologists use, the real problems are practical. Resource constraints often limit researchers ability to collect large samples and the preference for novel significant results over replication failures of old findings gives researchers an incentive to selectively report their “successes.” To do so, they may even use multiple unreliable measures in order to capitalize on chance. The best way to address these problems is to establish a clear code of research practices and to hold researchers accountable if they violate this code. Editors should also enforce the already existing guidelines to report meaningful effect sizes with confidence intervals. In this utopian world, researchers would benefit from using reliable measures because they increase power and the probability of publishing a true positive result.

Abandon Gelman

I pointed out the mistake in Loken and Gelman’s article on Gelman’s blog post. He is unable to see that his claim of a reversal in effect size estimates due to random measurement error is a mistake. Instead he tries to explain my vehement insistence as a personality flaw.

Instead, his overconfidence makes it impossible to consider the possibility that he made a mistake. This arrogant response to criticism is by no means unique. I have seen it many times by Greenwald, Bargh, Baumeister, and others. However, it is ironic when meta-scientists like Ioannidis, Gelman, or Simonsohn who are known for harsh criticism of others are unable to admit when they made a mistake. A notable exception is my criticism of Kahneman’s book “Thinking: Fast and Slow.”

Gelman has criticized psychologists without offering any advice how they could improve their credibility. His main advice is to “abandon statistical significance” without any guidelines how we should distinguish real findings from false positives or avoid interpretation of inflated effect size estimates. Here I showed how the use of confidence intervals provides a simple solution to avoid many of the problems that Gelman likes to point out. To learn about statistics, i suggest to read less Gelman and read more Cohen.

Cohen’s work shaped my understanding of methodology and statistics and he actually cared about psychology and tried to improve it. Without him, I might not have learned about statistical power or contemplated the silly practice of refuting nil-hypothesis. I also think his work was influential in changing the way results are reported in psychology journals that enabled me to detected biases and estimate false positive rates in our field. He also tried to tell psychologists about the importance of replication studies.

For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication” (Cohen, 1994).

If psychologists had listened to Cohen, they could have avoided the replication crisis in the 2010s. However, his work can still help psychologists to learn from the replication crisis to build a credible science that is build on true positive results and avoids false negative results. The lessons are simple.
1. Plan studies with a reasonable chance to get a significant result. Try to maximize power by thinking about all possible ways to reduce sampling error, including using more reliable measures.
2. Publish studies independent of outcome, especially replication failures that can correct false positives.
3. Focus on effect sizes, but ignore the point estimates. Instead, use confidence interval to avoid interpreting effect size estimates that are inflated by selection for significance.

Gino-Colada – 2: The line between fraud and other QRPs

“It wasn’t fraud. It was other QRPs”

[collaborator] “Francesca and I have done so many studies, a lot of them as part of the CLER lab, the behavioral lab at Harvard. And I’d say 80% of them never worked out.” (Gino, 2023)

Experimental social scientists have considered themselves superior to other social scientists because experiments provide strong evidence about causality that correlational studies cannot provide. Their experimental studies often produced surprising results, but because they were obtained using the experimental method and published in respected, peer-reviewed, journals, they seemed to provide profound novel insights into human behavior.

In his popular book “Thinking: Fast and Slow” Nobel Laureate Daniel Kahneman told readers “disbelief is not an option. The results are not made up, nor are they statistical flukes. You have no choice but to accept that the major conclusions of these studies are true.” He probably regrets writing these words, because he no longer believes these findings (Kahneman, 2017).

What happened between 2011 and 2017? Social scientists started to distrust their own (or at least the results or their colleagues) findings because it became clear that they did not use the scientific method properly. The key problem is that they only published results when they provided evidence for their theories, hypothesis, and predictions, but did not report when their studies did not work. As one prominent experimental social psychologists put it.

“We did run multiple studies, some of which did not work, and some of which worked better than others. You may think that not reporting the less successful studies is wrong, but that is how the field works.” (Roy Baumeister)

Researchers not only selectively published studies with favorable results. They also used a variety of statistical tricks to increase the chances of obtaining evidence for their claims. John et al. (2012) called these tricks questionable research practices (QRPs) and compared them to doping in sport. The difference is that doping is banned in sports, but the use of many QRPs is not banned or punished by social scientific organizations.

The use of QRPs explains why scientific journals that report the results of experiments with human participants report over 90% of the time that the results confirmed researchers’ predictions. For statistical reasons , this high success rate is implausible even if all predictions were true (Sterling et al., 1995). The selective publishing of studies that worked renders the evidence meaningless (Sterling, 1959). Even clearly false hypotheses like “learning after an exam can increase exam performance” can receive empirical support, when QRPs are being used (Bem, 2011). The use of QRPs also explains why results of experimental social scientists often fail to replicate (Schimmack, 2020).

John et al. (2012) used the term questionable research practices broadly. However, it is necessary to distinguish three types of QRPs that have different implications for the credibility of results.

One QRPs is selective publishing of significant results. In this case, the results are what they are and the data are credible. The problem is mainly that these results are likely to be inflated by sampling bias. This bias would disappear when all studies were published and the results are averaged. However, if non-significant results are not published, the average remains inflated.

The second type of QRPs are various statistical tricks that can be used to “massage” the data to produce a more favorable result. These practices are now often called p-hacking. Presumably, these practices are used mainly after an initial analysis did not produce the desired result, but may be a trend in the expected direction. P-hacking alters the data and it is no longer clear how strong the actual evidence was. While lay people may consider these practices fraud or a type of doping, professional organizations tolerate these practices and even evidence of their use would not lead to disciplinary actions against a researcher.

The third QRP is fraud. Like p-hacking, fraud implies data manipulation with the goal of getting a desirable result, but the difference is …. well, it is hard to say what the difference to p-hacking is except that it is not tolerated by professional organizations. Outright fraud in which a whole data set is made up (as some datasets by disgraced Diederik Stapel) are clear cases of fraud. However, it is harder to distinguish between fraud and p-hacking when one researcher deletes selective outliers from two groups to get significance (p-hacking) or switches extreme cases from one group to another (fraud) (GinoColada1). In both cases, the data are meaningless, but only fraud leads to reputation damage and public outrage, while p-hackers can continue to present their claims as scientific truths.

The distinction between different types of QRPs is important to understand Gino’s latest defense against accusations that she committed fraud that have been widely publicized in newspaper articles and a long article in the New Yorker. In her response, she cites from Harvards’s investigative report to make the point that she is not a data fabricator.

[collaborator] “Francesca and I have done so many studies, a lot of them as part of the CLER lab, the behavioral lab at Harvard. And I’d say 80% of them never worked out.”

The argument is clear. Why would I have so many failed studies, if I could just make up fake data that support my claim. Indeed, Stapel claims that he started faking studies outright because it was clear that p-hacking is a lot of work and making up data is the most efficient QRP (“Why not just make the data up. Same results with less effort”). Gino makes it clear that she did not just fabricate data because she clearly collected a lot of data and has many failed studies that were not p-hacked or manipulated to get significance. She only did what everybody else did; hiding the studies that did not work and lot’s of them.

Whether she sometimes did engage in practices that cross the line from p-hacking to fraud is currently being investigated and not my concern. What I find interesting is the frank admission in her defense that 80% of her studies failed to provide evidence for her hypotheses. However, if somebody would look up her published work, they would see mainly the results of studies that worked. And she has no problem of telling us that these published results are just the tip of an iceberg of studies, where many more did not work. She thinks this is totally ok, because she has been trained / brainwashed to believe that this is how science works. Significance testing is like a gold pan.

Get a lot of datasets, look for p < .05, keep the significant ones (gold) and throw away the rest. The more studies, you run, the more gold you find, and the richer you are. Unfortunately, for her and the other experimental social scientists who think every p-value below .05 is a discovery, this is not how science works, as pointed out by Sterling (1959) many, many years before, but nobody wants to listen to people to tell you something is hard work.

Let’s for the moment assume that Gino really runs 100 studies to get 20 significant results (80% do not work, p < .10). Using a formula from Soric (1989), we can compute the risk that one of her 20 significant results is a false positive result (i.e., the significant result is a fluke without a real effect), even if she did not use p-hacking or other QRPs, which would further increase the risk of false claims.

FDR = ((1/.20) – 1)*(.05/.95) = 21%

Based on Gino’s own claim that 80% of her studies fail to produce significant results, we can infer that up to 21% of her published significant results could be false positive results. Moreover, selective publishing also inflates effect sizes and even if a result is not a false positive, the effect size may be in the same direction, but too small to be practically important. In other words, Gino’s empirical findings are meaningless without independent replications, even if she didn’t use p-hacking or manipulated any data. The question whether she committed fraud is only relevant for her personal future. It has no relevance for the credibility of her published findings or those of others in her field like Dan Air-Heady. The whole field is a train wreck. In 2012, Kahneman asked researchers in the field to clean up their act, but nobody listened and Kahneman has lost faith in their findings. Maye it is time to stop nudging social scientists with badges and use some operant conditioning to shape their behavior. But until this happens, if it every happens, we can just ignore this pseudo-science, no matter what happens in the Gino versus Harvard/DataColada case. As interesting as scandals are, it has no practical importance for the evaluation of the work that has been produced by experimental social scientists.

P.S. Of course, there are also researchers who have made real contributions, but unless we find ways to distinguish between credible work that was obtained without QRPs and incredible findings that were obtained with scientific doping, we don’t know which results we can trust. Maybe we need a doping test for scientists to find out.