Category Archives: Fisher

Gelman’s Type-S Error: A Misunderstanding of Hypothesis Testing

Andrew Gelman is well known for strong opinions about psychological science, including its methods and research culture (Fiske, 2017. For the most part, he writes as if psychologists are still following a statistical ritual that cannot produce meaningful results. This criticism is not new. It was already made by influential psychologists and methodologists, including Cohen (1990, 1994) and Gigerenzer (2004). The problem with Gelman’s critique is that it is outdated and largely ignores the discussion of null-hypothesis significance testing that took place in psychology during the 1990s. As evidence for this claim, one can simply inspect the reference list of Gelman and Carlin (2014). An article published in Perspectives on Psychological Science does not cite Cohen (1990, 1994), Gigerenzer (2004), or Tukey’s directional reformulation of significance testing (Tukey, 1991; Jones & Tukey, 2000). Although an outsider perspective can be useful for challenging untested assumptions, a commentary that ignores key insights produced by eminent statisticians and methodologists within psychology is unlikely to do so.

The Null-Hypothesis Significance Testing Strawman

As Gigerenzer (2004) pointed out, statistics is often taught as a ritual to be followed rather than as a principled approach to drawing conclusions from data. Rituals are not necessarily bad, but in science it is usually better to understand the rationale and assumptions underlying routine practices.

Null-hypothesis significance testing (NHST) has been described and criticized for decades (Tukey, 1991; Cohen, 1994). Most students of psychology will recognize the following brief description of it. First, researchers collect data that relate one variable to another. Ideally, this is an experiment in which one variable is experimentally manipulated (the independent variable) and the other is observed (the dependent variable). In experiments, a relationship between the independent and dependent variable may justify causal claims, but NHST itself is indifferent to causality. It can be applied to both experimental and correlational data. The main information produced by statistical analyses is the p-value. P-values below a conventional threshold are called statistically significant; those above the threshold are treated as not significant (ns). Significant results are easier to publish. As a result, data analysis often becomes a series of statistical tests searching for statistically significant results (Bem, 2010).

This approach to data analysis has been criticized for several reasons. First, statistical significance by itself does not provide information about effect size. For this reason, psychologists have increasingly reported effect-size estimates in addition to tests of statistical significance, in large part due to Cohen’s (1990) emphasis on effect sizes. Second, NHST has been criticized for its focus on statistically significant findings. Psychology journals have long reported rates of over 90% statistically significant results (Sterling, 1959; Sterling et al., 1995). Publication bias in favor of significant results then leads to inflated effect-size estimates (Rosenthal, 1979).

Most importantly, NHST has been criticized because it appears to reject a null hypothesis that is known to be false before any data are collected. Cohen (1994) called this the nil hypothesis. The nil hypothesis assumes that the population effect size is exactly zero. Statistical significance is then taken to imply that this hypothesis is unlikely to be true and can be rejected. The problem is that rejecting one specific possible effect size tells us very little about the data. It would be equally uninformative to test the hypothesis that the effect size equals any other single value, such as Cohen’s d = .20. So what if the effect size can be said not to be 0 or .20? It could still be 0.01 or 1.99. In short, hypothesis testing with a single point as the null hypothesis is meaningless. Yet that is exactly what psychological articles seem to be reporting when they state p < .05.

What Psychological Scientists Are Implicitly Doing

In reality, however, psychological scientists are doing something different. It may look as if they are testing the nil hypothesis, but in practice they are often testing two directional hypotheses at the same time (Kaiser 1960; Lakens et al., 2025; Tukey, 1991; Jones & Tukey, 2000). When the nil hypothesis is rejected, researchers do not merely conclude that there is a difference. They also inspect the sign of the effect size estimate and infer that the experimental manipulation increased or decreased behavior.

Some authors have argued that drawing directional conclusions from a two-sided test is conceptually problematic (e.g., Rubin, 2020). However, Jones and Tukey explain the rationale for doing so. The easiest way to see this is to reinterpret the standard nil-hypothesis test as two directional tests with two complementary null hypotheses. One null hypothesis states that the effect size is zero or negative. The other states that the effect size is zero or positive. Rejecting the first leads to the inference that the effect is probably positive. Rejecting the second leads to the inference that the effect is probably negative. Viewed this way, zero is simply the boundary between two rejection regions.

Because NHST can be understood in this way as involving two directional possibilities, alpha must be allocated across both tails to maintain the long-run error rate. No psychology student would be surprised to see a t distribution with 2.5% of the area in each tail. Each tail represents the error rate for one directional rejection, and together they produce the familiar two-sided alpha level of 5%.

Most psychology students are not taught that they are implicitly conducting directional tests when they interpret significant p values, but their actual practice shows that this is what they are doing. They routinely draw directional inferences from NHST, and this is a legitimate use of the procedure. It also makes NHST more meaningful than the strawman version in which researchers merely reject an exact value of zero that is often known in advance to be false.

Using NHST to infer the direction of population effects is meaningful because researchers often do not know that direction before data are collected. Empirical data can therefore provide genuinely new information. This is not a full defense of NHST, because effect size and practical importance can still be ignored, but it does show that psychologists have not spent decades and millions of dollars merely to establish that effect sizes are not exactly zero.

Gelman’s Type-S Error

Gelman and Tuerlinckx (2000) criticized NHST because “the significance of comparisons … is calibrated using the Type 1 error rate, relying on the assumption that the true difference is zero, which makes no sense in many applications.” To replace this framework, they proposed focusing on Type S error, where S stands for sign. A Type S error occurs when a researcher makes a confident directional claim even though the true effect has the opposite sign.

The label Type S error is potentially confusing because it suggests a replacement for the Type I error framework rather than a refinement of it. A Type I error is the unconditional long-run probability of falsely rejecting a null hypothesis across all tests that are conducted. For example, suppose a researcher conducts 100 tests with a significance criterion (alpha) of 5%. This criterion ensures that in the long run no more than 5% of all tests will be false positives. Testing at least some real effects will reduce the probability of a false positive. For example, if all studies have high power to detect a true effect, the probability of a false positive is zero (Soric, 1989). Thus, alpha sets a range of the relative frequency of false positives between 0 and alpha.

This unconditional probability must be distinguished from the conditional probability of error among the subset of studies that produced statistically significant results. In the previous example, if only 5 results were significant, it is likely that all 5 rejections were errors and that the conditional probability of a false positive given a significant result is 5 / 5 = 100% (Sorić, 1989). The proportion of false rejections among statistically significant results is called the false discovery rate (FDR), and the estimation and control of FDRs has become a large literature in statistics (Benjamini & Hochberg, 1995).

Applying Jones and Tukey’s interpretation of NHST to false discovery rates, a false discovery occurs not only when the true effect size is zero but also when it is in the opposite direction of the significant result. Gelman’s Type S error rate, also called the false sign rate (Stephens, 2017), assumes that effect sizes are never zero and counts only false rejections with the opposite sign. False sign rates are necessarily smaller than false discovery rates because wrong-sign rejections are only a subset of all false rejections. Exact-zero effects can produce significant results in either direction, whereas nonzero effects make correct-sign rejections more likely and wrong-sign rejections less likely.

The key source of confusion is that Gelman’s criticism of NHST and FDR estimation rests on a misunderstanding of NHST (Gelman, 2021). He maintains that FDR estimates are limited to the unlikely scenario that an effect is exactly zero and ignores sign errors. However, as Jones and Tukey (2000) pointed out, psychological researchers routinely use NHST as a directional sign test. Once NHST is understood in this way, Type S errors are no longer a fundamentally new kind of inferential problem and are already included in conditional and unconditional error rates. Moreover, NHST provides researchers with concrete statistical tools to estimate and control error rates, whereas Gelman’s Type S error is not something that can be estimated and was introduced as a rhetorical tool without practical use (Gelman, 2025; Lakens et al., 2025). In contrast, estimation of false discovery rates and false sign rates is an active area of research in statistics that builds on the foundations of NHST (Benjamini & Hochberg, 1995; Stephens, 2017) and has been largely ignored in psychology.

Statistical Power

So far, the distinction between Type I and (unconditional) Type S errors is mostly harmless. It may even help clarify that NHST is really used as a test of the sign of the population effect size rather than as a literal test of the nil hypothesis (Jones & Tukey, 2000). However, the wheels come off when Gelman and Carlin (2014) extend this critique from Type I error to Type II error and statistical power.

The distinction between Type I and Type II errors was introduced by Neyman and Pearson. A Type II error is the probability of failing to reject a false null hypothesis. Neyman and Pearson were cautious and avoided framing results as inferences about a true effect or as acceptance of a true hypothesis. In practice, however, failure to reject a false hypothesis means that either the population effect is positive and the study failed to produce a statistically significant result with a positive sign, or the population effect is negative and the study failed to produce a statistically significant result with a negative sign.

Statistical power is simply the complementary probability of obtaining a statistically significant result with the correct sign. Unlike the discussion of Type I errors, there is no important distinction here between a point null and an opposite-sign error. Power calculations are inherently directional. Researchers assume either a positive or a negative effect and then choose a design and sample size that reduce sampling error while controlling the Type I error rate. For example, a comparison of two groups with n = 50 per group, a population effect size of half a standard deviation (Cohen’s d = .50), and alpha = .05 has about a 70% probability of producing a statistically significant result with the correct sign.

By definition, then, power already concerns rejections with the correct sign. At this point, there is no meaningful difference between standard NHST and Gelman’s Type S framework (Stephens, 2017). The only minor difference arises in hypothetical scenarios with extremely low power. For two-sided (non-directional) power calculations, low power can produce significant results with sign errors. To use NHST as a sign-test in Jones and Tukey framework of two simultaneous one-sided tests, power should be estimated for one-sided directional tests with alpha/2. However, in practice, this distinction is irrelevant because Gelman and Carlin already showed that even modest power of 50% renders sign errors practically impossible.

Thus, the main concern about Gelman and Carlin’s (2014) article is the false implication that power calculations ignore sign errors and that researchers must move “beyond power” to control them. Grounding NHST in Jones and Tukey’s (2000) framework of two simultaneous directional tests shows that power calculations are not flawed. High power prevents both false negatives and sign errors. Gelman’s critique rests on a false premise: the assumption that NHST is nil-hypothesis testing. Under that assumption, power appears disconnected from sign errors. But once NHST is understood as directional inference, the criticism is invalid. Power analysis is not only useful but essential for controlling sign errors and the false sign rate.

Implications

Here’s a shortened version:

Implications

Gelman positions the Type S error as a new concept that requires moving “beyond power” because “power analysis is flawed” (p. 641). On closer inspection, power analysis is necessary and sufficient to control Type S error rates. Studies with high power ensure that most significant results have the correct sign, and high power also ensures a high discovery rate, which limits the proportion of false discoveries (Sorić, 1989). Power delivers everything needed to make significant results credible. It is paradoxical to criticize psychology for relying on small samples while also criticizing the tool that tells researchers how to avoid them. Cohen’s lasting contribution was precisely this: demonstrating that many studies lack power to detect plausible but small effect sizes and providing the tools to do better (Cohen, 1962).

Gelman and Carlin’s (2014) framing of power as flawed may have added to misunderstandings about the role of power in ensuring credible results. NHST and power analysis are not flawed. They are statistical tools for drawing conclusions about the direction of population effect sizes (Maxwell, Kelley, & Rausch, 2008). It would be desirable to conduct all studies with enough precision to provide informative effect size estimates, but limited resources often make this impossible. Meta-analysis of smaller studies can yield precise estimates, provided results are reported without selection bias. Reporting outcomes regardless of statistical significance is the most effective way to address selection bias, which remains the biggest threat to the credibility of NHST in practice (Sterling, 1959).

The real problem of NHST is not solved by a focus on Type S errors. The real problem is that non-significant results are inconclusive because failure to provide evidence for a positive or negative effect does not allow inferring the absence of an effect (Altman & Bland, 1995). The solution is to distinguish three hypotheses (Rice & Krakauer, 2023): (a) the effect is positive and larger than a smallest effect size of interest, (b) the effect is negative and larger in magnitude than a smallest effect size of interest, and (c) the effect falls within a region of practical equivalence around zero. Evidence for absence is established if the confidence interval falls entirely within the middle region. Replacing the point nil hypothesis with a range of practically equivalent values is an important addition to statistics for psychologists (Lakens, 2017; Lakens, Scheel, & Isager, 2018). It helps distinguish between statistical and practical significance, and it can turn non-significant results into significant evidence for the absence of a meaningful effect. However, providing evidence for absence often requires large samples because precise confidence intervals are needed to fit within a narrow region around zero. Power analysis remains essential for planning studies with this goal.

Conclusion

Continued controversy about NHST shows that better education about its underlying logic is needed. Jones and Tukey (2000) provided a clear explanation that deserves to be foundational for the teaching of NHST. Understanding NHST as two simultaneous directional tests avoids the confusion created by decades of criticism directed at a strawman version of the procedure. NHST has persisted for nearly a century despite harsh criticism because it provides a minimal but useful inference: determining the likely sign of a population effect size. Students need to learn about the real limitations of NHST and how they can be addressed. Changing statistical methods does not solve the problem that researchers need to publish and that precise effect size estimates are often out of reach. Even power to infer the sign of an effect is often low. Honest reporting of a single well-powered study is more important than reporting multiple underpowered studies that are p-hacked or selected for significance (Schimmack, 2012). With good data, different statistical approaches lead to the same conclusion. Open science reforms that improve the quality of data are more important than new statistical methods. The main reason NHST continues to attract criticism is that criticism is easy, but finding a better solution is harder. Real progress requires a real analysis of the problem NHST has many problems, but ignoring sign errors is not one of them.

References

Fiske, S. T. (2017). Going in many right directions, all at once. Perspectives on Psychological Science, 12, 652–655. https://doi.org/10.1177/1745691617706506

Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45(12), 1304–1312.

Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49(12), 997–1003.

Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio-Economics, 33(5), 587–606.

Tukey, J. W. (1991). The philosophy of multiple comparisons. Statistical Science, 6(1), 100–116.

Jones, L. V., & Tukey, J. W. (2000). A sensible formulation of the significance test. Psychological Methods, 5(4), 411–414.

Before we can balance false positives and false negatives, we have to publish false negatives.

Ten years ago, a stunning article by Bem (2011) triggered a crisis of confidence about psychology as a science. The article presented nine studies that seemed to show time-reversed causal effects of subliminal stimuli on human behavior. Hardly anybody believed the findings, but everybody wondered how Bem was able to produce significant results for effects that do not exist. This triggered a debate about research practices in social psychology.

Over the past decade, most articles on the replication crisis in social psychology pointed out problems with existing practices, but some articles tried to defend the status quo (cf. Schimmack, 2020).

Finkel, Eastwick, and Reis (2015) contributed to the debate with a plea to balance false positives and false negatives.

Best Research Practices in Psychology: Illustrating Epistemological and Pragmatic Considerations With the Case of Relationship Science

I argue that the main argument in this article is deceptive, but before I do so it is important to elaborate a bit on the use of the word deceptive. Psychologists make a distinction between self-deception and other-deception. Other-deception is easy to explain. For example, a politician may spread a lie for self-gain knowing full well that it is a lie. The meaning of self-deception is also relatively clear. Here individuals are spreading false information because they are unaware that the information is false. The main problem for psychologists is to distinguish between self-deception and other-deception. For example, it is unclear whether Donald Trump’s and his followers’ defence mechanisms are so strong that they really believes the election was stolen without any evidence to support this belief or whether he is merely using a lie for political gains. Similarly, it is also unclear whether Finkel et al. were deceiving themselves when they characterized the research practices of relationship researchers as an error-balanced approach, but the distinction between self-deception and other-deception is irrelevant. Self-deception also leads to the spreading of misinformation that needs to be corrected.

In short, my main thesis is that Finkel et al. misrepresent research practices in psychology and that they draw false conclusions about the status quo and the need for change based on a false premise.

Common Research Practices in Psychology

Psychological research practices follow a number of simple steps.

1. Researchers formulate a hypothesis that two variables are related (e.g., height is related to weight; dieting leads to weight loss).

2. They find ways to measure or manipulate a potential causal factor (height, dieting) and find a way to measure the effect (weight).

3. They recruit a sample of participants (e.g., N = 40).

4. They compute a statistic that reflects the strength of the relationship between the two variables (e.g., height and weight correlate r = .5).

5. They determine the amount of sampling error given their sample size.

6. They compute a test-statistic (t-value, F-value, z-score) that reflects the ratio of the effect size over the sample size (e.g., r (40) = .5; t(38) = 3.56.

7. They use the test-statistic to decide whether the relationship in the sample (e.g., r = .5) is strong enough to reject the nil-hypothesis that the relationship in the population is zero (p = .001).

The important question is what researchers do after they compute a p-value. Here critics of the status quo (the evidential value movement) and Finkel et al. make divergent assumptions.

The Evidential Value Movement

The main assumption of the EVM is that psychologists, including relationship researchers, have interpreted p-values incorrectly. For the most part, the use of p-values in psychology follows Fisher’s original suggestion to use a fixed criterion value of .05 to decide whether a result is statistically significant. In our example of a correlation of r = .5 with N = 40 participants, a p-value of .001 is below .05 and therefore it is sufficiently unlikely that the correlation could have emerged by chance if the real correlation between height and weight was zero. We therefore can reject the nil-hypothesis and infer that there is indeed a positive correlation.

However, if a correlation is not significant (e.g., r = .2, p > .05), the results are inconclusive because we cannot infer from a non-significant result that the nil-hypothesis is true. This creates an asymmetry in the value of significant results. Significant results can be used to claim a discovery (a diet produces weight loss), but non-significant results cannot be used to claim that there is no relationship (a diet has no effect on weight).

This asymmetry explains why most published articles results in psychology report significant results (Sterling, 1959; Sterling et al., 1959). As significant results are more conclusive, journals found it more interesting to publish studies with significant results.

Significant
http://allendowney.blogspot.com/2014/08/new-study-vaccines-prevent-disease-and.html

As Sterling (1959) pointed out, if only significant results are published, statistical significance no longer provides valuable information, and as Rosenthal (1979) warned, in theory journals could be filled with significant results even if most results are false positives (i.e., the nil-hypothesis is actually true).

Importantly, Fisher did not prescribe to do studies only once and to publish only significant results. Fisher clearly stated that results should only be considered credible if replication studies confirm the original results most of the time (say 8 out of 10 replication studies also produced p < .05). However, this important criterion of credibility was ignored by social psychologists, especially in research areas like relationship research that is resource intensive.

To conclude, the main concern among critics of research practices in psychology is that selective publishing of significant results produces results that have a high risk of being false positives (cf. Schimmack, 2020).

The Error Balanced Approach

Although Finkel et al. (2015) do not mention Neyman and Pearson, their error-balanced approach is rooted in Neyman-Pearsons approach to the interpretation of p-values. This approach is rather different from Fisher’s approach and it is well documented that Fisher and Neyman-Pearson were in a bitter fight over this issue. Neyman and Pearson introduced the distinction between Type I errors also called false positives and type-II errors also called false negatives.

Understanding Confusion Matrix. When we get the data, after data… | by  Sarang Narkhede | Towards Data Science

The type-I error is the same error that one could make in Fisher’s approach, namely a significant results, p < .05, is falsely interpreted as evidence for a relationship when there is no relationship between two variables in the population and the observed relationship was produced by sampling error alone.

So, what is a type-II error? It only occurred to me yesterday that most explanations of type-II errors are based on a misunderstanding of Neyman-Pearson’s approach. A simplistic explanation of a type-II error is the inference that there is no relationship, when a relationship actually exists. In the pregnancy example, a type-II error would be a pregnancy test that suggests a pregnant woman is not pregnant.

This explains conceptually what a type-II error is, but it does not explain how psychologists could ever make a type-II error. To actually make type-II errors, researchers would have to approach research entirely differently than psychologists actually do. Most importantly, they would need to specify a theoretically expected effect size. For example, researchers could test the nil-hypothesis that a relationship between height and weight is r = 0 against the alternative hypothesis that the relationship is r = .4. They would then need to compute the probability of obtaining a non-significant result under the assumption that the correlation is r = .4. This probability is known as the type-II error probability (beta). Only then, a non-significant result can be used to reject the alternative hypothesis that the effect size is .4 or larger with a pre-determined error rate beta. If this suddenly sounds very unfamiliar, the reason is that neither training nor published articles follow this approach. Thus, psychologists never make type-II error because they never specify a priori effect sizes and use p-values greater than .05 to infer that population effect sizes are smaller than a specified effect size.

However, psychologists often seem to believe that they are following Neyman-Pearson because statistics is often taught as a convoluted, incoherent mishmash of the two approaches (Gigerenzer, 1993). It also seems that Finkel et al. (2015) falsely assumed that psychologists follow Neyman-Pearson’s approach and carefully weight the risks of type-I and type-II errors. For example, they write

Psychological scientists typically set alpha (the theoretical possibility of a false positive) at .05, and, following Cohen (1988), they frequently set beta (the theoretical possibility of a false negative) at .20.

It is easy to show that this is not the case. To set the probability of a type-II error at 20%, psychologists would need to specify an effect size that gives them an 80% probability (power) to reject the nil-hypothesis, and they would then report the results with the conclusion that the population effect size is less than their a priori specified effect size. I have read more than 1,000 research articles in psychology and I have never seen an article that followed this approach. Moreover, it has been noted repeatedly that sample sizes are determined on an ad hoc basis with little concerns about low statistical power (Cohen, 1962; Sedlmeier & Gigerenzer, 1989; Schimmack, 2012; Sterling et al., 1995). Thus, the claim that psychologists are concerned about beta (type-II errors) is delusional, even if many psychologists believe it.

Finkel et al. (2015) suggests that an optimal approach to research would balance the risk of false positive results with the risk of false negative results. However, once more they ignore that false negatives can only be specified with clearly specified effect sizes.

Estimates of false positive and false negative rates in situations like these would go a long way toward helping scholars who work with large datasets to refine their confirmatory and exploratory hypothesis testing practices to optimize the balance between false-positive and false-negative error rates.

Moreover, they are blissfully unaware that false positive rates are abstract entities because it is practically impossible to verify that the relationship between two variables in a population is exactly zero. Thus, neither false positives nor false negatives are clearly defined and therefore cannot be counted to compute rates of their occurrences.

Without any information about the actual rate of false positives and false negatives, it is of course difficult to say whether current practices produce too many false positives or false negatives. A simple recommendation would be to increase sample sizes because higher statistical power reduces the risk of false negatives and the risk of false positives. So, it might seem like a win-win. However, this is not what Finkel et al. considered to be best practices.

As discussed previously, many policy changes oriented toward reducing false-positive rates will exacerbate false-negative rates

This statement is blatantly false and ignores recommendations to test fewer hypotheses in larger samples (Cohen, 1990; Schimmack, 2012).

They further make unsupported claims about the difficulty of correcting false positive results and false negative results. The evidential value critics have pointed out that current research practices in psychology make it practically impossible to correct a false positive result. Classic findings that failed to replicate are often cited and replications are ignored. The reason is that p < .05 is treated as strong evidence, whereas p > .05 is treated as inconclusive, following Fisher’s approach. If p > .05 was considered evidence against a plausible hypothesis, there would be no reason not to publish it (e.g., a diet does not decrease weight by more than .3 standard deviations in a study with 95% power, p < .05).

We are especially concerned about the evidentiary value movement’s relative neglect of false negatives because, for at least two major reasons, false negatives are much less likely to be the subject of replication attempts. First, researchers typically lose interest in unsuccessful ideas, preferring to use their resources on more “productive” lines of research (i.e., those that yield evidence for an effect rather than lack of evidence for an effect). Second, others in the field are unlikely to learn about these failures because null results are rarely published (Greenwald, 1975). As a result, false negatives are unlikely to be corrected by the normal processes of reconsideration and replication. In contrast, false positives appear in the published literature, which means that, under almost all circumstances, they receive more attention than false negatives. Correcting false positive errors is unquestionably desirable, but the consequences of increasingly favoring the detection of false positives relative to the detection of false negatives are more ambiguous.

This passage makes no sense. As the authors themselves acknowledge, the key problem with existing research practices is that non-significant results are rarely published (“because null-results are rarely published”). In combination with low statistical power to detect small effect sizes, this selection implies that researchers will often obtain non-significant results that are not published. However, it also means that published significant results often inflate the effect size because the true population effect size alone is too weak to produce a significant result. Only with the help of sampling error, the observed relationship is strong enough to be significant. So, many correlations that are r = .2 will be published as correlations of r = .5. The risk of false negatives is also reduced by publication bias. Because researchers do not know that a hypothesis was tested and produced a non-significant result, they will try again. Eventually, a study will produce a significant result (green jelly beans cause acne, p < .05), and the effect size estimate will be dramatically inflated. When follow-up studies fail to replicate this finding, these replication results are again not published because non-significant results are considered inconclusive. This means that current research practices in psychology never produce type-II errors, only produce type-I errors, and type-I errors are not corrected. This fundamentally flawed approach to science has created the replication crisis.

In short, while evidential value critics and Finkel agree that statistical significance is widely used to decide editorial decisions, they draw fundamentally different conclusions from this practice. Finkel et al. falsely label non-significant results in small samples, false negative results, but they are not false negatives in Neyman-Pearson’s approach to significance testing. They are, however, inconclusive results and the best practice to avoid inconclusive results would be to increase statistical power and to specify type-II error probabilities for reasonable effect sizes.

Finkel et al. (2015) are less concerned about calls for higher statistical power. They are more concerned with the introduction of badges for materials sharing, data sharing, and preregistration as “quick-and-dirty indicator of which studies, and which scholars,
have strong research integrity
” (p. 292).

Finkel et al. (2015) might therefore welcome cleaner and more direct indicators of research integrity that my colleagues and I have developed over the past decade that are related to some of their key concerns about false negative and false positive results (Bartos & Schimmack, 2020; Brunner & Schimmack, 2020, Schimmack, 2012; Schimmack, 2020). To illustrate this approach, I am using Eli J. Finkel’s published results.

I first downloaded published articles from major social and personality journals (Schimmack, 2020). I then converted these pdf files into text files and used R-code to find statistical results that were reported in the text. I then used a separate R-code to search these articles for the name “Eli J. Finkel.” I excluded thank you notes. I then selected the subset of test statistics that appeared in publications by Eli J. Finkel. The extracted test statistics are available in the form of an excel file (data). The file contains 1,638 useable test statistics (z-scores between 0 and 100).

A z-curve analysis of test-statistic converts all published test-statistics into p-values. Then the p-values are converted into z-scores on an standard normal distribution. Because the sign of an effect does not matter, all z-scores are positive The higher a z-score, the stronger is the evidence against the null-hypothesis. Z-scores greater than 1.96 (red line in the plot) are significant with the standard criterion of p < .05 (two-tailed). Figure 1 shows a histogram of the z-scores between 0 and 6; 143 z-scores exceed the upper value. They are included in the calculations, but not shown.

The first notable observation in Figure 1 is that the peak (mode) of the distribution is just to the right side of the significance criterion. It is also visible that there are more results just to the right (p < .05) than to the left (p > .05) around the peak. This pattern is common and reflects the well-known tendency for journals to favor significant results.

The advantage of a z-curve analysis is that it is possible to quantify the amount of publication bias. To do so, we can compare the observed discovery rate with the expected discovery rate. The observed discovery rate is simply the percentage of published results that are significant. Finkel published 1,031 significant results, which is a percentage of 63%.

The expected discovery rate is based on a statistical model. The statistical model is fitted to the distribution of significant results. To produce the distribution of significant results in Figure 1, we assume that they were selected from a larger set of tests that produced significant and non-significant results. Based on the mean power of these tests, we can estimate the full distribution before selection for significance. Simulation studies show that these estimates match simulated true values reasonably well (Bartos & Schimmack, 2020).

The expected discovery rate is 26%. This estimate implies that the average power of statistical tests conducted by Finkel is low. With over 1,000 significant test statistics, it is possible to obtain a fairly close confidence interval around this estimate, 95%CI = 11% to 44%. The confidence interval does not include 50%, showing that the average power is below 50%, which is often considered a minimum value for good science (Tversky & Kahneman, 1971). The 95% confidence interval also does not include the observed discovery rate of 63%. This shows the presence of publication bias. These results are by no means unique to Finkel. I was displeased to see that a z-curve analysis of my own articles produced similar results (ODR = 74%, EDR = 25%).

The EDR estimate is not only useful to examine publication bias. It can also be used to estimate the maximum false discovery rate (Soric, 1989). That is, although it is impossible to specify how many published results are false positives, it is possible to quantify the worst case scenario. Finkel’s EDR estimate of 26% implies a maximum false discovery rate of 15%. Once again, this is an estimate and it is useful to compute a confidence interval around it. The 95%CI ranges from 7% to 43%. On the one hand, this makes it possible to reject Ioannidis’ claim that most published results are false. On the other hand, we cannot rule out that some of Finkel’s significant results were false positives. Moreover, given the evidence that publication bias is present, we cannot rule out the possibility that non-significant results that failed to replicate a significant result are missing from the published record.

A major problem for psychologists is the reliance on p-values to evaluate research findings. Some psychologists even falsely assume that p < .05 implies that 95% of significant results are true positives. As we see here, the risk of false positives can be much higher, but significance does not tell us which p-values below .05 are credible. One solution to this problem is to focus on the false discovery rate as a criterion. This approach has been used in genomics to reduce the risk of false positive discoveries. The same approach can also be used to control the risk of false positives in other scientific disciplines (Jager & Leek, 2014).

To reduce the false discovery rate, we need to reduce the criterion to declare a finding a discovery. A team of researchers suggested to lower alpha from .05 to .005 (Benjamin et al. 2017). Figure 2 shows the results if this criterion is used for Finkel’s published results. We now see that the number of significant results is only 579, but that is still a lot of discoveries. We see that the observed discovery rate decreased to 35%. The reason is that many of the just significant results with p-values between .05 and .005 are no longer considered to be significant. We also see that the expected discovery rate increased! This requires some explanation. Figure 2 shows that there is an excess of significant results between .05 and .005. These results are not fitted to the model. The justification for this would be that these results are likely to be obtained with questionable research practices. By disregarding them, the remaining significant results below .005 are more credible and the observed discovery rate is in line with the expected discovery rate.

The results look different if we do not assume that questionable practices were used. In this case, the model can be fitted to all p-values below .05.

If we assume that p-values are simply selected for significance, the decrease of p-values from .05 to .005 implies that there is a large file-drawer of non-significant results and the expected discovery rate with alpha = .005 is only 11%. This translates into a high maximum false discovery rate of 44%, but the 95%CI is wide and ranges from 14% to 100%. In other words, the published significant results provide no credible evidence for the discoveries that were made. It is therefore charitable to attribute the peak of just significant results to questionable research practices so that p-values below .005 provide some empirical support for the claims in Finkel’s articles.

Discussion

Ultimately, science relies on trust. For too long, psychologists have falsely assumed that most if not all significant results are discoveries. Bem’s (2011) article made many psychologists realize that this is not the case, but this awareness created a crisis of confidence. Which significant results are credible and which ones are false positives? Are most published results false positives? During times of uncertainty, cognitive biases can have a strong effect. Some evidential value warriors saw false positive results everywhere. Others wanted to believe that most published results are credible. These extreme positions are not supported by evidence. The reproducibility project showed that some results replicate and others do not (Open Science Collaboration, 2015). To learn from the mistakes of the past, we need solid facts. Z-curve analyses can provide these facts. It can also help to separate more credible p-values from less credible p-values. Here, I showed that about half of Finkel’s discoveries can be salvaged from the wreckage of the replication crisis in social psychology by using p < .005 as a criterion for a discovery.

However, researchers may also have different risk preferences. Maybe some are more willing to build on a questionable, but intriguing finding than others. Z-curve analysis can accommodate personalized risk-preferences as well. I shared the data here and an R-package is available to fit z-curve with different alpha levels and selection thresholds.

Aside from these practical implications, this blog post also made a theoretical observation. The term type-II error or false negative is often used loosely and incorrectly. Until yesterday, I also made this mistake. Finkel et al. (2015) use the term false negative to refer to all non-significant results were the nil-hypothesis is false. They then worry that there is a high risk of false negatives that needs to be counterbalanced against the risk of a false positive. However, not every trivial deviation from zero is meaningful. For example, a diet that reduces weight by 0.1 pounds is not worthwhile studying. A real type-II error is made when researcher specify a meaningful effect size, conduct a high-powered study to find it, and then falsely conclude that an effect of this magnitude does not exist. To make a type-II error, it is necessary to conduct studies with high power. Otherwise, beta is so high that it makes no sense to draw a conclusion from the data. As average power in psychology in general and in Finkel’s studies is low, it is clear that they did not make any type-II errors. Thus, I recommend to increase power to finally get a balance between type-I and type-II errors which requires making some type-II errors some of the time.

References

Gigerenzer, G. (1993). The superego, the ego, and the id in statistical reasoning. In G. Keren & C. Lewis (Eds.), A handbook for data analysis in the behavioral sciences: Methodological issues (pp. 311–339). Hillsdale, NJ: Erlbaum, Inc.