Estimating Replicability of Psychological Science: 35% or 50%

Examining the Basis of Bakker, van Dijk, and Wichert’s 35% estimate.

Marjan Bakker, Annette van Dijk and Jelte M. Wicherts (2012). The Rules of the Game Called Psychological Science, Perspectives on Psychological Science 2012 7: 543

BDW’s article starts with the observation that psychological journals publish mostly significant results, but that most studies lack the statistical power to produce so many significant results (Sterling, 1959; Sterling et al., 1995). The heading for the paragraph that makes this claim is “Authors Are Lucky!”

“Sterling (1959) and Sterling, Rosenbaum, and Weinkam (1995) showed that in 97% (in 1958) and 96% (in 1986–1987) of psychological studies involving the use of NHST, H0 was rejected at α = .05. “ (p. 543).

“The abundance of positive outcomes is striking because effect sizes (ESs) in psychology are typically not large enough to be detected by the relatively small samples used in most studies (i.e., studies are often underpowered; Cohen, 1990).” (p. 543).

It is true that power is an important determinant of the rate of significant results that a series of experiments will produce. However, power is defined as the probability of obtaining a significant result when an effect is present. Power is not defined when the null-hypothesis is true. As a result, power is the maximum rate of significant results that can be expected when the null-hypothesis is always false (Sterling et al., 1995).

Although it has been demonstrated that publication bias exists and that publication bias contributes to the high success rate in psychology journals, it has been more difficult to estimate the actual rate of significant results that one would expect without publication bias.

BDW provides an estimate and the point of this blog post is to examine their method of obtaining an unbiased estimate of statistical power, which sets an upper limit for the success rate of psychological studies published in psychology journals.

BDW begin with the observation that statistical power is a function of (a) the criterion for statistical significance (alpha), which is typically p < .05 (two-tailed), (b) sampling error, which decreases with increasing sample size, and (c) the population effect size.

The nominal significance level and sample size are known parameters. BDW suggest that the typical sample size in psychology is N = 40.

“According to Marszalek, Barber, Kohlhart, and Holmes (2011), the median total sample size in four representative psychological journals (Journal of Abnormal Psychology, Journal of Applied Psychology, Journal of Experimental Psychology: Human Perception and Performance, and Developmental Psychology) was 40. This finding is corroborated by Wetzels et al. (2011), who found a median cell size of 24 in both between- and within-subjects designs in their large sample of t tests from Psychonomic Bulletin & Review and Journal of Experimental Psychology: Learning, Memory and Cognition.”

The N = 40 estimate has two problems. It is not based on a representative sample of studies across all areas of psychology. Sample sizes are often smaller than N = 40 in animal studies and they are larger in personality psychology (Fraley & Vazire, 2014). Second, the research design also influences sampling error. In a one-sample t-test, N = 40 implies a sampling error of 1/sqrt(40) = .16, and an effect size of d = .33 would be significant (t(39) = .32/.16 = 2.05, p = .043). In contrast, sampling error in a between-subject design is 2/sqrt(40) and an effect size of d = .65 is needed to obtain a significant result, t(39) = .65/.32 = 2.05, p = .047. Thus, power calculations have to take into account what research design was used. N = 40 can be adequate to study moderate effect sizes (d = .5) with a one-sample design, but not with a between-subject design.

The major problem for power estimation is that the population effect size is unknown. BDW rely on meta-analyses to obtain an estimate of the typical effect size in psychological research.   There are two problems with this approach. First, meta-analysis often failed to correct for publication bias. As a result, meta-analytic estimates can be inflated. Second, meta-analyses may focus on research questions with small effect sizes because large effect are so obvious that they do not require a meta-analysis to examine whether they are real. With these caveats in mind, meta-analyses are likely to provide some valid information about the typical population effect size in psychology. BDW arrive at an estimate of d = .50, which Cohen considered a medium effect size.

“The average ES found in meta-analyses in psychology is around d = 0.50 (Anderson, Lindsay, & Bushman, 1999; Hall, 1998; Lipsey & Wilson, 1993; Meyer et al., 2001; Richard, Bond, & Stokes-Zoota, 2003; Tett, Meyer, & Roese, 1994).

Based on a sample size of N = 40 and a typical effect size of d = .50, the authors arrive at an estimate of 35% power; that is a 35% probability that a psychological study that is reported with a significant result in a journal actually produced a significant result or would produce a significant result again in an exact replication study (with the same sample size and power as the original study). The problem with this estimate is that BDW assume that all studies use the low-power, between-subject (BS) design.

“The typical power in our field will average around 0.35 in a two independent samples comparison, if we assume an ES of d = 0.50 and a total sample size of 40” (p. 544).

The authors do generalize from the BS scenario to all areas of research.

“This low power in common psychological research raises the possibility of a file drawer (Rosenthal, 1979) containing studies with negative or inconclusive results.” (p. 544).

Unfortunately, the authors ignore important work that contradicts their conclusions. Most important, Cohen (1962) provided the first estimate of statistical power in psychological research. He did not conduct an explicit meta-analysis of psychological research, but he suggested that an effect size of half a standard deviation is a moderate effect size. This standardized effect size was named after him; Cohen’s d. As it turns out, the effect size used by BDW of Cohen’s d = .50, is the same effect size that Cohen used for his power analysis (he also proposed similar criteria for other effect size measures).   Cohen (1962) arrived at a median power estimate of 50% to detect a moderate effect size.   This estimate was replicated by Sedlmeier and Gigerenzer (1989), who also conducted a meta-analysis of power estimates and found that power in some other research areas was higher with an average of 60% power to detect a moderate effect size.

One major factor that contributes to the discrepancy between BDW’s estimate of 35% power and other power estimates in the range from 50 to 60% power is that BDW estimated sample sizes on the basis of journals that use within-subject designs, but conducted the power analysis with a between-subject design. In contrast, Cohen and others used the actual designs of studies to estimate power. This approach is more labor-intensive, but provides more accurate estimates than an approach that assumes that all studies use between-subject designs.


In conclusion, the 35% estimate underestimates the typical power in psychological studies. Given that BDW and Cohen made the same assumption about the median population effect size, Cohen’s method is more accurate and estimates based on his method should be used. These estimates are closer to 50% power.

However, even the 50% estimate is just an estimate that requires further validation research. One limitation is that the accuracy of the meta-analytic estimation method is unknown. Another problem is that power assumes that an effect is present, but in some studies the null-hypothesis is true. Thus, even if the typical power of studies were 50%., the actual success rate is lower.

Unless better estimates become available, it is reasonable to assume that at best 50% of published significant results will replicate in an exact replication study. With success rates close to 100%, this means that researchers routinely obtain non-significant results in studies that would be published if they had produced significant results. This large file-drawer of unreported studies inflates reported effect sizes, increases the risk of false-positive results, and wastes resources.

Leave a Reply