UPDATE 5/13/2019 Our manuscript on the z-curve method for estimation of mean power after selection for significance has been accepted for publication in Meta-Psychology. As estimation of actual power is an important tool for meta-psychologists, we are happy that z-curve found its home in Meta-Psychology. We also enjoyed the open and constructive review process at Meta-Psychology. Definitely will try Meta-Psychology again for future work (look out for z-curve.2.0 with many new features).

Since 2015, Jerry Brunner and I have been working on a statistical tool that can estimate mean (statitical) power for a set of studies with heterogeneous sample sizes and effect sizes (heterogeneity in non-centrality parameters and true power). This method corrects for the inflation in mean observed power that is introduced by the selection for statistical significance. Knowledge about mean power makes it possible to predict the success rate of exact replication studies. For example, if a set of studies with mean power of 60% were replicated exactly (including sample sizes), we would expect that 60% of the replication studies produce a significant result again.

Our latest manuscript is a revision of an earlier manuscript that received a revise and resubmit decision from the free, open-peer-review journal Meta-Psychology. We consider it the most authoritative introduction to z-curve that should be used to learn about z-curve, critic z-curve, or as a citation for studies that use z-curve.

Feel free to ask questions, provide comments, and critic our manuscript in the comments section. We are proud to be an open science lab, and consider criticism an opportunity to improve z-curve and our understanding of power estimation.

R-CODE
Latest R-Code to run Z.Curve (Z.Curve.Public.18.10.28).
[updated 18/11/17] [35 lines of code]
call function mean.power = zcurve(pvalues,Plot=FALSE,alpha=.05,bw=.05)[1]

Z-Curve related Talks
Presentation on Z-curve and application to BS Experimental Social Psychology and (Mostly) WS-Cognitive Psychology at U Waterloo (November 2, 2018)
[Powerpoint Slides]

This is a draft of a commentary on Loken and Gelman’s Science article “Measurement error and the replication crisis. Comments are welcome.

Random Measurement Error Reduces Power, Replicability, and Observed Effect Sizes After Selection for Significance

Ulrich Schimmack and Rickard Carlsson

In the article “Measurement error and the replication crisis” Loken and Gelman (LG) “caution against the fallacy of assuming that that which does not kill statistical significance makes it stronger” (1). We agree with the overall message that it is a fallacy to interpret observed effect size estimates in small samples as accurate estimates of population effect sizes. We think it is helpful to recognize the key role of statistical power in significance testing. If studies have less than 50% power, effect sizes must be inflated to be significant. Thus, all observed effect sizes in these studies are inflated. Once power is greater than 50%, it is possible to obtain significance with observed effect sizes that underestimate the population effect size. However, even with 80% power, the probability of overestimation is 62.5%. [corrected]. As studies with small samples and small effect sizes often have less than 50% power (2), we can safely assume that observed effect sizes overestimate the population effect size. The best way to make claims about effect sizes in small samples is to avoid interpreting the point estimate and to interpret the 95% confidence interval. It will often show that significant large effect sizes in small samples have wide confidence intervals that also include values close to zero, which shows that any strong claims about effect sizes in small samples are a fallacy (3).

Although we agree with Loken and Gelman’s general message, we believe that their article may have created some confusion about the effect of random measurement error in small samples with small effect sizes when they wrote “In a low-noise setting, the theoretical results of Hausman and others correctly show that measurement error will attenuate coefficient estimates. But we can demonstrate with a simple exercise that the opposite occurs in the presence of high noise and selection on statistical significance” (p. 584). We both read this sentence as suggesting that under the specified conditions random error may produce even more inflated estimates than perfectly reliable measure. We show that this interpretation of their sentence would be incorrect and that random measurement error always leads to an underestimation of observed effect sizes, even if effect sizes are selected for significance. We demonstrate this fact with a simple equation that shows that true power before selection for significance is monotonically related to observed power after selection for significance. As random measurement error always attenuates population effect sizes, the monotonic relationship implies that observed effect sizes with unreliable measures are also always attenuated. We provide the formula and R-Code in a Supplement. Here we just give a brief description of the steps that are involved in predicting the effect of measurement error on observed effect sizes after selection for significance.

The effect of random measurement error on population effect sizes is well known. Random measurement error adds variance to the observed measures X and Y, which lowers the observable correlation between two measures. Random error also increases the sampling error. As the non-central t-value is the proportion of these two parameters, it follows that random measurement error always attenuates power. Without selection for significance, median observed effect sizes are unbiased estimates of population effect sizes and median observed power matches true power (4,5). However, with selection for significance, non-significant results with low observed power estimates are excluded and median observed power is inflated. The amount of inflation is proportional to true power. With high power, most results are significant and inflation is small. With low power, most results are non-significant and inflation is large.

Schimmack developed a formula that specifies the relationship between true power and median observed power after selection for significance (6). Figure 1 shows that median observed power after selection for significant is a monotonic function of true power. It is straightforward to transform inflated median observed power into median observed effect sizes. We applied this approach to Locken and Gelman’s simulation with a true population correlation of r = .15. We changed the range of sample sizes from 50 to 3050 to 25 to 1000 because this range provides a better picture of the effect of small samples on the results. We also increased the range of reliabilities to show that the results hold across a wide range of reliabilities. Figure 2 shows that random error always attenuates observed effect sizes, even after selection for significance in small samples. However, the effect is non-linear and in small samples with small effects, observed effect sizes are nearly identical for different levels of unreliability. The reason is that in studies with low power, most of the observed effect is driven by the noise in the data and it is irrelevant whether the noise is due to measurement error or unexplained reliable variance.

In conclusion, we believe that our commentary clarifies how random measurement error contributes to the replication crisis. Consistent with classic test theory, random measurement error always attenuates population effect sizes. This reduces statistical power to obtain significant results. These non-significant results typically remain unreported. The selective reporting of significant results leads to the publication of inflated effect size estimates. It would be a fallacy to consider these effect size estimates reliable and unbiased estimates of population effect sizes and to expect that an exact replication study would also produce a significant result. The reason is that replicability is determined by true power and observed power is systematically inflated by selection for significance. Our commentary also provides researchers with a tool to correct for the inflation by selection for significance. The function in Figure 1 can be used to deflate observed effect sizes. These deflated observed effect sizes provide more realistic estimates of population effect sizes when selection bias is present. The same approach can also be used to correct effect size estimates in meta-analyses (7).

References

1. Loken, E., & Gelman, A. (2017). Measurement error and the replication crisis. Science,

2. Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology, 65, 145-153, http://dx.doi.org/10.1037/h004518

4. Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551-566. http://dx.doi.org/10.1037/a0029487

6. Schimmack, U. (2017). How selection for significance influences observed power. https://replicationindex.com/2017/02/21/how-selection-for-significance-influences-observed-power/

7. van Assen, M.A., van Aert, R.C., Wicherts, J.M. (2015). Meta-analysis using effect size distributions of only statistically significant studies. Psychological Methods, 293-309. doi: 10.1037/met0000025.

title = “Effect of Selection for Significance on Observed Effect Size”

### Create Figure

for (i in 1:length(rel)) {

print(i)

plot(N[,1],inf.obs.es[,i],type=”l”,xlim=c(x.min,x.max),ylim=c(y.min,y.max),col=col[i],xlab=”Sample Size”,ylab=”Median Observed Effect Size After Selection for Significance”,lwd=3,main=title)

Two years ago, I posted an Excel spreadsheet to help people to understand the concept of true power, observed power, and how selection for significance inflates observed power. Two years have gone by and I have learned R. It is time to update the post.

There is no mathematical formula to correct observed power for inflation to solve for true power. This was partially the reason why I created the R-Index, which is an index of true power, but not an estimate of true power. This has led to some confusion and misinterpretation of the R-Index (Disjointed Thought blog post).

However, it is possible to predict median observed power given true power and selection for statistical significance. To use this method for real data with observed median power of only significant results, one can simply generate a range of true power values, generate the predicted median observed power and then pick the true power value with the smallest discrepancy between median observed power and simulated inflated power estimates. This approach is essentially the same as the approach used by pcurve and puniform, which only
differ in the criterion that is being minimized.

Here is the r-code for the conversion of true.power into the predicted observed power after selection for significance.

And here is a pretty picture of the relationship between true power and inflated observed power. As we can see, there is more inflation for low true power because observed power after selection for significance has to be greater than 50%. With alpha = .05 (two-tailed), when the null-hypothesis is true, inflated observed power is 61%. Thus, an observed median power of 61% for only significant results supports the null-hypothesis. With true power of 50%, observed power is inflated to 75%. For high true power, the inflation is relatively small. With the recommended true power of 80%, median observed power for only significant results is 86%.

Observed power is easy to calculate from reported test statistics. The first step is to compute the exact two-tailed p-value. These p-values can then be converted into observed power estimates using the standard normal distribution.

If there is selection for significance, you can use the previous formula to convert this observed power estimate into an estimate of true power.

This method assumes that (a) significant results are representative of the distribution and there are no additional biases (no p-hacking) and (b) all studies have the same or similar power. This method does not work for heterogeneous sets of studies.

P.S. It is possible to proof the formula that transforms true power into median observed power. Another way to verify that the formula is correct is to confirm the predicted values with a simulation study.

Here is the code to run the simulation study:

n.sim = 100000
z.crit = qnorm(.975)
true.power = seq(.01,.99,.01)
obs.pow.sim = c()
for (i in 1:length(true.power)) {
z.sim = rnorm(n.sim,qnorm(true.power[i],z.crit))
med.z.sig = median(z.sim[z.sim > z.crit])
obs.pow.sim = c(obs.pow.sim,pnorm(med.z.sig,z.crit))
}
obs.pow.sim

“For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication” (Cohen, 1994).

DEFINITION OF REPLICABILITY: In empirical studies with sampling error, replicability refers to the probability of a study with a significant result to produce a significant result again in an exact replication study of the first study using the same sample size and significance criterion (Schimmack, 2017).

See Reference List at the end for peer-reviewed publications.

Mission Statement

The purpose of the R-Index blog is to increase the replicability of published results in psychological science and to alert consumers of psychological research about problems in published articles.

To evaluate the credibility or “incredibility” of published research, my colleagues and I developed several statistical tools such as the Incredibility Test (Schimmack, 2012); the Test of Insufficient Variance (Schimmack, 2014), and z-curve (Version 1.0; Brunner & Schimmack, 2020; Version 2.0, Bartos & Schimmack, 2021).

I have used these tools to demonstrate that several claims in psychological articles are incredible (a.k.a., untrustworthy), starting with Bem’s (2011) outlandish claims of time-reversed causal pre-cognition (Schimmack, 2012). This article triggered a crisis of confidence in the credibility of psychology as a science.

Over the past decade it has become clear that many other seemingly robust findings are also highly questionable. For example, I showed that many claims in Nobel Laureate Daniel Kahneman’s book “Thinking: Fast and Slow” are based on shaky foundations (Schimmack, 2020). An entire book on unconscious priming effects, by John Bargh, also ignores replication failures and lacks credible evidence (Schimmack, 2017). The hypothesis that willpower is fueled by blood glucose and easily depleted is also not supported by empirical evidence (Schimmack, 2016). In general, many claims in social psychology are questionable and require new evidence to be considered scientific (Schimmack, 2020).

Each year I post new information about the replicability of research in 120 Psychology Journals (Schimmack, 2021). I also started providing information about the replicability of individual researchers and provide guidelines how to evaluate their published findings (Schimmack, 2021).

Replication is essential for an empirical science, but it is not sufficient. Psychology also has a validation crisis (Schimmack, 2021). That is, measures are often used before it has been demonstrate how well they measure something. For example, psychologists have claimed that they can measure individuals’ unconscious evaluations, but there is no evidence that unconscious evaluations even exist (Schimmack, 2021a, 2021b).

If you are interested in my story how I ended up becoming a meta-critic of psychological science, you can read it here (my journey).

References

Brunner, J., & Schimmack, U. (2020). Estimating population mean power under conditions of heterogeneity and selection for significance. Meta-Psychology, 4, MP.2018.874, 1-22 https://doi.org/10.15626/MP.2018.874

Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17, 551–566 http://dx.doi.org/10.1037/a0029487

Schimmack, U. (2020). A meta-psychological perspective on the decade of replication failures in social psychology. Canadian Psychology/Psychologie canadienne, 61(4), 364–376. https://doi.org/10.1037/cap0000246

Replicability rankings of psychology journals differs from traditional rankings based on impact factors (citation rates) and other measures of popularity and prestige. Replicability rankings use the test statistics in the results sections of empirical articles to estimate the average power of statistical tests in a journal. Higher average power means that the results published in a journal have a higher probability to produce a significant result in an exact replication study and a lower probability of being false-positive results.

The rankings are based on statistically significant results only (p < .05, two-tailed) because only statistically significant results can be used to interpret a result as evidence for an effect and against the null-hypothesis. Published non-significant results are useful for meta-analysis and follow-up studies, but they provide insufficient information to draw statistical inferences.

The average power across the 105 psychology journals used for this ranking is 70%. This means that a representative sample of significant results in exact replication studies is expected to produce 70% significant results. The rankings for 2015 show variability across journals with average power estimates ranging from 84% to 54%. A factor analysis of annual estimates for 2010-2015 showed that random year-to-year variability accounts for 2/3 of the variance and that 1/3 is explained by stable differences across journals.

The Journal Names are linked to figures that show the powergraphs of a journal for the years 2010-2014 and 2015. The figures provide additional information about the number of tests used, confidence intervals around the average estimate, and power estimates that estimate power including non-significant results even if these are not reported (the file-drawer).

In this post, I provide a brief summary of Hoenig and Heisey’s argument. The summary shows that Hoenig and Heisey were concerned with the practice of assessing the statistical power of a single test based on the observed effect size for this effect. I agree that it is often not informative to do so (unless the result is power = .999). However, the article is often cited to suggest that the use of observed effect sizes in power calculations is fundamentally flawed. I show that this statement is false.

The abstract of the article makes it clear that Hoenig and Heisey focused on the estimation of power for a single statistical test. “There is also a large literature advocating that power calculations be made whenever one performs a statistical test of a hypothesis and one obtains a statistically nonsignificant result” (page 1). The abstract informs readers that this practice is fundamentally flawed. “This approach, which appears in various forms, is fundamentally flawed. We document that the problem is extensive and present arguments to demonstrate the flaw in the logic” (p. 1).

Given that method articles can be difficult to read, it is possible that the misinterpretation of Hoenig and Heisey is the result of relying on the term “fundamentally flawed” in the abstract. However, some passages in the article are also ambiguous. In the Introduction Hoenig and Heisey write “we describe the flaws in trying to use power calculations for data-analytic purposes” (p. 1). It is not clear what purposes are left for power calculations if they cannot be used for data-analytic purposes. Later on, they write more forcefully “A number of authors have noted that observed power may not be especially useful, but to our knowledge a fatal logical flaw has gone largely unnoticed.” (p. 2). So readers cannot be blamed entirely if they believed that calculations of observed power are fundamentally flawed. This conclusion is often implied in Hoenig and Heisey’s writing, which is influenced by their broader dislike of hypothesis testing in general.

The main valid argument that Hoenig and Heisey make is that power analysis is based on the unknown population effect size and that effect sizes in a particular sample are contaminated with sampling error. As p-values and power estimates depend on the observed effect size, they are also influenced by random sampling error.

In a special case, when true power is 50%, the p-value matches the significance criterion. If sampling error leads to an underestimation of the true effect size, the p-value will be non-significant and the power estimate will be less than 50%. When sampling error inflates the observed effect size, p-values will be significant and power will be above 50%.

It is therefore impossible to find scenarios where observed power is high (80%) and a result is not significant, p > .05, or where observed power is low (20%) and a result is significant, p < .05. As a result, it is not possible to use observed power to decide whether a non-significant result was obtained because power was low or because power was high but the effect does not exist.

In fact, a simple mathematical formula can be used to transform p-values into observed power and vice versa (I actually got the idea of using p-values to estimate power from Hoenig and Heisey’s article). Given this perfect dependence between the two statistics, observed power cannot add additional information to the interpretation of a p-value.

This central argument is valid and it does mean that it is inappropriate to use the observed effect size of a statistical test to draw inferences about the statistical power of a significance test for the same effect (N = 1). Similarly, one would not rely on a single data point to draw inferences about the mean of a population.

However, it is common practice to aggregate original data points or to aggregated effect sizes of multiple studies to obtain more precise estimates of the mean in a population or the mean effect size, respectively. Thus, the interesting question is whether Hoenig and Heisey’s (2001) article contains any arguments that would undermine the aggregation of power estimates to obtain an estimate of the typical power for a set of studies. The answer is no. Hoenig and Heisey do not consider a meta-analysis of observed power in their discussion and their discussion of observed power does not contain arguments that would undermine the validity of a meta-analysis of post-hoc power estimates.

A meta-analysis of observed power can be extremely useful to check whether researchers’ a priori power analysis provide reasonable estimates of the actual power of their studies.

Assume that researchers in a particular field have to demonstrate that their studies have 80% power to produce significant results when an important effect is present because conducting studies with less power would be a waste of resources (although some granting agencies require power analyses, these power analyses are rarely taken seriously, so I consider this a hypothetical example).

Assume that researchers comply and submit a priori power analysis with effect sizes that are considered to be sufficiently meaningful. For example, an effect of half-a-standard deviation (Cohen’s d = .50) might look reasonable large to be meaningful. Researchers submit their grant applications with a prior power analysis that produce 80% power with an effect size of d = .50. Based on the power analysis, researchers request funding for 128 participants. A researcher plans four studies and needs $50 for each participant. The total budget is $25,600.

When the research project is completed, all four studies produced non-significant results. The observed standardized effect sizes were 0, .20, .25, and .15. Is it really impossible to estimate the realized power in these studies based on the observed effect sizes? No. It is common practice to conduct a meta-analysis of observed effect sizes to get a better estimate of the (average) population effect size. In this example, the average effect size across the four studies is d = .15. It is also possible to show that the average effect size in these four studies is significantly different from the effect size that was used for the a priori power calculation (M1 = .15, M2 = .50, Mdiff = .35, SE = 1/sqrt(512) = .044, t = .35 / .044 = 7.92, p < 1e-13). Using the more realistic effect size estimate that is based on actual empirical data rather than wishful thinking, the post-hoc power analysis yields a power estimate of 13%. The probability of obtaining non-significant results in all four studies is 57%. Thus, it is not surprising that the studies produced non-significant results. In this example, a post-hoc power analysis with observed effect sizes provides valuable information about the planning of future studies in this line of research. Either effect sizes of this magnitude are not important enough and research should be abandoned or effect sizes of this magnitude still have important practical implications and future studies should be planned on the basis of a priori power analysis with more realistic effect sizes.

Another valuable application of observed power analysis is the detection of publication bias and questionable research practices (Ioannidis and Trikalinos; 2007), Schimmack, 2012) and for estimating the replicability of statistical results published in scientific journals (Schimmack, 2015).

In conclusion, the article by Hoenig and Heisey is often used as a reference to argue that observed effect sizes should not be used for power analysis. This post clarifies that this practice is not meaningful for a single statistical test, but that it can be done for larger samples of studies.

The article with the witty title “Do Studies of Statistical Power Have an Effect on the Power of Studies?” builds on Cohen’s (1962) seminal power analysis of psychological research.

The main point of the article can be summarized in one word: No. Statistical power has not increased after Cohen published his finding that statistical power is low.

One important contribution of the article was a meta-analysis of power analyses that applied Cohen’s method to a variety of different journals. The table below shows that power estimates vary by journal assuming that the effect size was medium according to Cohen’s criteria of small, medium, and large effect sizes. The studies are sorted by power estimates from the highest to the lowest value, which provides a power ranking of journals based on Cohen’s method. I also included the results of Sedlmeier and Giegerenzer’s power analysis of the 1984 volume of the Journal of Abnormal Psychology (the Journal of Social and Abnormal Psychology was split into Journal of Abnormal Psychology and Journal of Personality and Social Psychology). I used the mean power (50%) rather than median power (44%) because the mean power is consistent with the predicted success rate in the limit. In contrast, the median will underestimate the success rate in a set of studies with heterogeneous effect sizes.

JOURNAL TITLE

YEAR

Power%

Journal of Marketing Research

1981

89

American Sociological Review

1974

84

Journalism Quarterly, The Journal of Broadcasting

1976

76

American Journal of Educational Psychology

1972

72

Journal of Research in Teaching

1972

71

Journal of Applied Psychology

1976

67

Journal of Communication

1973

56

The Research Quarterly

1972

52

Journal of Abnormal Psychology

1984

50

Journal of Abnormal and Social Psychology

1962

48

American Speech and Hearing Research & Journal of Communication Disorders

1975

44

Counseler Education and Supervision

1973

37

The table shows that there is tremendous variability in power estimates for different journals ranging from as high as 89% (9 out of 10 studies will produce a significant result when an effect is present) to the lowest estimate of 37% power (only 1 out of 3 studies will produce a significant result when an effect is present).

The table also shows that the Journal of Abnormal and Social Psychology and its successor the Journal of Abnormal Psychology yielded nearly identical power estimates. This finding is the key finding that provides empirical support for the claim that power in the Journal of Abnormal Psychology has not increased over time.

The average power estimate for all journals in the table is 62% (median 61%). The list of journals is not a representative set of journals and few journals are core psychology journals. Thus, the average power may be different if a representative set of journals had been used.

The average for the three core psychology journals (JASP & JAbnPsy, JAP, AJEduPsy) is 67% (median = 63%) is slightly higher. The latter estimate is likely to be closer to the typical power in psychology in general rather than the prominently featured estimates based on the Journal of Abnormal Psychology. Power could be lower in this journal because it is more difficult to recruit patients with a specific disorder than participants from undergraduate classes. However, only more rigorous studies of power for a broader range of journals and more years can provide more conclusive answers about the typical power of a single statistical test in a psychology journal.

The article also contains some important theoretical discussions about the importance of power in psychological research. One important issue concerns the treatment of multiple comparisons. For example, a multi-factorial design produces an exponential number of statistical comparisons. With two conditions, there is only one comparison. With three conditions, there are three comparisons (C1 vs. C2, C1 vs. C3, and C2 vs. C3). With 5 conditions, there are 10 comparisons. Standard statistical methods often correct for these multiple comparisons. One consequence of this correction for multiple comparisons is that the power of each statistical test decreases. An effect that would be significant in a simple comparison of two conditions would not be significant if this test is part of a series of tests.

Sedlmeier and Giegerenzer used the standard criterion of p < .05 (two-tailed) for their main power analysis and for the comparison with Cohen’s results. However, many articles presented results using a more stringent criterion of significance. If the criterion used by authors would have been used for the power analysis, power decreased further. About 50% of all articles used an adjusted criterion value and if the adjusted criterion value was used power was only 37%.

Sedlmeier and Giegerenzer also found another remarkable difference between articles in 1960 and in 1984. Most articles in 1960 reported the results of a single study. In 1984 many articles reported results from two or more studies. Sedlmeier and Giegerenzer do not discuss the statistical implications of this change in publication practices. Schimmack (2012) introduced the concept of total power to highlight the problem of publishing articles that contain multiple studies with modest power. If studies are used to provide empirical support for an effect, studies have to show a significant effect. For example, Study 1 shows an effect with female participants. Study 2 examines whether the effect can also be demonstrated with male participants. If Study 2 produces a non-significant result, it is not clear how this finding should be interpreted. It may show that the effect does not exist for men. It may show that the first result was just a fluke finding due to sampling error. Or it may show that the effect exists equally for men and women but studies had only 50% power to produce a significant result. In this case, it is expected that one study will produce a significant result and one will produce a non-significant result, but in the long-run significant results are equally likely with male or female participants. Given the difficulty of interpreting a non-significant result, it would be important to conduct a more powerful study that examines gender differences in a more powerful study with more female and male participants. However, this is not what researchers do. Rather, multiple study articles contain only the studies that produced significant results. The rate of successful studies in psychology journals is over 90% (Sterling et al., 1995). However, this outcome is extremely likely in multiple studies where studies have only 50% power to get a significant result in a single attempt. For each additional attempt, the probability to obtain only significant results decreases exponentially (1 Study, 50%, 2 Studies 25%, 3 Studies 12.5%, 4 Studies 6.75%).

The fact that researchers only publish studies that worked is well-known in the research community. Many researchers believe that this is an acceptable scientific practice. However, consumers of scientific research may have a different opinion about this practice. Publishing only studies that produced the desired outcome is akin to a fund manager that only publishes the return rate of funds that gained money and excludes funds with losses. Would you trust this manager to take care of your retirement? It is also akin to a gambler that only remembers winnings. Would you marry a gambler who believes that gambling is ok because you can earn money that way?

I personally do not trust obviously biased information. So, when researchers present 5 studies with significant results, I wonder whether they really had the statistical power to produce these results or whether they simply did not publish results that failed to confirm their claims. To answer this question it is essential to estimate the actual power of individual studies to produce significant results; that is, it is necessary to estimate the typical power in this field, of this researcher, or in the journal that published the results.

In conclusion, Sedlmeier and Gigerenzer made an important contribution to the literature by providing the first power-ranking of scientific journals and the first temporal analyses of time trends in power. Although they probably hoped that their scientific study of power would lead to an increase in statistical power, the general consensus is that their article failed to change scientific practices in psychology. In fact, some journals required more and more studies as evidence for an effect (some articles contain 9 studies) without any indication that researchers increased power to ensure that their studies could actually provide significant results for their hypotheses. Moreover, the topic of statistical power remained neglected in the training of future psychologists.

I recommend Sedlmeier and Gigerenzer’s article as essential reading for anybody interested in improving the credibility of psychology as a rigorous empirical science.

As always, comments (positive or negative) are always welcome.

Citation: Dr. R (2015). Meta-analysis of observed power. R-Index Bulletin, Vol(1), A2.

In a previous blog post, I presented an introduction to the concept of observed power. Observed power is an estimate of the true power on the basis of observed effect size, sampling error, and significance criterion of a study. Yuan and Maxwell (2005) concluded that observed power is a useless construct when it is applied to a single study, mainly because sampling error in a single study is too large to obtain useful estimates of true power. However, sampling error decreases as the number of studies increases and observed power in a set of studies can provide useful information about the true power in a set of studies.

This blog post introduces various methods that can be used to estimate power on the basis of a set of studies (meta-analysis). I then present simulation studies that compare the various estimation methods in terms of their ability to estimate true power under a variety of conditions. In this blog post, I examine only unbiased sets of studies. That is, the sample of studies in a meta-analysis is a representative sample from the population of studies with specific characteristics. The first simulation assumes that samples are drawn from a population of studies with fixed effect size and fixed sampling error. As a result, all studies have the same true power (homogeneous). The second simulation assumes that all studies have a fixed effect size, but that sampling error varies across studies. As power is a function of effect size and sampling error, this simulation models heterogeneity in true power. The next simulations assume heterogeneity in population effect sizes. One simulation uses a normal distribution of effect sizes. Importantly, a normal distribution has no influence on the mean because effect sizes are symmetrically distributed around the mean effect size. The next simulations use skewed normal distributions. This simulation provides a realistic scenario for meta-analysis of heterogeneous sets of studies such as a meta-analysis of articles in a specific journal or articles on different topics published by the same author.

Observed Power Estimation Method 1: The Percentage of Significant Results

The simplest method to determine observed power is to compute the percentage of significant results. As power is defined as the long-range percentage of significant results, the percentage of significant results in a set of studies is an unbiased estimate of the long-term percentage. The main limitation of this method is that the dichotomous measure (significant versus insignificant) is likely to be imprecise when the number of studies is small. For example, two studies can only show observed power values of 0, 25%, 50%, or 100%, even if true power were 75%. However, the percentage of significant results plays an important role in bias tests that examine whether a set of studies is representative. When researchers hide non-significant results or use questionable research methods to produce significant results, the percentage of significant results will be higher than the percentage of significant results that could have been obtained on the basis of the actual power to produce significant results.

Observed Power Estimation Method 2: The Median

Schimmack (2012) proposed to average observed power of individual studies to estimate observed power. Yuan and Maxwell (2005) demonstrated that the average of observed power is a biased estimator of true power. It overestimates power when power is less than 50% and it underestimates true power when power is above 50%. Although the bias is not large (no more than 10 percentage points), Yuan and Maxwell (2005) proposed a method that produces an unbiased estimate of power in a meta-analysis of studies with the same true power (exact replication studies). Unlike the average that is sensitive to skewed distributions, the median provides an unbiased estimate of true power because sampling error is equally likely (50:50 probability) to inflate or deflate the observed power estimate. To avoid the bias of averaging observed power, Schimmack (2014) used median observed power to estimate the replicability of a set of studies.

Observed Power Estimation Method 3: P-Curve’s KS Test

Another method is implemented in Simonsohn’s (2014) pcurve. Pcurve was developed to obtain an unbiased estimate of a population effect size from a biased sample of studies. To achieve this goal, it is necessary to determine the power of studies because bias is a function of power. The pcurve estimation uses an iterative approach that tries out different values of true power. For each potential value of true power, it computes the location (quantile) of observed test statistics relative to a potential non-centrality parameter. The best fitting non-centrality parameter is located in the middle of the observed test statistics. Once a non-central distribution has been found, it is possible to assign each observed test-value a cumulative percentile of the non-central distribution. For the actual non-centrality parameter, these percentiles have a uniform distribution. To find the best fitting non-centrality parameter from a set of possible parameters, pcurve tests whether the distribution of observed percentiles follows a uniform distribution using the Kolmogorov-Smirnov test. The non-centrality parameter with the smallest test statistics is then used to estimate true power.

Observed Power Estimation Method 4: P-Uniform

van Assen, van Aert, and Wicherts (2014) developed another method to estimate observed power. Their method is based on the use of the gamma distribution. Like the pcurve method, this method relies on the fact that observed test-statistics should follow a uniform distribution when a potential non-centrality parameter matches the true non-centrality parameter. P-uniform transforms the probabilities given a potential non-centrality parameter with a negative log-function (-log[x]). These values are summed. When probabilities form a uniform distribution, the sum of the log-transformed probabilities matches the number of studies. Thus, the value with the smallest absolute discrepancy between the sum of negative log-transformed percentages and the number of studies provides the estimate of observed power.

Observed Power Estimation Method 5: Averaging Standard Normal Non-Centrality Parameter

In addition to these existing methods, I introduce to novel estimation methods. The first new method converts observed test statistics into one-sided p-values. These p-values are then transformed into z-scores. This approach has a long tradition in meta-analysis that was developed by Stouffer et al. (1949). It was popularized by Rosenthal during the early days of meta-analysis (Rosenthal, 1979). Transformation of probabilities into z-scores makes it easy to aggregate probabilities because z-scores follow a symmetrical distribution. The average of these z-scores can be used as an estimate of the actual non-centrality parameter. The average z-score can then be used to estimate true power. This approach avoids the problem of averaging power estimates that power has a skewed distribution. Thus, it should provide an unbiased estimate of true power when power is homogenous across studies.

Observed Power Estimation Method 6: Yuan-Maxwell Correction of Average Observed Power

Yuan and Maxwell (2005) demonstrated a simple average of observed power is systematically biased. However, a simple average avoids the problems of transforming the data and can produce tighter estimates than the median method. Therefore I explored whether it is possible to apply a correction to the simple average. The correction is based on Yuan and Maxwell’s (2005) mathematically derived formula for systematic bias. After averaging observed power, Yuan and Maxwell’s formula for bias is used to correct the estimate for systematic bias. The only problem with this approach is that bias is a function of true power. However, as observed power becomes an increasingly good estimator of true power in the long run, the bias correction will also become increasingly better at correcting the right amount of bias.

The Yuan-Maxwell correction approach is particularly promising for meta-analysis of heterogeneous sets of studies such as sets of diverse studies in a journal. The main advantage of this method is that averaging of power makes no assumptions about the distribution of power across different studies (Schimmack, 2012). The main limitation of averaging power was the systematic bias, but Yuan and Maxwell’s formula makes it possible to reduce this systematic bias, while maintaining the advantage of having a method that can be applied to heterogeneous sets of studies.

RESULTS

Homogeneous Effect Sizes and Sample Sizes

The first simulation used 100 effect sizes ranging from .01 to 1.00 and 50 sample sizes ranging from 11 to 60 participants per condition (Ns = 22 to 120), yielding 5000 different populations of studies. The true power of these studies was determined on the basis of the effect size, sample size, and the criterion p < .025 (one-tailed), which is equivalent to .05 (two-tailed). Sample sizes were chosen so that average power across the 5,000 studies was 50%. The simulation drew 10 random samples from each of the 5,000 populations of studies. Each sample of a study simulated a between-subject design with the given population effect size and sample size. The results were stored as one-tailed p-values. For the meta-analysis p-values were converted into z-scores. To avoid biases due to extreme outliers, z-scores greater than 5 were set to 5 (observed power = .999).

The six estimation methods were then used to compute observed power on the basis of samples of 10 studies. The following figures show observed power as a function of true power. The green lines show the 95% confidence interval for different levels of true power. The figure also includes red dashed lines for a value of 50% power. Studies with more than 50% observed power would be significant. Studies with less than 50% observed power would be non-significant. The figures also include a blue line for 80% true power. Cohen (1988) recommended that researchers should aim for a minimum of 80% power. It is instructive how accurate estimation methods are in evaluating whether a set of studies met this criterion.

The histogram shows the distribution of true power across the 5,000 populations of studies.

The histogram shows that the simulation covers the full range of power. It also shows that high-powered studies are overrepresented because moderate to large effect sizes can achieve high power for a wide range of sample sizes. The distribution is not important for the evaluation of different estimation methods and benefits all estimation methods equally because observed power is a good estimator of true power when true power is close to the maximum (Yuan & Maxwell, 2005).

The next figure shows scatterplots of observed power as a function of true power. Values above the diagonal indicate that observed power overestimates true power. Values below the diagonal show that observed power underestimates true power.

Visual inspection of the plots suggests that all methods provide unbiased estimates of true power. Another observation is that the count of significant results provides the least accurate estimates of true power. The reason is simply that aggregation of dichotomous variables requires a large number of observations to approximate true power. The third observation is that visual inspection provides little information about the relative accuracy of the other methods. Finally, the plots show how accurate observed power estimates are in meta-analysis of 10 studies. When true power is 50%, estimates very rarely exceed 80%. Similarly, when true power is above 80%, observed power is never below 50%. Thus, observed power can be used to examine whether a set of studies met Cohen’s recommended guidelines to conduct studies with a minimum of 80% power. If observed power is 50%, it is nearly certain that the studies did not have the recommended 80% power.

To examine the relative accuracy of different estimation methods quantitatively, I computed bias scores (observed power – true power). As bias can overestimate and underestimate true power, the standard deviation of these bias scores can be used to quantify the precision of various estimation methods. In addition, I present the mean to examine whether a method has large sample accuracy (i.e. the bias approaches zero as the number of simulations increases). I also present the percentage of studies with no more than 20% points bias. Although 20% bias may seem large, it is not important to estimate power with very high precision. When observed power is below 50%, it suggests that a set of studies was underpowered even if the observed power estimate is an underestimation.

The quantitative analysis also shows no meaningful differences among the estimation methods. The more interesting question is how these methods perform under more challenging conditions when the set of studies are no longer exact replication studies with fixed power.

The next simulation simulated variation in sample sizes. For each population of studies, sample sizes were varied by multiplying a particular sample size by factors of 1 to 5.5 (1.0, 1.5,2.0…,5.5). Thus, a base-sample-size of 40 created a range of sample sizes from 40 to 220. A base-sample size of 100 created a range of sample sizes from 100 to 2,200. As variation in sample sizes increases the average sample size, the range of effect sizes was limited to a range from .004 to .4 and effect sizes were increased in steps of d = .004. The histogram shows the distribution of power in the 5,000 population of studies.

The simulation covers the full range of true power, although studies with low and very high power are overrepresented.

The results are visually not distinguishable from those in the previous simulation.

The quantitative comparison of the estimation methods also shows very similar results.

In sum, all methods perform well even when true power varies as a function of variation in sample sizes. This conclusion may not generalize to more extreme simulations of variation in sample sizes, but more extreme variations in sample sizes would further increase the average power of a set of studies because the average sample size would increase as well. Thus, variation in effect sizes poses a more realistic challenge for the different estimation methods.

Heterogeneous, Normally Distributed Effect Sizes

The next simulation used a random normal distribution of true effect sizes. Effect sizes were simulated to have a reasonable but large variation. Starting effect sizes ranged from .208 to 1.000 and increased in increments of .008. Sample sizes ranged from 10 to 60 and increased in increments of 2 to create 5,000 populations of studies. For each population of studies, effect sizes were sampled randomly from a normal distribution with a standard deviation of SD = .2. Extreme effect sizes below d = -.05 were set to -.05 and extreme effect sizes above d = 1.20 were set to 1.20. The first histogram of effect sizes shows the 50,000 population effect sizes. The histogram on the right shows the distribution of true power for the 5,000 sets of 10 studies.

The plots of observed and true power show that the estimation methods continue to perform rather well even when population effect sizes are heterogeneous and normally distributed.

The quantitative comparison suggests that puniform has some problems with heterogeneity. More detailed studies are needed to examine whether this is a persistent problem for puniform, but given the good performance of the other methods it seems easier to use these methods.

Heterogeneous, Skewed Normal Effect Sizes

The next simulation puts the estimation methods to a stronger challenge by introducing skewed distributions of population effect sizes. For example, a set of studies may contain mostly small to moderate effect sizes, but a few studies examined large effect sizes. To simulated skewed effect size distributions, I used the rsnorm function of the fGarch package. The function creates a random distribution with a specified mean, standard deviation, and skew. I set the mean to d = .2, the standard deviation to SD = .2, and skew to 2. The histograms show the distribution of effect sizes and the distribution of true power for the 5,000 sets of studies (k = 10).

This time the results show differences between estimation methods in the ability of various estimation methods to deal with skewed heterogeneity. The percentage of significant results is unbiased, but is imprecise due to the problem of averaging dichotomous variables. The other methods show systematic deviations from the 95% confidence interval around the true parameter. Visual inspection suggests that the Yuan-Maxwell correction method has the best fit.

This impression is confirmed in quantitative analyses of bias. The quantitative comparison confirms major problems with the puniform estimation method. It also shows that the median, p-curve, and the average z-score method have the same slight positive bias. Only the Yuan-Maxwell corrected average power shows little systematic bias.

To examine biases in more detail, the following graphs plot bias as a function of true power. These plots can reveal that a method may have little average bias, but has different types of bias for different levels of power. The results show little evidence of systematic bias for the Yuan-Maxwell corrected average of power.

The following analyses examined bias separately for simulation with less or more than 50% true power. The results confirm that all methods except the Yuan-Maxwell correction underestimate power when true power is below 50%. In contrast, most estimation methods overestimate true power when true power is above 50%. The exception is puniform which still underestimated true power. More research needs to be done to understand the strange performance of puniform in this simulation. However, even if p-uniform could perform better, it is likely to be biased with skewed distributions of effect sizes because it assumes a fixed population effect size.

Conclusion

This investigation introduced and compared different methods to estimate true power for a set of studies. All estimation methods performed well when a set of studies had the same true power (exact replication studies), when effect sizes were homogenous and sample sizes varied, and when effect sizes were normally distributed and sample sizes were fixed. However, most estimation methods were systematically biased when the distribution of effect sizes was skewed. In this situation, most methods run into problems because the percentage of significant results is a function of the power of individual studies rather than the average power.

The results of these analyses suggest that the R-Index (Schimmack, 2014) can be improved by simply averaging power and then applying the Yuan-Maxwell correction. However, it is important to realize that the median method tends to overestimate power when power is greater than 50%. This makes it even more difficult for the R-Index to produce an estimate of low power when power is actually high. The next step in the investigation of observed power is to examine how different methods perform in unrepresentative (biased) sets of studies. In this case, the percentage of significant results is highly misleading. For example, Sterling et al. (1995) found percentages of 95% power, which would suggest that studies had 95% power. However, publication bias and questionable research practices create a bias in the sample of studies that are being published in journals. The question is whether other observed power estimates can reveal bias and can produce accurate estimates of the true power in a set of studies.

Yuan, K.-H., & Maxwell, S. (2005). On the Post Hoc Power in Testing Mean Differences. Journal of Educational and Behavioral Statistics, 141–167

This blog post provides an accessible introduction to the concept of observed power. Most of the statistical points are based on based on Yuan and Maxwell’s (2005 excellent but highly technical article about post-hoc power. This bog post tries to explain statistical concepts in more detail and uses simulation studies to illustrate important points.

What is Power?

Power is defined as the long-run probability of obtaining significant results in a series of exact replication studies. For example, 50% power means that a set of 100 studies is expected to produce 50 significant results and 50 non-significant results. The exact numbers in an actual set of studies will vary as a function of random sampling error, just like 100 coin flips are not always going to produce a 50:50 split of heads and tails. However, as the number of studies increases, the percentage of significant results will be ever closer to the power of a specific study.

A priori power

Power analysis can be useful for the planning of sample sizes before a study is being conducted. A power analysis that is being conducted before a study is called a priori power analysis (before = a priori). Power is a function of three parameters: the actual effect size, sampling error, and the criterion value that needs to be exceeded to claim statistical significance. In between-subject designs, sampling error is determined by sample size alone. In this special case, power is a function of the true effect size, the significance criterion and sample size.

The problem for researchers is that power depends on the effect size in the population (e.g., the true correlation between height and weight amongst Canadians in 2015). The population effect size is sometimes called the true effect size. Imagine that somebody would actually obtain data from everybody in a population. In this case, there is no sampling error and the correlation is the true correlation in the population. However, typically researchers use much smaller samples and the goal is to estimate the correlation in the population on the basis of a smaller sample. Unfortunately, power depends on the correlation in the population, which is unknown to a researcher planning a study. Therefore, researchers have to estimate the true effect size to compute an a priori power analysis.

Cohen (1988) developed general guidelines for the estimation of effect sizes. For example, in studies that compare the means of two groups, a standardized difference of half a standard deviation (e.g., 7.5 IQ points on an iQ scale with a standard deviation of 15) is considered a moderate effect. Researchers who assume that their predicted effect has a moderate effect size, can use d = .5 for an a priori power analysis. Assuming that they want to claim significance with the standard criterion of p < .05 (two-tailed), they would need N = 210 (n =105 per group) to have a 95% chance to obtain a significant result (GPower). I do not discuss a priori power analysis further because this blog post is about observed power. I merely introduced a priori power analysis to highlight the difference between a priori power analysis and a posteriori power analysis, which is the main topic of Yuan and Maxwell’s (2005) article.

A Posteriori Power Analysis: Observed Power

Observed power computes power after a study or several studies have been conducted. The key difference between a priori and a posteriori power analysis is that a posteriori power analysis uses the observed effect size in a study as an estimate of the population effect size. For example, assume a researcher found a correlation of r = .54 in a sample of N = 200 Canadians. Instead of guessing the effect size, the researcher uses the correlation observed in this sample as an estimate of the correlation in the population. There are several reasons why it might be interesting to conduct a power analysis after a study. First, the power analysis might be used to plan a follow up or replication study. Second, the power analysis might be used to examine whether a non-significant result might be the result of insufficient power. Third, observed power is used to examine whether a researcher used questionable research practices to produce significant results in studies that had insufficient power to produce significant results.

In sum, observed power is an estimate of the power of a study based on the observed effect size in a study. It is therefore not power that is being observed, but the effect size that is being observed. However, because the other parameters that are needed to compute power are known (sample size, significance criterion), the observed effect size is the only parameter that needs to be observed to estimate power. However, it is important to realize that observed power does not mean that power was actually observed. Observed power is still an estimate based on an observed effect size because power depends on the effect size in the population (which remains unobserved) and the observed effect size in a sample is just an estimate of the population effect size.

A Posteriori Power Analysis after a Single Study

Yuan and Maxwell (2005) examined the statistical properties of observed power. The main question was whether it is meaningful to compute observed power based on the observed effect size in a single study.

The first statistical analysis of an observed mean difference is to examine whether the study produced a significant result. For example, the study may have examined whether music lessons produce an increase in children’s IQ. The study had 95% power to produce a significant difference with N = 176 participants and a moderate effect size (d = .5; IQ = 7.5).

One possibility is that the study actually produced a significant result. For example, the observed IQ difference was 5 IQ points. This is less than the expected difference of 7.5 points and corresponds to a standardized effect size of d = .3. Yet, the t-test shows a highly significant difference between the two groups, t(208) = 3.6, p = 0.0004 (1 / 2513). The p-value shows that random sampling error alone would produce differences of this magnitude or more in only 1 out of 2513 studies. Importantly, the p-value only makes it very likely that the intervention contributed to the mean difference, but it does not provide information about the size of the effect. The true effect size may be closer to the expected effect size of 7.5 or it may be closer to 0. The true effect size remains unknown even after the mean difference between the two groups is observed. Yet, the study provides some useful information about the effect size. Whereas the a priori power analysis relied exclusively on guess-work, observed power uses the effect size that was observed in a reasonably large sample of 210 participants. Everything else being equal, effect size estimates based on 210 participants are more likely to match the true effect size than those based on 0 participants.

The observed effect size can be entered into a power analysis to compute observed power. In this example, observed power with an effect size of d = .3 and N = 210 (n = 105 per group) is 58%. One question examined by Yuan and Maxwell (2005) is whether it can be useful to compute observed power after a study produced a significant result.

The other question is whether it can be useful to compute observed power when a study produced a non-significant result. For example, assume that the estimate of d = 5 is overly optimistic and that the true effect size of music lessons on IQ is a more modest 1.5 IQ points (d = .10, one-tenth of a standard deviation). The actual mean difference that is observed after the study happens to match the true effect size exactly. The difference between the two groups is not statistically significant, t(208) = .72, p = .47. A non-significant result is difficult to interpret. On the one hand, the means trend in the right direction. On the other hand, the mean difference is not statistically significant. The p-value suggests that a mean difference of this magnitude would occur in every second study by chance alone even if music intervention had no effect on IQ at all (i.e., the true effect size is d = 0, the null-hypothesis is true). Statistically, the correct conclusion is that the study provided insufficient information regarding the influence of music lessons on IQ. In other words, assuming that the true effect size is closer to the observed effect size in a sample (d = .1) than to the effect size that was used to plan the study (d = .5), the sample size was insufficient to produce a statistically significant result. Computing observed power merely provides some quantitative information to reinforce this correct conclusion. An a posteriori power analysis with d = .1 and N = 210, yields an observed power of 11%. This suggests that the study had insufficient power to produce a significant result, if the effect size in the sample matches the true effect size.

Yuan and Maxwell (2005) discuss false interpretations of observed power. One false interpretation is that a significant result implies that a study had sufficient power. Power is a function of the true effect size and observed power relies on effect sizes in a sample. 50% of the time, effect sizes in a sample overestimate the true effect size and observed power is inflated. It is therefore possible that observed power is considerably higher than the actual power of a study.

Another false interpretation is that low power in a study with a non-significant result means that the hypothesis is correct, but that the study had insufficient power to demonstrate it. The problem with this interpretation is that there are two potential reasons for a non-significant result. One of them, is that a study had insufficient power to show a significant result when an effect is actually present (this is called the type-II error). The second possible explanation is that the null-hypothesis is actually true (there is no effect). A non-significant result cannot distinguish between these two explanations. Yet, it remains true that the study had insufficient power to test these hypotheses against each other. Even if a study had 95% power to show an effect if the true effect size is d = .5, it can have insufficient power if the true effect size is smaller. In the example, power decreased from 95% assuming d = .5, to 11% assuming d = .1.

Yuan and Maxell’s Demonstration of Systematic Bias in Observed Power

Yuan and Maxwell focus on a design in which a sample mean is compared against a population mean and the standard deviation is known. To modify the original example, a researcher could recruit a random sample of children, do a music lesson intervention and test the IQ after the intervention against the population mean of 100 with the population standard deviation of 15, rather than relying on the standard deviation in a sample as an estimate of the standard deviation. This scenario has some advantageous for mathematical treatments because it uses the standard normal distribution. However, all conclusions can be generalized to more complex designs. Thus, although Yuan and Maxwell focus on an unusual design, their conclusions hold for more typical designs such as the comparison of two groups that use sample variances (standard deviations) to estimate the variance in a population (i.e., pooling observed variances in both groups to estimate the population variance).

Yuan and Maxwell (2005) also focus on one-tailed tests, although the default criterion in actual studies is a two-tailed test. Once again, this is not a problem for their conclusions because the two-tailed criterion value for p = .05 is equivalent to the one-tailed criterion value for p = .025 (.05 / 2). For the standard normal distribution, the value is z = 1.96. This means that an observed z-score has to exceed a value of 1.96 to be considered significant.

To illustrate this with an example, assume that the IQ of 100 children after a music intervention is 103. After subtracting the population mean of 100 and dividing by the standard deviation of 15, the effect size is d = 3/15 = .2. Sampling error is defined by 1 / sqrt (n). With a sample size of n = 100, sampling error is .10. The test-statistic (z) is the ratio of the effect size and sampling error (.2 / .1) = 2. A z-score of 2 is just above the critical value of 2, and would produce a significant result, z = 2, p = .023 (one-tailed; remember criterion is .025 one-tailed to match .05 two-tailed). Based on this result, a researcher would be justified to reject the null-hypothesis (there is no effect of the intervention) and to claim support for the hypothesis that music lessons lead to an increase in IQ. Importantly, this hypothesis makes no claim about the true effect size. It merely states that the effect is greater than zero. The observed effect size in the sample (d = .2) provides an estimate of the actual effect size but the true effect size can be smaller or larger than the effect size in the sample. The significance test merely rejects the possibility that the effect size is 0 or less (i.e., music lessons lower IQ).

Entering a non-centrality parameter of 3 for a generic z-test in G*power yields the following illustration of a non-central distribution.

Illustration of non-central distribution using G*Power output

The red curve shows the standard normal distribution for the null-hypothesis. With d = 0, the non-centrality parameter is also 0 and the standard normal distribution is centered over zero.

The blue curve shows the non-central distribution. It is the same standard normal distribution, but now it is centered over z = 3. The distribution shows how z-scores would be distributed for a set of exact replication studies, where exact replication studies are defined as studies with the same true effect size and sampling error.

The figure also illustrates power by showing the critical z-score of 1.96 with a green line. On the left side are studies where sampling error reduced the observed effect size so much that the z-score was below 1.96 and produced a non-significant result (p > .025 one-tailed, p > .05, two-tailed). On the right side are studies with significant results. The area under the curve on the left side is called type-II error or beta-error). The area under the curve on the right side is called power (1 – type-II error). The output shows that beta error probability is 15% and Power is 85%.

In sum, the formula

states that power for a given true effect size is the area under the curve to the right side of a critical z-score for a standard normal distribution that is centered over the non-centrality parameter that is defined by the ratio of the true effect size over sampling error.

[personal comment: I find it odd that sampling error is used on the right side of the formula but not on the left side of the formula. Power is a function of the non-centrality parameter and not just the effect size. Thus I would have included sqrt (n) also on the left side of the formula].

Because the formula relies on the true effect size, it specifies true power given the (unknown) population effect size. To use it for observed power, power has to be estimated based on the observed effect size in a sample.

The important novel contribution of Yuan and Maxwell (2005) was to develop a mathematical formula that relates observed power to true power and to find a mathematical formula for the bias in observed power.

The formula implies that the amount of bias is a function of the unknown population effect size. Yuan and Maxwell make several additional observations about bias. First, bias is zero when true power is 50%. The second important observation is that systematic bias is never greater than 9 percentage points. The third observation is that power is overestimated when true power is less than 50% and underestimated when true power is above 50%. The last observation has important implications for the interpretation of observed power.

50% power implies that the test statistic matches the criterion value. For example, if the criterion is p < .05 (two-tailed), 50% power is equivalent to p = .05. If observed power is less than 50%, a study produced a non-significant result. A posteriori power analysis might suggest that observed power is only 40%. This finding suggests that the study was underpowered and that a more powerful study might produce a significant result. Systematic bias implies that the estimate of 40% is more likely to be an overestimation than an underestimation. As a result, bias does not undermine the conclusion. Rather observed power is conservative because the actual power is likely to be even less than 40%.

The alternative scenario is that observed power is greater than 50%, which implies a significant result. In this case, observed power might be used to argue that a study had sufficient power because it did produce a significant result. Observed power might show, however, that observed power is only 60%. This would indicate that there was a relatively high chance to end up with a non-significant result. However, systematic bias implies that observed power is more likely to underestimate true power than to overestimate it. Thus, true power is likely to be higher. Again, observed power is conservative when it comes to the interpretation of power for studies with significant results. This would suggest that systematic bias is not a serious problem for the use of observed power. Moreover, the systematic bias is never more than 9 percentage-points. Thus, observed power of 60% cannot be systematically inflated to more than 70%.

In sum, Yuan and Maxwell (2005) provided a valuable analysis of observed power and demonstrated analytically the properties of observed power.

Practical Implications of Yuan and Maxwell’s Findings

Based on their analyses, Yuan and Maxwell (2005) draw the following conclusions in the abstract of their article.

Using analytical, numerical, and Monte Carlo approaches, our results show that the estimated power does not provide useful information when the true power is small. It is almost always a biased estimator of the true power. The bias can be negative or positive. Large sample size alone does not guarantee the post hoc power to be a good estimator of the true power.

Unfortunately, other scientists often only read the abstract, especially when the article contains mathematical formulas that applied scientists find difficult to follow. As a result, Yuan and Maxwell’s (2005) article has been cited mostly as evidence that it observed power is a useless concept. I think this conclusion is justified based on Yuan and Maxwell’s abstract, but it does not follow from Yuan and Maxwell’s formula of bias. To make this point, I conducted a simulation study that paired 25 sample sizes (n = 10 to n = 250) and 20 effect sizes (d = .05 to d = 1) to create 500 non-centrality parameters. Observed effect sizes were randomly generated for a between-subject design with two groups (df = n*2 – 2). For each non-centrality parameter, two simulations were conducted for a total of 1000 studies with heterogeneous effect sizes and sample sizes (standard errors). The results are presented in a scatterplot with true power on the x-axis and observed power on the y-axis. The blue line shows prediction of observed power from true power. The red curve shows the biased prediction based on Yuan and Maxwell’s bias formula.

The most important observation is that observed power varies widely as a function of random sampling error in the observed effect sizes. In comparison, the systematic bias is relatively small. Moreover, observed power at the extremes clearly distinguishes between low powered (< 25%) and high powered (> 80%) power. Observed power is particularly informative when it is close to the maximum value of 100%. Thus, observed power of 99% or more strongly suggests that a study had high power. The main problem for posteriori power analysis is that observed effect sizes are imprecise estimates of the true effect size, especially in small samples. The next section examines the consequences of random sampling error in more detail.

Standard Deviation of Observed Power

Awareness has been increasing that point estimates of statistical parameters can be misleading. For example, an effect size of d = .8 suggests a strong effect, but if this effect size was observed in a small sample, the effect size is strongly influenced by sampling error. One solution to this problem is to compute a confidence interval around the observed effect size. The 95% confidence interval is defined by sampling error times 1.96; approximately 2. With sampling error of .4, the confidence interval could range all the way from 0 to 1.6. As a result, it would be misleading to claim that an effect size of d = .8 in a small sample suggests that the true effect size is strong. One solution to this problem is to report confidence intervals around point estimates of effect sizes. A common confidence interval is the 95% confidence interval. A 95% confidence interval means that there is a 95% probability that the population effect size is contained in the 95% confidence interval around the (biased) effect size in a sample.

To illustrate the use of confidence interval, I computed the confidence interval for the example of music training and IQ in children. The example assumes that the IQ of 100 children after a music intervention is 103. After subtracting the population mean of 100 and dividing by the standard deviation of 15, the effect size is d = 3/15 = .2. Sampling error is defined by 1 / sqrt (n). With a sample size of n = 100, sampling error is .10. To compute a 95% confidence interval, sampling error is multiplied with the z-scores that capture 95% of a standard normal distribution, which is 1.96. As sampling error is .10, the values are -.196 and .196. Given an observed effect size of d = .2, the 95% confidence interval ranges from .2 – .196 = .004 to .2 + .196 = .396.

A confidence interval can be used for significance testing by examining whether the confidence interval includes 0. If the 95% confidence interval does not include zero, it is possible to reject the hypothesis that the effect size in the population is 0, which is equivalent to rejecting the null-hypothesis. In the example, the confidence interval ends at d = .004, which implies that the null-hypothesis can be rejected. At the upper end, the confidence interval ends at d = .396. This implies that the empirical results also would reject hypotheses that the population effect size is moderate (d = .5) or strong (d = .8).

Confidence intervals around effect sizes are also useful for posteriori power analysis. Yuan and Maxwell (2005) demonstrated that confidence interval of observed power is defined by the observed power of the effect sizes that define the confidence interval of effect sizes.

The figure below illustrates the observed power for the lower bound of the confidence interval in the example of music lessons and IQ (d = .004).

The figure shows that the non-central distribution (blue) and the central distribution (red) nearly perfectly overlap. The reason is that the observed effect size (d = .004) is just slightly above the d-value of the central distribution when the effect size is zero (d = .000). When the null-hypothesis is true, power equals the type-I error rate (2.5%) because 2.5% of studies will produce a significant result by chance alone and chance is the only factor that produces significant results. When the true effect size is d = .004, power increases to 2.74 percent.

Remember that this power estimate is based on the lower limit of a 95% confidence interval around the observed power estimate of 50%. Thus, this result means that there is a 95% probability that the true power of the study is 2.5% when observed power is 50%.

The next figure illustrates power for the upper limit of the 95% confidence interval.

In this case, the non-central distribution and the central distribution overlap very little. Only 2.5% of the non-central distribution is on the left side of the criterion value, and power is 97.5%. This finding means that there is a 95% probability that true power is not greater than 97.5% when observed power is 50%.

Taken these results together, the results show that the 95% confidence interval around an observed power estimate of 50% ranges from 2.5% to 97.5%. As this interval covers pretty much the full range of possible values, it follows that observed power of 50% in a single study provides virtually no information about the true power of a study. True power can be anywhere between 2.5% and 97.5% percent.

The next figure illustrates confidence intervals for different levels of power.

The data are based on the same simulation as in the previous simulation study. The green line is based on computation of observed power for the d-values that correspond to the 95% confidence interval around the observed (simulated) d-values.

The figure shows that confidence intervals for most observed power values are very wide. The only accurate estimate of observed power can be achieved when power is high (upper right corner). But even 80% true power still has a wide confidence interval where the lower bound is below 20% observed power. Firm conclusions can only be drawn when observed power is high.

For example, when observed power is 95%, a one-sided 95% confidence interval (guarding only against underestimation) has a lower bound of 50% power. This finding would imply that observing power of 95% justifies the conclusion that the study had at least 50% power with an error rate of 5% (i.e., in 5% of the studies the true power is less than 50%).

The implication is that observed power is useless unless observed power is 95% or higher.

In conclusion, consideration of the effect of random sampling error on effect size estimates provides justification for Yuan and Maxwell’s (2005) conclusion that computation of observed power provides relatively little value. However, the reason is not that observed power is a problematic concept. The reason is that observed effect sizes in underpowered studies provide insufficient information to estimate observed power with any useful degree of accuracy. The same holds for the reporting of observed effect sizes that are routinely reported in research reports and for point estimates of effect sizes that are interpreted as evidence for small, moderate, or large effects. None of these statements are warranted when the confidence interval around these point estimates is taken into account. A study with d = .80 and a confidence interval of d = .01 to 1.59 does not justify the conclusion that a manipulation had a strong effect because the observed effect size is largely influenced by sampling error.

In conclusion, studies with large sampling error (small sample sizes) are at best able to determine the sign of a relationship. Significant positive effects are likely to be positive and significant negative effects are likely to be negative. However, the effect sizes in these studies are too strongly influenced by sampling error to provide information about the population effect size and therewith about parameters that depend on accurate estimation of population effect sizes like power.

Meta-Analysis of Observed Power

One solution to the problem of insufficient information in a single underpowered study is to combine the results of several underpowered studies in a meta-analysis. A meta-analysis reduces sampling error because sampling error creates random variation in effect size estimates across studies and aggregation reduces the influence of random factors. If a meta-analysis of effect sizes can produce more accurate estimates of the population effect size, it would make sense that meta-analysis can also increase the accuracy of observed power estimation.

Yuan and Maxwell (2005) discuss meta-analysis of observed power only briefly.

A problem in a meta-analysis of observed power is that observed power is not only subject to random sampling error, but also systematically biased. As a result, the average of observed power across a set of studies would also be systematically biased. However, the reason for the systematic bias is the non-symmetrical distribution of observed power when power is not 50%. To avoid this systematic bias, it is possible to compute the median. The median is unbiased because 50% of the non-central distribution is on the left side of the non-centrality parameter and 50% is on the right side of the non-centrality parameter. Thus, the median provides an unbiased estimate of the non-centrality parameter and the estimate becomes increasingly accurate as the number of studies in a meta-analysis increases.

The next figure shows the results of a simulation with the same 500 studies (25 sample sizes and 20 effect sizes) that were simulated earlier, but this time each study was simulated to be replicated 1,000 times and observed power was estimated by computing the average or the median power across the 1,000 exact replication studies.

Purple = average observed power; Orange = median observed power

The simulation shows that Yuan and Maxwell’s (2005) bias formula predicts the relationship between true power and the average of observed power. It also confirms that the median is an unbiased estimator of true power and that observed power is a good estimate of true power when the median is based on a large set of studies. However, the question remains whether observed power can estimate true power when the number of studies is smaller.

The next figure shows the results for a simulation where estimated power is based on the median observed power in 50 studies. The maximum discrepancy in this simulation was 15 percentage points. This is clearly sufficient to distinguish low powered studies (<50% power) from high powered studies (>80%).

To obtain confidence intervals for median observed power estimates, the power estimate can be converted into the corresponding non-centrality parameter of a standard normal distribution. The 95% confidence interval is defined as the standard deviation divided by the square root of the number of studies. The standard deviation of a standard normal distribution equals 1. Hence, the 95% confidence interval for a set of studies is defined by

Lower Limit = Normal (InverseNormal (power) – 1.96 / sqrt(k))

Upper Limit = Normal (inverseNormal(power) + 1.96 / sqrt(k))

Interestingly, the number of observations in a study is irrelevant. The reason is that larger samples produce smaller confidence intervals around an effect size estimate and increase power at the same time. To hold power constant, the effect size has to decrease and power decreases exponentially as effect sizes decrease. As a result, observed power estimates do not become more precise when sample sizes increase and effect sizes decrease proportionally.

The next figure shows simulated data for 1000 studies with 20 effect sizes (0.05 to 1) and 25 sample sizes (n = 10 to 250). Each study was repeated 50 times and the median value was used to estimate true power. The green lines are the 95% confidence interval around the true power value. In real data, the confidence interval would be drawn around observed power, but observed power does not provide a clear mathematical function. The 95% confidence interval around the true power values is still useful because it predicts how much observed power estimates can deviate from true power. 95% of observed power values are expected to be within the area that is defined by lower and upper bound of the confidence interval. The Figure shows that most values are within the area. This confirms that sampling error in a meta-analysis of observed power is a function of the number of studies. The figure also shows that sampling error is greatest when power is 50%. In the tails of the distribution range restriction produces more precise estimates more quickly.

With 50 studies, the maximum absolute discrepancy is 15 percentage points. This level of precision is sufficient to draw broad conclusions about the power of a set of studies. For example, any median observed power estimate below 65% is sufficient to reveal that a set of studies had less power than Cohen’s recommended level of 80% power. A value of 35% would strongly suggest that a set of studies was severely underpowered.

Conclusion

Yuan and Maxwell (2005) provided a detailed statistical examination of observed power. They concluded that observed power typically provides little to no useful information about the true power of a single study. The main reason for this conclusion was that sampling error in studies with low power is too large to estimate true power with sufficient precision. The only precise estimate of power can be obtained when sampling error is small and effect sizes are large. In this case, power is near the maximum value of 1 and observed power correctly estimates true power as being close to 1. Thus, observed power can be useful when it suggests that a study had high power.

Yuan and Maxwell’s (2005) also showed that observed power is systematically biased unless true power is 50%. The amount of bias is relatively small and even without this systematic bias, the amount of random error is so large that observed power estimates based on a single study cannot be trusted.

Unfortunately, Yuan and Maxwell’s (2005) article has been misinterpreted as evidence that observed power calculations are inherently biased and useless. However, observed power can provide useful and unbiased information in a meta-analysis of several studies. First, a meta-analysis can provide unbiased estimates of power because the median value is an unbiased estimator of power. Second, aggregation across studies reduces random sampling error, just like aggregation across studies reduces sampling error in meta-analyses of effect sizes.

Implications

The demonstration that median observed power provides useful information about true power is important because observed power has become a valuable tool in the detection of publication bias and other biases that lead to inflated estimates of effect sizes. Starting with Sterling, Rosenbaum, and Weinkam ‘s(1995) seminal article, observed power has been used by Ioannidis and Trikalinos (2007), Schimmack (2012), Francis (2012), Simonsohn (2014), and van Assen, van Aert, and Wicherts (2014) to draw inferences about a set of studies with the help of posteriori power analysis. The methods differ in the way observed data are used to estimate power, but they all rely on the assumption that observed data provide useful information about the true power of a set of studies. This blog post shows that Yuan and Maxwell’s (2005) critical examination of observed power does not undermine the validity of statistical approaches that rely on observed data to estimate power.

Future Directions

This blog post focussed on meta-analysis of exact replication studies that have the same population effect size and the same sample size (sampling error). It also assumed that the set of studies is a representative set of studies. An important challenge for future research is to examine the statistical properties of observed power when power varies across studies (heterogeneity) and when publication bias and other biases are present. A major limitation of existing methods is that these methods assume a fixed population effect size (Ioannidis and Trikalinos (2007), Francis (2012), Simonsohn (2014), and van Assen, van Aert, and Wicherts (2014). At present, the Incredibility index (Schimmack, 2012) and the R-Index (Schimmack, 2014) have been proposed as methods for sets of studies that are biased and heterogeneous. An important goal for future research is to evaluate these methods in simulation studies with heterogeneous and biased sets of data.