Original: December 5, 2014

Revised: December 28, 2020

### Power Failure in Neuroscience

An article in Nature Review” Neuroscience suggested that the median power in neuroscience studies is just 21% (*Katherine S. Button, John P. A. Ioannidis, Claire Mokrysz, Brian A.Nosek, Jonathan Flint, Emma S.J. Robinson* *and Marcus R. Munafò, 2013). *

The authors of this article examined meta-analyses of primary studies in neuroscience that were published in 2011. They analyzed 49 meta-analyses that were based on a total of 730 original studies (on average, 15 studies per meta-analysis, range 2 to 57).

For each primary study, the authors computed observed power based on the sample size and the estimated effect size in the meta-analysis.

Based on their analyses, the authors concluded that the median power in neuroscience is 21%.

There is a major problem with this estimate that the authors overlooked. The power estimate is incredibly low because a median power estimate of 21% corresponds to a p-value of p = .25. If median power were 21%, it would mean that over 50% of the original studies in the meta-analysis reported a non-significant result (p > .05). This seems rather unlikely because journals tend to publish mostly significant results.

The estimate is even less plausible because it is based on meta-analytic averages without any correction for bias. These effect sizes are likely to be inflated, which means that median power estimate is inflated. Thus, true power is even less than 21% and even more results are non-significant.

What could explain this implausible result?

- A meta-analysis includes published and unpublished studies. It is possible that the published studies reported significant results with observed power greater than 50% (p < .05) and the unpublished studies reported non-significant results with power less than 50%. However, this would imply that meta-analysts were able to retrieve as many unpublished studies as published studies. The authors did not report whether power of published and unpublished studies differed.
- A second possibility is that the power analyses produced false results. The authors relied on Ioannidis and Trikalinos’s (2007) approach to the estimation of power. This approach assumes that studies in a meta-analysis have the same true effect size and that the meta-analytic average (weighted mean) provides the best estimate of the true effect size. This estimate of the true effect size is then used to estimate power in individual studies based on the sample size of the study. As already noted by Ioannidis and Trikalinos (2007), this approach can produce biased results when effect sizes in a meta-analysis are heterogeneous.
- Estimating power simply on the basis of effect size and sample size can be misleading when the design is not a simple comparison of two groups. Between-subject designs are common in animal studies in neuroscience. However, many fMRI studies use within-subject designs that achieve high statistical power with a few participants because participants serve as their own controls.

Schimmack (2012) proposed an alternative procedure that does not have this limitation. Power is estimated individually for each study based on the observed effect size in this study. This approach makes it possible to estimate median power for heterogeneous sets of studies with different effect sizes. Moreover, this approach makes it possible to compute power when power is not simply a function of sample size and effect size (e.g., within-subject designs).

R-Index of Nature Neuroscience: Analysis

To examine the replicability of research published in nature and neuroscience, I retrieved the most cited articles in this journal until I had a sample of 20 studies. I needed 14 articles to meet this goal. The number of studies in these articles ranged from 1 to 7.

The success rate for focal significance tests was 97%. This implies that the vast majority of significance tests reported a significant result. The median observed power was 84%. The inflation rate is 13% (97% – 84% = 13%). The R-Index is 71%. Based on these numbers, the R-Index predicts that the majority of studies in nature neuroscience would replicate in an exact replication study.

This conclusion differs dramatically from Button et al.’s (2013) conclusion. I therefore examined some of the articles that were used for Button et al.’s analyses.

A study by Davidson et al. (2003) examined treatment effects in 12 depressed patients and compared them to 5 healthy controls. The main findings in this article were three significant interactions between time of treatment and group with z-scores of 3.84, 4.60, and 4.08. Observed power for these values with p = .05 is over 95%. If a more conservative significance level of p = .001 is used, power is still over 70%. However, the meta-analysis focused on the correlation between brain activity at baseline and changes in depression over time. This correlation is shown with a scatterplot without reporting the actual correlation or testing it for significance. The text further states that a similar correlation was observed for an alternative depression measure with r = .46 and noting correctly that this correlation is not significant, t(10) = 1.64, p = .13, d = .95, obs. power = 32%. The meta-analysis found a mean effect size of .92. A power analysis with d = .92 and N = 12 yields a power estimate of 30%. Presumably, this is the value that Button et al. used to estimate power for the Davidson et al. (2003) article. However, the meta-analysis did not include the more powerful analyses that compared patients and controls over time.

Conclusion

In the current replication crisis, there is a lot of confusion about the replicability of published findings. Button et al. (2013) aimed to provide some objective information about the replicability of neuroscience research. They concluded that replicability is very low with a median estimate of 21%. In this post, I point out some problems with their statistical approach and the focus on meta-analyses as a way to make inferences about replicability of published studies. My own analysis shows a relatively high R-Index of 71%. To make sense of this index it is instructive to compare it to the following R-Indices.

In a replication project of psychological studies, I found an R-Index of 43% and 28% of studies were successfully replicated.

In the many-labs replication project, 10 out of 12 studies were successfully replicated, a replication rate of 83% and the R-Index was 72%.

Caveat

Neuroscience studies may have high observed power and still not replicate very well in exact replications. The reason is that measuring brain activity is difficult and requires many steps to convert and reduce observed data into measures of brain activity in specific regions. Actual replication studies are needed to examine the replicability of published results.

I like the R-index. It follows from the rules of inference most researchers in empirical social science use to validate their explanatory claims.

In your opinion, what are the assumptions for the R-index to yield the correct predictions? (e.g., the ergodic theorems). And how would we be able to detect assumptions were violated, like the p-curve limitations you describe in the manuscript? (These questions can and should of course be asked of any index recently proposed).

All the best,

Fred

By the way, the data of the ManyLabs project (Klein et al. 2014) were published in the Journal of Open Psychology Data (refs are here http://fredhasselman.com/?page_id=19)

Thank you for your comments. I tried to look up ergodic theorems, but I have to admit that I didn’t really understand the concept and how it applies to the R-Index. Predictions about replicability are better the higher the R-Index is because there is less bias. Predictions are less precise when the R-Index is low because the distribution of bias is unknown. In this case, the distribution of true power and the type of biases matters. The aim of the R-Index is to reward research with high statistical power. Jacob Cohen suggested 50 years ago that studies should have 80% power, but his contribution has not affected power in psychological research. I hope that the R-Index makes this shortcoming salient and nudges researchers to increase power in their studies. The goal should be an R-Index of 80%. Of course, more is better.

I was referring to the assumption that the ‘true effect’ is a valid description of the invariant structure in the data and that a psychological measurement can be regarded as a classical physical measurement.

In short, strong ergodicity: throw N=1000 different dice at O=1 occasion and you should get a similar distribution as throwing N=1 die O=1000 occasions

If a True effect exists, why can’t we evidence it in one subject with N=1000 repeated measurements?

Any explanation as to why we cannot do such a thing has to do with violating strong ergodicity or property attribution as a classical physical measurement.

First, as you point out, let’s get the current rules of inference right.

Do you have an explanation for the overestimation in Manylabs1 of some effects of which the original study was powerful enough to detect both the original and replicated effect, had they been the true effect?

The R-Index does not assume a fixed effect at the level of single individuals or studies. Statistical inferences hold for groups of individuals (samples) or sets of studies.

For a single study, an effect size reflects the average effect size in a sample. A significant result specifies the probability that the effect size in an EXACT replication study would produce an effect size of the same sign.

Observed power is an estimate of the probability that an EXACT replication study would produce a significant result IF THE EFFECT SIZE IN THE FIRST STUD IS IDENTICAL TO THE AVERAGE EFFECT SIZE IN THE POPULATION. The problem with predicting outcomes of empirical replications is that the replication study may not be an EXACT replication and that the observed effect size is practically never an unbiased estimate of the true effect size.

The R-Index is based on the insight that sampling error is unbiased and will cancel out in estimates of the true power in a set of studies. Thus, median observed power across a set of studies provides a reasonable estimate of median true power in a set of studies. However, the estimate can be biased when questionable research practices are present and inflate estimates of observed power.

As the influence of questionable research practices can influence observed power in many different ways, it cannot be modeled with a simple statistical model. The R-Index solves this problem by using a simple rule to adjust for the influence of questionable research practices by subtracting the bias (success rate – median observed power) from the estimate of observed power.

Importantly, there is no claim that this Index is an estimate of true median power. The claim is simply that the Index is able to predict success rates in empirical replication studies because the Index correlates with true power.

The R-Index cannot explain why in the many-labs project some studies were replicated with a stronger effect size. The most likely explanation is that the replication studies were not EXACT replications. This could be tested by trying to find moderators of the effect. Which factors moderated an effect is an empirical question that the R-Index cannot answer.

The R-Index is more useful in revealing that failures of replication are NOT due to failures to conduct a successful exact replication study. If the R-Index of the original study or the authors of an original study is low, it suggests that the effect size in the original study was inflated because questionable research practices contributed to the published results.

I hope this answers your question.

Thanks again for your comments.