A main message of the Lord of the Rings novels is that power is dangerous and corrupts. The main exception is statistical power. High statistical power is desirable because it reduces the risk of false negative results and therewith increases the rate of true discoveries. A high rate of true discoveries is desirable because it reduces the risk that significant results are false positives. For example, a researcher who conducts studies with low power to produce many significant results, but also tests many false hypotheses, will have a high rate of false positive discoveries (Finkel, 2018). In contrast, a researcher who invests more resources in any single study will have fewer significant results, but a lower risk of false positives. Another advantage of high power is that true discoveries are more replicable. A true positive that was obtained with 80% power has an 80% chance to produce a successful replication. In contrast, a true discovery that was obtained with 20% power has an 80% chance to end with a failure to replicate that requires additional replication studies to determine whether the original result was a false positive.
Although most researchers agree that high power is desirable – and specify that they are planning studies with 80% power in their grant proposals, they no longer care about power once the study is completed and a significant result was obtained. The fallacy is to assume that a significant result was obtained because the hypothesis was true and the study had good power. Until recently, there was also no statistical method to estimate researchers’ actual power. The main problem was that questionable research practices inflate post-hoc estimates of statistical power. Selection for significance ensures that post-hoc power is at least 50%. This problem has been solved with selection models that correct for selection for significance, namely p-curve and z-curve. A comparison of these methods with simulation studies shows that p-curve estimates can be dramatically inflated when studies are heterogeneous in power (Brunner & Schimmack, 2020). Z-curve is also the only method that estimates power for all studies that were conducted and not just the subset of studies that produced a significant results. A comparison with actual success rates of replication studies shows that these estimates predict actual replication outcomes (Bartos & Schimmack, 2021).
The ability to estimate researchers’ actual power offers new opportunities for meta-psychologists. One interesting question is how statistical power is related to traditional indicators of scientific success or eminence. There are three several possible outcomes.
One possibility is that power could be positively correlated with success, especially for older researchers. The reason is that low power should produce many replication failures for other researchers who are trying to build on the work of this researcher. Faced with replication failures, they are likely to abandon this research and work on this topic will cease after a while. Accordingly, low powered studies are unlikely to produce a large body of research. In contrast, high powered studies replicate and many other researchers who build on this work are building on these findings, leading to many citations and a large H-Index.
A second possibility is that there is no relationship between power and success. The reason would be that power is determined by many other factors such as the effect sizes in a research area and the type of design that is used to examine these effects. Some research areas will have robust findings that replicate often. Other areas will have low power, but everybody in this area accepts that studies do not always work. In this scenario, success is determined by other factors that vary within research areas and not by power, which varies mostly across research areas.
Another reason for the lack of a correlation could be a floor effect. In a system that does not value credibility and replicability, researchers who use questionable practices to push articles out might win and the only way to survive is to do bad research (Smaldino & McElreath, 2016).
A third possibility is that power is negatively correlated with success. Although there is no evidence for a negative relationship, concerns have been raised that some researchers are gaming the system by conducting many studies with low power to produce as many significant results as possible. The costs of replication failures are passed on to other researchers that try to build on these findings, whereas the innovator moves on to produce more significant results on new questions.
Given the lack of data and plausible predictions for any type of relationship, it is not possible to make a priori predictions about the relationship. Thus, the theoretical implications can only be examined after we look at the data.
Success was measured with the H-Index in Web of Science. Information about statistical power of over 300 social/personality psychologists was obtained using z-curve analyses of automatically extracted test statistics (Schimmack, 2021). A sample size of N = 300 provides reasonably tight confidence intervals to evaluate whether there is a substantial relationship between H-Index and power. I transformed the H-Index using log-transformation to compute the correlation with the estimated discovery rate, which corresponds to the average power before selection for significance (Brunner & Schimmack, 2020). The results show a weak positive relationship that is not significantly different from zero, r(N -= 304) = .07, 95%CI = -.04 to .18. Thus, the results are most consistent with theories that predict no relationship between success and research practices. Figure 1 shows the scatterplot and there is no indication that the weak correlation is due to a floor effect. There is considerable variation in the estimated discovery rate across researchers.
One concern could be that the EDR is just a very noisy and unreliable measure of statistical power. To examine this, I split the z-values of researchers in half, computed separate z-curves and then computed the split-half correlation and adjusted it to compute alpha for the full set of z-scores. Reliability of the EDR was alpha r = .5. To increase reliability, I used extreme groups for the EDR and excluded values between 25 and 45. However , the correlation with the H-Index did not increase, r = .08, 95%CI = -.08 to .23.
I also correlated the H-Index with the more reliable estimated replication rate (reliability = .9), which is power after selection for significance. This correlation was also not significant, r = .08, 95%CI = -.04 to .19.
In conclusion, we can reject the hypothesis that higher success is related to conducting many small studies with low power and selectively reporting only significant results (r > -.1, p < .05). There may be a small positive correlation, (r < .2, p < .05), but a larger sample would be needed to reject the hypotheses that there is no relationship between success and statistical power.
Low replication rates and major replication failures of some findings in social psychology created a crisis of confidence. Some articles suggests that most published results are false and were obtained with questionable research practices. The present results suggests that these fears are unfounded and that it would be false to generalize from a few researchers to the whole group of social psychologists.
The present results also suggest that it is not necessary to burn social psychology to the ground. Instead, social psychologists should carefully examine which important findings are credible and replicable and which ones are not. Although this work has begun, it is moving slowly. The present results show that researchers success, wich is measured in terms of citations by peers, is not tight to the credibility of their findings. Personalized information about power may help to change this in the future.
A famous quote in management is “If You Can’t Measure It, You Can’t Improve It.” This might explain why statistical power remained low despite early warnings about low power (Cohen, 1961; Tversky & Kahneman, 1971). Z-curve analysis is a game changer because it makes it possible to measure power and with the use of modern computers, it is possible to do so quickly and on a large scale. If we agree that power is important and that it can be measured, it is time to improve it. Every researcher can do so and the present results suggest that increasing power is not a career ending move. So, I hope this post empowers researchers to invest more resources in high-powered studies.