Category Archives: Post-Hoc Power

The Replicability Index Is the Most Powerful Tool to Detect Publication Bias in Meta-Analyses

Abstract

Methods for the detection of publication bias in meta-analyses were first introduced in the 1980s (Light & Pillemer, 1984). However, existing methods tend to have low statistical power to detect bias, especially when population effect sizes are heterogeneous (Renkewitz & Keiner, 2019). Here I show that the Replicability Index (RI) is a powerful method to detect selection for significance while controlling the type-I error risk better than the Test of Excessive Significance (TES). Unlike funnel plots and other regression methods, RI can be used without variation in sampling error across studies. Thus, it should be a default method to examine whether effect size estimates in a meta-analysis are inflated by selection for significance. However, the RI should not be used to correct effect size estimates. A significant results merely indicates that traditional effect size estimates are inflated by selection for significance or other questionable research practices that inflate the percentage of significant results.

Evaluating the Power and Type-I Error Rate of Bias Detection Methods

Just before the end of the year, and decade, Frank Renkewitz and Melanie Keiner published an important article that evaluated the performance of six bias detection methods in meta-analyses (Renkewitz & Keiner, 2019).

The article makes several important points.

1. Bias can distort effect size estimates in meta-analyses, but the amount of bias is sometimes trivial. Thus, bias detection is most important in conditions where effect sizes are inflated to a notable degree (say more than one-tenth of a standard deviation, e.g., from d = .2 to d = .3).

2. Several bias detection tools work well when studies are homogeneous (i.e. ,the population effect sizes are very similar). However, bias detection is more difficult when effect sizes are heterogeneous.

3. The most promising tool for heterogeneous data was the Test of Excessive Significance (Francis, 2013; Ioannidis, & Trikalinos, 2013). However, simulations without bias showed that the higher power of TES was achieved by a higher false-positive rate that exceeded the nominal level. The reason is that TES relies on the assumption that all studies have the same population effect size and this assumption is violated when population effect sizes are heterogeneous.

This blog post examines two new methods to detect publication bias and compares them to the TES and the Test of Insufficient Variance (TIVA) that performed well when effect sizes were homogeneous (Renkewitz & Keiner , 2019). These methods are not entirely new. One method is the Incredibility Index, which is similar to TES (Schimmack, 2012). The second method is the Replicability Index, which corrects estimates of observed power for inflation when bias is present.

The Basic Logic of Power-Based Bias Tests

The mathematical foundations for bias tests based on statistical power were introduced by Sterling et al. (1995). Statistical power is defined as the conditional probability of obtaining a significant result when the null-hypothesis is false. When the null-hypothesis is true, the probability of obtaining a significant result is set by the criterion for a type-I error, alpha. To simplify, we can treat cases where the null-hypothesis is true as the boundary value for power (Brunner & Schimmack, 2019). I call this unconditional power. Sterling et al. (1995) pointed out that for studies with heterogeneity in sample sizes, effect sizes or both, the discoery rate; that is the percentage of significant results, is predicted by the mean unconditional power of studies. This insight makes it possible to detect bias by comparing the observed discovery rate (the percentage of significant results) to the expected discovery rate based on the unconditional power of studies. The empirical challenge is to obtain useful estimates of unconditional mean power, which depends on the unknown population effect sizes.

Ioannidis and Trialinos (2007) were the first to propose a bias test that relied on a comparison of expected and observed discovery rates. The method is called Test of Excessive Significance (TES). They proposed a conventional meta-analysis of effect sizes to obtain an estimate of the population effect size, and then to use this effect size and information about sample sizes to compute power of individual studies. The final step was to compare the expected discovery rate (e.g., 5 out of 10 studies) with the observed discovery rate (8 out of 10 studies) with a chi-square test and to test the null-hypothesis of no bias with alpha = .10. They did point out that TES is biased when effect sizes are heterogeneous (see Renkewitz & Keiner, 2019, for a detailed discussion).

Schimmack (2012) proposed an alternative approach that does not assume a fixed effect sizes across studies, called the incredibility index. The first step is to compute observed-power for each study. The second step is to compute the average of these observed power estimates. This average effect size is then used as an estimate of the mean unconditional power. The final step is to compute the binomial probability of obtaining as many or more significant results that were observed for the estimated unconditional power. Schimmack (2012) showed that this approach avoids some of the problems of TES when effect sizes are heterogeneous. Thus, it is likely that the Incredibility Index produces fewer false positives than TES.

Like TES, the incredibility index has low power to detect bias because bias inflates observed power. Thus, the expected discovery rate is inflated, which makes it a conservative test of bias. Schimmack (2016) proposed a solution to this problem. As the inflation in the expected discovery rate is correlated with the amount of bias, the discrepancy between the observed and expected discovery rate indexes inflation. Thus, it is possible to correct the estimated discovery rate by the amount of observed inflation. For example, if the expected discovery rate is 70% and the observed discovery rate is 90%, the inflation is 20 percentage points. This inflation can be deducted from the expected discovery rate to get a less biased estimate of the unconditional mean power. In this example, this would be 70% – 20% = 50%. This inflation-adjusted estimate is called the Replicability Index. Although the Replicability Index risks a higher type-I error rate than the Incredibility Index, it may be more powerful and have a better type-I error control than TES.

To test these hypotheses, I conducted some simulation studies that compared the performance of four bias detection methods. The Test of Insufficient Variance (TIVA; Schimmack, 2015) was included because it has good power with homogeneous data (Renkewitz & Keiner, 2019). The other three tests were TES, ICI, and RI.

Selection bias was simulated with probabilities of 0, .1, .2, and 1. A selection probability of 0 implies that non-significant results are never published. A selection probability of .1 implies that there is a 10% chance that a non-significant result is published when it is observed. Finally, a selection probability of 1 implies that there is no bias and all non-significant results are published.

Effect sizes varied from 0 to .6. Heterogeneity was simulated with a normal distribution with SDs ranging from 0 to .6. Sample sizes were simulated by drawing from a uniform distribution with values between 20 and 40, 100, and 200 as maximum. The number of studies in a meta-analysis were 5, 10, 20, and 30. The focus was on small sets of studies because power to detect bias increases with the number of studies and power was often close to 100% with k = 30.

Each condition was simulated 100 times and the percentage of significant results with alpha = .10 (one-tailed) was used to compute power and type-I error rates.

RESULTS

Bias

Figure 1 shows a plot of the mean observed d-scores as a function of the mean population d-scores. In situations without heterogeneity, mean population d-scores corresponded to the simulated values of d = 0 to d = .6. However, with heterogeneity, mean population d-scores varied due to sampling from the normal distribution of population effect sizes.


The figure shows that bias could be negative or positive, but that overestimation is much more common than underestimation.  Underestimation was most likely when the population effect size was 0, there was no variability (SD = 0), and there was no selection for significance.  With complete selection for significance, bias always overestimated population effect sizes, because selection was simulated to be one-sided. The reason is that meta-analysis rarely show many significant results in both directions.  

An Analysis of Variance (ANOVA) with number of studies (k), mean population effect size (mpd), heterogeneity of population effect sizes (SD), range of sample sizes (Nmax) and selection bias (sel.bias) showed a four-way interaction, t = 3.70.   This four-way interaction qualified main effects that showed bias decreases with effect sizes (d), heterogeneity (SD), range of sample sizes (N), and increased with severity of selection bias (sel.bias).  

The effect of selection bias is obvious in that effect size estimates are unbiased when there is no selection bias and increases with severity of selection bias.  Figure 2 illustrates the three way interaction for the remaining factors with the most extreme selection bias; that is, all non-significant results are suppressed. 

The most dramatic inflation of effect sizes occurs when sample sizes are small (N = 20-40), the mean population effect size is zero, and there is no heterogeneity (light blue bars). This condition simulates a meta-analysis where the null-hypothesis is true. Inflation is reduced, but still considerable (d = .42), when the population effect is large (d = .6). Heterogeneity reduces bias because it increases the mean population effect size. However, even with d = .6 and heterogeneity, small samples continue to produce inflated estimates by d = .25 (dark red). Increasing sample sizes (N = 20 to 200) reduces inflation considerably. With d = 0 and SD = 0, inflation is still considerable, d = .52, but all other conditions have negligible amounts of inflation, d < .10.

As sample sizes are known, they provide some valuable information about the presence of bias in a meta-analysis. If studies with large samples are available, it is reasonable to limit a meta-analysis to the larger and more trustworthy studies (Stanley, Jarrell, & Doucouliagos, 2010).

Discovery Rates

If all results are published, there is no selection bias and effect size estimates are unbiased. When studies are selected for significance, the amount of bias is a function of the amount of studies with non-significant results that are suppressed. When all non-significant results are suppressed, the amount of selection bias depends on the mean power of the studies before selection for significance which is reflected in the discovery rate (i.e., the percentage of studies with significant results). Figure 3 shows the discovery rates for the same conditions that were used in Figure 2. The lowest discovery rate exists when the null-hypothesis is true. In this case, only 2.5% of studies produce significant results that are published. The percentage is 2.5% and not 5% because selection also takes the direction of the effect into account. Smaller sample sizes (left side) have lower discovery rates than larger sample sizes (right side) because larger samples have more power to produce significant results. In addition, studies with larger effect sizes have higher discovery rates than studies with small effect sizes because larger effect sizes increase power. In addition, more variability in effect sizes increases power because variability increases the mean population effect sizes, which also increases power.

In conclusion, the amount of selection bias and the amount of inflation of effect sizes varies across conditions as a function of effect sizes, sample sizes, heterogeneity, and the severity of selection bias. The factorial design covers a wide range of conditions. A good bias detection method should have high power to detect bias across all conditions with selection bias and low type-I error rates across conditions without selection bias.

Overall Performance of Bias Detection Methods

Figure 4 shows the overall results for 235,200 simulations across a wide range of conditions. The results replicate Renkewitz and Keiner’s finding that TES produces more type-I errors than the other methods, although the average rate of type-I errors is below the nominal level of alpha = .10. The error rate of the incredibility index is practically zero, indicating that it is much more conservative than TES. The improvement for type-I errors does not come at the cost of lower power. TES and ICI have the same level of power. This finding shows that computing observed power for each individual study is superior than assuming a fixed effect size across studies. More important, the best performing method is the Replicability Index (RI), which has considerably more power because it corrects for inflation in observed power that is introduced by selection for significance. This is a promising results because one of the limitation of the bias tests examined by Renkewitz and Keiner was the low power to detect selection bias across a wide range of realistic scenarios.

Logistic regression analyses for power showed significant five-way interactions for TES, IC, and RI. For TIVA, two four-way interactions were significant. For type-I error rates no four-way interactions were significant, but at least one three-way interaction was significant. These results show that results systematic vary in a rather complex manner across the simulated conditions. The following results show the performance of the four methods in specific conditions.

Number of Studies (k)

Detection of bias is a function of the amount of bias and the number of studies. With small sets of studies (k = 5), it is difficult to detect power. In addition, low power can suppress false-positive rates because significant results without selection bias are even less likely than significant results with selection bias. Thus, it is important to examine the influence of the number of studies on power and false positive rates.

Figure 5 shows the results for power. TIVA does not gain much power with increasing sample sizes. The other three methods clearly become more powerful as sample sizes increase. However, only the R-Index shows good power with twenty studies and still acceptable studies with just 10 studies. The R-Index with 10 studies is as powerful as TES and ICI with 10 studies.

Figure 6 shows the results for the type-I error rates. Most important, the high power of the R-Index is not achieved by inflating type-I error rates, which are still well-below the nominal level of .10. A comparison of TES and ICI shows that ICI controls type-I error much better than TES. TES even exceeds the nominal level of .10 with 30 studies and this problem is going to increase as the number of studies gets larger.

Selection Rate

Renkewitz and Keiner noticed that power decreases when there is a small probability that non-significant results are published. To simplify the results for the amount of selection bias, I focused on the condition with n = 30 studies, which gives all methods the maximum power to detect selection bias. Figure 7 confirms that power to detect bias deteriorates when non-significant results are published. However, the influence of selection rate varies across methods. TIVA is only useful when only significant results are selected, but even TES and ICI have only modest power even if the probability of a non-significant result to be published is only 10%. Only the R-Index still has good power, and power is still higher with a 20% chance to select a non-significant result than with a 10% selection rate for TES and ICI.

Population Mean Effect Size

With complete selection bias (no significant results), power had ceiling effects. Thus, I used k = 10 to illustrate the effect of population effect sizes on power and type-I error rates. (Figure 8)

In general, power decreased as the population mean effect sizes increased. The reason is that there is less selection because the discovery rates are higher. Power decreased quickly to unacceptable levels (< 50%) for all methods except the R-Index. The R-Index maintained good power even with the maximum effect size of d = .6.

Figure 9 shows that the good power of the R-Index is not achieved by inflating type-I error rates. The type-I error rate is well below the nominal level of .10. In contrast, TES exceeds the nominal level with d = .6.

Variability in Population Effect Sizes

I next examined the influence of heterogeneity in population effect sizes on power and type-I error rates. The results in Figure 10 show that hetergeneity decreases power for all methods. However, the effect is much less sever for the RI than for the other methods. Even with maximum heterogeneity, it has good power to detect publication bias.

Figure 11 shows that the high power of RI is not achieved by inflating type-I error rates. The only method with a high error-rate is TES with high heterogeneity.

Variability in Sample Sizes

With a wider range of sample sizes, average power increases. And with higher power, the discovery rate increases and there is less selection for significance. This reduces power to detect selection for significance. This trend is visible in Figure 12. Even with sample sizes ranging from 20 to 100, TIVA, TES, and IC have modest power to detect bias. However, RI maintains good levels of power even when sample sizes range from 20 to 200.

Once more, only TES shows problems with the type-I error rate when heterogeneity is high (Figure 13). Thus, the high power of RI is not achieved by inflating type-I error rates.

Stress Test

The following analyses examined RI’s performance more closely. The effect of selection bias is self-evident. As more non-significant results are available, power to detect bias decreases. However, bias also decreases. Thus, I focus on the unfortunately still realistic scenario that only significant results are published. I focus on the scenario with the most heterogeneity in sample sizes (N = 20 to 200) because it has the lowest power to detect bias. I picked the lowest and highest levels of population effect sizes and variability to illustrate the effect of these factors on power and type-I error rates. I present results for all four set sizes.

The results for power show that with only 5 studies, bias can only be detected with good power if the null-hypothesis is true. Heterogeneity or large effect sizes produce unacceptably low power. This means that the use of bias tests for small sets of studies is lopsided. Positive results strongly indicate severe bias, but negative results are inconclusive. With 10 studies, power is acceptable for homogeneous and high effect sizes as well as for heterogeneous and low effect sizes, but not for high effect sizes and high heterogeneity. With 20 or more studies, power is good for all scenarios.

The results for the type-I error rates reveal one scenario with dramatically inflated type-I error rates, namely meta-analysis with a large population effect size and no heterogeneity in population effect sizes.

Solutions

The high type-I error rate is limited to cases with high power. In this case, the inflation correction over-corrects. A solution to this problem is found by considering the fact that inflation is a non-linear function of power. With unconditional power of .05, selection for significance inflates observed power to .50, a 10 fold increase. However, power of .50 is inflated to .75, which is only a 50% increase. Thus, I modified the R-Index formula and made inflation contingent on the observed discovery rate.

RI2 = Mean.Observed.Power – (Observed Discovery Rate – Mean.Observed.Power)*(1-Observed.Discovery.Rate). This version of the R-Index reduces power, although power is still superior to the IC.

It also fixed the type-I error problem at least with sample sizes up to N = 30.

Example 1: Bem (2011)

Bem’s (2011) sensational and deeply flawed article triggered the replication crisis and the search for bias-detection tools (Francis, 2012; Schimmack, 2012). Table 1 shows that all tests indicate that Bem used questionable research practices to produce significant results in 9 out of 10 tests. This is confirmed by examination of his original data (Schimmack, 2018). For example, for one study, Bem combined results from four smaller samples with non-significant results into one sample with a significant result. The results also show that both versions of the Replicability Index are more powerful than the other tests.

Testp1/p
TIVA0.008125
TES0.01856
IC0.03132
RI0.0000245754
RI20.000137255

Example 2: Francis (2014) Audit of Psychological Science

Francis audited multiple-study articles in the journal Psychological Science from 2009-2012. The main problem with the focus on single articles is that they often contain relatively few studies and the simulation studies showed that bias tests tend to have low power if 5 or fewer studies are available (Renkewitz & Keiner, 2019). Nevertheless, Francis found that 82% of the investigated articles showed signs of bias, p < .10. This finding seems very high given the low power of TES in the simulation studies. It would mean that selection bias in these articles was very high and power of the studies was extremely low and homogeneous, which provides the ideal conditions to detect bias. However, the high type-I error rates of TES under some conditions may have produced more false positive results than the nominal level of .10 suggests. Moreover, Francis (2014) modified TES in ways that may have further increased the risk of false positives. Thus, it is interesting to reexamine the 44 studies with other bias tests. Unlike Francis, I coded one focal hypothesis test per study.

I then applied the bias detection methods. Table 2 shows the p-values.

YearAuthorFrancisTIVATESICRI1RI2
2012Anderson, Kraus, Galinsky, & Keltner0.1670.3880.1220.3870.1110.307
2012Bauer, Wilkie, Kim, & Bodenhausen0.0620.0040.0220.0880.0000.013
2012Birtel & Crisp0.1330.0700.0760.1930.0040.064
2012Converse & Fishbach0.1100.1300.1610.3190.0490.199
2012Converse, Risen, & Carter Karmic0.0430.0000.0220.0650.0000.010
2012Keysar, Hayakawa, &0.0910.1150.0670.1190.0030.043
2012Leung et al.0.0760.0470.0630.1190.0030.043
2012Rounding, Lee, Jacobson, & Ji0.0360.1580.0750.1520.0040.054
2012Savani & Rattan0.0640.0030.0280.0670.0000.017
2012van Boxtel & Koch0.0710.4960.7180.4980.2000.421
2011Evans, Horowitz, & Wolfe0.4260.9380.9860.6280.3790.606
2011Inesi, Botti, Dubois, Rucker, & Galinsky0.0260.0430.0610.1220.0030.045
2011Nordgren, Morris McDonnell, & Loewenstein0.0900.0260.1140.1960.0120.094
2011Savani, Stephens, & Markus0.0630.0270.0300.0800.0000.018
2011Todd, Hanko, Galinsky, & Mussweiler0.0430.0000.0240.0510.0000.005
2011Tuk, Trampe, & Warlop0.0920.0000.0280.0970.0000.017
2010Balcetis & Dunning0.0760.1130.0920.1260.0030.048
2010Bowles & Gelfand0.0570.5940.2080.2810.0430.183
2010Damisch, Stoberock, & Mussweiler0.0570.0000.0170.0730.0000.007
2010de Hevia & Spelke0.0700.3510.2100.3410.0620.224
2010Ersner-Hershfield, Galinsky, Kray, & King0.0730.0040.0050.0890.0000.013
2010Gao, McCarthy, & Scholl0.1150.1410.1890.3610.0410.195
2010Lammers, Stapel, & Galinsky0.0240.0220.1130.0610.0010.021
2010Li, Wei, & Soman0.0790.0300.1370.2310.0220.129
2010Maddux et al.0.0140.3440.1000.1890.0100.087
2010McGraw & Warren0.0810.9930.3020.1480.0060.066
2010Sackett, Meyvis, Nelson, Converse, & Sackett0.0330.0020.0250.0480.0000.011
2010Savani, Markus, Naidu, Kumar, & Berlia0.0580.0110.0090.0620.0000.014
2010Senay, Albarracín, & Noguchi0.0900.0000.0170.0810.0000.010
2010West, Anderson, Bedwell, & Pratt0.1570.2230.2260.2870.0320.160
2009Alter & Oppenheimer0.0710.0000.0410.0530.0000.006
2009Ashton-James, Maddux, Galinsky, & Chartrand0.0350.1750.1330.2700.0250.142
2009Fast & Chen0.0720.0060.0360.0730.0000.014
2009Fast, Gruenfeld, Sivanathan, & Galinsky0.0690.0080.0420.1180.0010.030
2009Garcia & Tor0.0891.0000.4220.1900.0190.117
2009González & McLennan0.1390.0800.1940.3030.0550.208
2009Hahn, Close, & Graf0.3480.0680.2860.4740.1750.390
2009Hart & Albarracín0.0350.0010.0480.0930.0000.015
2009Janssen & Caramazza0.0830.0510.3100.3920.1150.313
2009Jostmann, Lakens, & Schubert0.0900.0000.0260.0980.0000.018
2009Labroo, Lambotte, & Zhang0.0080.0540.0710.1480.0030.051
2009Nordgren, van Harreveld, & van der Pligt0.1000.0140.0510.1350.0020.041
2009Wakslak & Trope0.0610.0080.0290.0650.0000.010
2009Zhou, Vohs, & Baumeister0.0410.0090.0430.0970.0020.036

The Figure shows the percentage of significant results for the various methods. The results confirm that despite the small number of studies, the majority of multiple-study articles show significant evidence of bias. Although statistical significance does not speak directly to effect sizes, the fact that these tests were significant with a small set of studies implies that the amount of bias is large. This is also confirmed by a z-curve analysis that provides an estimate of the average bias across all studies (Schimmack, 2019).

A comparison of the methods shows with real data that the R-Index (RI1) is the most powerful method and even more powerful than Francis’s method that used multiple studies from a single study. The good performance of TIVA shows that population effect sizes are rather homogeneous as TIVA has low power with heterogeneous data. The Incredibility Index has the worst performance because it has an ultra-conservative type-I error rate. The most important finding is that the R-Index can be used with small sets of studies to demonstrate moderate to large bias.

Discussion

In 2012, I introduced the Incredibility Index as a statistical tool to reveal selection bias; that is, the published results were selected for significance from a larger number of results. I compared the IC with TES and pointed out some advantages of averaging power rather than effect sizes. However, I did not present extensive simulation studies to compare the performance of the two tests. In 2014, I introduced the replicability index to predict the outcome of replication studies. The replicability index corrects for the inflation of observed power when selection for significance is present. I did not think about RI as a bias test. However, Renkewitz and Keiner (2019) demonstrated that TES has low power and inflated type-I error rates. Here I examined whether IC performed better than TES and I found it did. Most important, it has much more conservative type-I error rates even with extreme heterogeneity. The reason is that selection for significance inflates observed power which is used to compute the expected percentage of significant results. This led me to see whether the bias correction that is used to compute the Replicability Index can boost power, while maintaining acceptable type-I error rates. The present results shows that this is the case for a wide range of scenarios. The only exception are meta-analysis of studies with a high population effect size and low heterogeneity in effect sizes. To avoid this problem, I created an alternative R-Index that reduces the inflation adjustment as a function of the percentage of non-significant results that are reported. I showed that the R-Index is a powerful tool that detects bias in Bem’s (2011) article and in a large number of multiple-study articles published in Psychological Science. In conclusion, the replicability index is the most powerful test for the presence of selection bias and it should be routinely used in meta-analyses to ensure that effect sizes estimates are not inflated by selective publishing of significant results. As the use of questionable practices is no longer acceptable, the R-Index can be used by editors to triage manuscripts with questionable results or to ask for a new, pre-registered, well-powered additional study. The R-Index can also be used in tenure and promotion evaluations to reward researchers that publish credible results that are likely to replicate.

References

Francis, G. (2013). Replication, statistical consistency, and publication bias. Journal of Mathematical Psychology, 57, 153–169. https://doi.org/10.1016/j.jmp.2013.02.003

Ioannidis, J. P. A., & Trikalinos, T. A. (2007). An exploratory test for an excess of significant findings. Clinical Trials: Journal of the Society for Clinical Trials, 4, 245–253. https://doi.org/10.1177/1740774507079441

 R. J. Light; D. B. Pillemer (1984). Summing up: The Science of Reviewing Research. Cambridge, Massachusetts: Harvard University Press.

Renkewitz, F., & Keiner, M. (2019). How to Detect Publication Bias in Psychological Research
A Comparative Evaluation of Six Statistical Methods. Zeitschrift für Psychologie, 227, 261-279. https://doi.org/10.1027/2151-2604/a000386.

Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17, 551–566. doi:10.1037/a0029487

Schimmack, U. (2014, December 30). The test of insufficient variance (TIVA): A new tool for the detection of questionable research practices [Blog Post]. Retrieved from https://replicationindex.wordpress.com/2014/12/30/the-test-ofinsufficient-
variance-tiva-a-new-tool-for-the-detection-ofquestionable-
research-practices/

Schimmack, U. (2016). A revised introduction to the R-Index. Retrieved
from https://replicationindex.wordpress.com/2016/01/31/a-revisedintroduction-
to-the-r-index/

Sterling, T. D., Rosenbaum, W. L., & Weinkam, J. J. (1995). Publication decisions revisited: The effect of the outcome of statistical tests on the decision to publish and vice versa. The American Statistician, 49, 108–112.

An Introduction to Z-Curve: A method for estimating mean power after selection for significance (replicability)

UPDATE 5/13/2019   Our manuscript on the z-curve method for estimation of mean power after selection for significance has been accepted for publication in Meta-Psychology. As estimation of actual power is an important tool for meta-psychologists, we are happy that z-curve found its home in Meta-Psychology.  We also enjoyed the open and constructive review process at Meta-Psychology.  Definitely will try Meta-Psychology again for future work (look out for z-curve.2.0 with many new features).

Z.Curve.1.0.Meta.Psychology.In.Press

Since 2015, Jerry Brunner and I have been working on a statistical tool that can estimate mean (statitical) power for a set of studies with heterogeneous sample sizes and effect sizes (heterogeneity in non-centrality parameters and true power).   This method corrects for the inflation in mean observed power that is introduced by the selection for statistical significance.   Knowledge about mean power makes it possible to predict the success rate of exact replication studies.   For example, if a set of studies with mean power of 60% were replicated exactly (including sample sizes), we would expect that 60% of the replication studies produce a significant result again.

Our latest manuscript is a revision of an earlier manuscript that received a revise and resubmit decision from the free, open-peer-review journal Meta-Psychology.  We consider it the most authoritative introduction to z-curve that should be used to learn about z-curve, critic z-curve, or as a citation for studies that use z-curve.

Cite as “submitted for publication”.

Final.Revision.874-Manuscript in PDF-2236-1-4-20180425 mva final (002)

Feel free to ask questions, provide comments, and critic our manuscript in the comments section.  We are proud to be an open science lab, and consider criticism an opportunity to improve z-curve and our understanding of power estimation.

R-CODE
Latest R-Code to run Z.Curve (Z.Curve.Public.18.10.28).
[updated 18/11/17]   [35 lines of code]
call function  mean.power = zcurve(pvalues,Plot=FALSE,alpha=.05,bw=.05)[1]

Z-Curve related Talks
Presentation on Z-curve and application to BS Experimental Social Psychology and (Mostly) WS-Cognitive Psychology at U Waterloo (November 2, 2018)
[Powerpoint Slides]

Random measurement error and the replication crisis: A statistical analysis

This is a draft of a commentary on Loken and Gelman’s Science article “Measurement error and the replication crisis. Comments are welcome.

Random Measurement Error Reduces Power, Replicability, and Observed Effect Sizes After Selection for Significance

Ulrich Schimmack and Rickard Carlsson

In the article “Measurement error and the replication crisis” Loken and Gelman (LG) “caution against the fallacy of assuming that that which does not kill statistical significance makes it stronger” (1). We agree with the overall message that it is a fallacy to interpret observed effect size estimates in small samples as accurate estimates of population effect sizes.  We think it is helpful to recognize the key role of statistical power in significance testing.  If studies have less than 50% power, effect sizes must be inflated to be significant. Thus, all observed effect sizes in these studies are inflated.  Once power is greater than 50%, it is possible to obtain significance with observed effect sizes that underestimate the population effect size. However, even with 80% power, the probability of overestimation is 62.5%. [corrected]. As studies with small samples and small effect sizes often have less than 50% power (2), we can safely assume that observed effect sizes overestimate the population effect size. The best way to make claims about effect sizes in small samples is to avoid interpreting the point estimate and to interpret the 95% confidence interval. It will often show that significant large effect sizes in small samples have wide confidence intervals that also include values close to zero, which shows that any strong claims about effect sizes in small samples are a fallacy (3).

Although we agree with Loken and Gelman’s general message, we believe that their article may have created some confusion about the effect of random measurement error in small samples with small effect sizes when they wrote “In a low-noise setting, the theoretical results of Hausman and others correctly show that measurement error will attenuate coefficient estimates. But we can demonstrate with a simple exercise that the opposite occurs in the presence of high noise and selection on statistical significance” (p. 584).  We both read this sentence as suggesting that under the specified conditions random error may produce even more inflated estimates than perfectly reliable measure. We show that this interpretation of their sentence would be incorrect and that random measurement error always leads to an underestimation of observed effect sizes, even if effect sizes are selected for significance. We demonstrate this fact with a simple equation that shows that true power before selection for significance is monotonically related to observed power after selection for significance. As random measurement error always attenuates population effect sizes, the monotonic relationship implies that observed effect sizes with unreliable measures are also always attenuated.  We provide the formula and R-Code in a Supplement. Here we just give a brief description of the steps that are involved in predicting the effect of measurement error on observed effect sizes after selection for significance.

The effect of random measurement error on population effect sizes is well known. Random measurement error adds variance to the observed measures X and Y, which lowers the observable correlation between two measures. Random error also increases the sampling error. As the non-central t-value is the proportion of these two parameters, it follows that random measurement error always attenuates power. Without selection for significance, median observed effect sizes are unbiased estimates of population effect sizes and median observed power matches true power (4,5). However, with selection for significance, non-significant results with low observed power estimates are excluded and median observed power is inflated. The amount of inflation is proportional to true power. With high power, most results are significant and inflation is small. With low power, most results are non-significant and inflation is large.

inflated-mop

Schimmack developed a formula that specifies the relationship between true power and median observed power after selection for significance (6). Figure 1 shows that median observed power after selection for significant is a monotonic function of true power.  It is straightforward to transform inflated median observed power into median observed effect sizes.  We applied this approach to Locken and Gelman’s simulation with a true population correlation of r = .15. We changed the range of sample sizes from 50 to 3050 to 25 to 1000 because this range provides a better picture of the effect of small samples on the results. We also increased the range of reliabilities to show that the results hold across a wide range of reliabilities. Figure 2 shows that random error always attenuates observed effect sizes, even after selection for significance in small samples. However, the effect is non-linear and in small samples with small effects, observed effect sizes are nearly identical for different levels of unreliability. The reason is that in studies with low power, most of the observed effect is driven by the noise in the data and it is irrelevant whether the noise is due to measurement error or unexplained reliable variance.

inflated-effect-sizes

In conclusion, we believe that our commentary clarifies how random measurement error contributes to the replication crisis.  Consistent with classic test theory, random measurement error always attenuates population effect sizes. This reduces statistical power to obtain significant results. These non-significant results typically remain unreported. The selective reporting of significant results leads to the publication of inflated effect size estimates. It would be a fallacy to consider these effect size estimates reliable and unbiased estimates of population effect sizes and to expect that an exact replication study would also produce a significant result.  The reason is that replicability is determined by true power and observed power is systematically inflated by selection for significance.  Our commentary also provides researchers with a tool to correct for the inflation by selection for significance. The function in Figure 1 can be used to deflate observed effect sizes. These deflated observed effect sizes provide more realistic estimates of population effect sizes when selection bias is present. The same approach can also be used to correct effect size estimates in meta-analyses (7).

References

1. Loken, E., & Gelman, A. (2017). Measurement error and the replication crisis. Science,

355 (6325), 584-585. [doi: 10.1126/science.aal3618]

2. Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology, 65, 145-153, http://dx.doi.org/10.1037/h004518

3. Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997-1003. http://dx.doi.org/10.1037/0003-066X.49.12.99

4. Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551-566. http://dx.doi.org/10.1037/a0029487

5. Schimmack, U. (2016). A revised introduction to the R-Index. https://replicationindex.wordpress.com/2016/01/31/a-revised-introduction-to-the-r-index

6. Schimmack, U. (2017). How selection for significance influences observed power. https://replicationindex.wordpress.com/2017/02/21/how-selection-for-significance-influences-observed-power/

7. van Assen, M.A., van Aert, R.C., Wicherts, J.M. (2015). Meta-analysis using effect size distributions of only statistically significant studies. Psychological Methods, 293-309. doi: 10.1037/met0000025.

################################################################

#### R-CODE ###

################################################################

### sample sizes

N = seq(25,500,5)

### true population correlation

true.pop.r = .15

### reliability

rel = 1-seq(0,.9,.20)

### create matrix of population correlations between measures X and Y.

obs.pop.r = matrix(rep(true.pop.r*rel),length(N),length(rel),byrow=TRUE)

### create a matching matrix of sample sizes

N = matrix(rep(N),length(N),length(rel))

### compute non-central t-values

ncp.t = obs.pop.r / ( (1-obs.pop.r^2)/(sqrt(N – 2)))

### compute true power

true.power = pt(ncp.t,N-2,qt(.975,N-2))

###  Get Inflated Observed Power After Selection for Significance

inf.obs.pow = pnorm(qnorm(true.power/2+(1-true.power),qnorm(true.power,qnorm(.975))),qnorm(.975))

### Transform Into Inflated Observed t-values

inf.obs.t = qt(inf.obs.pow,N-2,qt(.975,N-2))

### Transform inflated observed t-values into inflated observed effect sizes

inf.obs.es = (sqrt(N + 4*inf.obs.t^2 -2) – sqrt(N – 2))/(2*inf.obs.t)

### Set parameters for Figure

x.min = 0

x.max = 500

y.min = 0.10

y.max = 0.45

ylab = “Inflated Observed Effect Size”

title = “Effect of Selection for Significance on Observed Effect Size”

### Create Figure

for (i in 1:length(rel)) {

print(i)

plot(N[,1],inf.obs.es[,i],type=”l”,xlim=c(x.min,x.max),ylim=c(y.min,y.max),col=col[i],xlab=”Sample Size”,ylab=”Median Observed Effect Size After Selection for Significance”,lwd=3,main=title)

segments(x0 = 600,y0 = y.max-.05-i*.02, x1 = 650,col=col[i], lwd=5)

text(730,y.max-.05-i*.02,paste0(“Rel = “,format(rel[i],nsmall=1)))

par(new=TRUE)

}

abline(h = .15,lty=2)

##################### THE END #################################

How Selection for Significance Influences Observed Power

Two years ago, I posted an Excel spreadsheet to help people to understand the concept of true power, observed power, and how selection for significance inflates observed power. Two years have gone by and I have learned R. It is time to update the post.

There is no mathematical formula to correct observed power for inflation to solve for true power. This was partially the reason why I created the R-Index, which is an index of true power, but not an estimate of true power.  This has led to some confusion and misinterpretation of the R-Index (Disjointed Thought blog post).

However, it is possible to predict median observed power given true power and selection for statistical significance.  To use this method for real data with observed median power of only significant results, one can simply generate a range of true power values, generate the predicted median observed power and then pick the true power value with the smallest discrepancy between median observed power and simulated inflated power estimates. This approach is essentially the same as the approach used by pcurve and puniform, which only
differ in the criterion that is being minimized.

Here is the r-code for the conversion of true.power into the predicted observed power after selection for significance.

true.power = seq(.01,.99,.01)
obs.pow = pnorm(qnorm(true.power/2+(1-true.power),qnorm(true.power,z.crit)),z.crit)

And here is a pretty picture of the relationship between true power and inflated observed power.  As we can see, there is more inflation for low true power because observed power after selection for significance has to be greater than 50%.  With alpha = .05 (two-tailed), when the null-hypothesis is true, inflated observed power is 61%.   Thus, an observed median power of 61% for only significant results supports the null-hypothesis.  With true power of 50%, observed power is inflated to 75%.  For high true power, the inflation is relatively small. With the recommended true power of 80%, median observed power for only significant results is 86%.

inflated-mop

Observed power is easy to calculate from reported test statistics. The first step is to compute the exact two-tailed p-value.  These p-values can then be converted into observed power estimates using the standard normal distribution.

z.crit = qnorm(.975)
Obs.power = pnorm(qnorm(1-p/2),z.crit)

If there is selection for significance, you can use the previous formula to convert this observed power estimate into an estimate of true power.

This method assumes that (a) significant results are representative of the distribution and there are no additional biases (no p-hacking) and (b) all studies have the same or similar power.  This method does not work for heterogeneous sets of studies.

P.S.  It is possible to proof the formula that transforms true power into median observed power.  Another way to verify that the formula is correct is to confirm the predicted values with a simulation study.

Here is the code to run the simulation study:

n.sim = 100000
z.crit = qnorm(.975)
true.power = seq(.01,.99,.01)
obs.pow.sim = c()
for (i in 1:length(true.power)) {
z.sim = rnorm(n.sim,qnorm(true.power[i],z.crit))
med.z.sig = median(z.sim[z.sim > z.crit])
obs.pow.sim = c(obs.pow.sim,pnorm(med.z.sig,z.crit))
}
obs.pow.sim

obs.pow = pnorm(qnorm(true.power/2+(1-true.power),qnorm(true.power,z.crit)),z.crit)
obs.pow
cbind(true.power,obs.pow.sim,obs.pow)
plot(obs.pow.sim,obs.pow)

 

 

A comparison of The Test of Excessive Significance and the Incredibility Index

A comparison of The Test of Excessive Significance and the Incredibility Index

It has been known for decades that published research articles report too many significant results (Sterling, 1959).  This phenomenon is called publication bias.  Publication bias has many negative effects on scientific progress and undermines the value of meta-analysis as a tool to accumulate evidence from separate original studies.

Not surprisingly, statisticians have tried to develop statistical tests of publication bias.  The most prominent tests are funnel plots (Light & Pillemer, 1984) and Eggert regression (Eggert et al., 1997). Both tests rely on the fact that population effect sizes are statistically independent of sample sizes.  As a result, observed effect sizes in a representative set of studies should also be independent of sample size.  However, publication bias will introduce a negative correlation between observed effect sizes and sample sizes because larger effects are needed in smaller studies to produce a significant result.  The main problem with these bias tests is that other factors may produce heterogeneity in population effect sizes that can also produce variation in observed effect sizes and the variation in population effect sizes may be related to sample sizes.  In fact, one would expect a correlation between population effect sizes and sample sizes if researchers use power analysis to plan their sample sizes.  A power analysis would suggest that researchers use larger samples to study smaller effects and smaller samples to study large effects.  This makes it problematic to draw strong inferences from negative correlations between effect sizes and sample sizes about the presence of publication bias.

Sterling et al. (1995) proposed a test for publication bias that does not have this limitation.  The test is based on the fact that power is defined as the relative frequency of significant results that one would expect from a series of exact replication studies.  If a study has 50% power, the expected frequency of significant results in 100 replication studies is 50 studies.  Publication bias will lead to an inflation in the percentage of significant results. If only significant results are published, the percentage of significant results in journals will be 100%, even if studies had only 50% power to produce significant results.  Sterling et al. (1995) found that several journals reported over 90% of significant results. Based on some conservative estimates of power, he concluded that this high success rate can only be explained with publication bias.  Sterling et al. (1995), however, did not develop a method that would make it possible to estimate power.

Ioannidis and Trikalonis (2007) proposed the first test for publication bias based on power analysis.  They call it “An exploratory test for an excess of significant results.” (ETESR). They do not reference Sterling et al. (1995), suggesting that they independently rediscovered the usefulness of power analysis to examine publication bias.  The main problem for any bias test is to obtain an estimate of (true) power. As power depends on population effect sizes, and population effect sizes are unknown, power can only be estimated.  ETSESR uses a meta-analysis of effect sizes for this purpose.

This approach makes a strong assumption that is clearly stated by Ioannidis and Trikalonis (2007).  The test works well “If it can be safely assumed that the effect is the same in all studies on the same question” (p. 246). In other words, the test may not work well when effect sizes are heterogeneous.  Again, the authors are careful to point out this limitation of ETSER. “In the presence of considerable between-study heterogeneity, efforts should be made first to dissect sources of heterogeneity [33,34]. Applying the test ignoring genuine heterogeneity is ill-advised” (p. 246).

The authors repeat this limitation at the end of the article. “Caution is warranted when there is genuine between-study heterogeneity. Test of publication bias generally yield spurious results in this setting.” (p. 252).   Given these limitations, it would be desirable to develop a test that that does not have to assume that all studies have the same population effect size.

In 2012, I developed the Incredibilty Index (Schimmack, 2012).  The name of the test is based on the observation that it becomes increasingly likely that a set of studies produces a non-significant result as the number of studies increases.  For example, if studies have 50% power (Cohen, 1962), the chance of obtaining a significant result is equivalent to a coin flip.  Most people will immediately recognize that it becomes increasingly unlikely that a fair coin will produce the same outcome again and again and again.  Probability theory shows that this outcome becomes very unlikely even after just a few coin tosses as the cumulative probability decreases exponentially from 50% to 25% to 12.5%, 6.25%, 3.1.25% and so on.  Given standard criteria of improbability (less than 5%), a series of 5 significant results would be incredible and sufficient to be suspicious that the coin is fair, especially if it always falls on the side that benefits the person who is throwing the coin. As Sterling et al. (1995) demonstrated, the coin tends to favor researchers’ hypothesis at least 90% of the time.  Eight studies are sufficient to show that even a success rate of 90% is improbable (p < .05).  It therefore very easy to show that publication bias contributes to the incredible success rate in journals, but it is also possible to do so for smaller sets of studies.

To avoid the requirement of a fixed effect size, the incredibility index computes observed power for individual studies. This approach avoids the need to aggregate effect sizes across studies. The problem with this approach is that observed power of a single study is a very unreliable measure of power (Yuan & Maxwell, 2006).  However, as always, the estimate of power becomes more precise when power estimates of individual studies are combined.  The original incredibility indices used the mean to estimate averaged power, but Yuan and Maxwell (2006) demonstrated that the mean of observed power is a biased estimate of average (true) power.  In further developments of my method, I changed the method and I am now using median observed power (Schimmack, 2016).  The median of observed power is an unbiased estimator of power (Schimmack, 2015).

In conclusion, the Incredibility Index and the Exploratory Test for an Excess of Significant Results are similar tests, but they differ in one important aspect.  ETESR is designed for meta-analysis of highly similar studies with a fixed population effect size.  When this condition is met, ETESR can be used to examine publication bias.  However, when this condition is violated and effect sizes are heterogeneous, the incredibility index is a superior method to examine publication bias. At present, the Incredibility Index is the only test for publication bias that does not assume a fixed population effect size, which makes it the ideal test for publication bias in heterogeneous sets of studies.

References

Light, J., Pillemer, D. B.  (1984). Summing up: The Science of Reviewing Research. Cambridge, Massachusetts.: Harvard University Press.

Egger, M., Smith, G. D., Schneider, M., & Minder, C. (1997). Bias in meta-analysis detected by a simple, graphical test”. BMJ 315 (7109): 629–634. doi:10.1136/bmj.315.7109.629.

Ioannidis and Trikalinos (2007).  An exploratory test for an excess of significant findings. Clinical Trials, 4 245-253.

Schimmack (2012). The Ironic effect of significant results on the credibility of multiple study articles. Psychological Methods, 17, 551-566.

Schimmack, U. (2016). A revised introduction o the R-Index.

Schimmack, U. (2015). Meta-analysis of observed power.

Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance: Or vice versa. Journal of the American Statistical Association, 54(285), 30-34. doi: 10.2307/2282137

Stering, T. D., Rosenbaum, W. L., & Weinkam, J. J. (1995). Publication Decisions Revisited: The Effect of the Outcome of Statistical Tests on the Decision to Publish and Vice Versa, The American Statistician, 49, 108-112.

Yuan, K.-H., & Maxwell, S. (2005). On the Post Hoc Power in Testing Mean Differences. Journal of Educational and Behavioral Statistics, 141–167

Dr. Ulrich Schimmack’s Blog about Replicability

For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication” (Cohen, 1994).

DEFINITION OF REPLICABILITYIn empirical studies with sampling error, replicability refers to the probability of a study with a significant result to produce a significant result again in an exact replication study of the first study using the same sample size and significance criterion (Schimmack, 2017).

BLOGS BY YEAR:  20192018, 2017, 2016, 2015, 2014

Featured Blog of the Month (January, 2020): Z-Curve.2.0 (with R-package) 

 

TOP TEN BLOGS

RR.Logo

  1. 2018 Replicability Rankings of 117 Psychology Journals (2010-2018)

Rankings of 117 Psychology Journals according to the average replicability of a published significant result. Also includes detailed analysis of time trends in replicability from 2010 to 2018). 

Golden2.  Introduction to Z-Curve with R-Code

This post presented the first replicability ranking and explains the methodology that is used to estimate the typical power of a significant result published in a journal.  The post provides an explanation of the new method to estimate observed power based on the distribution of test statistics converted into absolute z-scores.  The method has been developed further to estimate power for a wider range of z-scores by developing a model that allows for heterogeneity in power across tests.  A description of the new method will be published when extensive simulation studies are completed.

Say-No-to-Doping-Test-Image

3. An Introduction to the R-Index

 

The R-Index can be used to predict whether a set of published results will replicate in a set of exact replication studies. It combines information about the observed power of the original studies with information about the amount of inflation in observed power due to publication bias (R-Index = Observed Median Power – Inflation). The R-Index has predicted the outcome of actual replication studies.

Featured Image -- 203

4.  The Test of Insufficient Variance (TIVA)

 

The Test of Insufficient Variance is the most powerful test of publication bias and/or dishonest reporting practices. It can be used even if only two independent statistical results are available, although power to detect bias increases with the number of studies. After converting test results into z-scores, z-scores are expected to have a variance of one.   Unless power is very high, some of these z-scores will not be statistically significant (z .05 two-tailed).  If these non-significant results are missing, the variance shrinks, and TIVA detects that the variance is insufficient.  The observed variance is compared against the expected variance of 1 with a left-tailed chi-square test. The usefulness of TIVA is illustrated with Bem’s (2011) “Feeling the Future” data.

train-wreck-15.  MOST VIEWED POST (with comment by Noble Laureate Daniel Kahneman)

Reconstruction of a Train Wreck: How Priming Research Went off the Rails

This blog post examines the replicability of priming studies cited in Daniel Kahneman’s popular book “Thinking fast and slow.”   The results suggest that many of the cited findings are difficult to replicate.

http://schoolsnapshots.org/blog/2014/09/30/math-prize-for-girls-at-m-i-t/6. How robust are Stereotype-Threat Effects on Women’s Math Performance?

Stereotype-threat has been used by social psychologists to explain gender differences in math performance. Accordingly, the stereotype that men are better at math than women is threatening to women and threat leads to lower performance.  This theory has produced a large number of studies, but a recent meta-analysis showed that the literature suffers from publication bias and dishonest reporting.  After correcting for these effects, the stereotype-threat effect was negligible.  This blog post shows a low R-Index for the first article that appeared to provide strong support for stereotype-threat.  These results show that the R-Index can warn readers and researchers that reported results are too good to be true.

GPower7.  An attempt at explaining null-hypothesis testing and statistical power with 1 figure and 1500 words.   Null-hypothesis significance testing is old, widely used, and confusing. Many false claims have been used to suggest that NHST is a flawed statistical method. Others argue that the method is fine, but often misunderstood. Here I try to explain NHST and why it is important to consider power (type-II errors) using a picture from the free software GPower.

snake-oil

8.  The Problem with Bayesian Null-Hypothesis Testing

 

Some Bayesian statisticians have proposed Bayes-Factors to provide evidence for a Null-Hypothesis (i.e., there is no effect).  They used Bem’s (2011) “Feeling the Future” data to argue that Bayes-Factors would have demonstrated that extra-sensory perception does not exist.  This blog post shows that Bayes-Factors depend on the specification of the alternative hypothesis and that support for the null-hypothesis is often obtained by choosing an unrealistic alternative hypothesis (e.g., there is a 25% probability that effect size is greater than one standard deviation, d > 1).  As a result, Bayes-Factors can favor the null-hypothesis when there is an effect, but the effect size is small (d = .2).  A Bayes-Factor in favor of the null is more appropriately interpreted as evidence that the alternative hypothesis needs to decrease the probabilities assigned to large effect sizes. The post also shows that Bayes-Factors based on a meta-analysis of Bem’s data provide misleading evidence that an effect is present because Bayesian statistics do not take publication bias and dishonest reporting practices into account.

hidden9. Hidden figures: Replication failures in the stereotype threat literature.  A widespread problem is that failed replication studies are often not published. This blog post shows that another problem is that failed replication studies are ignored even when they are published.  Selective publishing of confirmatory results undermines the credibility of science and claims about the importance of stereotype threat to explain gender differences in mathematics.

20170620_14554410. My journey towards estimation of replicability.  In this blog post I explain how I got interested in statistical power and replicability and how I developed statistical methods to reveal selection bias and to estimate replicability.

The Abuse of Hoenig and Heisey: A Justification of Power Calculations with Observed Effect Sizes

In 2001, Hoenig and Heisey wrote an influential article, titled “The Abuse of Power: The Persuasive Fallacy of Power Calculations For Data Analysis.”  The article has been cited over 500 times and it is commonly cited as a reference to claim that it is a fallacy to use observed effect sizes to compute statistical power.

In this post, I provide a brief summary of Hoenig and Heisey’s argument. The summary shows that Hoenig and Heisey were concerned with the practice of assessing the statistical power of a single test based on the observed effect size for this effect. I agree that it is often not informative to do so (unless the result is power = .999). However, the article is often cited to suggest that the use of observed effect sizes in power calculations is fundamentally flawed. I show that this statement is false.

The abstract of the article makes it clear that Hoenig and Heisey focused on the estimation of power for a single statistical test. “There is also a large literature advocating that power calculations be made whenever one performs a statistical test of a hypothesis and one obtains a statistically nonsignificant result” (page 1). The abstract informs readers that this practice is fundamentally flawed. “This approach, which appears in various forms, is fundamentally flawed. We document that the problem is extensive and present arguments to demonstrate the flaw in the logic” (p. 1).

Given that method articles can be difficult to read, it is possible that the misinterpretation of Hoenig and Heisey is the result of relying on the term “fundamentally flawed” in the abstract. However, some passages in the article are also ambiguous. In the Introduction Hoenig and Heisey write “we describe the flaws in trying to use power calculations for data-analytic purposes” (p. 1). It is not clear what purposes are left for power calculations if they cannot be used for data-analytic purposes. Later on, they write more forcefully “A number of authors have noted that observed power may not be especially useful, but to our knowledge a fatal logical flaw has gone largely unnoticed.” (p. 2). So readers cannot be blamed entirely if they believed that calculations of observed power are fundamentally flawed. This conclusion is often implied in Hoenig and Heisey’s writing, which is influenced by their broader dislike of hypothesis testing  in general.

The main valid argument that Hoenig and Heisey make is that power analysis is based on the unknown population effect size and that effect sizes in a particular sample are contaminated with sampling error.  As p-values and power estimates depend on the observed effect size, they are also influenced by random sampling error.

In a special case, when true power is 50%, the p-value matches the significance criterion. If sampling error leads to an underestimation of the true effect size, the p-value will be non-significant and the power estimate will be less than 50%. When sampling error inflates the observed effect size, p-values will be significant and power will be above 50%.

It is therefore impossible to find scenarios where observed power is high (80%) and a result is not significant, p > .05, or where observed power is low (20%) and a result is significant, p < .05.  As a result, it is not possible to use observed power to decide whether a non-significant result was obtained because power was low or because power was high but the effect does not exist.

In fact, a simple mathematical formula can be used to transform p-values into observed power and vice versa (I actually got the idea of using p-values to estimate power from Hoenig and Heisey’s article).  Given this perfect dependence between the two statistics, observed power cannot add additional information to the interpretation of a p-value.

This central argument is valid and it does mean that it is inappropriate to use the observed effect size of a statistical test to draw inferences about the statistical power of a significance test for the same effect (N = 1). Similarly, one would not rely on a single data point to draw inferences about the mean of a population.

However, it is common practice to aggregate original data points or to aggregated effect sizes of multiple studies to obtain more precise estimates of the mean in a population or the mean effect size, respectively. Thus, the interesting question is whether Hoenig and Heisey’s (2001) article contains any arguments that would undermine the aggregation of power estimates to obtain an estimate of the typical power for a set of studies. The answer is no. Hoenig and Heisey do not consider a meta-analysis of observed power in their discussion and their discussion of observed power does not contain arguments that would undermine the validity of a meta-analysis of post-hoc power estimates.

A meta-analysis of observed power can be extremely useful to check whether researchers’ a priori power analysis provide reasonable estimates of the actual power of their studies.

Assume that researchers in a particular field have to demonstrate that their studies have 80% power to produce significant results when an important effect is present because conducting studies with less power would be a waste of resources (although some granting agencies require power analyses, these power analyses are rarely taken seriously, so I consider this a hypothetical example).

Assume that researchers comply and submit a priori power analysis with effect sizes that are considered to be sufficiently meaningful. For example, an effect of half-a-standard deviation (Cohen’s d = .50) might look reasonable large to be meaningful. Researchers submit their grant applications with a prior power analysis that produce 80% power with an effect size of d = .50. Based on the power analysis, researchers request funding for 128 participants. A researcher plans four studies and needs $50 for each participant. The total budget is $25,600.

When the research project is completed, all four studies produced non-significant results. The observed standardized effect sizes were 0, .20, .25, and .15. Is it really impossible to estimate the realized power in these studies based on the observed effect sizes? No. It is common practice to conduct a meta-analysis of observed effect sizes to get a better estimate of the (average) population effect size. In this example, the average effect size across the four studies is d = .15. It is also possible to show that the average effect size in these four studies is significantly different from the effect size that was used for the a priori power calculation (M1 = .15, M2 = .50, Mdiff = .35, SE = 1/sqrt(512) = .044, t = .35 / .044 = 7.92, p < 1e-13). Using the more realistic effect size estimate that is based on actual empirical data rather than wishful thinking, the post-hoc power analysis yields a power estimate of 13%. The probability of obtaining non-significant results in all four studies is 57%. Thus, it is not surprising that the studies produced non-significant results.  In this example, a post-hoc power analysis with observed effect sizes provides valuable information about the planning of future studies in this line of research. Either effect sizes of this magnitude are not important enough and research should be abandoned or effect sizes of this magnitude still have important practical implications and future studies should be planned on the basis of a priori power analysis with more realistic effect sizes.

Another valuable application of observed power analysis is the detection of publication bias and questionable research practices (Ioannidis and Trikalinos; 2007), Schimmack, 2012) and for estimating the replicability of statistical results published in scientific journals (Schimmack, 2015).

In conclusion, the article by Hoenig and Heisey is often used as a reference to argue that observed effect sizes should not be used for power analysis.  This post clarifies that this practice is not meaningful for a single statistical test, but that it can be done for larger samples of studies.

“Do Studies of Statistical Power Have an Effect on the Power of Studies?” by Peter Sedlmeier and Gerg Giegerenzer

The article with the witty title “Do Studies of Statistical Power Have an Effect on the Power of Studies?” builds on Cohen’s (1962) seminal power analysis of psychological research.

The main point of the article can be summarized in one word: No. Statistical power has not increased after Cohen published his finding that statistical power is low.

One important contribution of the article was a meta-analysis of power analyses that applied Cohen’s method to a variety of different journals. The table below shows that power estimates vary by journal assuming that the effect size was medium according to Cohen’s criteria of small, medium, and large effect sizes. The studies are sorted by power estimates from the highest to the lowest value, which provides a power ranking of journals based on Cohen’s method. I also included the results of Sedlmeier and Giegerenzer’s power analysis of the 1984 volume of the Journal of Abnormal Psychology (the Journal of Social and Abnormal Psychology was split into Journal of Abnormal Psychology and Journal of Personality and Social Psychology). I used the mean power (50%) rather than median power (44%) because the mean power is consistent with the predicted success rate in the limit. In contrast, the median will underestimate the success rate in a set of studies with heterogeneous effect sizes.

JOURNAL TITLE YEAR Power%
Journal of Marketing Research 1981 89
American Sociological Review 1974 84
Journalism Quarterly, The Journal of Broadcasting 1976 76
American Journal of Educational Psychology 1972 72
Journal of Research in Teaching 1972 71
Journal of Applied Psychology 1976 67
Journal of Communication 1973 56
The Research Quarterly 1972 52
Journal of Abnormal Psychology 1984 50
Journal of Abnormal and Social Psychology 1962 48
American Speech and Hearing Research & Journal of Communication Disorders 1975 44
Counseler Education and Supervision 1973 37

 

The table shows that there is tremendous variability in power estimates for different journals ranging from as high as 89% (9 out of 10 studies will produce a significant result when an effect is present) to the lowest estimate of  37% power (only 1 out of 3 studies will produce a significant result when an effect is present).

The table also shows that the Journal of Abnormal and Social Psychology and its successor the Journal of Abnormal Psychology yielded nearly identical power estimates. This finding is the key finding that provides empirical support for the claim that power in the Journal of Abnormal Psychology has not increased over time.

The average power estimate for all journals in the table is 62% (median 61%).  The list of journals is not a representative set of journals and few journals are core psychology journals. Thus, the average power may be different if a representative set of journals had been used.

The average for the three core psychology journals (JASP & JAbnPsy,  JAP, AJEduPsy) is 67% (median = 63%) is slightly higher. The latter estimate is likely to be closer to the typical power in psychology in general rather than the prominently featured estimates based on the Journal of Abnormal Psychology. Power could be lower in this journal because it is more difficult to recruit patients with a specific disorder than participants from undergraduate classes. However, only more rigorous studies of power for a broader range of journals and more years can provide more conclusive answers about the typical power of a single statistical test in a psychology journal.

The article also contains some important theoretical discussions about the importance of power in psychological research. One important issue concerns the treatment of multiple comparisons. For example, a multi-factorial design produces an exponential number of statistical comparisons. With two conditions, there is only one comparison. With three conditions, there are three comparisons (C1 vs. C2, C1 vs. C3, and C2 vs. C3). With 5 conditions, there are 10 comparisons. Standard statistical methods often correct for these multiple comparisons. One consequence of this correction for multiple comparisons is that the power of each statistical test decreases. An effect that would be significant in a simple comparison of two conditions would not be significant if this test is part of a series of tests.

Sedlmeier and Giegerenzer used the standard criterion of p < .05 (two-tailed) for their main power analysis and for the comparison with Cohen’s results. However, many articles presented results using a more stringent criterion of significance. If the criterion used by authors would have been used for the power analysis, power decreased further. About 50% of all articles used an adjusted criterion value and if the adjusted criterion value was used power was only 37%.

Sedlmeier and Giegerenzer also found another remarkable difference between articles in 1960 and in 1984. Most articles in 1960 reported the results of a single study. In 1984 many articles reported results from two or more studies. Sedlmeier and Giegerenzer do not discuss the statistical implications of this change in publication practices. Schimmack (2012) introduced the concept of total power to highlight the problem of publishing articles that contain multiple studies with modest power. If studies are used to provide empirical support for an effect, studies have to show a significant effect. For example, Study 1 shows an effect with female participants. Study 2 examines whether the effect can also be demonstrated with male participants. If Study 2 produces a non-significant result, it is not clear how this finding should be interpreted. It may show that the effect does not exist for men. It may show that the first result was just a fluke finding due to sampling error. Or it may show that the effect exists equally for men and women but studies had only 50% power to produce a significant result. In this case, it is expected that one study will produce a significant result and one will produce a non-significant result, but in the long-run significant results are equally likely with male or female participants. Given the difficulty of interpreting a non-significant result, it would be important to conduct a more powerful study that examines gender differences in a more powerful study with more female and male participants. However, this is not what researchers do. Rather, multiple study articles contain only the studies that produced significant results. The rate of successful studies in psychology journals is over 90% (Sterling et al., 1995). However, this outcome is extremely likely in multiple studies where studies have only 50% power to get a significant result in a single attempt. For each additional attempt, the probability to obtain only significant results decreases exponentially (1 Study, 50%, 2 Studies 25%, 3 Studies 12.5%, 4 Studies 6.75%).

The fact that researchers only publish studies that worked is well-known in the research community. Many researchers believe that this is an acceptable scientific practice. However, consumers of scientific research may have a different opinion about this practice. Publishing only studies that produced the desired outcome is akin to a fund manager that only publishes the return rate of funds that gained money and excludes funds with losses. Would you trust this manager to take care of your retirement? It is also akin to a gambler that only remembers winnings. Would you marry a gambler who believes that gambling is ok because you can earn money that way?

I personally do not trust obviously biased information. So, when researchers present 5 studies with significant results, I wonder whether they really had the statistical power to produce these results or whether they simply did not publish results that failed to confirm their claims. To answer this question it is essential to estimate the actual power of individual studies to produce significant results; that is, it is necessary to estimate the typical power in this field, of this researcher, or in the journal that published the results.

In conclusion, Sedlmeier and Gigerenzer made an important contribution to the literature by providing the first power-ranking of scientific journals and the first temporal analyses of time trends in power. Although they probably hoped that their scientific study of power would lead to an increase in statistical power, the general consensus is that their article failed to change scientific practices in psychology. In fact, some journals required more and more studies as evidence for an effect (some articles contain 9 studies) without any indication that researchers increased power to ensure that their studies could actually provide significant results for their hypotheses. Moreover, the topic of statistical power remained neglected in the training of future psychologists.

I recommend Sedlmeier and Gigerenzer’s article as essential reading for anybody interested in improving the credibility of psychology as a rigorous empirical science.

As always, comments (positive or negative) are always welcome.

Distinguishing Questionable Research Practices from Publication Bias

It is well-known that scientific journals favor statistically significant results (Sterling, 1959). This phenomenon is known as publication bias. Publication bias can be easily detected by comparing the observed statistical power of studies with the success rate in journals. Success rates of 90% or more would only be expected if most theoretical predictions are true and empirical studies have over 90% statistical power to produce significant results. Estimates of statistical power range from 20% to 50% (Button et al., 2015, Cohen, 1962). It follows that for every published significant result an unknown number of non-significant results has occurred that remained unpublished. These results linger in researchers proverbial file-drawer or more literally in unpublished data sets on researchers’ computers.

The selection of significant results also creates an incentive for researchers to produce significant results. In rare cases, researchers simply fabricate data to produce significant results. However, scientific fraud is rare. A more serious threat to the integrity of science is the use of questionable research practices. Questionable research practices are all research activities that create a systematic bias in empirical results. Although systematic bias can produce too many or too few significant results, the incentive to publish significant results suggests that questionable research practices are typically used to produce significant results.

In sum, publication bias and questionable research practices contribute to an inflated success rate in scientific journals. So far, it has been difficult to examine the prevalence of questionable research practices in science. One reason is that publication bias and questionable research practices are conceptually overlapping. For example, a research article may report the results of a 2 x 2 x 2 ANOVA or a regression analysis with 5 predictor variables. The article may only report the significant results and omit detailed reporting of the non-significant results. For example, researchers may state that none of the gender effects were significant and not report the results for main effects or interaction with gender. I classify these cases as publication bias because each result tests a different hypothesis., even if the statistical tests are not independent.

Questionable research practices are practices that change the probability of obtaining a specific significant result. An example would be a study with multiple outcome measures that would support the same theoretical hypothesis. For example, a clinical trial of an anti-depressant might include several depression measures. In this case, a researcher can increase the chances of a significant result by conducting tests for each measure. Other questionable research practices would be optional stopping once a significant result is obtained, selective deletion of cases based on the results after deletion. A common consequence of these questionable practices is that they will produce results that meet the significance criterion, but deviate from the distribution that is expected simply on the basis of random sampling error.

A number of articles have tried to examine the prevalence of questionable research practices by comparing the frequency of p-values above and below the typical criterion of statistical significance, namely a p-value less than .05. The logic is that random error would produce a nearly equal amount of p-values just above .05 (e.g., p = .06) and below .05 (e.g., p = .04). According to this logic, questionable research practices are present, if there are more p-values just below the criterion than p-values just above the criterion (Masicampo & Lalande, 2012).

Daniel Lakens has pointed out some problems with this approach. The most crucial problem is that publication bias alone is sufficient to predict a lower frequency of p-values below the significance criterion. After all, these p-values imply a non-significant result and non-significant results are subject to publication bias. The only reason why p-values of .06 are reported with higher frequency than p-values of .11 is that p-values between .05 and .10 are sometimes reported as marginally significant evidence for a hypothesis. Another problem is that many p-values of .04 are not reported as p = .04, but are reported as p < .05. Thus, the distribution of p-values close to the criterion value provides unreliable information about the prevalence of questionable research practices.

In this blog post, I introduce an alternative approach to the detection of questionable research practices that produce just significant results. Questionable research practices and publication bias have different effects on the distribution of p-values (or corresponding measures of strength of evidence). Whereas publication bias will produce a distribution that is consistent with the average power of studies, questionable research practice will produce an abnormal distribution with a peak just below the significance criterion. In other words, questionable research practices produce a distribution with too few non-significant results and too few highly significant results.

I illustrate this test of questionable research practices with post-hoc-power analysis of three journals. One journal shows neither signs of publication bias, nor significant signs of questionable research practices. The second journal shows clear evidence of publication bias, but no evidence of questionable research practices. The third journal illustrates the influence of publication bias and questionable research practices.

Example 1: A Relatively Unbiased Z-Curve

The first example is based on results published during the years 2010-2014 in the Journal of Experimental Psychology: Learning, Memory, and Cognition. A text-mining program searched all articles for publications of F-tests, t-tests, correlation coefficients, regression coefficients, odds-ratios, confidence intervals, and z-tests. Due to the inconsistent and imprecise reporting of p-values (p = .02 or p < .05), p-values were not used. All statistical tests were converted into absolute z-scores.

The program found 14,800 tests. 8,423 tests were in the critical interval between z = 2 and z = 6 that is used for estimation of 4 non-centrality parameters and 4 weights that are used to model the distribution of z-values between 2 and 6 and to estimate the distribution in the range from 0 to 2. Z-values greater than 6 are not used because they correspond to Power close to 1. 11% of all tests fall into this region of z-scores that are not shown.

PHP-Curve JEP-LMCThe histogram and the blue density distribution show the observed data. The green curve shows the predicted distribution based on the post-hoc power analysis. Post-hoc power analysis suggests that the average power of significant results is 67%. Power for all statistical tests is estimated to be 58% (including 11% of z-scores greater than 6, power is .58*.89 + .11 = 63%). More important is the predicted distribution of z-scores. The predicted distribution on the left side of the criterion value matches the observed distribution rather well. This shows that there are not a lot of missing non-significant results. In other words, there does not appear to be a file-drawer of studies with non-significant results. There is also only a very small blip in the observed data just at the level of statistical significance. The close match between the observed and predicted distributions suggests that results in this journal are relatively free of systematic bias due to publication bias or questionable research practices.

Example 2: A Z-Curve with Publication Bias

The second example is based on results published in the Attitudes & Social Cognition Section of the Journal of Personality and Social Psychology. The text-mining program retrieved 5,919 tests from articles published between 2010 and 2014. 3,584 tests provided z-scores in the range from 2 to 6 that is being used for model fitting.

PHP-Curve JPSP-ASC

The average power of significant results in JPSP-ASC is 55%. This is significantly less than the average power in JEP-LMC, which was used for the first example. The estimated power for all statistical tests, including those in the estimated file drawer, is 35%. More important is the estimated distribution of z-values. On the right side of the significance criterion the estimated curve shows relatively close fit to the observed distribution. This finding shows that random sampling error alone is sufficient to explain the observed distribution. However, on the left side of the distribution, the observed z-scores drop off steeply. This drop is consistent with the effect of publication bias that researchers do not report all non-significant results. There is only a slight hint that questionable research practices are also present because observed z-scores just above the criterion value are a bit more frequent than the model predicts. However, this discrepancy is not conclusive because the model could increase the file drawer, which would produce a steeper slope. The most important characteristic of this z-curve is the steep cliff on the left side of the criterion value and the gentle slope on the right side of the criterion value.

Example 3: A Z-Curve with Questionable Research Practices.

Example 3 uses results published in the journal Aggressive Behavior during the years 2010 to 2014. The text mining program found 1,429 results and 863 z-scores in the range from 2 to 6 that were used for the post-hoc-power analysis.

PHP-Curve for AggressiveBeh 2010-14

 

The average power for significant results in the range from 2 to 6 is 73%, which is similar to the power estimate in the first example. The power estimate that includes non-significant results is 68%. The power estimate is similar because there is no evidence of a file drawer with many underpowered studies. In fact, there are more observed non-significant results than predicted non-significant results, especially for z-scores close to zero. This outcome shows some problems of estimating the frequency of non-significant results based on the distribution of significant results. More important, the graph shows a cluster of z-scores just above and below the significance criterion. The step cliff to the left of the criterion might suggest publication bias, but the whole distribution does not show evidence of publication bias. Moreover, the steep cliff on the right side of the cluster cannot be explained with publication bias. Only questionable research practices can produce this cliff because publication bias relies on random sampling error which leads to a gentle slope of z-scores as shown in the second example.

Prevalence of Questionable Research Practices

The examples suggest that the distribution of z-scores can be used to distinguish publication bias and questionable research practices. Based on this approach, the prevalence of questionable research practices would be rare. The journal Aggressive Behavior is exceptional. Most journals show a pattern similar to Example 2, with varying sizes of the file drawer. However, this does not mean that questionable research practices are rare because it is most likely that the pattern observed in Example 2 is a combination of questionable research practices and publication bias. As shown in Example 2, the typical power of statistical tests that produce a significant result is about 60%. However, researchers do not know which experiments will produce significant results. Slight modifications in experimental procedures, so-called hidden moderators, can easily change an experiment with 60% power into an experiment with 30% power. Thus, the probability of obtaining a significant result in a replication study is less than the nominal power of 60% that is implied by post-hoc-power analysis. With only 30% to 60% power, researchers will frequently encounter results that fail to produce an expected significant result. In this case, researchers have two choices to avoid reporting a non-significant result. They can put the study in the file-drawer or they can try to salvage the study with the help of questionable research practices. It is likely that researchers will do both and that the course of action depends on the results. If the data show a trend in the right direction, questionable research practices seem an attractive alternative. If the data show a trend in the opposite direction, it is more likely that the study will be terminated and the results remain unreported.

Simons et al. (2011) conducted some simulation studies and found that even extreme use of multiple questionable research practices (p-hacking) will produce a significant result in at most 60% of cases, when the null-hypothesis is true. If such extreme use of questionable research practices were widespread, z-curve would produce corrected power estimates well-below 50%. There is no evidence that extreme use of questionable research practices is prevalent. In contrast, there is strong evidence that researchers conduct many more studies than they actually report and that many of these studies have a low probability of success.

Implications of File-Drawers for Science

First, it is clear that researchers could be more effective if they would use existing resources more effectively. An fMRI study with 20 participants costs about $10,000. Conducting a study that costs $10,000 that has only a 50% probability of producing a significant result is wasteful and should not be funded by taxpayers. Just publishing the non-significant result does not fix this problem because a non-significant result in a study with 50% power is inconclusive. Even if the predicted effect exists, one would expect a non-significant result in ever second study.   Instead of wasting $10,000 on studies with 50% power, researchers should invest $20,000 in studies with higher power (unfortunately, power does not increase proportional to resources). With the same research budget, more money would contribute to results that are being published. Thus, without spending more money, science could progress faster.

Second, higher powered studies make non-significant results more relevant. If a study had 80% power, there is only a 20% chance to get a non-significant result if an effect is present. If a study had 95% power, the chance of a non-significant result would be just as low as the chance of a false positive result. In this case, it is noteworthy that a theoretical prediction was not confirmed. In a set of high-powered studies, a post-hoc power analysis would show a bimodal distribution with clusters of z-scores around 0 for true null-hypothesis and a cluster of z-scores of 3 or higher for clear effects. Type-I and Type-II errors would be rare.

Third, Example 3 shows that the use of questionable research practices becomes detectable in the absence of a file drawer and that it would be harder to publish results that were obtained with questionable research practices.

Finally, the ability to estimate the size of file-drawers may encourage researchers to plan studies more carefully and to invest more resources into studies to keep their file drawers small because a large file-drawer may harm reputation or decrease funding.

In conclusion, post-hoc power analysis of large sets of data can be used to estimate the size of the file drawer based on the distribution of z-scores on the right side of a significance criterion. As file-drawers harm science, this tool can be used as an incentive to conduct studies that produce credible results and thus reducing the need for dishonest research practices. In this regard, the use of post-hoc power analysis complements other efforts towards open science such as preregistration and data sharing.