**A comparison of The Test of Excessive Significance and the Incredibility Index **

It has been known for decades that published research articles report too many significant results (Sterling, 1959). This phenomenon is called publication bias. Publication bias has many negative effects on scientific progress and undermines the value of meta-analysis as a tool to accumulate evidence from separate original studies.

Not surprisingly, statisticians have tried to develop statistical tests of publication bias. The most prominent tests are funnel plots (Light & Pillemer, 1984) and Eggert regression (Eggert et al., 1997). Both tests rely on the fact that population effect sizes are statistically independent of sample sizes. As a result, observed effect sizes in a representative set of studies should also be independent of sample size. However, publication bias will introduce a negative correlation between observed effect sizes and sample sizes because larger effects are needed in smaller studies to produce a significant result. The main problem with these bias tests is that other factors may produce heterogeneity in population effect sizes that can also produce variation in observed effect sizes and the variation in population effect sizes may be related to sample sizes. In fact, one would expect a correlation between population effect sizes and sample sizes if researchers use power analysis to plan their sample sizes. A power analysis would suggest that researchers use larger samples to study smaller effects and smaller samples to study large effects. This makes it problematic to draw strong inferences from negative correlations between effect sizes and sample sizes about the presence of publication bias.

Sterling et al. (1995) proposed a test for publication bias that does not have this limitation. The test is based on the fact that power is defined as the relative frequency of significant results that one would expect from a series of exact replication studies. If a study has 50% power, the expected frequency of significant results in 100 replication studies is 50 studies. Publication bias will lead to an inflation in the percentage of significant results. If only significant results are published, the percentage of significant results in journals will be 100%, even if studies had only 50% power to produce significant results. Sterling et al. (1995) found that several journals reported over 90% of significant results. Based on some conservative estimates of power, he concluded that this high success rate can only be explained with publication bias. Sterling et al. (1995), however, did not develop a method that would make it possible to estimate power.

Ioannidis and Trikalonis (2007) proposed the first test for publication bias based on power analysis. They call it “An exploratory test for an excess of significant results.” (ETESR). They do not reference Sterling et al. (1995), suggesting that they independently rediscovered the usefulness of power analysis to examine publication bias. The main problem for any bias test is to obtain an estimate of (true) power. As power depends on population effect sizes, and population effect sizes are unknown, power can only be estimated. ETSESR uses a meta-analysis of effect sizes for this purpose.

This approach makes a strong assumption that is clearly stated by Ioannidis and Trikalonis (2007). The test works well “If it can be safely assumed that the effect is the same in all studies on the same question” (p. 246). In other words, the test may not work well when effect sizes are heterogeneous. Again, the authors are careful to point out this limitation of ETSER. “In the presence of considerable between-study heterogeneity, efforts should be made first to dissect sources of heterogeneity [33,34]. Applying the test ignoring genuine heterogeneity is ill-advised” (p. 246).

The authors repeat this limitation at the end of the article. “Caution is warranted when there is genuine between-study heterogeneity. Test of publication bias generally yield spurious results in this setting.” (p. 252). Given these limitations, it would be desirable to develop a test that that does not have to assume that all studies have the same population effect size.

In 2012, I developed the Incredibilty Index (Schimmack, 2012). The name of the test is based on the observation that it becomes increasingly likely that a set of studies produces a non-significant result as the number of studies increases. For example, if studies have 50% power (Cohen, 1962), the chance of obtaining a significant result is equivalent to a coin flip. Most people will immediately recognize that it becomes increasingly unlikely that a fair coin will produce the same outcome again and again and again. Probability theory shows that this outcome becomes very unlikely even after just a few coin tosses as the cumulative probability decreases exponentially from 50% to 25% to 12.5%, 6.25%, 3.1.25% and so on. Given standard criteria of improbability (less than 5%), a series of 5 significant results would be incredible and sufficient to be suspicious that the coin is fair, especially if it always falls on the side that benefits the person who is throwing the coin. As Sterling et al. (1995) demonstrated, the coin tends to favor researchers’ hypothesis at least 90% of the time. Eight studies are sufficient to show that even a success rate of 90% is improbable (p < .05). It therefore very easy to show that publication bias contributes to the incredible success rate in journals, but it is also possible to do so for smaller sets of studies.

To avoid the requirement of a fixed effect size, the incredibility index computes observed power for individual studies. This approach avoids the need to aggregate effect sizes across studies. The problem with this approach is that observed power of a single study is a very unreliable measure of power (Yuan & Maxwell, 2006). However, as always, the estimate of power becomes more precise when power estimates of individual studies are combined. The original incredibility indices used the mean to estimate averaged power, but Yuan and Maxwell (2006) demonstrated that the mean of observed power is a biased estimate of average (true) power. In further developments of my method, I changed the method and I am now using median observed power (Schimmack, 2016). The median of observed power is an unbiased estimator of power (Schimmack, 2015).

In conclusion, the Incredibility Index and the Exploratory Test for an Excess of Significant Results are similar tests, but they differ in one important aspect. ETESR is designed for meta-analysis of highly similar studies with a fixed population effect size. When this condition is met, ETESR can be used to examine publication bias. However, when this condition is violated and effect sizes are heterogeneous, the incredibility index is a superior method to examine publication bias. At present, the Incredibility Index is the only test for publication bias that does not assume a fixed population effect size, which makes it the ideal test for publication bias in heterogeneous sets of studies.

**References**

Ioannidis and Trikalinos (2007). An exploratory test for an excess of significant findings. Clinical Trials, 4 245-253.

Schimmack (2012). The Ironic effect of significant results on the credibility of multiple study articles. Psychological Methods, 17, 551-566.

Schimmack, U. (2016). A revised introduction o the R-Index.

## 1 thought on “A comparison of The Test of Excessive Significance and the Incredibility Index”