ANNIVERSARY POST. Slightly edited version of first R-Index Blog on December 1, 2014.

In a now infamous article, Bem (2011) produced 9 (out of 10) statistically significant results that appeared to show time-reversed causality. Not surprisingly, subsequent studies failed to replicate this finding. Although Bem never admitted it, it is likely that he used questionable research practices to produce his results. That is, he did not just run 10 studies and found 9 significant results. He may have dropped failed studies, deleted outliers, etc. It is well-known among scientists (but not lay people) that researchers routinely use these questionable practices to produce results that advance their careers. Think, doping for scientists.

I have developed a statistical index that tracks whether published results were obtained by conducting a series of studies with a good chance of producing a positive result (high statistical power) or whether researchers used questionable research practices. The R-Index is a function of the observed power in a set of studies. More power means that results are likely to replicate in a replication attempt. The second component of the R-index is the discrepancy between observed power and the rate of significant results. 100 studies with 80% power should produce, on average, 80% significant results. If observed power is 80% and the success rate is 100%, questionable research practices were used to obtain more significant results than the data justify. In this case, the actual power is less than 80% because questionable research practices inflate observed power. The R-index subtracts the discrepancy (in this case 20% too many significant results) from observed power to adjust for the inflation. For example, if observed power is 80% and success rate is 100%, the discrepancy is 20% and the R-index is 60%.

In a paper, I show that the R-index predicts success in empirical replication studies.

The R-index also sheds light on the recent controversy about failed replications in psychology (repligate) between replicators and “replihaters.” Replicators sometimes imply that failed replications are to be expected because original studies used small samples with surprisingly large effects, possibly due to the use of questionable research practices. Replihaters counter that replicators are incompetent researchers who are motivated to produce failed studies. The R-Index makes it possible to evaluate these claims objectively and scientifically. It shows that the rampant use of questionable research practices in original studies makes it extremely likely that replication studies will fail. Replihaters should take note that questionable research practices can be detected and that many failed replications are predicted by low statistical power in original articles.

“It shows that the rampant use of questionable research practices in original studies makes it extremely likely that replication studies will fail.”

It could also be due to low statistical power in the original studies that studies will not replicate or am i mistaken ? I mean a low powered study that produced a significant results could also be just a random fluke, and not necessarily due to having used QRP’s ?

Dear Wonder,

a between-subject design with N = 40 has a sampling error of d = .32 (2 / sqrt[40]).

A small true effect would not be significant, d = .20, t(38) = .20 / .32 = .63, p = .53 (two-tailed).

To get a significant result with N = 40 and a true effect size of d = .20, the observed effect size has to be inflated to a minimum of d = .64, t(38) = .64/.32 = 2, p = .05.

Power analysis (e.g, with G*Power3) shows that this will happen in 10% of all studies. It is therefore unlikely to happen even in a single study.

In a set of studies, it becomes even less likely. The probability to get two significant results in a row with a true d = .2 and N = 40 is only 1% (10% * 10%).

Thus, high success rates of 90% that are common in published articles are unlikely to be just fluke findings.

Sincerely, Dr. R