This post was submitted as a comment to the R-Index Bulletin, but I think posting in a comment section of a blog reduces visibility. Therefore, I am reposting this contribution as a post. It is a good demonstration that article-based metrics can predict replication failures. Please consider submitting similar analyses to R-Index Bulletin or send me an email to post your findings anonymously or with author credit.
=================================================================
Too good to be true: A reanalysis of Damisch, Stoberock, and Mussweiler (2010). Keep Your Fingers Crossed! How Superstition Improves Performance. Psychological Science, (21)7, p.1014-1020
Preliminary note:
Test statistics of the t-tests on p.1016 (t(48) = 2.0, p < .05 and t(48) = 2.36, p < .03) were excluded from the following analyses as they served just as manipulation checks. The t-test reported on p.1017 (t(39) = 3.07, p < .01) was also excluded because mean differences in self-efficacy represent a mere exploratory analysis.
One statistical test reported a significant finding with F(2, 48) = 3.16, p < .05. However, computing the p-value with R gives a p-value of 0.051, which is above the criterion value of .05. For this analysis, the critical p-value was set to p = .055 to be consistent with the interpretation of the test as significant evidence in favor of the authors’ hypothesis.
R-Index analysis:
Success rate = 1
Mean observed power = 0.5659
Median observed power = 0.537
Inflation rate = 0.4341
R-Index = 0.1319
Note that, according to http://www.r-index.org/uploads/3/5/6/7/3567479/r-index_manual.pdf (p.7):
“An R-Index of 22% is consistent with a set of studies in which the null-hypothesis is true and a researcher reported only significant results”.
Furthermore, the test of insufficient variance (TIVA) was conducted.
Note that variances of z-values < 1 suggest bias. The chi2 test tests the H0 that variance = 1.
Results:
Variance = 0.1562
Chi^2(7) = 1.094; p = .007
Thus, the insufficient variance in z-scores of .156 suggests that it is extremely likely that the reported results overestimate the population effect and replicability of the reported studies.
It should be noted that the present analysis is consistent with earlier claims that these results are too good to be true based on Francis’s Test of Excessive Significance (Francis et al., 2014; http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0114255).
Finally, the study results were analyzed using p-curve (http://p-curve.com/):
Statistical Inference on p-curve:
Studies contain evidential value:
chisq(16) = 10.745; p = .825
Note that a significant p-value indicates that the p-curve is right-skewed, which indicates evidential value.
Studies lack evidential value:
chisq(16) = 36.16; p = .003
Note that a significant p-value indicates that the p-curve is flatter than one would expect if studies were powered at 33%, which indicates that the results have no evidential value.
Studies lack evidential value and were intensely p-hacked :
chisq(16) = 26.811; p = .044
Note that a significant p-value indicates that the p-curve is left-skewed, which indicates p-hacking/selective reporting.
All bias tests suggest that the reported results are biased. Consistent with these statistical results, a replication study failed to reproduce the original findings (see https://osf.io/fsadm/)
Because all studies were conducted by the same team of researchers the bias cannot be attributed to publication bias. Thus, it appears probable that questionable research practices were used to produce the observed significant results. A possible explanation might be that the authors ran multiple studies and reported just those that produced significant results.
In conclusion, researchers should be suspicious about the power of superstition or at least keep their fingers crossed when they attempt to replicate the reported findings.
My analysis of this article is actually at http://link.springer.com/article/10.3758/s13423-014-0601-x
Uki linked to a different article.