“For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication” (Cohen, 1994).
DEFINITION OF REPLICABILITY: In empirical studies with sampling error, replicability refers to the probability of a study with a significant result to produce a significant result again in an exact replication study of the first study using the same sample size and significance criterion (Schimmack, 2017).
Featured Blog of the Month (January, 2020): Z-Curve.2.0 (with R-package)
TOP TEN BLOGS
Rankings of 117 Psychology Journals according to the average replicability of a published significant result. Also includes detailed analysis of time trends in replicability from 2010 to 2018).
This post presented the first replicability ranking and explains the methodology that is used to estimate the typical power of a significant result published in a journal. The post provides an explanation of the new method to estimate observed power based on the distribution of test statistics converted into absolute z-scores. The method has been developed further to estimate power for a wider range of z-scores by developing a model that allows for heterogeneity in power across tests. A description of the new method will be published when extensive simulation studies are completed.
The R-Index can be used to predict whether a set of published results will replicate in a set of exact replication studies. It combines information about the observed power of the original studies with information about the amount of inflation in observed power due to publication bias (R-Index = Observed Median Power – Inflation). The R-Index has predicted the outcome of actual replication studies.
The Test of Insufficient Variance is the most powerful test of publication bias and/or dishonest reporting practices. It can be used even if only two independent statistical results are available, although power to detect bias increases with the number of studies. After converting test results into z-scores, z-scores are expected to have a variance of one. Unless power is very high, some of these z-scores will not be statistically significant (z .05 two-tailed). If these non-significant results are missing, the variance shrinks, and TIVA detects that the variance is insufficient. The observed variance is compared against the expected variance of 1 with a left-tailed chi-square test. The usefulness of TIVA is illustrated with Bem’s (2011) “Feeling the Future” data.
5. MOST VIEWED POST (with comment by Noble Laureate Daniel Kahneman)
This blog post examines the replicability of priming studies cited in Daniel Kahneman’s popular book “Thinking fast and slow.” The results suggest that many of the cited findings are difficult to replicate.
Stereotype-threat has been used by social psychologists to explain gender differences in math performance. Accordingly, the stereotype that men are better at math than women is threatening to women and threat leads to lower performance. This theory has produced a large number of studies, but a recent meta-analysis showed that the literature suffers from publication bias and dishonest reporting. After correcting for these effects, the stereotype-threat effect was negligible. This blog post shows a low R-Index for the first article that appeared to provide strong support for stereotype-threat. These results show that the R-Index can warn readers and researchers that reported results are too good to be true.
7. An attempt at explaining null-hypothesis testing and statistical power with 1 figure and 1500 words. Null-hypothesis significance testing is old, widely used, and confusing. Many false claims have been used to suggest that NHST is a flawed statistical method. Others argue that the method is fine, but often misunderstood. Here I try to explain NHST and why it is important to consider power (type-II errors) using a picture from the free software GPower.
Some Bayesian statisticians have proposed Bayes-Factors to provide evidence for a Null-Hypothesis (i.e., there is no effect). They used Bem’s (2011) “Feeling the Future” data to argue that Bayes-Factors would have demonstrated that extra-sensory perception does not exist. This blog post shows that Bayes-Factors depend on the specification of the alternative hypothesis and that support for the null-hypothesis is often obtained by choosing an unrealistic alternative hypothesis (e.g., there is a 25% probability that effect size is greater than one standard deviation, d > 1). As a result, Bayes-Factors can favor the null-hypothesis when there is an effect, but the effect size is small (d = .2). A Bayes-Factor in favor of the null is more appropriately interpreted as evidence that the alternative hypothesis needs to decrease the probabilities assigned to large effect sizes. The post also shows that Bayes-Factors based on a meta-analysis of Bem’s data provide misleading evidence that an effect is present because Bayesian statistics do not take publication bias and dishonest reporting practices into account.
9. Hidden figures: Replication failures in the stereotype threat literature. A widespread problem is that failed replication studies are often not published. This blog post shows that another problem is that failed replication studies are ignored even when they are published. Selective publishing of confirmatory results undermines the credibility of science and claims about the importance of stereotype threat to explain gender differences in mathematics.
10. My journey towards estimation of replicability. In this blog post I explain how I got interested in statistical power and replicability and how I developed statistical methods to reveal selection bias and to estimate replicability.