A few years ago, Motyl et al. (2017) published the article “The State of Social and Personality Science: Rotten to the Core, Not So Bad, Getting Better, or Getting Worse?” The article provided the first assessment of the credibility and replicability of social psychology based on a representative sample of over 1,000 hand-coded test statistics in original research articles. Given the amount of work involved, the authors may be a bit disappointed that their article has been largely ignored by social psychologists and meta-psychologists alike. So far, it has received only 23 citations in Web of Science. In comparison, the reproducibility project that replicated a quasi-representative sample of 55 studies has received over 2,700 citations and 580 citations in 2020.
In my opinion, this difference is not proportional to the contributions of the two projects. Neither actual replications nor coding of original research findings are flawless methods to estimate the replicability of social psychology. Actual replication studies have the problem that replication studies may fail to reproduce the original conditions, especially when research is conducted with different populations. In contrast, the coding of original test statistics is 100% objective and are only biased by misreporting of statistics in original articles. The advantage of actual replications is that they more directly answer the question of interest. Can we reproduce a significant result, if we conduct the same study again? As many authors from Fisher to Cohen have pointed out, actual replication is the foundation of empirical sciences. In contrast, statistical analysis of published test statistics can only estimate the outcome of actual replication studies based on a number of assumptions that are difficult or impossible to verify. In short, both approaches have their merits and shortcomings and they are best used in tandem to produce convergent evidence with divergent methods.
A key problem with Motyl et al.’s (2017) article was that they did not provide a clearly interpretable result that is akin to the shocking finding in the reproducibility project that only 14 out of the 55 (25%) replication attempts were successful, despite increased sample sizes and power for some of the replication studies. This may explain why Motyl et al. (2017) did not conclude that social psychology is rotten to the core, which would be an apt description of a failure rate of 75%.
Motyl et al. (2017) used a variety of statistical methods that were just being developed. They also converted all test statistics into z-scores and showed z-curves for studies in 2003/04 and 2013/14. Yet, they did not analyze these z-curve plots with the z-curve analysis to estimate power. Moreover, the new version of z-curve.2.0 was not yet developed.
The authors clearly point out that the steep drop of values below the significance criterion of z = 1.96 (p = .05, two-sided) provides evidence of publication bias. “There is clear evidence of publication bias (i.e., a sharp rise of the distribution near 1.96)” (p. 49). In contrast, the Open Science Collaboration article provided no explanation for the drop in success rates from 97% in the original articles to 25% in the replication studies. This may be justified given the small sample of studies. Thus, Motyl et al.’s (2017) article should be cited because it provides clear visual evidence of publication bias in the social psychological literature. However, the only people interested in social psychology are social psychologists and they are not motivated to cite research that makes their science look bad.
A bigger limitation of Motyl et al.’s (2017) article is the discussion of power and replicability. First, the authors examine post-hoc power, which is dramatically inflated when publication bias selects significant results.
“Although post hoc observed power estimates are extremely upwardly biased and should be interpreted with great caution, our median values were very near Cohen’s .80 threshold for both time periods, a conclusion more consistent with an interpretation of it’s not so bad than it’s rotten to the core.”
To avoid these misleading conclusions, it is important to adjust power estimates for the effect of selection for significance. Motly et al. (2017) actually report results for the R-Index that corrects for the effect of inflation. To correct for inflation by publication bias, the R-Index first computes the discrepancy between the observed discovery rate (i.e, the percentage of z-scores greater than 1.96 in Figure 1) and observed power. The idea is that we cannot get 95% significant results if power is only 80%. The lower the observed power is, the more the success rate is inflated by questionable research practices. The R-Index is called an index because the correction method provides biased estimates of power. So, values should be used as a heuristic, but not as proper estimates of power. However, values around 50% are relatively unbiased. Thus, the R-Index results provide some initial information about the average power of studies.
“The R-index decreased numerically, but not statistically over time, from .62 [95% CI = .54, .68] in 2003–2004 to .52 [95% CI = .47, .56] in 2013–2014”
This result could be used as a rough estimate of the statistically predicted replication rate for social psychology that can be directly compared to the replication rate in the Open Science Collaboration project. This leads to two different conclusions about the published studies in social psychology from 1900 to 2014. Based on the Open Science Reproducibility project the field is rotten. With a 75% failure rate, it is not clear which results can be trusted. The best approach forward would be to burn everything to the ground and start from scratch to build a science of social behavior. With a 50% replication rate, we might be more willing to call the glass half empty or half full and search for some robust findings in the rubble of the replication crisis. So, in 2021 we have no clear assessment of the credibility of social psychology. We have clear evidence of publication bias and inflation of success rates, but we do not have clear evidence about the replicability of social psychology. It would seem imprudent to ignore all published evidence based on actual replication outcomes of just 50 studies.
In a recent publication, I analyzed Motyl et al.’s data using the latest version of z-curve (Brunner & Schimmack, 2020; Bartos & Schimmack, 2021). The advantage of z-curve over the R-Index is that it does provide estimates of power that have been validated in simulation studies. I focussed on t-tests and F-tests with one degree of freedom because these tests most directly test predictions about group differences. As there were no significant differences between 2003/04 and 2013/14, only one model was fitted to all years.
Figure 2 shows the results. The first finding is that the expected replication rate (ERR) is estimated to be slightly lower than the R-Index results in Motyl et al. (2017) suggested, 43% 95%CI = 36- 52%. This estimate is closer to the success rate for actual replication studies (25%), but there is still a gap. One reason for this gap is that the ERR assumes exact replications. However, to the extent that replication studies are not exact, regression to the mean will lower replication rates and in the worst case scenario, the success of replication studies is no different from the expected discovery rate (Bartos & Schimmack, 2020). That is, researchers are essentially doing a new study whenever they do a conceptual replication study and the outcome of these studies is based on the average power of studies that are being conducted. The EDR estimate is 19% and the 95%CI ranges from 6% to 36%, which includes 25%. Thus, the EDR estimate for Motyl et al. data is consistent with the replication rate in actual replication studies.
The main purpose of this post (pre-print) is to replicate and extend the z-curve analysis of Motyl et al.’s data. There are several good reasons for doing so. First, replication is a good practice for all sciences, including meta-science. Second, a blog post by Leif Nelson and colleagues questioned the coding of test statistics and implied that the results were too good (Nelson et al., 2071). Accordingly, the actual power of studies in social psychology would be even lower than 19%, but selection for significant might boost the expected replication rate to 25%. However, direct replications are often not as informative as replication studies with an extension that address a new question. For this reason, this replication project did not use a random sampling of studies. Instead, the focus was on the most cited articles by the most eminent social psychologists. There are several advantages of focusing on this set of studies. First, there have been concerns that studies by junior authors and studies with low citation counts are of lower quality. The wisdom of crowds might help to pick well-conducted studies with high replicability. Accordingly, this study should produce a higher ERR and EDR than Motyl et al.’s random sample of studies. Second, the replicability of highly cited articles is more important for the field than the replicability of studies with low citation counts that had no influence on the field of psychology.
A paid undergraduate student, who prefers to remain anonymous, and I coded the most highly cited articles of eminent social psychologists (an H-Index of 35 or higher in 2018). The goal was to code enough articles to have at least 20 studies per researcher.
For the most part, the results replicate the z-curve analysis of Motyl et al.’s data. The observed discovery rate is 89% compared to 90% for Motyl et al. Importantly, these values do not include marginally significant results. Including marginally significant results, the ODR is consistent with Sterling’s finding that over 90% of published focal tests in psychology are significant (Sterling, 1959; Sterling et al., 1995).
Z-curve provides the first estimates of the actual power to produce significant results. The EDR estimate for the replication study, 26%, is slightly higher than the estimate for Motyl et al., but the confidence intervals overlap considerably, showing that the differences are not statistically significant. The new confidence interval of 10% to 36% also includes the actual replication rate of 25%.
The ERR for the replication study, 49% is a bit higher than the ERR of Motyl’s study, 43%, but the confidence intervals overlap. Both confidence intervals exclude the actual replication rate of 25%, showing that the ERR of Motyl et al.’s study was not inflated by bad coding. Instead, the results provide further evidence that the ERR overestimates actual replication outcomes.
Social psychology lacks credibility
The foundation of an empirical science are objectively verified facts. In the social sciences, these building blocks are based on statistical inferences that come with the risk of false positive results. Only convergent evidence across multiple studies can provide solid foundations for theories of social behavior. However, selective publishing of studies that confirm theoretical predictions renders the published record inconclusive. The impressive success rates of close to 100% in psychology journals are a mirage and merely show psychologists aversion to disconfirming evidence (Sterling, 1959). The present study provides converging evidence that the actual discovery rate in social psychological laboratories is much lower and likely to be well below 50%. While statisticians are still debating the usefulness of statistical significance testing, they do agree that selecting significant results renders statistical significance useless. If only significant results are published, even false positive results like Bem’s embarrassing results of time-reversed priming get published (Bem, 2011). Nobody outside of social psychology needs to take claims based on these questionable results seriously. A science that does not publish disconfirming evidence is not a science. Period.
It is of course not easy to face the bitter truth that decades of research were wasted on pseud-scientific publications and that the thousands of articles with discoveries may be filled with false discoveries (“Let’s err on the side of discovery” Bem, 2000). Not surprisingly, social psychologists have reacted in ways that are all to familiar to psychoanalysts. Ten years after concerns about the trustworthiness of social psychology triggered a crisis of confidence, not much has been done to correct the scientific record. Citation counts show that claims based on questionable practices are still treated as if they are based on solid empirical foundations. Textbooks continue to pretend that social psychological theories are empirically supported, even if replication failures cast doubt on these theories. However, science is like the stock market. We know it will correct eventually; we just don’t know when. Meanwhile, social psychology is losing credibility because they are unable or unwilling to even acknowledge the mistakes of the past.
Social psychology needs to improve statistical power
Criticisms of low power in social psychology are nearly as old as empirical social psychology itself (Cohen, 1961). However, despite repeated calls for increased power, power did not increase from 1960 to 2010 (I have produced the first evidence that power increased afterwards, Schimmack, 2016, 2017, 2021). The main problem of low power is that studies are likely to produce non-significant results even if a study tested a true hypothesis. However, low power also influences the false discovery risk. If only a small portion of studies produces a significant outcome, the risk of a false positive result relative to a true positive result increases (Soric, 1989). In theory, this is not a problem if replication studies can be used to separate true and false discoveries, but if replication studies are not credible, it remains unclear how many discoveries are false discoveries.
Social psychology needs to invest more resources in original studies.
Before the major replication crisis in the 2010s, social psychologists were concerned about questionable practices in the 1990s (Kerr, 1998). In response to these concerns, demands increased to demonstrate robustness of findings in multi-study articles (cf. Schimmack, 2012). Surprisingly, social psychologists were able to present significant results again and again in these multiple-study articles, creating the illusion of replicability. Even Bem (2011), demonstrated time-reversed causality in nine studies. This is practically impossible to happen by chance. However, these seemingly robust results did not show that social psychological results were credible. Instead, they showed that social psychologists had found ways to produce many significant results with questionable practices. The demand for multiple studies is no longer needed when original studies are credible because they used large samples and pre-registered dependent variables and other design features. However, social psychologists continue to expect multiple studies within a single article. To do so, social psychologists have moved online and conduct cheap studies with short studies that take a few minutes and cost little. These studies are not intrinsically bad, but they crowd out important research on actual social behavior or intervention studies that can actually reduce prejudice or change other social behaviors. Cohen famously said, less is more. By this he did not mean to lower standards of external validity. Instead, he was trying to push back against a research culture that prizes quantitative indicators of success like the number of significant results, articles, and citations. This research culture has produced no reliable interventions to reduce prejudice in 60 years of research. It is time to change this and to reward carefully planned, expensive, and difficult studies that can make a real contribution. This may require collaboration rather than competition among labs. Social psychology needs a Hubble telescope, a CERN collider, or a large household panel study to tackle big questions. The genius scientist with a sample of 40 undergraduate students like Festinger was the wrong role model for social psychology for far too long. The Open Science Collaboration project showed how collaboration across many labs can have a big impact that no single replication study could have had. This should also be the model for original social psychology.
Evidence is accumulating that social psychology has made a lot of mistakes in the past. The evidence that has accumulated in social psychological journals has little evidential value. It will take time to separate what is credible and what is not. New researchers need to be careful to avoid investing resources in research lines that are mirages and to look for oases in the desert. A reasonable heuristic is to distrust all published findings with a p-value greater than .005 and to carefully check the research practices of individual researchers (Schimmack, 2021). Of course, it is not viable to retract all bad articles that have been published or to issue expressions of concerns for entire volumes. However, consumers of social psychology need to be aware that the entire literature comes with a big warning label “Readers are advised to proceed with caution”