[corrected 9.2.2019, see erratum at the end]
Trust is good. Trust with threat of audits is better. (Ira Rose Smith)
Why Psychological Science Is Not an Empirical Science
Science is built on trust. This is especially true for psychology where researchers conduct studies in their labs and report their findings in peer-reviewed journals. Peer-review does not ensure that data were analyzed correctly or that all relevant data are included in a submitted manuscript. In fact, it is well-known that relevant data are often omitted from submitted manuscripts. This selective reporting of evidence that confirms authors’ predictions is known as publication bias. Thus, it is reasonable to be skeptical that published articles report all of the relevant information to evaluate a theory.
Publication bias is the single most important reason for replication failures. Imagine that a researcher conducted 20 independent statistical analysis and found one significant result. This result is featured in a manuscript that is accepted for publication. If it were known that 19 other statistical tests had been conducted, the single significant result would be considered a statistical fluke because we expect one significant result for every 20 (independent) tests by chance alone, if the significance criterion is .05 (1/20). However, if the result is presented as if only one statistical test was conducted and the result is presented as evidence as the only test of an interesting theoretical prediction, the evidence looks much stronger. Thus, selective publication of significant results gives an inflated estimate of the robustness of published results in peer-reviewed journals (Sterling, 1959; Rosenthal, 1979).
As the number of tests that were actually conducted is unknown, publication bias makes it impossible to say whether a significant result provides strong or weak evidence for a hypotheses. Thus, published results in psychology journals provide no information about the strength of empirical support for theoretical claims. It was only in 2011, that psychologists started to realize that their approach to science fails to test theoretical predictions after Bem (2011) was able to present seemingly strong evidence for mental time-traveling, which does not really exist. Bem’s (2011) absurd findings were obtained with questionable research practices (Schimmack, 2018). If his practices had been known, his article would not have been accepted for publication.
The problem for psychology as a science is that Bem’s article was no exception. Rather, selective presentation of confirmatory evidence was, and still is, the norm in psychology (John et al., 2012), despite some efforts to reform psychology by means of open and transparent reporting of practices. Thus, thousands of published articles report statistically significant results that provide no empirical evidence, and without empirical evidence a science is not an empirical science. This sad state of affairs is known as the replication crisis, crisis of confidence, or credibilty crisis in psychology (Scientific American).
A Retroactive Fix of the Credibility Crisis in Psychology
There are two approaches to fixing the credibility crisis in psychology. One approach is to redo studies without publication bias to see whether published results replicate. An added bonus of this approach is that psychological results can be influenced by culture and could change over time. Thus, even if a result was true in 1980, it may no longer be true in 2019. For example, sex differences in the reported frequency of masturbation have been decreasing over time. The downside of this approach is that it is extremely costly and takes away resources from conducting new studies. It is simply impossible to redo thousands of studies to separate credible results from incredible ones.
I have been working on a second approach to examine the credibility of published results. This approach takes advantage of the fact that psychologists do not only report whether a result is significant or not (p < .05), but also provide information about the strength of the evidence against the null-hypothesis in the form of exact test statistics (t-values, F-values). Although strength of evidence in a single study is strongly influenced by chance (sampling error), a meta-analysis of the strength of evidence for sets of studies provides important information about the credibility of significant results (Schimmack, 2012). Brunner and Schimmack (2018) developed a statistical tool that makes it possible to predict the outcome of replication studies based on published test statistics that have been selected to be significant. The method is called z-curve because it uses z-scores as the common metric to measure the strength of evidence against the null-hypothesis. Without going into the statistical details here, z-curve provides information about the amount of publication bias and the chances that a finding can be replicated under the same conditions as the original study. That is, it cannot predict replication failures that are due to changes in cultural factors or other differences between studies (e.g., different populations).
The main advantage of using z-curve for published studies is that it is possible to separate credible and incredible findings in the published literature without having to do actual replication studies. I call a z-curve analysis of published studies an audit because it serves the same function as a tax audit. Just like science, taxation relies on trust. Citizens are asked to report their taxes and their tax forms are used to determine their taxes. Just like scientists, citizens are tempted to report their taxes in a way that is biased in their favor. To reduce this bias, revenue agencies conduct tax audits to ensure that citizens report their taxes honestly. A replicability audit does the same for science. It shows how trustworthy a scientists’ self-reported results are.
Replicability Audit of Social Psychology
The credibility crisis in psychology originated in social psychology and there is evidence that social psychologists have used questionable research practices more than other psychologists, especially cognitive psychologists (OSC, 2015; Schimmack, 2018). Social psychology has also been the target of actual replication studies in the form of Registered Replication Reports and statistical examinations of replicability (Motyl et al., 2017; Schimmack, 2018).
Motyl et al. (2017) used a representative sample of articles in several social psychology journals to examine the credibility of social psychology. A z-curve analysis of their data produced an average estimate of 44% replicability. However, there was also evidence of heterogeneity across studies. Studies with weak evidence, p-values between .05 and .01, had only a replicability of 22%, and even studies with p-values between .01 and .001 averaged only 28% replicability.
Scientists as Moderators
Although science is often characterized as a collective and objective pursuit of truth, science requires scientists and scientists’ behavior is influenced by self-interest and subjectivity. Thus, the personality of scientists also influences science; at least in the short term. Z-curve makes it possible to examine whether scientists differ in their research practices in ways that produce differences in credibility. For example, some scientists may use larger samples or within-subject designs, which reduces sampling error and increases the chances that their findings replicate. Others may use small samples for highly exploratory research and publish findings that are less likely to replicate. Thus, it is possible that replicability varies across individuals and that individual differences in research practices contribute to the heterogeneity in replicability in social psychology. To test this hypothesis, I started to conduct replicabiltiy audits of individual researchers.
The audits of eminent social psychologists provide an opportunity to cross-validate the results obtained with Motyl et al.’s representative sample. Both sampling strategies have limitations. Motyl et al. used only four years and four journals. In contrast, the audit of the most highly cited articles by eminent social psychologists neglects work by many other psychologists. Thus, the population of studies differ and different results could be obtained for these two populations. However, Figure 1 shows that this is not the case. The replicability estimate for high-profile articles is 37% with a 95%CI ranging from 27% to 41%. [Note that this number will change as more audits are added to this analysis]. This estimate is close to the 44% estimate (95%CI 37-49%) for Motyl’s sample.
There is also evidence for heterogeneity as shown by the local power estimates below the x-axis. Z-scores that are just significant (z = 2 to 2.5) have very low replicability (22%) and even z-scores between 2.5 and 3 have low replicability (28%) To have at least a 50% chance that a result will replicate requires z-scores above 3.5 and 80% replicability requires z-scores greater than 4, which is consistent with results for actual replication studies (OSC, 2015). As shown, there are very few studies that meet this criterion. Thus, most published results by eminent social psychologists lack credibility without successful replications in credible replication studies.
However, it would be unfair to stereotype social psychologists and treat all social psychologists alike. The danger of stereotyping is probably one of the most important topics in social psychology. Thus, social psychologists should be evaluated based on their own research practices rather than the typical research practices in the field. For this reason I also provide individual audits for each of the eminent social psychologists that have been audited so far (see links below).
In conclusion, psychology has a credibility crisis because significance is no longer a valuable measure of credibility if publication bias is present (Sterling, 1959). Z-curve makes it possible to assess the credibility of published results on the basis of the strength of evidence provided in original articles. Audits of social psychology show that average replicability in social psychology is low (< 50%) and especially low for studies with p-values between .05 and .01, which are the majority of published findings. Thus, social psychology has a replication crisis where many published results will not replicate. However, replicability varies across researchers and authorship is a valid cue of credibility. Current researchers should ensure that they conduct open studies with high statistical power to ensure that audits of their work produce favorable results.
List of Social Psychologists with a Replicability Audit:
Roy F. Baumeister (20%)
John A. Bargh (29%)
Fritz Strack (38%)
Norbert Schwarz (39%)
Timothy D. Wilson (41%)
Susan T. Fiske (59%)
Steven Heine (75%)
P.S. For a larger list of replicabilty analyses based on automated extraction of focal and non-focal tests see the Repliability Rankings.
Alexander Rubenbauer spotted a mistake in an earlier version of this blog post. I wrote ” Audits of social psychology show that average replicability in social psychology is low (< 50%) and especially low for studies with p-values below .001, which are the majority of published findings” The correct statement is ” Audits of social psychology show that average replicability in social psychology is low (< 50%) and especially low for studies with p-values between .05 and .01, which are the majority of published findings”