Category Archives: Uncategorized

Open Science Reveals Most Published Results are Not False

CORRECTION: Open science also means that our mistakes are open and transparent. Shortly after I posted this blog, Spencer Greenberg pointed out that I made a mistake when I used the discovery rate in OSC to estimate the discovery rate in psychological science. I am glad he caught my mistake quickly and I can warn readers that my conclusions do not hold. A 50% success rate for replications in cognitive psychology suggests that most results in cognitive psychology are not false positives, but the low replication rate of 25% for social psychology does allow for a much higher false discover rate than I estimated in this blog post.

===========================================================================

Money does not make the world go round, it cannot buy love, but it does pretty much everything else. Money is behind most scientific discoveries. Just like investments in stock markets, investments in science are unpredictable. Some of these investments are successful (e.g., Covid-19 vaccines), but others are not.

Most scientists, like myself, rely on government funding that is distributed in a peer-reviewed process by scientists to scientists. It is difficult to see how scientists would fund research that aims to show that most of their work is useless, if not fraudulent. This is where private money comes in.

The Arnold foundation handed out two big grants to reform science (Arnold Foundation Awards $6 Million to Boost Quality of Research; The Center for Open Science receives $7.5 million in additional funding from the Laura and John Arnold Foundation).

One grant was given to Ioannidis, who was famous for declaring that “most published results are false” (Ioannidis, 2005). The other grant was given to Nosek, to establish the Open Science Foundation.

Ioannidis and Nosek also worked together as co-authors (Button et al., 2013). In terms of traditional metrics of impact, the Arnold foundations’ investment paid off. Ioannidis’s (2005) has been cited over 4,000 times. Button et al.’s article has been cited over 2,000 times. And an influential article by Nosek and many others that replicated 100 studies from psychology has been cited over 2,000 times.

These articles are go-to citations for authors to claim that science is in a replication crisis, most published results are false, and major reforms to scientific practices are needed. It is no secret that many authors who cite these articles have not read the actual article. This explains why thousands of citations do not include a single article that points out that the Open Science Collaboration findings contradict Ioannidis’s claim that most published results are false.

The Claim

Ioannidis (2005) used hypothetical examples to speculate that most published results are false. The main assumption underlying these scenarios was that researchers are much more likely to test false hypotheses (a vaccine has no effect) than true hypotheses (a vaccine has an effect). The second assumption was that even when researchers test true hypotheses, they do so with a low probability to provide enough evidence (p < .05) that an effect occurred.

Under these assumptions, most empirical tests of hypotheses produce non-significant results (p > .05) and among those that are significant, the majority come from the large number of tests that tested a false hypothesis (false positives).

In theory, it would be easy to verify Ioannidis’s predictions because he predicts that most results are not significant, p > .05. Thus, a simple count of significant and non-significant results would reveal that many published results are false. The problem is that not all hypotheses tests are published and that significant results are more likely to be published than non-significant results. This bias in the selection of results is known as publication bias. Ioannidis (2005) called it researcher bias. As the amount of researcher bias is unknown, there is ample room to suggest that it is large enough to fit Ioannidis’s prediction that most published significant results are false positives.

The Missing Piece

Fifteen years after Ioannidis claimed that most published results are false, there have been few attempts to test this hypothesis empirically. One attempt was made byJager and Leek (2014). This article made two important contributions. First, Jager and Leek created a program to harvest statistical results from abstracts in medical journals. Second, they developed a model to analyze the harvested p-values to estimate the percentage of false positive results in the medical literature. They ended up with an estimate of 14%, which is well below Ioannidis’s claim that over 50% of published results are false.

Ioannidis’s reply made it clear that a multi-million investment in his idea made it impossible to look at this evidence objectively. Clearly, his speculations based on no data must be right and an actual empirical test must be wrong, if it didn’t confirm his prediction. In science this is known as confirmation bias. Ironically, confirmation bias is one of the main obstacles that prevents science from making progress and to correct false beliefs.

Ioannidis (2014), p. 34

Fortunately, there is a much easier way to test Ioannidis’s claim than Jager and Leek’s model that may have underestimated the false discovery risk. All we need to estimate to estimate the false discovery rate under the worst case scenario is a credible estimate of the discovery rate (i.e., the percentage of significant results). Once we know how many tests produced a positive result, we can compute the maximum false discovery rate using a simple formula developed by Soric (1989).

Maximum False Discovery Rate = (1/Discovery Rate – 1)*(.05/.95)

The only challenge is to find a discovery rate that is not inflated by publication bias. And that is where Nosek and the Open Science Foundation come in.

The Reproducibility Project

It has been known for decades that psychology has a publication bias problem. Sterling (1959) observed that over 90% of published results report a statistically significant result. This finding was replicated in 1995 (Sterling et al., 1995) and again in 2015, when the a large team of psychologists replicated 100 studies and 97% of the original studies reported a statistically significant result (Open Science Collaboration, 2015).

Using Soric’s formula this would imply a false discovery rate of 0. However, the replication studies showed that this high discovery rate is inflated by publication bias. More important, the replication studies provide an unbiased estimate of the actual discovery rate in psychology. Thus, these results can be used to estimate the maximum false discovery rate in psychology, using Soric’s formula.

The headline finding of this article was that 36% (35/97) of the replication studies reproduced a significant result.

Using Soric’s formula, this implies a maximum (!) false discovery rate of 9%, which is well below the predicted 50% by Ioannidis. The difference is so large that no statistical test is needed to infer that the Nosek’s results falsify Ioannidis’s claim.

Table 1 also shows the discovery rates for specific journals or research areas. The discovery rate for cognitive psychology in the journal Psychological Science is 53%, which implies a maximum FDR of 5%. For cognitive psychology published in the Journal of Experimental Psychology: Learning, Memory, and Cognition the DR of 48% implies a maximum FDR of 6%.

Things look worse for social psychology, which has also seen a string of major replication failures (Schimmack, 2020). However, even here we do not get false discovery rates over 50%. For social psychology published in Psychological Science, the discovery rate of 29% implies a maximum false discovery rate of 13%, and social psychology published in JPSP has a discovery rate of 23% and a maximum false discovery rate of 18%.

These results do not imply that everything is going well in social psychology, but they do show how unrealistic Ioannidis’s scenarios were that produced false discovery rates over 50%.

Conclusion

The Arnold foundation has funded major attempts to improve science. This is a laudable goal and I have spent the past 10 years working towards the same goal. Here I simply point out that one big successful initiative, the reproducibility project (Open Science Collaboration, 2015), produced valuable data that can be used to test a fundamental assumption in the open science movement, namely the fear that most published results are false. Using the empirical data from the Open Science Collaboration we find no empirical support for this claim. Rather the results are in line with Jager and Leek’s (2014) findings that strictly false results where the null-hypothesis is true are the exception rather than the norm.

This does not mean that everything is going well in science because rejecting the null-hypothesis is only a first step towards testing a theory. However, it is also not helpful to spread false claims about science that may undermine trust in science. “Most published results are false” is an eye-catching claim, but it lacks empirical support. In fact, it has been falsified in every empirical test that has been conducted. Ironically, the strongest empirical evidence based on actual replication studies comes from a project that used open science practices that would not have happened without Ioannidis’s alarmist claim. This shows the advantages of open science practices and implementing these practices remains a valuable goal even if most published results are not strictly false positives.

Empirical Standards for Statistical Significance

Many sciences, including psychology, rely on statistical significance to draw inferences from data. A widely accepted practice is to consider results with a p-value less than .05 as evidence that an effect occurred.

Hundreds of articles have discussed the problems of this approach, but few have offered attractive alternatives. As a result, very little has changed in the way results are interpreted and published in 2020.

Even if this would suddenly change, researchers still have to decide what they should do with the results that have been published so far. At present there are only two options. Either trust all results and hope for the best or assume that most published results are false and start from scratch. Trust everything or trust nothing are not very attractive options. Ideally, we would want to find a method that can sperate more credible findings from less credible ones.

One solution to this problem comes from molecular genetics. When it became possible to measure genetic variation across individuals, geneticists started correlating single variants with phenotypes (e.g., the serotonin transporter gene variation and neuroticism). These studies used the standard approach of declaring results with p-values below .05 as a discovery. Actual replication studies showed that many of these results could not be replicated. In response to these replication failures, the field moved towards genome-wide association studies that tested many genetic variants simultaneously. This further increased the risk of false discoveries. To avoid this problem, geneticists lowered the criterion for a significant finding. This criterion was not picked arbitrarily. Rather it was determined by estimating the false discovery rate or false discovery risk. The classic article that recommeded this approach has been cited over 40,000 times (Benjamin & Hochberg, 1995).

In genetics, a single study produces thousands of p-values that require a correction for multiple comparisons. Studies in other disciplines usually produce a much smaller (typically less than 100) p-values. However, an entire scientific field also generates thousands of p-values. This makes it necessary to control for multiple comparisons and to lower p-values from the nominal value of .05 to maintain a reasonably low false discovery rate.

The main difference between original studies in genomics and meta-analysis of studies in other fields is that publication bias can inflate the percentage of significant results. This leads to biased estimates of the actual false discovery rate (Schimmack, 2020).

One solution to this problem are selection models that take publication bias into account. Jager and Leek (2014) used this approach to estimate the false discovery rate in medical journals for statistically significant results, p < .05. In response to this article, Goodman (2014) suggested to ask a different question.

What significance criterion would ensure a false discovery rate of 5%?

Although this is a useful question, selection models have not been used to answer it. Instead, recommendations for adjusting alpha have been based on ad-hoc assumptions about the number of true hypotheses that are being tested and power of studies.

For example, the false positive rate is greater than 33% with prior odds of 1:10 and a P value threshold of 0.05, regardless of the level of statistical power. Reducing the threshold to 0.005 would reduce this minimum false positive rate to 5% (D. J. Benjamin et al., 2017, p. 7).

Rather than relying on assumptions, it is possible to estimate the maximum false discovery rate based on the distribution of statistically significant p-values (Bartos & Schimmack, 2020).

Here, I illustrate this approach with p-values from 120 psychology journals for articles published between 2010 and 2019. An automated extraction of test-statistics found 670,055 useable test-statistics. All test-statistics were converted into absolute z-scores that reflect the amount of evidence against the null-hypothesis.

Figure 1 shows the distribution of the absolute z-scores. The first notable observation is the drop (from right to left) in the distribution right at the standard level for statistical significance, p < .05 (two-tailed) that corresponds to a z-score of 1.96. This drop reveals publication bias. The amount of bias is reflected in a comparison of the observed discovery rate and the estimated discovery rate. The observed discovery rate of 67% is simply the percentage of p-values below .05. The estimated discovery rate is the percentage of significant results based on the z-curve model that is fitted to the significant results (grey curve). The estimated discovery rate is only 38% and the 95% confidence interval around this estimate, 32% to 49%, does not include the observed discovery rate. This shows that significant results are more likely to be reported and that non-significant results are missing from published article.

If we would use the observed discovery rate of 67%, we would underestimate the risk of false positive results. Using Soric’s (1989) formula,

FDR = (1/DR – 1)*(.05/.95)

a discovery rate of 67% implies a maximum false discovery rate of 3%. Thus, no adjustment to the significance criterion would be needed to maintain a false discovery rate below 5%.

However, publication bias is present and inflates the discovery rate. To adjust for this, we can use the estimated discovery rate of 38% and get a maximum false discovery rate of 9%. As this value exceeds the desired number of false discoveries, we need to lower alpha to reduce the false discovery rate.

Figure 2 shows the results when alpha is set .005 (z = 2.80) as recommended by Benjamin et al. (2017). The model is only fitted to data that are significant with this new criterion. We now see that the observed discovery rate (44%) is even lower than the estimated discovery rate (49%), although the difference is not significant. Thus, there is no evidence of publication bias with this new criterion for significance. The reason is that many questionable practices that are used to report significant results produce just significant results. This is seen in the excess of just significant results between z = 2 and z = 2.8. These results no longer inflate the discovery rate because they are no longer counted as discoveries. We also see that the estimated discovery rate produces a maximum false discovery rate of 6%, which may be close enough to the desired level of 5%.

Another piece of useful information is the estimated replication rate (ERR). This is the average power of results that are significant with p < .005 as criterion. Although lowering the alpha level decreases power, the average power of 66% suggests that many results should replicate successfully in exact replication studies with the same sample size. Increasing sample sizes could help to achieve 80% power.

In conclusion, we can use the distribution of p-values in the psychological literature to evaluate published findings. Based on the present results, readers of published articles could use p < .005 (rule of thumb: z > 2.8, t > 3, or chi-square > 9, F > 9) to evaluate statistical evidence.

The empirical approach to justify alpha with FDRs has the advantage that it can be adjusted for different literatures. This is illustrated with the Attitudes and Social Cognition section of JPSP. Social cognition research has experienced a replication crisis due to massive use of questionable research practices. It is possible that even alpha = .005 is too liberal for this research area.

Figure 3 shows the results for test statistics published in JPSP-ASC from 2000 to 2020.

There is clear evidence of publication bias (ODR = 71%, EDR = 31%). Based on the EDR of 31%, the maximum false discovery rate is 11%, well above the desired level of 5%. Even the 95%CI around the FDR does not include 5%. Thus, it is necessary to lower the alpha criterion.

Using p = .005 as criterion improves things, but not fully. First, a comparison of the ODR and EDR suggests that publication bias was not fully removed, 43% vs. 35%. Second, the EDR of 35% still implies a maximum FDR of 10%, although the 95%CI now touches 5%, but also has 35% as the upper limit. Thus, even with p = .005, the social cognition literature is not credible.

Lowering the criterion further does not solve this problem. The reason is that there are now so few significant results that the discovery rate remains low. This is shown in the next figure where the criterion is set to p < .0005 (z = 3.5). The model cannot be fitted to z-scores so extreme because there is insufficient information about lower power studies. Thus, the model was fitted to z-scores greater than 2.8 (p < .005). in this scenario, the expected discovery rate is 27%, which implies a maximum false discovery rate of 14% and the 95%CI still does not include 5%.

These results illustrate the problem of conducting many studies with low power. The false discovery risk remains high because there are only few test statistics with extreme values and a few extreme test statistics are expected by chance.

In short, setting alpha to .005 is still too liberal for this research area. Given the ample replication failures in social cognition research, most results cannot be trusted. This conclusion is also consistent with the actual replication rate in the Open Science Collaboration (2015) project that could only replicate 7/31 (23% results). With a discovery rate of 23%, the maximum false discovery rate is 18%. This is still way below Ioannidis’s claim that most published results are false positives, but it is also well above 5%.

Different results are expected for the Journal of Experimental Psychology, Learning, Memory, and Cognition (JEP-LMC). Here the OSC project was able to replicate 13/47 (48%) results. A discovery rate of 48% implies a maximum false discovery rate of 6%. Thus, no adjustment to the alpha level may be needed for this journal.

Figure 6 shows the results for the z-curve analysis of test statistics published from 2000 to 2020. There is evidence of publication bias. The ODR of 67% is outside the 95%CI of the EDR 45%, 95%CI = . However, with an EDR of 45%, the maximum FDR is 7%. This is close to the estimate based on the OSC results and close to the desired level of 5%.

For this journal it was sufficient to set the alpha criterion to p < .03. This produced a fairly close match between the ODR (61%) and EDR (58%) and a maximum FDR of 4%.

Conclusion

Significance testing was introduced by Fisher, 100 years ago. He would recognize the way scientists analyze their data because not much has changed. Over the past 100 years, many statisticians and practitioners have pointed out problems with this approach, but no practical alternatives have been offered. Adjusting the significance criterion depending on the research question is one reasonable modification, but often requires more a priori knowledge than researchers have (Lakens et al., 2018). Lowering alpha makes sense when there is a concern about too many false positive results, but can be a costly mistake when false positive results are fewer than feared (Benjamin et al., 2017). Here I presented a solution to this problem. It is possible to use the maximum false-discovery rate to pick alpha so that the percentage of false discoveries is kept at a reasonable minimum.

Even if this recommendation does not influence the behavior of scientists or the practices of journals, it can be helpful to compute alpha values that ensure a low false discovery rate. At present, consumers of scientific research (mostly other scientists) are used to treat all significant results with p-values less than .05 as discoveries. Literature reviews mention studies with p = .04 as if they have the same status as studies with p = .000001. Once a p-values crosses the magic .05 level, it becomes a solid fact. This is wrong because statistical significance alone does not ensure that a finding is a true positive. To avoid this fallacy, consumers of research can do their own adjustment to the alpha level. Readers of JEP:LMC may use .05 or .03 because this alpha level is sufficient. Readers of JPSP-ASC may lower alpha to .001.

Once readers demand stronger evidence from journals that publish weak evidence, researchers may actually change their practices. As long as consumers buy every p-values less than .05, there is little incentive for producers of p-values to try harder to produce stronger evidence, but when consumers demand p-values below .005, supply will follow. Unfortunately, consumers have been gullible and it was easy to sell them results that do not replicate with a p < .05 warranty because they had no rational way to decide which p-values they should trust or not. Maintaining a reasonably low false discovery rate has proved useful in genomics, it may also prove useful for other sciences.

Ioannidis is Wrong Again

In 2005, Ioannidis wrote an influential article with the title “Why most published research findings are false.” This article has been widely cited by scientists and in the popular media as evidence that we cannot trust scientific results (The Atlantic).

It is often overlooked that Ioannidis’s big claim was not supported by empirical evidence. It rested entirely on hypothetical examples. The problem with big claims that are based on intuition rather than empirical observations is that they can induce confirmation bias. Just like original researchers with their pet theories, Ioannidis was no longer an objective meta-scientists who could explore how often science is wrong. He had to go out and find evidence to support his claim. And that is what he did.

In 2017, Denes Szucs and John P. A. Ioannidis published an article that examined the risk of false positive results in cognitive neuroscience and psychology. The abstract suggests that the empirical results support Ioannidis’s claim that most published result are false positives.

We conclude that more than 50% of published findings deemed to be statistically significant are likely to be false.”

The authors shared their data, which made it possible for me to verify this conclusion using my own statistical method that can be used to assess the maximum false positive rate (Bartos & Schimmack, 2020; Brunner & Schimmack, 2020). I first used the information about t-values and their degrees of freedom to compute absolute z-scores. Z-scores have the advantage that they all have the same sampling distribution so the values provide standardized information about the strength of evidence against the null-hypothesis. The distribution of the absolute z-scores were then analyzed using zcurve.2.0 (Bartos & Schimmack, 2020).

Figure 1 shows the results with the assumption that there is no publication bias. As a result, both non-significant and significant results are fitted. Visual inspection shows some evidence that there are too many significant results, especially those that just reached significance (z > 1.96 corresponds to p = .05, two-tailed). There are also too few results that just missed to be significant or are sometimes considered to be marginally significant (p < .10, z > 1.65). This pattern suggests that researchers used questionable research practices to present marginally significant results as significant. However, in the big picture of all tests, this bias is relatively small. The observed discovery rate of 64% is only slightly higher than the expected discovery rate of 60%. This is a small amount of inflation and even with this large sample size, the deviation is not statistically significant (i.e., 64% is within the 95%CI of the EDR from 55% to 66%).

Szucs and John P. A. Ioannidis also create a scenario without researcher bias and still conclude that most published results are false.

For example, if we consider the recent estimate of 13:1 H0:H1 odds [30], then FRP exceeds 50% even in the absence of bias.

Figure 1 shows that the assumption is totally incompatible with the data. A model that assumes no bias has a discovery rate of 60%, and a discovery rate of 60% implies that no more than 3% of significant results can be false positives (Soric, 1989). Even the upper limit of the 95% CI is only 4% false discoveries. Thus, empirical data clearly falsifies Szucs and Ioannidis’ wild guess that psychologists test only 7% true hypotheses. Even actual replication studies have produced 37% significant results, which puts the rate of true hypothesis at a minimum of 37% (OSC, 2015). Thus, the conclusion in the abstract is based on false assumptions and not on an unbiased examination of the data.

Despite the small amount of bias in Figure 1, it is likely that some researcher bias is present. It is therefore reasonable to see what happens when a model allows for researcher bias. To do so, z-curve can be fitted only to the distribution of significant results and correct for the selection for significance. These results are shown in Figure 2.

This model shows clearer evidence of selection for significance. The expected discovery rate is 42% and the 95% CI , 24% to 52%, does not include the observed discovery rate of 64%. It is therefore save to assume that publication bias inflates the observed discovery rate. However, even with a discovery rate of 42%, the maximum false discovery rate is only 7%, and even if we use the lower bound of the 95%CI of the EDR, 24%, the false discovery rate is only 17%, which is still well below the 50% level needed to support Ioannidis’s famous claim that most published results are false.

In short, an objective assessment of Ioannidis’s own data falsifies his claim that most published results are false positives. So, how did he end up concluding that the data support his claim?

To make any claims about the false discovery rate, the authors had to make several assumptions because their model did not estimate the actual power of studies and did not measure the actual amount of bias. Thus, all Ioannidis had to do was to adjust the assumptions to fit the data. As in 2005, Ioannidis then presents these speculations as if they are empirical facts.

Non-scientists may be surprised that somebody can get away with this big claims that are not supported by evidence. After all, scientific articles are peer-reviewed. However, insiders are well aware that peer-review is an imperfect method of quality control. However, it is amazing that Ioannidis has been getting away with his bold claim that undermines trust in science for so long. Science is not perfect, and Ioannidis is a perfect example of the staying power of false claims, but science is still the best way to search for truth and solutions. Fortunately, Ioannidis was wrong about science. Science needs improvement, but it has produced many important and robust findings such as the discovery of highly effective vaccines against Covid-19. We should not blindly trust science. Instead, we need to examine the data and the assumptions underlying scientific claims, including meta-scientific ones. When we do this, it turns out that Ioannidis fight against researchers bias is based on a biased assessment of bias.

Ioannidis is Wrong Most of the Time

John P. A. Ioannidis is a rock star in the world of science (wikipedia).

By traditional standards of science, he is one of the most prolific and influential scientists alive. He has published over 1,000 articles that have been cited over 100,000 times.

He is best known for the title of his article “Why most published research findings are false” that has been cited nearly 5,000 times. The irony of this title is that it may also apply to Ioannidis, especially because there is a trade-off between quality and quantity in publishing.

Fact Checking Ioannidis

The title of Ioannidis’s article implies a factual statement: “Most published results ARE false.” However, the actual article does not contain empirical data to support this claim. Rather, Ioannidis presents some hypothetical scenarios that show under what conditions published results MAY BE false.

To produce mostly false findings, a literature has to meet two conditions.

First, it has to test mostly false hypotheses.
Second, it has to test hypotheses in studies with low statistical power, that is a low probability of producing true positive results.

To give a simple example, imagine a field that tests only 10% true hypothesis with just 20% power. As power predicts the percentage of true discoveries, only 2 out of the 10 true hypothesis will be significant. Meanwhile, the alpha criterion of 5% implies that 5% of the false hypotheses will also produce a significant result. Thus, 5 of the 90 false hypotheses will also produce a significant result. As a result, there will be two times more false positives (4.5 over 100) than true positives (2 over 100).

These relatively simple calculations were well known by 2005 (Soric, 1989). Thus, why did Ioannidis article have such a big impact? The answer is that Ioannidis convinced many people that his hypothetical examples are realistic and describe most areas in science.

2020 has shown that Ioannidis’s claim does not apply to all areas of science. In amazing speed, bio-tech companies were able to make not just one but several successful vaccine’s with high effectiveness. Clearly some sciences are making real progress. On the other hand, other areas of science suggest that Ioannidis’s claims were accurate. For example, the whole literature on single-gene variations as predictors of human behavior has produced mostly false claims. Social psychology has a replication crisis where only 25% of published results could be replicated (OSC, 2015).

Aside from this sporadic and anecdotal evidence, it remains unclear how many false results are published in science as a whole. The reason is that it is impossible to quantify the number of false positive results in science. Fortunately, it is not necessary to know the actual rate of false positives to test Ioannidis’s prediction that most published results are false positives. All we need to know is the discovery rate of a field (Soric, 1989). The discovery rate makes it possible to quantify the maximum percentage of false positive discoveries. If the maximum false discovery rate is well below 50%, we can reject Ioannidis’s hypothesis that most published results are false.

The empirical problem is that the observed discovery rate in a field may be inflated by publication bias. It is therefore necessary to estimate the amount of publication bias and if necessary correct the discovery rate, if publication bias is present.

In 2005, Ioannidis and Trikalinos (2005) developed their own test for publication bias, but this test had a number of shortcomings. First, it could be biased in heterogeneous literatures. Second, it required effect sizes to compute power. Third, it only provided information about the presence of publication bias and did not quantify it. Fourth, it did not provide bias-corrected estimates of the true discovery rate.

When the replication crisis became apparent in psychology, I started to develop new bias tests that address these limitations (Bartos & Schimmack, 2020; Brunner & Schimmack, 2020; Schimmack, 2012). The newest tool, called z-curve.2.0 (and yes, there is a app for that), overcomes all of the limitations of Ioannidis’s approach. Most important, it makes it possible to compute a bias-corrected discovery rate that is called the expected discovery rate. The expected discovery rate can be used to examine and quantify publication bias by comparing it to the observed discovery rate. Moreover, the expected discovery rate can be used to compute the maximum false discovery rate.

The Data

The data were compiled by Simon Schwab from the Cochrane database (https://www.cochrane.org/) that covers results from thousands of clinical trials. The data are publicly available (https://osf.io/xjv9g/) under a CC-By Attribution 4.0 International license (“Re-estimating 400,000 treatment effects from intervention studies in the Cochrane Database of Systematic Reviews”; (see also van Zwet, Schwab, & Senn, 2020).

Studies often report results for several outcomes. I selected only results for the primary outcome. It is often suggested that researchers switch outcomes to produce significant results. Thus, primary outcomes are the most likely to show evidence of publication bias, while secondary outcomes might even be biased to show more negative results for the same reason. The choice of primary outcomes also ensures that the test statistics are statistically independent because they are based on independent samples.

Results

I first fitted the default model to the data. The default model assumes that publication bias is present and only uses statistically significant results to fit the model. Z-curve.2.0 uses a finite mixture model to approximate the observed distribution of z-scores with a limited number of non-centrality parameters. After finding optimal weights for the components, power can be computed as the weighted average of the implied power of the components (Bartos & Schimmack, 2020). Bootstrapping is used to compute 95% confidence intervals that have shown to have good coverage in simulation studies (Bartos & Schimmack, 2020).

The main finding with the default model is that the model (grey curve) fits the observed distribution of z-scores very well in the range of significant results. However, z-curve has problems extrapolating from significant results to the distribution of non-significant results. In this case, the model (grey curve) underestimates the amount of non-significant results. Thus, there is no evidence of publication bias. This is seen in a comparison of the observed and expected discovery rates. The observed discovery rate of 26% is lower than the expected discovery rate of 38%.

When there is no evidence of publication bias, there is no reason to fit the model only to the significant results. Rather, the model can be fitted to the full distribution of all test statistics. The results are shown in Figure 2.

The key finding for this blog post is that the estimated discovery rate of 27% closely matches the observed discovery rate of 26%. Thus, there is no evidence of publication bias. In this case, simply counting the percentage of significant results provides a valid estimate of the discovery rate in clinical trials. Roughly one-quarter of trials end up with a positive result. The new question is how many of these results might be false positives.

To maximize the rate of false positives, we have to assume that true positives were obtained with maximum power (Soric, 1989). In this scenario, we could get as many as 14% (4 over 27) false positive results.

Even if we use the upper limit of the 95% confidence interval, we only get 19% false positives. Moreover, it is clear that Soric’s (1989) scenario overestimate the false discovery rate because it is unlikely that all tests of true hypotheses have 100% power.

In short, an empirical test of Ioannidis’s hypothesis that most published results in science are false shows that this claim is at best a wild overgeneralization. It is not true for clinical trials in medicine. In fact, the real problem is that many clinical trials may be underpowered to detect clinically relevant effects. This can be seen in the estimated replication rate of 61%, which is the mean power of studies with significant results. This estimate of power includes false positives with 5% power. If we assume that 14% of the significant results are false positives, the conditional power based on a true discovery is estimated to be 70% (14 * .05 + 86 * . 70 = .61).

With information about power, we can modify Soric’s worst case scenario and change power from 100% to 70%. This has only a small influence on the false positive discovery rate that decreases to 11% (3 over 27). However, the rate of false negatives increases from 0 to 14% (10 over 74). This also means that there are now three-times more false negatives than false positives (10 over 3).

Even this scenario overestimates power of studies that produced false negative results because power of studies with significant results is higher than power of studies that produced non-significant results when power is heterogenous (Brunner & Schimmack, 2020). In the worst case scenario, the null-hypothesis may rarely be true and power of studies with non-significant results could be as low as 14.5%. To explain, if we redo all of the studies, we expected that 61% of the significant studies produce a significant result again, producing 16.5% significant results. We also expect that the discovery rate will be 27% again. Thus, the remaining 73% of studies have to make up the difference between 27% and 16.5%, which is 10.5%. For 73 studies to produce 10.5 significant results, the studies have to have 14.5% power. 27 = 27 * .61 + 73 * .145.

In short, while Ioannidis predicted that most published results are false positives, it is much more likely that most published results are false negatives. This problem is of course not new. To make conclusions about effectiveness of treatments, medical researchers usually do not rely on a single clinical trial. Rather results of several studies are combined in a meta-analysis. As long as there is no publication bias, meta-analyses of original studies can boost power and reduce the risk of false negative results. It is therefore encouraging that the present results suggest that there is relatively little publication bias in these studies. Additional analyses for subgroups of studies can be conducted, but are beyond the main point of this blog post.

Conclusion

Ioannidis wrote an influential article that used hypothetical scenarios to make the prediction that most published results are false positives. Although this article is often cited as if it contained evidence to support this claim, the article contained no empirical evidence. Surprisingly, there also have been few attempts to test Ioannidis’s claim empirically. Probably the main reason is that nobody knew how to test it. Here I showed a way to test Ioannidis’s claim and I presented clear empirical evidence that contradicts this claim in Ioannidis’s own field of science, namely medicine.

The main feature that distinguishes science and fiction is not that science is always right. Rather, science is superior because proper use of the scientific method allows for science to correct itself, when better data become available. In 2005, Ioannidis had no data and no statistical method to prove his claim. Fifteen years later, we have good data and a scientific method to test his claim. It is time for science to correct itself and to stop making unfounded claims that science is more often wrong than right.

The danger of not trusting science has been on display this year, where millions of Americans ignored good scientific evidence, leading to the unnecessary death of many US Americans. So far, 330, 000 US Americans are estimated to have died of Covid-19. In a similar country like Canada, 14,000 Canadians have died so far. To adjust for population, we can compare the number of deaths per million, which is 1000 in the USA and 400 in Canada. The unscientific approach to the pandemic in the US may explain some of this discrepancy. Along with the development of vaccines, it is clear that science is not always wrong and can save lives. Iannaidis (2005) made unfounded claims that success stories are the exception rather than the norm. At least in medicine, intervention studies show real successes more often than false ones.

The Covid-19 pandemic also provides another example where Ioannidis used off-the-cuff calculations to make big claims without any evidence. In a popular article titled “A fiasco in the making” he speculated that the Covid-19 virus might be less deadly than the flu and suggested that policies to curb the spread of the virus were irrational.

As the evidence accumulated, it became clear that the Covid-19 virus is claiming many more lives than the flu, despite policies that Ioannidis considered to be irrational. Scientific estimates suggest that Covid-19 is 5 to 10 times more deadly than the flu (BNN), not less deadly as Ioannidis implied. Once more, Ioannidis quick, unempirical claims were contradicted by hard evidence. It is not clear how many of his other 1,000 plus articles are equally questionable.

To conclude, Ioannidis should be the last one to be surprised that several of his claims are wrong. Why should he be better than other scientists? The question is only how he deals with this information. However, for science it is not important whether scientists correct themselves. Science corrects itself by replacing old, false information with better information. One question is what science does with false and misleading information that is highly cited.

If YouTube can remove a video with Ioannidis’s false claims about Covid-19 (WP), maybe PLOS Medicine can retract an article with the false claim that “most published results in science are false”.

Washington Post

The attention-grabbing title is simply misleading because nothing in the article supports the claim. Moreover, actual empirical data contradict the claim at least in some domains. Most claims in science are not false and in a world with growing science skepticism spreading false claims about science may be just as deadly as spreading false claims about Covid-19.

If we learned anything from 2020, it is that science and democracy are not perfect, but a lot better than superstition and demagogy.

I wish you all a happier 2021.

Which Social Psychologists Can you Trust in 2020?

Social psychologists, among others, have misused the scientific method. Rather than using it to separate false from true hypotheses, they used statistical tests to find and report statistically significant results. The main problem with the search for significance is that significant results are not automatically true discoveries. The probability that a selected significant result is a true discovery also depends on the power of statistical tests to detect a true finding. However, social psychologists have ignored power and often selected significant results from studies with lower power. In this case, significance is often more due to chance than a real effect and the results are difficult to replicate. A shocking finding revealed that less than 25% of results in social psychology could be replicated (OSC, 2015). This finding has been widely cited outside of social psychology, but social psychologists have preferred to ignore the implication that most of their published results may be false (Schimmack, 2020).

Some social psychologists have responded to this replication crisis by increasing power and reporting non-significant results as evidence that effects are small and negligible (e.g., Lai et al., 2014, 2016). However, others continue to use the same old practices. This creates a problem. While the average credibility of social psychology has increased, readers do not know whether they are reading an article that used the scientific method properly or improperly.

One solution to this problem is to examine the strength of the reported statistical results. Strong statistical results are more credible than weak statistical results. Thus, the average strength of the statistical results provides useful information about the credibility of individual articles. I demonstrate this approach with two articles from 2020 in the Attitudes and Social Cognition section of the Journal of Personality and Social Psychology (JPSP-ASC).

Before I examine individual articles, I am presenting results for the entire journal based on automatic extraction of test-statistics for the years 2010 (pre-crisis) and 2020 (post-crisis).

Figure 1 shows the results for 2010. All test-statistics are first converted into p-values and then transformed into absolute z-scores. The higher the z-score, the stronger is the evidence against the null-hypothesis. The figure shows the mode of the distribution of z-scores at a value of 2, which coincides with the criterion for statistical significance (p = .05, two-tailed, z = 1.96). Fitting a model to the distribution of the significant z-scores, we would expect an even higher mode in the region of non-significant results. However, the actual distribution shows a sharp drop in reported z-scores. This pattern shows the influence of selection for significance.

The amount of publication bias is quantified by a comparison of the observed discovery rate (i.e. the percentage of reported tests with significant results and the expected discovery rate, which is the area of the grey curve for z-scores greater than 1.96). The ODR of 73% is much higher than the EDR of 15%. The fact that the confidence intervals for these two estimates do not overlap shows clear evidence of selection for significance in JPSP-ASC in 2010.

An EDR of 15% also implies that most statistical tests are extremely underpowered. Thus, even if there is an effect, it is unlikely to be significant. More relevant is the replication rate, which is the average power of results that were significant. As power determines the outcome of exact replication studies, the replication rate of 60% implies that 60% of published results are expected to be replicable in exact replication studies. However, observed effect sizes are expected to shrink and it is unclear whether the actual effect sizes are practically meaningful or would exceed the typical level of a small effect size (i.e., 0.2 standard deviations or 1% explained variance).

In short, Figure 1 visualizes incorrect use of the scientific method that capitalizes more on chance than on actual effects.

The good news is that research practices in social psychology have changed, as seen in Figure 2.

First, reporting of results is much less deceptive. The observed discovery rate of 73% is close to the estimated discovery rate of 72%. However, visual inspection of the two curves shows a small dip for results that are marginally significant (z = 1.5 to 2) and a slight excess for just significant results (z = 2 to 2.2). Thus, some selection may still happen in some articles.

Another sign of improvement is that the EDR of 72% in 2020 is much higher than the EDR of 15% in 2010. This shows that social psychologists have dramatically improved the power of their studies. This is largely due to the move from small undergraduate samples to larger online samples.

The replication rate of 85% implies that most published results in 2020 are replicable. Even if exact replications are difficult, the EDR of 73% still suggests rather high replicability (see Bartos & Schimmack, 2020, for a discussion of EDR vs. ERR to predict actual replication results).

Despite this positive trend, it is possible that individual articles are less credible than the average results suggest. This is illustrated with the article by Leander et al. (2020).

This article was not picked at random. There are several cues that suggested the results of this article may be less credible than other results. First, Wolfgang Stroebe has been an outspoken defender of the old unscientific practices in social psychology (Stroebe & Strack, 2014). Thus, it was interesting to see whether somebody who so clearly defended bad practices would have changed. This is of course a possibility because it is not clear how much influence Stroebe had on the actual studies. Another reason to be skeptical about this article is that it used priming as an experimental manipulation, although priming has been identified as a literature with many replication failures. The authors cite old priming studies as if there is no problem with these manipulations. Thus, it was interesting to see how credible these new priming results would be. Finally, the article reported many studies and it was interesting to see how the authors addressed the problem that the risk of a non-significant result increases with each additional study (Schimmack, 2012).

I first used the automatically extracted test-statistics for this article. The program found 51 test-statistics. The results are different from the z-curve for all articles in 2020.

Visual inspection shows a peak of p-values that are just significant. The comparison of the ODR of 65% and the EDR of 14% suggests selection for significance. However, even if we just focus on the significant results, the replication rate is low with just 38%, compared to the 85% average for 2020.

I also entered all test-statistics by hand. There were more test-statistics because I was able to use exact p-values and confidence intervals, which are not used by the automated procedure.

The results are very similar showing that automatically extracted values are useful if an article reports results mostly in terms of F and t-values in the text.

The low power of significant results creates a problem for focal hypothesis tests in a serious of studies. This article included 7 studies (1a, 1b, 1c, 2, 3, 4, 5) and reported significant results for all of them, ps = 0.004, 0.007, 0.014, 0.020, 0.041, 0.033, and 0.002. This 100% success rate is higher than the average observed power of these studies, 70%. Average power overestimates real power, when results are selected for significance. A simple correction is to subtract the inflation rate (100% – 70% = 30%) from the mean observed power. This Index is called the Replication Index and an R-Index of 40% shows that studies were underpowered and that a replication study with the same sample size is more likely to produce a non-significant result than a significant one.

A z-curve analysis produce a similar estimate, but also shows that these estimates are very unstable and that replicability could be 5%, which means there is no effect. Thus, after taking selection for significance into account, the 7 significant p-values in Leander et al.’s (2020) article provide as much evidence for their claims as Bem’s (2011) 9 significant p-values did for the claim that priming effects can work when the prime FOLLOWS the behavior.

Judd and Gawronski (2011) argued that they had to accept Bem’s crazy article because (a) it passed critical peer-review and (b) they had to trust the author that results were not selected for significance. Nothing has changed in JPSP-ASC. The only criterion for acceptance is peer-review and trust. Bias tests that have been evaluated whether results are actually credible are not used by peer-reviewers or editors. Thus, readers have to carry out these tests themselves to protect themselves from fake science like Leander et al.’s (2020) priming studies. Readers can still not trust social psychology journals to reject junk science like Bem’s (2011) article.

The second example shows how these tools can also provide evidence that published results are credible, using an article by Zlatev et al. (2020).

The automated method retrieved only 12 test statistics. This is a good sign because hypothesis tests are used sparingly to test only important effects, but it makes it more difficult to get precise estimates for a single article. Thus, article based information should be only used as a heuristic, especially if no other information is available. Nevertheless, the limited information suggests that the results are credible. The Observed discovery rate is even slightly below the estimated discovery rate and both the EDR and ERR are very high, 99%. 5 of the 12 test statistics exceed a z-value of 6 (6 sigma) which is even higher than the 5-sigma rule used on particle physics.

The hand-coding retrieved 22 test statistics. The main reason for the difference is that the automated method does not include chi-square tests to avoid including results from structural equation modeling. However, the results are similar. The ODR of 86% is only slightly higher than the EDR of 74% and the replication rate is estimated to be 95%.

There were six focal tests with four p-values below .001. The other two p-values were .001 and .002. The mean observed power was 96%, which means that a success rate of 100% was justified and that there is very little inflation in the success rate, resulting in an R-Index of 93%.

Conclusion

Psychology, especially social psychology, has a history of publishing significant results that are selected from a larger set of tests with low statistical power. This renders published results difficult to replicate. Despite a reform movement, published articles still rely on three criteria to be published: (a) p-values below .05 for focal tests, (b) peer-review, and (c) trust that researchers did not use questionable practices to inflate effect sizes and type-I error risks. These criteria do not help to distinguish credible and incredible articles.

This blog post shows how post-hoc power analysis can be used to distinguish questionable evidence from credible evidence. Although post-hoc power analysis has been criticized when it is applied to a single test statistic, meta-analyses of observed power can show whether researchers actually had good power or not. It can also be used to provide information about the presence and amount of selection for significance. This can be helpful for readers to focus on articles that published credible and replicable results.

The reason why psychology has been slow in improving is that readers have treated all significant results as equal. This encouraged researchers to p-hack their results just enough to get significance. If readers become more discerning in the reading of method section and no longer treat all p-values below .05 as equal, articles with more credible evidence will gain more attention and citations. For example, this R-Index analysis suggests that readers can ignore Leander et al.’s article and can focus on the credible evidence in Zlatev et al.’s article. Of course, solid empirical results are only a first step in assessing an article. Other questions about ecological validity remain, but there is no point in paying attention to p-hacked results, even if their are published in the most prestigious journal.

P.S. I ran a z-curve analysis on all articles with 10 or more z-scores between 2 and 6 published from 2000 to 2010. The excel file contains the DOI, the observed discovery rate, expected discovery rate, and the expected replication rate. It can be fun to plug a DOI into a search engine and to see what article pops up. I know nobody is going to believe me, but I did not know which article has the lowest EDR of 5% and ERR of 9%, but the result is not surprising. I call it predictive validity of the R-Index.

https://pubmed.ncbi.nlm.nih.gov/16834479/

Social Psychologists’ (Non) Response to Bem (2011)

A science that cannot face its history has no future. (Anonymous).

ABSTRACT

Bem (2011) presented incredible results that seemed to provide strong empirical evidence (p < .05^9). The article was published because it passed “a rigorous review process, involving a large set of extremely thorough reviews by distinguished experts in social cognition” and “editors can only take the author at his word that his data are in fact genuine.” Here I show that social psychologists have avoided discussing the broader implications of the method crisis in social psychology. The same standards that were used for Bem’s article are still used for most articles published today: a few significant p-values, peer-review, and trust that researchers are honest are supposed to ensure that results are robust and replicable. However, the replication crisis has shown that this is not the case. Consumers of social psychology need to be aware that even 10 years after Bem’s infamous article, evidence for social psychological theories is no more credible than Bem’s evidence for extra-sensory perception.

Introduction

Daryl Bem was a highly respected social psychologist, until he published his “Feeling the Future” article in 2011.

The article became a poster child of everything that is wrong with research methods in social psychology and has been cited over 300 times.

The article was also accompanied by an editorial that justified the publication of an article that seemed to provide evidence for the incredible claim that humans, or at least extraverts, can feel events that haven’t happened yet.

The editorial suggests that Bem’s article will “stimulate further thoughts about appropriate methods in research on social cognition and attitudes” (p. 406). Ten years later, we can see whether publishing Bem’s article had the intended effect.

The high citation count shows that the article did indeed generate lot’s of discussion about research practices in social cognition research and social psychology more broadly. However, an inspection of these citations shows that most of this discussion occurred outside of social psychology, by meta-psychologists who reflected on research practices by social psychologists. In stark contrast, critical self-reflection by social psychologists is insignificant.

Here I provide a bibliography of these citations. An examination of these citations shows that social psychologists have carefully avoided asking themselves the most important question that follows from Bem’s (2011) article.

If we cannot trust Bem’s article that reported nine statistically significant results, which article in social psychology can we trust?

Etienne P. LeBel & Kurt R. Peters (2011) Review of General Psychology

This article clearly spells out the problems of QRPs and uses Bem’s article to raise questions about other research findings. The first author was trained as a graduate student by Gawronski, but is no longer in social psychology.

Charles M. Judd, Jacob Westfall, & David A. Kenny (2012) JPSP

This article implies that the problem was inappropriate treatment of variation across stimuli. It does not mention the use of QRPs in social psychology, nor does it mention evidence that Bem (2011) used QRPs (Francis, 2012; Schimmack, 2012).

Sander L. Koole and Daniël Lakens (2012) PoPS

Jeff Galak, Robyn A. LeBoeuf, Jeff Galak, & Leif D. Nelson (2012). JPSP

suggest that QRPS were used, but do not cite John et al. (2012); also do not cite Schimmack (2012) or Francis (2012) as evidence that QRPs were used.

Jens B. Asendorpf, Mark Conner, Filip De Fruyt, Jan de Houwer, Jaap J. A. Denissen, Klaus Fiedler, Susann Fiedler, David C. Funder, Reinhold, Kliegel, Brian A. Nosek, Marco Perugini, Brent. W. Roberts, Manfred Schmitt, Marcel A. G. Van Aken, Hannelore Weber, & Jelte M. Wicherts (2013). EJPers

cite John et al. (2012) and Schimmack (2012) but do not cite Francis or Schimmack as evidence that Bem used QRPs.

Andrew J. Vonasch and Roy F. Baumeister (2013) British JSP

Does not mention John et al. (2012) and does not cite Francis or Schimmack as evidence that Bem used QRPs even though the second author was reviewer of Schimmack (2012)

Joachim I. Krueger, David Freestone, Mika L.M. MacInnis (2013) New Ideas

does not mention QRPs (John et al., 2012) or that Bem used QRPs (Francis, Schimmack, 2012)

David C. Funder, John M. Levine, Diane M. Mackie, Carolyn C. Morf, Carol Sansone, Simine Vazire, and Stephen G. West (2014), PSPR

mention John et al. (2012) and imply use of QRPs, cite Schimmack (2012), but do not cite Schimmack or Francis as evidence that Bem used QRPs.

Jonathan W. Schooler (2015) PoPS

Does not mention that Francis or Schimmack showed Bem (2011) used QRPs, although he was a reviewer of the Schimmack (2012) article.

Eli J. Finkel, Paul W. Eastwick, and Harry T. Reis

Cite John et al., Schimmack, and Francis, but not to point out that Bem used QRPs to get his results.

Brian D.Earp & DavidTrafimow, 2015, Frontiers in Psychology

cite John et al. (2012), but do not cite Francis or Schimmack as evidence that Bem used QRPs.

Klaus Fiedler (2016) SOCIAL PSYCHOLOGY OF MORALITY

cites John et al.’s (2012) to criticize it. does not cite Francis or Schimmack as evidence that Bem used QRPs.

Joachim Hüffmeier, Jens Mazei, & Thomas Schultze (2016) JESP

do cite John et al. (2012) for QRPs, do not cite Francis; cite Schimmack (2012), but not for evidence that Bem (2011) used QRPs.

Mark Schaller (2016) JESP

does not cite John et al. (2012) for QRPs, does not cite Francis or Schimmack (2012) for presence of QRPs in Bem’s (2011) article.

Matt Motyl, Alexander P. Demos, Timothy S. Carsel, Brittany E. Hanson, Zachary J. Melton,
Allison B. Mueller, J. P. Prims, Jiaqing Sun, Anthony N. Washburn, Kendal M. Wong, Caitlyn Yantis, and Linda J. Skitka

mention QRPs (John et al., 2012) and describe Bem’s results as questionable, but do not mention that Bem used QRPs.

Mark Rubin (2017) Review of General Psychology

mentions QRPs (John, 2012) and labels Bem’s research as questionable, but does not cite Francis or Schimmack (2012) as evidence that Bem used QRPs.

Roger Giner-Sorolla (2019) European Review of Social Psychology

does not cite John et al., but does cite Schimmack (2012) as evidence that Bem “almost certaintly” used QRPs.

Lee Jussim, Jon A. Krosnick, Sean T. Stevens & Stephanie M. Anglin (2019) Psy Belgica

cite John et al. (2012), but do not cite Francis or Schimmack (2012) as evidence that Bem used QRPs. However, they cite Schimmack (2017) as evidence that social psychology has “dramatically improved”

Jonathan Baron & John T. Jost (2019) PoPS

do not cite John et al. (2012) and do not cite Francis or Schimmack (2012) as evidence that Bem used QRPs.

Gandalf Nicolas, Xuechunzi Bai, & Susan T. Fiske (2019). PoPS

do not cite John et al. (2012) and do not cite Francis or Schimmack (2012) for evidence that Bem used QRPs.

Zachary G. Baker, Ersie-Anastasia Gentzis, Emily M. Watlington, Sabrina Castejon, Whitney E. Petit1, Maggie Britton, Sana Haddad, Angelo M. DiBello, Lindsey M. Rodriguez, Jaye L. Derrick, C. Raymond Knee (2020). Personal Relationships

do not cite John et al. (2012) and do not cite Francis or Schimmack (2012) as evidence that Bem used QRPs.

Conclusion

Bem (2011) presented incredible results that seemed to provide strong empirical evidence (p < .05^9). The article was published because it passed “a rigorous review process, involving a large set of extremely thorough reviews by distinguished experts in social cognition” and “we can only take the author at his word that his data are in fact genuine.” The same is true for all other articles published in social psychology. A few significant p-values, peer-review, and trust are supposed to ensure that results are robust, replicable. However, the replication crisis has shown that this is not the case. As research practices have not dramatically changed, consumers of social psychology need to be warned that even 10 years after Bem’s article published results in social psychology are no more trustworthy than Bem’s claims of extra-sensory perception.

A Quick Fix for the Replication Crisis in Psychology

Ten years after Bem’s (2011) demonstration of extrasensory perception with the standard statistical practices in psychology it is clear that these practices were unable to distinguish true findings from false findings. In the following decade, replication studies revealed that many textbook findings, especially in social psychology, were false findings, including extrasensory perception (Świątkowski & Benoît, 2017; Schimmack, 2020).

Although a few changes have been made, especially in social psychology, research practices in psychology are mostly unchanged one decade after the method crisis in psychology became apparent. Most articles continue to report diligently results that are statistically significant, p < .05, and do not report when critical hypotheses were not confirmed. This selective publishing of significant result has characterized psychology as an anormal science for decades (Sterling, 1959).

Some remedies are unlikely to change this. Preregistration is only useful, if good studies are preregistered. Nobody would benefit from publishing all badly designed preregistered studies with null-results. Open sharing of data is also not useful if the data are meaningless. Finally, sharing of materials that helps with replication is not useful if the original studies were meaningless. What psychology needs is a revolution of research practices that leads to the publication of studies that address meaningful questions.

The fundamental problem in psychology is the asymmetry of statistical tests that focus on the nil-hypothesis that the effect size is zero (Cohen, 1994). The problem with this hypothesis is that it is impossible to demonstrate an effect size of zero. The only way would be to study the entire population. However, often the population is not clearly defined and it is unlikely that the effect size is exactly zero in the population. This asymmetry has led to the believe that non-significant results, p > .05, are inconclusive. There is always the possibility that a non-zero effect exists, so one is not allowed to draw conclusions that effects do not exist (even time-reversed pre-cognition always remains a possibility).

This problem was recognized in the 1990s, but APA came up with an even worse solution to fix this problem. Instead of just reporting statistical significance, researchers were also asked to report effect sizes. In theory, reporting of effect sizes would help researchers to evaluate whether an effect size is large enough to be meaningful or not. For example, if a researcher reported a result with p < .05, but an extremely small effect size of d = .03, it might be considered so small, that it is practically irrelevant.

So why did reporting effect sizes not improve the quality and credibility of psychological science? The reason is that studies still had to pass the significance filter, and to do so effect size estimates in a sample have to exceed a threshold value. The perverse incentive was that studies with small samples and large sampling error require larger effect size estimates than qualitatively better studies with large samples that provide more precise estimates of effect sizes. Thus, researchers who invested few resources in small studies were able to brag about their large effect sizes. Sloppy language disguised the fact that these large effects were merely estimates of the actual population effect sizes.

Many researchers relied on Cohen’s guidelines for a priori power analysis to label their effect size estimates, small, moderate or large. The problem with this is that selection for significance, inflates effect size estimates and the inflation is inversely related to sample size. The smaller the sample, the bigger the inflation, and the larger the effect size that is reported.

This inflation only becomes apparent when replication studies with larger samples are available. For example, Joy-Gaba and Nosek (2010) obtained a standardized effect size estimate of d = .08 with N = 3,000 participants, the original study with N = 48 participants reported an effect size estimate of d = .82.

The title of the article “The Surprisingly Limited Malleability of Implicit Racial Evaluations” draws attention to the comparison of the two effect size estimates, as does the discussion section.

“Further, while DG reported a large effect of exposure on implicit racial (and age) preferences (d = .82), the effect sizes in our studies were considerably smaller. None exceeded d = .20, and a
weighted average by sample size suggests an average effect size of d = .08…” (p. 143).

The problem is the sloppy usage of the term effect size for effect size estimates. Dasgupta and Greenwald did not report a large effect because their small sample had so much sampling error that it was impossible to provide any precise information about the population effect size. This becomes evidence, when we compare the results in terms of confidence intervals (frequentist or objective Bayesian doesn’t matter).

The sampling error for a study with N = 33 participants is 2/sqrt(33) = .35. To create a 95% confidence interval, we multiply the sampling error by qt(.975,31) = 2. So, the 95% CI around the effect size estimate of .80 ranges from .80 – .70 = .10 to .80 + .70 = 1.50. In short, small samples produce extremely noisy estimates of effect sizes. It is a mistake to interpret the point estimates of these studies as reasonable estimates of the population effect size.

Moreover, when results are selected for significance, these noisy estimates are truncated at high values. To get a significant result in their extremely small sample, Dasgupta and Greenwald required a minimum effect size estimate of d = .70, In this case, the effect size estimate is two times the sampling error, which produces a p-value of .05.

This example is not an isolated incidence. It is symptomatic of reporting of results in psychology. Only recently, a new initiative is asking for the reporting of confidence intervals, but it is not clear whether psychology fully grasp the importance of this information. Maybe point estimates should not be reported unless confidence intervals are reasonably small.

In any case, the reporting of effect sizes did not change research practices and reporting of confidence intervals will also fail to do so because they do not change the asymmetry of nil-hypothesis testing. This is illustrated with another example.

Using a large online sample (N = 92,230), a study produced an effect size estimate (Ravary, Baldwin, & Bartz, 2019) of d = .02 (d = .0177 in the article). This effect size is reported with a 95% confidence interval from .004 to .03.

Using the standard logic of nil-hypothesis testing, this finding is used to reject the nil-hypothesis and to support the conclusion (in the abstract) that “as predicted, fat-shaming led to a spike in women’s (N=93,239) implicit anti-fat attitudes, with events of greater notoriety producing greater spikes” (p. 1580).

We now can ask ourselves a conter-factual question. What finding would have led the authors to conclude that there was no effect. What if a sample size of 1.5 million participants had shown an effect size of d = .002 with CI = .001 to .003. Would that have been sufficiently small to conclude nothing notable happened; let’s move on? Or would it still have been interpreted as evidence against the nil-hypothesis?

The main lesson from this Gedankenexperiment is that psychologists lack a procedure to weed out effects that are so small that chasing them would be a waste of time and resources.

I am by no means the first one to make this observation. In fact, some psychologists like Rouder and Wagenmakers have criticized nil-hypothesis testing for the very same reason and proposed a solution to the problem. Their solution is to test two competing hypothesis and allow for empirical data to favor either one. One of these hypotheses specifies an actual effect size, but because we do not know what the effect size, this hypothesis is specified as a distribution of effect sizes. The other hypothesis is the nil-hypothesis that there is absolutely no effect. The consistency of the data with these two hypotheses is evaluated in terms of Bayes-Factors.

The advantage of this method is that it is possible for researchers to decide in favor of the absence of an effect. The disadvantage of this method is that the absence of a relevant effect is still specified as absolutely no effect. This makes it possible to sometimes end up with the wrong inference that there is absolutely no effect with a small effect size that has practically significant population effect size.

A better way to solve the problem is to specify two hypothesis that are distinguished by the minimum relevant population effect size. Lakens, Scheel, and Isager (2018) give a detailed tutorial on equivalence testing. I diverge from their approach in two ways. First, they leave it to researchers’ expertise to define the smallest effect size of interest (SESOI) or minimum effect size (MES). This is a problem because psychologists have shown a cunning ability to game any methodological innovation to avoid changing their research practices. For example, nothing would change if the MES were set to d = .001. Rejecting d = .001 is not very different from rejecting, d = .000, and it would require 10 million participants to establish the absence of an MSE.

In fact, when psychologists obtain small effect sizes, they are quick to argue that these effects still have huge practical implications (Greenwald, Banaji, & Nosek, 2015). The danger is that these arguments are made in the discussion section, but that the results section reports effect sizes that are inflated by publication bias, d = .5, 95%CI = .1 to .9.

To solve this problem, MESs should correspond to sampling error. Studies with small samples and large sampling error need to specify a high MES, which increases the risk that researchers end up with a result that falsifies their predictions. For example, the race IAT does not predict voting against Obama, p < .05.

I therefore suggest an MSE of d = .2 or r = .1 as a default criterion. This is consistent with Cohen’s (1988) criteria for a small effect. In terms of significance testing, not much changes. For example, for a t-test, we are simply substracting .2 from the standardized mean difference. This has implications for a priori power analysis. To have 80% power with the nil-hypothesis, a sample size of with a moderate effect size of d = .5, a total of 128 participants are needed (n = 64 per cell).

To compute power with MES = .2, I wrote a little R-script. It shows that N = 356 participants are needed to achieve 80% power with a population effect size of d = .5. The program also computes the power for the hypothesis that the population effect size is below the MES. Once more, it is important to assume a fixed population effect size. A plausible value is zero, but if there is a small but negligible effect, power would be lower. The figure shows that power is only 47%. Power less than 50% implies that the effect size estimate has to be negative to produce a significant result.

Of course, the values can be changed to make other assumptions. The main point is to demonstrate the advantage of specifying a minimum effect size. It is now possible to falsify predictions. For example, with a sample of 93,239 participants, we have 100% power to determine whether an effect is larger or smaller than .2, even if the population effect size is d = .1. Thus, we can falsify the prediction that media events about fat-shaming move scores on a fat-IAT with a statistically small effect size.

The obvious downside of this approach is that it makes it harder to get statistically significant results. For many research areas in psychology, a sample size of N = 356 is very large. Animal studies or fMRI studies often have much smaller sample sizes. One solution to this problem is to increase the number of observations with repeated measurements, but this is also not always possible or not much cheaper. Limited resources are the main reason why psychologists are often conducting underpowered studies. This is not going to change overnight.

Fortunately, thinking about the minimum effect size is still helpful for consumers of research articles because they can retroactively apply these criteria to published research findings. For example, take Dasgupta and Greenwald’s seminal study that aimed to change race IAT scors with an experimental manipulation. If we apply an MSE of d = .2 to a study with N = 33 participants, we easily see that this study provided no valuable information about effect sizes, because a d-score of -.5 is needed to claim that the population effect size is below d = .2 and a population effect size of d = .9 is needed to claim that the effect size is greater than .2. If we assume that the population effect size is d = .5, the study has only 13% power to produce a significant result. Given the selection for significance, it is clear that published significant results are dramatically inflated by sampling error.

In conclusion, the biggest obstacle to improving psychological science is the asymmetry in nil-hypothesis significance testing. Whereas significant results that are obtained with “luck” can be published, non-significant results are often not published or considered inconclusive. Bayes-factors have been suggested as a solution to this problem, but Bayes-Factors do not take effect sizes into account and can also reject the nil-hypothesis with practically meaningless effect sizes. To get rid of the asymmetry, it is necessary to specify non-null effect sizes (Lakens et al., 2018). This will not happen any time soon because it requires an actual change in research practices that requires more resources. If we have learned anything from the history of psychology, it is that sample sizes have not changed. To protect themselves from “lucky” false positive results, consumers of scientific articles can specify their own minimum effect sizes and see whether results remain significant. With the typical p-values between .05 and .005 this will often not be the case. These results should be treated as interesting suggestions that need to be followed up with larger sample sizes, but readers can skip the policy implication discussion section. Maybe if readers get more demanding, researchers will work harder to convince them of their pet theories. Amen.

Consider the source: Can you trust implicit social cognition researchers?

One popular topic in social psychology is persuasion. How can we make people believe something and change their attitudes. A number of variables influence how persuasive a message is. One of them is source credibility. When Donald Trump claims that he won the election, we may want to check what others say. If even Fox News calls the election for Biden, we may not trust Donald Trump and ignore his claim. Similarly, scientists respond to the reputation of scientists. For example, in 2011 it was revealed that Diederik Stapel faked the data for many of his articles. The figure below shows that other scientists responded by citing these articles less.

In 2011, it also became apparent that social psychologists used other practices to publish results that cannot be replicated (OSC, 2015). These practices are known as questionable research practices, but unlike fraud they are not banned and articles that reported these results have not been retracted. As a result, social psychologists continue to cite articles with false evidence that present misleading information.

One literature that has lost credibility is research on implicit cognitions. Early on, replication failures of implicit priming undermined the assumption that social behavior is often elicited by stimuli without awareness (Kahneman, 2012). Now, research with implicit bias measures has come under scrutiny. A main problem with implicit bias measures is that they have low convergent validity with each other (Schimmack, 2019). As most of the variance in these measures is measurement error, one can only expect small effects of experimental manipulations. This means that studies with implicit measures often have low statistical power to produce replicable effects. Nevertheless, articles that use implicit measures report mostly successful results. This can only be explained with questionable research practices. Therefore it is currently unclear whether there are any robust and replicable results in the literature with implicit bias measures.

This is even true, when an article reports several replication studies (Schimmack, 2012). Although multiple replications give the impression that a result is credible, questionable research practices undermine the trustworthiness of results. Fortunately, it is possible to examine the credibility of published results with forensic statistical tools that can reveal the use of questionable practices. Here I use these tools to examine the credibility of a five-study article that claimed implicit measures are influenced by the credibility of a source.

Consider the Source: Persuasion of Implicit Evaluations Is Moderated by Source Credibility

The article by Colin Tucker Smith, Jan De Houwer, and Brian A. Nosek reports five experimental manipulations of attitudes towards a consumer product.

Study 1

Supporting the central hypothesis of the study, source expertise significantly affected implicit preferences; participants showed a stronger implicit preference for Soltate when that information was presented by an individual “High” in expertise (M = 0.54, SD = 0.36) than “Low” in expertise (M = 0.42, SD = 0.41), t(195) = 2.24, p = .026, d = .32″

Study 2

Participants indicated a stronger implicit preference for Soltate when that information was presented by an individual “High” in expertise (M = 0.60, SD = 0.33) than “Low” in expertise (M = 0.48, SD = 0.40), t(194) = 2.18, p = .031, d = 0.31

Study 3

Implicit preferences were significantly affected by manipulating the level of source trustworthiness; participants indicated a stronger implicit preference for Soltate when that information was presented by an individual “High” in trustworthiness (M = 0.52, SD = 0.34) than “Low” in trustworthiness (M = 0.42, SD = 0.39), t(280) = 2.40, p = .017, d = 0.29.

Study 4

Replicating Study 3, manipulating the level of source trustworthiness significantly affected implicit preferences as measured by the IAT. Participants implicitly preferred Soltate more when presented by an individual “High” in trustworthiness (M = 0.51, SD = 0.35) than “Low” in trustworthiness (M = 0.43, SD = 0.43), t(419) = 2.17, p = .031, d = 0.21.

Study 5

There was a main effect of credibility, F(1, 549) = 4.43, p = .036, such that participants implicitly preferred Soltate more when presented by a source high in credibility (M = 0.62, SD = 0.36) than low in credibility (M = 0.55, SD = 0.39).

Forensic Analysis

For naive readers of social psychology results, it may look impressive that the key finding was replicated in five studies. After all, the chance of a false positive result decreases exponentially with each significant result. While a single p-value less than .05 can occur by chance in 1 out of 20 studies, five significant results can only happen by chance in 1 out of 20*20*20*20*20 = 3,200,000 attempts. So, it would seem reasonable to believe that implicit attitude measures are influenced by the credibility of the source. However, when researchers use questionable practices, a p-value less than .05 is ensured and even 9 significant results do not mean that an effect is real (Bem, 2011). So, the question is whether the researchers used questionable practices to produce their results.

To answer this question, we can examine the probability of obtaining five very similar p-values in a row, p = .026, p = .031, p = .017, p = .031, p = .036. The Test of Insufficient Variance (TIVA) converts the p-values into z-scores and compares the variance against the sampling error of z-scores, which is 1. The variance is just 0.012. The probability of this happening by chance is 1/3287. In other words, it is extremely unlikely that five independent studies would produce such a small variance in p-values by chance.

Another test is to compute the average observed power of the studies (Schimmack, 2012), which is 60%. We can now ask how probable it is to get five significant results in a row with a probability of 60%, which is .6*.6*.6*.6*.6 = .08. The probability is also low, but not as low as the one for the previous test. The reason is that QRPs also inflate observed power. Thus, the 60% estimate is an overestimation. A rough index of inflation is simply the difference between the 100% success rate and the inflated power estimate of 60%, which is 40%. Subtracting the inflation from the observed power index, gives a replicability index of 20%. A value of 20% is also what is obtained in simulation studies where all studies are false positives (i.e., there is no real effect).

So, does source credibility influences implicitly measured attitudes? We do not know. At least these five studies provide no evidence for it. However, these results do provide further evidence that consumers of IAT researchers should consider the source. IAT researchers have a vested interest in making you believe that implicit measures can reveal something important about you that exists outside of your awareness. This gives them power to make big claims about social behavior that benefits their careers.

However, you also need to consider the source of this blog post. I have a vested interest in showing that social psychologists are full of shit. After all, who cares about bias analyses that always show there is no bias. So, who should you believe? The answer is that you should believe the data. Is it possible to get five p-values between .05 and .005 in a row? If you disregard probability theory, you can ignore this post. If you trust probability theory, you might wonder what other results in the IAT literature you can trust. In science we don’t trust people. We trust the evidence, but only after we make sure that we are presented with credible evidence. Unfortunately, this is often not the case in psychological science, even in 2020.

Research Opportunity Program 2020: Preliminary Results

Every year, some of our best undergraduate students apply to work with professors on their research projects for one year. For several years, I have worked with students to examine the credibility of psychological science. After an intensive crash course in statistics, students code published articles. The biggest challenge for them and everybody else is to find the critical statistical test that supports the main conclusion of the article. Moreover, results are often not reported sufficiently (e.g., effect sizes without sampling error or exact p-values). For students it is a good opportunity to see why good understanding of statistics is helpful in reading original research articles.

One advantage of my ROP is that it is based on secondary data. Thus, the Covid-19 pandemic didn’t impede the project. In fact, it probably helped me to get a larger number of students. In addition, zoom made it easy to meet with students to discuss critical articles one on one.

The 2020 ROP team has 13 members:
Sara Al-Omani
Samanvita Bajpai
Nidal Chaudhry
Yeshoda Harry-Paul
Nouran Hashem
Memoona Maah
Andrew Sedrak
Dellannia Segreti
Yashika Shroff
Brook Tan
Ze Yearwood
Maria Zainab
Xinyu Zhu

The main aim of the project is to get a sense of the credibility of psychological research across the diverse areas of psychology. The reason is that actual replication initiatives have focussed mostly on social and cognitive psychology where recruitment of participants is easy and studies are easy to do (Open Science Collaboration, 2015). Despite concerns about other areas, actual replication projects are lacking due to the huge costs involved. A statistical approach has the advantage that credibility can also be assessed by simply examining the strength of evidence (signal/noise) ratio in published articles.

The team started with coding articles from 2010, the year just before the replication crisis started. The journals represent a broad range of areas in psychology with an emphasis on clinical psychology because research in clinical psychology has the most direct practical implications.

Addictive Behaviors
Cognitive Therapy and Research
Journal of Anxiety Disorders
Journal of Consulting and Clinical Psychology
Journal of Counseling Psychology
Journal of Applied Psychology
Behavioural Neuroscience
Child Development
Social Development

The test statistics are converted into z-scores as a common metric to reflect the strength of evidence against the null-hypothesis. These z-scores are then analyzed with z-curve (Bartos & Schimmack, 2020; Brunner & Schimmack, 2020).

The figure and statistical results are similar to results in social psychology (Schimmack, 2020). First, the graph shows the well-known practice in psychology to publish mostly successful studies; that is, statistically significant results with p < .05 (z > 1.96) (Sterling, 1959). Here, the observed discovery rate is 88%, but the actual discovery rate is even higher because marginally signifcant results (p < .10, z > 1.65) are also often interpreted as sufficient evidence to reject the null-hypothesis.

In comparison, the estimated discovery rate is much lower at 33%. The discrepancy between the observed and expected discovery provides clear evidence that questionable research practices (QRPs, John et al., 2012; Schimmack, 2014). QRPs are research practices that increase the chances of reporting a statistically significant result, including selective reporting of significant results or highlighting significant results as discoveries (Kerr et al., 1998). The presence of QRPs in psychological research in 2010 is expected, but information about the extent of QRPs is lacking. Z-curve suggests that there is massive use of QRPs to boost actual success rates of 33% to nearly perfect success rate in published articles. This has important implication for replication attempts. If reported results are selected to be significant from results with low power, replication studies have a low probability of being significant again.

However, the chance of a replication of a significant result in the original studies, depends on the mean power of the studies with significant results and selection for significance also increases the actual power of studies (Brunner & Schimmack, 2020). The reason is that studies with higher power have a higher chance to produce significant results even without QRPs. The z-curve estimate of the expected replication rate is 52%. This would suggest that ever second study could be successfully replicated. The problem with this estimate is that it assumes that exact replications are possible. However, psychological studies are difficult or impossible to replicate exactly. This may explain why the expected replication rate is higher than the success rate in actual replication studies (cf. Bartos & Schimmack, 2020). For actual replication studies, the expected discovery rate seems to be a better predictor.

In conclusion, the results for clinical psychology and other areas of psychology are similar to those for social psychology (Schimmack, 2020). This is consistent with a comparison of disciplines based on automatic extraction of all test statistics rather than hand-coding of focal hypothesis tests (Schimmack, 2020).

In the upcoming semester (aptly called the winter semester in Canada), the team will code articles from 2019 to see whether a decade of soul searching about research practices in psychology has produced notable changes. There are two possibilities. On the one hand, journals could have become more accepting of non-significant results leading to more publications of non-significant results (i.e., a decrease in the observed discovery rate). On the other hand, journals may have asked for a priori power analysis and bigger sample sizes to reduce sampling error to produce stronger evidence against the null-hypothesis (i.e., an increase in the expected discovery rate).

Stay tuned and check in again in May.

The Unbearable Bullshit of Social Psychologists

Until 2011, social psychologists were able to believe that they were actually doing science. They conducted studies, often rigorous experiments with random assignment, analyzed the data and reported results only when they achieved statistical significance, p < .05. This is how they were trained to do science and most of them believed that this is how science works.

However, in 2011 an article by a well-respected social psychologists changed all this. Daryl Bem published an article that showed time-reversed causal processes. Seemingly, people were able to feel the future (Bem, 2011). This article shock the foundations of social psychology because most social psychologists did not believe in paranormal phenomena. Yet, Bem presented evidence for his crazy claim in 8 out of 9 studies. The only study that did not work was with supraliminal stimuli. The other studies used subliminal stimuli, suggesting that only our unconscious self can feel the future.

Over the past decade it has become apparent that Bem and other social psychologists had misused significance testing. They only paid attention to significant results, p < .05, and ignored non-significant results, p > .05. Selective publishing of significant results means that statistical results no longer distinguished between true and false findings. Everything was significant, even time-reversed implicit priming.

Some areas of social psychology have been hit particularly hard by replication failures. Most prominently, implicit priming research has been called out as a poster child of doubt about social psychological results by Nobel Laureate Kahneman. The basic idea of implicit priming is that stimuli outside of participants’ awareness can influence their behavior. Many implicit priming studies have failed to replicate.

Ten years later, we can examine how social psychologists have responded to the growing evidence that many classic findings were obtained with questionable practices (not reporting the failures) and cannot be replicated. Unfortunately, the response is consistent with psychodynamic theories of ego-defense mechanisms and social psychologists’ own theories of motivated reasoning. For the most part, social psychologists have simply ignored the replication failures in the 2010s and continue to treat old articles as if they provide scientific insights into human behavior. For example, Bargh – a leading figure in the implicit priming world – wrote a whole book about implicit priming that does not mention replication failures and presents questionable research as if they were well-established facts (Schimmack, 2017).

Given the questionable status of implicit priming research, it is not surprising that concerns are also growing about measures that were designed to reflect individual differences in implicit cognitions (Schimmack, 2019). The measures often have low reliability (when you test yourself you get different results each time) and show low convergent validity (one measure of your unconscious feelings towards your spouse doesn’t correlate with another measure of your unconscious feelings towards your spouse). It is therefore suspicious, when researchers consistently find results with these measures because measurement error should make it difficult to get significant results all the time.

Implicit Love

In an article from 2019 (i.e., when the replication crisis in social psychology has been well-established), Hicks and McNulty make the following claims about implicit love; that is feelings that are not reflected in self-reports of affection or marital satisfaction.

Their title is based on a classic article by Bargh and Chartrand.

Readers are not informed that the big claims made by Bargh twenty years ago have failed to be supported by empirical evidence. Especially the claim that stimuli often influence behavior without awareness lacks any credible evidence. It is therefore sad to say that social psychologists have moved on from self-deception (they thought they were doing science, but they did not) to other-deception (spreading false information knowing that credible doubts have been raised about this research). Just like it is time to reclaim humility and honesty in American political life, it is important to demand humility and honesty from American social psychologists, who are dominating social psychology.

The empirical question is whether research on implicit love has produced robust and credible results. One advantage for relationship researchers is that a lot of this research was published after Bem (2011). Thus, researchers could have improved their research practices. This could result in two outcomes. Either relationship researchers reported their results more honestly and did report non-significant results when they emerged, or they increased sample sizes to ensure that small effect sizes could produce statistically significant results.

Hicks and McNulty’s (2019) narrative review makes the following claims about implicit love.

1. The frequency of various sexual behaviors was prospectively associated with automatic partner evaluations assessed with an implicit measure but not with self-reported relationship satisfaction. (Hicks, McNulty, Meltzer, & Olson, 2016).

2. Participants with less responsive partners who felt less connected to their partners during conflict-of-interest situations had more negative automatic partner attitudes at a subsequent assessment but not more negative subjective evaluations (Murray, Holmes, & Pinkus, 2010).

3. Pairing the partner with positive affect from other sources (i.e., positive words and pleasant images) can increase the positivity of automatic partner attitudes relative to a control group.

4. The frequency of orgasm during sex was associated with automatic partner attitudes, whereas sexual frequency was associated only with deliberate reports of relationship satisfaction for participants who believed frequent sex was important for relationship health.

5. More positive automatic partner attitudes have been linked to perceiving fewer problems over time (McNulty, Olson, Meltzer, & Shaffer, 2013).

6. More positive automatic partner attitudes have been linked to self-reporting
fewer destructive behaviours (Murray et al., 2015).

7. More positive automatic partner attitudes have been linked to more cooperative relationship behaviors (LeBel & Campbell, 2013)

8. More positive automatic partner attitudes have been linked to displaying attitude-consistent nonverbal communication in conflict discussions (Faure et al., 2018).

9. More positive automatic partner attitudes were associated with a decreased likelihood of dissolution the following year, even after controlling for explicit relationship satisfaction (Lee, Rogge, & Reis, 2010).

10. Newlyweds’ implicit partner evaluations but not explicit satisfaction within the first few months of marriage were more predictive of their satisfaction 4 years later.

11. People with higher motivation to see their relationship in a positive light because of barriers to exiting their relationships (i.e., high levels of relationship investments and poor alternatives) demonstrated a weaker correspondence between their automatic attitudes and their relationship self-reports.

12. People with more negative automatic evaluations are less trusting of their partners when their working memory capacity is limited (Murray et al., 2011).

These claims are followed with the assurance that “these studies provide compelling evidence that automatic partner attitudes do have implications for relationship outcomes” (p. 256).

Should anybody who reads this article or similar claims in the popular media believe them? Have social psychologists improved their methods to produce more credible results over the past decade?

Fortunately, we can answer this question by examining the statistical evidence that was used to support these claims, using the z-curve method. First, all test statistics are converted into z-scores that represent the strength of evidence against the null-hypothesis (i.e., implicit love has no effect or does not exist) in each study. These z-scores are a function of the effect size and the amount of sampling error in a study (signal/noise ratio). Second, the z-scores are plotted as a histogram to show how many of the reported results provide weak or strong evidence against the null-hypothesis. The data are here for full transparency (Implicit.Love.xlsx).

The figure shows the z-curve for the 30 studies that reported usable test results. Most published z-scores are clustered just above the threshold value of 1.96 that corresponds to the .05 criterion to claim a discovery. This clustering is indicative of the use of selecting significant results from a much larger set of analyses that were run and produced non-significant results. The grey curve from z = 0 to 1.96 shows the predicted number of analyses that were not reported. The file drawer ratio implies that for every significant result there were 12 analyses with non-significant results.

Another way to look at the results is to compare the observed discovery rate with the expected discovery rate. The observed discovery rate is simply the percentage of studies that reported a significant result, which is 29 out of 30 or 97%. The estimated discovery rate is the average power of studies to produce a significant result. It is only 8%. This shows that social psychologists still continue to select only successes and do not report or interpret the failures. Moreover, in this small sample of studies, there is considerable uncertainty around the point estimates. The 95%confidence interval for the replication success probability includes 5%, which is not higher than chance. The complementary finding is that the maximum number of false positives is estimated to be 63%, but could be as high as 100%. In other words, the results make it impossible to conclude that even some of these studies produced a credible result.

In short, the entire research on implicit love is bullshit. Ten years ago, social psychologists had the excuse that they did not know better and misused statistics because they were trained the wrong way. This excuse is wearing thin in 2020. They know better, but they continue to report misleading results and write unscientific articles. In psychology, this is called other-deception, in everyday life it is called lying. Don’t trust social psychologists. Doing so is as stupid as believing Donald Trump when he claims that he won the election.