Please help out to improve this post. If you have conducted successful or unsuccessful replication studies of work done by Jens Forster, please share this information with me and I will add it to this blog post.
Jens Forster was a social psychologists from Germany. He was a rising star and on the way to receiving a prestigious 5 million Euro award from the Alexander von Humboldt Foundation (Retraction Watch, 2015). Then an anonymous whistle blower accused him of scientific misconduct. Under pressure, Forster returned the award without admitting to any wrongdoing.
He also was in transition to move from the University of Amsterdam to the University of Bochum. After a lengthy investigation, Forster was denied tenure and he is no longer working in academia (Science, 2016), despite the fact that an investigation by the German association of psychologists (DGP) did not conclude that he conducted fraud.
While the personal consequences for Forster are similar to those of Stapel, who admitted to fraud and left his tenured position, the effect on the scientific record is different. Stapel retracted over 50 articles that are no longer being cited at high numbers. In contrast, Forster retracted only a few papers and most of his articles are not flagged to readers as potentially fraudulent. We can see the differences in citation counts for Stapel and Forster.
Stapel’s citation counts peaked at 350 and are now down to 150 citations a year. Some of these citations are with co-authors and from papers that have been cleared as credible.
Citation counts for Forster peaked at 450. The also decreased by 200 citations to 250 citations, but we are also seeing an uptick by 100 citations in 2019. The question is whether this muted correction is due to Forster’s denial of wrongdoing or whether the articles that were not retracted actually are more credible.
The difficulty in proving fraud in social psychology is that social psychologists also used many questionable practices to produce significant results. These questionable practices have the same effect as fraud, but they were not considered unethical or illegal. Thus, there are two reasons why articles that have not been retracted may still lack credible evidence. First, it is difficult to prove fraud when authors do not confess. Second, even if no fraud was committed, the data may lack credible evidence because they were produced with questionable practices that are not considered data fabrication.
For readers of the scientific literature it is irrelevant whether incredible (results with low credibility) results were produced with fraud or with other methods. The only question is whether the published results provide credible evidence for the theoretical claims in an article. Fortunately, meta-scientists have made progress over the past decade in answering this question. One method relies on a statistical examination of an author’s published test statistics. Test statistics can be converted into p-values or z-scores so that they have a common metric (e.g., t-values can be compared to F-values). The higher the z-score, the stronger is the evidence against the null-hypothesis. High z-scores are also difficult to obtain with questionable practices. Thus, they are either fraudulent or provide real evidence for a hypothesis (i.e. against the null-hypothesis).
I have published z-curve analyses of over 200 social/personality psychologists that show clear evidence of variation in research practices across researchers (Schimmack, 2021). I did not include Stapel or Forster in these analyses because doubts have been raised about their research practices. However, it is interesting to compare Forster’s z-curve plot to the plot of other researchers because it is still unclear whether anomalous statistical patterns in Forster’s articles are due to fraud or the use of questionable research practices.
The distribution of z-scores shows clear evidence that questionable practices were used because the observed discovery rate of 78% is much higher than the estimated discovery rate of 18% and the ODR is outside of the 95% CI of the EDR, 9% to 47%. An EDR of 18% places Forster at rank #181 in the ranking of 213 social psychologists. Thus, even if Forster did not conduct fraud, many of his published results are questionable.
The comparison of Forster with other social psychologists is helpful because humans’ are prone to overgeneralize from salient examples which is known as stereotyping. Fraud cases like Stapel and Forster have tainted the image of social psychology and undermined trust in social psychology as a science. The fact that Forster would rank very low in comparison to other social psychologists shows that he is not representative of research practices in social psychology. This does not mean that Stapel and Forster are bad apples and extreme outliers. The use of QRPs was widespread but how much researchers used QRPs varied across researchers. Thus, we need to take an individual difference perspective and personalize credibility. The average z-curve plot for all social psychologists ignores that some research practices were much worse and others were much better. Thus, I argue against stereotyping social psychologists and in favor of evaluating each social psychologists based on their own merits. As much as all social psychologists acted within a reward structure that nearly rewarded Forster’s practices with a 5 million dollar prize, researchers navigated this reward structure differently. Hopefully, making research practices transparent can change the reward structure so that credibility gets rewarded.
Is there still something new to say about p-values? Yes, there is. Most discussions of p-values focus on a scenario where a researcher tests a new hypothesis computes a p-value and now has to interpret the result. The status quo follows Fisher’s – 100 year old – approach to compare the p-value to a value of .05. If the p-value is below .05 (two-sided), the inference is that the population effect size deviates from zero in the same direction as the observed effect in the sample. If the p-value is greater than .05 the results are deemed inconclusive.
This approach to the interpretation of the data assumes that we have no other information about our hypothesis or that we do not trust this information sufficiently to incorporate it in our inference about the population effect size. Over the past decade, Bayesian psychologists have argued that we should replace p-values with Bayes-Factors. The advantage of Bayes-Factors is that they can incorporate prior information to draw inferences from data. However, if no prior information is available, the use of Bayesian statistics may cause more harm than good. To use priors without prior information, Bayes-Factors are computed with generic, default priors that are not based on any information about a research question. Along with other problems of Bayes-Factors, this is not an appealing solution to the problem of p-values.
Here I introduce a new approach to the interpretation of p-values that has been called empirical Bayesian and has been successfully applied in genomics to control the field-wise false positive rate. That is, prior information does not rest on theoretical assumptions or default values, but rather on prior empirical information. The information that is used to interpret a new p-value is the distribution of prior p-values.
Every study is a new study because it relies on a new sample of participants that produces sampling error that is independent of the previous studies. However, studies are not independent in other characteristics. A researcher who conducted a study with N = 40 participants is likely to have used similar sample sizes in previous studies. And a researcher who used N = 200 is also likely to have used larger sample sizes in previous studies. Researchers are also likely to use similar designs. Social psychologists, for example, prefer between-subject designs to better deceive their participants. Cognitive psychologists care less about deception and study simple behaviors that can be repeated hundreds of times within an hour. Thus, researchers who used a between-subject design are likely to have used a between-subject design in previous studies and researchers who used a within-subject design are likely to have used a within-subject design before. Researchers may also be chasing different effect sizes. Finally, researchers can differ in their willingness to take risks. Some may only test hypotheses that are derived from prior theories that have a high probability of being correct, whereas others may be willing to shoot for the moon. All of these consistent differences between researchers (i.e., sample size, effect size, research design) influence the unconditional statistical power of their studies, which is defined as the long-run probability of obtaining significant results, p < .05.
Over the past decade, in the wake of the replication crisis, interest in the distribution of p-values has increased dramatically. For example, one approach uses the distribution of significant p-values, which is known as p-curve analysis (Simonsohn et al., 2014). If p-values were obtained with questionable research practices when the null-hypothesis is true (p-hacking), the distribution of significant p-values is flat. Thus, if the distribution is monotonically decreasing from 0 to .05, the data have evidential value. Although p-curve analyses has been extended to estimate statistical power, simulation studies show that the p-curve algorithm is systematically biased when power varies across studies (Bartos & Schimmack, 2020; Brunner & Schimmack, 2020).
As shown in simulation studies, a better way to estimate power is z-curve (Bartos & Schimmack, 2020; Brunner & Schimmack, 2020). Here I show how z-curve analyses of prior p-values can be used to demonstrate that p-values from one researcher are not equal to p-values of other researchers when we take their prior research practices into account. By using this prior information, we can adjust the alpha level of individual researchers to take their research practices into account. To illustrate this use of z-curve, I first start with an illustration how different research practices influence p-value distributions.
Scenario 1: P-hacking
In the first scenario, we assume that a researcher only tests false hypotheses (i.e., the null-hypothesis is always true (Bem, 2011; Simonsohn et al., 2011). In theory, it would be easy to spot false positives because replication studies would produce produce 19 non-significant results for every significant one and significant ones would have different signs. However, questionable research practices lead to a pattern of results where only significant results in one direction are reported, which is the norm in psychology (Sterling, 1959, Sterling et al., 1995; Schimmack, 2012).
In a z-curve analysis, p-values are first converted into z-scores, z = -qnorm(p/2) with qnorm being the inverse normal function and p being a two-sided p-value. A z-curve plot shows the histogram of all z-scores, including non-significant ones (Figure 1).
Visual inspection of the z-curve plot shows that all 200 p-values are significant (on the right side of the criterion value z = 1.96). it also shows that the mode of the distribution as at the significance criterion. Most important, visual inspection shows a steep drop from the mode to the range of non-significant values. That is, while z = 1.96 is the most common value, z = 1.95 is never observed. This drop provides direct visual information that questionable research practices were used because normal sampling error cannot produce such dramatic changes in the distribution.
I am skipping the technical details how the z-curve model is fitted to the distribution of z-scores (Bartos & Schimmack, 2020). It is sufficient to know that the model is fitted to the distribution of significant z-scores with a limited number of model parameters that are equally spaced over the range of z-scores from 0 to 6 (7 parameters, z = 0, z = 1, z = 2, …. z = 6). The model gives different weights to these parameters to match the observed distribution. Based on these estimates, z-curve.2.0 computes several statistics that can be used to interpret single p-values that have been published or future p-values by the same researcher, assuming that the same research practices are used.
The most important statistic is the expected discovery rate (EDR), which corresponds to the average power of all studies that were conducted by a researcher. Importantly, the EDR is an estimate that is based on only the significant results, but makes predictions about the number of non-significant results. In this example with N = 200 participants, the EDR is 7%. Of course, we know that it really is only 5% because the expected discovery rate for true hypotheses that are tested with alpha = .05 is 5%. However, sampling error can introduce biases in our estimates. Nevertheless, even with only 200 observations, the estimate of 7% is relatively close to 5%. Thus, z-curve tells us something important about the way these p-values were obtained. They were obtained in studies with very low power that is close to the criterion value for a false positive result.
Z-curve uses bootstrap to compute confidence intervals around the point estimate of the EDR. the 95%CI ranges from 5% to 18%. As the interval includes 5%, we cannot reject the hypothesis that all tests were false positives (which in this scenario is also the correct conclusion). At the upper end we can see that mean power is low, even if some true hypotheses are being tested.
The EDR can be used for two purposes. First, it can be used to examine the extent of selection for significance by comparing the EDR to the observed discovery rate (ODR; Schimmack, 2012). The ODR is simply the percentage of significant results that was observed in the sample of p-values. In this case, this is 200 out of 200 or 100%. The discrepancy between the EDR of 7% and 100% is large and 100% is clearly outside the 95%CI of the EDR. Thus, we have strong evidence that questionable research practices were used, which we know to be true in this simulation because the 200 tests were selected from a much larger sample of 4,000 tests.
Most important for the use of z-curve to interpret p-values is the ability to estimate the maximum False Discovery Rate (Soric, 1989). The false discovery rate is the percentage of significant results that are false positives or type-I errors. The false discovery rate is often confused with alpha, the long-run probability of making a type-I error. The significance criterion ensures that no more than 5% of significant and non-significant results are false positives. When we test 4,000 false hypotheses (i.e., the null-hypothesis is true) were are not going to have more than 5% (4,000 * .05 = 200) false positive results. This is true in general and it is true in this example. However, when only significant results are published, it is easy to make the mistake to assume that no more than 5% of the published 200 results are false positives. This would be wrong because the 200 were selected to be significant and they are all false positives.
The false discovery rate is the percentage of significant results that are false positives. It no longer matters whether non-significant results are published or not. We are only concerned with the population of p-values that are below .05 (z > 1.96). In our example, the question is how many of the 200 significant results could be false positives. Soric (1989 demonstrated that the EDR limits the number of false positive discoveries. The more discoveries there are, the lower is the risk that discoveries are false. Using a simple formula, we can compute the maximum false discovery rate from the EDR.
FDR = (1/(EDR – 1)*(.05/.95), with alpha = .05
With an EDR of 7%, we obtained a maximum FDR of 68%. We know that the true FDR is 100%, thus, the estimate is too low. However, the reason is that sampling error can have dramatic effects on the FDR estimates when the EDR is low. With an EDR of 6%, the FDR estimate goes up to 82% and with an EDR estimate of 5% it is 100%. To take account of this uncertainty, we can use the 95%CI of the EDR to compute a 95%CI for the FDR estimate, 24% to 100%. Now we see that we cannot rule out that the FDR is 100%.
In short, scenario 1 introduced the use of p-value distributions to provide useful information about the risk that the published results are false discoveries. In this extreme example, we can dismiss the published p-values as inconclusive or as lacking in evidential value.
Scenario 2: The Typical Social Psychologist
It is difficult to estimate the typical effect size in a literature. However, a meta-analysis of meta-analyses suggested that the average effect size in social psychology is d = 4 (Richard et al., 2003). A smaller set of replication studies that did not select for significance estimated an effect size of d = .3 for social psychology (d = .2 for JPSP, d = .4 for Psych Science; Open Science Collaboration, 2015). The later estimate may include an unknown number of hypotheses where the null-hypothesis is true and the true effect size is zero. Thus, I used d = .4 as a reasonable effect size for true hypotheses in social psychology (see also LeBel, Campbell, & Loving, 2017).
It is also known that a rule of thumb in experimental social psychology was to allocate n = 20 participants to a condition, resulting in a sample size of N = 40 in studies with two groups. In a 2 x 2 design, the main effect would be tested with N = 80. However, to keep this scenario simple, I used d = .4 and N = 40 for true effects. This affords 23% power to obtain a significant result.
Finkel, Eastwick, and Reis (2017) argued that power of 25% is optimal if 75% of the hypotheses that are being tested are true. However, the assumption that 75% of hypotheses are true may be on the optimistic side. Wilson and Wixted (2018) suggested that the false discovery risk is closer to 50%. With 23% power for true hypotheses, this implies a false discovery rate of Given uncertainty about the actual false discovery rate in social psychology, I used a scenario with 50% true and 50% false hypotheses.
I kept the number of significant results at 200. To obtain 200 significant results with an equal number of true and false hypotheses, we need 1,428 tests. The 714 true hypotheses contribute 714*.23 = 164 true positives and the 714 false hypotheses produce 714*.05 = 36 false positive results; 164 + 36 = 200. This implies a false discovery rate of 36/200 = 18%. The true EDR is (714*.23+714*.05)/(714+714) = 14%.
The z-curve plot looks very similar to the previous plot, but they are not identical. Although the EDR estimate is higher, it still includes zero. The maximum FDR is well above the actual FDR of 18%, but the 95%CI includes the actual value of 18%.
A notable difference between Figure 1 and Figure 2 is the expected replication rate (ERR), which corresponds to the average power of significant p-values. It is called the estimated replication rate (ERR) because it predicts the percentage of significant results if the studies that were selected for significance were replicated exactly (Brunner & Schimmack, 2020). When power is heterogeneous, power of the studies with significant results is higher than power of studies with non-significant results (Brunner & Schimmack, 2020). In this case, with only two power values, the reason is that false positives have a much lower chance to be significant (5%) than true positives (23%). As a result, the average power of significant studies is higher than the average power of all studies. In this simulation, the true average power of significant studies is the weighted average of true and false positives with significant results, (164*.23 +36*.05)/(164+36) = 20%. Z-curve perfectly estimated this value.
Importantly, the 95% CI of the ERR, 11% to 34%, does not include zero. Thus, we can reject the null-hypotheses that all of the significant results are false positives based on the ERR. In other words, the significant results have evidential value. However, we do not know the composition of this average. It could be a large percentage of false positives and a few true hypotheses with high power or it could be many true positives with low power. We also do not know which of the 200 significant results is a true positive or a false positive. Thus, we would need to conduct replication studies to distinguish between true and false hypotheses. And given the low power, we would only have a 23% chance of successfully replicating a true positive result. This is exactly what happened with the reproducibility project. And the inconsistent results lead to debates and require further replications. Thus, we have real-world evidence how uninformative p-values are when they are obtained this way.
Social psychologists might argue that the use of small samples is justified because most hypotheses in psychology are true. Thus, we can use prior information to assume that significant results are true positives. However, this logic fails when social psychologists test false hypotheses. In this case, the observed distribution of p-values (Figure 1) is not that different from the distribution that is observed when most significant results are true positives that were obtained with low power (Figure 2). Thus, it is doubtful that this is really an optimal use of resources (Finkel et al., 2015). However, until recently this was the way experimental social psychologists conducted their research.
Scenario 3: Cohen’s Way
In 1962 (!), Cohen conducted a meta-analysis of statistical power in social psychology. The main finding was that studies had only a 50% chance to get significant results with a median effect size of d = .5. Cohen (1988) also recommended that researchers should plan studies to have 80% power. However, this recommendation was ignored.
To achieve 80% power with d = .4, researchers need N = 200 participants. Thus, the number of studies is reduced from 5 studies with N = 40 to one study with N = 200. As Finkel et al. (2017) point out, we can make more discoveries with many small studies than a few large ones. However, this ignores that the results of the small studies are difficult to replicate. This was not a concern when social psychologists did not bother to test whether their discoveries are false discoveries or whether they can be replicated. The replication crisis shows the problems of this approach. Now we have results from decades of research that produced significant p-values without providing any information whether these significant results are true or false discoveries.
Scenario 3 examines what social psychology would look like today, if social psychologists had listened to Cohen. The scenario is the same as in the second scenario, including publication bias. There are 50% false hypotheses and 50% true hypotheses with an effect size of d = .4. The only difference is that researchers used N = 200 to test their hypotheses to achieve 80% power.
With 80% power, we need 470 tests (compared to 1,428 in Scenario 2) to produce 200 significant results, 235*.80 + 235*.05 = 188 + 12 = 200. Thus, the EDR is 200/470 = 43%. The true false discovery rate is 6%. The expected replication rate is 188*.80 + 12*.05 = 76%. Thus, we see that higher power increases replicability from 20% to 76% and lowers the false discovery rate from 18% to 6%.
Figure 3 shows the z-curve plot. Visual inspection shows that Figure 3 looks very different from Figures 1 and 2. The estimates are also different. In this example, sampling error inflated the EDR to be 58%, but the 95%CI includes the true value of 46%. The 95%CI does not include the ODR. Thus, there is evidence for publication bias, which is also visible by the steep drop in the distribution at 1.96.
Even with a low EDR of 20%, the maximum FDR is only 21%. Thus, we can conclude with confidence that at least 79% of the significant results are true positives. Remember, in the previous scenario, we could not rule out that most results are false positives. Moreover, the estimated replication rate is 73%, which underestimates the true replication rate of 76%, but the 95%CI includes the true value, 95%CI = 61% – 84%. Thus, if these studies were replicated, we would have a high success rate for actual replication studies.
Just imagine for a moment what social psychology might look like in a parallel universe where social psychologists followed Cohen’s advice. Why didn’t they? The reason is that they did not have z-curve. All they had was p < .05, and using p < .05, all three scenarios are identical. All three scenarios produced 200 significant results. Moreover, as Finkel et al. (2015) pointed out, smaller samples produce 200 significant results quicker than large samples. An additional advantage of small samples is that they inflate point estimates of the population effect size. Thus, the social psychologists with the smallest samples could brag about the biggest (illusory) effect sizes as long as nobody was able to publish replication studies with larger samples that deflated effect sizes of d = .8 to d = .08 (Joy-Gaba & Nosek, 2010).
This game is over, but social psychology – and other social sciences – have published thousands of significant p-values, and nobody knows whether they were obtained using scenario 1, 2, or 3, or probably a combination of these. This is where z-curve can make a difference. P-values are no longer equal when they are considered as a data point from a p-value distribution. In scenario 1, a p-value of .01 and even a p-value of .001 has no meaning. In contrast, in scenario 3 even a p-value of .02 is meaningful and more likely to reflect a true positive than a false positive result. This means that we can use z-curve analyses of published p-values to distinguish between probably false and probably true positives.
I illustrate this with three concrete examples from a project that examined the p-value distributions of over 200 social psychologists (Schimmack, in preparation). The first example has the lowest EDR in the sample. The EDR is 11% and because there are only 210 tests, the 95%CI is wide and includes 5%.
The maximum EDR estimate is high with 41% and the 95%CI includes 100%. This suggests that we cannot rule out the hypothesis that most significant results are false positives. However, the replication rate is 57% and the 95%CI, 45% to 69%, does not include 5%. Thus, some tests tested true hypotheses, but we do not know which ones.
Visual inspection of the plot shows a different distribution than Figure 2. There are more just significant p-values, z = 2.0 to 2.2 and more large z-scores (z > 4). This shows more heterogeneity in power. A comparison of the ODR with the EDR shows that the ODR falls outside the 95%CI of the EDR. This is evidence of publication bias or the use of questionable research practices. One solution to the presence of publication bias is to lower the criterion for statistical significance. As a result, the large number of just significant results is no longer significant and the ODR decreases. This is a post-hoc correction for publication bias. For example, we can lower alpha to .005.
As expected, the ODR decreases considerably from 70% to 39%. In contrast, the EDR increases. The reason is that many questionable research practices produce a pile of just significant p-values. As these values are no longer used to fit the z-curve, it predicts a lot fewer non-significant p-values. The model now underestimates p-values between 2 and 2.2. However, these values do not seem to come from a sampling distribution. Rather they stick out like a tower. By excluding them, the p-values that are still significant with alpha = .005 look more credible. Thus, we can correct for the use of QRPs by lowering alpha and by examining whether these p-values produced interesting discoveries. At the same time, we can ignore the p-values between .05 and .005 and await replication studies to provide empirical evidence whether these hypotheses receive empirical support.
The second example was picked because it was close to the median EDR (33) and ERR (66) in the sample of 200 social psychologists.
The larger sample of tests (k = 1,529) helps to obtain more precise estimates. A comparison of the ODR, 76%, and the 95%CI of the EDR, 12% to 48%, shows that publication bias is present. However, with an EDR of 33%, the maximum FDR is only 11% and the upper limit of the 95%CI is 39%. Thus, we can conclude with confidence that fewer than 50% of the significant results are false positives, however numerous findings might be false positives. Only replication studies can provide this information.
In this example, lowering alpha to .005 did not align the ODR and the EDR. This suggests that these values come from a sampling distribution where non-significant results were not published. Thus, adjusting the there is no simple fix to adjust the significance criterion. In this situation, we can conclude that the published p-values are unlikely to be false positives, but that replication studies are needed to ensure that published significant results are not false positives.
The third example is the social psychologists with the highest EDR. In this case, the EDR is actually a little bit lower than the ODR, suggesting that there is no publication bias. The high EDR also means that the maximum FDR is very small and even the upper limit of the 95%CI is only 7%.
Another advantage of data without publication bias is that it is not necessary to exclude non-significant results from the analysis. Fitting the model to all p-values produces much tighter estimates of the EDR and the maximum FDR.
The upper limit of the 95%CI for the FDR is now 4%. Thus, we conclude that no more than 5% of the p-values less than .05 are false positives. Even p = .02 is unlikely to be a false positive. Finally, the estimated replication rate is 84% with a tight confidence interval ranging from 78% to 90%. Thus, most of the published p-values are expected to replicate in an exact replication study.
I hope these examples make it clear how useful it can be to evaluate single p-values with prior information about the p-values distribution of a lab. As labs differ in their research practices, significant p-values are also different. Only if we ignore the research context and focus on a single result p = .02 equals p = .02. But once we see the broader distribution, p-values of .02 can provide stronger evidence against the null-hypothesis than p-values of .002.
Cohen tried and failed to change the research culture of social psychologists. Meta-psychological articles have puzzled why meta-analyses of power failed to increase power (Maxwell, 2004; Schimmack, 2012; Sedelmeier & Gigerenzer, 1989). Finkel et al. (2015) provided an explanation. In a game where the winner publishes as many significant results as possible, the optimal strategy is to conduct as many studies as possible with low power. This strategy continues to be rewarded in psychology, where jobs, promotions, grants, and pay raises are based on the number of publications. Cohen (1990) said less is more, but that is not true in a science that does not self-correct and treats every p-value less than .05 as a discovery.
To improve psychology as a science, we need to change the incentive structure and author-wise z-curve analyses can do this. Rather than using p < .05 (or p < .005) as a general rule to claim discoveries, claims of discoveries can be adjusted to the research practices of a researchers. As demonstrated here, this will reward researchers who follow Cohen’s rules and punish those who use questionable practices to produce p-values less than .05 (or Bayes-Factors > 3) without evidential value. And maybe, there is a badge for credible p-values one day.
The past decade has seen major replication failures in social psychology. This has led to a method revolution in social psychology. Thanks to technological advances, many social psychologists moved from studies with smallish undergraduate samples to online studies with hundreds of participants. Thus, findings published after 2016 are more credible than those published before 2016.
However, social psychologists have avoided to take a closer look at theories that were built on the basis of questionable results. Review articles continue to present these theories and cite old studies as if they provided credible evidence for them as if the replication crisis never happened.
One influential theory in social psychology is that stimuli can bypass conscious awareness and still influence behavior. This assumption is based on theories of emotions that emerged in the 1980s. In the famous Lazarus-Zajonc debate most social psychologists sided with Zajonc who quipped that “Preferences need no inferences.”
The influence of Zajonc can be seen in hundreds of studies with implicit primes (Bargh et al., 1996; Devine, 1989) and in modern measures of implicit cognition such as the evaluative priming task and the affect misattribution paradigm (AMP, Payne et al., . 2005).
Payne and Lundberg (2014) credit a study by Murphy and Zajonc (1993) for the development of the AMP. Interestingly, the AMP was developed because Payne was unable to replicate a key finding from Murphy and Zajonc’ studies.
In these studies, a smiling or frowning face was presented immediately before a target stimulus (e.g., a Chinese character). Participants had to evaluate the target. The key finding was that the faces influenced evaluations of the targets only when the faces were processed without awareness. When participants were aware of the faces, they had no effect. When Payne developed the AMP, he found that preceding stimuli (e.g., faces of African Americans) still influenced evaluations of Chinese characters, even though the faces were presented long enough (75ms) to be clearly visible.
Although research with the AMP has blossomed, there has been little interest in exploring the discrepancy between Murphy and Zajonc’s (1993) findings and Payne’s findings.
One possible explanation for the discrepancy is that the Murphy and Zajonc’s (1993) results were obtained with questionable research practices (QRPs, John et al., 2012). Fortunately, it is possible to detect the use of QRPs using forensic statistical tools. Here I use these tools to examine the credibility of Murphy and Zajonc’s claims that subliminal presentations of emotional faces produce implicit priming effects.
Before I examine the small set of studies from this article, it is important to point out that the use of QRPs in this literature is highly probable. This is revealed by examining the broader literature of implicit priming, especially with subliminal stimuli (Schimmack, 2020).
Figure 1 shows that published studies rarely report non-significant results, although the distribution of significant results shows low power and a high probability of non-significant results. While the observed discovery rate is 90%, the expected discovery rate is only 13%. This shows that QRPs were used to supress results that did not show the expected implicit priming effects.
Study 1 in Murphy and Zajonc (1993) had 32 participants; 16 with subliminal presentations and 16 with supraliminal presentations. There were 4 within-subject conditions (smiling, frowning & two control conditions). The means of the affect ratings were 3.46 for smiling, 3.06 for both control conditions and 2.70 for the frowning faces. The perfect ordering of means is a bit suspicious, but even more problematic is that the mean differences of experimental conditions and control conditions were all statistically significant. The t-values, df = 15, are 2.23, 2.31, 2.31, and 2.59. Too many significant contrasts have been the downfall for a German social psychologist. Here we can only say that Murphy and Zajonc were very lucky that the two control conditions fell smack in the middle of the two experimental conditions. Any deviation in one direction would have increased one comparison, but decreased the other comparison and increased the risk of a non-significant result.
Study 2 was similar, except that the judgments was changed from subjective liking to objective goodness vs. badness judgments.
The means for the two control conditions were again right in the middle, nearly identical to each other, and nearly identical to the means in Study 1 (M = 3.05, 3.06). Given sampling error, it is extremely unlikely that even the same condition produces the same means. Without reporting actual t-values, the authors further claim that all four comparisons of experimental and control conditions are significant.
Taken together, these two studies with surprisingly simiar t-values and 32 participants provide the only evidence for the claim that stimuli outside of awareness can elicit affective reactions. This weak evidence has garnered nearly 1,000 citations without ever being questioned or published replication attempts.
Studies 3-5 did not examine affective priming, but Study 6 did. The paradigm here was different. Participants were subliminally presented with a smiling or a frowning face. Then they had to choose between two pictures, the prime and a foil. The foil either had the same facial expression or a different facial expression. Another manipulation was to have the same or a different gender. This study showed a strong effect of facial expression, t(62) = 6.26, but not of gender.
I liked this design and conducted several conceptual replication studies with emotional pictures (beautiful beaches, dirty toilets). It did not work. Participants were not able to use their affect to pick the right picture from a prime-foil pair. I also manipulated presentation times and with increasing presentation times, participants could pick out the picture, even if the affect was the same (e.g., prime and foil were both pleasant).
Study 6 also explains why Payne was unable to get priming effects for subliminal stimuli that varied race or other features.
One possible explanation for the results in Study 6 is that it is extremely difficult to mask facial expressions, especially smiles. I also did some studies that tried that and at least with computers it was impossible to prevent detection of smiling faces.
Thus, we are left with some questionable results in Studies 1 and 2 as the sole evidence that subliminal stimuli can elicit affective reactions that are transferred to other stimuli.
I have tried to get implicit priming effects on affect measures and failed. It was difficult to publish these failures in the early 2000s. I am sure there are many other replication failures (see Figure 1) and Payne et al.’s (2014) account of the developed the AMP implies as much. Social psychology is still in the process of cleaning up the mess that the use of QRPs created. Implicit priming research is a posterchild of the replication crisis and researchers should stop citing these old articles as if they produced credible evidence.
Emotion researchers may also benefit from revisiting the Lazarus-Zajonc debate. Appraisal theory may not have the sex appeal of unconscious emotions, but it may be a more robust and accurate theory of emotions. Preference may not always require inferences, but preferences that are based on solid inferences are likely to be a better guide of behavior. Therefore I prefer Lazarus over Zajonc.
This is the third part in a mini-series of building a monster-model of well-being. The first part (Part1) introduced the measurement of well-being and the relationship between affect and well-being. The second part added measures of satisfaction with life-domains (Part 2). Part 2 ended with the finding that most of the variance in global life-satisfaction judgments is based on evaluations of important life domains. Satisfaction in important life domains also influences the amount of happiness and sadness individuals experience, but affect had relatively small unique effects on global life-satisfaction judgments. In fact, happiness made a trivial, non-significant unique contribution.
The effects of the various life domains on happiness, sadness, and the weighted average of domain satisfactions is shown in the table below. Regarding happy affective experiences, the results showed that friendships and recreations are important for high levels of positive affect (experiencing happiness), but health or money are relatively unimportant.
In part 3, I am examining how we can add the personality trait extraversion to the model. Evidence that extraverts have higher well-being was first reviewed by Wilson (1967). An influential article by Costa and McCrae (1980) showed that this relationship is stable over a period of 10 years, suggesting that stable dispositions contribute to this relationship. Since then, meta-analyses have repeatedly reaffirmed that extraversion is related to well-being (DeNeve & Cooper, 1998; Heller et al., 2004; Horwood, Smillie, Marrero, Wood, 2020).
Here, I am examining the question how extraversion influences well-being. One criticism of structural equation modeling of correlational, cross-sectional data is that causal arrows are arbitrary and that the results do not provide evidence of causality. This is nonsense. Whether a causal model is plausible or not depends on what we know about the constructs and measures that are being used in a study. Not every study can test all assumptions, but we can build models that make plausible assumptions given well-established findings in the literature. Fortunately, personality psychology has established some robust findings about extraversion and well-being.
First, personality traits and well-being measures show evidence of heritability in twin studies. If well-being showed no evidence of heritability, we could not postulate that a heritable trait like extraversion influences well-being because genetic variance in a cause would produce genetic variance in an outcome.
Second, both personality and well-being have a highly stable variance component. However, the stable variance in extraversion is larger than the stable variance in well-being (Anusic & Schimmack, 2016). This implies that extraversion causes well-being rather than the other way-around because causality goes from the more stable variable to the less stable variable (Conley, 1984). The reasoning is that a variable that changes quickly and influences another variable would produce changes, which contradicts the finding that the outcome is stable. For example, if height were correlated with mood, we would know that height causes variation in mood rather than the other way around because mood changes daily, but height does not. We also have direct evidence that life events that influence well-being such as unemployment can change well-being without changing extraversion (Schimmack, Wagner, & Schupp, 2008). This implies that well-being does not cause extraversion because the changes in well-being due to unemployment would then produce changes in extraversion, which is contradicted by evidence. In short, even though the cross-sectional data used here cannot test the assumption that extraversion causes well-being, the broader literature makes it very likely that causality runs from extraversion to well-being rather than the other way around.
Despite 50-years of research, it is still unknown how extraversion influences well-being. “It is widely appreciated that extraversion is associated with greater subjective well-being. What is not yet clear is what processes relate the two” ((Harris, English, Harms, Gross, & Jackson, 2017, p. 170). Costa and McCrae (1980) proposed that extraversion is a disposition to experience more pleasant affective experiences independent of actual stimuli or life circumstances. That is, extraverts are disposed to be happier than introverts. A key problem with this affect-level model is that it is difficult to test. One way of doing so is to falsify alternative models. One alternative model is the affective reactivity model. Accordingly, extraverts are only happier in situations with rewarding stimuli. This model implies personality x situation interactions that can be tested. So far, however, the affective reactivity model has received very little support in several attempts (Lucas & Baird, 2004). Another model assumes that extraversion is related to situation selection. Extraverts may spend more time in situations that elicit pleasure. Accordingly, both introverts and extraverts enjoy socializing, but extraverts actually spend more time socializing than introverts. This model implies person-situation correlations that can be tested.
Nearly 20 yeas ago, I proposed a mediation model that assumes extraversion has a direct influence on affective experiences and the amount of affective experiences is used to evaluate life-satisfaction (Schimmack, Diener, & Oishi, 2002). Although cited relatively frequently, none of these citations are replication studies. The findings above cast doubt on this model because there is no direct influence of positive affect (happiness) on life-satisfaction judgments.
The following analyses examine how extraversion is related to well-being in the Mississauga Family Study dataset.
1. A multi-method study of extraversion and well-being
I start with a very simple model that predicts well-being from extraversion, CFI = .989, RMSEA = .027. The correlated residuals show some rater-specific correlations between ratings of extraversion and life-satisfaction. Most important, the correlation between the extraversion and well-being factors is only r = .11, 95%CI = .03 to .19.
The effect size is noteworthy because extraversion is often considered to be a very powerful predictor of well-being. For example, Kesebir and Diener (2008) write “Other than extraversion and neuroticism, personality traits such as extraversion … have been found to be strong predictors of happiness” (p. 123)
There are several explanations for the week relationship in this model. First, many studies did not control for shared method variance. Even McCrae and Costa (1991) found a weak relationship when they used informant ratings of extraversion to predict self-ratings of well-being, but they ignored the effect size estimate.
Another possible explanation is that Mississauga is a highly diverse community and that the influence of extraversion on well-being can be weaker in non-Western samples (r ~ .2, Kim et al. , 2017.
I next added the two affect factors (happiness and sadness) to the model to test the mediation model. This model had good fit, CFI = .986, RMSEA = .026. The moderate to strong relationships from extraversion to happy feelings and happy feelings to life-satisfaction were highly significant, z > 5. Thus, without taking domain satisfaction into account, the results appear to replicate Schimmack et al.’s (2002) findings.
However, including domain satisfaction changes the results, CFI = .988, RMSEA = .015.
Although extraversion is a direct predictor of happy feelings, b = .25, z = 6.5, the non-significant path from happy feelings to life-satisfaction implies that extraversion does not influence life-satisfaction via this path, indirect effect b = .00, z = 0.2. Thus, the total effect of b = .14, z = 3.7, is fully mediated by the domain satisfactions.
A broad affective disposition model would predict that extraversion enhances positive affect across all domains, including work. However, the path coefficients show that extraversion is a stronger predictor of satisfaction with some domains than others. The strongest coefficients are obtained for satisfaction with friendships and recreation. In contrast, extraversion has only very small relationships with financial satisfaction, health satisfaction, or housing satisfaction that are not statistically significant. Inspection of the indirect effects shows that friendship (b = .026), leisure (.022), romance (.026), and work (.024) account for most of the total effect. However, power is too low to test significance of individual path coefficients.
The results replicate previous work. First, extraversion is a statistically significant predictor of life-satisfaction, even when method variance is controlled, but the effect size is small. Second, extraversion is a stronger predictor of happy feelings than life-satisfaction and unrelated to sad feelings. However, the inclusion of domain satisfaction judgments shows that happy feelings do not mediate the influence of extraversion on life-satisfaction. Rather, extraversion predicts higher satisfaction with some life domains. It may seem surprising that this is a new finding in 2021, 40-years after Costa and McCrae (1980) emphasized the importance of extraversion for well-being. The reason is that few psychological studies of well-being include measures of domain satisfaction and few sociological studies of well-being include personality measures (Schimmack, Schupp, & Wagner, 2008). The present results show that it would be fruitful to examine how extraversion is related to satisfaction with friendships, romantic relationships, and recreation. This is an important avenue for future research. However, for the monster model of well-being the next step will be to include neuroticism in the model. Stay tuned.
Psychological Science is the flagship journal of the Association for Psychological Science (APS). In response to the replication crisis, D. Stephen Lindsay worked hard to increase the credibility of results published in this journal as editor from 2014-2019 (Schimmack, 2020). This work paid off and meta-scientific evidence shows that publication bias decreased and replicability increased (Schimmack, 2020). In the replicability rankings, Psychological Science is one of a few journals that show reliable improvement over the past decade (Schimmack, 2020).
The good news is that these concerns were unfounded. The meta-scientific criteria of credibility did not change notably from 2019 to 2020.
The observed discovery rates were 64% in 2019 and 66% in 2020. The estimated discovery rates were 58% in 2019 and 59%, respectively. Visual inspection of the z-curves and the slightly higher ODR than EDR suggests that there is still some selection for significant result. That is, researchers use so-called questionable research practices to produce statistically significant results. However, the magnitude of these questionable research practices is small and much lower than in 2010 (ODR = 77%, EDR = 38%).
Based on the EDR, it is possible to estimate the maximum false discovery rate (i.e., the percentage of significant results where the null-hypothesis is true). This rate is low with 4% in both years. Even the upper limit of the 95%CI is only 12%. This contradicts the widespread concern that most published (significant) results are false (Ioannidis, 2005).
The expected replication rate is slightly, but not significantly (i.e., it could be just sampling error) lower in 2020 (76% vs. 83%). Given the small risk of a false positive result, this means that on average significant results were obtained with the recommended power of 80% (Cohen, 1988).
Overall, these results suggest that published results in Psychological Science are credible and replicable. However, this positive evaluations comes with a few caveats.
First, null-hypothesis significance testing can only provide information that there is an effect and the direction of the effect. It cannot provide information about the effect size. Moreover, it is not possible to use the point estimates of effect sizes in small samples to draw inferences about the actual population effect size. Often the 95% confidence interval will include small effect sizes that may have no practical significance. Readers should clearly evaluate the lower limit of the 95%CI to examine whether a practically significant effect was demonstrated.
Second, the replicability estimate of 80% is an average. The average power of results that are just significant is lower. The local power estimates below the x-axis suggest that results with z-scores between 2 and 3 (p < .05 & p > .005) have only 50% power. It is recommended to increase sample sizes for follow-up studies.
Third, the local power estimates also show that most non-significant results are false negatives (type-II errors). Z-scores between 1 and 2 are estimated to have 40% average power. It is unclear how often articles falsely infer that an effect does not exist or can be ignored because the test was not significant. Often sampling error alone is sufficient to explain differences between test statistics in the range from 1 to 2 and from 2 to 3.
Finally, 80% power is sufficient for a single focal test. However, with 80% power, multiple focal tests are likely to produce at least one non-significant result. If all focal tests are significant, there is a concern that questionable research practices were used (Schimmack, 2012).
Readers should also carefully examine the results of individual articles. The present results are based on automatic extraction of all statistical tests. If focal tests have only p-values in the range between .05 and .005, the results are less credible than if at least some p-values are below .005 (Schimmack, 2020).
In conclusion, Psychological Science has responded to concerns about a high rate of false positive results by increasing statistical power and reducing publication bias. This positive trend continued in 2020 under the leadership of the new editor Patricia Bauer.
CORRECTION: Open science also means that our mistakes are open and transparent. Shortly after I posted this blog, Spencer Greenberg pointed out that I made a mistake when I used the discovery rate in OSC to estimate the discovery rate in psychological science. I am glad he caught my mistake quickly and I can warn readers that my conclusions do not hold. A 50% success rate for replications in cognitive psychology suggests that most results in cognitive psychology are not false positives, but the low replication rate of 25% for social psychology does allow for a much higher false discover rate than I estimated in this blog post.
Money does not make the world go round, it cannot buy love, but it does pretty much everything else. Money is behind most scientific discoveries. Just like investments in stock markets, investments in science are unpredictable. Some of these investments are successful (e.g., Covid-19 vaccines), but others are not.
Most scientists, like myself, rely on government funding that is distributed in a peer-reviewed process by scientists to scientists. It is difficult to see how scientists would fund research that aims to show that most of their work is useless, if not fraudulent. This is where private money comes in.
One grant was given to Ioannidis, who was famous for declaring that “most published results are false” (Ioannidis, 2005). The other grant was given to Nosek, to establish the Open Science Foundation.
Ioannidis and Nosek also worked together as co-authors (Button et al., 2013). In terms of traditional metrics of impact, the Arnold foundations’ investment paid off. Ioannidis’s (2005) has been cited over 4,000 times. Button et al.’s article has been cited over 2,000 times. And an influential article by Nosek and many others that replicated 100 studies from psychology has been cited over 2,000 times.
These articles are go-to citations for authors to claim that science is in a replication crisis, most published results are false, and major reforms to scientific practices are needed. It is no secret that many authors who cite these articles have not read the actual article. This explains why thousands of citations do not include a single article that points out that the Open Science Collaboration findings contradict Ioannidis’s claim that most published results are false.
Ioannidis (2005) used hypothetical examples to speculate that most published results are false. The main assumption underlying these scenarios was that researchers are much more likely to test false hypotheses (a vaccine has no effect) than true hypotheses (a vaccine has an effect). The second assumption was that even when researchers test true hypotheses, they do so with a low probability to provide enough evidence (p < .05) that an effect occurred.
Under these assumptions, most empirical tests of hypotheses produce non-significant results (p > .05) and among those that are significant, the majority come from the large number of tests that tested a false hypothesis (false positives).
In theory, it would be easy to verify Ioannidis’s predictions because he predicts that most results are not significant, p > .05. Thus, a simple count of significant and non-significant results would reveal that many published results are false. The problem is that not all hypotheses tests are published and that significant results are more likely to be published than non-significant results. This bias in the selection of results is known as publication bias. Ioannidis (2005) called it researcher bias. As the amount of researcher bias is unknown, there is ample room to suggest that it is large enough to fit Ioannidis’s prediction that most published significant results are false positives.
The Missing Piece
Fifteen years after Ioannidis claimed that most published results are false, there have been few attempts to test this hypothesis empirically. One attempt was made byJager and Leek (2014). This article made two important contributions. First, Jager and Leek created a program to harvest statistical results from abstracts in medical journals. Second, they developed a model to analyze the harvested p-values to estimate the percentage of false positive results in the medical literature. They ended up with an estimate of 14%, which is well below Ioannidis’s claim that over 50% of published results are false.
Ioannidis’s reply made it clear that a multi-million investment in his idea made it impossible to look at this evidence objectively. Clearly, his speculations based on no data must be right and an actual empirical test must be wrong, if it didn’t confirm his prediction. In science this is known as confirmation bias. Ironically, confirmation bias is one of the main obstacles that prevents science from making progress and to correct false beliefs.
Fortunately, there is a much easier way to test Ioannidis’s claim than Jager and Leek’s model that may have underestimated the false discovery risk. All we need to estimate to estimate the false discovery rate under the worst case scenario is a credible estimate of the discovery rate (i.e., the percentage of significant results). Once we know how many tests produced a positive result, we can compute the maximum false discovery rate using a simple formula developed by Soric (1989).
Maximum False Discovery Rate = (1/Discovery Rate – 1)*(.05/.95)
The only challenge is to find a discovery rate that is not inflated by publication bias. And that is where Nosek and the Open Science Foundation come in.
The Reproducibility Project
It has been known for decades that psychology has a publication bias problem. Sterling (1959) observed that over 90% of published results report a statistically significant result. This finding was replicated in 1995 (Sterling et al., 1995) and again in 2015, when the a large team of psychologists replicated 100 studies and 97% of the original studies reported a statistically significant result (Open Science Collaboration, 2015).
Using Soric’s formula this would imply a false discovery rate of 0. However, the replication studies showed that this high discovery rate is inflated by publication bias. More important, the replication studies provide an unbiased estimate of the actual discovery rate in psychology. Thus, these results can be used to estimate the maximum false discovery rate in psychology, using Soric’s formula.
The headline finding of this article was that 36% (35/97) of the replication studies reproduced a significant result.
Using Soric’s formula, this implies a maximum (!) false discovery rate of 9%, which is well below the predicted 50% by Ioannidis. The difference is so large that no statistical test is needed to infer that the Nosek’s results falsify Ioannidis’s claim.
Table 1 also shows the discovery rates for specific journals or research areas. The discovery rate for cognitive psychology in the journal Psychological Science is 53%, which implies a maximum FDR of 5%. For cognitive psychology published in the Journal of Experimental Psychology: Learning, Memory, and Cognition the DR of 48% implies a maximum FDR of 6%.
Things look worse for social psychology, which has also seen a string of major replication failures (Schimmack, 2020). However, even here we do not get false discovery rates over 50%. For social psychology published in Psychological Science, the discovery rate of 29% implies a maximum false discovery rate of 13%, and social psychology published in JPSP has a discovery rate of 23% and a maximum false discovery rate of 18%.
These results do not imply that everything is going well in social psychology, but they do show how unrealistic Ioannidis’s scenarios were that produced false discovery rates over 50%.
The Arnold foundation has funded major attempts to improve science. This is a laudable goal and I have spent the past 10 years working towards the same goal. Here I simply point out that one big successful initiative, the reproducibility project (Open Science Collaboration, 2015), produced valuable data that can be used to test a fundamental assumption in the open science movement, namely the fear that most published results are false. Using the empirical data from the Open Science Collaboration we find no empirical support for this claim. Rather the results are in line with Jager and Leek’s (2014) findings that strictly false results where the null-hypothesis is true are the exception rather than the norm.
This does not mean that everything is going well in science because rejecting the null-hypothesis is only a first step towards testing a theory. However, it is also not helpful to spread false claims about science that may undermine trust in science. “Most published results are false” is an eye-catching claim, but it lacks empirical support. In fact, it has been falsified in every empirical test that has been conducted. Ironically, the strongest empirical evidence based on actual replication studies comes from a project that used open science practices that would not have happened without Ioannidis’s alarmist claim. This shows the advantages of open science practices and implementing these practices remains a valuable goal even if most published results are not strictly false positives.
Many sciences, including psychology, rely on statistical significance to draw inferences from data. A widely accepted practice is to consider results with a p-value less than .05 as evidence that an effect occurred.
Hundreds of articles have discussed the problems of this approach, but few have offered attractive alternatives. As a result, very little has changed in the way results are interpreted and published in 2020.
Even if this would suddenly change, researchers still have to decide what they should do with the results that have been published so far. At present there are only two options. Either trust all results and hope for the best or assume that most published results are false and start from scratch. Trust everything or trust nothing are not very attractive options. Ideally, we would want to find a method that can sperate more credible findings from less credible ones.
One solution to this problem comes from molecular genetics. When it became possible to measure genetic variation across individuals, geneticists started correlating single variants with phenotypes (e.g., the serotonin transporter gene variation and neuroticism). These studies used the standard approach of declaring results with p-values below .05 as a discovery. Actual replication studies showed that many of these results could not be replicated. In response to these replication failures, the field moved towards genome-wide association studies that tested many genetic variants simultaneously. This further increased the risk of false discoveries. To avoid this problem, geneticists lowered the criterion for a significant finding. This criterion was not picked arbitrarily. Rather it was determined by estimating the false discovery rate or false discovery risk. The classic article that recommeded this approach has been cited over 40,000 times (Benjamin & Hochberg, 1995).
In genetics, a single study produces thousands of p-values that require a correction for multiple comparisons. Studies in other disciplines usually produce a much smaller (typically less than 100) p-values. However, an entire scientific field also generates thousands of p-values. This makes it necessary to control for multiple comparisons and to lower p-values from the nominal value of .05 to maintain a reasonably low false discovery rate.
The main difference between original studies in genomics and meta-analysis of studies in other fields is that publication bias can inflate the percentage of significant results. This leads to biased estimates of the actual false discovery rate (Schimmack, 2020).
One solution to this problem are selection models that take publication bias into account. Jager and Leek (2014) used this approach to estimate the false discovery rate in medical journals for statistically significant results, p < .05. In response to this article, Goodman (2014) suggested to ask a different question.
What significance criterion would ensure a false discovery rate of 5%?
Although this is a useful question, selection models have not been used to answer it. Instead, recommendations for adjusting alpha have been based on ad-hoc assumptions about the number of true hypotheses that are being tested and power of studies.
For example, the false positive rate is greater than 33% with prior odds of 1:10 and a P value threshold of 0.05, regardless of the level of statistical power. Reducing the threshold to 0.005 would reduce this minimum false positive rate to 5% (D. J. Benjamin et al., 2017, p. 7).
Rather than relying on assumptions, it is possible to estimate the maximum false discovery rate based on the distribution of statistically significant p-values (Bartos & Schimmack, 2020).
Here, I illustrate this approach with p-values from 120 psychology journals for articles published between 2010 and 2019. An automated extraction of test-statistics found 670,055 useable test-statistics. All test-statistics were converted into absolute z-scores that reflect the amount of evidence against the null-hypothesis.
Figure 1 shows the distribution of the absolute z-scores. The first notable observation is the drop (from right to left) in the distribution right at the standard level for statistical significance, p < .05 (two-tailed) that corresponds to a z-score of 1.96. This drop reveals publication bias. The amount of bias is reflected in a comparison of the observed discovery rate and the estimated discovery rate. The observed discovery rate of 67% is simply the percentage of p-values below .05. The estimated discovery rate is the percentage of significant results based on the z-curve model that is fitted to the significant results (grey curve). The estimated discovery rate is only 38% and the 95% confidence interval around this estimate, 32% to 49%, does not include the observed discovery rate. This shows that significant results are more likely to be reported and that non-significant results are missing from published article.
If we would use the observed discovery rate of 67%, we would underestimate the risk of false positive results. Using Soric’s (1989) formula,
FDR = (1/DR – 1)*(.05/.95)
a discovery rate of 67% implies a maximum false discovery rate of 3%. Thus, no adjustment to the significance criterion would be needed to maintain a false discovery rate below 5%.
However, publication bias is present and inflates the discovery rate. To adjust for this, we can use the estimated discovery rate of 38% and get a maximum false discovery rate of 9%. As this value exceeds the desired number of false discoveries, we need to lower alpha to reduce the false discovery rate.
Figure 2 shows the results when alpha is set .005 (z = 2.80) as recommended by Benjamin et al. (2017). The model is only fitted to data that are significant with this new criterion. We now see that the observed discovery rate (44%) is even lower than the estimated discovery rate (49%), although the difference is not significant. Thus, there is no evidence of publication bias with this new criterion for significance. The reason is that many questionable practices that are used to report significant results produce just significant results. This is seen in the excess of just significant results between z = 2 and z = 2.8. These results no longer inflate the discovery rate because they are no longer counted as discoveries. We also see that the estimated discovery rate produces a maximum false discovery rate of 6%, which may be close enough to the desired level of 5%.
Another piece of useful information is the estimated replication rate (ERR). This is the average power of results that are significant with p < .005 as criterion. Although lowering the alpha level decreases power, the average power of 66% suggests that many results should replicate successfully in exact replication studies with the same sample size. Increasing sample sizes could help to achieve 80% power.
In conclusion, we can use the distribution of p-values in the psychological literature to evaluate published findings. Based on the present results, readers of published articles could use p < .005 (rule of thumb: z > 2.8, t > 3, or chi-square > 9, F > 9) to evaluate statistical evidence.
The empirical approach to justify alpha with FDRs has the advantage that it can be adjusted for different literatures. This is illustrated with the Attitudes and Social Cognition section of JPSP. Social cognition research has experienced a replication crisis due to massive use of questionable research practices. It is possible that even alpha = .005 is too liberal for this research area.
Figure 3 shows the results for test statistics published in JPSP-ASC from 2000 to 2020.
There is clear evidence of publication bias (ODR = 71%, EDR = 31%). Based on the EDR of 31%, the maximum false discovery rate is 11%, well above the desired level of 5%. Even the 95%CI around the FDR does not include 5%. Thus, it is necessary to lower the alpha criterion.
Using p = .005 as criterion improves things, but not fully. First, a comparison of the ODR and EDR suggests that publication bias was not fully removed, 43% vs. 35%. Second, the EDR of 35% still implies a maximum FDR of 10%, although the 95%CI now touches 5%, but also has 35% as the upper limit. Thus, even with p = .005, the social cognition literature is not credible.
Lowering the criterion further does not solve this problem. The reason is that there are now so few significant results that the discovery rate remains low. This is shown in the next figure where the criterion is set to p < .0005 (z = 3.5). The model cannot be fitted to z-scores so extreme because there is insufficient information about lower power studies. Thus, the model was fitted to z-scores greater than 2.8 (p < .005). in this scenario, the expected discovery rate is 27%, which implies a maximum false discovery rate of 14% and the 95%CI still does not include 5%.
These results illustrate the problem of conducting many studies with low power. The false discovery risk remains high because there are only few test statistics with extreme values and a few extreme test statistics are expected by chance.
In short, setting alpha to .005 is still too liberal for this research area. Given the ample replication failures in social cognition research, most results cannot be trusted. This conclusion is also consistent with the actual replication rate in the Open Science Collaboration (2015) project that could only replicate 7/31 (23% results). With a discovery rate of 23%, the maximum false discovery rate is 18%. This is still way below Ioannidis’s claim that most published results are false positives, but it is also well above 5%.
Different results are expected for the Journal of Experimental Psychology, Learning, Memory, and Cognition (JEP-LMC). Here the OSC project was able to replicate 13/47 (48%) results. A discovery rate of 48% implies a maximum false discovery rate of 6%. Thus, no adjustment to the alpha level may be needed for this journal.
Figure 6 shows the results for the z-curve analysis of test statistics published from 2000 to 2020. There is evidence of publication bias. The ODR of 67% is outside the 95%CI of the EDR 45%, 95%CI = . However, with an EDR of 45%, the maximum FDR is 7%. This is close to the estimate based on the OSC results and close to the desired level of 5%.
For this journal it was sufficient to set the alpha criterion to p < .03. This produced a fairly close match between the ODR (61%) and EDR (58%) and a maximum FDR of 4%.
Significance testing was introduced by Fisher, 100 years ago. He would recognize the way scientists analyze their data because not much has changed. Over the past 100 years, many statisticians and practitioners have pointed out problems with this approach, but no practical alternatives have been offered. Adjusting the significance criterion depending on the research question is one reasonable modification, but often requires more a priori knowledge than researchers have (Lakens et al., 2018). Lowering alpha makes sense when there is a concern about too many false positive results, but can be a costly mistake when false positive results are fewer than feared (Benjamin et al., 2017). Here I presented a solution to this problem. It is possible to use the maximum false-discovery rate to pick alpha so that the percentage of false discoveries is kept at a reasonable minimum.
Even if this recommendation does not influence the behavior of scientists or the practices of journals, it can be helpful to compute alpha values that ensure a low false discovery rate. At present, consumers of scientific research (mostly other scientists) are used to treat all significant results with p-values less than .05 as discoveries. Literature reviews mention studies with p = .04 as if they have the same status as studies with p = .000001. Once a p-values crosses the magic .05 level, it becomes a solid fact. This is wrong because statistical significance alone does not ensure that a finding is a true positive. To avoid this fallacy, consumers of research can do their own adjustment to the alpha level. Readers of JEP:LMC may use .05 or .03 because this alpha level is sufficient. Readers of JPSP-ASC may lower alpha to .001.
Once readers demand stronger evidence from journals that publish weak evidence, researchers may actually change their practices. As long as consumers buy every p-values less than .05, there is little incentive for producers of p-values to try harder to produce stronger evidence, but when consumers demand p-values below .005, supply will follow. Unfortunately, consumers have been gullible and it was easy to sell them results that do not replicate with a p < .05 warranty because they had no rational way to decide which p-values they should trust or not. Maintaining a reasonably low false discovery rate has proved useful in genomics, it may also prove useful for other sciences.
In 2005, Ioannidis wrote an influential article with the title “Why most published research findings are false.” This article has been widely cited by scientists and in the popular media as evidence that we cannot trust scientific results (The Atlantic).
It is often overlooked that Ioannidis’s big claim was not supported by empirical evidence. It rested entirely on hypothetical examples. The problem with big claims that are based on intuition rather than empirical observations is that they can induce confirmation bias. Just like original researchers with their pet theories, Ioannidis was no longer an objective meta-scientists who could explore how often science is wrong. He had to go out and find evidence to support his claim. And that is what he did.
In 2017, Denes Szucs and John P. A. Ioannidis published an article that examined the risk of false positive results in cognitive neuroscience and psychology. The abstract suggests that the empirical results support Ioannidis’s claim that most published result are false positives.
“We conclude that more than 50% of published findings deemed to be statistically significant are likely to be false.”
The authors shared their data, which made it possible for me to verify this conclusion using my own statistical method that can be used to assess the maximum false positive rate (Bartos & Schimmack, 2020; Brunner & Schimmack, 2020). I first used the information about t-values and their degrees of freedom to compute absolute z-scores. Z-scores have the advantage that they all have the same sampling distribution so the values provide standardized information about the strength of evidence against the null-hypothesis. The distribution of the absolute z-scores were then analyzed using zcurve.2.0 (Bartos & Schimmack, 2020).
Figure 1 shows the results with the assumption that there is no publication bias. As a result, both non-significant and significant results are fitted. Visual inspection shows some evidence that there are too many significant results, especially those that just reached significance (z > 1.96 corresponds to p = .05, two-tailed). There are also too few results that just missed to be significant or are sometimes considered to be marginally significant (p < .10, z > 1.65). This pattern suggests that researchers used questionable research practices to present marginally significant results as significant. However, in the big picture of all tests, this bias is relatively small. The observed discovery rate of 64% is only slightly higher than the expected discovery rate of 60%. This is a small amount of inflation and even with this large sample size, the deviation is not statistically significant (i.e., 64% is within the 95%CI of the EDR from 55% to 66%).
Szucs and John P. A. Ioannidis also create a scenario without researcher bias and still conclude that most published results are false.
For example, if we consider the recent estimate of 13:1 H0:H1 odds , then FRP exceeds 50% even in the absence of bias.
Figure 1 shows that the assumption is totally incompatible with the data. A model that assumes no bias has a discovery rate of 60%, and a discovery rate of 60% implies that no more than 3% of significant results can be false positives (Soric, 1989). Even the upper limit of the 95% CI is only 4% false discoveries. Thus, empirical data clearly falsifies Szucs and Ioannidis’ wild guess that psychologists test only 7% true hypotheses. Even actual replication studies have produced 37% significant results, which puts the rate of true hypothesis at a minimum of 37% (OSC, 2015). Thus, the conclusion in the abstract is based on false assumptions and not on an unbiased examination of the data.
Despite the small amount of bias in Figure 1, it is likely that some researcher bias is present. It is therefore reasonable to see what happens when a model allows for researcher bias. To do so, z-curve can be fitted only to the distribution of significant results and correct for the selection for significance. These results are shown in Figure 2.
This model shows clearer evidence of selection for significance. The expected discovery rate is 42% and the 95% CI , 24% to 52%, does not include the observed discovery rate of 64%. It is therefore save to assume that publication bias inflates the observed discovery rate. However, even with a discovery rate of 42%, the maximum false discovery rate is only 7%, and even if we use the lower bound of the 95%CI of the EDR, 24%, the false discovery rate is only 17%, which is still well below the 50% level needed to support Ioannidis’s famous claim that most published results are false.
In short, an objective assessment of Ioannidis’s own data falsifies his claim that most published results are false positives. So, how did he end up concluding that the data support his claim?
To make any claims about the false discovery rate, the authors had to make several assumptions because their model did not estimate the actual power of studies and did not measure the actual amount of bias. Thus, all Ioannidis had to do was to adjust the assumptions to fit the data. As in 2005, Ioannidis then presents these speculations as if they are empirical facts.
Non-scientists may be surprised that somebody can get away with this big claims that are not supported by evidence. After all, scientific articles are peer-reviewed. However, insiders are well aware that peer-review is an imperfect method of quality control. However, it is amazing that Ioannidis has been getting away with his bold claim that undermines trust in science for so long. Science is not perfect, and Ioannidis is a perfect example of the staying power of false claims, but science is still the best way to search for truth and solutions. Fortunately, Ioannidis was wrong about science. Science needs improvement, but it has produced many important and robust findings such as the discovery of highly effective vaccines against Covid-19. We should not blindly trust science. Instead, we need to examine the data and the assumptions underlying scientific claims, including meta-scientific ones. When we do this, it turns out that Ioannidis fight against researchers bias is based on a biased assessment of bias.
John P. A. Ioannidis is a rock star in the world of science (wikipedia).
By traditional standards of science, he is one of the most prolific and influential scientists alive. He has published over 1,000 articles that have been cited over 100,000 times.
He is best known for the title of his article “Why most published research findings are false” that has been cited nearly 5,000 times. The irony of this title is that it may also apply to Ioannidis, especially because there is a trade-off between quality and quantity in publishing.
Fact Checking Ioannidis
The title of Ioannidis’s article implies a factual statement: “Most published results ARE false.” However, the actual article does not contain empirical data to support this claim. Rather, Ioannidis presents some hypothetical scenarios that show under what conditions published results MAY BE false.
To produce mostly false findings, a literature has to meet two conditions.
First, it has to test mostly false hypotheses. Second, it has to test hypotheses in studies with low statistical power, that is a low probability of producing true positive results.
To give a simple example, imagine a field that tests only 10% true hypothesis with just 20% power. As power predicts the percentage of true discoveries, only 2 out of the 10 true hypothesis will be significant. Meanwhile, the alpha criterion of 5% implies that 5% of the false hypotheses will also produce a significant result. Thus, 5 of the 90 false hypotheses will also produce a significant result. As a result, there will be two times more false positives (4.5 over 100) than true positives (2 over 100).
These relatively simple calculations were well known by 2005 (Soric, 1989). Thus, why did Ioannidis article have such a big impact? The answer is that Ioannidis convinced many people that his hypothetical examples are realistic and describe most areas in science.
2020 has shown that Ioannidis’s claim does not apply to all areas of science. In amazing speed, bio-tech companies were able to make not just one but several successful vaccine’s with high effectiveness. Clearly some sciences are making real progress. On the other hand, other areas of science suggest that Ioannidis’s claims were accurate. For example, the whole literature on single-gene variations as predictors of human behavior has produced mostly false claims. Social psychology has a replication crisis where only 25% of published results could be replicated (OSC, 2015).
Aside from this sporadic and anecdotal evidence, it remains unclear how many false results are published in science as a whole. The reason is that it is impossible to quantify the number of false positive results in science. Fortunately, it is not necessary to know the actual rate of false positives to test Ioannidis’s prediction that most published results are false positives. All we need to know is the discovery rate of a field (Soric, 1989). The discovery rate makes it possible to quantify the maximum percentage of false positive discoveries. If the maximum false discovery rate is well below 50%, we can reject Ioannidis’s hypothesis that most published results are false.
The empirical problem is that the observed discovery rate in a field may be inflated by publication bias. It is therefore necessary to estimate the amount of publication bias and if necessary correct the discovery rate, if publication bias is present.
In 2005, Ioannidis and Trikalinos (2005) developed their own test for publication bias, but this test had a number of shortcomings. First, it could be biased in heterogeneous literatures. Second, it required effect sizes to compute power. Third, it only provided information about the presence of publication bias and did not quantify it. Fourth, it did not provide bias-corrected estimates of the true discovery rate.
When the replication crisis became apparent in psychology, I started to develop new bias tests that address these limitations (Bartos & Schimmack, 2020; Brunner & Schimmack, 2020; Schimmack, 2012). The newest tool, called z-curve.2.0 (and yes, there is a app for that), overcomes all of the limitations of Ioannidis’s approach. Most important, it makes it possible to compute a bias-corrected discovery rate that is called the expected discovery rate. The expected discovery rate can be used to examine and quantify publication bias by comparing it to the observed discovery rate. Moreover, the expected discovery rate can be used to compute the maximum false discovery rate.
The data were compiled by Simon Schwab from the Cochrane database (https://www.cochrane.org/) that covers results from thousands of clinical trials. The data are publicly available (https://osf.io/xjv9g/) under a CC-By Attribution 4.0 International license (“Re-estimating 400,000 treatment effects from intervention studies in the Cochrane Database of Systematic Reviews”; (see also van Zwet, Schwab, & Senn, 2020).
Studies often report results for several outcomes. I selected only results for the primary outcome. It is often suggested that researchers switch outcomes to produce significant results. Thus, primary outcomes are the most likely to show evidence of publication bias, while secondary outcomes might even be biased to show more negative results for the same reason. The choice of primary outcomes also ensures that the test statistics are statistically independent because they are based on independent samples.
I first fitted the default model to the data. The default model assumes that publication bias is present and only uses statistically significant results to fit the model. Z-curve.2.0 uses a finite mixture model to approximate the observed distribution of z-scores with a limited number of non-centrality parameters. After finding optimal weights for the components, power can be computed as the weighted average of the implied power of the components (Bartos & Schimmack, 2020). Bootstrapping is used to compute 95% confidence intervals that have shown to have good coverage in simulation studies (Bartos & Schimmack, 2020).
The main finding with the default model is that the model (grey curve) fits the observed distribution of z-scores very well in the range of significant results. However, z-curve has problems extrapolating from significant results to the distribution of non-significant results. In this case, the model (grey curve) underestimates the amount of non-significant results. Thus, there is no evidence of publication bias. This is seen in a comparison of the observed and expected discovery rates. The observed discovery rate of 26% is lower than the expected discovery rate of 38%.
When there is no evidence of publication bias, there is no reason to fit the model only to the significant results. Rather, the model can be fitted to the full distribution of all test statistics. The results are shown in Figure 2.
The key finding for this blog post is that the estimated discovery rate of 27% closely matches the observed discovery rate of 26%. Thus, there is no evidence of publication bias. In this case, simply counting the percentage of significant results provides a valid estimate of the discovery rate in clinical trials. Roughly one-quarter of trials end up with a positive result. The new question is how many of these results might be false positives.
To maximize the rate of false positives, we have to assume that true positives were obtained with maximum power (Soric, 1989). In this scenario, we could get as many as 14% (4 over 27) false positive results.
Even if we use the upper limit of the 95% confidence interval, we only get 19% false positives. Moreover, it is clear that Soric’s (1989) scenario overestimate the false discovery rate because it is unlikely that all tests of true hypotheses have 100% power.
In short, an empirical test of Ioannidis’s hypothesis that most published results in science are false shows that this claim is at best a wild overgeneralization. It is not true for clinical trials in medicine. In fact, the real problem is that many clinical trials may be underpowered to detect clinically relevant effects. This can be seen in the estimated replication rate of 61%, which is the mean power of studies with significant results. This estimate of power includes false positives with 5% power. If we assume that 14% of the significant results are false positives, the conditional power based on a true discovery is estimated to be 70% (14 * .05 + 86 * . 70 = .61).
With information about power, we can modify Soric’s worst case scenario and change power from 100% to 70%. This has only a small influence on the false positive discovery rate that decreases to 11% (3 over 27). However, the rate of false negatives increases from 0 to 14% (10 over 74). This also means that there are now three-times more false negatives than false positives (10 over 3).
Even this scenario overestimates power of studies that produced false negative results because power of studies with significant results is higher than power of studies that produced non-significant results when power is heterogenous (Brunner & Schimmack, 2020). In the worst case scenario, the null-hypothesis may rarely be true and power of studies with non-significant results could be as low as 14.5%. To explain, if we redo all of the studies, we expected that 61% of the significant studies produce a significant result again, producing 16.5% significant results. We also expect that the discovery rate will be 27% again. Thus, the remaining 73% of studies have to make up the difference between 27% and 16.5%, which is 10.5%. For 73 studies to produce 10.5 significant results, the studies have to have 14.5% power. 27 = 27 * .61 + 73 * .145.
In short, while Ioannidis predicted that most published results are false positives, it is much more likely that most published results are false negatives. This problem is of course not new. To make conclusions about effectiveness of treatments, medical researchers usually do not rely on a single clinical trial. Rather results of several studies are combined in a meta-analysis. As long as there is no publication bias, meta-analyses of original studies can boost power and reduce the risk of false negative results. It is therefore encouraging that the present results suggest that there is relatively little publication bias in these studies. Additional analyses for subgroups of studies can be conducted, but are beyond the main point of this blog post.
Ioannidis wrote an influential article that used hypothetical scenarios to make the prediction that most published results are false positives. Although this article is often cited as if it contained evidence to support this claim, the article contained no empirical evidence. Surprisingly, there also have been few attempts to test Ioannidis’s claim empirically. Probably the main reason is that nobody knew how to test it. Here I showed a way to test Ioannidis’s claim and I presented clear empirical evidence that contradicts this claim in Ioannidis’s own field of science, namely medicine.
The main feature that distinguishes science and fiction is not that science is always right. Rather, science is superior because proper use of the scientific method allows for science to correct itself, when better data become available. In 2005, Ioannidis had no data and no statistical method to prove his claim. Fifteen years later, we have good data and a scientific method to test his claim. It is time for science to correct itself and to stop making unfounded claims that science is more often wrong than right.
The danger of not trusting science has been on display this year, where millions of Americans ignored good scientific evidence, leading to the unnecessary death of many US Americans. So far, 330, 000 US Americans are estimated to have died of Covid-19. In a similar country like Canada, 14,000 Canadians have died so far. To adjust for population, we can compare the number of deaths per million, which is 1000 in the USA and 400 in Canada. The unscientific approach to the pandemic in the US may explain some of this discrepancy. Along with the development of vaccines, it is clear that science is not always wrong and can save lives. Iannaidis (2005) made unfounded claims that success stories are the exception rather than the norm. At least in medicine, intervention studies show real successes more often than false ones.
The Covid-19 pandemic also provides another example where Ioannidis used off-the-cuff calculations to make big claims without any evidence. In a popular article titled “A fiasco in the making” he speculated that the Covid-19 virus might be less deadly than the flu and suggested that policies to curb the spread of the virus were irrational.
As the evidence accumulated, it became clear that the Covid-19 virus is claiming many more lives than the flu, despite policies that Ioannidis considered to be irrational. Scientific estimates suggest that Covid-19 is 5 to 10 times more deadly than the flu (BNN), not less deadly as Ioannidis implied. Once more, Ioannidis quick, unempirical claims were contradicted by hard evidence. It is not clear how many of his other 1,000 plus articles are equally questionable.
To conclude, Ioannidis should be the last one to be surprised that several of his claims are wrong. Why should he be better than other scientists? The question is only how he deals with this information. However, for science it is not important whether scientists correct themselves. Science corrects itself by replacing old, false information with better information. One question is what science does with false and misleading information that is highly cited.
If YouTube can remove a video with Ioannidis’s false claims about Covid-19 (WP), maybe PLOS Medicine can retract an article with the false claim that “most published results in science are false”.
The attention-grabbing title is simply misleading because nothing in the article supports the claim. Moreover, actual empirical data contradict the claim at least in some domains. Most claims in science are not false and in a world with growing science skepticism spreading false claims about science may be just as deadly as spreading false claims about Covid-19.
If we learned anything from 2020, it is that science and democracy are not perfect, but a lot better than superstition and demagogy.
Social psychologists, among others, have misused the scientific method. Rather than using it to separate false from true hypotheses, they used statistical tests to find and report statistically significant results. The main problem with the search for significance is that significant results are not automatically true discoveries. The probability that a selected significant result is a true discovery also depends on the power of statistical tests to detect a true finding. However, social psychologists have ignored power and often selected significant results from studies with lower power. In this case, significance is often more due to chance than a real effect and the results are difficult to replicate. A shocking finding revealed that less than 25% of results in social psychology could be replicated (OSC, 2015). This finding has been widely cited outside of social psychology, but social psychologists have preferred to ignore the implication that most of their published results may be false (Schimmack, 2020).
Some social psychologists have responded to this replication crisis by increasing power and reporting non-significant results as evidence that effects are small and negligible (e.g., Lai et al., 2014, 2016). However, others continue to use the same old practices. This creates a problem. While the average credibility of social psychology has increased, readers do not know whether they are reading an article that used the scientific method properly or improperly.
One solution to this problem is to examine the strength of the reported statistical results. Strong statistical results are more credible than weak statistical results. Thus, the average strength of the statistical results provides useful information about the credibility of individual articles. I demonstrate this approach with two articles from 2020 in the Attitudes and Social Cognition section of the Journal of Personality and Social Psychology (JPSP-ASC).
Before I examine individual articles, I am presenting results for the entire journal based on automatic extraction of test-statistics for the years 2010 (pre-crisis) and 2020 (post-crisis).
Figure 1 shows the results for 2010. All test-statistics are first converted into p-values and then transformed into absolute z-scores. The higher the z-score, the stronger is the evidence against the null-hypothesis. The figure shows the mode of the distribution of z-scores at a value of 2, which coincides with the criterion for statistical significance (p = .05, two-tailed, z = 1.96). Fitting a model to the distribution of the significant z-scores, we would expect an even higher mode in the region of non-significant results. However, the actual distribution shows a sharp drop in reported z-scores. This pattern shows the influence of selection for significance.
The amount of publication bias is quantified by a comparison of the observed discovery rate (i.e. the percentage of reported tests with significant results and the expected discovery rate, which is the area of the grey curve for z-scores greater than 1.96). The ODR of 73% is much higher than the EDR of 15%. The fact that the confidence intervals for these two estimates do not overlap shows clear evidence of selection for significance in JPSP-ASC in 2010.
An EDR of 15% also implies that most statistical tests are extremely underpowered. Thus, even if there is an effect, it is unlikely to be significant. More relevant is the replication rate, which is the average power of results that were significant. As power determines the outcome of exact replication studies, the replication rate of 60% implies that 60% of published results are expected to be replicable in exact replication studies. However, observed effect sizes are expected to shrink and it is unclear whether the actual effect sizes are practically meaningful or would exceed the typical level of a small effect size (i.e., 0.2 standard deviations or 1% explained variance).
In short, Figure 1 visualizes incorrect use of the scientific method that capitalizes more on chance than on actual effects.
The good news is that research practices in social psychology have changed, as seen in Figure 2.
First, reporting of results is much less deceptive. The observed discovery rate of 73% is close to the estimated discovery rate of 72%. However, visual inspection of the two curves shows a small dip for results that are marginally significant (z = 1.5 to 2) and a slight excess for just significant results (z = 2 to 2.2). Thus, some selection may still happen in some articles.
Another sign of improvement is that the EDR of 72% in 2020 is much higher than the EDR of 15% in 2010. This shows that social psychologists have dramatically improved the power of their studies. This is largely due to the move from small undergraduate samples to larger online samples.
The replication rate of 85% implies that most published results in 2020 are replicable. Even if exact replications are difficult, the EDR of 73% still suggests rather high replicability (see Bartos & Schimmack, 2020, for a discussion of EDR vs. ERR to predict actual replication results).
Despite this positive trend, it is possible that individual articles are less credible than the average results suggest. This is illustrated with the article by Leander et al. (2020).
This article was not picked at random. There are several cues that suggested the results of this article may be less credible than other results. First, Wolfgang Stroebe has been an outspoken defender of the old unscientific practices in social psychology (Stroebe & Strack, 2014). Thus, it was interesting to see whether somebody who so clearly defended bad practices would have changed. This is of course a possibility because it is not clear how much influence Stroebe had on the actual studies. Another reason to be skeptical about this article is that it used priming as an experimental manipulation, although priming has been identified as a literature with many replication failures. The authors cite old priming studies as if there is no problem with these manipulations. Thus, it was interesting to see how credible these new priming results would be. Finally, the article reported many studies and it was interesting to see how the authors addressed the problem that the risk of a non-significant result increases with each additional study (Schimmack, 2012).
I first used the automatically extracted test-statistics for this article. The program found 51 test-statistics. The results are different from the z-curve for all articles in 2020.
Visual inspection shows a peak of p-values that are just significant. The comparison of the ODR of 65% and the EDR of 14% suggests selection for significance. However, even if we just focus on the significant results, the replication rate is low with just 38%, compared to the 85% average for 2020.
I also entered all test-statistics by hand. There were more test-statistics because I was able to use exact p-values and confidence intervals, which are not used by the automated procedure.
The results are very similar showing that automatically extracted values are useful if an article reports results mostly in terms of F and t-values in the text.
The low power of significant results creates a problem for focal hypothesis tests in a serious of studies. This article included 7 studies (1a, 1b, 1c, 2, 3, 4, 5) and reported significant results for all of them, ps = 0.004, 0.007, 0.014, 0.020, 0.041, 0.033, and 0.002. This 100% success rate is higher than the average observed power of these studies, 70%. Average power overestimates real power, when results are selected for significance. A simple correction is to subtract the inflation rate (100% – 70% = 30%) from the mean observed power. This Index is called the Replication Index and an R-Index of 40% shows that studies were underpowered and that a replication study with the same sample size is more likely to produce a non-significant result than a significant one.
A z-curve analysis produce a similar estimate, but also shows that these estimates are very unstable and that replicability could be 5%, which means there is no effect. Thus, after taking selection for significance into account, the 7 significant p-values in Leander et al.’s (2020) article provide as much evidence for their claims as Bem’s (2011) 9 significant p-values did for the claim that priming effects can work when the prime FOLLOWS the behavior.
Judd and Gawronski (2011) argued that they had to accept Bem’s crazy article because (a) it passed critical peer-review and (b) they had to trust the author that results were not selected for significance. Nothing has changed in JPSP-ASC. The only criterion for acceptance is peer-review and trust. Bias tests that have been evaluated whether results are actually credible are not used by peer-reviewers or editors. Thus, readers have to carry out these tests themselves to protect themselves from fake science like Leander et al.’s (2020) priming studies. Readers can still not trust social psychology journals to reject junk science like Bem’s (2011) article.
The second example shows how these tools can also provide evidence that published results are credible, using an article by Zlatev et al. (2020).
The automated method retrieved only 12 test statistics. This is a good sign because hypothesis tests are used sparingly to test only important effects, but it makes it more difficult to get precise estimates for a single article. Thus, article based information should be only used as a heuristic, especially if no other information is available. Nevertheless, the limited information suggests that the results are credible. The Observed discovery rate is even slightly below the estimated discovery rate and both the EDR and ERR are very high, 99%. 5 of the 12 test statistics exceed a z-value of 6 (6 sigma) which is even higher than the 5-sigma rule used on particle physics.
The hand-coding retrieved 22 test statistics. The main reason for the difference is that the automated method does not include chi-square tests to avoid including results from structural equation modeling. However, the results are similar. The ODR of 86% is only slightly higher than the EDR of 74% and the replication rate is estimated to be 95%.
There were six focal tests with four p-values below .001. The other two p-values were .001 and .002. The mean observed power was 96%, which means that a success rate of 100% was justified and that there is very little inflation in the success rate, resulting in an R-Index of 93%.
Psychology, especially social psychology, has a history of publishing significant results that are selected from a larger set of tests with low statistical power. This renders published results difficult to replicate. Despite a reform movement, published articles still rely on three criteria to be published: (a) p-values below .05 for focal tests, (b) peer-review, and (c) trust that researchers did not use questionable practices to inflate effect sizes and type-I error risks. These criteria do not help to distinguish credible and incredible articles.
This blog post shows how post-hoc power analysis can be used to distinguish questionable evidence from credible evidence. Although post-hoc power analysis has been criticized when it is applied to a single test statistic, meta-analyses of observed power can show whether researchers actually had good power or not. It can also be used to provide information about the presence and amount of selection for significance. This can be helpful for readers to focus on articles that published credible and replicable results.
The reason why psychology has been slow in improving is that readers have treated all significant results as equal. This encouraged researchers to p-hack their results just enough to get significance. If readers become more discerning in the reading of method section and no longer treat all p-values below .05 as equal, articles with more credible evidence will gain more attention and citations. For example, this R-Index analysis suggests that readers can ignore Leander et al.’s article and can focus on the credible evidence in Zlatev et al.’s article. Of course, solid empirical results are only a first step in assessing an article. Other questions about ecological validity remain, but there is no point in paying attention to p-hacked results, even if their are published in the most prestigious journal.
P.S. I ran a z-curve analysis on all articles with 10 or more z-scores between 2 and 6 published from 2000 to 2010. The excel file contains the DOI, the observed discovery rate, expected discovery rate, and the expected replication rate. It can be fun to plug a DOI into a search engine and to see what article pops up. I know nobody is going to believe me, but I did not know which article has the lowest EDR of 5% and ERR of 9%, but the result is not surprising. I call it predictive validity of the R-Index.