A naive model of science assumes that scientists are objective. That is, they derive hypotheses from theories, collect data to test these theories, and then report the results. In reality, scientists are passionate about theories and often want to confirm that their own theories are right. This leads to conformation bias and the use of questionable research practices (QRPs, John et al., 2012; Schimmack, 2015). QRPs are defined as practices that increase the chances of the desired outcome (typically a statistically significant result) while at the same time inflating the risk of a false positive discovery. A simple QRP is to conduct multiple studies and to report only the results that support the theory.
The use of QRPs explains the astonishingly high rate of statistically significant results in psychology journals that is over 90% (Sterling, 1959; Sterling et al., 1995). While it is clear that this rate of significant results is too high, it is unclear how much it is inflated by QRPs. Given the lack of quantitative information about the extent of QRPs, motivated biases also produce divergent opinions about the use of QRPs by social psychologists. John et al. (2012) conducted a survey and concluded that QRPs are widespread. Fiedler and Schwarz (2016) criticized the methodology and their own survey of German psychologists suggested that QRPs are not used frequently. Neither of these studies is ideal because they relied on self-report data. Scientists who heavily use QRPs may simply not participate in surveys of QRPs or underreport the use of QRPs. It has also been suggested that many QRPs happen automatically and are not accessible to self-reports. Thus, it is necessary to study the use of QRPs with objective methods that reflect the actual behavior of scientists. One approach is to compare dissertations with published articles (Cairo et al., 2020). This method provided clear evidence for the use of QRPs, even though a published document could reveal their use. It is possible that this approach underestimates the use of QRPs because even the dissertation results could be influenced by QRPs and the supervision of dissertations by outsiders may reduce the use of QRPs.
With my colleagues, I developed a statistical method that can detect and quantify the use of QRPs (Bartos & Schimmack, 2020; Brunner & Schimmack, 2020). Z-curve uses the distribution of statistically significant p-values to estimate the mean power of studies before selection for significance. This estimate predicts how many non-significant results were obtained in the serach for the significant ones. This makes it possible to compute the estimated discovery rate (EDR). The EDR can then be compared to the observed discovery rate, which is simply the percentage of published results that are statistically significant. The bigger the difference between the ODR and the EDR is, the more questionable research practices were used (see Schimmack, 2021, for a more detailed introduction).
I merely focus on social psychology because (a) I am a social/personality psychologists, who is interested in the credibility of results in my field, and (b) because social psychology has a large number of replication failures (Schimmack, 2020). Similar analyses are planned for other areas of psychology and other disciplines. I also focus on social psychology more than personality psychology because personality psychology is often more exploratory than confirmatory.
I illustrate the use of z-curve to quantify the use of QRPs with the most extreme examples in the credibility rankings of social/personality psychologists (Schimmack, 2021). Figure 1 shows the z-value plot (ZVP) of David Matsumoto. To generate this plot, the tests statistics from t-tests and F-tests were transformed into exact p-values and then transformed into the corresponding values on the standard normal distribution. As two-sided p-values are used, all z-scores are positive. However, because the curve is centered over the z-score that corresponds to the median power before selection for significance (and not zero, when the null-hypothesis is true), the distribution can look relatively normal. The variance of the distribution will be greater than 1 when studies vary in statistical power.
The grey curve in Figure 1 shows the predicted distribution based on the observed distribution of z-scores that are significant (z > 1.96). In this case, the observed number of non-significant results is similar to the predicted number of significant results. As a result, the ODR of 78% closely matches the EDR of 79%.
Figure 2 shows the results for Shelly Chaiken. The first notable observation is that the ODR of 75% is very similar to Matsumoto’s EDR of 78%. Thus, if we simply count the number of significant and non-significant p-values, there is no difference between these two researchers. However, the z-value plot (ZVP) shows a dramatically different picture. The peak density is 0.3 for Matsoumoto and 1.0 for Chaiken. As the maximum density of the standard normal distribution is .4, it is clear that the results in Chaiken’s articles are not from an actual sampling distribution. In other words, QRPs must have been used to produce too many just significant results with p-values just below .05.
The comparison of the ODR and EDR shows a large discrepancy of 64 percentage points too many significant results (ODR = 75% minus EDR = 11%). This is clearly not a chance finding because the ODR falls well outside the 95% confidence interval of the EDR, 5% to 21%.
To examine the use of QPSs in social psychology, I computed the EDR and ORDR for over 200 social/personality psychologists. Personality psychologists were excluded if they reported too few t-values and F-values. The actual values can be found and additional statistics can be found in the credibility rankings (Schimmack, 2021). Here I used these data to examine the use of QRPs in social psychology.
Average Use of QRPs
The average ODR is 73.48 with a 95% confidence interval ranging from 72.67 to 74.29. The average EDR is 35.28 with a 95% confidence interval ranging from 33.14 to 37.43. the inflation due to QRPs is 38.20 percentage points, 95%CI = 36.10 to 40.30. This difference is highly significant, t(221) = 35.89, p < too many zeros behind the decimal for R to give an exact value.
It is of course not surprising that QRPs have been used. More important is the effect size estimate. The results suggest that QRPs inflate the discovery rate by over 100%. This explains why unbiased replication studies in social psychology have only a 25% chance of being significant (Open Science Collaboration, 2015). In fact, we can use the EDR as a conservative predictor of replication outcomes (Bartos & Schimmack, 2020). While the EDR of 35% is a bit higher than the actual replication rate, this may be due to the inclusion of non-focal hypothesis tests in these analyses. Z-curve analyses of focal hypothesis tests typically produce lower EDRs. In contrast, Fiedler and Schwarz failed to comment on the low replicability of social psychology. If social psychologists would not have used QRPs, it remains a mystery why their results are so hard to replicate.
In sum, the present results confirm that, on average, social psychologists heavily used QRPs to produce significant results that support their predictions. However, these averages masks differences between researchers like Matsumoto and Chaiken. The next analyses explore these individual differences between researchers.
I had no predictions about the effect of cohort on the use of QRPs. I conducted a twitter poll that suggested a general intuition that the use of QRPs may not have changed over time, but there was a lot of uncertainty in these answers. Similar results were obtained in a Facebook poll in the Psychological Methods Discussion Group. Thus, the a priori hypothesis is a vague prior of no change.
The dataset includes different generations of researchers. I used the first publication listed in WebofScience to date researchers. The earliest date was 1964 (Robert S. Wyer). The latest date was 2012 (Kurt Gray). The histogram shows that researchers from the 1970s to 2000s were well-represented in the dataset.
There was a significant negative correlation between the ODR and cohort, r(N = 222) = -.25, 95%CI = -.12 to -.37, t(220) = 3.83, p = .0002. This finding suggests that over time the proportion of non-significant results increased. For researchers with the first publication in the 1970s, the average ODR was 76%, whereas it was 72% for researchers with the first publication in the 2000s. This is a modest trend. There are various explanations for this trend.
One possibility is that power decreased as researchers started looking for weaker effects. In this case, the EDR should also show a decrease. However, the EDR showed no relationship with cohort, r(N = 222) = -.03, 95%CI = -.16 to .10, t(220) = 0.48, p = .63. Thus, less power does not seem to explain the decrease in the ODR. At the same time, the finding that EDR does not show a notable, abs(r) < .2, relationship with cohort suggests that power has remained constant over time. This is consistent with previous examinations of statistical power in social psychology (Sedlmeier & Gigerenzer, 1989).
Although the ODR decreased significantly and the EDR did not decrease significantly, bias (ODR – EDR) did not show a significant relationship with cohort, r(N = 222) = -.06, 95%CI = -19 to .07, t(220) = -0.94, p = .35, but the 95%CI allows for a slight decrease in bias that would be consistent with the significant decrease in the ODR.
In conclusion, there is a small, statistically significant decrease in the ODR, but the effect over the past 40 decades is too small to have practical significance. The EDR and bias are not even statistically significantly related to cohort. These results suggest that research practices and the use of questionable ones has not changed notably since the beginning of empirical social psychology (Cohen, 1961; Sterling, 1959).
Another possibility is that in each generation, QRPs are used more by researches who are more achievement motivated (Janke et al., 2019). After all, the reward structure in science is based on number of publications and significant results are often needed to publish. In social psychology it is also necessary to present a package of significant results across multiple studies, which is nearly impossible without the use of QRPs (Schimmack, 2012). To examine this hypothesis, I correlated the EDR with researchers’ H-Index (as of 2/1/2021). The correlation was small, r(N = 222) = .10, 95%CI = -.03 to .23, and not significant, t(220) = 1.44, p = .15. This finding is only seemingly inconsistent with Janke et al.’s (2019) finding that self-reported QRPs were significantly correlated with self-reported ambition, r(217) = .20, p = .014. Both correlations are small and positive, suggesting that achievement motivated researchers may be slightly more likely to use QRPs. However, the evidence is by no means conclusive and the actual relationship is weak. Thus, there is no evidence to support that highly productive researchers with impressive H-indices achieved their success by using QRPs more than other researchers. Rather, they became successful in a field where QRPs are the norm. If the norms were different, they would have become successful following these other norms.
A common saying in science is that “extraordinary claims require extraordinary evidence.” Thus, we might expect stronger evidence for claims of time-reversed feelings (Bem, 2011) than for evidence that individuals from different cultures regulate their emotions differently (Matsumoto et al., 2008). However, psychologists have relied on statistical significance with alpha = .05 as a simple rule to claim discoveries. This is a problem because statistical significance is meaningless when results are selected for significance and replication failures with non-significant results remain unpublished (Sterling, 1959). Thus, psychologists have trusted an invalid criterion that does not distinguish between true and false discoveries. It is , however, possible that social psychologists used other information (e.g, gossip about replication failures at conferences) to focus on credible results and to ignore incredible ones. To examine this question, I correlated authors’ EDR with the number of citations in 2019. I used citation counts for 2019 because citation counts for 2020 are not yet final (the results will be updated with the 2020 counts). Using 2019 increases the chances of finding a significant relationship because replication failures over the past decade could have produced changes in citation rates.
The correlation between EDR and number of citations was statistically significant, r(N = 222) = .16, 95%CI = .03 to .28, t(220) = 2.39, p = .018. However, the lower limit of the 95% confidence interval is close to zero. Thus, it is possible that the real relationship is too small to matter. Moreover, the non-parametric correlation with Kendell’s tau was not significant, tau = .085, z = 1.88, p = .06. Thus, at present there is insufficient evidence to suggest that citation counts take the credibility of significant results into account. At present, p-values less than .05 are treated as equally credible no matter how they were produced.
There is general agreement that questionable research practices have been used to produce an unreal success rate of 90% or more in psychology journals (Sterling, 1959). However, there is less agreement about the amount of QRPs that are being used and the implications for the credibility of significant results in psychology journals (John et al., 2012; Fiedler & Schwarz, 2016). The problem is that self-reports may be biased because researchers are unable or unwilling to report the use of QRPs (Nisbett & Wilson, 1977). Thus, it is necessary to examine this question with alternative methods. The present study used a statistical method to compare the observed discovery rate with a statistically estimated discovery rate based on the distribution of significant p-values. The results showed that on average social psychologists have made extensive use of QRPs to inflate an expected discovery rate of around 35% to an observed discovery rate of 70%. Moreover, the estimated discovery rate of 35%is likely to be an inflated estimate of the discovery rate for focal hypothesis tests because the present analysis is based on focal and non-focal tests. This would explain why the actual success rate in replication studies is even lower thna the estimated discovery rate of 35% (Open Science Collaboration, 2015).
The main novel contribution of this study was to examine individual differences in the use of QRPs. While the ODR was fairly consistent across articles, the EDR varied considerably across researchers. However, this variation showed only very small relationships with a researchers’ cohort (first year of publication). This finding suggests that the use of QRPs varies more across research fields and other factors than over time. Additional analysis should explore predictors of the variation across researchers.
Another finding was that citations of authors’ work do not take credibility of p-values into account. Citations are influenced by popularity of topics and other factors and do not take the strength of evidence into account. One reason for this might be that social psychologists often publish multiple internal replications within a single article. This gives the illusion that results are robust and credible because it is very unlikely to replicate type-I errors. However, Bem’s (2011) article with 9 internal replications of time-reversed feelings showed that QRPs are also used to produce consistent results within a single article (Francis, 2012; Schimmack, 2012). Thus, number of significant results within an article or across articles is also an invalid criterion to evaluate the robustness of results.
In conclusion, social psychologists have conducted studies with low statistical power since the beginning of empirical social psychology. The main reason for this is the preference for between-subject designs that have low statistical power with small sample sizes of N = 40 participants and small to moderate effect sizes. Despite repeated warnings about the problems of selection for significance (Sterling, 1959) and the problems of small sample sizes (Cohen, 1961; Sedelmeier & Gigerenzer, 1989; Tversky & Kahneman, 1971), the practices have not changed since Festinger conducted his seminal study on dissonance with n = 20 per group. Over the past decades, social psychology journals have reported thousands of statistically significant results that are used in review articles, meta-analyses, textbooks, and popular books as evidence to support claims about human behavior. The problem is that it is unclear which of these significant results are true positives and which are false positives, especially if false positives are not just strictly nil-results, but also results with tiny effect sizes that have no practical significance. Without other reliable information, even social psychologists do not know which of their colleagues results are credible or not. Over the past decade, the inability to distinguish credible and incredible information has produced heated debates and a lack of confidence in published results. The present study shows that the general research practices of a researcher provide valuable information about credibility. For example, a p-value of .01 by a researcher with an EDR of 70 is more credible than a p-value of .01 by a researcher with an EDR of 15. Thus, rather than stereotyping social psychologists based on the low replication rate in the Open Science Collaboration project, social psychologists should be evaluated based on their own research practices.
Cairo, A. H., Green, J. D., Forsyth, D. R., Behler, A. M. C., & Raldiris, T. L. (2020). Gray (Literature) Matters: Evidence of Selective Hypothesis Reporting in Social Psychological Research. Personality and Social Psychology Bulletin, 46(9), 1344–1362. https://doi.org/10.1177/0146167220903896
Janke, S., Daumiller, M., & Rudert, S. C. (2019). Dark pathways to achievement in science: Researchers’ achievement goals predict engagement in questionable research practices. Social Psychological and Personality Science, 10(6), 783–791. https://doi.org/10.1177/1948550618790227
Is there still something new to say about p-values? Yes, there is. Most discussions of p-values focus on a scenario where a researcher tests a new hypothesis computes a p-value and now has to interpret the result. The status quo follows Fisher’s – 100 year old – approach to compare the p-value to a value of .05. If the p-value is below .05 (two-sided), the inference is that the population effect size deviates from zero in the same direction as the observed effect in the sample. If the p-value is greater than .05 the results are deemed inconclusive.
This approach to the interpretation of the data assumes that we have no other information about our hypothesis or that we do not trust this information sufficiently to incorporate it in our inference about the population effect size. Over the past decade, Bayesian psychologists have argued that we should replace p-values with Bayes-Factors. The advantage of Bayes-Factors is that they can incorporate prior information to draw inferences from data. However, if no prior information is available, the use of Bayesian statistics may cause more harm than good. To use priors without prior information, Bayes-Factors are computed with generic, default priors that are not based on any information about a research question. Along with other problems of Bayes-Factors, this is not an appealing solution to the problem of p-values.
Here I introduce a new approach to the interpretation of p-values that has been called empirical Bayesian and has been successfully applied in genomics to control the field-wise false positive rate. That is, prior information does not rest on theoretical assumptions or default values, but rather on prior empirical information. The information that is used to interpret a new p-value is the distribution of prior p-values.
Every study is a new study because it relies on a new sample of participants that produces sampling error that is independent of the previous studies. However, studies are not independent in other characteristics. A researcher who conducted a study with N = 40 participants is likely to have used similar sample sizes in previous studies. And a researcher who used N = 200 is also likely to have used larger sample sizes in previous studies. Researchers are also likely to use similar designs. Social psychologists, for example, prefer between-subject designs to better deceive their participants. Cognitive psychologists care less about deception and study simple behaviors that can be repeated hundreds of times within an hour. Thus, researchers who used a between-subject design are likely to have used a between-subject design in previous studies and researchers who used a within-subject design are likely to have used a within-subject design before. Researchers may also be chasing different effect sizes. Finally, researchers can differ in their willingness to take risks. Some may only test hypotheses that are derived from prior theories that have a high probability of being correct, whereas others may be willing to shoot for the moon. All of these consistent differences between researchers (i.e., sample size, effect size, research design) influence the unconditional statistical power of their studies, which is defined as the long-run probability of obtaining significant results, p < .05.
Over the past decade, in the wake of the replication crisis, interest in the distribution of p-values has increased dramatically. For example, one approach uses the distribution of significant p-values, which is known as p-curve analysis (Simonsohn et al., 2014). If p-values were obtained with questionable research practices when the null-hypothesis is true (p-hacking), the distribution of significant p-values is flat. Thus, if the distribution is monotonically decreasing from 0 to .05, the data have evidential value. Although p-curve analyses has been extended to estimate statistical power, simulation studies show that the p-curve algorithm is systematically biased when power varies across studies (Bartos & Schimmack, 2020; Brunner & Schimmack, 2020).
As shown in simulation studies, a better way to estimate power is z-curve (Bartos & Schimmack, 2020; Brunner & Schimmack, 2020). Here I show how z-curve analyses of prior p-values can be used to demonstrate that p-values from one researcher are not equal to p-values of other researchers when we take their prior research practices into account. By using this prior information, we can adjust the alpha level of individual researchers to take their research practices into account. To illustrate this use of z-curve, I first start with an illustration how different research practices influence p-value distributions.
Scenario 1: P-hacking
In the first scenario, we assume that a researcher only tests false hypotheses (i.e., the null-hypothesis is always true (Bem, 2011; Simonsohn et al., 2011). In theory, it would be easy to spot false positives because replication studies would produce produce 19 non-significant results for every significant one and significant ones would have different signs. However, questionable research practices lead to a pattern of results where only significant results in one direction are reported, which is the norm in psychology (Sterling, 1959, Sterling et al., 1995; Schimmack, 2012).
In a z-curve analysis, p-values are first converted into z-scores, z = -qnorm(p/2) with qnorm being the inverse normal function and p being a two-sided p-value. A z-curve plot shows the histogram of all z-scores, including non-significant ones (Figure 1).
Visual inspection of the z-curve plot shows that all 200 p-values are significant (on the right side of the criterion value z = 1.96). it also shows that the mode of the distribution as at the significance criterion. Most important, visual inspection shows a steep drop from the mode to the range of non-significant values. That is, while z = 1.96 is the most common value, z = 1.95 is never observed. This drop provides direct visual information that questionable research practices were used because normal sampling error cannot produce such dramatic changes in the distribution.
I am skipping the technical details how the z-curve model is fitted to the distribution of z-scores (Bartos & Schimmack, 2020). It is sufficient to know that the model is fitted to the distribution of significant z-scores with a limited number of model parameters that are equally spaced over the range of z-scores from 0 to 6 (7 parameters, z = 0, z = 1, z = 2, …. z = 6). The model gives different weights to these parameters to match the observed distribution. Based on these estimates, z-curve.2.0 computes several statistics that can be used to interpret single p-values that have been published or future p-values by the same researcher, assuming that the same research practices are used.
The most important statistic is the expected discovery rate (EDR), which corresponds to the average power of all studies that were conducted by a researcher. Importantly, the EDR is an estimate that is based on only the significant results, but makes predictions about the number of non-significant results. In this example with N = 200 participants, the EDR is 7%. Of course, we know that it really is only 5% because the expected discovery rate for true hypotheses that are tested with alpha = .05 is 5%. However, sampling error can introduce biases in our estimates. Nevertheless, even with only 200 observations, the estimate of 7% is relatively close to 5%. Thus, z-curve tells us something important about the way these p-values were obtained. They were obtained in studies with very low power that is close to the criterion value for a false positive result.
Z-curve uses bootstrap to compute confidence intervals around the point estimate of the EDR. the 95%CI ranges from 5% to 18%. As the interval includes 5%, we cannot reject the hypothesis that all tests were false positives (which in this scenario is also the correct conclusion). At the upper end we can see that mean power is low, even if some true hypotheses are being tested.
The EDR can be used for two purposes. First, it can be used to examine the extent of selection for significance by comparing the EDR to the observed discovery rate (ODR; Schimmack, 2012). The ODR is simply the percentage of significant results that was observed in the sample of p-values. In this case, this is 200 out of 200 or 100%. The discrepancy between the EDR of 7% and 100% is large and 100% is clearly outside the 95%CI of the EDR. Thus, we have strong evidence that questionable research practices were used, which we know to be true in this simulation because the 200 tests were selected from a much larger sample of 4,000 tests.
Most important for the use of z-curve to interpret p-values is the ability to estimate the maximum False Discovery Rate (Soric, 1989). The false discovery rate is the percentage of significant results that are false positives or type-I errors. The false discovery rate is often confused with alpha, the long-run probability of making a type-I error. The significance criterion ensures that no more than 5% of significant and non-significant results are false positives. When we test 4,000 false hypotheses (i.e., the null-hypothesis is true) were are not going to have more than 5% (4,000 * .05 = 200) false positive results. This is true in general and it is true in this example. However, when only significant results are published, it is easy to make the mistake to assume that no more than 5% of the published 200 results are false positives. This would be wrong because the 200 were selected to be significant and they are all false positives.
The false discovery rate is the percentage of significant results that are false positives. It no longer matters whether non-significant results are published or not. We are only concerned with the population of p-values that are below .05 (z > 1.96). In our example, the question is how many of the 200 significant results could be false positives. Soric (1989 demonstrated that the EDR limits the number of false positive discoveries. The more discoveries there are, the lower is the risk that discoveries are false. Using a simple formula, we can compute the maximum false discovery rate from the EDR.
FDR = (1/(EDR – 1)*(.05/.95), with alpha = .05
With an EDR of 7%, we obtained a maximum FDR of 68%. We know that the true FDR is 100%, thus, the estimate is too low. However, the reason is that sampling error can have dramatic effects on the FDR estimates when the EDR is low. With an EDR of 6%, the FDR estimate goes up to 82% and with an EDR estimate of 5% it is 100%. To take account of this uncertainty, we can use the 95%CI of the EDR to compute a 95%CI for the FDR estimate, 24% to 100%. Now we see that we cannot rule out that the FDR is 100%.
In short, scenario 1 introduced the use of p-value distributions to provide useful information about the risk that the published results are false discoveries. In this extreme example, we can dismiss the published p-values as inconclusive or as lacking in evidential value.
Scenario 2: The Typical Social Psychologist
It is difficult to estimate the typical effect size in a literature. However, a meta-analysis of meta-analyses suggested that the average effect size in social psychology is Cohen’s d = .4 (Richard et al., 2003). A smaller set of replication studies that did not select for significance estimated an effect size of d = .3 for social psychology (d = .2 for JPSP, d = .4 for Psych Science; Open Science Collaboration, 2015). The later estimate may include an unknown number of hypotheses where the null-hypothesis is true and the true effect size is zero. Thus, I used d = .4 as a reasonable effect size for true hypotheses in social psychology (see also LeBel, Campbell, & Loving, 2017).
It is also known that a rule of thumb in experimental social psychology was to allocate n = 20 participants to a condition, resulting in a sample size of N = 40 in studies with two groups. In a 2 x 2 design, the main effect would be tested with N = 80. However, to keep this scenario simple, I used d = .4 and N = 40 for true effects. This affords 23% power to obtain a significant result.
Finkel, Eastwick, and Reis (2017) argued that power of 25% is optimal if 75% of the hypotheses that are being tested are true. However, the assumption that 75% of hypotheses are true may be on the optimistic side. Wilson and Wixted (2018) suggested that the false discovery risk is closer to 50%. With 23% power for true hypotheses, this implies a false discovery rate of Given uncertainty about the actual false discovery rate in social psychology, I used a scenario with 50% true and 50% false hypotheses.
I kept the number of significant results at 200. To obtain 200 significant results with an equal number of true and false hypotheses, we need 1,428 tests. The 714 true hypotheses contribute 714*.23 = 164 true positives and the 714 false hypotheses produce 714*.05 = 36 false positive results; 164 + 36 = 200. This implies a false discovery rate of 36/200 = 18%. The true EDR is (714*.23+714*.05)/(714+714) = 14%.
The z-curve plot looks very similar to the previous plot, but they are not identical. Although the EDR estimate is higher, it still includes zero. The maximum FDR is well above the actual FDR of 18%, but the 95%CI includes the actual value of 18%.
A notable difference between Figure 1 and Figure 2 is the expected replication rate (ERR), which corresponds to the average power of significant p-values. It is called the estimated replication rate (ERR) because it predicts the percentage of significant results if the studies that were selected for significance were replicated exactly (Brunner & Schimmack, 2020). When power is heterogeneous, power of the studies with significant results is higher than power of studies with non-significant results (Brunner & Schimmack, 2020). In this case, with only two power values, the reason is that false positives have a much lower chance to be significant (5%) than true positives (23%). As a result, the average power of significant studies is higher than the average power of all studies. In this simulation, the true average power of significant studies is the weighted average of true and false positives with significant results, (164*.23 +36*.05)/(164+36) = 20%. Z-curve perfectly estimated this value.
Importantly, the 95% CI of the ERR, 11% to 34%, does not include zero. Thus, we can reject the null-hypotheses that all of the significant results are false positives based on the ERR. In other words, the significant results have evidential value. However, we do not know the composition of this average. It could be a large percentage of false positives and a few true hypotheses with high power or it could be many true positives with low power. We also do not know which of the 200 significant results is a true positive or a false positive. Thus, we would need to conduct replication studies to distinguish between true and false hypotheses. And given the low power, we would only have a 23% chance of successfully replicating a true positive result. This is exactly what happened with the reproducibility project. And the inconsistent results lead to debates and require further replications. Thus, we have real-world evidence how uninformative p-values are when they are obtained this way.
Social psychologists might argue that the use of small samples is justified because most hypotheses in psychology are true. Thus, we can use prior information to assume that significant results are true positives. However, this logic fails when social psychologists test false hypotheses. In this case, the observed distribution of p-values (Figure 1) is not that different from the distribution that is observed when most significant results are true positives that were obtained with low power (Figure 2). Thus, it is doubtful that this is really an optimal use of resources (Finkel et al., 2015). However, until recently this was the way experimental social psychologists conducted their research.
Scenario 3: Cohen’s Way
In 1962 (!), Cohen conducted a meta-analysis of statistical power in social psychology. The main finding was that studies had only a 50% chance to get significant results with a median effect size of d = .5. Cohen (1988) also recommended that researchers should plan studies to have 80% power. However, this recommendation was ignored.
To achieve 80% power with d = .4, researchers need N = 200 participants. Thus, the number of studies is reduced from 5 studies with N = 40 to one study with N = 200. As Finkel et al. (2017) point out, we can make more discoveries with many small studies than a few large ones. However, this ignores that the results of the small studies are difficult to replicate. This was not a concern when social psychologists did not bother to test whether their discoveries are false discoveries or whether they can be replicated. The replication crisis shows the problems of this approach. Now we have results from decades of research that produced significant p-values without providing any information whether these significant results are true or false discoveries.
Scenario 3 examines what social psychology would look like today, if social psychologists had listened to Cohen. The scenario is the same as in the second scenario, including publication bias. There are 50% false hypotheses and 50% true hypotheses with an effect size of d = .4. The only difference is that researchers used N = 200 to test their hypotheses to achieve 80% power.
With 80% power, we need 470 tests (compared to 1,428 in Scenario 2) to produce 200 significant results, 235*.80 + 235*.05 = 188 + 12 = 200. Thus, the EDR is 200/470 = 43%. The true false discovery rate is 6%. The expected replication rate is 188*.80 + 12*.05 = 76%. Thus, we see that higher power increases replicability from 20% to 76% and lowers the false discovery rate from 18% to 6%.
Figure 3 shows the z-curve plot. Visual inspection shows that Figure 3 looks very different from Figures 1 and 2. The estimates are also different. In this example, sampling error inflated the EDR to be 58%, but the 95%CI includes the true value of 46%. The 95%CI does not include the ODR. Thus, there is evidence for publication bias, which is also visible by the steep drop in the distribution at 1.96.
Even with a low EDR of 20%, the maximum FDR is only 21%. Thus, we can conclude with confidence that at least 79% of the significant results are true positives. Remember, in the previous scenario, we could not rule out that most results are false positives. Moreover, the estimated replication rate is 73%, which underestimates the true replication rate of 76%, but the 95%CI includes the true value, 95%CI = 61% – 84%. Thus, if these studies were replicated, we would have a high success rate for actual replication studies.
Just imagine for a moment what social psychology might look like in a parallel universe where social psychologists followed Cohen’s advice. Why didn’t they? The reason is that they did not have z-curve. All they had was p < .05, and using p < .05, all three scenarios are identical. All three scenarios produced 200 significant results. Moreover, as Finkel et al. (2015) pointed out, smaller samples produce 200 significant results quicker than large samples. An additional advantage of small samples is that they inflate point estimates of the population effect size. Thus, the social psychologists with the smallest samples could brag about the biggest (illusory) effect sizes as long as nobody was able to publish replication studies with larger samples that deflated effect sizes of d = .8 to d = .08 (Joy-Gaba & Nosek, 2010).
This game is over, but social psychology – and other social sciences – have published thousands of significant p-values, and nobody knows whether they were obtained using scenario 1, 2, or 3, or probably a combination of these. This is where z-curve can make a difference. P-values are no longer equal when they are considered as a data point from a p-value distribution. In scenario 1, a p-value of .01 and even a p-value of .001 has no meaning. In contrast, in scenario 3 even a p-value of .02 is meaningful and more likely to reflect a true positive than a false positive result. This means that we can use z-curve analyses of published p-values to distinguish between probably false and probably true positives.
I illustrate this with three concrete examples from a project that examined the p-value distributions of over 200 social psychologists (Schimmack, in preparation). The first example has the lowest EDR in the sample. The EDR is 11% and because there are only 210 tests, the 95%CI is wide and includes 5%.
The maximum EDR estimate is high with 41% and the 95%CI includes 100%. This suggests that we cannot rule out the hypothesis that most significant results are false positives. However, the replication rate is 57% and the 95%CI, 45% to 69%, does not include 5%. Thus, some tests tested true hypotheses, but we do not know which ones.
Visual inspection of the plot shows a different distribution than Figure 2. There are more just significant p-values, z = 2.0 to 2.2 and more large z-scores (z > 4). This shows more heterogeneity in power. A comparison of the ODR with the EDR shows that the ODR falls outside the 95%CI of the EDR. This is evidence of publication bias or the use of questionable research practices. One solution to the presence of publication bias is to lower the criterion for statistical significance. As a result, the large number of just significant results is no longer significant and the ODR decreases. This is a post-hoc correction for publication bias. For example, we can lower alpha to .005.
As expected, the ODR decreases considerably from 70% to 39%. In contrast, the EDR increases. The reason is that many questionable research practices produce a pile of just significant p-values. As these values are no longer used to fit the z-curve, it predicts a lot fewer non-significant p-values. The model now underestimates p-values between 2 and 2.2. However, these values do not seem to come from a sampling distribution. Rather they stick out like a tower. By excluding them, the p-values that are still significant with alpha = .005 look more credible. Thus, we can correct for the use of QRPs by lowering alpha and by examining whether these p-values produced interesting discoveries. At the same time, we can ignore the p-values between .05 and .005 and await replication studies to provide empirical evidence whether these hypotheses receive empirical support.
The second example was picked because it was close to the median EDR (33) and ERR (66) in the sample of 200 social psychologists.
The larger sample of tests (k = 1,529) helps to obtain more precise estimates. A comparison of the ODR, 76%, and the 95%CI of the EDR, 12% to 48%, shows that publication bias is present. However, with an EDR of 33%, the maximum FDR is only 11% and the upper limit of the 95%CI is 39%. Thus, we can conclude with confidence that fewer than 50% of the significant results are false positives, however numerous findings might be false positives. Only replication studies can provide this information.
In this example, lowering alpha to .005 did not align the ODR and the EDR. This suggests that these values come from a sampling distribution where non-significant results were not published. Thus, adjusting the there is no simple fix to adjust the significance criterion. In this situation, we can conclude that the published p-values are unlikely to be false positives, but that replication studies are needed to ensure that published significant results are not false positives.
The third example is the social psychologists with the highest EDR. In this case, the EDR is actually a little bit lower than the ODR, suggesting that there is no publication bias. The high EDR also means that the maximum FDR is very small and even the upper limit of the 95%CI is only 7%.
Another advantage of data without publication bias is that it is not necessary to exclude non-significant results from the analysis. Fitting the model to all p-values produces much tighter estimates of the EDR and the maximum FDR.
The upper limit of the 95%CI for the FDR is now 4%. Thus, we conclude that no more than 5% of the p-values less than .05 are false positives. Even p = .02 is unlikely to be a false positive. Finally, the estimated replication rate is 84% with a tight confidence interval ranging from 78% to 90%. Thus, most of the published p-values are expected to replicate in an exact replication study.
I hope these examples make it clear how useful it can be to evaluate single p-values with prior information about the p-values distribution of a lab. As labs differ in their research practices, significant p-values are also different. Only if we ignore the research context and focus on a single result p = .02 equals p = .02. But once we see the broader distribution, p-values of .02 can provide stronger evidence against the null-hypothesis than p-values of .002.
Cohen tried and failed to change the research culture of social psychologists. Meta-psychological articles have puzzled why meta-analyses of power failed to increase power (Maxwell, 2004; Schimmack, 2012; Sedelmeier & Gigerenzer, 1989). Finkel et al. (2015) provided an explanation. In a game where the winner publishes as many significant results as possible, the optimal strategy is to conduct as many studies as possible with low power. This strategy continues to be rewarded in psychology, where jobs, promotions, grants, and pay raises are based on the number of publications. Cohen (1990) said less is more, but that is not true in a science that does not self-correct and treats every p-value less than .05 as a discovery.
To improve psychology as a science, we need to change the incentive structure and author-wise z-curve analyses can do this. Rather than using p < .05 (or p < .005) as a general rule to claim discoveries, claims of discoveries can be adjusted to the research practices of a researchers. As demonstrated here, this will reward researchers who follow Cohen’s rules and punish those who use questionable practices to produce p-values less than .05 (or Bayes-Factors > 3) without evidential value. And maybe, there is a badge for credible p-values one day.
Psychology is not a unified paradigmatic science. That is, it lacks an overarching theory like evolution theory in biology. In a science without an empirically grounded paradigm, progress is made very much like evolution made progress in a process of trial and error. Some ideas may thrive for a moment, but if they are not fruitful, they are discarded. The emergence of a new idea is often characterized as a revolution, and psychology has seen its fair share of revolutions. Behaviorism replaced introspectionism and the cognitive revolution replaced behaviorism. For better or worse, cognitivism is dominating psychology at the moment. The cognitive revolution also had a strong influence on social psychology with the rise of social cognition research.
In the early days, social psychologists focussed on higher cognitive processes like attributions. However, in the 1980s, the implicit revolution shifted focus towards lower cognitive processes that may occur without awareness. This was not the first time, unconscious processes became popular. A special issue in the American Psychologists in 1992 called it the New Look 3 (Greenwald, 1992).
The first look was Freud’s exploration of conscious and unconscious processes. A major hurdle for this first look was conceptual confusion and a lack of empirical support. Puritan academic may also have shied away from the sexual content in Freudian theories (e.g., sexual desire directed at the mother).
However, the second look did try to study many of Freud’s ideas with empirical methods. For example, Silverman and Weinberger (1985) presented the phrase “Mommy and I are one” on a computer screen so quickly that participants were unable to say what they saw. This method is called subliminal priming. The idea was that the unconscious has a longing to be loved by mommy and that presenting this phrase would gratify the unconscious. Numerous studies used the “Mommy and I are one” priming method to see effects on behavior.
Greenwald (1992) reviewed this evidence.
Can subliminal presentations result in cognitive analyses of multiword strings? There have been reports of such effects, especially in association with tests of psychoanalytic hypotheses. The best known of these findings (described as subliminal psychodynamic activation [SPA], using “Mommy and I are One” as the text of a subliminal stimulus; Silverman & Weinberger, 1985) has been identified, on the basis of meta-analysis, as a reproducible phenomenon (Hardaway, 1990; Weinberger & Hardaway, 1990).
Despite this strong evidence, many researchers remain skeptical about the SPA result (see, e.g., the survey reported in Appendix B). Such skepticism is almost certainly due to the lack of widespread enthusiasm for the SPA result’s proposed psychodynamic interpretation (Silverman & Weinberger, 1985).
Because of the positive affective values of words in the critical stimulus (especially Mommy and I) , it is possible that observed effects might be explained by cognitive analysis limited to the level of single words. Some support for that interpretation is afforded by Hardaway’s demonstration (1990, p. 183, Table 3) that other affectively positive strings that include Mommy or One also produce significant effects. However, these other effects are weaker than the effect of the specific string, “Mommy and I are One.”
In summary of evidence from studies of subliminal activation, it is now well established that analysis occurs for stimuli presented at exposure conditions in a region between objective and subjective thresholds; this analysis can extract at least some semantic content of single words.
The New Look 3, however, was less interested in Freudian theory. Most of the influential subliminal priming studies used ordinary stimuli to study common topics in social psychology, including prejudice.
For example, Greenwald (1992) cites Devine’s (1989) highly influential subliminal priming studies with racial stimuli as evidence that “experiments using stimulus conditions that are clearly above objective thresholds (but presumably below subjective thresholds) have obtained semantic activation findings with apparent relative ease” (p. 769).
25 years later, in their Implicit Revolution article, Greenwald and Banaji feature Devine’s influential article.
“Patricia Devine’s (1989) dissertation research extended the previously mentioned subliminal priming methods of Bargh and Pietromonaco (1982) to automatic stereotypes. Devine’s article brought attention to the possibility of dissociation between automatic stereotype activation and controlled inhibition of stereotype expression” (p. 865).
In short, subliminal priming has played an important role in the implicit revolution. However, subliminal priming is still rare. Most studies use clearly visible stimuli. This is surprising, given the clear advantages of subliminal priming to study unconscious processes. A major concern with stimuli that are presented with awareness is that participants can control their behavior. In contrast, if they are not even aware that a racial stimulus was presented, they have no ability to supress a prejudice response.
Another revolution explains why subliminal studies remain rare despite their obvious advantages. This revolution has been called the credibility revolution, replication revolution, or open science revolution. The credibility revolution started in 2011, after a leading social cognition journal published a controversial article that showed time-reversed subliminal priming effects (Bem, 2011). This article revealed a fundamental problem in the way social psychologists conducted their research. Rather than using experiments to see whether effects exist, they used experiments to accumulate evidence in favor of effects. Studies that failed to show the expected effects were hidden. In the 2010s, it has become apparent that this flawed use of the scientific method has produced large literatures with results that cannot be replicated. A major replication project found that less than 25% of results in social psychological experiments could be replicated (OSC, 2015). Given these results, it is unclear which results provided credible evidence.
Despite these troubling findings, social psychologists continue to cite old studies like Devine’s (1989) study (it was just one study!) as if it provided conclusive evidence for subliminal priming of prejudice. If we need any evidence for Freud’s theory of repression, social psychologists would be a prime example. Through various defense mechanisms they maintain the belief that old findings that were obtained with bad scientific practices provided credible evidence that can inform our understanding of the unconscious.
Here I show that this is wishful thinking. To do so, I conducted a modern meta-analysis of subliminal priming studies. Unlike traditional meta-analysis that do not take publication bias into account, this new method provides a strong test of publication bias and corrects for its effect on the results. While there are several new methods, z-curve has been shown to be superior to other methods (Brunner & Schimmack, 2020).
The figure shows the results. The red line at z = 1.96, corresponds to the significance criterion of .05. It is easy to see that this criterion acts like a censor. Results with z-scores greater than 1.96 (i.e., p < .05) are made public and can enter researchers awareness. Results that are not significant, z < 1.06, are repressed and may linger only in the unconscious of researchers who prefer not to think about their failures.
Statistical evidence of repression is provided by a comparison of the observed discovery rate (i.e., the percentage of published results that are significant) of 90% and the expected discovery rate based on the z-curve model (i.e., the grey curve in the figure) of 13%. Evidently, published results are selected from a much larger number of analyses that failed to support subliminal priming. This clear evidence of selection for significance undermines the credibility of individual studies in the subliminal priming literature.
However, there is some evidence of heterogeneity across studies. This is seen in the increasing numbers below the x-axis. Whereas studies with z-scores below 4, have low average power, studies with z-scores above 4, have a mean power greater than 80%. This suggests that replications of these studies could produce significant results. This information could be used to salvage a few solid findings from a pile of junk findings. Closer examination of these studies is beyond the purpose of this blog post, and Devine’s study is not one of them.
The main point of this analysis is that there is strong scientific evidence to support the claim that subliminal priming researchers did not use the scientific method properly. By selecting only results that support the existence of subliminal priming, they created only illusory evidence in support of subliminal priming. Thirty years after Devine’s (1989) subliminal prejudice study was published, we have no scientific evidence in support of the claim that racial stimuli can bypass consciousness and directly influence behavior.
However, Greenwald and other social psychologists who made a career out of these findings repress the well-known fact that published results in experimental social psychology are not credible and cite them as if they are credible evidence (Greenwald & Banaj, 2017).
Social psychologists are of course very familiar with deception. First, they became famous for deceiving participants (Milgram studies). In 2011, it became apparent that they were deceiving themselves. Now, it seems they are willing to deceive others to avoid facing the inconvenient truth that decades of research have produced no scientific results.
The inability to face ego-threatening information is of course not new to psychologists. Freud studied defense mechanisms and social psychologists studied cognitive biases and motivated reasoning. Right now, this trait is on display in Donald Trump and his supporters inability to face the fact that he lost an election. It is ironic that social psychologists have the same inability when their own egos are on the line.
2.17.2020 [the blog post has been revised after I received reviews of the ms. The reference list has been expanded to include all major viewpoints and influential articles. If you find something important missing, please let me know.]
7.2.2020 [the blog post has been edited to match the print version behind the paywall]
You can email me to request a copy of the printed article (firstname.lastname@example.org)
Citation: Schimmack, U. (2020). A meta-psychological perspective on the decade of replication failures in social psychology. Canadian Psychology/Psychologie canadienne. Advance online publication. https://doi.org/10.1037/cap0000246
Bem’s (2011) article triggered a string of replication failures in social psychology. A major replication project found that only 25% of results in social psychology could be replicated. I examine various explanations for this low replication rate and found most of them lacking in empirical support. I then provide evidence that the use of questionable research practices accounts for this result. Using z-curve and a representative sample of focal hypothesis tests, I find that the expected replication rate for social psychology is between 20% and 45%. I argue that quantifying replicability can provide an incentive to use good research practices and to invest more resources in studies that produce replicable results. The replication crisis in social psychology provides important lessons for other disciplines in psychology that have avoided to take a closer look at their research practices.
Keywords: Replication, Replicability, Replicability Crisis, Expected Replication Rate, Expected Discovery Rate, Questionable Research Practices, Power, Social Psychology
The 2010s started with a bang. Journal clubs were discussing the preprint of Bem’s (2011) article “Feeling the Future: Experimental Evidence for Anomalous Retroactive Influences on Cognition and Affect.” Psychologists were confronted with a choice. Either they had to believe in anomalous effects or they had to believe that psychology was an anomalous science. Ten years later, it is pos- sible to look back at Bem’s article with the hindsight of 2020. It is now clear that Bem used questionable practices to produce false evidence for his outlandish claims (Francis, 2012; Schim- mack, 2012, 2018b, 2020). Moreover, it has become apparent that these practices were the norm and that many other findings in social psychology cannot be replicated. This realisation has led to initiatives to change research practices that produce more credible and replicable results. The speed and the extent of these changes has been revolutionary. Akin to the cognitive revolution in the 1960s and the affective revolution in the 1980s, the 2010s have witnessed a method revolution. Two new journals were created that focus on methodological problems and improvements of research practices: Meta-Psychology and Advances in Methods and Practices in Psychological Science.
In my review of the method revolution, I focus on replication failures in experimental social psychology and the different explanations for these failures. I argue that the use of questionable research practices accounts for many replication failures, and I examine how social psychologists have responded to evidence that questionable research practices (QRPs) undermine the trustworthiness of social psychological results. Other disciplines may learn from these lessons and may need to reform their research practices in the coming decade.
Arguably, the most important development in psychology has been the publication of replication failures. When Bem (2011) published his abnormal results supporting paranormal phenomena, researchers quickly failed to replicate these sensational results. However, they had a hard time publishing these results. The editor of the journal that published Bem’s findings, the Journal of Personality and Social Psychology (JPSP), did not even send the article out for review. This attempt to suppress negative evidence failed for two reasons. First, online-only journals with unlimited journal space like PLoSOne or Frontiers were willing to publish null results (Ritchie, Wiseman, & French, 2012). Second, the decision to reject the replication studies was made public and created a lot of attention because Bem’s article had attracted so much attention (Aldhous, 2011). In response to social pressure, JPSP did publish a massive replication failure of Bem’s results (Galak, LeBoeuf, Nelson, & Simmons, 2012).
Over the past decade, new article formats have evolved that make it easier to publish results that fail to confirm theoretical predictions such as registered reports (Chambers, 2013) and registered replication reports (Association for Psychological Science, 2015). Registered reports are articles that are accepted for publication before the results are known, thus avoiding the problem of publishing only confirmatory findings. Scheel, Schijen, and Lakens (2020) found that this format reduced the rate of significant results from over 90% to about 50%. This difference suggests that the normal literature has a strong bias to publish significant results (Bakker, van Dijk, & Wicherts, 2012; Sterling, 1959; Sterling, Rosenbaum, & Weinkam, 1995).
Registered replication reports are registered reports that aim to replicate an original study in a high-powered study with many laboratories. Most registered replication reports have produced replication failures (Kvarven, Strømland, & Johannesson, 2020). These failures are especially stunning because registered replication reports have a much higher chance to produce a significant result than the original studies with much smaller samples. Thus, the failure to replicate ego depletion (Hagger et al., 2016) or facial feedback (Acosta et al., 2016) effects was shocking.
Replication failures of specific studies are important for specific theories, but they do not examine the crucial question of whether these failures are anomalies or symptomatic of a wider problem in psychological science. Answering this broader question requires a representative sample of studies from the population of results published in psychology journals. Given the diversity of psychology, this is a monumental task.
A first step toward this goal was the Reproducibility Project that focused on results published in three psychology journals in the year 2008. The journals represented social/personality psychology (JPSP), cognitive psychology (Journal of Experimental Psychology: Learning, Memory, and Cognition), and all areas of psychology (Psychological Science). Although all articles published in 2008 were eligible, not all studies were replicated, in part because some studies were very expensive or difficult to replicate. In the end, 97 studies with significant results were replicated. The headline finding was that only 37% of the replication studies replicated a statistically significant result.
This finding has been widely cited as evidence that psychology has a replication problem. However, headlines tend to blur over the fact that results varied as a function of discipline. While the success rate for cognitive psychology was 50% and even higher for within-subject designs with many observations per participant, the success rate was only 25% for social psychology and even lower for the typical between-subjects design that was employed to study ego depletion, facial feedback, or other prominent topics in social psychology.
These results do not warrant the broad claim that psychology has a replication crisis or that most results published in psychology are false. A more nuanced conclusion is that social psychology has a replication crisis and that methodological factors account for these differences. Disciplines that use designs with low statistical power are more likely to have a replication crisis.
To conclude, the 2010s have seen a rise in publications of nonsignificant results that fail to replicate original results and that contradict theoretical predictions. The replicability of published results is particularly low in social psychology.
Responses to the Replication Crisis in Social Psychology
There have been numerous responses to the replication crisis in social psychology. Broadly, they can be classified as arguments that support the notion of a crisis and arguments that claim that there is no crisis. I first discuss problems with no-crisis arguments. I then examine the pro-crisis arguments and discuss their implications for the future of psychology as a science.
No Crisis: Downplaying the Finding
Some social psychologists have argued that the term crisis is inappropriate and overly dramatic. “Every generation or so, social psychologists seem to enjoy experiencing a ‘crisis.’ While sympathetic to the underlying intentions underlying these episodes— first the field’s relevance, then the field’s methodological and statistical rigor—the term crisis seems to me overly dramatic. Placed in a positive light, social psychology’s presumed ‘crises’ actually marked advances in the discipline” (Pettigrew, 2018, p. 963). Others use euphemistic and vague descriptions of the low replication rate in social psychology. For example, Fiske (2017) notes that “like other sciences, not all our effects replicate” (p. 654). Crandall and Sherman (2016) note that the number of successful replications in social psychology was “at a lower rate than expected” (p. 94).
These comments downplay the stunning finding that only 25% of social psychology results could be replicated. Rather than admitting that there is a problem, these social psychologists find fault with critics of social psychology. “I have been proud of the professional stance of social psychology throughout my long career. But unrefereed blogs and social media attacks sent to thou- sands can undermine the professionalism of the discipline” (Pettigrew, 2018, p. 967). I would argue that lecturing thousands of students each year based on evidence that is not replicable is a bigger problem than talking openly about the low replicability of social psychology on social media.
No Crisis: Experts Can Reliably Produce Effects
After some influential priming results could not be replicated, Daniel Kahneman wrote a letter to John Bargh and suggested that leading priming researchers should conduct a series of replication studies to demonstrate that their original results are replicable (Yong, 2012). In response, Bargh and other prominent social psychologists conducted numerous studies that showed the effects are robust. At least, this is what might have happened in an alternate universe. In this universe, there have been few attempts to self-replicate original findings. Bartlett (2013) asked Bargh why he did not prove his critics wrong by doing the study again. “So why not do an actual examination? Set up the same experiments again, with additional safeguards. It wouldn’t be terribly costly. No need for a grant to get undergraduates to unscramble sentences and stroll down a hallway” (Bartlett, 2013).
Bargh’s answer is not very convincing. “Bargh says he wouldn’t want to force his graduate students, already worried about their job prospects, to spend time on research that carries a stigma. Also, he is aware that some critics believe he’s been pulling tricks, that he has a ‘special touch’ when it comes to priming, a comment that sounds like a compliment but isn’t. ‘I don’t think anyone would believe me,’ he says” (Bartlett, 2013).
One self-replication ended with a replication failure (Elkins- Brown, Saunders, & Inzlicht, 2018). One notable successful self- replication was conducted by Petty and colleagues (Luttrell, Petty, & Xu, 2017), after a replication study by Ebersole et al. (2016) failed to replicate a seminal finding by Cacioppo, Petty, and Morris (1983) that need for cognition moderates the effect of argument strength on attitudes. Luttrell et al. (2017) were able to replicated the original finding by Cacioppo et al., and they repro- duced the nonsignificant result of Ebersole et al.’s replication study. In addition, they found a significant interaction with exper- imental design, indicating that procedural differences made the effect weaker in Ebersole et al.’s replication study. This study has been celebrated as an exemplary way to respond to replication failures. It also suggests that flaws in replication studies are some- times responsible for replication failures. However, it is impossible to generalise from this single instance to other replication failures. Thus, it remains unclear how many replication failures were caused by problems with the replication studies.
No Crisis: Decline Effect
The idea that replication failures occur because effects weaken over time was proposed by Johnathan Schooler and popularized in a New Yorker article (Lehrer, 2010). Schooler coined the term decline effect for the observation that effect sizes often decrease over time. Unfortunately, it does not work for more mundane behaviours like eating cheesecake. No matter how often you eat cheesecakes, they still add pounds to your weight. However, for effects in social psychology, it seems to be the case that it is easier to discover effects than to replicate them (Wegner, 1992). This is also true for Schooler and Engstler-Schooler’s (1990) verbal over- shadowing effect. A registered replication report replicated a statistically significant effect but with smaller effect sizes (Alogna et al., 2014). Schooler (2014) considered this finding a win-win because his original results had been replicated, and the reduced effect size supported the presence of a decline effect. However, the notion of a decline effect is misleading because it merely describes a phenomenon rather than providing an explanation for it. Schooler (2014) offered several possible explanations. One possible explanation was regression to the mean (see next paragraph). A second explanation was that slight changes in experimental procedures can reduce effect sizes (more detailed discussion below). More controversial, Schooler also eludes to the possibility that some paranormal processes may produce a decline effect. “Perhaps, there are some parallels between VO [verbal overshadowing] effects and parapsychology after all, but they reflect genuine unappreciated mechanisms of nature (Schooler, 2011) and not simply the product of publication bias or other artifact” (p. 582). Schooler, however, fails to acknowledge that a mundane explanation for the decline effect involves questionable research practices that inflate effect size estimates in original studies. Using statistical tools, Francis (2012) showed that Schooler’s original verbal over-shadowing studies showed signs of bias. Thus, there is no need to look for paranormal explanation of the decline effect in verbal overshadowing. The normal practices of selectively publishing only significant results are sufficient to explain it. In sum, the decline effect is descriptive rather than explanatory, and Schooler’s suggestion that it reflects some paranormal phenomena is not supported by scientific evidence.
No Crisis: Regression to the Mean Is Normal
Regression to the mean has been invoked as one possible explanation for the decline effect (Fiedler, 2015; Schooler, 2014). Fiedler’s argument is that random measurement error in psycho- logical measures is sufficient to produce replication failures. How- ever, random measurement error is neither necessary nor sufficient to produce replication failures. The outcome of a replication study is determined solely by a study’s statistical power, and if the replication study is an exact replication of an original study, both studies have the same amount of random measurement error and power (Brunner & Schimmack, 2020). Thus, if the Open Science Collaboration (OSC) project found 97 significant results in 100 published studies, the observed discovery rate of 97% suggests that the studies had 97% power to obtain a significant result. Random measurement error would have the same effect on power in the replication studies. Thus, random measurement error cannot ex- plain why the replication studies produced only 37% significant results. Therefore, Fiedler’s claim that random measurement error alone explains replication failures is based on a misunderstanding of the phenomenon of regression to the mean.
Moreover, regression to the mean requires that studies were selected for significance. Schooler (2014) ignores this aspect of regression to the mean when he suggests that regression to the mean is normal and expected. It is not. The effect sizes of eating cheesecake do not decrease over time because there is no selection process. In contrast, the effect sizes of social psychological experiments decrease when original articles selected significant results and replication studies do not select for significance. Thus, it is not normal for success rates to decrease from 97% to 25%, just like it would not be normal for a basketball players’ free-throw percent- age to drop from 97% to 25%. In conclusion, regression to the mean implies that original studies were selected for significance and would suggest that replication failures are produced by questionable research practices. Regression to the mean therefore be- comes an argument why there is a crisis once it is recognized that it requires selective reporting of significant results, which leads to illusory success rates in psychology journals.
No Crisis: Exact Replications Are Impossible
Heraclitus, an ancient Greek philosopher, observed that you can never step into the same river twice. Similarly, it is impossible to exactly re-create the conditions of a psychological experiment. This trivial observation has been used to argue that replication failures are neither surprising nor problematic but rather the norm. We should never expect to get the same result from the same paradigm because the actual experiments are never identical, just like a river is always changing (Stroebe & Strack, 2014). This argument has led to a heated debate about the distinction and value of direct versus conceptual replication studies (Crandall & Sherman, 2016; Pashler & Harris, 2012; Zwaan, Etz, Lucas, & Donnellan, 2018).
The purpose of direct replication studies is to replicate an original study as closely as possible so that replication failures can correct false results in the literature (Pashler & Harris, 2012). However, journals were reluctant to publish replication failures. Thus, a direct replication had little value. Either the results were not significant or they were not novel. In contrast, conceptual replication studies were publishable as long as they produced a significant result. Thus, publication bias provides an explanation for many seemingly robust findings (Bem, 2011) that suddenly cannot be replicated (Galak et al., 2012). After all, it is simply not plausible that conceptual replications that intentionally change features of a study are always successful, while direct replications that try to reproduce the original conditions as closely as possible fail in large numbers.
The argument that exact replications are impossible also ignores the difference between disciplines. Why is there no replication crisis in cognitive psychology if each experiment is like a new river? And why does eating cheesecake always lead to a weight gain, no matter whether it is chocolate cheesecake, raspberry white-truffle cheesecake, or caramel fudge cheesecake? The reason is that the main features of rivers remain the same. Even if the river is not identical, you still get wet every time you step into it. To explain the higher replicability of results in cognitive psychology than in social psychology, Van Bavel, Mende-Siedlecki, Brady, and Reinero (2016) proposed that social psychological studies are more difficult to replicate for a number of reasons. They called this property of studies contextual sensitivity. Coding studies for contextual sensitivity showed the predicted negative correlation between contextual sensitivity and replicability. However, Inbar (2016) found that this correlation was no longer significant when discipline was included as a predictor. Thus, the results suggested that social psychological studies are more contextually sensitive and less replicable but that contextual sensitivity did not explain the lower replicability of social psychology.
It is also not clear that contextual sensitivity implies that social psychology does not have a crisis. Replicability is not the only criterion of good science, especially if exact replications are impossible. Findings that can only be replicated when conditions are reproduced exactly lack generalizability, which makes them rather useless for applications and for construction of broader theories. Take verbal overshadowing as an example. Even a small change in experimental procedures reduced a practically significant effect size of 16% to a no longer meaningful effect size of 4% (Alogna et al., 2014), and neither of these experimental conditions were similar to real-world situations of eyewitness identification. Thus, the practical implications of this phenomenon remain unclear because it depends too much on the specific context.
In conclusion, empirical results are only meaningful if research- ers have a clear understanding of the conditions that can produce a statistically significant result most of the time (Fisher, 1926). Contextual sensitivity makes it harder to do so. Thus, it is one potential factor that may contribute to the replication crisis in social psychology because social psychologists do not know under which conditions their results can be reproduced. For example, I asked Roy F. Baumeister to specify optimal conditions to replicate ego depletion. He was unable or unwilling to do so (Baumeister, 2016).
No Crisis: The Replication Studies Are Flawed
The argument that replication studies are flawed comes in two flavors. One argument is that replication studies are often carried out by young researchers with less experience and expertise. They did their best, but they are just not very good experimenters (Gilbert, King, Pettigrew, & Wilson, 2016). Cunningham and Baumeister (2016) proclaim, “Anyone who has served on university thesis committees can attest to the variability in the competence and commitment of new researchers. Nonetheless, a graduate committee may decide to accept weak and unsuccessful replication studies to fulfill degree requirements if the student appears to have learned from the mistakes” (p. 4). There is little evidence to support this claim. In fact, a meta-analysis found no differences in effect sizes between studies carried out by Baumeister’s lab and other labs (Hagger, Wood, Stiff, & Chatzisarantis, 2010).
The other argument is that replication failures are sexier and more attention grabbing than successful replications. Thus, replication researchers sabotage their studies or data analyses to produce nonsignificant results (Bryan, Yeager, & O’Brien, 2019; Strack, 2016). The latter accusations have been made without empirical evidence to support this claim. For example, Strack (2016) used a positive correlation between sample size and effect size to claim that some labs were motivated to produce nonsignificant results, presumably by using a smaller sample size. However, a proper bias analysis showed no evidence that there were too few significant results (Schimmack, 2018a). Moreover, the overall effect size across all labs was also nonsignificant.
Inadvertent problems, however, may explain some replication failures. For example, some replication studies reduced statistical power by replicating a study with a smaller sample than the original study (OSC, 2015; Ritchie et al., 2012). In this case, a replication failure could be a false negative (Type II error). Thus, it is problematic to conduct replication studies with smaller samples. At the same time, registered replication reports with thou- sands of participants should be given more weight than original studies with fewer than 100 participants. Size matters.
However, size is not the only factor that matters, and researchers disagree about the implications of replication failures. Not surpris- ingly, authors of the original studies typically recognise some problems with the replication attempts (Baumeister & Vohs, 2016; Strack, 2016; cf. Skibba, 2016). Ideally, researchers would agree ahead of time on a research design that is acceptable to all parties involved. Kahneman called this model an adversarial collaboration (Kahneman, 2003). However, original researchers have either not participated in the planning of a study (Strack, 2016) or withdrawn their approval after the negative results were known (Baumeister & Vohs, 2016). No author of an original study that failed to replicate has openly admitted that questionable research practices contributed to replication failures.
In conclusion, replication failures can occur for a number of reasons, just like significant results in original studies can occur for a number of reasons. Inconsistent results are frustrating because they often require further research. This being said, there is no evidence that low quality of replication studies is the sole or the main cause of replication failures in social psychology.
No Crisis: Replication Failures Are Normal
In an opinion piece for the New York Times, Lisa Feldmann Barrett, current president of the Association for Psychological Science, commented on the OSC results and claimed that “the failure to replicate is not a cause for alarm; in fact, it is a normal part of how science works” (Barrett, 2015). On the surface, Barrett makes a valid point. It is true that replication failures are a normal part of science. First, if psychologists would conduct studies with 80% power, one out of five studies would fail to replicate, even if everything is going well and all predictions are true. Second, replication failures are expected when researchers test risky hypotheses (e.g., effects of candidate genes on personality) that have a high probability of being false. In this case, a significant result may be a false-positive result and replication failures demonstrate that it was a false positive. Thus, honest reporting of replication failures plays an integral part in normal science, and the success rate of replication studies provides valuable information about the empirical support for a hypothesis. However, a success rate of 25% or less for social psychology is not a sign of normal science, especially when social psychology journals publish over 90% significant results (Motyl et al., 2017; Sterling, 1959; Sterling et al., 1995). This discrepancy suggests that the problem is not the low success rate in replication studies but the high success rate in psychology journals. If social psychologists tested risky hypotheses that have a high probability of being false, journals should report a lot of nonsignificant results, especially in articles that report multiple tests of the same hypothesis, but they do not (cf. Schimmack, 2012).
Crisis: Original Studies Are Not Credible Because They Used Null-Hypothesis Significance Testing
Bem’s anomalous results were published with a commentary by Wagenmakers, Wetzels, Borsboom, and van der Maas (2011). This commentary made various points that are discussed in more detail below, but one unique and salient point of Wagenmakers et al.’s comment concerned the use of null-hypothesis significance testing (NHST). Bem presented nine results with p values below .05 as evidence for ESP. Wagenmakers et al. object to the use of a significance criterion of .05 and argue that this criterion makes it too easy to publish false-positive results (see also Benjamin et al., 2016).
Wagenmakers et al. (2011) claimed that this problem can be avoided by using Bayes factors. When they used Bayes factors with default priors, several of Bem’s studies no longer showed evidence for ESP. Based on these findings, they argued that psychologists must change the way they analyse their data. Since then, Wagenmakers has worked tirelessly to promote Bayes factors as an alternative to NHST. However, Bayes factors have their own problems. The biggest problem is that they depend on the choice of a prior.
Bem, Utts, and Johnson (2011) pointed out that Wagenmakers et al.’s (2011) default prior assumed that there is a 50% probability that ESP works in the opposite direction (below chance accuracy) and a 25% probability that effect sizes are greater than one stan- dard deviation (Cohen’s d > 1). Only 25% of the prior distribution was allocated to effect sizes in the predicted direction between 0
No Crisis: Replication Failures Are Normal
In an opinion piece for the New York Times, Lisa Feldmann Barrett, current president of the Association for Psychological Science, commented on the OSC results and claimed that “the failure to replicate is not a cause for alarm; in fact, it is a normal part of how science works” (Barrett, 2015). On the surface, Barrett makes a valid point. It is true that replication failures are a normal part of science. First, if psychologists would conduct studies with 80% power, one out of five studies would fail to replicate, even if everything is going well and all predictions are true. Second, replication failures are expected when researchers test risky hy- potheses (e.g., effects of candidate genes on personality) that have a high probability of being false. In this case, a significant result may be a false-positive result and replication failures demonstrate that it was a false positive. Thus, honest reporting of replication failures plays an integral part in normal science, and the success rate of replication studies provides valuable information about the empirical support for a hypothesis. However, a success rate of 25% or less for social psychology is not a sign of normal science, especially when social psychology journals publish over 90% significant results (Motyl et al., 2017; Sterling, 1959; Sterling et al., 1995). This discrepancy suggests that the problem is not the low success rate in replication studies but the high success rate in psychology journals. If social psychologists tested risky hypothe- ses that have a high probability of being false, journals should report a lot of nonsignificant results, especially in articles that report multiple tests of the same hypothesis, but they do not (cf. Schimmack, 2012).
Crisis: Original Studies Are Not Credible Because They Used Null-Hypothesis Significance Testing
Bem’s anomalous results were published with a commentary by Wagenmakers, Wetzels, Borsboom, and van der Maas (2011). This commentary made various points that are discussed in more detail below, but one unique and salient point of Wagenmakers et al.’s comment concerned the use of null-hypothesis significance testing (NHST). Bem presented nine results with p values below .05 as evidence for extrasensory perception (ESP). Wagenmakers et al. object to the use of a significance criterion of .05 and argue that this criterion makes it too easy to publish false-positive results (see also Benjamin et al., 2016).
Wagenmakers et al. (2011) claimed that this problem can be avoided by using Bayes factors. When they used Bayes factors with default priors, several of Bem’s studies no longer showed evidence for ESP. Based on these findings, they argued that psychologists must change the way they analyse their data. Since then, Wagenmakers has worked tirelessly to promote Bayes factors as an alternative to NHST. However, Bayes factors have their own problems. The biggest problem is that they depend on the choice of a prior.
Bem, Utts, and Johnson (2011) pointed out that Wagenmakers et al.’s (2011) default prior assumed that there is a 50% probability that ESP works in the opposite direction (below chance accuracy) and a 25% probability that effect sizes are greater than one standard deviation (Cohen’s d > 1). Only 25% of the prior distribution was allocated to effect sizes in the predicted direction between 0 and 1. This prior makes no sense for research on ESP processes that are expected to produce small effects.
When Bem et al. (2011) specified a more reasonable prior, Bayes factors actually showed more evidence for ESP than NHST. Moreover, the results of individual studies are less important than the combined evidence across studies. A meta-analysis of Bem’s studies shows that even with the default prior, Bayes factors reject the null hypothesis with an odds ratio of 1 billion to 1. Thus, if we trust Bem’s data, Bayes factors also suggest that Bem’s results are robust, and it remains unclear why Galak et al. (2012) were unable to replicate Bem’s results.
Another argument in favour of Bayes-Factors is that NHST is one-sided. Significant results are used to reject the null-hypothesis, but nonsignificant results cannot be used to affirm the null- hypothesis. This makes nonsignificant results difficult to publish, which leads to publication bias. The claim is that Bayes factors solve this problem because they can provide evidence for the null hypothesis. However, this claim is false (Tendeiro & Kiers, 2019). Bayes factors are odds ratios between two alternative hypotheses. Unlike in NHST, these two competing hypotheses are not mutually exclusive. That is, an infinite number of additional hypotheses are not tested. Thus, if the data favour the null hypothesis, they do not provide support for the null hypothesis. They merely provide evidence against one specified alternative hypothesis. There is always another possible alternative hypothesis that fits the data better than the null hypothesis. As a result, even Bayes factors that strongly favour H0 fail to provide evidence that the true effect size is exactly zero.
The solution to this problem is not new but unfamiliar to many psychologists. To demonstrate the absence of an effect, it is necessary to specify a region of effect sizes around zero and to demonstrate that the population effect size is likely to be within this region. This can be achieved using NHST (equivalence tests; Lakens, Scheel, & Isager, 2018) or Bayesian statistics (Kruschke & Liddell, 2018). The main reason why psychologists are not familiar with tests that demonstrate the absence of an effect may be that typical sample sizes in psychology are too small to produce precise estimates of effect sizes that could justify the conclusion that the population effect size is too close to zero to be meaningful.
An even more radical approach was taken by the editors of Basic and Applied Social Psychology (Trafimow & Marks, 2015), who claimed that NHST is logically invalid (Trafimow, 2003). Based on this argument, the editors banned p values from publications, which solves the problem of replication failures because there are no formal inferential tests. However, authors continue to draw causal inferences that are in line with NHST but simply omit statements about p values. It is not clear that this cosmetic change in the presentation of results is a solution to the replication crisis.
In conclusion, Wagenmakers et al. and others have blamed the use of NHST for the replication crisis, but this criticism ignores the fact that cognitive psychology also uses NHST and does not suffer a replication crisis. The problem with Bem’s results was not the use of NHST but the use of questionable research practices to produce illusory evidence (Francis, 2012; Schimmack, 2012, 2018b, 2020).
Crisis: Original Studies Report Many False Positives
An influential article by Ioannidis (2005) claimed that most published research findings are false. This eye-catching claim has been cited thousands of times. Few citing authors have bothered to point out that the claim is entirely based on hypothetical scenarios rather than empirical evidence. In psychology, fear that most published results are false positives was stoked by Simmons, Nelson, and Simonsohn’s (2011) “False-Positive Psychology” ar- ticle that showed with simulation studies that the aggressive use of questionable research practices can dramatically increase the prob- ability that a study produces a significant result without a real effect. These articles shifted concerns about false negatives in the 1990s (e.g., Cohen, 1994) to concerns about false positives.
The problem with the current focus on false-positive results is that it implies that replication failures reveal false-positive results in original studies. This is not necessarily the case. There are two possible explanations for a replication failure. Either the original study had low power to show a true effect (the nil hypothesis is false) or the original study reported a false-positive result and the nil hypothesis is true. Replication failures do not distinguish be- tween true and false nil hypothesis, but they are often falsely interpreted as if replication failures reveal that the original hypothesis was wrong. For example, Nelson, Simmons, and Simonsohn (2018) write, “Experimental psychologists spent several decades relying on methods of data collection and analysis that make it too easy to publish false-positive, nonreplicable results. During that time, it was impossible to distinguish between findings that are true and replicable and those that are false and not replicable” (p. 512). This statement ignores that results can be true but difficult to replicate and that the nil hypothesis is often unlikely to be true.
The false assumption that replication failures reveal false- positive results has created a lot of confusion in the interpretation of replication failures (Maxwell, Lau, & Howard, 2015). For example, Gilbert et al. (2016) attribute the low replication rate in the reproducibility project to low power of the replication studies. This does not make sense when the replication studies had the same or sometimes even larger sample sizes than the original studies. As a result, the replication studies had as much or more power than the original studies. So, how could low power explain that discrepancy between the 97% success rate in original studies and the 25% success rate in replication studies? It cannot.
Gilbert et al.’s (2016) criticism only makes sense if replication failures in the replication studies are falsely interpreted as evidence that the original results were false positives. Now it makes sense to argue that both the original studies and the replication studies had low power to detect true effects and that replication failures are expected when true effects are tested in studies with low power. The only question that remains is why original studies all reported significant results when they had low power, but Gilbert et al. (2016) do not address this question.
Aside from Simmons et al.’s (2011) simulation studies, a few articles tried to examine the rate of false-positive results empirically. One approach is to examine sign changes in replication studies. If 100 true null hypotheses are tested, 50 studies are expected to show a positive sign and 50 studies are expected to show a negative sign due to random sampling error. If these 100 studies are replicated, this will happen again. Just like two coin flips, we would therefore expect 50 studies with the same outcome(both positive or both negative) and 50 studies with different outcomes (one positive, one negative).
Wilson and Wixted (2018) found that 25% of social psychological results in the OSC project showed a sign reversal. This would suggest that 50% of the studies tested a true null hypothesis. Of course, sign reversals are also possible when the effect size is not strictly zero. However, the probability of a sign reversal decreases as effect sizes increase. Thus, it is possible to say that about 50% of the replicated studies had an effect size close to zero. Unfortunately, this estimate is imprecise due to the small sample size.
Gronau, Duizer, Bakker, and Wagenmakers (2017) attempted to estimate the false discovery rate using a statistical model that is fitted to the exact p values of original studies. The applied this model to three data sets and found false discovery rates (FDRs) of 34-46% for cognitive psychology, 40 – 60% for social psychology in general, and 48-88% for social priming. However, Schimmack and Brunner (2019) discovered a statistical flaw in this model that leads to the overestimation of the FDR. They also pointed out that it is impossible to provide exact estimates of the FDR because the distinction between absolutely no effect and a very small effect is arbitrary.
Bartoš and Schimmack (2020) developed a statistical model, called z-curve.2.0, that makes it possible to estimate the maximum FDR. If this maximum is low, it suggests that most replication failures are due to low power. Applying z-curve2.0 to Gronau et al.’s (2017) data sets yields FDRs of 9% (95% CI [2%, 24%]) for cognitive psychology, 26% (95% CI [4%, 100%]) for social psychology, and 61% (95% CI [19%, 100%]) for social priming. The z-curve estimate that up to 61% of social priming results could be false positives justifies Kahneman’s letter to Bargh that called out social priming research as the “poster child for doubts about the integrity of psychological research” (cf. Yong, 2012). The difference between 9% for cognitive psychology and 61% for social priming makes it clear that it is not possible to generalize from the replication crisis in social psychology to other areas of psychology. In conclusion, it is impossible to specify exactly whether an original finding was a false-positive result or not. There have been several attempts to estimate the number of false-positive results in the literature, but there is no consensus about the proper method to do so. I believe that the distinction between false and true positives is not particularly helpful if the null hypothesis is specified as a value of zero. An effect size of d = .0001 is not any more meaningful than an effect size of d = 0000. To be meaningful, published results should be replicable given the same sample sizes as used in original research. Demonstrating a significant result in the same direction in a much larger sample with a much smaller effect size should not be considered a successful replication.
Crisis: Original Studies Are Selected for Significance
The most obvious explanation for the replication crisis is the well-known bias to publish only significant results that confirm theoretical predictions. As a result, it is not necessary to read the results section of a psychological article. It will inevitably report confirmatory evidence, p < .05. This practice is commonly known as publication bias. Concerns about publication bias are nearly as old as empirical psychology (Rosenthal, 1979; Sterling, 1959). Kerr (1998) published his famous “HARKing” (hypothesising after results are known) article to explain how social psychologists were able to report mostly significant results. Social psychology journals responded by demanding that researchers publish multiple replication studies within a single article (cf. Wegner, 1992). These multiple-study articles created a sense of rigor and made false- positive results extremely unlikely. With five significant results with p < .05, the risk of a false-positive result is smaller than the criterion used by particle physicists to claim a discovery (cf. Schimmack, 2012). Thus, Bem’s (2011) article that contained nine successful studies exceeded the stringent criterion that was used to claim the discovery of the Higgs-Boson particle, the most celebrated findings in physics in the 2010s. The key difference be- tween the discovery of the Higgs-Boson particle in 2012 and Bem’s discovery of mental time travel is that physicists conducted a single powerful experiment to test their predictions, while Bem conducted many studies and selectively published results that supported his claim (Schimmack, 2018b). Bem (2012) even admitted that he ran many small studies that were not included in the article. At the same time, he was willing to combine several small studies with promising trends into a single data set. For example, Study 6 was really four studies with Ns = 50, 41, 19, and 40 (cf. Schimmack, Schultz, Carlsson, & Schmukle, 2018). These questionable, to say the least, practices were so common in social psychology that leading social psychologists were unwilling to retract Bem’s article because this practice was considered acceptable (Kitayama, 2018).
There have been three independent approaches to examine the use of questionable research practices. All three approaches show converging evidence that questionable practices inflate the rate of significant results in social psychology journals. Cairo, Green, Forsyth, Behler, and Raldiris (2020) demonstrated that published articles report more significant results than dissertations. John et al. (2012) found evidence for the use of questionable practices with a survey of research practices. The most widely used QRPs were not reporting all dependent variables (65%), collecting more data after snooping (57%), and selectively reporting studies that worked (48%). Moreover, researchers found these QRPs acceptable with defensibility ratings (0 –2) of 1.84, 1.79, and 1.66, respectively. Thus, researchers are using questionable practices because they do not consider them to be problematic. It is unclear whether attitudes toward questionable research practices have changed in response to the replication crisis.
Social psychologists have responded to John et al.’s (2012) article in two ways. One response was to question the importance of the findings. Stroebe and Strack (2014) argued that these practices may not be questionable, but they do not counter Sterling’s argument that these practices invalidate the meaning of significance testing and p values. Fiedler and Schwarz (2016) argue that John et al.’s (2012) survey produced inflated estimates of the use of QRPs. However, they fail to provide an alternative explanation for the low replication rate of social psychological research.
Statistical methods that can reveal publication bias provide additional evidence about the use of QRPs. Although these tests often have low power in small sets of studies (Renkewitz & Keiner, 2019), they can provide clear evidence of publication bias when bias is large (Francis, 2012; Schimmack, 2012) or when the set of studies is large (Carter, Kofler, Forster, & McCullough, 2015; Carter & McCullough, 2013, 2014). One group of bias tests compares the success rate to estimates of mean power. The advantage of these tests is that they provide clear evidence of QRPs. Francis used this approach to demonstrate that 82% of articles with four or more studies that were published between 2009 and 2012 in Psychological Science showed evidence of bias. Given the small set of studies, this finding implies that selection for significance was severe (Schimmack, 2020).
Social psychologists have mainly ignored evidence that QRPs were used to produce significant results. John et al.’s article has been cited over 500 times, but it has not been cited by social psychologists who commented on the replication crisis like Fiske, Baumeister, Gilbert, Wilson, or Nisbett. This is symptomatic of the response by some eminent social psychologists to the replication crisis. Rather than engaging in a scientific debate about the causes of the crisis, they have remained silent or dismissed critics as unscientific. “Some critics go beyond scientific argument and counterargument to imply that the entire field is inept and misguided (e.g., Gelman, 2014; Schimmack, 2014)” (Fiske, 2017, p. 653). Yet, Fiske fails to explain why social psychological results cannot be replicated.
Others have argued that Francis’s work is unnecessary because the presence of publication bias is a well-known fact. Therefore, “one is guaranteed to eventually reject a null we already know is false” (Simonsohn, 2013, p. 599). This argument ignores that bias tests can help to show that social psychology is improving. For example, bias tests show no bias in registered replication reports, indicating that this new format produces more credible results (Schimmack, 2018a).
Murayama, Pekrun, and Fiedler (2014) noted that demonstrating the presence of bias does not justify the conclusion that there is no effect. This is true but not very relevant. Bias undermines the credibility of the evidence that is supposed to demonstrate an effect. Without credible evidence, it remains uncertain whether an effect is present or not. Moreover, Murayama et al. acknowledge that bias always inflates effect size estimates, which makes it more difficult to assess the practical relevance of published results.
A more valid criticism of Francis’s bias analyses is that they do not reveal the amount of bias (Simonsohn, 2013). That is, when we see 95% significant results in a journal and there is bias, it is not clear whether mean power was 75% or 25%. To be more useful, bias tests should also provide information about the amount of bias.
In conclusion, selective reporting of significant results inflates effect sizes, and the observed discovery rate in journals gives a false impression of the power and replicability of published results. Surveys and bias tests show that the use of QRPs in social psychology were widespread. However, bias tests merely show that QRPs were used. They do not show how much QRPs influenced reported results.
z-Curve: Quantifying the Crisis
Some psychologists developed statistical models that can quantify the influence of selection for significance on replicability. Brunner and Schimmack (2020) compared four methods to estimate the expected replication rate (ERR), including the popular p-curve method (Brunner, 2018; Simonsohn, Nelson, & Simmons, 2014; Ulrich & Miller, 2018). They found that p-curve overestimated replicability when effect sizes vary across studies. In contrast, a new method called z-curve performed well across many scenarios, especially when heterogeneity was present.
Bartoš and Schimmack (2020) validated an extended version of z-curve (z-curve2.0) that provides confidence intervals and pro- vides estimates of the expected discovery rate, that is, the percent- age of observed significant results for all tests that were conducted, even if they were not reported. To do so, z-curve estimates the size of the file drawer of unpublished studies with nonsignificant results. The z-curve has already been applied to various data sets of results in social psychology (see R-Index blog for numerous examples).
The most important data set was created by Motyl et al. (2017), who used representative sampling of social psychology journals to examine the credibility of social psychology. The data set was also much larger than the 100 studies of the actual replication project (OSC, 2015). The main drawback of Motyl et al.’s audit of social psychology was that they did not have a proper statistical tool to estimate replicability. I used this data set to estimate the replica- bility of social psychology based on a representative sample of studies. To be included in the z-curve analysis, a study had to use a t test or F test with no more than four numerator degrees of freedom. I excluded studies from the journal Psychological Science to focus on social psychology. This left 678 studies for analysis. The set included 450 between-subjects studies, 139 mixed designs, and 67 within-subject designs. The preponderance of between-subjects designs is typical of social psychology and one of the reasons for the low power of studies in social psychology.
Figure 1 was created with the R-package zcurve. The figure shows a histogram of test statistics converted into z-scores. The red line shows statistical significance at z = 1.96, which corresponds to p < .05 (two-tailed). The blue line shows the predicted values based on the best-fitting mixture model that is used to estimate the expected replication rate and the expected discovery rate. The dotted lines show 95% confidence intervals.
The results in Figure 1 show an expected replication rate of 43% (95% CI [36%, 52%]). This result is a bit better than the 25% estimate obtained in the OSC project. There are a number of possible explanations for the discrepancy between the OSC estimate and the z-curve estimate. First of all, the number of studies in the OSC project is very small and sampling error alone could explain some of the differences. Second, the set of studies in the OSC project was not representative and may have selected studies with lower replicability. Third, some actual replication studies may have modified procedures in ways that lowered the chance of obtaining a significant result. Finally, it is never possible to exactly replicate a study (Stroebe & Strack, 2014; Van Bavel et al., 2016). Thus, z-curve estimates are overly optimistic because they assume exact replications. If there is contextual sensitivity, selection for significance will produce additional regression to the mean, and a better estimate of the actual replication rate is the expected discovery rate, EDR (Bartoš & Schimmack, 2020). The estimated EDR of 21% is close to the 25% estimate based on actual replication studies. In combination, the existing evidence suggests that the replicability of social psychological research is somewhere be- tween 20% and 50%, which is clearly unsatisfactory and much lower than the observed discovery rate of 90% or more in social psychology journals.
Figure 1 also clearly shows that questionable research practices explain the gap between success rates in laboratories and success rates in journals. The z-curve estimate of nonsignificant results shows that a large proportion of nonsignificant results is expected, but hardly any of these expected studies ever get published. This is reflected in an observed discovery rate of 90% and an expected discovery rate of 21%. The confidence intervals do not overlap, indicating that this discrepancy is statistically significant. Given such extreme selection for significance, it is not surprising that published effect sizes are inflated and replication studies fail to reproduce significant results. In conclusion, out of all explanations for replication failures in psychology, the use of questionable research practices is the main factor.
The z-curve can also be used to examine the power of subgroups of studies. In the OSC project, studies with a z-score greater than 4 had an 80% chance to be replicated. To achieve an ERR of 80% with Motyl et al.’s (2017) data, z-scores have to be greater than 3.5. In contrast, studies with just significant results (p < .05 and p > .01) have an ERR of only 28%. This information can be used to reevaluate published results. Studies with p values between .05 and .01 should not be trusted unless other information suggests otherwise (e.g., a trustworthy meta-analysis). In contrast, results with z-scores greater than 4 can be used to plan new studies. Unfortunately, there are much more questionable results with p values greater than .01 (42%) than trustworthy results with z > 4 (17%), but at least there are some findings that are likely to replicate even in social psychology.
An Inconvenient Truth
Every crisis is an opportunity to learn from mistakes. Lending practices were changed after the financial crisis in the 2000s. Psychologists and other sciences can learn from the replication crisis in social psychology, but only if they are honest and upfront about the real cause of the replication crisis. Social psychologists did not use the scientific method properly. Neither Fisher nor Neyman and Pearson, who created NHST, proposed that nonsignificant results are irrelevant or that only significant results should be published. The problems of selection for significance is evident and has been well known (Rosenthal, 1979; Sterling, 1959). Cohen (1962) warned about low power, but the main concern was a large file drawer filled with Type II errors. Nobody could imagine that whole literatures with hundreds of studies are built on nothing but sampling error and selection for significance. Bem’s article and replication failures in the 2010s showed that the abuse of questionable research practices was much more excessive than any- body was willing to believe.
The key culprit were conceptual replication studies. Even social psychologists were aware that it is unethical to hide replication failures. For example, Bem advised researchers to use questionable research practices to find significant results in their data. “Go on a fishing expedition for something—anything—interesting, even if this meant to ‘err on the side of discovery’” (Bem, 2000). However, even Bem made it clear that “this is not advice to suppress negative results. If your study was genuinely designed to test hypotheses that derive from a formal theory or are of wide general interest for some other reason, then they should remain the focus of your article. The integrity of the scientific enterprise requires the reporting of disconfirming results.”
How did social psychologists justify to themselves that it is OK to omit nonsignificant results? One explanation is the distinction between direct and conceptual replications. Conceptual replications always vary at least a small detail of a study. Thus, a nonsignificant result is never a replication failure of a previous study. It is just a failure of a specific study to show a predicted effect. Graduate students were explicitly given the advice to “never do a direct replication; that way, if a conceptual replication doesn’t work, you maintain plausible deniability” (Anonymous, cited in Spellman, 2015). This is also how Morewedge, Gilbert, and Wilson (2014) explain why they omitted nonsignificant results from a publication:
Let us be clear: We did not run the same study over and over again until it yielded significant results and then report only the study that “worked.” Doing so would be clearly unethical. Instead, like most researchers who are developing new methods, we did some preliminary studies that used different stimuli and different procedures and that showed no interesting effects. Why didn’t these studies show interesting effects? We’ll never know.
It was only in 2012 that psychologists realized that changing results in their studies were heavily influenced by sampling error and not by some minor changes in the experimental procedure. Only a few psychologists have been open about this. In a commendable editorial, Lindsay (2019) talks about his realization that his research practices were suboptimal:
Early in 2012, Geoff Cumming blew my mind with a talk that led me to realize that I had been conducting underpowered experiments for decades. In some lines of research in my lab, a predicted effect would come booming through in one experiment but melt away in the next. My students and I kept trying to find conditions that yielded consistent statistical significance—tweaking items, instructions, exclusion rules— but we sometimes eventually threw in the towel because results were maddeningly inconsistent.
Rather than invoking some supernatural decline effect, Lindsay realized that his research practices were suboptimal. A first step for social psychologists is to acknowledge their past mistakes and to learn from their mistakes. Making mistakes is a fact of life. What counts is the response to a mistake. So far, the response by social psychologists has been underwhelming. It is time for some leaders to step up or to step down and make room for a new generation of social psychologists who follow open and transparent practices.
The Way Out of the Crisis
A clear analysis of the replication crisis points toward a clear path out of the crisis. Given that “lax data collection, analysis, and reporting” standards (Carpenter, 2012, p. 1558) allowed for the use of QRPs that undermine the credibility of social psychology, the most obvious solution is to ban the use of questionable research practices and to treat them like other types of unethical behaviours (Engel, 2015). However, no scientific organisation has clearly
stated which practices are acceptable and which practices are not, and prominent social psychologists oppose clear rules of scientific misconduct (Fiske, 2016).
At present, the enforcement of good practices is left to editors of journals who can ask pertinent questions during the submission process (Lindsay, 2019). Another solution has been to ask re- searchers to preregister their studies, which limits researchers’ freedom to go on a fishing expedition (Nosek, Ebersole, DeHaven, & Mellor, 2018). Some journals reward preregistering with badges (JESP), but some social psychology journals do not (PSPB, SPPS). There has been a lot of debate about the value of preregistration and concerns that it may reduce creativity. However, preregistra- tion does not imply that all research has to be confirmatory. It merely makes it possible to distinguish clearly between explor- atory and confirmatory research.
It is unlikely that preregistration alone will solve all problems, especially because there are no clear standards about preregistra- tions and how much they constrain the actual analyses. For exam- ple, Noah, Schul, and Mayo (2018) preregistered the prediction of an interaction between being observed and a facial feedback ma- nipulation. Although the predicted interaction was not significant, they interpreted the nonsignificant pattern as confirming their prediction rather than stating that there was no support for their preregistered prediction of an interaction effect. A z-curve analysis of preregistered studies in JESP still found evidence of QRPs, although less so than for articles that were not preregistered (Schimmack, 2020). To improve the value of preregistration, so- cieties should provide clear norms for research ethics that can be used to hold researchers accountable when they try to game preregistration (Yamada, 2018).
Preregistration of studies alone will only produce more nonsig- nificant results and not increase the replicability of significant results because studies are underpowered. To increase replicabil- ity, social psychologists finally have to conduct power analysis to plan studies that can produce significant results without QRPs. This also means they need to publish less because more resources are needed for a single study (Schimmack, 2012).
To ensure that published results are credible and replicable, I argue that researchers should be rewarded for conducting high- powered studies. As a priori power analyses are based on estimates of effect sizes, they cannot provide information about the actual power of studies. However, z-curve can provide information about the typical power of studies that are conducted within a lab. This information provides quantitative information about the research practices of a lab.
This can be useful information to evaluate the contribution of a research to psychological science. Imagine an eminent scholar [I had to delete the name of this imaginary scholar in the published version, I used the R-Index of Roy F. Baumeister for this example] with an H-index of 100, but assume that this H-index was achieved by publishing many studies with low power that are difficult to replicate. A z-curve analysis might produce an estimate of 25%. This information can be integrated with the H-index to produce a replicability-weighted H-index of RH = 100 * .25 = 25. Another researcher may be less prolific and only have an H-index of 50. A z-curve analysis shows that these studies have a replicability of 80%. This yields an RH-index of 40, which is higher than the RH index of the prolific researcher. By quantifying replicability, we can reward researchers who make replicable contributions to psychological science.
By taking replicability into account, the incentive to publish as many discoveries as possible without concerns about their truth- value (i.e., “to err on the side of discovery”) is no longer the best strategy to achieve fame and recognition in a field. The RH-index could also motivate researchers to retract articles that they no longer believe in, which would lower the H-index but increase the R-index. For highly problematic studies, this could produce a net gain in the RH-index.
Social psychology is changing in response to a replication crisis. To (re)gain trust in social psychology as a science, social psychol- ogists need to change their research practices. The problem of low power has been known since Cohen (1962), but only in recent years, power of social psychological studies has increased (Schim- mack, 2020). Aside from larger samples, social psychologists are also starting to use within-subject designs that increase power (Lin, Saunders, Friese, Evans, & Inzlicht, 2020). Finally, social psychologists need to change the way they report their results. Most important, they need to stop reporting only results that confirm their predictions. Fiske (2016) recommended that scientists keep track of their questionable practices, and Wicherts et al. (2016) provided a checklist to do so. I think it would be better to ban these practices altogether. Most important, once a discovery has been made, failures to replicate this finding provide valuable, new information and need to be published (Galak et al., 2012), and theories that fail to provide consistent support need to be abandoned or revised (Ferguson & Heene, 2012).
My personal contribution to improving science has been the development of tools that make it possible to examine whether reported results are credible or not (Bartoš & Schimmack, 2020; Schimmack, 2012; Brunner & Schimmack, 2020). I agree with Fiske (2017) that science works better when we can trust scientists, but a science with a replication rate of 25% is not trustworthy. Ironically, the same tool that reveals shady practices in the past can also demonstrate that practices in social psychology are improving (Schimmack, 2020). Hopefully, z-curve analyses of social psychology will eventually show that social psychology has become a trustworthy science.
Acosta, A., Adams, R. B., Jr., Albohn, D. N., Allard, E. S., Beek, T., Benning, S. D., . . . Zwaan, R. A. (2016). Registered replication report: Strack, Martin, & Stepper (1988). Perspectives on Psychological Sci- ence, 11, 917–928. http://dx.doi.org/10.1177/1745691616674458
Alogna, V. K., Attaya, M. K., Aucoin, P., Bahník, Š., Birch, S., Birt, A. R.,. . . Zwaan, R. A. (2014). Registered replication report: Schooler & Engstler-Schooler (1990). Perspectives on Psychological Science, 9, 556 –578. http://dx.doi.org/10.1177/1745691614545653
Bem, D. J. (2011). Feeling the future: Experimental evidence for anoma- lous retroactive influences on cognition and affect. Journal of Person- ality and Social Psychology, 100, 407– 425. http://dx.doi.org/10.1037/a0021524
Bem, D. J., Utts, J., & Johnson, W. O. (2011). Must psychologists change the way they analyze their data? Journal of Personality and Social Psychology, 101, 716 –719. http://dx.doi.org/10.1037/a0024777
Brunner, J., & Schimmack, U. (2020). Estimating population mean power under conditions of heterogeneity and selection for significance. Meta- Psychology. MP.2018.874, https://doi.org/10.15626/MP.2018.874
Bryan, C. J., Yeager, D. S., & O’Brien, J. M. (2019). Replicator degrees of freedom allow publication of misleading failures to replicate. Proceed- ings of the National Academy of Sciences USA, 116, 25535–25545. http://dx.doi.org/10.1073/pnas.1910951116
Cacioppo, J. T., Petty, R. E., & Morris, K. (1983). Effects of need for cognition on message evaluation, recall, and persuasion. Journal of Personality and Social Psychology, 45, 805– 818. http://dx.doi.org/10.1037/0022-35220.127.116.115
Cairo, A. H., Green, J. D., Forsyth, D. R., Behler, A. M. C., & Raldiris, T. L. (2020). Gray (literature) mattes: Evidence of selective hypothesis reporting in social psychological research. Personality and Social Psy- chology Bulletin. Advance online publication. http://dx.doi.org/10.1177/ 0146167220903896
Carter, E. C., Kofler, L. M., Forster, D. E., & McCullough, M. E. (2015). A series of meta-analytic tests of the depletion effect: Self-control does not seem to rely on a limited resource. Journal of Experimental Psy- chology: General, 144, 796 – 815. http://dx.doi.org/10.1037/xge0000083 Carter, E. C., & McCullough, M. E. (2013). Is ego depletion too incredible? Evidence for the overestimation of the depletion effect. Behavioraland Brain Sciences, 36, 683– 684. http://dx.doi.org/10.1017/S0140525X13000952
Carter, E. C., & McCullough, M. E. (2014). Publication bias and the limited strength model of self-control: Has the evidence for ego depletion been overestimated? Frontiers in Psychology, 5, 823.http://dx.doi.org/10.3389/fpsyg.2014.00823
Crandall, C. S., & Sherman, J. W. (2016). On the scientific superiority of conceptual replications for scientific progress. Journal of Experimental Social Psychology, 66, 93–99. http://dx.doi.org/10.1016/j.jesp.2015.10.002
Cunningham, M. R., & Baumeister, R. F. (2016). How to make nothing out of something: Analyses of the impact of study sampling and statistical interpretation in misleading meta-analytic conclusions. Frontiers in Psy- chology, 7, 1639. http://dx.doi.org/10.3389/fpsyg.2016.01639
Ebersole, C. R., Atherton, O. E., Belanger, A. L., Skulborstad, H. M., Allen, J. M., Banks, J. B., . . . Nosek, B. A. (2016). Many Labs 3: Evaluating participant pool quality across the academic semester via replication. Journal of Experimental Social Psychology, 67, 68 – 82. http://dx.doi.org/10.1016/j.jesp.2015.10.012
Ferguson, C. J., & Heene, M. (2012). A vast graveyard of undead theories: Publication bias and psychological science’s aversion to the null. Per- spectives on Psychological Science, 7, 555–561. http://dx.doi.org/10.1177/1745691612459059
Galak, J., LeBoeuf, R. A., Nelson, L. D., & Simmons, J. P. (2012). Correcting the past: Failures to replicate. Journal of Personality andSocial Psychology, 103, 933–948. http://dx.doi.org/10.1037/a0029709
Gronau, Q. F., Duizer, M., Bakker, M., & Wagenmakers, E.-J. (2017). Bayesian mixture modeling of significant p values: A meta-analytic method to estimate the degree of contamination from Ho. Journal of Experimental Psychology: General, 146, 1223–1233. http://dx.doi.org/10.1037/xge0000324
Hagger, M. S., Chatzisarantis, N. L. D., Alberts, H., Anggono, C. O., Batailler, C., Birt, A. R., . Zwienenberg, M. (2016). A multilab preregistered replication of the ego-depletion effect. Perspectives on Psychological Science, 11, 546 –573. http://dx.doi.org/10.1177/1745691616652873
Hagger, M. S., Wood, C., Stiff, C., & Chatzisarantis, N. L. D. (2010). Ego depletion and the strength model of self-control: A meta-analysis. Psychological Bulletin, 136, 495–525. http://dx.doi.org/10.1037/a0019486
Inbar, Y. (2016). Association between contextual dependence and replicability in psychology may be spurious. Proceedings of the National Academy of Sciences, 113(34):E4933-9334, doi.org/10.1073/pnas.1608676113
John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the Prevalence of Questionable Research Practices With Incentives for Truth Telling. Psychological Science, 23(5), 524–532. https://doi.org/10.1177/0956797611430953
Kruschke, J. K., & Liddell, T. M. (2018). The Bayesian new statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective. Psychonomic Bulletin & Review, 25, 178 –206. http://dx.doi.org/10.3758/s13423-016-1221-4
Kvarven, A., Strømland, E. & Johannesson, M. (2020). Comparing meta-analyses and preregistered multiple-laboratory replication projects. Nature Human Behaviour 4, 423–434 (2020). https://doi.org/10.1038/s41562-019-0787-z
Lakens, D., Scheel, A. M., & Isager, P. M. (2018). Equivalence testing for psychological research: A tutorial. Advances in Methods and Practices in Psychological Science, 1, 259 –269. http://dx.doi.org/10.1177/2515245918770963
Luttrell, A., Petty, R. E., & Xu, M. (2017). Replicating and fixing failed replications: The case of need for cognition and argument quality. Journal of Experimental Social Psychology, 69, 178 –183. http://dx.doi.org/10.1016/j.jesp.2016.09.006
Maxwell, S. E., Lau, M. Y., & Howard, G. S. (2015). Is psychology suffering from a replication crisis? What does “failure to replicate” really mean? American Psychologist, 70, 487– 498. http://dx.doi.org/10.1037/a0039400
Motyl, M., Demos, A. P., Carsel, T. S., Hanson, B. E., Melton, Z. J., Mueller, A. B., . . . Skitka, L. J. (2017). The state of social and personality science: Rotten to the core, not so bad, getting better, or getting worse? Journal of Personality and Social Psychology, 113, 34 –58. http://dx.doi.org/10.1037/pspa0000084
Murayama, K., Pekrun, R., & Fiedler, K. (2014). Research practices that can prevent an inflation of false-positive rates. Personality and Social Psychology Review, 18, 107–118. http://dx.doi.org/10.1177/1088868313496330
Noah, T., Schul, Y., & Mayo, R. (2018). When both the original study and its failed replication are correct: Feeling observed eliminates the facial- feedback effect. Journal of Personality and Social Psychology, 114, 657– 664. http://dx.doi.org/10.1037/pspa0000121
Nosek, B. A., Ebersole, C. R., DeHaven, A. C., & Mellor, D. T. (2018). The preregistration revolution. Proceedings of the National Academy of Sciences USA, 115, 2600 –2606. http://dx.doi.org/10.1073/pnas.1708274114
Renkewitz, F., & Keiner, M. (2019). How to detect publication bias in psychological research: A comparative evaluation of six statistical methods. Zeitschrift für Psychologie, 227(4), 261-279. http://dx.doi.org/10.1027/2151-2604/a000386
Ritchie, S. J., Wiseman, R., & French, C. C. (2012). Failing the future: Three unsuccessful attempts to replicate Bem’s ‘retroactive facilitation of recall’ effect. PLoS One, 7, e33423. http://dx.doi.org/10.1371/journal.pone.0033423
Schimmack, U. (2018b). Why the Journal of Personality and Social Psychology Should Retract Article DOI:10.1037/a0021524 “Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect” by Daryl J. Bem. Retrieved January 6, 2020, from https://replicationindex.com/2018/01/05/bem-retraction
Schooler, J. W. (2014). Turning the lens of science on itself: Verbal overshadowing, replication, and metascience. Perspectives on Psycho- logical Science, 9, 579 –584. http://dx.doi.org/10.1177/1745691614547878
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359 –1366. http://dx.doi.org/10.1177/0956797611417632
Simonsohn, U. (2013). It does not follow: Evaluating the one-off publication bias critiques by Francis (2012a, 2012b, 2012c, 2012d, 2012e, in press). Perspective on Psychological Science, 7, 597–599. http://dx.doi.org/10.1177/1745691612463399
Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). P-curve and effect size: Correcting for publication bias using only significant results. Perspectives on Psychological Science, 9, 666 – 681. http://dx.doi.org/10.1177/1745691614553988
Sterling, T. D. (1959). Publication decision and the possible effects on inferences drawn from tests of significance— or vice versa. Journal of the American Statistical Association, 54, 30 –34.
Sterling, T. D., Rosenbaum, W. L., & Weinkam, J. J. (1995). Publication decisions revisited: The effect of the outcome of statistical tests on the decision to publish and vice versa. The American Statistician, 49, 108 –112.
Van Bavel, J. J., Mende-Siedlecki, P., Brady, W. J., & Reinero, D. A. (2016). Contextual sensitivity in scientific reproducibility. Proceedings of the National Academy of Sciences USA, 113, 6454 – 6459. http://dx.doi.org/10.1073/pnas.1521897113
Wagenmakers, E. J., Wetzels, R., Borsboom, D., & van der Maas, H. L. (2011). Why psychologists must change the way they analyze their data: The case of psi: Comment on Bem (2011). Journal of Personality and Social Psychology, 100, 426 – 432. http://dx.doi.org/10.1037/a0022790
Wagenmakers, E.-J., Beek, T., Dijkhoff, L., Gronau, Q. F., Acosta, A., Adams, R. B., … Zwaan, R. A. (2016). Registered Replication Report: Strack, Martin, & Stepper (1988). Perspectives on Psychological Science, 11(6), 917–928. https://doi.org/10.1177/1745691616674458
Wicherts, J. M., Veldkamp, C. L. S., Augusteijn, H. E. M., Bakker, M., van Aert, R. C. M., & van Assen, M. A. L. M. (2016). Degrees of freedom in planning, running, analyzing, and reporting psychological studies: A checklist to avoid p-hacking. Frontiers in Psychology, 7, 1832.http://dx.doi.org/10.3389/fpsyg.2016.01832
Wilson, B. M., & Wixted, J. T. (2018). The prior odds of testing a true effect in cognitive and social psychology. Advances in Methods and Practices in Psychological Science, 1, 186 –197. http://dx.doi.org/10.1177/2515245918767122
Zwaan, R. A., Etz, A., Lucas, R. E., & Donnellan, M. B. (2018). Improving social and behavioral science by making replication mainstream: A response to commentaries. Behavioral and Brain Sciences, 41, e157. http://dx.doi.org/10.1017/S0140525X18000961
Social psychologists have responded differently to the replication crisis. Some eminent social psychologists were at the end of their careers when the crisis started in 2011. Their research output in the 2010s is too small for quantitative investigations. Thus, it makes sense to look at the younger generation of future leaders in the field.
A prominent social psychologist is Mickey Inzlicht. Not only is he on the path to becoming an eminent social psychologist (current H-Index in WebOfScience 40, over 1,000 citations in 2018), he is also a prominent commentator on the replication crisis. Most notable are Mickey’s blog posts that document his journey from believing in social psychology to becoming a skeptic, if not nihilist as more and more studies failed to replicate, including his areas of research (ego-depletion, stereotype threat; Inzlicht, 2016). Mickey is also one of the few researchers who has expressed doubts about his own findings that were obtained with methods that are now considered questionable and are difficult to replicate (Inzlicht, 2015).
He used some bias-detection tools on older and newer articles and found that the older articles showed clear evidence that questionable practices were used. His critical self-analysis was meant to stimulate more critical self-examinations, but it remains a rare example of honesty among social psychologists.
In 2016, Mickey did another self-examination that showed some positive trends in research practices. However, 2016 leaves little time for improvement and the tools were not the best tools. Here I use the most powerful method to examine questionable research practices and replicability, z-curve (Brunner & Schimmack, 2019). Following another case-study (Adam D. Galinsky), I divide the time periods into before and including 2012 and the years after 2012.
One notable difference between the two time periods is that the observed discovery rate decreased from 64% , 95%CI 59%-69%), to 49%, 95%CI = 44%-55%. This change shows that there is less selection for significance after 2012. There is also positive evidence that results before 2012 were selected for significance. The Observed Discovery Rate of 64% is higher than the expected discovery rate based on z-curve, EDR = 26%, 95%CI = 7% to 41%. However, the results after 2012 show no significant evidence that results are selected for significance because the ODR = 49% is within the 95%CI of the EDR, 7% to 64%. Visual inspection suggests a large file-drawer but that is caused by the blip of p-values just below .05 (z = 2 to z = 2.2). If these values are excluded and z-curve is fitted to z-values greater than 2.2, the model even suggests that there are more non-significant results than expected (Figure 2).
Overall, these results show that the reported results after 2013 are more trustworthy, in part because more non-significant results are reported.
Honest reporting of non-significant results is valuable, but these results are inconclusive. Thus, another important question is whether power has increased to produce more credible significant results. This is evaluated by examining the replicability of significant results. Replicability increased from 47%, 95%CI = 36% to 59%, to 68%, 95% 57% to 78%. This shows that significant results published after 2012 are more likely to replicate. However, an average replicability of 68% is still a bit short of the recommended level of 80%. Moreover, this estimate includes focal and non-focal tests and there is heterogeneity. For p-values in the range between .05 and .01, replicability is estimated to be only 30%. However, this estimate increases to 56% for the model in Figure 2. Thus, there is some uncertainty about the replicability of just significant p-values. For p-values beween .01 and .001 replicabilty is about 50%, which is acceptable but not ideal.
In conclusion, Mickey Inzlicht has been more self-critical about his past research practices than other social psychologists who have used the same questionable research practices to produce publishable significant results. Consistent with his own self-analysis, these results show that research practices changed mostly by reporting more non-significant results, but also by increasing power of studies.
I hope these positive results make Mickey revise his opinion about the value of z-curve results (Inzlicht, 2015). In 2015, Mickey argued that z-curve results are not ready for prime time. Meanwhile, z-curve has been vetted in simulation studies and is in press in Meta-Psychology. The present results show that z-curve is a valuable tool to reward the use of open science practices that lead to the publication of more credible results.
Social psychologists have responded differently to the replication crisis. Some eminent social psychologists were at the end of their careers when the crisis started in 2011. Their research output in the 2010s is too small for quantitative investigations. Thus, it makes sense to look at the younger generation of future leaders in the field.
By quantitative measures one of the leading social psychologists with an active lab in the 2010s is Adam D. Galinsky. Web of Science shows that he is on track to become a social psychologists with an H-Index of 100. He currently has 213 articles with 14,004 citations and an H-Index of 62.
Several of Adam D. Galinsky’s Psychological Science articles published between 2009-2012 were examined by Greg Francis and showed signs of questionable research practices. This is to be expected because the use of QRPs was the norm in social psychology. The more interesting question is how a productive and influential social psychologists like Adam D. Galinsky responded to the replication crisis. Given his large number of articles, it is possible to examine this quantiatively by z-curving the automatically extracted test-statistics of the articles. Although automatic extraction has the problem that it does not distinguish between focal and non-focal tests, it has the advantage that it is 100% objective and can reveal changes in research practices over time.
The good news is that results have become more replicable. The average replicability for all tests was 48% (95%CI = 42%-57%) before 2012 and 61% (95%CI = 54%-69%) since then. Zooming in on p-values between .05 and .01, replicability increased from 23% to 38%.
The observed discovery rate has not changed (71% vs. 69%). Thus, articles do not report more non-significant results, although it is not clear whether articles report more non-significant focal (and fewer non-significant non-focal tests). This observed discovery rates are significantly higher than the estimated discovery rates before 2012, 26% (9%-36%) and after 2012, 40% (18%-59%). Thus, there is evidence of selection bias; that is published results are selected for significance. The extend of selection bias can be seen visually by comparing the histogram of observed non-significant results to the predicted densities shown by the grey line. This ‘file-drawer’ has decreased but is still clearly visible after 2012.
Social psychology is a large field and the response to the replication crisis has been mixed. Whereas some social psychologists are leaders in open science practices and have changed their research practices considerably, others have not. At present, journals still reward significant results and researchers who continue to use questionable research practices continue to have an advantage. The good news is that it is now possible to examine and quantify the use of questionable research practices and to take this information into account. The 2020s will show whether the field will finally take information about replicability into account and reward slow and solid results more than fast and wobbly results.
Citation: Francis G., (2014). The frequency of excess success for articles in Psychological Science. Psychon Bull Rev (2014) 21:1180–1187 DOI 10.3758/s13423-014-0601-x
The Open Science Collaboration article in Science has over 1,000 articles (OSC, 2015). It showed that attempting to replicate results published in 2008 in three journals, including Psychological Science, produced more failures than successes (37% success rate). It also showed that failures outnumbered successes 3:1 in social psychology. It did not show or explain why most social psychological studies failed to replicate.
Since 2015 numerous explanations have been offered for the discovery that most published results in social psychology cannot be replicated: decline effect (Schooler), regression to the mean (Fiedler), incompetent replicators (Gilbert), sabotaging replication studies (Strack), contextual sensitivity (vanBavel). Although these explanations are different, they share two common elements, (a) they are not supported by evidence, and (b) they are false.
A number of articles have proposed that the low replicability of results in social psychology are caused by questionable research practices (John et al., 2012). Accordingly, social psychologists often investigate small effects in between-subject experiments with small samples that have large sampling error. A low signal to noise ratio (effect size/sampling error) implies that these studies have a low probability of producing a significant result (i.e., low power and high type-II error probability). To boost power, researchers use a number of questionable research practices that inflate effect sizes. Thus, the published results provide the false impression that effect sizes are large and results are replicated, but actual replication attempts show that the effect sizes were inflated. The replicability projected suggested that effect sizes are inflated by 100% (OSC, 2015).
In an important article, Francis (2014) provided clear evidence for the widespread use of questionable research practices for articles published from 2009-2012 (pre crisis) in the journal Psychological Science. However, because this evidence does not fit the narrative that social psychology was a normal and honest science, this article is often omitted from review articles, like Nelson et al’s (2018) ‘Psychology’s Renaissance’ that claims social psychologists never omitted non-significant results from publications (cf. Schimmack, 2019). Omitting disconfirming evidence from literature reviews is just another sign of questionable research practices that priorities self-interest over truth. Given the influence that Annual Review articles hold, many readers maybe unfamiliar with Francis’s important article that shows why replication attempts of articles published in Psychological Science often fail.
Francis (2014) “The frequency of excess success for articles in Psychological Science”
Francis (2014) used a statistical test to examine whether researchers used questionable research practices (QRPs). The test relies on the observation that the success rate (percentage of significant results) should match the mean power of studies in the long run (Brunner & Schimmack, 2019; Ioannidis, J. P. A., & Trikalinos, T. A., 2007; Schimmack, 2012; Sterling et al., 1995). Statistical tests rely on the observed or post-hoc power as an estimate of true power. Thus, mean observed power is an estimate of the expected number of successes that can be compared to the actual success rate in an article.
It has been known for a long time that the actual success rate in psychology articles is surprisingly high (Sterling, 1995). The success rate for multiple-study articles is often 100%. That is, psychologists rarely report studies where they made a prediction and the study returns a non-significant results. Some social psychologists even explicitly stated that it is common practice not to report these ‘uninformative’ studies (cf. Schimmack, 2019).
A success rate of 100% implies that studies required 99.9999% power (power is never 100%) to produce this result. It is unlikely that many studies published in psychological science have the high signal-to-noise ratios to justify these success rates. Indeed, when Francis applied his bias detection method to 44 studies that had sufficient results to use it, he found that 82 % (36 out of 44) of these articles showed positive signs that questionable research practices were used with a 10% error rate. That is, his method could at most produce 5 significant results by chance alone, but he found 36 significant results, indicating the use of questionable research practices. Moreover, this does not mean that the remaining 8 articles did not use questionable research practices. With only four studies, the test has modest power to detect questionable research practices when the bias is relatively small. Thus, the main conclusion is that most if not all multiple-study articles published in Psychological Science used questionable research practices to inflate effect sizes. As these inflated effect sizes cannot be reproduced, the effect sizes in replication studies will be lower and the signal-to-noise ratio will be smaller, producing non-significant results. It was known that this could happen since 1959 (Sterling, 1959). However, the replicability project showed that it does happen (OSC, 2015) and Francis (2014) showed that excessive use of questionable research practices provides a plausible explanation for these replication failures. No review of the replication crisis is complete and honest, without mentioning this fact.
Limitations and Extension
One limitation of Francis’s approach and similar approaches like my incredibility Index (Schimmack, 2012) is that p-values are based on two pieces of information, the effect size and sampling error (signal/noise ratio). This means that these tests can provide evidence for the use of questionable research practices, when the number of studies is large, and the effect size is small. It is well-known that p-values are more informative when they are accompanied by information about effect sizes. That is, it is not only important to know that questionable research practices were used, but also how much these questionable practices inflated effect sizes. Knowledge about the amount of inflation would also make it possible to estimate the true power of studies and use it as a predictor of the success rate in actual replication studies. Jerry Brunner and I have been working on a statistical method that is able to to this, called z-curve, and we validated the method with simulation studies (Brunner & Schimmack, 2019).
I coded the 195 studies in the 44 articles analyzed by Francis and subjected the results to a z-curve analysis. The results are shocking and much worse than the results for the studies in the replicability project that produced an expected replication rate of 61%. In contrast, the expected replication rate for multiple-study articles in Psychological Science is only 16%. Moreover, given the fairly large number of studies, the 95% confidence interval around this estimate is relatively narrow and includes 5% (chance level) and a maximum of 25%.
There is also clear evidence that QRPs were used in many, if not all, articles. Visual inspection shows a steep drop at the level of significance, and the only results that are not significant with p < .05 are results that are marginally significant with p < .10. Thus, the observed discovery rate of 93% is an underestimation and the articles claimed an amazing success rate of 100%.
Correcting for bias, the expected discovery rate is only 6%, which is just shy of 5%, which would imply that all published results are false positives. The upper limit for the 95% confidence interval around this estimate is 14, which would imply that for every published significant result there are 6 studies with non-significant results if file-drawring were the only QRP that was used. Thus, we see not only that most article reported results that were obtained with QRPs, we also see that massive use of QRPs was needed because many studies had very low power to produce significant results without QRPs.
Social psychologists have used QRPs to produce impressive results that suggest all studies that tested a theory confirmed predictions. These results are not real. Like a magic show they give the impression that something amazing happened, when it is all smoke and mirrors. In reality, social psychologists never tested their theories because they simply failed to report results when the data did not support their predictions. This is not science. The 2010s have revealed that social psychological results in journals and text books cannot be trusted and that influential results cannot be replicated when the data are allowed to speak. Thus, for the most part, social psychology has not been an empirical science that used the scientific method to test and refine theories based on empirical evidence. The major discovery in the 2010s was to reveal this fact, and Francis’s analysis provided valuable evidence to reveal this fact. However, most social psychologists preferred to ignore this evidence. As Popper pointed out, this makes them truly ignorant, which he defined as “the unwillingness to acquire knowledge.” Unfortunately, even social psychologists who are trying to improve it wilfully ignore Francis’s evidence that makes replication failures predictable and undermines the value of actual replication studies. Given the extent of QRPs, a more rational approach would be to dismiss all evidence that was published before 2012 and to invest resources in new research with open science practices. Actual replication failures were needed to confirm predictions made by bias tests that old studies cannot be trusted. The next decade should focus on using open science practices to produce robust and replicable findings that can provide the foundation for theories.
One of the worst articles about the decade of replication failures is the “Psychology’s Renaissance” article by the datacolada team (Leif Nelson, Joseph Simmons, & Uri Simonsohn).
This is not your typical Annual Review article that aims to give a review over developments in the field. it is an opinion piece filled with bold claims that lack empirical evidence.
The worst claim is that p-hacking is so powerful that pretty much every study can be made to work.
“Experiments that work are sent to a journal, whereas experiments that fail are sent to the file drawer (Rosenthal 1979). We believe that this “file-drawer explanation” is incorrect. Most failed studies are not missing. They are published in our journals, masquerading as successes.”
We can all see that not publishing failed studies is a bit problematic. Even Bem’s famous manual for p-hackers warned that it is unethical to hide contradictory evidence. “The integrity of the scientific enterprise requires the reporting of disconfirming results” (Bem). Thus, the idea that researchers are sitting on a pile of failed studies that they failed to disclose makes psychologists look bad and we can’t have that in Fiske’s Annual Review of Psychology journal. Thus, psychologists must have been doing something that is not dishonest and can be sold as normal science.
“P-hacking is the only honest and practical way to consistently get underpowered studies to be statistically significant. Researchers did not learn from experience to increase their sample sizes precisely because their underpowered studies were not failing.” (p. 515).
This is utter nonsense. First, researchers have file-drawers of studies that did not work. Just ask them and they may tell you that they do.
“We did run multiple studies, some of which did not work, and some of which worked better than others. You may think that not reporting the less successful studies is wrong, but that is how the field works.” (Roy Baumeister, personal email communication)
Leading social psychologists, Gilbert and Wilson provide an even more detailed account of their research practices that produce many non-significant results that are not reported (a.k.a. a file drawer), which has been preserved thanks to Greg Francis.
First, it’s important to be clear about what “publication bias” means. It doesn’t mean that anyone did anything wrong, improper, misleading, unethical, inappropriate, or illegal. Rather it refers to the well known fact that scientists in every field publish studies whose results tell them something interesting about the world, and don’t publish studies whose results tell them nothing. Francis uses sophisticated statistical tools to discover what everyone already knew—and what he could easily have discovered simply by asking us. Yes, of course we ran some studies on “consuming experience” that failed to show interesting effects and are not reported in our JESP paper. Let us be clear: We did not run the same study over and over again until it yielded significant results and then report only the study that “worked.” Doing so would be clearly unethical. Instead, like most researchers who are developing new methods, we did some preliminary studies that used different stimuli and different procedures and that showed no interesting effects. Why didn’t these studies show interesting effects? We’ll never know. Failed studies are often (though not always) inconclusive, which is why they are often (but not always) unpublishable. So yes, we had to mess around for a while to establish a paradigm that was sensitive and powerful enough to observe the effects that we had hypothesized. In one study we might have used foods that didn’t differ sufficiently in quality, in another we might have made the metronome tick too fast for people to chew along. Exactly how good a potato chip should be and exactly how fast a person can chew it are the kinds of mundane things that scientists have to figure out in preliminary testing, and they are the kinds of mundane things that scientists do not normally report in journals (but that they informally share with other scientists who work on similar phenomenon). Looking back at our old data files, it appears that in some cases we went hunting for potentially interesting mediators of our effect (i.e., variables that might make it larger or smaller) and although we replicated the effect, we didn’t succeed in making it larger or smaller. We don’t know why, which is why we don’t describe these blind alleys in our paper. All of this is the hum-drum ordinary stuff of day-to-day science.
Aside from this anecdotal evidence, the datacolada crew actually had access to empirical evidence in an article that they cite, but maybe never read. An important article in the 2010s reported a survey of research practices (John, Loewenstein, & Prelec, 2012). The survey asked about several questionable research practices, including not reporting entire studies that failed to support the main hypothesis.
Not reporting studies that “did not work” was the third most frequently used QRP. Unfortunately, this result contradicts datacolada’s claim that there are no studies in file-drawers and so they ignore this inconvenient empirical fact to tell their fairy tail of honest p-hackers that didn’t know better until 2011 when they published their famous “False Positive Psychology” article.
This is a cute story that isn’t supported by evidence, but that has never stopped psychologists from writing articles that advance their own career. The beauty of review articles is that you don’t even have to phack data. You just pick and choose citations or make claims without evidence. As long as the editor (Fiske) likes what you have to say, it will be published. Welcome to psychology’s renaissance; same bullshit as always.
The statistics wars go back all the way to Fisher, Pearson, and Neyman-Pearson(Jr), and there is no end in sight. I have no illusion that I will be able to end these debates, but at least I can offer a fresh perspective. Lately, statisticians and empirical researchers like me who dabble in statistics have been debating whether p-values should be banned and if they are not banned outright whether they should be compared to a criterion value of .05 or .005 or be chosen on an individual basis. Others have advocated the use of Bayes-Factors.
However, most of these proposals have focused on the traditional approach to test the null-hypothesis that the effect size is zero. Cohen (1994) called this the nil-hypothesis to emphasize that this is only one of many ways to specify the hypothesis that is to be rejected in order to provide evidence for a hypothesis.
For example, a nil-hypothesis is that the difference in the average height of men and women is exactly zero). Many statisticians have pointed out that a precise null-hypothesis is often wrong a priori and that little information is provided by rejecting it. The only way to make nil-hypothesis testing meaningful is to think about the nil-hypothesis as a boundary value that distinguishes two opposing hypothesis. One hypothesis is that men are taller than women and the other is that women are taller than men. When data allow rejecting the nil-hypothesis, the direction of the mean difference in the sample makes it possible to reject one of the two directional hypotheses. That is, if the sample mean height of men is higher than the sample mean height of women, the hypothesis that women are taller than men can be rejected.
However, the use of the nil-hypothesis as a boundary value does not solve another problem of nil-hypothesis testing. Namely, specifying the null-hypothesis as a point value makes it impossible to find evidence for it. That is, we could never show that men and women have the same height or the same intelligence or the same life-satisfaction. The reason is that the population difference will always be different from zero, even if this difference is too small to be practically meaningful. A related problem is that rejecting the nil-hypothesis provides no information about effect sizes. A significant result can be obtained with a large effect size and with a small effect size.
In conclusion, nil-hypothesis testing has a number of problems, and many criticism of null-hypothesis testing are really criticism of nil-hypothesis testing. A simple solution to the problem of nil-hypothesis testing is to change the null-hypothesis by specifying a minimal effect size that makes a finding theoretically or practically useful. Although this effect size can vary from research question to research question, Cohen’s criteria for standardized effect sizes can give some guidance about reasonable values for a minimal effect size. Using the example of mean differences, Cohen considered an effect size of d = .2 small, but meaningful. So, it makes sense to set a criterion for a minimum effect size somewhere between 0 and .2, and d = .1 seems a reasonable value.
We can even apply this criterion retrospectively to published studies with some interesting implications for the interpretation of published results. Shifting the null-hypothesis from d = 0 to d < abs(.1), we are essentially raising the criterion value that a test statistic has to meet in order to be significant. Let me illustrate this first with a simple one-sample t-test with N = 100.
Conveniently, the sampling error for N = 100 is 1/sqrt(100) = .1. To achieve significance with alpha = .05 (two-tailed) and H0:d = 0, the test statistic has to be greater than t.crit = 1.98. However, if we change H0 to d > abs(.1), the t-distribution is now centered at the t-value that is expected for an effect size of d = .1. The criterion value to get significance is now t.crit = 3.01. Thus, some published results that were able to reject the nil-hypothesis would be non-significant when the null-hypothesis specifies a range of values between d = -.1 to .1.
If the null-hypothesis is specified in terms of standardized effect sizes, the critical values vary as a function of sample size. For example, with N = 10 the critical t-value is 2.67, with N = 100 it is 3.01, and with N = 1,000 it is 5.14. An alternative approach is to specify H0 in terms of a fixed test statistic which implies different effect sizes for the boundary value. For example, with t = 2.5, the effect sizes would be d = .06 with N = 10, d = .05 with N = 100, and d = .02 with N = 1000. This makes sense because researchers should use larger samples to test weaker effects. The example also shows that a t-value of 2.5 specifies a very narrow range of values around zero. However, the example was based on one-sample t-tests. For the typical comparison of two groups, a criterion value of 2.5 corresponds to an effect size of d = .1 with N = 100. So, while t = 2.5 is arbitrary, it is a meaningful value to test for statistical significance. With N = 100, t(98) = 2.5 corresponds to an alpha criterion of .014, which is a bit more stringent than .05, but not as strict as a criterion value of .005. With N = 100, alpha = .005 corresponds to a criterion value of t.crit = 2.87, which implies a boundary value of d = .17.
In conclusion, statistical significance depends on the specification of the null-hypothesis. While it is common to specify the null-hypothesis as an effect size of zero, this is neither necessary, nor ideal. An alternative approach is to (re)specify the null-hypothesis in terms of a minimum effect size that makes a finding theoretically interesting or practically important. If the population effect size is below this value, the results could also be used to show that a hypothesis is false. Examination of various effect sizes shows that criterion values in the range between 2 and 3 provide can be used to define reasonable boundary values that vary around a value of d = .1
The problem with t-distributions is that they differ as a function of the degrees of freedom. To create a common metric it is possible to convert t-values into p-values and then to convert the p-values into z-scores. A z-score of 2.5 corresponds to a p-value of .01 (exact .0124) and an effect size of d = .13 with N = 100 in a between-subject design. This seems to be a reasonable criterion value to evaluate statistical significance when the null-hypothesis is defined as a range of smallish values around zero and alpha is .05.
Shifting the significance criterion in this way can dramatically change the evaluation of published results, especially results that are just significant, p < .05 & p > .01. There have been concerns that many of these results have been obtained with questionable research practices that were used to reject the nil-hypothesis. However, these results would not be strong enough to reject the modified hypothesis that the population effect size exceeds a minimum value of theoretical or practical significance. Thus, no debates about the use of questionable research practices are needed. There is also no need to reduce the type-I error rate at the expense of increasing the type-II error rate. It can be simply noted that the evidence is insufficient to reject the hypothesis that the effect size is greater than zero but too small to be important. This would shift any debates towards discussion about effect sizes and proponents of theories would have to make clear which effect sizes they consider to be theoretically important. I believe that this would be more productive than quibbling over alpha levels.
To demonstrate the implications of redefining the null-hypothesis, I use the results of the replicability project (Open Science Collaboration, 2015). The first z-curve shows the traditional analysis for the nil-hypothesis and alpha = .05, which has z = 1.96 as the criterion value for statistical significance (red vertical line).
Figure 1 shows that 86 out of 90 studies reported a test-statistic that exceeded the criterion value of 1.96 for H0:d = 0, alpha = .05 (two-tailed). The other four studies met the criterion for marginal significance (alpha = .10, two-tailed or .05 one-tailed). The figure also shows that the distribution of observed z-scores is not consistent with sampling error. The steep drop at z = 1.96 is inconsistent with random sampling error. A comparison of the observed discovery rate (86/90, 96%) and the expected discovery rate 43% shows evidence that the published results are selected from a larger set of studies/tests with non-significant results. Even the upper limit of the confidence interval around this estimate (71%) is well below the observed discovery rate, showing evidence of publication bias. Z-curve estimates that only 60% of the published results would reproduce a significant result in an actual replication attempt. The actual success rate for these studies was 39%.
Results look different when the null-hypothesis is changed to correspond to a range of effect sizes around zero that correspond to a criterion value of z = 2.5. Along with shifting the significance criterion, z-curve is also only fitted to studies that produced z-scores greater than 2.5. As questionable research practices have a particularly strong effect on the distribution of just significant results, the new estimates are less influenced by these practices.
Figure 2 shows the results. Most important, the observed discovery rate dropped from 96% to 61%, indicating that many of the original results provided just enough evidence to reject the nil-hypothesis, but not enough evidence to rule out even small effect sizes. The observed discovery rate is also more in line with the expected discovery rate. Thus, some of the missing non-significant results may have been published as just significant results. This is also implied by the greater frequency of results with z-scores between 2 and 2.5 than the model predicts (grey curve). However, the expected replication rate of 63% is still much higher than the actual replication rate with a criterion value of 2.5 (33%). Thus, other factors may contribute to the low success rate in the actual replication studies of the replicability project.
In conclusion, statisticians have been arguing about p-values, significance levels, and Bayes-Factors. Proponents of Bayes-Factors have argued that their approach is supreme because Bayes-Factors can provide evidence for the null-hypothesis. I argue that this is wrong because it is theoretically impossible to demonstrate that a population effect size is exactly zero or any other specific value. A better solution is to specify the null-hypothesis as a range of values that are too small to be meaningful. This makes it theoretically possible to demonstrate that a population effect size is above or below the boundary value. This approach can also be applied retrospectively to published studies. I illustrate this by defining the null-hypothesis as the region of effect sizes that is defined by the effect size that corresponds to a z-score of 2.5. While a z-score of 2.5 corresponds to p = .01 (two-tailed) for the nil-hypothesis, I use this criterion value to maintain an error rate of 5% and to change the null-hypothesis to a range of values around zero that becomes smaller as sample sizes increase.
As p-hacking is often used to just reject the nil-hypothesis, changing the null-hypothesis to a range of values around zero makes many ‘significant’ results non-significant. That is, the evidence is too weak to exclude even trivial effect sizes. This does not mean that the hypothesis is wrong or that original authors did p-hack their data. However, it does mean that they can no longer point to their original results as empirical evidence. Rather they have to conduct new studies to demonstrate with larger samples that they can reject the new null-hypothesis that the predicted effect meets some minimal standard of practical or theoretical significance. With a clear criterion value for significance, authors also risk to obtain evidence that positively contradicts their predictions. Thus, the biggest improvement that arises form rethinking null-hypothesis testing is that authors have to specify effect sizes a priori and that that studies can provide evidence for and against a zero. Thus, changing the nil-hypothesis to a null-hypothesis with a non-null value makes it possible to provide evidence for or against a theory. In contrast, computing Bayes-Factors in favor of the nil-hypothesis fails to achieve this goal because the nil-hypothesis is always wrong, the real question is only how wrong.
Wegner’s article “The Premature Demise of the Solo Experiment” in PSPB (1992) is an interesting document for meta-psychologists. It provides some insight into the thinking of leading social psychologists at the time; not only the author, but reviewers and the editor who found this article worthy of publishing, and numerous colleagues who emailed Wegner with approving comments.
The article starts with the observation that in the 1990s social psychology journals increasingly demanded that articles contain more than one study. Wegner thinks that the preference of multiple-study articles is a bias rather than a preference in favour of stronger evidence.
“it has become evident that a tremendous bias against the “solo” experiment exists that guides both editors and reviewers” (p. 504).
The idea of bias is based on the assumption that rejection a null-hypothesis with a long-run error-probability of 5% is good enough to publish exciting new ideas and give birth to wonderful novel theories. Demanding even just one replication of this finding would create a lot more burden without any novel insights just to lower this probability to 0.25%.
“But let us just think a moment about the demise of the solo experiment. Here we have a case in which skepticism has so overcome the love of ideas that we seem to have squared the probability of error we are willing to allow. Once, p < .05 was enough. Now, however, we must prove things twice. The multiple experiment ethic has surreptitiously changed alpha to .0025 or below.”
That’s right. The move from solo-experiment to multiple-study articles shifted the type-I error probability. Even a pair of studies reduced the type-I error probability more than the highly cited and controversial call to move alpha from .05 to .005. A pair of studies with p < .05 reduces the .005 probability by 50%!
Wegner also explains why journals started demanding multiple studies.
After all, the statistical reasons for multiple experiments are obvious-what better protection of the truth than that each article contain its own replication? (p. 505)
Thus, concerns about replicabilty in social psychology were prominent in the early 1990s, twenty years before the replication crisis. And demanding replication studies was considered to be a solution to this problem. If researchers were able to replicate their findings, ideally with different methods, stimuli, and dependent variables, the results are robust and generalizable. So much for the claim that psychologists did not value or conduct replication studies before the open science movement was born in the early 2010.
Wegner also reports about his experience with attempting to replicate his perfectly good first study.
“Sometimes it works wonderfully….more often than not, however, we find the second experiment is harder to do than the first…Even if we do the exact same experiment again” (p. 506).
He even cheerfully acknowledge that the first results are difficult to replicate because the first results were obtained with some good fortune.
“Doing it again, we will be less likely to find the same thing even if it is true, because the error variance regresses our effects to the mean. So we must add more subjects right off the bat. The joy of discovery we felt on bumbling into the first study is soon replaced by the strain of collecting an all new and expanded set of data to fend off the pointers [pointers = method-terrorists]” (p. 506).
Wegner even thinks that publishing these replication studies is pointless because readers expect the replication study to work. Sure, if the first study worked, so will the second.
This is something of a nuisance in light of the reception that our second experiment will likely get Readers who see us replicate our own findings roll their eyes and say “Sure,” and we wonder why we’ve even gone to the trouble.
However, he fails to examine more carefully why a successful replication study receives only a shoulder-shrug from readers. After all, his own experience was that it was quite difficult to get these replication studies to work. Doesn’t this mean readers should be at the edge of their seats and wonder whether the original result was a false positive or whether it can actually be replicated? Isn’t the second study the real confirmatory test where the rubber hits the road? Insiders of course know that this is not the case. The second study works because it would not have been included in the multiple-study article if it hadn’t worked. That is after all how the field operated. Everybody had the same problems to get studies to work that Wegner describes, but many found a way to get enough studies to work to meet the demands of the editor. The number of studies was just a test of the persistence of a researcher, not a test of a theory. And that is what Wegner rightfully criticized. What is the point of producing a set of studies with p < .05, if more studies do not strengthen the evidence for a claim. We might as well publish a single finding and then move on to find more interesting ideas and publish them with p-values less than .05. Even 9 studies with p < .05 don’t mean that people can foresee the future (Bem, 2011), but it is surely an interesting idea.
Wegner also comments on the nature of replication studies that are now known as conceptual replication studies. The justification for conceptual replication studies is that they address limitations that are unavoidable in a single study. For example, including a manipulation check may introduce biases, but without one, it is not clear whether a manipulation worked. So, ideally the effect could be demonstrated with and without a manipulation check. However, this is not how conceptual replication studies are conducted.
We must engage in a very delicate “tuning” process to dial in a second experiment that is both sufficiently distant from and sufficiently similar to the original. This tuning requires a whole set of considerations and skills that have nothing to do with conducting an experiment. We are not trained in multi experiment design, only experimental design, and this enterprise is therefore largely one of imitation, inspiration, and luck.
So, to replicate original results that were obtained with a healthy dose of luck, more luck is needed in finding a condition that works, or simply to try often enough until luck strikes again.
Given the negative attitude towards rigor, Wegner and colleagues also used a number of tricks to make replication studies work.
“Some of us use tricks to disguise our solos. We run “two experiments” in the same session with the same subjects and write them up separately. Or we run what should rightfully be one experiment as several parts, analyzing each separately and writing it up in bite-sized pieces as a multi experiment Many times, we even hobble the first experiment as a way of making sure there will be something useful to do when we run another.” (p. 506).
If you think this sounds like some charlatans who enjoy pretending to be scientists, your impression is rather accurate because the past decade has shown that many of these internal replications in multiple study articles were obtained with tricks and provide no empirical test of empirical hypotheses; p-values are just for show so that it looks like science, but it isn’t.
My own views on this issue are that the multiple study format was a bad fix for a real problem. The real problem was that it was all to easy to get p < .05 in a single study to make grand claims about the causes of human behavior. Multiple-study articles didn’t solve this problem because researchers found ways to get significant results again and again even when their claims were false.
The failure of multiple-study articles to fix psychology has some interesting lessons for the current attempts to improve psychology. Badges for data sharing and preregistration will not improve psychology, if they are being gamed like psychologists gamed the multiple-study format. Ultimately, science can only advance if results are reported honestly and if results are finally able to falsify theoretical predictions. Psychology will only become a science when brilliant novel ideas can be proven false and scientific rigor is prized as much as the creation of interesting ideas. Coming up with interesting ideas is philosophy. Psychology emerged as a distinct discipline in order to subject those theories to empirical tests. After a century of pretending to do so, it is high time to do so for real.