# Loken and Gelman are still wrong

## Introduction

Loken and Gelman published a brief essay on “Measurement error and the replication crisis” in the magazine Science. As it turns out, the claims in the article are ambiguous because key terms like measurement error are not properly defined. However, the article does contain the results of simulation studies that are presented in a Figure. The key figure is Figure 3.

This figure is used to support the claim that “of statistically significant effects observed after error, a majority could be greater than in the “ideal” setting when N is small.”

Some points about this claim are important. It is not a claim about a single study. In a single study, measurement error, sampling error, and other factors CAN produce a stronger result with a less reliable measures, just like some people can win the lottery, even though it is a very unlikely event. The claim is clearly about the outcome in the long-run after many repeated trials. That is also implied by a figure that is based on simulations of many repeated trials of the same study. What does the figure imply? It implies that measurement error attenuates observed correlations (or regression coefficients with measurement error in the predictor variable, x) in large samples. The reason is simply that random measurement error adds variance to a variable that is unrelated to the outcome measure. As a result, the observed correlation is a mixture of the true relationship and a correlation of zero and the mixture depends on the amount of random measurement error in the predictor variable.

Selection for significance on the other hand has the opposite effect. To obtain significance, the observed correlation has to have a minimum value so that the observed correlation is approximately twice as large as the sampling error (t ~ 2 equals p < .05, two-tailed). In large samples, sampling error is small and correlations of r = .15 are significant in most cases (i.e., the study has high power). When 99% of all studies are significant, selecting for significance to get a success rate of 100% is irrelevant. However, in small samples with N = 50, a small correlation of r = .15 is not enough to get significance. Thus, all significant correlations are inflated. Measurement error attenuates correlations and makes it even harder to get significant results. With reliability = .8 and a correlation of r = .15, the expected correlation is only .15 * .8 = .12 and more inflation is needed to get significance.

Figure 3 in Loken and Gelman’s article suggests that selection for significance with unreliable measures produces even more inflated effect size estimates than selection for significance without measurement error. This is implied by the results of a simulation study that produced a majority (over 50%) of outcomes where the effect size estimate was higher (and more inflated) when random measurement error was added than in the ideal setting without random measurement error. Loken and Gelman’s claim “of statistically significant effects observed after error, a majority could be greater than in the “ideal” setting when N is small” is based on this result of their simulation studies. With N = 50, r = .15, and reliability of .8, a majority of the comparisons showed a stronger effect size estimate for the simulation with random error than for the simulation without random error.

I believe that this outcome is based on an error in their simulation studies. The simulation does not clearly distinguish between sampling error and random measurement error. I have tried to make this point repeatedly on Gelman’s blog post, but this discussion and my previous blog post (that Gelman probably did not read) failed to resolve this controversy. However, it helped me to understand the source of the disagreement more clearly. I maintain that Gelman does not make a clear distinction between sampling error (i.e., even with perfectly reliable measures, results will vary from sample to sample and this variability is larger in small samples, STATS101) and random measurement error (i.e., two measures of the same constructs are not perfectly correlated with each other, NOT A TOPIC OF STATISTCS, which typically assumes perfect measures). Based on this insight, I wrote a new r-script that clearly distinguishes between sampling error and random measurement error. I ran the script 10,000 times. Here are the key results.

The simulation ensured that reliability in each run is exactly 80%.

The expected effect sizes are r = .15 for the true relationship and r = .12 for the measure with 80% reliability. The average effect sizes across the 10,000 simulations match these expected values. We also see that sampling error produces huge variability in specific runs. However, even extreme deviations are attenuated by random measurement error. Thus, random measurement error makes values less extreme.

What about sign errors. We don’t really know the true correlation and two-tailed testing allows researchers to reject H0 with the wrong sign. To allow for this possibility, we can compute the absolute correlations.

This does not matter. The results for the measure with random error are still lower and less extreme.

Now we can examine how conditioning on significance influences the results.

Once more the effect size estimates for the true correlation are stronger and more extreme than those for the measure with random measurement error. This is also true for absolute effect size estimates.

Loken and Gelman’s Figure 3 required the direct comparison of two outcomes in the same run after selection for significance. This creates a problem because sometimes one result will be significant and the other one will not be significant. As a result, the comparison is biased because it compares estimates after selection for significance with estimates without selection for significance. However, even with this bias in favor of the unreliable measure, random measurement error produced weaker effect size estimates in the majority of all cases.

## Conclusion

In short, these results confirm Hausman’s iron law of econometrics that random measurement error typically attenuates effect size estimates. Typically, of course, does not mean always. However, Loken and Gelman claimed that they identified a situation in which the iron law of economics does not apply and can lead to false inferences. They claimed that (a) in small samples and (b) after selection for significance, random measurement error will produce stronger effect size estimates not once or twice but IN A MAJORITY of studies. This claim was implied by the results of their simulations displayed in Figure 3. Here is showed that their simulation fails to simulate the influence of random measurement error. Holding random measurement error constant at 80% produces the expected outcome that random measurement error is more likely to attenuate effect size estimates than to inflate it even in small samples and after selection for significance. Thus, researchers are justified to claim that they could have obtained stronger correlations with a more reliable measure or to use latent variable models to correct for unreliability. What they cannot do is to claim that the true population correlation is stronger than their observed correlation because this claim ignores the influence of selection for significance that inevitably inflates observed correlations in small samples with small effect sizes. It is also not correct to assume that two wrongs (selecting for significance with unreliable measures) make one right. Robust and replicable results require good data. Effect sizes of correlations should only be interpreted if measures have demonstrated good reliability (and validity, but that is another topic) and when sampling error is small enough to produce a meaningful range of plausible values.

### New Simulation of Reliability

N = 50

REL = .80
n.sim = 10000

res = c()

for (i in 1:n.sim) {

SV = scale(rnorm(N))*sqrt(REL)
var(SV)

x1 = rnorm(N)
x1 = residuals(lm(x1 ~ SV))
x1 = scale(x1)
x1 = x1*sqrt(1-REL)
var(x1)

x2 = rnorm(N)
x2 = residuals(lm(x2 ~ x1 + SV))
x2 = scale(x2)
x2 = x2*sqrt(1-REL)
var(x2)

x1 = x1 + SV
x2 = x2 + SV

y = .15 * SV + rnorm(N)*sqrt(1-.15^2)

r = c(var(x1),var(x2),cor(x1,x2),
summary(lm(y ~ SV))\$coef[2,],
summary(lm(y ~ x1))\$coef[2,],
summary(lm(y ~ x2))\$coef[2,]
);r

res = rbind(res,r)

} # End of sim

summary(res)

# Open Science Practices and Replicability

## Summary

A recent article in the flashy journal “Nature Human Behaviour” that charges authors or their universities \$6,000 published the claim “high replicability of newly discovered social-behavioural findings is achievable” (Protzko et al., 2023). This is good news for social scientists and consumer of social psychology after a decade of replication failures caused by questionable research practices, including fraud.

So, what is the magic formula to produce replicable and credible findings in the social sciences?

The paper attributes success to the implementation of four rigour-enhancing practices, namely confirmatory tests, large sample sizes, preregistration, and methodological transparency, The problem with this multi-pronged approach is that it is not possible to say which of these features are necessary or sufficient to produce replicable results.

I analyze the results of this article with the R-Index. Based on these results, I conclude that none of the four rigor-enhancing practices are necessary to produce highly replicable results. The key ingredients for high replicability are honesty and high power. It is wrong to confuse large samples (N = 1,500) with high power. As shown, sometimes N = 1,500 has low power and sometimes much smaller samples are sufficient to have high power.

## Introduction

The article reports 16 studies. Each study was proposed by one lab and the lab reported the results of a confirmatory test that produced significant results in 15 of the 16 studies. The replication studies by the other three labs produced significant results in 79% of the studies.

I predicted these replication outcomes with the Replicability-Index (R-Index). The R-Index is a simple method to estimate replicability for a small set of studies. The key insight of the R-Index is that the outcome of unbiased replication studies is a function of the mean (I once assumed the median would be better, but this was wrong) power of the original studies (Brunner & Schimmack, 2021). Unfortunately, it can be difficult to estimate the true mean power based on original studies because original studies are often selected for significance and selection for significant leads to inflated estimates of observed power. The R-Index adjusts for this inflation by comparing the success rate (percentage of significant results) to the mean observed power. If the success rate is higher than the mean observed power, selection bias is present and the mean power is inflated. A simple heuristic to correct for this inflation is to subtract the inflation from the observed power.

The article reported the outcomes of “original” (blue = self-replication) and replication studies (green = independent replications by other labs) in Figure 1.

To obtain estimates of observed power, I used the point estimates of the original (original) studies and the lower limit of the 95%CI. I converted these statistics into z-scores, using the formula (ES/((ES – LL.CI)/2). The z-scores were converted into p-values and p-values below .05 were considered significant. Visual inspection of Figure 1 shows that one original study (blue) did not have a statistically significant result (i.e., the 95%CI includes a value of zero). Thus, the actual success rate was 15/16 = 94%.

Table 1 shows that the mean observed power is 87%. Thus, there is evidence of a small amount of selection for significance and the predicted success rate of replication studies is .87 – .06 = .81. The actual success rate was computed as the percentage of replication studies (k = 3) that produced a significant result. The overall success rate of replication studies was 79%, which is close to the estimate of the R-Index, 81%. Finally, it is evident that power of studies varies across studies. 9 studies had z-scores greater than 5 (the 5 sigma rule of particle physics) and all 9 studies had a replication success rate of 100%. The only reason for replication failures of studies with z-scores greater than 5 is fraud or problems in the implementation of the actual replication study. In contrast, studies with z-scores below 4 have insufficient power to produce consistent significant results. The correlation between observed power and replication success rates is r = .93. This finding demonstrates empirically that power determines the outcome of unbiased replication studies.

## Discussion

Honest reporting of results is necessary to trust published results. Open Science Practices may help to ensure that results are reported honesty. This is particularly valuable for the evaluation of a single study. However, statistical tools like the R-Index can be used to examine whether a set of studies is unbiased or whether the results are biased. In the present set of 16 original studies, it detected a small bias that explains the differences in success rate for the original studies (blue, 94%) and the replication studies (green, 79%).

More importantly, the investigation of power shows that some of the studies were underpowered to reject the nil-hypothesis even with N = 1,500 because the real effect sizes were too close to zero. This shows how difficult it is to provide evidence for the absence of an important effect.

At the same time, other studies had large effect sizes and were dramatically overpowered to demonstrate an effect. As shown, z-scores of 5 are sufficient to provide conclusive evidence against a nil-hypothesis and this criterion is used in particle physics for strong hypothesis tests. Using N = 1,500 for an effect size of d = .6 is overkill. This means that researchers who cannot easily collect data from large samples can produce credible results. There are also other methods to reduce sampling error and to increase power than increasing sample sizes. Within-subject designs with many repeated trials can produce credible and replicable results with sample size of N = 8. Sample size should not be used as a criterion to evaluate studies and large samples should not be used as a criterion for good science.

To evaluate the credibility of results in single studies, it is useful to examine confidence intervals and to see which effect sizes are excluded by the lower limit of the confidence interval. Confidence intervals that exclude zero, but not values close to zero suggest that a study was underpowered and that the true population effect size may be so close to zero that it is practically zero. In addition, p-values or z-scores provide valuable information about replicability. Results with z-scores greater than 5 are extremely likely to replicate in an exact replication study and replication failures suggest a significant moderating factor.

Finally, the present results suggest that other aspects of open science like pre-registration are not necessary to produce highly replicable results. Even exploratory results that produced strong evidence (z > 5) are likely to replicate. The reason is that luck or extreme p-hacking does not produce such extreme evidence against the null-hypothesis. A better understanding of the strength of evidence may help to produce credible results without wasting precious resources on unnecessarily large samples.

# Random Measurement Error and the Replication Crisis

The code for all simulations is available on OSF (https://osf.io/pyhmr).

P.S. I have been arguing with Andrew Gelman on his blog about his confusing and misleading article with Loken. Things have escalated and I just want to share his latest response.

Ulrich:
I’m not dragging McShane into anything. You’re the one who showed up in the comment thread, mischaracterized his paper in two different ways, and called him an “asshole” for doing something that he didn’t do. You say now that you don’t care that he cited you; earlier you called him an asshole for not citing you, even though you did.

Also, your summaries of the McShane et al. and Gelman and Loken papers are inaccurate, as are your comments about confidence intervals, as are a few zillion other things you’ve been saying in this thread.

Please just stop it with the comments here. You’re spreading false statements and wasting our time; also I don’t think this is doing you any good either. I don’t really mind the insults if you had anything useful to say (here’s an example of where I received some critical feedback that was kinda rude but still very helpful), but whatever useful you have to say, you’ve already said, and at this point you’re just going around in circles. So, no more.

The main reason to share this is that I already showed that confidence intervals are often accurate even after selection for significance and that this is even more true when studies use unreliable measures because the attenuation due to random measurement error compensates for the inflation due to selection for significance. i am not saying that this makes it ok to suppress non-significant results, but it does show that Gelman is not interested in telling researchers how to avoid misinterpretation of biased point estimates. He likes to point out mistake in other people’s work, but he is not very good at noticing mistakes in his own work. I have repeatedly asked for feedback on my simulation results and if there are mistakes I am going to correct them. Gelman hasn’t done so and so far nobody else has. Of course, I cannot find a mistake in my own simulations. Ergo, I maintain that confidence intervals are useful to avoid misinterpretation of pointless point estimates. The real reason why confidence intervals are rarely interpreted (other than saying CI = .01 to 1.00 excludes zero, therefore the nil-hypothesis can be rejected, which is just silly nil-hypothesis testing, Cohen, 1994) is that confidence intervals in between-study designs with small samples are so wide that they do not allow strong conclusions about population effect sizes.

## Introduction

A few years ago, Loken and Gelman (2017) published an article in the Magazine “Science.” A key novel claim in this article was that random measurement error can inflate effect size estimates.

“In a low-noise setting, the theoretical results of Hausman and others correctly show that measurement error will attenuate coefficient estimates. But we can demonstrate with a simple exercise that the opposite occurs in the presence of high noise and selection on statistical significance”

Language is famously ambiguous and open to interpretation. However, the article also presented a figure that seemed to support this counterintuitive conclusion.

The figure seems to suggest that with selection for significance, overestimation of effect sizes is increasingly more common in studies that use an unreliable measure rather than a reliable measure. At some point, the proportion of studies where the effect size estimate is greater with rather than without error seems to be over 50%.

Paradox findings are interesting and attracted our attention (Schimmack & Carlson, 2017). We believed that this conclusion is based on a mistake in the simulation code. We also tried to explain the combined effects of sampling error and random measurement error on effect sizes in a short commentary that remind unpublished. We never published our extensive simulation results.

Recently, a Ph.D student also questioned the simulation code and Andrew Gelman posted the students concerns on his blog post (Simulations of measurement error and the replication crisis: Maybe Loken and I have a mistake in our paper?) The blog post also included the simulation code.

The simulation is simple. It generates two variables with SD = 1 and a correlation ~ r = .15. It then adds 25% random measurement error to both variables, so that the two variables are measures of the former variables with 4/5 = 80% reliability. This attenuates the true correlation slightly to .15*.8 = .12. The crucial condition is when this simulation is run with a small sample size of N = 50.

N= 50 is a small sample size to study an effect size of .15 or .12. So, we are expecting mostly non-significant results. The crucial question is what happens when researchers get lucky and obtain a statistically significant result. Would selection for significance produce a stronger effect size estimate for the perfect measure or the unreliable measure?

It is not easy to answer this question because selection for significance requires conditioning on an outcome and Loken and Gelman’s simulation has two outcomes in the simulation. The outcomes for the perfect measure are paired with the outcome for the unreliable measure. So, which outcome should be used to select for significance? Using either measure will of course benefit the measure that was used to select for significance. To avoid this problem, I simply examined all four possible outcomes, neither measure was significant, the perfect measure was significant and the unreliable was not, the unreliable was significant and the perfect was not, or both were significant. To obtain stable cell frequencies, I ran 10,000 simulations.

Here are the results.

1. Neither measure produced a significant result

4870 times the perfect measure had a higher estimate than the unreliable measure (58%)
3629 times the unreliable measure had a higher estimate than the unreliable measure (42%)

2. Both measure produced a significant result

579 times the perfect measure had a higher estimate than the unreliable measure (61%)
377 times the unreliable measure had a higher estimate than the unreliable measure (39%)

3. The reliable measure is significant and the unreliable measure is not significant

981 times the perfect measure had a higher estimate than the unreliable measure (100%)
0 times the unreliable measure had a higher estimate than the unreliable measure (0%)

4. The unreliable measure is significant and the reliable measure is not significant

0 times the perfect measure had a higher estimate than the unreliable measure (0%)
464 times the unreliable measure had a higher estimate than the unreliable measure (100%)

The main point of these results is that selection for significance will always favor the measure that is used for conditioning on significance. By definition, the effect size of a significant result will be larger than the effect size of a non-significant result given equal sample size. However, it is also clear that the unreliable measure produces fewer significant results because random measurement error attenuates the effect size and reduces power; that is, the probability to obtain a significant result.

Based on these results, we can reproduce Loken and Gelman’s results that showed larger effect size estimates more often with the unreliable measure. To produce this result, they conditioned on significance for the measure with random error, but not for the measure without random measurement error. That is, they combined conditions 2 (both measures produced significant results) and 4 (ONLY the unreliable measure produced significant result).

5. (2 + 4) The unreliable measure is significant, the reliable measure can be significant or not significant.

When we simply select for significance on the unreliable measure, we see that the unreliable measure has the stronger effect size over 50% of the time.

579 times the perfect measure had a higher estimate than the unreliable measure (41%)
377+464 = 841 times the unreliable measure had a higher estimate than the unreliable measure (59%)

However, this is not a fair comparison of the two measures. Selection for significance is applied to one of them and not the other. The illusion of reversal is produced by selection bias in the simulation, not in a real world scenario where researchers use one or the other measure. This is easy to see, when we condition on the reliable measure.

6. (2 + 3) The reliable measure is significant, significance on the other measure does not matter.

579+981 = 1560 times the perfect measure had a higher estimate than the unreliable measure (81%)
377 times the unreliable measure had a higher estimate than the unreliable measure (19%)

Now, we overestimate the advantage of the reliable measure. Conditioning on significance selectively for one variable and not the other produces biased simulation results that falsely suggest that an unreliable measure produces stronger correlations despite the fact that random measurement error attenuates correlation coefficients and other standardized effect size estimates.

## Extension 1: A Fair Comparison

I posted polls on Twitter (X) and in a Facebook Group that discusses methodological issues in psychological science. I created a scenario in which two labs are studying the same effect, say r = .15 in the population between two variables. Both labs use the same small sample size of N = 50. Both labs only publish when they get a statistically significant result, p < .05 (two-tailed). Both labs use a perfectly reliable preditor variable (e.g., height). The only difference between the two labs is that LAB A uses a more reliable measure (Cronbach’s alpha = .8) than LAB B (Cronbach’s alpha = .5). The question is simply which lab is more likely to publish a bigger (more biased) effect size estimate? The key difference between this scenario and Loken and Gelman’s simulation is that both outcomes are conditioned on significance. Thus, we can actually see the influence of random measurement error on correlation coefficients when results are selected for significance. The polls showed that the correct answer is unknown. In both polls, two-third of respondents sided with Loken and Gelman and answered that LAB B is more likely to report larger correlations (Twitter/X 65%, Facebook Group, 68%). They may have been influenced by Loken and Gelman’s article that claimed “in noisy research settings, poor measurement can contribute to exaggerated estimates of effect size.”

ChatGPT on Bing thinks LAB A will report stronger effect sizes, but its reasoning does not take conditioning for significance into account. “According to the results, the effect size is a measure of the magnitude of the experimental effect, and it is influenced by the reliability of the measures used. The higher the reliability, the more accurate the effect size estimate will be1. Therefore, LAB A, which uses a more reliable measure (Cronbach’s alpha = .8) than LAB B (Cronbach’s alpha = .5), is more likely to publish a bigger (more biased) effect size estimate. LAB B, which uses a less reliable measure, will have more measurement error and lower statistical power, making it harder to detect the true effect size in the population.”

To obtain the correct answer, I made only a small change to Loken and Gelman’s simulation. First, I did not add measurement error to the predictor variable, X. Second, I added different amounts of random measurement error to two outcome variables, Y1 with 80% reliable variance for LAB A, and y2 with 50% reliable variance for LAB B. I ran 10,000 simulations to have a reasonably large number of cases after selection for significance. LAB A had more significant results because the population effect size or average sample correlation is larger, .15 * .8 = .12 than the one for LAB B, .15 * .5 = .075, and studies with larger effect sizes in the same sample size have more statistical power, a greater chance to produce a significant result. In the simulation, LAB A had 1,435 significant results (~ 14% power) and LAB B had 1,106 significant results (11% power). I then compared the first 1,106 significant results from LAB A to the 1,106 results from LAB B and computed how often LAB A had a higher effect size estimate than LAB B.

Results: LAB A had a higher effect size estimate in 569 cases (51%) and LAB B had a higher effect size estimate in 537 cases (49%). Thus, there is no reversal that less reliable measures produce stronger (more biased) correlations in studies with low power and after selection for significance. Loken and Gelman’s false conclusion is based on an error in their simulations that conditioned on significance for the unreliable measure, but not for the measure without random measurement error.

Would a more extreme scenario produce a reversal? Power is already low and nobody should compute correlation coefficients in samples with N = 20, but fMRI researchers famously reported dramatic correlations between brain and behavior i studies with N = 8 (“voodoo correlations; Vul et al., 2012). So, I repeated the simulation with N = 20, and pitting a measure with 100% reliability against a measure with 20% reliability. Given the low power, I ran 100,000 simulations to get stable results.

Results:

LAB A obtained 9,620 significant results (Power ~ 10%). LAB B obtained 6,030 (Power ~ 6%, close to chance, 5% with alpha = .05).

The comparison of the first 6,030 significant results with the 6,030 significant results from LAB B showed that LAB A reported a stronger effect size 3,227 times (54%) and LAB B reported a stronger effect size 2,803 times (46%). Thus, more reliable not only help to report voodoo correlations more often, but also report higher correlations. Clearly, using less reliable measures does not contribute to the replication crisis as Loken and Gelman claimed. Their claim is based on a mistake in their simulations that conditioned joked outcomes on significance of the unreliable measure.

## Extension 2: Simulation with two equally unreliable measures

The next simulation is a new simulation that has two purposes. First, it drives home the message that Gelman’s simulation study unfairly biased the results in favor of the unreliable measure by conditioning on significance for this measure. Second, it provides a new insight into the contribution of unreliable measures to the replication crisis. The simulation assumes that researchers really use two dependent variables (or more) and are going to report results if at least one of the measures has a significant result. Evidently, this doesn’t really work with two perfect measures because they are perfectly correlated, r = 1. As a result, they will always show the same correlation with the independent variable. However, unreliable measures are not perfectly correlated with each other and produce different correlations. This provides room for capitalizing on chance and getting significance with low power. The lower the reliability of the measures the better. I picked a reliability of .5 for both dependent measures (Y1, Y2) and assumed that the independent variable has perfect reliability (e.g., an experimental manipulation).

1. Neither measure produced a significant result

4011 times Y1 had a higher estimate than Y2 (49%)
4109 times Y2 had a higher estimate than Y1 (51%).

2. Both measures produced a significant result.

212 times Y1 had a higher estimate than Y2 (57%)
162 times Y2 had a higher estimate than Y1 (43%).

3. Y1 is significant and Y2 is not significant

743 times Y1 had a higher estimate than Y2 (100%)
0 times Y2 had a higher estimate than Y1 (0%).

4. Y2 is significant and Y1 is not significant

0 times Y1 had a higher estimate than Y2 (0%)
763 times Y2 had a higher estimate than Y1 (100%).

The results show that using two measures with 50% reliability increases the chances of obtaining a significant result by about 750 / 10000 tries (7.5 percentage points). Thus, unreliable measures can contribute to the replication crisis if researchers use multiple unreliable measures and selectively publish results for the significant one. However, using a single unreliable measure versus a single reliable measure is not beneficial because an unreliable measure makes it less likely to obtain a significant result. Gelman’s reversal is an artifact by conditioning on one outcome. This can be easily seen by comparing the results after conditioning on significance for Y1 or Y2.

5. (2 + 3) Y1 is significant, significance of Y2 does not matter

212+743 = 955 times Y1 had a higher estimate than Y2 (85%)
162 times Y2 had a higher estimate than Y1 (15%).

6. (2 + 4) Y2 is significant, significance of Y1 does not matter

212 = times Y1 had a higher estimate than Y2 (81%)
162+763 = 925 times Y2 had a higher estimate than Y1 (19%).

When we condition on significance for Y1, Y1 produces more often significant results. When we condition on Y2, Y2 produces more often significant results. This has nothing to do with the reliability of the measures because they have the same reliability. The difference is illusory because selection for significance in the simulation produces biased results.

Another interesting observation

While working on this issue with Rickard, we also discovered an important distinction between standardized and unstandardized effect sizes. Loken and Gelman simulated standardized effect sizes because by correlating two variables. Random measurement error lowers standardized effect sizes because the unstandardized effect sizes are divided by the standard deviation and random measurement error adds to the naturally occurring variance in a variable. However, unstandardized effect sizes like the covariance or the mean difference between two groups are not attenuated by random measurement error. For this reason, it would be wrong to claim that unreliability of a measure attenuated unstandardized effect sizes or that they should be corrected for unreliability of a measure.

Random measurement error will however increase the standard error and make it more difficult to get a significant result. As a result, selection for significance will inflate the unstandardized effect size more for an unreliable measure. The following simulation demonstrates this point. To keep things similar, I kept the effect size of b = .15, but used the unstandardized effect size of a regression analysis as the effect size.

First, I show the distribution of the effect size estimates. Both distributions are centered over the simulated effect size of b = .15. However, the measure with random error produces a wider distribution which often results in more extreme effect size estimates.

1. Neither measure produced a significant result

3170 times the perfect measure had a higher estimate than the unreliable measure (41%)
4587 times the unreliable measure had a higher estimate than the unreliable measure (59%)

This scenario shows the surprising reversal that the less reliable measure shows the stronger absolute effect size estimates more often and more than 50% of the time that Loken and Gelman wanted to demonstrated, but their simulation used standardized effect size estimates that do not produce this reversal. Only unstandardized effect size estimates show it.

2. Both measures produced a significant result

73 times the perfect measure had a higher estimate than the unreliable measure (10%)
659 times the unreliable measure had a higher estimate than the unreliable measure (90%)

When both effect size estimates are significant, the one with the unreliable measure is much more likely to show a stronger effect size estimate. The reason is simple. Sampling error is larger and it takes a stronger effect size estimate to produce the same t-value that produces a significant result.

3. The reliable measure is significant and the unreliable measure is not significant

790 times the perfect measure had a higher estimate than the unreliable measure (73%)
299 times the unreliable measure had a higher estimate than the unreliable measure (27%)

With standardized effect sizes, selection for significance always favored the conditioning variable 100% of the time. Now unstandardized coefficients are higher 27% of the time. However, the conditioning effect is notable because conditioning on significance for the perfect measure reverses the usual pattern that the unreliable measure produces stronger effect size estimates.

4. The unreliable measure is significant and the reliable measure is not significant

0 times the perfect measure had a higher estimate than the unreliable measure (0%)
422 times the unreliable measure had a higher estimate than the unreliable measure (100%)

Conditioning on significance on the unreliable measure produces a 100% rate of stronger effect sizes because effect sizes are already biased in favor of the unreliable measure.

The interesting observation is that Loken and Gelman were right that effect size estimates can be inflated with unreliable measures, but they failed to demonstrate this reversal because they used standardized effect size estimates. Inflation occurs with unstandardized effect sizes. Moreover, it does not require selection for significance. Even non-significant effect size estimates tend to be larger because there is more sampling error.

## The Fallacy of Interpreting Point Estimates of Effect Sizes

Loken and Gelman’s article is framed as a warning to practitioners to avoid misinterpretation of effect size estimates. The are concerned that researchers “assume that the observed effect size would have been even larger if not for the burden of measurement error” and “when it comes to surprising research findings from small studies, measurement error (or other uncontrolled variation) should not be invoked automatically to suggest that effects are even larger” and “our concern is that researchers are sometimes tempted to use the “iron law” reasoning to defend or justify surprisingly large statistically significant effects from small studies”

They missed an opportunity to point out that there is a simple solution to avoid misinterpretation of effect sizes estimates that has been recommended by psychological methodologists since the 1990s (I highly recommend Cohen, 1994; also Cummings, 2013). The solution is to consider the uncertainty in effect sizes estimates by means of confidence intervals. Confidence provide a simple solution to many fallacies of traditional null-hypothesis tests, p < .05. A confidence interval can be used to test not only the nil-hypothesis, but also hypotheses about specific effect sizes. A confidence interval may exclude zero, but it might include other values of theoretical interest, especially if sampling error is large. To claim an effect size larger than the true population effect size of b = .15, the confidence interval has to exclude a value of b = .15. Otherwise, it is a fallacy to claim that the effect size in the population is larger than .15.

As demonstrated before, random measurement error inflates effect size estimates of unstandardized effect sizes, but it also increases sampling error, resulting in wider confidence interval. Thus, it is an important question whether unreliable measures really allow researchers to claim effect sizes that are significantly larger than the simulated true effect size of b = .15.

A final simulation examined how often the 95%CI excluded the true value of b = .15 for the perfect measure and the unreliable measure. To produce more precise estimates, I ran 100,000 simulations.

1. Measure without error

2809 Significant underestimations (2.8%)
2732 Significant overestimations (2.7%)
5541 Errors

2. Measure with error

2761 Significant underestimations (2.8%)
2750 Significant overestimations (2.7%)
5511 Errors

The results should not come as a surprise. 95% confidence intervals are designed to have a 5% error rate and to split these errors into equal errors on both sides. The addition of random measurement error does not affect this property of confidence intervals. Most important, there is no reversal in the probability of overestimation. The measure without error produces confidence intervals that overestimate the true effect size as often as the measure without error. However, the effect of random measurement error is noticeable in the amount of bias.

For the measure without error, the lower bound of the 95%CI ranges from .15 to .55, M = .21.
For the measure with error, the lower bound of the 95%CI ranges from .15 to .65, M = .22.

These differences are small and have no practical consequences. Thus, the use of confidence intervals provides a simple solution to false interpretation of effect size estimates. Although selection for significance in small samples inflates the point estimate of effect sizes, the confidence interval often includes the smaller true effect size.

## The Replication Crisis

Loken and Gelman’s article aimed to relate random measurement error to the replication crisis. They write “If researchers focus on getting statistically significant estimates of small effects, using noisy measurements and small samples, then it is likely that the additional sources of variance are already making the t test look strong. Measurement error and selection bias thus can combine to exacerbate the replication crisis.”

The previous results show that this statement ignores the key influence of random measurement error on statistical significance. Random measurement error increases the standard deviation and the standard deviation is in the denominator of the t-statistic. Thus, t-values are biased downwards and it is harder to get statistical significance with unreliable measures. The key point is that studies in small samples with unreliable measures have low statistical power. It is therefore misleading to claim that random-measurement error inflates t-values. It actually attenuates t-values. Selection for significance inflates point estimates of effect sizes, but these are meaningless and the confidence interval around this estimate often include the true population parameter.

More important, it is not clear what Loken and Gelman mean by the replication crisis. Let’s assume a researcher conducts a study with N = 50, a measure with 50% reliability, and an effect size of r = .15. Luck, the winner’s curse, gives them a statistically significant result with an effect size estimate of r = .6 and a 95% confidence interval ranging from .42 to .86. They get a major publication out of this finding. Another researcher conducts a replication study and gets a non-significant result with r = .11 and a 95%CI ranging from -.11 to .33. This outcome is often called a replication failure because significant results are considered successes and non-significant results are considered failures. However, findings like this do not signal a crisis. Replication failures are normal and to be expected because significance testing allows for error and replication failures.

The replication crisis in psychology is caused by the selective omission of replication failures from the literature or even from a set of studies within a single article (Schimmack, 2012). The problem is not that a single significant result is followed by a non-significant result. The problem is that non-significant results are not published. The success rate in psychology journals is over 90% (Sterling, 1959; Sterling et al., 1995). Thus, the replication crisis refers to the fact that psychologists never published failed replication studies. When publication of replication failures became more acceptable in the past decade, we just saw that selection bias inflated the success rate. Given the typical power of studies in psychology, replication failures are to be expected. This has nothing to do with random measurement error. The main contribution of random measurement error is to reduce power and increase the percentage of studies with non-significant results.

Over the past decade, a few influential articles have created a fear of false positive results (Ioannidis, 2005; Simmons, Nelson, & Simonsohn, 2011). The real problem, however, is that selection for significance makes it impossible to know whether an effect exists or not. Whereas real effects in reasonably powered studies would often produce significant results, false positives would be followed by many replication failures. Without credible replication studies that are published independent of the outcome, statistical significance has no meaning. This led to concerns that over 50% of published results could be false positives. However, empirical studies of the false positive risk often find much lower plausible values (Bartos & Schimmack, 2023). Arguably, the bigger problem in studies with small samples and unreliable measures is that these studies will often produce a false negative result. Take, Loken and Gelman’s simulation as an example. The key outcome of studies that look for a small effect size of r = .15 or r = .12 with a noisy measure is a non-significant result. This is a false negative result because we know a priori that there is a non-zero correlation between the two variables with a theoretically important effect size. For example, the correlation between income and a noisy measure of happiness is around r = .15. Looking for this small relationship in a small sample will often suggest that money does not buy happiness, while large samples consistently show this small relationship. One might even argue that the few studies that produce a significant result with an inflated point estimate but a confidence interval that includes r = .15 avoid the false negative result without providing inflated estimates of the effect size given the wide range of plausible values. Only the 2.5% of studies that produce confidence intervals that do not include r = .15 are misleading, but a single replication study is likely to correct this inflated estimate.

This line of reasoning does not justify selective publishing of significant results. Rather it draws attention back to the concerns of methodologists in the 1990s that low power is wasteful because many studies produce inconclusive results. To address this problem researchers need to think carefully about the plausible range of effect sizes and plan studies that can produce significant results for real effects. Researchers also need to be able and willing to publish results when the results are not significant. No statistical method can produce valid results when the data are biased. In comparison, the problem of inflated point estimates of effect sizes in a single small sample is trivial. Confidence interval make it clear that the true effect size can be much smaller and rare outcomes of extreme inflation will be corrected quickly by failed replication studies.

In short, as much as Gelman likes to think that there is something fundamentally wrong with the statistical methods that psychologists use, the real problems are practical. Resource constraints often limit researchers ability to collect large samples and the preference for novel significant results over replication failures of old findings gives researchers an incentive to selectively report their “successes.” To do so, they may even use multiple unreliable measures in order to capitalize on chance. The best way to address these problems is to establish a clear code of research practices and to hold researchers accountable if they violate this code. Editors should also enforce the already existing guidelines to report meaningful effect sizes with confidence intervals. In this utopian world, researchers would benefit from using reliable measures because they increase power and the probability of publishing a true positive result.

## Abandon Gelman

I pointed out the mistake in Loken and Gelman’s article on Gelman’s blog post. He is unable to see that his claim of a reversal in effect size estimates due to random measurement error is a mistake. Instead he tries to explain my vehement insistence as a personality flaw.

Instead, his overconfidence makes it impossible to consider the possibility that he made a mistake. This arrogant response to criticism is by no means unique. I have seen it many times by Greenwald, Bargh, Baumeister, and others. However, it is ironic when meta-scientists like Ioannidis, Gelman, or Simonsohn who are known for harsh criticism of others are unable to admit when they made a mistake. A notable exception is my criticism of Kahneman’s book “Thinking: Fast and Slow.”

Gelman has criticized psychologists without offering any advice how they could improve their credibility. His main advice is to “abandon statistical significance” without any guidelines how we should distinguish real findings from false positives or avoid interpretation of inflated effect size estimates. Here I showed how the use of confidence intervals provides a simple solution to avoid many of the problems that Gelman likes to point out. To learn about statistics, i suggest to read less Gelman and read more Cohen.

Cohen’s work shaped my understanding of methodology and statistics and he actually cared about psychology and tried to improve it. Without him, I might not have learned about statistical power or contemplated the silly practice of refuting nil-hypothesis. I also think his work was influential in changing the way results are reported in psychology journals that enabled me to detected biases and estimate false positive rates in our field. He also tried to tell psychologists about the importance of replication studies.

For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication” (Cohen, 1994).

If psychologists had listened to Cohen, they could have avoided the replication crisis in the 2010s. However, his work can still help psychologists to learn from the replication crisis to build a credible science that is build on true positive results and avoids false negative results. The lessons are simple.
1. Plan studies with a reasonable chance to get a significant result. Try to maximize power by thinking about all possible ways to reduce sampling error, including using more reliable measures.
2. Publish studies independent of outcome, especially replication failures that can correct false positives.
3. Focus on effect sizes, but ignore the point estimates. Instead, use confidence interval to avoid interpreting effect size estimates that are inflated by selection for significance.

# Replicability Report 2023: Aggressive Behavior

This report was created in collaboration with Anas Alsayed Hasan.
Citation: Alsayed Hasan, A. & Schimmack, U. (2023). Replicability Report 2023: Aggressive Behavior. Replicationindex.com

In the 2010s, it became apparent that empirical psychology has a replication problem. When psychologists tested the replicability of 100 results, they found that only 36% of the 97 significant results in original studies could be reproduced (Open Science Collaboration, 2015). In addition, several prominent cases of research fraud further undermined trust in published results. Over the past decade, several proposals were made to improve the credibility of psychology as a science. Replicability reports are the results of one of these initiatives.The main problem in psychological science is the selective publishing of statistically significant results and the blind trust in statistically significant results as evidence for researchers’ theoretical claims. Unfortunately, psychologists have been unable to self-regulate their behavior and continue to use unscientific practices to hide evidence that disconfirms their predictions. Moreover, ethical researchers who do not use unscientific practices are at a disadvantage in a game that rewards publishing many articles without any concern about the replicability of these findings.

My colleagues and I have developed a statistical tools that can reveal the use of unscientific practices and predict the outcome of replication studies (Brunner & Schimmack, 2021Bartos & Schimmack, 2022). This method is called z-curve. Z-curve cannot be used to evaluate the credibility of a single study. However, it can provide valuable information about the research practices in a particular research domain.

Replicability-Reports (RR) use z-curve to provide information about the research and publication practices of psychological journals. This information can be used by authors to chose journals they want to publish in, provides feedback to journal editors who have influence on selection bias and replicability of results published in their journals, and most importantly to readers of these journals.

## Aggressive Behavior

Aggressive Behavior is the official journal of the International Society for Research on Aggression.  Founded in 1974, this journal provides a multidisciplinary view of aggressive behavior and its physiological and behavioral consequences on subjects.  Published articles use theories and methods from psychology, psychiatry, anthropology, ethology, and more. So far, Aggressive Behavior has published close to 2,000 articles. Nowadays, it publishes about 60 articles a year in 6 annual issues. The journal has been cited by close to 5000 articles in the literature and has an H-Index of 104 (i.e., 104 articles have received 104 or more citations). The journal also has a moderate impact factor of 3. This journal is run by an editorial board containing over 40 members. The Editor-In-Chief is Craig Anderson. The associate editors are Christopher Barlett, Thomas Denson, Ann Farrell, Jane Ireland, and Barbara Krahé.

## Report

Replication reports are based on automatically extracted test-statistics (F-tests, t-tests, z-tests) from the text potion of articles. The reports do not include results reported as effect sizes (r), confidence intervals, or results reported in tables or figures.

Figure 1 shows a z-curve plot for all articles from 2000-2022 (see Schimmack, 2023, for a detailed description of z-curve plots). The plot is essentially a histogram of all test statistics converted into absolute z-scores (i.e., the direction of an effect is ignored). Z-scores can be interpreted as the strength of evidence against the null-hypothesis that there is no statistical relationship between two variables (i.e., the effect size is zero and the expected z-score is zero).

Z-curve fits a statistical model to the distribution of these z-scores. The predicted distribution is shown as a grey curve. Importantly, the model is fitted to the significant z-scores, but the model makes a prediction about the distribution of non-significant results. This makes it possible to examine publication bias (i.e., selective publishing of significant results).

## Selection for Significance

Visual inspection of Figure 1 shows that there are fewer observed non-significant results (z-scores between 0 and 1.65) than predicted z-scores. This is evidence of selection for significance. It is possible to quantify the bias in favor of significant results by comparing the proportion of observed significant results (i.e., the observed discovery rate, ODR) with the expected discovery rate (EDR) based on the grey curve. While the observed discovery rate is 71%, the expected discovery rate is 45%. The 95% confidence intervals of the ODR and EDR do not overlap. Thus, there is clear evidence of selection bias in this journal.

## False Positive Risk

The replication crisis has led to concerns that many or even most published results are false positives (i.e., the true effect size is zero). The false positive risk is inversely related to the expected discovery rate and z-curve uses the EDR to estimate the risk that a published significant result is a false positive result. An EDR of 45% implies that no more than 7% of the significant results are false positives. The 95%CI puts the upper limit at false positive results at 12%. Thus, concerns that most published results are false are overblown. However, a focus on false positives is misleading because it ignores effect sizes. Even if an effect is not exactly zero, it may be too small to be relevant (i.e., practically significant). Readers of original articles need to focus on confidence intervals of effect size estimates and take into account that selection for significance inflates effect size estimates. Thus, published results are likely to show the correct direction of a relationship, but may not provide enough information to determine whether a statistically significant result is theoretically or practically important.

## Expected Replication Rate

The expected replication rate estimates the percentage of studies that would produce a significant result again if exact replications with the same sample size were conducted. A comparison of the ERR with the outcome of actual replication studies shows that the ERR is higher than the actual replication rate. There are several factors that can explain this discrepancy such as the difficulty of conducting exact replication studies. Thus, the ERR is an optimist estimate. A conservative estimate is the EDR. The EDR predicts replication outcomes if significance testing does not favor studies with higher power (larger effects and smaller sampling error) because statistical tricks make it just as likely that studies with low power are published. The ERR of 69% suggests that the majority of results published in Aggressive Behavior are replicable, but the EDR allows for a replication rate as low as 45%. Thus, replicability is estimated to range from 45% to 69%. There are currently no large replication studies in this field, making it difficult to compare these estimates to outcomes of empirical replication studies. However, the ERR for the OSC reproducibility project that produced 36% successful actual replications was around 60%, suggesting that roughly 50% of actual replication studies of articles in this journal would be significant. It is unlikely that the success rate would be lower than the EDR of 45%. Given the relatively low risk of type-I errors, most of these replication failures are likely to occur because studies in this journal tend to be underpowered. Thus, replication studies should use larger samples.

## Time Trends

To examine changes in credibility over time, z-curves were fitted to test statistics for each year from 2000 to 2022. The ODR, EDR, and ERR were regressed on time and time-squared to allow for non-linear relationships. The ODR showed a significant linear trend, indicating more publications of non-significant results, b = -.52 percentage points per year (SE = .22). The EDR showed no significant trends, p > .30. There were no linear or quadratic time trends for the ERR, p > .10. Figure 2 shows the ODR and EDR to examine selection bias.

The decrease in the ODR implies that selection bias is decreasing over time. In the last years, the confidence intervals for the ODR and EDR overlap, indicating that there are no longer statistically reliable differences. However, this does not imply that all results are being reported. The main reason for the overlap is the low certainty about the annual EDR. Given the lack of a significant time trend for the EDR, the average EDR across all years implies that there is still selection bias. Finally, automatically extracted test-statistics make it impossible to say whether researchers are reporting more focal or non-focal results as non-significant. To investigate this question, it is necessary to hand-code focal tests (see Limitation section).

Figure 3 shows the false discovery risk (FDR) and the estimated replication rate (ERR). It also shows the expected replication failure rate (EFR = 1 – ERR). A comparison of the EFR with the FDR provides information for the interpretation of replication failures. If the FDR is close to the EFR, many replication failures may be due to false positive results in original studies. In contrast, if the FDR is low, most replication failures are likely to be false negative results in underpowered replication studies.

The FDR is based on the EDR that also showed no time trends. Thus, the estimates for all years can be used to obtain more precise estimates than the annual ones. Based on the results in Figure 1, the expected failure rate is 31% and the FDR is 7%. This suggests that replication failures are more likely to be false negatives due to modest power rather than false positive results in original studies. To avoid false negative results in replication studies, these studies should use larger samples.

## Retrospective Improvement of Credibility

The criterion of alpha = .05 is an arbitrary criterion to make decisions about a hypothesis. It was used by authors to conclude that an effect is present, and editors accepted articles on the basis of this evidence. However, readers can demand stronger evidence. A rational way to decide what alpha criterion to use is the false positive risk. A lower alpha, say alpha = .005, reduces the false positive risk, but also increases the percentage of false negatives (i.e.., there is an effect even if the p-value is above alpha).

Figure 4 shows the implications of using different significance criteria for the observed discovery rate (lower alpha implies fewer significant results).

Using alpha = .01 lowers the discovery rate by about 15 percentage points. The stringent criterion of alpha = .001 lowers it by another 10 percentage points to around 40% discoveries. This would mean that many published results that were used to make claims no longer have empirical support.

Figure 5 shows the effects of alpha on the false positive risk. Even alpha = .01 is sufficient to ensure a false positive risk of 5% or less. Thus, alpha = .01 seems a reasonable criterion to avoid too many false positive results without discarding too many true positive results. Authors may want to increase statistical power to increase their chances of obtaining a p-value below .01 when their hypotheses are true to produce credible evidence for their hypotheses.

## Limitations

The main limitation of these results is the use of automatically extracted test statistics. This approach cannot distinguish between theoretically important statistical results and other results that are often reported, but do not test focal hypotheses (e.g., testing statistical significance of a manipulation check, reporting a non-significant result for a factor in a complex statistical design that was not expected to produce a significant result).

Hand-coding of other journals shows that publications of non-significant focal hypothesis tests are still rare. As a result, the ODR for focal hypothesis tests in Aggressive Behavior is likely to be higher and selection bias larger than the present results suggest. Hand-coding of a representative sample of articles in this journal is needed.

## Conclusion

The replicability report for Aggressive Behavior shows clear evidence of selection bias, although there is a trend selection bias may be decreasing in the last years. The results also suggest that replicability is in a range from 40% to 70%. This replication rate does not deserve to be called a crisis, but it is does suggest that many studies are underpowered and require luck to get a significant result. The false positive risk is modest and can be controlled by setting alpha to .01. Finally, time trend analyses show no important changes in response to the open science movement. An important goal is to reduce the selective publishing of studies that worked (p < .05) and to hide studies that did not work (p > .05). Preregistration or registered reports can help to address this problem. Given concerns that most published results in psychology are false positives, the present results are reassuring and suggest that most results with p-values below .01 are true positive results.

# Replicability Report 2023: Cognition & Emotion

In the 2010s, it became apparent that empirical psychology has a replication problem. When psychologists tested the replicability of 100 results, they found that only 36% of the 97 significant results in original studies could be reproduced (Open Science Collaboration, 2015). In addition, several prominent cases of research fraud further undermined trust in published results. Over the past decade, several proposals were made to improve the credibility of psychology as a science. Replicability reports are the results of one of these initiatives.The main problem in psychological science is the selective publishing of statistically significant results and the blind trust in statistically significant results as evidence for researchers’ theoretical claims. Unfortunately, psychologists have been unable to self-regulate their behavior and continue to use unscientific practices to hide evidence that disconfirms their predictions. Moreover, ethical researchers who do not use unscientific practices are at a disadvantage in a game that rewards publishing many articles without any concern about the replicability of these findings.My colleagues and I have developed a statistical tools that can reveal the use of unscientific practices and predict the outcome of replication studies (Brunner & Schimmack, 2021Bartos & Schimmack, 2022). This method is called z-curve. Z-curve cannot be used to evaluate the credibility of a single study. However, it can provide valuable information about the research practices in a particular research domain. Replicability-Reports (RR) use z-curve to provide information about the research and publication practices of psychological journals. This information can be used by authors to chose journals they want to publish in, provides feedback to journal editors who have influence on selection bias and replicability of results published in their journals, and most importantly to readers of these journals.

## Cognition & Emotion

The study of emotions largely disappeared from psychology after the second world war and during the rain of behaviorism or was limited to facial expressions. The study of emotional experiences reemerged in the 1980. Cognition & Emotion was established in 1987 as an outlet for this research.

So far, the journal has published close to 3,000 articles. The average number of citations per article is 46. The journal has an H-Index of 155 (i.e., 155 articles have 155 or more citations). These statistics show that Cognition & Emotion is an influential journal for research on emotions.

Nine articles have more than 1,000 citations. The most highly cited article is a theoretical article by Paul Ekman arguing for basic emotions (Ekman, 1992);

## Report

Replication reports are based on automatically extracted test-statistics (F-tests, t-tests, z-tests) from the text potion of articles. The reports do not include results reported as effect sizes (r), confidence intervals, or results reported in tables or figures.

Figure 1 shows a z-curve plot for all articles from 2000-2022 (see Schimmack, 2023, for a detailed description of z-curve plots). The plot is essentially a histogram of all test statistics converted into absolute z-scores (i.e., the direction of an effect is ignored). Z-scores can be interpreted as the strength of evidence against the null-hypothesis that there is no statistical relationship between two variables (i.e., the effect size is zero and the expected z-score is zero).A z-curve plot shows the standard criterion of statistical significance (alpha = .05, z = 1.97) as a vertical red line. It also shows a dotted vertical red line at z = 1.65 because results with z-scores between 1.65 and 1.97 are often interpreted as evidence for an effect using a more liberal alpha criterion, alpha = .10, a one-sided test, or with qualifiers (e.g., marginally significant). Thus, values in this range cannot be interpreted as reporting of results that failed to support a hypothesis.

Z-curve plots are limited to values less than z = 6. The reason is that values greater than 6 are so extreme that a successful replication is all but certain, unless the value is a computational error or based on fraudulent data. The extreme values are still used for the computation of z-curve statistics, but omitted from the plot to highlight the shape of the distribution for diagnostic z-scores in the range from 2 to 6.Z-curve fits a statistical model to the distribution of these z-scores. The predicted distribution is shown as a grey curve. Importantly, the model is fitted to the significant z-scores, but the model makes a prediction about the distribution of non-significant results. This makes it possible to examine publication bias (i.e., selective publishing of significant results).

### Selection for Significance

Visual inspection of Figure 1 shows that there are fewer observed non-significant results (z-scores between 0 and 1.65) than predicted z-scores. This is evidence of selection for significance. It is possible to quantify the bias in favor of significant results by comparing the proportion of observed significant results (i.e., the observed discovery rate, ODR) with the expected discovery rate (EDR) based on the grey curve. While the observed discovery rate is 68%, the expected discovery rate is 34%. The 95% confidence intervals of the ODR and EDR do not overlap. Thus, there is clear evidence of selection bias in this journal.

### False Positive Risk

The replication crisis has led to concerns that many or even most published results are false positives (i.e., the true effect size is zero). The false positive risk is inversely related to the expected discovery rate and z-curve uses the EDR to estimate the risk that a published significant result is a false positive result.

An EDR of 34% implies that up to 10% of the significant results could be false positives. Thus, concerns that most published results are false are overblown. However, a focus on false positives is misleading because it ignores effect sizes. Even if an effect is not exactly zero, it may be too small to be relevant (i.e., practically significant). Readers of statistical results in this journal need to examine the range of plausible effect sizes, confidence intervals, to see whether results have practical significance. Unfortunately, these estimates are inflated by selection bias, especially when the evidence is weak and the confidence interval already includes effect sizes close to zero.

### Expected Replication Rate

The expected replication rate estimates the percentage of studies that would produce a significant result again if exact replications with the same sample size were conducted. A comparison of the ERR with the outcome of actual replication studies shows that the ERR is higher than the actual replication rate. There are several factors that can explain this discrepancy such as the difficulty of conducting exact replication studies. Thus, the ERR is an optimist estimate. A conservative estimate is the EDR. The EDR predicts replication outcomes if significance testing does not favor studies with higher power (larger effects and smaller sampling error) because statistical tricks make it just as likely that studies with low power are published.

The ERR of 70% suggests that most results published in this journal are replicable, but the EDR allows for a replication rate as low as 34%. Thus, replicability is estimated to range from 34% to 70%. There is no representative sample of replication studies from this journal to compare this estimate with the outcome of actual replication studies. However, a journal with lower ERR and EDR estimates, Psychological Science, had an actual replication rate of 41%. Thus, it is plausible to predict a higher actual replication rate than this for Cognition & Emotion.

### Time Trends

To examine changes in credibility over time, z-curves were fitted to test statistics for each year from 2000 to 2022. Confidence intervals were created by regressing the estimates on time and time-squared to examine non-linear relationships.

Figure 2 shows the ODR and EDR to examine selection bias. The ODR showed a significant linear trend, indicating more publications of non-significant results, b = -.79, percentage points per year (SE = .10). The EDR showed no significant linear, b = .23, SE = .41, or non-linear, b = -.10, SE = .07, trends.

The decreasing ODR implies that selection bias is decreasing, but it is not clear whether this trend also applies to focal hypothesis tests (see limitations section). The lack of an increase in the EDR implies that researchers continue to conduct studies with low statistical power and that the non-significant results often remain unpublished. To improve credibility of this journal, editors could focus on power rather than statistical significance in the review process.

Figure 3 shows the false discovery risk (FDR) and the estimated replication rate (ERR). It also shows the expected replication failure rate (EFR = 1 – ERR). A comparison of the EFR with the FDR provides information for the interpretation of replication failures. If the FDR is close to the EFR, many replication failures may be due to false positive results in original studies. In contrast, if the FDR is low, most replication failures are likely to be false negative results in underpowered replication studies.

There was a significant linear, b = .24, SE = .11, tend for the ERR, indicating an increase in the ERR. The increase in the ERR implies fewer replication failures in the later years. However, because the FDR is not decreasing, a larger portion of these replication failures could be false positives.

## Retrospective Improvement of Credibility

he criterion of alpha = .05 is an arbitrary criterion to make decisions about a hypothesis. It was used by authors to conclude that an effect is present and editors accepted articles on the basis of this evidence. However, readers can demand stronger evidence. A rational way to decide what alpha criterion to use is the false positive risk. A lower alpha, say alpha = .005, reduces the false positive risk, but also increases the percentage of false negatives (i.e.., there is an effect even if the p-value is above alpha).

Figure 4 shows the implications of using different significance criteria for the observed discovery rate (lower alpha implies fewer significant results).

Lowering alpha to .01 reduces the observed discovery rate by 20 to 30 percentage points. It is also interesting that the ODR decreases more with alpha = .05 than for other alpha levels. This suggests that changes in the ODR are in part caused by fewer p-values between .05 and .01. These significant results are more likely to result from unscientific methods and are often do not replicate.

Figure 5 shows the effects of alpha on the false positive risk. Lowering alpha to .01 reduces the false positive risk to less than 5%. Thus, readers can use this criterion to reduce the false positive risk to an acceptable level.

## Limitations

The main limitation of these results is the use of automatically extracted test statistics. This approach cannot distinguish between theoretically important statistical results and other results that are often reported, but do not test focal hypotheses (e.g., testing statistical significance of a manipulation check, reporting a non-significant result for a factor in a complex statistical design that was not expected to produce a significant result).

For the journal Cognition & Emotion a small set of articles were hand-coded as part of a study on the effects of open science reforms on the credibility of psychological science. Figure 6 shows the z-curve plot and results for 117 focal hypothesis tests.

The main difference between manually and automatically coded data is a much higher ODR (95%) for manually coded data. This finding shows that selection bias for focal hypothesis tests is much more severe than the automatically extracted data suggest.

The point estimate of the EDR, 37%, is similar to the EDR for automatically extracted data, 34%. However, due to the small sample size, the 95%CI for manually coded data is wide and it is impossible to draw firm conclusions about the EDR, but results from other journals and large samples also show similar results.

The ERR estimates are also similar and the 95%CI for hand-coded data suggests that the majority of results are replicable.

Overall, these results suggest that automatically extracted results are informative, but underestimate selection bias for focal hypothesis tests.

## Conclusion

The replicability report for Cognition & Emotion shows clear evidence of selection bias, but also a relatively low risk of false positive results that can be further reduced by using alpha = .01 as a criterion to reject the null-hypothesis. There are no notable changes in credibility over time. Editors of this journal could improve credibility by reducing selection bias. The best way to do so would be to evaluate the strength of evidence rather than using alpha = .05 as a dichotomous criterion for acceptance. Moreover, the journal needs to publish more articles that fail to support theoretical predictions. The best way to do so is to accept articles that preregistered predictions and failed to confirm them or to invite registered reports that publish articles independent of outcome of a study. Readers can set their own level of alpha depending on their appetite for risk, but alpha = .01 is a reasonable criterion because it (a) maintains a false positive risk below 5%, and eliminates p-values between .01 and .05 that are often obtained with unscientific practices and fail to replicate.

Link to replicability reports for other journals.

# How to Interpret Z-Curve Plots

work in progress.

# Replicability Report 2023: Acta Psychologica

In the 2010s, it became apparent that empirical psychology has a replication problem. When psychologists tested the replicability of 100 results, they found that only 36% of the 97 significant results in original studies could be reproduced (Open Science Collaboration, 2015). In addition, several prominent cases of research fraud further undermined trust in published results. Over the past decade, several proposals were made to improve the credibility of psychology as a science. Replicability reports are the results of one of these initiatives.

The main problem in psychological science is the selective publishing of statistically significant results and the blind trust in statistically significant results as evidence for researchers’ theoretical claims. Unfortunately, psychologists have been unable to self-regulate their behavior and continue to use unscientific practices to hide evidence that disconfirms their predictions. Moreover, ethical researchers who do not use unscientific practices are at a disadvantage in a game that rewards publishing many articles without any concern about the replicability of these findings.

My colleagues and I have developed a statistical tools that can reveal the use of unscientific practices and predict the outcome of replication studies (Brunner & Schimmack, 2021; Bartos & Schimmack, 2022). This method is called z-curve. Z-curve cannot be used to evaluate the credibility of a single study. However, it can provide valuable information about the research practices in a particular research domain.

Replicability-Reports (RR) use z-curve to provide information about the research and publication practices of psychological journals. This information can be used by authors to chose journals they want to publish in, provides feedback to journal editors who have influence on selection bias and replicability of results published in their journals, and most importantly to readers of these journals.

## Acta Psychologica

Acta Psychologica is an old psychological journal that was founded in 1936. The journal publishes articles from various areas of psychology, but cognitive psychological research seems to be the most common area.

So far, Acta Psychologica has published close to 6,000 articles. Nowadays, it publishes about 150 articles a year in 10 annual issues. Over the past 30 years, articles have an average citation rate of 24.48 citations, and the journal has an H-Index of 116 (i.e., 116 articles have received 116 or more citations). The journal has an impact factor of 2 which is typical of most empirical psychology journals.

So far, the journal has published 4 articles with more than 1,000 citations, but all of these articles were published in the 1960s and 1970s. The most highly cited article in the 2000s, examined the influence of response categories on the psychometric properties of survey items (Preston & Colman, 2000; 947 citations).

Given the multidisciplinary nature of the journal, the journal has a team of editors. The current editors are Mohamed Alansari, Martha Arterberry, Colin Cooper, Martin Dempster, Tobias Greitemeyer, Matthieu Guitton, and Nhung T Hendy.

## Report

Replication reports are based on automatically extracted test-statistics (F-tests, t-tests, z-tests) from the text potion of articles. The reports do not include results reported as effect sizes (r), confidence intervals, or results reported in tables or figures.

Figure 1 shows a z-curve plot for all articles from 2000-2022 (see Schimmack, 2023, for a detailed description of z-curve plots). The plot is essentially a histogram of all test statistics converted into absolute z-scores (i.e., the direction of an effect is ignored). Z-scores can be interpreted as the strength of evidence against the null-hypothesis that there is no statistical relationship between two variables (i.e., the effect size is zero and the expected z-score is zero).

A z-curve plot shows the standard criterion of statistical significance (alpha = .05, z = 1.97) as a vertical red line. It also shows a dotted vertical red line at z = 1.65 because results with z-scores between 1.65 and 1.97 are often interpreted as evidence for an effect using a more liberal alpha criterion, alpha = .10, a one-sided test, or with qualifiers (e.g., marginally significant). Thus, values in this range cannot be interpreted as reporting of results that failed to support a hypothesis.

Z-curve plots are limited to values less than z = 6. The reason is that values greater than 6 are so extreme that a successful replication is all but certain, unless the value is a computational error or based on fraudulent data. The extreme values are still used for the computation of z-curve statistics, but omitted from the plot to highlight the shape of the distribution for diagnostic z-scores in the range from 2 to 6.

Z-curve fits a statistical model to the distribution of these z-scores. The predicted distribution is shown as a grey curve. Importantly, the model is fitted to the significant z-scores, but the model makes a prediction about the distribution of non-significant results. This makes it possible to examine publication bias (i.e., selective publishing of significant results).

### Selection for Significance

Visual inspection of Figure 1 shows that there are fewer observed non-significant results (z-scores between 0 and 1.65) than predicted z-scores. This is evidence of selection for significance. It is possible to quantify the bias in favor of significant results by comparing the proportion of observed significant results (i.e., the observed discovery rate, ODR) with the expected discovery rate (EDR) based on the grey curve. While the observed discovery rate is 70%, the expected discovery rate is 46%. The 95% confidence intervals of the ODR and EDR do not overlap. Thus, there is clear evidence of selection bias in this journal.

### False Positive Risk

The replication crisis has led to concerns that many or even most published results are false positives (i.e., the true effect size is zero). The false positive risk is inversely related to the expected discovery rate and z-curve uses the EDR to estimate the risk that a published significant result is a false positive result.

An EDR of 46% implies that no more than 6% of the significant results are false positives. The 95%CI puts the upper limit at false positive results at 8%. Thus, concerns that most published results are false are overblown. However, a focus on false positives is misleading because it ignores effect sizes. Even if an effect is not exactly zero, it may be too small to be relevant (i.e., practically significant). Readers of original articles need to focus on confidence intervals of effect size estimates and take into account that selection for significance inflates effect size estimates. Thus, published results are likely to show the correct direction of a relationship, but may not provide enough information to determine whether a statistically significant result is theoretically or practically important.

### Expected Replication Rate

The expected replication rate estimates the percentage of studies that would produce a significant result again if exact replications with the same sample size were conducted. A comparison of the ERR with the outcome of actual replication studies shows that the ERR is higher than the actual replication rate. There are several factors that can explain this discrepancy such as the difficulty of conducting exact replication studies. Thus, the ERR is an optimist estimate. A conservative estimate is the EDR. The EDR predicts replication outcomes if significance testing does not favor studies with higher power (larger effects and smaller sampling error) because statistical tricks make it just as likely that studies with low power are published.

The ERR of 72% suggests that the majority of results published in Acta Psychologica is replicable, but the EDR allows for a replication rate as low as 46%. Thus, replicability is estimated to range from 46% to 72%. Actual replications of cognitive research suggest that 50% of results produce a significant result again (Open Science Collaboration, 2015). This finding is consistent with the present results. Taking the low false positive risk into account, most replication failures are likely to be false negatives due to insufficient power in the original and replication studies. This suggests that replication studies should increase sample sizes to have sufficient statistical power to replicate true positive effects.

### Time Trends

To examine changes in credibility over time, z-curves were fitted to test statistics for each year from 2000 to 2022. The results were regressed on time and time-squared to allow for non-linear relationships.

Figure 2 shows the ODR and EDR to examine selection bias. The ODR showed a significant linear trend, indicating more publications of non-significant results, b = -.43 percentage points per year (SE = .13). The EDR showed no significant trends, p > .20.

The decrease in the ODR implies that selection bias is decreasing over time. However, a low EDR still implies that many studies that produced non-significant results remain unpublished. Moreover, it is unclear whether researcher are reporting more focal results as non-significant. To investigate this question it is necessary to hand-code focal tests (see Limitation section).

Figure 3 shows the false discovery risk (FDR) and the estimated replication rate (ERR). It also shows the expected replication failure rate (EFR = 1 – ERR). A comparison of the EFR with the FDR provides information for the interpretation of replication failures. If the FDR is close to the EFR, many replication failures may be due to false positive results in original studies. In contrast, if the FDR is low, most replication failures are likely to be false negative results in underpowered replication studies.

There were no linear or quadratic time trends for the ERR, p > .2. The FDR is based on the EDR that also showed no time trends. Thus, the estimates for all years can be used to obtain more precise estimates than the annual ones. Based on the results in Figure 1, the expected failure rate is 28% and the FDR is 5%. This suggests that replication failures are more likely to be false negatives due to modest power rather than false positive results in original studies. To avoid false negative results in replication studies, these studies should use larger samples.

## Retrospective Improvement of Credibility

The criterion of alpha = .05 is an arbitrary criterion to make decisions about a hypothesis. It was used by authors to conclude that an effect is present and editors accepted articles on the basis of this evidence. However, readers can demand stronger evidence. A rational way to decide what alpha criterion to use is the false positive risk. A lower alpha, say alpha = .005, reduces the false positive risk, but also increases the percentage of false negatives (i.e.., there is an effect even if the p-value is above alpha).

Figure 4 shows the implications of using different significance criteria for the observed discovery rate (lower alpha implies fewer significant results).

Using alpha = .01 lowers the discovery rate by about 20 percentage points. The stringent criterion of alpha = .001 lowers it by another 20 percentage points to around 40% discoveries. This would mean that many published results that were used to make claims no longer have empirical support.

Figure 5 shows the effects of alpha on the false positive risk. Even alpha = .01 is sufficient to ensure a false positive risk of 5% or less. Thus, alpha = .01 seems a reasonable criterion to avoid too many false positive results without discarding too many true positive results. Authors may want to increase statistical power to increase their chances of obtaining a p-value below .01 when their hypotheses are true to produce credible evidence for their hypotheses.

## Limitations

The main limitation of these results is the use of automatically extracted test statistics. This approach cannot distinguish between theoretically important statistical results and other results that are often reported, but do not test focal hypotheses (e.g., testing statistical significance of a manipulation check, reporting a non-significant result for a factor in a complex statistical design that was not expected to produce a significant result).

For the journal Acta Psychologica, hand-coded data are available for the years 2010 and 2020 from a study that examines changes in replicability from 2010 to 2020. Figure 6 shows the results.

The most notable difference is the higher observed discovery rate for hand-coding of focal hypothesis tests (94%) than for automatically extracted test statistics (70%). Thus, results based on automatically extracted data underestimate selection bias.

In contrast, the expected discovery rates are similar in hand-coded (46%) and automatically extracted (46%) data. Given the small set of hand-coded tests, the 95% confidence interval around the 46% estimate is wide, but there is no evidence that automatically extracted data overestimate the expected discovery rate and by implication underestimate the false discovery rate.

The ERR for hand-coded focal tests (70%) is also similar to the ERR for automatically extracted tests (72%).

This comparison suggests that the main limitation of automatic extraction of test statistics is that this method underestimates the amount of selection bias because authors are more likely to report non-focal tests than focal results that are not significant. Thus, selection bias remains a pervasive problem in this journal.

## Conclusion

The replicability report for Acta Psychologica shows clear evidence of selection bias, although there is a trend selection bias may be decreasing in the last years. The results also suggest that replicability is in a range from 40% to 70%. This replication rate does not deserve to be called a crisis, but it is does suggest that many studies are underpowered and require luck to get a significant result. The false positive risk is modest and can be controlled by setting alpha to .01. Finally, time trend analysis show no important changes in response to the open science movement. An important goal is to reduce the selective publishing of studies that worked (p < .05) and to hide studies that did not work (p > .05). Preregistration or registered reports can help to address this problem. Given concerns that most published results in psychology are false positives, the present results are reassuring and suggest that most results with p-values below .01 are true positive results.

# Replicability Reports of Psychology Journals

In the 2010s, it became apparent that empirical psychology has a replication problem. When psychologists tested the replicability of 100 results, they found that only 36% of the 97 significant results in original studies could be reproduced (Open Science Collaboration, 2015). In addition, several prominent cases of research fraud further undermined trust in published results. Over the past decade, several proposals were made to improve the credibility of psychology as a science. Replicability reports are the results of one of these initiatives.

The main problem in psychological science is the selective publishing of statistically significant results and the blind trust in statistically significant results as evidence for researchers’ theoretical claims. Unfortunately, psychologists have been unable to self-regulate their behavior and continue to use unscientific practices to hide evidence that disconfirms their predictions. Moreover, ethical researchers who do not use unscientific practices are at a disadvantage in a game that rewards publishing many articles without any concern about the replicability of these findings.

My colleagues and I have developed a statistical tools that can reveal the use of unscientific practices and predict the outcome of replication studies (Brunner & Schimmack, 2021; Bartos & Schimmack, 2022). This method is called z-curve. Z-curve cannot be used to evaluate the credibility of a single study. However, it can provide valuable information about the research practices in a particular research domain.

Research reports use z-curve to provide information about psychological journals. This information can be used by authors to chose journals they want to publish in, provides feedback to journal editors who have influence on selection bias and replicability of results published in their journals, and most importantly to readers of these journals.

#### List of Journals with Links to Replicability Report

Psychological Science (2000-2022)

Acta Psychologica (2000-2022)

# Replicability Report 2023: Psychological Science

updated 7/11/23
[slightly different results due to changes in the extraction code and a mistake in the formula for the false discovery risk with different levels of alpha]

In the 2010s, it became apparent that empirical psychology has a replication problem. When psychologists tested the replicability of 100 results, they found that only 36% of the 97 significant results in original studies could be reproduced (Open Science Collaboration, 2015). In addition, several prominent cases of research fraud further undermined trust in published results. Over the past decade, several proposals were made to improve the credibility of psychology as a science. Replicability reports are the results of one of these initiatives.The main problem in psychological science is the selective publishing of statistically significant results and the blind trust in statistically significant results as evidence for researchers’ theoretical claims. Unfortunately, psychologists have been unable to self-regulate their behavior and continue to use unscientific practices to hide evidence that disconfirms their predictions. Moreover, ethical researchers who do not use unscientific practices are at a disadvantage in a game that rewards publishing many articles without any concern about the replicability of these findings.My colleagues and I have developed a statistical tools that can reveal the use of unscientific practices and predict the outcome of replication studies (Brunner & Schimmack, 2021Bartos & Schimmack, 2022). This method is called z-curve. Z-curve cannot be used to evaluate the credibility of a single study. However, it can provide valuable information about the research practices in a particular research domain. Replicability-Reports (RR) use z-curve to provide information about the research and publication practices of psychological journals. This information can be used by authors to chose journals they want to publish in, provides feedback to journal editors who have influence on selection bias and replicability of results published in their journals, and most importantly to readers of these journals.

## Psychological Science

Psychological Science is often called the flagship journal of the Association for Psychological Science. It publishes journals from all areas of psychology, but most articles are experimental studies.

The journal started in 1990. So far, it has published over 5,000 articles with an average citation rate of 90 citations per article. The journal currently has an H-Index of 300 (i.e., 300 articles have received 300 or more citations).

Ironically, the most cited article (3,800 citations) is a theoretical article that illustrated how easy it is to produce statistically significant results with statistical tricks that capitalize on chance and increase the risk of a false discovery and inflate effect size estimates (Simmons, Nelson, & Simmonsoh, 2011). This article is often cited as evidence that published results lack credibility. The impact of this journal also suggests that most researchers are now aware that selective publishing of significant results is harmful.

After concerns about the replicability of psychological science emerged in the early 2010s, Erich Eid initiated changes to increase the credibility of published results. Further changes were made by Stephen Lindsay during his editorship from 2015 to 2019. Replicability reports provide an opportunity to examine the effect of these changes on the credibility of published results.

## Report

Replication reports are based on automatically extracted test-statistics (F-tests, t-tests, z-tests) from the text potion of articles. The reports do not include results reported as effect sizes (r), confidence intervals, or results reported in tables or figures.

Figure 1 shows a z-curve plot for all articles from 2000-2022 (see Schimmack, 2023, for a detailed description of z-curve plots). The plot is essentially a histogram of all test statistics converted into absolute z-scores (i.e., the direction of an effect is ignored). Z-scores can be interpreted as the strength of evidence against the null-hypothesis that there is no statistical relationship between two variables (i.e., the effect size is zero and the expected z-score is zero).A z-curve plot shows the standard criterion of statistical significance (alpha = .05, z = 1.97) as a vertical red line. It also shows a dotted vertical red line at z = 1.65 because results with z-scores between 1.65 and 1.97 are often interpreted as evidence for an effect using a more liberal alpha criterion, alpha = .10, a one-sided test, or with qualifiers (e.g., marginally significant). Thus, values in this range cannot be interpreted as reporting of results that failed to support a hypothesis.

Z-curve plots are limited to values less than z = 6. The reason is that values greater than 6 are so extreme that a successful replication is all but certain, unless the value is a computational error or based on fraudulent data. The extreme values are still used for the computation of z-curve statistics, but omitted from the plot to highlight the shape of the distribution for diagnostic z-scores in the range from 2 to 6.Z-curve fits a statistical model to the distribution of these z-scores. The predicted distribution is shown as a grey curve. Importantly, the model is fitted to the significant z-scores, but the model makes a prediction about the distribution of non-significant results. This makes it possible to examine publication bias (i.e., selective publishing of significant results).

### Selection for Significance

Visual inspection of Figure 1 shows that there are fewer observed non-significant results (z-scores between 0 and 1.65) than predicted z-scores. This is evidence of selection for significance. It is possible to quantify the bias in favor of significant results by comparing the proportion of observed significant results (i.e., the observed discovery rate, ODR) with the expected discovery rate (EDR) based on the grey curve. While the observed discovery rate is 71%, the expected discovery rate is 25%. The 95% confidence intervals of the ODR and EDR do not overlap. Thus, there is clear evidence of selection bias in this journal.

### False Positive Risk

The replication crisis has led to concerns that many or even most published results are false positives (i.e., the true effect size is zero). The false positive risk is inversely related to the expected discovery rate and z-curve uses the EDR to estimate the risk that a published significant result is a false positive result.

An EDR of 25% implies that up to 16% of the significant results could be false positives. Thus, concerns that most published results are false are overblown. However, a focus on false positives is misleading because it ignores effect sizes. Even if an effect is not exactly zero, it may be too small to be relevant (i.e., practically significant). Readers of statistical results in Psychological Science need to examine the range of plausible effect sizes, confidence intervals, to see whether results have practical significance. Unfortunately, these estimates are inflated by selection bias, especially when the evidence is weak and the confidence interval already includes effect sizes close to zero.

### Expected Replication Rate

The expected replication rate estimates the percentage of studies that would produce a significant result again if exact replications with the same sample size were conducted. A comparison of the ERR with the outcome of actual replication studies shows that the ERR is higher than the actual replication rate. There are several factors that can explain this discrepancy such as the difficulty of conducting exact replication studies. Thus, the ERR is an optimist estimate. A conservative estimate is the EDR. The EDR predicts replication outcomes if significance testing does not favor studies with higher power (larger effects and smaller sampling error) because statistical tricks make it just as likely that studies with low power are published.

The ERR of 67% suggests that most results published in this journal are replicable, but the EDR allows for a replication rate as low as 25%. Thus, replicability is estimated to range from 25% to 67%. Actual replications of results in this journal suggest a replication rate of 41% (Open Science Collaboration, 2015). This finding is consistent with the present results. Thus, replicability of results in Psychological Science is much lower than trusting readers might suspect.

### Time Trends

To examine changes in credibility over time, z-curves were fitted to test statistics for each year from 2000 to 2022. Confidence intervals were created by regressing the estimates on time and time-squared to examine non-linear relationships.

Figure 2 shows the ODR and EDR to examine selection bias. The ODR showed a significant linear trend, indicating more publications of non-significant results, b = -.53 percentage points per year (SE = .09). The EDR showed significant linear, b = .70, SE = .31, and non-linear, b = .24, SE = .05, trends.

The decreasing ODR implies that selection bias is decreasing, but it is not clear whether this trend also applies to focal hypothesis tests (see limitations section). The curvilinear trend for the EDR is notable because it suggests that concerns about the credibility of published results were triggered by a negative trend in the EDR from 2000 to 2010. Since then, the EDR has been moving up. The positive trend can be attributed to the reforms initiated by Eric Eich and Steven Lindsey that have been maintained by the current editor Patricia J. Bauer.

Figure 3 shows the false discovery risk (FDR) and the estimated replication rate (ERR). It also shows the expected replication failure rate (EFR = 1 – ERR). A comparison of the EFR with the FDR provides information for the interpretation of replication failures. If the FDR is close to the EFR, many replication failures may be due to false positive results in original studies. In contrast, if the FDR is low, most replication failures are likely to be false negative results in underpowered replication studies.

There were linear, b = .40, SE = .11, and quadratic, b = .14, SE = .02, time trends for the ERR. The FDR is based on the EDR that also showed linear and quadratic trends. The non-linear trends imply that credibility was lowest from 2005 to 2015. During this time up to 40% of published results might not be replicable and up to 50% of these results might be false positive results. The Open Science replication project replicated studies from 2008. Given the present findings, this result cannot be generalized to other years.

## Retrospective Improvement of Credibility

The criterion of alpha = .05 is an arbitrary criterion to make decisions about a hypothesis. It was used by authors to conclude that an effect is present and editors accepted articles on the basis of this evidence. However, readers can demand stronger evidence. A rational way to decide what alpha criterion to use is the false positive risk. A lower alpha, say alpha = .005, reduces the false positive risk, but also increases the percentage of false negatives (i.e.., there is an effect even if the p-value is above alpha).

Figure 4 shows the implications of using different significance criteria for the observed discovery rate (lower alpha implies fewer significant results).

Lowering alpha to .01 reduces the observed discovery rate by 20 to 30 percentage points. The effect is stronger during the dark period from 2005 to 2015 because more results during this period had p-values between .05 and .01. These results often do not replicate and are more likely to be the result of unscientific research practices.

Figure 5 shows the effects of alpha on the false positive risk. Lowering alpha to .01 reduces the false positive risk considerably, but it remains above 5% during the dark period from 2005 to 2015. These results suggest that readers could use alpha = .005 from 2005 to 2015 and alpha = .01 during other years to achieve a false positive risk below 5%.

## Limitations

The main limitation of these results is the use of automatically extracted test statistics. This approach cannot distinguish between theoretically important statistical results and other results that are often reported, but do not test focal hypotheses (e.g., testing statistical significance of a manipulation check, reporting a non-significant result for a factor in a complex statistical design that was not expected to produce a significant result).

For the journal Psychological Science, hand-coded data are available from coding by Motyl et al. (2017) and my own lab. The datasets were combined and analyzed with z-curve (Figure 4).

The ODR of 84% is higher than the ODR of 68% for automatic extraction. The EDR of 34% is identical to the estimate for automatic extraction. The ERR of 61% is 8 percentage points lower than the ERR for automatic extraction. Given the period effects on z-curve estimates, I also conducted a z-curve analysis for automatically extracted tests for the matching years (2003, 2004, 2010, 2016, 2020). The results were similar, ODR = 73%, EDR = 25%, and ERR = 64%. Thus, automatically extracted results produce similar results to results based on handcoded data. The main difference is that non-significant results are less likely to be focal tests.

## Conclusion

The replicability report for Psychological Science shows (a) clear evidence of selection bias, (b) unacceptably high false positive risks at the conventional criterion for statistical significance, and modest replicability. However, time trend analyses show that credibility of published results decreased in the beginning of this century, but improved since 2015. Further improvements are needed to eliminate selection bias and increase the expected discovery rate by increasing power (reducing sampling error). Reducing sampling error is also needed to produce strong evidence against theoretical predictions that are important for theory development. The present results can be used as benchmark for further improvements that can increase the credibility of results in psychological science (e.g., more Registered Reports that publish results independent of outcomes). The results can also help readers of psychological science to chose significance criteria that match their personal preferences for risk and their willingness to “err on the side of discovery” (Bem, 2004).

# The Relationship between Positive Affect and Negative Affect: It’s Complicated

About 20 years ago, I was an emotion or affect researcher. I was interested in structural models of affect, which was a hot research topic in the 1980s (Russell, 1980; Watson & Tellegen, 1985; Diener & Iran-Nejad, 1986′ Shaver et al., 1987). In the 1990s, a consensus emerged that the structure of affect has a two-dimensional core, but a controversy remained about the basic dimensions that create the two-dimensional space. One model assumed that Positive Affect and Negative Affect are opposite ends of a single dimension (like hot and cold are opposite ends of a bipolar temperature dimension). The other model assumed that Positive Affect and Negative Affect are independent dimensions. This controversy was never resolved, probably because neither model is accurate (Schimmack & Grob, 2000).

When Seligman was pushing positive psychology as a new discipline in psychology, I was asked to write a chapter for a Handbook of Methods in Positive Psychology. This was a strange request because it is questionable whether Positive Psychology is really a distinct discipline and there are no distinct methods to study topics under the umbrella term positive psychology. Nevertheless, I obliged and wrote a chapter about the relationship between Positive Affect and Negative Affect that questions the assumption that positive emotions are a new and previously neglected topic and the assumption that Positive Affect can be studied separately from Negative Affect. The chapter basically summarized the literature on the relationship between PA and NA up to this point, including some mini meta-analysis that shed light on moderators of the relationship between PA and NA.

As with many handbooks that are expensive and not easily available as electronic documents, the chapter had very little impact on the literature. WebofScience shows only 25 citations. As the topic is still unresolved, I thought I make the chapter available as a free text in addition to the Google Book option that is a bit harder to navigate.