All posts by Ulrich Schimmack

About Ulrich Schimmack

Since Cohen (1962) published his famous article on statistical power in psychological journals, statistical power has not increased. The R-Index makes it possible f to distinguish studies with high power (good science) and studies with low power (bad science). Protect yourself from bad science and check the R-Index before you believe statistical results. View all posts by Ulrich Schimmack →

Loken and Gelman’s Simulation Is Not a Fair Comparison

November 25, 2023Gelman, LokenBackpack, Gelman, Loken, Random Measurement Error, replicability, Simulation StudyUlrich Schimmack

“What I’d like to say is that it is OK to criticize a paper, even [if, typo in original] it isn’t horrible.” (Gelman, 2023)

In this spirit, I would like to criticize Loken and Gelman’s confusing article about the interpretation of effect sizes in studies with small samples and selection for significance. They compare random measurement error to a backpack and the outcome of a study to running speed. Common sense suggests that the same individual under identical conditions would run faster without a backpack than with a backpack. The same outcome is also suggested by psychometric theories that suggest random measurement error attenuates population effect sizes, which would make it harder to demonstrate significance and produce, on average, weaker effect sizes.

The key point of Loken and Gelman’s article is to suggest that this intuition fails under some conditions. “Should we assume that if statistical significance is achieved in the presence
of measurement error, the associated effects would have been stronger without
noise? We caution against the fallacy”

To support their clam that common sense is a fallacy under certain conditions, they present the results of a simple simulation study. After some concerns about their conclusions were raised, Loken and Gelman shared the actual code of their simulation study. In this blog post, I share the code with annotations and reproduce their results. I also show that their results are based on selecting for significance only for the measure with random measurement error (with a backpack) and not for the measure without a backpack (no random measurement error). Reversing the selection shows that selection for significance without measurement error produces stronger effect sizes even more often than selection for significance with a backpack. Thus, it is not a fallacy to assume that we would all run faster without a backpack holding all other factors equal. However, a runner with a heavy backpack and tailwinds might run faster than a runner without a backpack facing strong headwinds. While this is true, the influence of wind on performance makes it difficult to see the influence of the backpack. Under identical conditions backpacks slow people down and random measurement error attenuates effects.

Loken and Gelman’s presentation of the results may explain why some readers, including us, misinterpreted their results to imply that selection bias and random measurement error may interaction in some complex way to produce even more inflated estimates of the true correlation. We added some lines of code to their simulation to compute the average correlations after selection for significance separately for the measure without error and the measure with error. This way, both measures benefit equally from selection bias. The plot also provides more direct evidence about the amount of bias that is introduced by selection bias and random measurement error. In addition, the plot shows the average 95% confidence intervals around the estimated correlation coefficients.

The plot shows that for large samples (N > 1,000), the measure without error always produces the expected true correlation of r = .15, whereas the measure with error always produces the expected attenuated correlation of r = .15 * .80 = .12. As sample sizes get smaller, the effect of selection bias becomes apparent. For the measure without error, the observed effect sizes are now inflated. For the measure with error, selection bias corrects for the inflation and the two biases cancel each other out to produce more accurate estimates of the true effect size than with the measure without error. For sample sizes below N = 400, however, both measures produce inflated estimates and in really small samples the attenuation effect due to unreliability is overwhelmed by selection bias. However, while the difference due to unreliability is negligible and approaches zero, it is clear that random measurement error combined with selection bias never produces even stronger estimates than the measure without error. Thus, it remains true that we should expect a measure without random measurement error to produce stronger correlations than a measure with random error. This fundamental principle of psychometrics, however, does not warrant the conclusion that an observed statistically significant correlation in small samples underestimates the true correlation coefficient because the correlation may have been inflated by selection for significance.

The plot also shows how researchers can avoid misinterpretation of inflated effect size estimates in small samples. In small samples, confidence intervals are wide. Figure 2 shows that the confidence interval around inflated effect size estimates in small samples is so wide that it includes the true correlation of r = .15. The width of the confidence interval in small samples make it clear that the study provided no meaningful information about the size of an effect. This does not mean the results are useless. After all, the results correctly show that the relationship between the variables is positive rather than negative. For the purpose of effect size estimation it is necessary to conduct meta-analysis and to include studies with significant and non-significant results. Furthermore, meta-analysis need to test for the presence of selection bias and correct for it when it is present.

P.S. If somebody claims that they ran a marathon in 2 hours with a heavy backpack, they may not be lying. They may just not tell you all of the information. We often fill in the blanks and that is where things can go wrong. If the backpack were a jet pack and the person was using it to fly for some of the race, we would no longer be surprised by the amazing feat. Similarly, if somebody tells you that they got a correlation of r = .8 in a sample of N = 8 with a measure that has only 20% reliable variance, you should not be surprised if they tell you that they got this result after picking 1 out of 20 studies because selection for significance will produce strong correlations in small samples even if there is no correlation at all. Once they tell you that they tried many times to get the one significant result, it is obvious that the next study is unlikely to replicate a significant result.

Sometimes You Can Be
Faster With a Heavy Backpack

Annotated Original Code

### This is the final code used for the simulation studies posted by Andrew Gelman on his blog

### https://statmodeling.stat.columbia.edu/wp-content/uploads/2023/11/graph-codes-to-share-for-science-paper-final.txt

### Comments are highlighted with my initials #US#

# First just the original two plots, high power N = 3000, low power N = 50, true slope = .15

r <- .15

sims<-array(0,c(1000,4))

xerror <- 0.5

yerror<-0.5

for (i in 1:1000) {

x <- rnorm(50,0,1)

y <- r*x + rnorm(50,0,1)

#US# this is a sloppy way to simulate a correlation of r = .15

#US# The proper code is r*x + rnorm(50,0,1)*sqrt(1-r^2)

#US# However, with the specific value of r = .15, the difference is trivial

#US# However, however, it raises some concerns about expertise

xx<-lm(y~x)

sims[i,1]<-summary(xx)$coefficients[2,1]

x<-x + rnorm(50,0,xerror)

y<-y + rnorm(50,0,yerror)

xx<-lm(y~x)

sims[i,2]<-summary(xx)$coefficients[2,1]

x <- rnorm(3000,0,1)

y <- r*x + rnorm(3000,0,1)

xx<-lm(y~x)

sims[i,3]<-summary(xx)$coefficients[2,1]

x<-x + rnorm(3000,0,xerror)

y<-y + rnorm(3000,0,yerror)

xx<-lm(y~x)

sims[i,4]<-summary(xx)$coefficients[2,1]

}

plot(sims[,2] ~ sims[,1],ylab=”Observed with added error”,xlab=”Ideal Study”)

abline(0,1,col=”red”)

plot(sims[,4] ~ sims[,3],ylab=”Observed with added error”,xlab=”Ideal Study”)

abline(0,1,col=”red”)

#US# There is no major issue with graphs 1 and 2.

#US# They merely show that high sampling error produces large uncertainty in the estimates.

#US# The small attenuation effect of r = .15 vs. r = 12 is overwhelmed by sampling error

#US# The real issue is the simulation of selection for significance in the third graph

# third graph

# run 2000 regressions at points between N = 50 and N = 3050

r <- .15

propor <-numeric(31)

powers<-seq(50,3050,100)

#US# These lines of code are added to illustrate the biased selection for significane

propor.reversed.selection <-numeric(31)

mean.sig.cor.without.error <- numeric(31) # mean correlation for the measure without error when t > 2

mean.sig.cor.with.error <- numeric(31) # mean correlation for the measure with error when t > 2

#US# It is sloppy to refer to sample sizes as powers.

#US# In between subject studies, the power to produce a true positive result

#US# is a function of the population correlation and the sample size

#US# With population correlations fixed at r = .15 or r = .12, sample size is the

#US# only variable that influences power

#US# However, power varies from alpha to 1 and it would be interesting to compare the

#US# power of studies with r = .15 and r = .12 to produce a significant result.

#US# The claim that “one would always run faster without a backback”

#US# could be interpreted as a claim that it is always easier to obtain a

#US# significant result without measurement error, r = .15, than with measurement error, r = .12

#US# This claim can be tested with Loken and Gelman’s simulation by computing

#US# the percentage of significant results obtained without and with measurement error

#US# Loken and Golman do not show this comparison of power.

#US# The reason might be the confusion of sample size with power.

#US# While sample sizes are held constant, power varies as a function of the population correlations

#US# without, r = .15, and with, r = .12, measurement error.

xerror<-0.5

yerror<-0.5

j = 1

i = 1

for (j in 1:31) {

sims<-array(0,c(1000,4))

for (i in 1:1000) {

x <- rnorm(powers[j],0,1)

y <- r*x + rnorm(powers[j],0,1)

#US# the same sloppy simulation of population correlations as before

xx<-lm(y~x)

sims[i,1:2]<-summary(xx)$coefficients[2,1:2]

x<-x + rnorm(powers[j],0,xerror)

y<-y + rnorm(powers[j],0,yerror)

xx<-lm(y~x)

sims[i,3:4]<-summary(xx)$coefficients[2,1:2]

}

#US# The code is the same as before, it just adds variation in sample sizes

#US# The crucial aspect to understand figure 3 is the following code that

#US# compares the results for the paired outcomes without and with measurement error

#US# Carlos Ungil (https://ch.linkedin.com/in/ungil) pointed out on Gelman’s blog #US# that there is another sloppy mistake in the simulation code that does not alter the results #US# The code compares absolute t-values (coefficient/sampling error), while the article #US# talks about inflated effect size estimates. However, while the sampling error variation #US# creates some variability, the pattern remains the same. #US# For sake of reproducibility I kept the comparison of t-values.

# find significant observations (t test > 2) and then check proportion

temp<-sims[abs(sims[,3]/sims[,4])> 2,]

#US# the use of t > 2 is sloppy and unnecessary.

#US# summary(lm) gives the exact p-values that could be used to select for significance

#US# summary(xx)[2,4] < .05

#US# However, this does not make a substantial difference

#US# The crucial part of this code is that it uses the outcomes of the simulation

#US# with random measurement error to select for significance

#US# As outcomes are paired, this means that the code sometimes selects outcomes

#US# in which sampling error produces significance with random measurement error

#US# but not without measurement error.

propor[j] <- table((abs(temp[,3]/temp[,4])> abs(temp[,1]/temp[,2])))[2]/length(temp[,1])

#US# this line is added to compute the mean correlation for significant outcomes

#US# when measurement error is present.

mean.sig.cor.with.error[j] = mean(temp[,3])

#US# Conditioning on significance for one of the two measures is a strange way

#US$ to compare outcomes with and without measurement error.

#US# Obviously, the opposite selection bias would favor the measure without error.

#US# This can be shown by computing the same proportion after selectiong for significance

#US$ for the measure without error

temp<-sims[abs(sims[,1]/sims[,2])> 2,]

propor.reversed.selection[j] <- table((abs(temp[,1]/temp[,2])> abs(temp[,3]/temp[,2])))[2]/length(temp[,4])

#US# this line is added to compute the mean correlation for significant outcomes

#US# without measurement error.

mean.sig.cor.without.error[j] = mean(temp[,1])

print(j)

#US# we can also add to comparisons that are more meaningful and avoid the comparison

###

}

#US# the plot code had to be modified slightly to have matching y-axes

#US# I also added a title

title = “Original Loken and Gelman Code”

plot(powers,propor,type=”l”,

ylim=c(0,1),main=title, ### added code

xlab=”Sample Size”,ylab=”Prop where error slope greater”,col=”blue”)

#US# text that explains what the plot displays, not shown

#US# #text(200,.8,”How often is the correlation higher for the measure with error”,pos=4)

#US# text(200,.75,”when pairs of outcomes are selected based on significance of”,pos=4)

#US# text(200,.70,”of the measure with error?”,pos=4)

#US# We can now plot the two outcomes in the same figure

#US# The original color was blue. I used red for the reversed selection

par(new=TRUE)

plot(powers,propor.reversed.selection,type=”l”,

ylim=c(0,1), ### added code

xlab=”Sample Size”,ylab=”Prop where error slope greater”,col=”firebrick2″)

#US# adding a legend

legend(1500,.9,legend=c(“with backpack only sig. \n shown in article \n “,

“without backpack only sig. \n added by me”),pch=c(15),

pt.cex=2,col=c(“blue”,”firebrick2″))

#US# adding a horizontal line at 50%

abline(h=.5,lty=2)

#US# The following code shows the plot of mean correlations after selection for significance

#US# for the measure with error (blue) and the measure witout error (red)

title = “Comparison of Correlations after Selection for Significance”

plot(powers,mean.sig.cor.with.error,type=”l”,ylim=c(.1,.4),main=title,

xlab=”Sample Size”,ylab=”Mean Observed Correlation”,col=”blue”)

par(new=TRUE)

plot(powers,mean.sig.cor.without.error,type=”l”,ylim=c(.1,.4),main=””,

xlab=”Sample Size”,ylab=”Mean Observed Correlation”,col=”firebrick2″)

#US# adding a legend

legend(2000,.4,legend=c(“with error”,

“without error”),pch=c(15),

pt.cex=2,col=c(“blue”,”firebrick2″))

#US# adding a horizontal line at 50%

abline(h=.15,lty=2)

Loken and Gelman are still wrong

November 17, 2023UncategorizedUlrich Schimmack

Abstract
Loken and Gelman published the results of a study that aimed to simulate the influence of random measurement error on effect size estimates in studies with low power (small sample, small correlation). Figure 3 suggested that “of statistically significant effects observed after error, a majority could be greater than in the “ideal” setting when N is small” I show with a simple simulation study that this result is based on a mistake in their simulation study that conflates sampling error and random measurement error. Holding random measurement error constant across simulations reaffirms Hausman’s iron law of econometrics that random measurement error is likely to produce attenuated effect size estimates. The article concludes with the iron law of meta-science. Original authors of a novel discovery are the least likely people to find an error in their work.

Introduction

Loken and Gelman published a brief essay on “Measurement error and the replication crisis” in the magazine Science. As it turns out, the claims in the article are ambiguous because key terms like measurement error are not properly defined. However, the article does contain the results of simulation studies that are presented in a Figure. The key figure is Figure 3.

This figure is used to support the claim that “of statistically significant effects observed after error, a majority could be greater than in the “ideal” setting when N is small.”

Some points about this claim are important. It is not a claim about a single study. In a single study, measurement error, sampling error, and other factors CAN produce a stronger result with a less reliable measures, just like some people can win the lottery, even though it is a very unlikely event. The claim is clearly about the outcome in the long-run after many repeated trials. That is also implied by a figure that is based on simulations of many repeated trials of the same study. What does the figure imply? It implies that measurement error attenuates observed correlations (or regression coefficients with measurement error in the predictor variable, x) in large samples. The reason is simply that random measurement error adds variance to a variable that is unrelated to the outcome measure. As a result, the observed correlation is a mixture of the true relationship and a correlation of zero and the mixture depends on the amount of random measurement error in the predictor variable.

Selection for significance on the other hand has the opposite effect. To obtain significance, the observed correlation has to have a minimum value so that the observed correlation is approximately twice as large as the sampling error (t ~ 2 equals p < .05, two-tailed). In large samples, sampling error is small and correlations of r = .15 are significant in most cases (i.e., the study has high power). When 99% of all studies are significant, selecting for significance to get a success rate of 100% is irrelevant. However, in small samples with N = 50, a small correlation of r = .15 is not enough to get significance. Thus, all significant correlations are inflated. Measurement error attenuates correlations and makes it even harder to get significant results. With reliability = .8 and a correlation of r = .15, the expected correlation is only .15 * .8 = .12 and more inflation is needed to get significance.

Figure 3 in Loken and Gelman’s article suggests that selection for significance with unreliable measures produces even more inflated effect size estimates than selection for significance without measurement error. This is implied by the results of a simulation study that produced a majority (over 50%) of outcomes where the effect size estimate was higher (and more inflated) when random measurement error was added than in the ideal setting without random measurement error. Loken and Gelman’s claim “of statistically significant effects observed after error, a majority could be greater than in the “ideal” setting when N is small” is based on this result of their simulation studies. With N = 50, r = .15, and reliability of .8, a majority of the comparisons showed a stronger effect size estimate for the simulation with random error than for the simulation without random error.

I believe that this outcome is based on an error in their simulation studies. The simulation does not clearly distinguish between sampling error and random measurement error. I have tried to make this point repeatedly on Gelman’s blog post, but this discussion and my previous blog post (that Gelman probably did not read) failed to resolve this controversy. However, it helped me to understand the source of the disagreement more clearly. I maintain that Gelman does not make a clear distinction between sampling error (i.e., even with perfectly reliable measures, results will vary from sample to sample and this variability is larger in small samples, STATS101) and random measurement error (i.e., two measures of the same constructs are not perfectly correlated with each other, NOT A TOPIC OF STATISTCS, which typically assumes perfect measures). Based on this insight, I wrote a new r-script that clearly distinguishes between sampling error and random measurement error. I ran the script 10,000 times. Here are the key results.

The simulation ensured that reliability in each run is exactly 80%.

The expected effect sizes are r = .15 for the true relationship and r = .12 for the measure with 80% reliability. The average effect sizes across the 10,000 simulations match these expected values. We also see that sampling error produces huge variability in specific runs. However, even extreme deviations are attenuated by random measurement error. Thus, random measurement error makes values less extreme.

What about sign errors. We don’t really know the true correlation and two-tailed testing allows researchers to reject H0 with the wrong sign. To allow for this possibility, we can compute the absolute correlations.

This does not matter. The results for the measure with random error are still lower and less extreme.

Now we can examine how conditioning on significance influences the results.

Once more the effect size estimates for the true correlation are stronger and more extreme than those for the measure with random measurement error. This is also true for absolute effect size estimates.

Loken and Gelman’s Figure 3 required the direct comparison of two outcomes in the same run after selection for significance. This creates a problem because sometimes one result will be significant and the other one will not be significant. As a result, the comparison is biased because it compares estimates after selection for significance with estimates without selection for significance. However, even with this bias in favor of the unreliable measure, random measurement error produced weaker effect size estimates in the majority of all cases.

Conclusion

In short, these results confirm Hausman’s iron law of econometrics that random measurement error typically attenuates effect size estimates. Typically, of course, does not mean always. However, Loken and Gelman claimed that they identified a situation in which the iron law of economics does not apply and can lead to false inferences. They claimed that (a) in small samples and (b) after selection for significance, random measurement error will produce stronger effect size estimates not once or twice but IN A MAJORITY of studies. This claim was implied by the results of their simulations displayed in Figure 3. Here is showed that their simulation fails to simulate the influence of random measurement error. Holding random measurement error constant at 80% produces the expected outcome that random measurement error is more likely to attenuate effect size estimates than to inflate it even in small samples and after selection for significance. Thus, researchers are justified to claim that they could have obtained stronger correlations with a more reliable measure or to use latent variable models to correct for unreliability. What they cannot do is to claim that the true population correlation is stronger than their observed correlation because this claim ignores the influence of selection for significance that inevitably inflates observed correlations in small samples with small effect sizes. It is also not correct to assume that two wrongs (selecting for significance with unreliable measures) make one right. Robust and replicable results require good data. Effect sizes of correlations should only be interpreted if measures have demonstrated good reliability (and validity, but that is another topic) and when sampling error is small enough to produce a meaningful range of plausible values.

New Simulation of Reliability

N = 50

REL = .80
n.sim = 10000

res = c()

for (i in 1:n.sim) {

SV = scale(rnorm(N))*sqrt(REL)
var(SV)

x1 = rnorm(N)
x1 = residuals(lm(x1 ~ SV))
x1 = scale(x1)
x1 = x1*sqrt(1-REL)
var(x1)

x2 = rnorm(N)
x2 = residuals(lm(x2 ~ x1 + SV))
x2 = scale(x2)
x2 = x2*sqrt(1-REL)
var(x2)

x1 = x1 + SV
x2 = x2 + SV

y = .15 * SV + rnorm(N)*sqrt(1-.15^2)

r = c(var(x1),var(x2),cor(x1,x2),
summary(lm(y ~ SV))$coef[2,],
summary(lm(y ~ x1))$coef[2,],
summary(lm(y ~ x2))$coef[2,]
);r

res = rbind(res,r)

} # End of sim

summary(res)

Open Science Practices and Replicability

November 11, 2023UncategorizedUlrich Schimmack

Summary

A recent article in the flashy journal “Nature Human Behaviour” that charges authors or their universities $6,000 published the claim “high replicability of newly discovered social-behavioural findings is achievable” (Protzko et al., 2023). This is good news for social scientists and consumer of social psychology after a decade of replication failures caused by questionable research practices, including fraud.

So, what is the magic formula to produce replicable and credible findings in the social sciences?

The paper attributes success to the implementation of four rigour-enhancing practices, namely confirmatory tests, large sample sizes, preregistration, and methodological transparency, The problem with this multi-pronged approach is that it is not possible to say which of these features are necessary or sufficient to produce replicable results.

I analyze the results of this article with the R-Index. Based on these results, I conclude that none of the four rigor-enhancing practices are necessary to produce highly replicable results. The key ingredients for high replicability are honesty and high power. It is wrong to confuse large samples (N = 1,500) with high power. As shown, sometimes N = 1,500 has low power and sometimes much smaller samples are sufficient to have high power.

Introduction

The article reports 16 studies. Each study was proposed by one lab and the lab reported the results of a confirmatory test that produced significant results in 15 of the 16 studies. The replication studies by the other three labs produced significant results in 79% of the studies.

I predicted these replication outcomes with the Replicability-Index (R-Index). The R-Index is a simple method to estimate replicability for a small set of studies. The key insight of the R-Index is that the outcome of unbiased replication studies is a function of the mean (I once assumed the median would be better, but this was wrong) power of the original studies (Brunner & Schimmack, 2021). Unfortunately, it can be difficult to estimate the true mean power based on original studies because original studies are often selected for significance and selection for significant leads to inflated estimates of observed power. The R-Index adjusts for this inflation by comparing the success rate (percentage of significant results) to the mean observed power. If the success rate is higher than the mean observed power, selection bias is present and the mean power is inflated. A simple heuristic to correct for this inflation is to subtract the inflation from the observed power.

The article reported the outcomes of “original” (blue = self-replication) and replication studies (green = independent replications by other labs) in Figure 1.

To obtain estimates of observed power, I used the point estimates of the original (original) studies and the lower limit of the 95%CI. I converted these statistics into z-scores, using the formula (ES/((ES – LL.CI)/2). The z-scores were converted into p-values and p-values below .05 were considered significant. Visual inspection of Figure 1 shows that one original study (blue) did not have a statistically significant result (i.e., the 95%CI includes a value of zero). Thus, the actual success rate was 15/16 = 94%.

Table 1 shows that the mean observed power is 87%. Thus, there is evidence of a small amount of selection for significance and the predicted success rate of replication studies is .87 – .06 = .81. The actual success rate was computed as the percentage of replication studies (k = 3) that produced a significant result. The overall success rate of replication studies was 79%, which is close to the estimate of the R-Index, 81%. Finally, it is evident that power of studies varies across studies. 9 studies had z-scores greater than 5 (the 5 sigma rule of particle physics) and all 9 studies had a replication success rate of 100%. The only reason for replication failures of studies with z-scores greater than 5 is fraud or problems in the implementation of the actual replication study. In contrast, studies with z-scores below 4 have insufficient power to produce consistent significant results. The correlation between observed power and replication success rates is r = .93. This finding demonstrates empirically that power determines the outcome of unbiased replication studies.

Discussion

Honest reporting of results is necessary to trust published results. Open Science Practices may help to ensure that results are reported honesty. This is particularly valuable for the evaluation of a single study. However, statistical tools like the R-Index can be used to examine whether a set of studies is unbiased or whether the results are biased. In the present set of 16 original studies, it detected a small bias that explains the differences in success rate for the original studies (blue, 94%) and the replication studies (green, 79%).

More importantly, the investigation of power shows that some of the studies were underpowered to reject the nil-hypothesis even with N = 1,500 because the real effect sizes were too close to zero. This shows how difficult it is to provide evidence for the absence of an important effect.

At the same time, other studies had large effect sizes and were dramatically overpowered to demonstrate an effect. As shown, z-scores of 5 are sufficient to provide conclusive evidence against a nil-hypothesis and this criterion is used in particle physics for strong hypothesis tests. Using N = 1,500 for an effect size of d = .6 is overkill. This means that researchers who cannot easily collect data from large samples can produce credible results. There are also other methods to reduce sampling error and to increase power than increasing sample sizes. Within-subject designs with many repeated trials can produce credible and replicable results with sample size of N = 8. Sample size should not be used as a criterion to evaluate studies and large samples should not be used as a criterion for good science.

To evaluate the credibility of results in single studies, it is useful to examine confidence intervals and to see which effect sizes are excluded by the lower limit of the confidence interval. Confidence intervals that exclude zero, but not values close to zero suggest that a study was underpowered and that the true population effect size may be so close to zero that it is practically zero. In addition, p-values or z-scores provide valuable information about replicability. Results with z-scores greater than 5 are extremely likely to replicate in an exact replication study and replication failures suggest a significant moderating factor.

Finally, the present results suggest that other aspects of open science like pre-registration are not necessary to produce highly replicable results. Even exploratory results that produced strong evidence (z > 5) are likely to replicate. The reason is that luck or extreme p-hacking does not produce such extreme evidence against the null-hypothesis. A better understanding of the strength of evidence may help to produce credible results without wasting precious resources on unnecessarily large samples.

Random Measurement Error and the Replication Crisis

November 4, 2023UncategorizedUlrich Schimmack

The code for all simulations is available on OSF (https://osf.io/pyhmr).

P.S. I have been arguing with Andrew Gelman on his blog about his confusing and misleading article with Loken. Things have escalated and I just want to share his latest response.

Andrew on November 22, 2023 11:22 AM at 11:22 am said:

Ulrich:
I’m not dragging McShane into anything. You’re the one who showed up in the comment thread, mischaracterized his paper in two different ways, and called him an “asshole” for doing something that he didn’t do. You say now that you don’t care that he cited you; earlier you called him an asshole for not citing you, even though you did.

Also, your summaries of the McShane et al. and Gelman and Loken papers are inaccurate, as are your comments about confidence intervals, as are a few zillion other things you’ve been saying in this thread.

Please just stop it with the comments here. You’re spreading false statements and wasting our time; also I don’t think this is doing you any good either. I don’t really mind the insults if you had anything useful to say (here’s an example of where I received some critical feedback that was kinda rude but still very helpful), but whatever useful you have to say, you’ve already said, and at this point you’re just going around in circles. So, no more.

The main reason to share this is that I already showed that confidence intervals are often accurate even after selection for significance and that this is even more true when studies use unreliable measures because the attenuation due to random measurement error compensates for the inflation due to selection for significance. i am not saying that this makes it ok to suppress non-significant results, but it does show that Gelman is not interested in telling researchers how to avoid misinterpretation of biased point estimates. He likes to point out mistake in other people’s work, but he is not very good at noticing mistakes in his own work. I have repeatedly asked for feedback on my simulation results and if there are mistakes I am going to correct them. Gelman hasn’t done so and so far nobody else has. Of course, I cannot find a mistake in my own simulations. Ergo, I maintain that confidence intervals are useful to avoid misinterpretation of pointless point estimates. The real reason why confidence intervals are rarely interpreted (other than saying CI = .01 to 1.00 excludes zero, therefore the nil-hypothesis can be rejected, which is just silly nil-hypothesis testing, Cohen, 1994) is that confidence intervals in between-study designs with small samples are so wide that they do not allow strong conclusions about population effect sizes.

Introduction

A few years ago, Loken and Gelman (2017) published an article in the Magazine “Science.” A key novel claim in this article was that random measurement error can inflate effect size estimates.

“In a low-noise setting, the theoretical results of Hausman and others correctly show that measurement error will attenuate coefficient estimates. But we can demonstrate with a simple exercise that the opposite occurs in the presence of high noise and selection on statistical significance”

Language is famously ambiguous and open to interpretation. However, the article also presented a figure that seemed to support this counterintuitive conclusion.

–

The figure seems to suggest that with selection for significance, overestimation of effect sizes is increasingly more common in studies that use an unreliable measure rather than a reliable measure. At some point, the proportion of studies where the effect size estimate is greater with rather than without error seems to be over 50%.

Paradox findings are interesting and attracted our attention (Schimmack & Carlson, 2017). We believed that this conclusion is based on a mistake in the simulation code. We also tried to explain the combined effects of sampling error and random measurement error on effect sizes in a short commentary that remind unpublished. We never published our extensive simulation results.

Recently, a Ph.D student also questioned the simulation code and Andrew Gelman posted the students concerns on his blog post (Simulations of measurement error and the replication crisis: Maybe Loken and I have a mistake in our paper?) The blog post also included the simulation code.

The simulation is simple. It generates two variables with SD = 1 and a correlation ~ r = .15. It then adds 25% random measurement error to both variables, so that the two variables are measures of the former variables with 4/5 = 80% reliability. This attenuates the true correlation slightly to .15*.8 = .12. The crucial condition is when this simulation is run with a small sample size of N = 50.

N= 50 is a small sample size to study an effect size of .15 or .12. So, we are expecting mostly non-significant results. The crucial question is what happens when researchers get lucky and obtain a statistically significant result. Would selection for significance produce a stronger effect size estimate for the perfect measure or the unreliable measure?

It is not easy to answer this question because selection for significance requires conditioning on an outcome and Loken and Gelman’s simulation has two outcomes in the simulation. The outcomes for the perfect measure are paired with the outcome for the unreliable measure. So, which outcome should be used to select for significance? Using either measure will of course benefit the measure that was used to select for significance. To avoid this problem, I simply examined all four possible outcomes, neither measure was significant, the perfect measure was significant and the unreliable was not, the unreliable was significant and the perfect was not, or both were significant. To obtain stable cell frequencies, I ran 10,000 simulations.

Here are the results.

1. Neither measure produced a significant result

4870 times the perfect measure had a higher estimate than the unreliable measure (58%)
3629 times the unreliable measure had a higher estimate than the unreliable measure (42%)

2. Both measure produced a significant result

579 times the perfect measure had a higher estimate than the unreliable measure (61%)
377 times the unreliable measure had a higher estimate than the unreliable measure (39%)

3. The reliable measure is significant and the unreliable measure is not significant

981 times the perfect measure had a higher estimate than the unreliable measure (100%)
0 times the unreliable measure had a higher estimate than the unreliable measure (0%)

4. The unreliable measure is significant and the reliable measure is not significant

0 times the perfect measure had a higher estimate than the unreliable measure (0%)
464 times the unreliable measure had a higher estimate than the unreliable measure (100%)

The main point of these results is that selection for significance will always favor the measure that is used for conditioning on significance. By definition, the effect size of a significant result will be larger than the effect size of a non-significant result given equal sample size. However, it is also clear that the unreliable measure produces fewer significant results because random measurement error attenuates the effect size and reduces power; that is, the probability to obtain a significant result.

Based on these results, we can reproduce Loken and Gelman’s results that showed larger effect size estimates more often with the unreliable measure. To produce this result, they conditioned on significance for the measure with random error, but not for the measure without random measurement error. That is, they combined conditions 2 (both measures produced significant results) and 4 (ONLY the unreliable measure produced significant result).

5. (2 + 4) The unreliable measure is significant, the reliable measure can be significant or not significant.

When we simply select for significance on the unreliable measure, we see that the unreliable measure has the stronger effect size over 50% of the time.

579 times the perfect measure had a higher estimate than the unreliable measure (41%)
377+464 = 841 times the unreliable measure had a higher estimate than the unreliable measure (59%)

However, this is not a fair comparison of the two measures. Selection for significance is applied to one of them and not the other. The illusion of reversal is produced by selection bias in the simulation, not in a real world scenario where researchers use one or the other measure. This is easy to see, when we condition on the reliable measure.

6. (2 + 3) The reliable measure is significant, significance on the other measure does not matter.

579+981 = 1560 times the perfect measure had a higher estimate than the unreliable measure (81%)
377 times the unreliable measure had a higher estimate than the unreliable measure (19%)

Now, we overestimate the advantage of the reliable measure. Conditioning on significance selectively for one variable and not the other produces biased simulation results that falsely suggest that an unreliable measure produces stronger correlations despite the fact that random measurement error attenuates correlation coefficients and other standardized effect size estimates.

Extension 1: A Fair Comparison

I posted polls on Twitter (X) and in a Facebook Group that discusses methodological issues in psychological science. I created a scenario in which two labs are studying the same effect, say r = .15 in the population between two variables. Both labs use the same small sample size of N = 50. Both labs only publish when they get a statistically significant result, p < .05 (two-tailed). Both labs use a perfectly reliable preditor variable (e.g., height). The only difference between the two labs is that LAB A uses a more reliable measure (Cronbach’s alpha = .8) than LAB B (Cronbach’s alpha = .5). The question is simply which lab is more likely to publish a bigger (more biased) effect size estimate? The key difference between this scenario and Loken and Gelman’s simulation is that both outcomes are conditioned on significance. Thus, we can actually see the influence of random measurement error on correlation coefficients when results are selected for significance. The polls showed that the correct answer is unknown. In both polls, two-third of respondents sided with Loken and Gelman and answered that LAB B is more likely to report larger correlations (Twitter/X 65%, Facebook Group, 68%). They may have been influenced by Loken and Gelman’s article that claimed “in noisy research settings, poor measurement can contribute to exaggerated estimates of effect size.”

ChatGPT on Bing thinks LAB A will report stronger effect sizes, but its reasoning does not take conditioning for significance into account. “According to the results, the effect size is a measure of the magnitude of the experimental effect, and it is influenced by the reliability of the measures used. The higher the reliability, the more accurate the effect size estimate will be ¹. Therefore, LAB A, which uses a more reliable measure (Cronbach’s alpha = .8) than LAB B (Cronbach’s alpha = .5), is more likely to publish a bigger (more biased) effect size estimate. LAB B, which uses a less reliable measure, will have more measurement error and lower statistical power, making it harder to detect the true effect size in the population.”

To obtain the correct answer, I made only a small change to Loken and Gelman’s simulation. First, I did not add measurement error to the predictor variable, X. Second, I added different amounts of random measurement error to two outcome variables, Y1 with 80% reliable variance for LAB A, and y2 with 50% reliable variance for LAB B. I ran 10,000 simulations to have a reasonably large number of cases after selection for significance. LAB A had more significant results because the population effect size or average sample correlation is larger, .15 * .8 = .12 than the one for LAB B, .15 * .5 = .075, and studies with larger effect sizes in the same sample size have more statistical power, a greater chance to produce a significant result. In the simulation, LAB A had 1,435 significant results (~ 14% power) and LAB B had 1,106 significant results (11% power). I then compared the first 1,106 significant results from LAB A to the 1,106 results from LAB B and computed how often LAB A had a higher effect size estimate than LAB B.

Results: LAB A had a higher effect size estimate in 569 cases (51%) and LAB B had a higher effect size estimate in 537 cases (49%). Thus, there is no reversal that less reliable measures produce stronger (more biased) correlations in studies with low power and after selection for significance. Loken and Gelman’s false conclusion is based on an error in their simulations that conditioned on significance for the unreliable measure, but not for the measure without random measurement error.

Would a more extreme scenario produce a reversal? Power is already low and nobody should compute correlation coefficients in samples with N = 20, but fMRI researchers famously reported dramatic correlations between brain and behavior i studies with N = 8 (“voodoo correlations; Vul et al., 2012). So, I repeated the simulation with N = 20, and pitting a measure with 100% reliability against a measure with 20% reliability. Given the low power, I ran 100,000 simulations to get stable results.

Results:

LAB A obtained 9,620 significant results (Power ~ 10%). LAB B obtained 6,030 (Power ~ 6%, close to chance, 5% with alpha = .05).

The comparison of the first 6,030 significant results with the 6,030 significant results from LAB B showed that LAB A reported a stronger effect size 3,227 times (54%) and LAB B reported a stronger effect size 2,803 times (46%). Thus, more reliable not only help to report voodoo correlations more often, but also report higher correlations. Clearly, using less reliable measures does not contribute to the replication crisis as Loken and Gelman claimed. Their claim is based on a mistake in their simulations that conditioned joked outcomes on significance of the unreliable measure.

Extension 2: Simulation with two equally unreliable measures

The next simulation is a new simulation that has two purposes. First, it drives home the message that Gelman’s simulation study unfairly biased the results in favor of the unreliable measure by conditioning on significance for this measure. Second, it provides a new insight into the contribution of unreliable measures to the replication crisis. The simulation assumes that researchers really use two dependent variables (or more) and are going to report results if at least one of the measures has a significant result. Evidently, this doesn’t really work with two perfect measures because they are perfectly correlated, r = 1. As a result, they will always show the same correlation with the independent variable. However, unreliable measures are not perfectly correlated with each other and produce different correlations. This provides room for capitalizing on chance and getting significance with low power. The lower the reliability of the measures the better. I picked a reliability of .5 for both dependent measures (Y1, Y2) and assumed that the independent variable has perfect reliability (e.g., an experimental manipulation).

1. Neither measure produced a significant result

4011 times Y1 had a higher estimate than Y2 (49%)
4109 times Y2 had a higher estimate than Y1 (51%).

2. Both measures produced a significant result.

212 times Y1 had a higher estimate than Y2 (57%)
162 times Y2 had a higher estimate than Y1 (43%).

3. Y1 is significant and Y2 is not significant

743 times Y1 had a higher estimate than Y2 (100%)
0 times Y2 had a higher estimate than Y1 (0%).

4. Y2 is significant and Y1 is not significant

0 times Y1 had a higher estimate than Y2 (0%)
763 times Y2 had a higher estimate than Y1 (100%).

The results show that using two measures with 50% reliability increases the chances of obtaining a significant result by about 750 / 10000 tries (7.5 percentage points). Thus, unreliable measures can contribute to the replication crisis if researchers use multiple unreliable measures and selectively publish results for the significant one. However, using a single unreliable measure versus a single reliable measure is not beneficial because an unreliable measure makes it less likely to obtain a significant result. Gelman’s reversal is an artifact by conditioning on one outcome. This can be easily seen by comparing the results after conditioning on significance for Y1 or Y2.

5. (2 + 3) Y1 is significant, significance of Y2 does not matter

212+743 = 955 times Y1 had a higher estimate than Y2 (85%)
162 times Y2 had a higher estimate than Y1 (15%).

6. (2 + 4) Y2 is significant, significance of Y1 does not matter

212 = times Y1 had a higher estimate than Y2 (81%)
162+763 = 925 times Y2 had a higher estimate than Y1 (19%).

When we condition on significance for Y1, Y1 produces more often significant results. When we condition on Y2, Y2 produces more often significant results. This has nothing to do with the reliability of the measures because they have the same reliability. The difference is illusory because selection for significance in the simulation produces biased results.

Another interesting observation

While working on this issue with Rickard, we also discovered an important distinction between standardized and unstandardized effect sizes. Loken and Gelman simulated standardized effect sizes because by correlating two variables. Random measurement error lowers standardized effect sizes because the unstandardized effect sizes are divided by the standard deviation and random measurement error adds to the naturally occurring variance in a variable. However, unstandardized effect sizes like the covariance or the mean difference between two groups are not attenuated by random measurement error. For this reason, it would be wrong to claim that unreliability of a measure attenuated unstandardized effect sizes or that they should be corrected for unreliability of a measure.

Random measurement error will however increase the standard error and make it more difficult to get a significant result. As a result, selection for significance will inflate the unstandardized effect size more for an unreliable measure. The following simulation demonstrates this point. To keep things similar, I kept the effect size of b = .15, but used the unstandardized effect size of a regression analysis as the effect size.

First, I show the distribution of the effect size estimates. Both distributions are centered over the simulated effect size of b = .15. However, the measure with random error produces a wider distribution which often results in more extreme effect size estimates.

1. Neither measure produced a significant result

3170 times the perfect measure had a higher estimate than the unreliable measure (41%)
4587 times the unreliable measure had a higher estimate than the unreliable measure (59%)

This scenario shows the surprising reversal that the less reliable measure shows the stronger absolute effect size estimates more often and more than 50% of the time that Loken and Gelman wanted to demonstrated, but their simulation used standardized effect size estimates that do not produce this reversal. Only unstandardized effect size estimates show it.

2. Both measures produced a significant result

73 times the perfect measure had a higher estimate than the unreliable measure (10%)
659 times the unreliable measure had a higher estimate than the unreliable measure (90%)

When both effect size estimates are significant, the one with the unreliable measure is much more likely to show a stronger effect size estimate. The reason is simple. Sampling error is larger and it takes a stronger effect size estimate to produce the same t-value that produces a significant result.

3. The reliable measure is significant and the unreliable measure is not significant

790 times the perfect measure had a higher estimate than the unreliable measure (73%)
299 times the unreliable measure had a higher estimate than the unreliable measure (27%)

With standardized effect sizes, selection for significance always favored the conditioning variable 100% of the time. Now unstandardized coefficients are higher 27% of the time. However, the conditioning effect is notable because conditioning on significance for the perfect measure reverses the usual pattern that the unreliable measure produces stronger effect size estimates.

4. The unreliable measure is significant and the reliable measure is not significant

0 times the perfect measure had a higher estimate than the unreliable measure (0%)
422 times the unreliable measure had a higher estimate than the unreliable measure (100%)

Conditioning on significance on the unreliable measure produces a 100% rate of stronger effect sizes because effect sizes are already biased in favor of the unreliable measure.

The interesting observation is that Loken and Gelman were right that effect size estimates can be inflated with unreliable measures, but they failed to demonstrate this reversal because they used standardized effect size estimates. Inflation occurs with unstandardized effect sizes. Moreover, it does not require selection for significance. Even non-significant effect size estimates tend to be larger because there is more sampling error.

The Fallacy of Interpreting Point Estimates of Effect Sizes

Loken and Gelman’s article is framed as a warning to practitioners to avoid misinterpretation of effect size estimates. The are concerned that researchers “assume that the observed effect size would have been even larger if not for the burden of measurement error” and “when it comes to surprising research findings from small studies, measurement error (or other uncontrolled variation) should not be invoked automatically to suggest that effects are even larger” and “our concern is that researchers are sometimes tempted to use the “iron law” reasoning to defend or justify surprisingly large statistically significant effects from small studies”

They missed an opportunity to point out that there is a simple solution to avoid misinterpretation of effect sizes estimates that has been recommended by psychological methodologists since the 1990s (I highly recommend Cohen, 1994; also Cummings, 2013). The solution is to consider the uncertainty in effect sizes estimates by means of confidence intervals. Confidence provide a simple solution to many fallacies of traditional null-hypothesis tests, p < .05. A confidence interval can be used to test not only the nil-hypothesis, but also hypotheses about specific effect sizes. A confidence interval may exclude zero, but it might include other values of theoretical interest, especially if sampling error is large. To claim an effect size larger than the true population effect size of b = .15, the confidence interval has to exclude a value of b = .15. Otherwise, it is a fallacy to claim that the effect size in the population is larger than .15.

As demonstrated before, random measurement error inflates effect size estimates of unstandardized effect sizes, but it also increases sampling error, resulting in wider confidence interval. Thus, it is an important question whether unreliable measures really allow researchers to claim effect sizes that are significantly larger than the simulated true effect size of b = .15.

A final simulation examined how often the 95%CI excluded the true value of b = .15 for the perfect measure and the unreliable measure. To produce more precise estimates, I ran 100,000 simulations.

1. Measure without error

2809 Significant underestimations (2.8%)
2732 Significant overestimations (2.7%)
5541 Errors

2. Measure with error

2761 Significant underestimations (2.8%)
2750 Significant overestimations (2.7%)
5511 Errors

The results should not come as a surprise. 95% confidence intervals are designed to have a 5% error rate and to split these errors into equal errors on both sides. The addition of random measurement error does not affect this property of confidence intervals. Most important, there is no reversal in the probability of overestimation. The measure without error produces confidence intervals that overestimate the true effect size as often as the measure without error. However, the effect of random measurement error is noticeable in the amount of bias.

For the measure without error, the lower bound of the 95%CI ranges from .15 to .55, M = .21.
For the measure with error, the lower bound of the 95%CI ranges from .15 to .65, M = .22.

These differences are small and have no practical consequences. Thus, the use of confidence intervals provides a simple solution to false interpretation of effect size estimates. Although selection for significance in small samples inflates the point estimate of effect sizes, the confidence interval often includes the smaller true effect size.

The Replication Crisis

Loken and Gelman’s article aimed to relate random measurement error to the replication crisis. They write “If researchers focus on getting statistically significant estimates of small effects, using noisy measurements and small samples, then it is likely that the additional sources of variance are already making the t test look strong. Measurement error and selection bias thus can combine to exacerbate the replication crisis.”

The previous results show that this statement ignores the key influence of random measurement error on statistical significance. Random measurement error increases the standard deviation and the standard deviation is in the denominator of the t-statistic. Thus, t-values are biased downwards and it is harder to get statistical significance with unreliable measures. The key point is that studies in small samples with unreliable measures have low statistical power. It is therefore misleading to claim that random-measurement error inflates t-values. It actually attenuates t-values. Selection for significance inflates point estimates of effect sizes, but these are meaningless and the confidence interval around this estimate often include the true population parameter.

More important, it is not clear what Loken and Gelman mean by the replication crisis. Let’s assume a researcher conducts a study with N = 50, a measure with 50% reliability, and an effect size of r = .15. Luck, the winner’s curse, gives them a statistically significant result with an effect size estimate of r = .6 and a 95% confidence interval ranging from .42 to .86. They get a major publication out of this finding. Another researcher conducts a replication study and gets a non-significant result with r = .11 and a 95%CI ranging from -.11 to .33. This outcome is often called a replication failure because significant results are considered successes and non-significant results are considered failures. However, findings like this do not signal a crisis. Replication failures are normal and to be expected because significance testing allows for error and replication failures.

The replication crisis in psychology is caused by the selective omission of replication failures from the literature or even from a set of studies within a single article (Schimmack, 2012). The problem is not that a single significant result is followed by a non-significant result. The problem is that non-significant results are not published. The success rate in psychology journals is over 90% (Sterling, 1959; Sterling et al., 1995). Thus, the replication crisis refers to the fact that psychologists never published failed replication studies. When publication of replication failures became more acceptable in the past decade, we just saw that selection bias inflated the success rate. Given the typical power of studies in psychology, replication failures are to be expected. This has nothing to do with random measurement error. The main contribution of random measurement error is to reduce power and increase the percentage of studies with non-significant results.

Forgetting about False Negatives

Over the past decade, a few influential articles have created a fear of false positive results (Ioannidis, 2005; Simmons, Nelson, & Simonsohn, 2011). The real problem, however, is that selection for significance makes it impossible to know whether an effect exists or not. Whereas real effects in reasonably powered studies would often produce significant results, false positives would be followed by many replication failures. Without credible replication studies that are published independent of the outcome, statistical significance has no meaning. This led to concerns that over 50% of published results could be false positives. However, empirical studies of the false positive risk often find much lower plausible values (Bartos & Schimmack, 2023). Arguably, the bigger problem in studies with small samples and unreliable measures is that these studies will often produce a false negative result. Take, Loken and Gelman’s simulation as an example. The key outcome of studies that look for a small effect size of r = .15 or r = .12 with a noisy measure is a non-significant result. This is a false negative result because we know a priori that there is a non-zero correlation between the two variables with a theoretically important effect size. For example, the correlation between income and a noisy measure of happiness is around r = .15. Looking for this small relationship in a small sample will often suggest that money does not buy happiness, while large samples consistently show this small relationship. One might even argue that the few studies that produce a significant result with an inflated point estimate but a confidence interval that includes r = .15 avoid the false negative result without providing inflated estimates of the effect size given the wide range of plausible values. Only the 2.5% of studies that produce confidence intervals that do not include r = .15 are misleading, but a single replication study is likely to correct this inflated estimate.

This line of reasoning does not justify selective publishing of significant results. Rather it draws attention back to the concerns of methodologists in the 1990s that low power is wasteful because many studies produce inconclusive results. To address this problem researchers need to think carefully about the plausible range of effect sizes and plan studies that can produce significant results for real effects. Researchers also need to be able and willing to publish results when the results are not significant. No statistical method can produce valid results when the data are biased. In comparison, the problem of inflated point estimates of effect sizes in a single small sample is trivial. Confidence interval make it clear that the true effect size can be much smaller and rare outcomes of extreme inflation will be corrected quickly by failed replication studies.

In short, as much as Gelman likes to think that there is something fundamentally wrong with the statistical methods that psychologists use, the real problems are practical. Resource constraints often limit researchers ability to collect large samples and the preference for novel significant results over replication failures of old findings gives researchers an incentive to selectively report their “successes.” To do so, they may even use multiple unreliable measures in order to capitalize on chance. The best way to address these problems is to establish a clear code of research practices and to hold researchers accountable if they violate this code. Editors should also enforce the already existing guidelines to report meaningful effect sizes with confidence intervals. In this utopian world, researchers would benefit from using reliable measures because they increase power and the probability of publishing a true positive result.

Abandon Gelman

I pointed out the mistake in Loken and Gelman’s article on Gelman’s blog post. He is unable to see that his claim of a reversal in effect size estimates due to random measurement error is a mistake. Instead he tries to explain my vehement insistence as a personality flaw.

Instead, his overconfidence makes it impossible to consider the possibility that he made a mistake. This arrogant response to criticism is by no means unique. I have seen it many times by Greenwald, Bargh, Baumeister, and others. However, it is ironic when meta-scientists like Ioannidis, Gelman, or Simonsohn who are known for harsh criticism of others are unable to admit when they made a mistake. A notable exception is my criticism of Kahneman’s book “Thinking: Fast and Slow.”

Gelman has criticized psychologists without offering any advice how they could improve their credibility. His main advice is to “abandon statistical significance” without any guidelines how we should distinguish real findings from false positives or avoid interpretation of inflated effect size estimates. Here I showed how the use of confidence intervals provides a simple solution to avoid many of the problems that Gelman likes to point out. To learn about statistics, i suggest to read less Gelman and read more Cohen.

Cohen’s work shaped my understanding of methodology and statistics and he actually cared about psychology and tried to improve it. Without him, I might not have learned about statistical power or contemplated the silly practice of refuting nil-hypothesis. I also think his work was influential in changing the way results are reported in psychology journals that enabled me to detected biases and estimate false positive rates in our field. He also tried to tell psychologists about the importance of replication studies.

For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication” (Cohen, 1994).

If psychologists had listened to Cohen, they could have avoided the replication crisis in the 2010s. However, his work can still help psychologists to learn from the replication crisis to build a credible science that is build on true positive results and avoids false negative results. The lessons are simple.
1. Plan studies with a reasonable chance to get a significant result. Try to maximize power by thinking about all possible ways to reduce sampling error, including using more reliable measures.
2. Publish studies independent of outcome, especially replication failures that can correct false positives.
3. Focus on effect sizes, but ignore the point estimates. Instead, use confidence interval to avoid interpreting effect size estimates that are inflated by selection for significance.

Gino-Colada – 2: The line between fraud and other QRPs

October 9, 2023Datacolada, Gino, HarvardDataColada, File-Drawer, Fraud, Gino, Harvard, P-Hacking, QRPUlrich Schimmack

“It wasn’t fraud. It was other QRPs”

[collaborator] “Francesca and I have done so many studies, a lot of them as part of the CLER lab, the behavioral lab at Harvard. And I’d say 80% of them never worked out.” (Gino, 2023)

Experimental social scientists have considered themselves superior to other social scientists because experiments provide strong evidence about causality that correlational studies cannot provide. Their experimental studies often produced surprising results, but because they were obtained using the experimental method and published in respected, peer-reviewed, journals, they seemed to provide profound novel insights into human behavior.

In his popular book “Thinking: Fast and Slow” Nobel Laureate Daniel Kahneman told readers “disbelief is not an option. The results are not made up, nor are they statistical flukes. You have no choice but to accept that the major conclusions of these studies are true.” He probably regrets writing these words, because he no longer believes these findings (Kahneman, 2017).

What happened between 2011 and 2017? Social scientists started to distrust their own (or at least the results or their colleagues) findings because it became clear that they did not use the scientific method properly. The key problem is that they only published results when they provided evidence for their theories, hypothesis, and predictions, but did not report when their studies did not work. As one prominent experimental social psychologists put it.

“We did run multiple studies, some of which did not work, and some of which worked better than others. You may think that not reporting the less successful studies is wrong, but that is how the field works.” (Roy Baumeister)“

Researchers not only selectively published studies with favorable results. They also used a variety of statistical tricks to increase the chances of obtaining evidence for their claims. John et al. (2012) called these tricks questionable research practices (QRPs) and compared them to doping in sport. The difference is that doping is banned in sports, but the use of many QRPs is not banned or punished by social scientific organizations.

The use of QRPs explains why scientific journals that report the results of experiments with human participants report over 90% of the time that the results confirmed researchers’ predictions. For statistical reasons , this high success rate is implausible even if all predictions were true (Sterling et al., 1995). The selective publishing of studies that worked renders the evidence meaningless (Sterling, 1959). Even clearly false hypotheses like “learning after an exam can increase exam performance” can receive empirical support, when QRPs are being used (Bem, 2011). The use of QRPs also explains why results of experimental social scientists often fail to replicate (Schimmack, 2020).

John et al. (2012) used the term questionable research practices broadly. However, it is necessary to distinguish three types of QRPs that have different implications for the credibility of results.

One QRPs is selective publishing of significant results. In this case, the results are what they are and the data are credible. The problem is mainly that these results are likely to be inflated by sampling bias. This bias would disappear when all studies were published and the results are averaged. However, if non-significant results are not published, the average remains inflated.

The second type of QRPs are various statistical tricks that can be used to “massage” the data to produce a more favorable result. These practices are now often called p-hacking. Presumably, these practices are used mainly after an initial analysis did not produce the desired result, but may be a trend in the expected direction. P-hacking alters the data and it is no longer clear how strong the actual evidence was. While lay people may consider these practices fraud or a type of doping, professional organizations tolerate these practices and even evidence of their use would not lead to disciplinary actions against a researcher.

The third QRP is fraud. Like p-hacking, fraud implies data manipulation with the goal of getting a desirable result, but the difference is …. well, it is hard to say what the difference to p-hacking is except that it is not tolerated by professional organizations. Outright fraud in which a whole data set is made up (as some datasets by disgraced Diederik Stapel) are clear cases of fraud. However, it is harder to distinguish between fraud and p-hacking when one researcher deletes selective outliers from two groups to get significance (p-hacking) or switches extreme cases from one group to another (fraud) (GinoColada1). In both cases, the data are meaningless, but only fraud leads to reputation damage and public outrage, while p-hackers can continue to present their claims as scientific truths.

The distinction between different types of QRPs is important to understand Gino’s latest defense against accusations that she committed fraud that have been widely publicized in newspaper articles and a long article in the New Yorker. In her response, she cites from Harvards’s investigative report to make the point that she is not a data fabricator.

[collaborator] “Francesca and I have done so many studies, a lot of them as part of the CLER lab, the behavioral lab at Harvard. And I’d say 80% of them never worked out.”

The argument is clear. Why would I have so many failed studies, if I could just make up fake data that support my claim. Indeed, Stapel claims that he started faking studies outright because it was clear that p-hacking is a lot of work and making up data is the most efficient QRP (“Why not just make the data up. Same results with less effort”). Gino makes it clear that she did not just fabricate data because she clearly collected a lot of data and has many failed studies that were not p-hacked or manipulated to get significance. She only did what everybody else did; hiding the studies that did not work and lot’s of them.

Whether she sometimes did engage in practices that cross the line from p-hacking to fraud is currently being investigated and not my concern. What I find interesting is the frank admission in her defense that 80% of her studies failed to provide evidence for her hypotheses. However, if somebody would look up her published work, they would see mainly the results of studies that worked. And she has no problem of telling us that these published results are just the tip of an iceberg of studies, where many more did not work. She thinks this is totally ok, because she has been trained / brainwashed to believe that this is how science works. Significance testing is like a gold pan.

Get a lot of datasets, look for p < .05, keep the significant ones (gold) and throw away the rest. The more studies, you run, the more gold you find, and the richer you are. Unfortunately, for her and the other experimental social scientists who think every p-value below .05 is a discovery, this is not how science works, as pointed out by Sterling (1959) many, many years before, but nobody wants to listen to people to tell you something is hard work.

Let’s for the moment assume that Gino really runs 100 studies to get 20 significant results (80% do not work, p < .10). Using a formula from Soric (1989), we can compute the risk that one of her 20 significant results is a false positive result (i.e., the significant result is a fluke without a real effect), even if she did not use p-hacking or other QRPs, which would further increase the risk of false claims.

FDR = ((1/.20) – 1)*(.05/.95) = 21%

Based on Gino’s own claim that 80% of her studies fail to produce significant results, we can infer that up to 21% of her published significant results could be false positive results. Moreover, selective publishing also inflates effect sizes and even if a result is not a false positive, the effect size may be in the same direction, but too small to be practically important. In other words, Gino’s empirical findings are meaningless without independent replications, even if she didn’t use p-hacking or manipulated any data. The question whether she committed fraud is only relevant for her personal future. It has no relevance for the credibility of her published findings or those of others in her field like Dan Air-Heady. The whole field is a train wreck. In 2012, Kahneman asked researchers in the field to clean up their act, but nobody listened and Kahneman has lost faith in their findings. Maye it is time to stop nudging social scientists with badges and use some operant conditioning to shape their behavior. But until this happens, if it every happens, we can just ignore this pseudo-science, no matter what happens in the Gino versus Harvard/DataColada case. As interesting as scandals are, it has no practical importance for the evaluation of the work that has been produced by experimental social scientists.

P.S. Of course, there are also researchers who have made real contributions, but unless we find ways to distinguish between credible work that was obtained without QRPs and incredible findings that were obtained with scientific doping, we don’t know which results we can trust. Maybe we need a doping test for scientists to find out.

The Gino-Colada Affair – 1

September 30, 2023Datacolada, Gino, HarvardDataColada, File-Drawer, Fraud, Gino, Harvard, P-HackingUlrich Schimmack

Link to Gino Colada Affair – 2

Link to Gino-Colada Affair – 3

There is no doubt that social psychology and its applied fields like behavioral economics and consumer psychology have a credibility problem. Many of the findings cannot be replicated because they were obtained with questionable research practices or p-hacking. QRPs are statistical tricks that help researchers to obtain p-values below the necessary threshold to claim a discovery (p < .05). To be clear, although lay people and undergraduate students consider these practices to be deceptive, fraudulent, and unscientific, they are not considered fraudulent by researchers, professional organizations, funding agencies, or universities. Demonstrating that a researchers used QRPs to obtain significant results is easy-peasy, undermines the credibility of their work, but they can keep their jobs because it is not (yet) illegal to use these practices.

The Gino-Harvard scandal is different because the DataColada team claimed that they found “four studies for which we had accumulated the strongest evidence of fraud” and that they “believe that many more Gino-authored papers contain fake data.” To lay people, it can be hard to understand the difference between allowed QRPs and forbidden fraud or data manipulation. An example of QRPs, could be selectively removing extreme values so that the difference between two groups becomes larger (e.g., removing extremely low depression scores from a control group to show a bigger treatment effect). Outright data manipulation would be switching participants with low scores from the control group to the treatment group and vice versa.

DataColada used features of the excel spreadsheet that contained the data to claim that the data were manually manipulated.

The focus is on six rows that have a strong influence on the results for all three dependent variables that were reported in the article, namely cheated or not, overreporting of performance, and deductions.

Based on the datasheet, participants in the sign-at-the-top condition (1) in rows 67, 68, and 69, did not cheat and therewith also did not overreport performance, and had very low deductions an independent measure of cheating. In contrast, participants in rows 70, 71, and 72 all cheated, had moderate amounts of overreporting, and very high deductions.

Yadi, yadi, yada, yesterday Gino posted a blog post that responded to these accusations. Personally, the most interesting rebuttal was the claim that there was no need to switch rows because the study results hold even without the flagged rows.

“Finally, recall the lack of motive for the supposed manipulation: If you re-run the entire study excluding all of the red observations (the ones that should be considered “suspicious” using Data Colada’s lens), the findings of the study still hold. Why would I manipulate data, if not to change the results of a study?“

This argument makes sense to me because fraud appears to be the last resort for researchers who are eager to present a statistically significant results. After all, nobody claims that there was no data collection as in some cases by Diederik Stapel, who committed blatant fraud around the time this article in question was published and the use of questionable research practices was rampant. When researchers conduct an actual study, they probably hope to get the desired result without QRPs or fraud. As significance requires luck, they may just hope to get lucky. When this does not work, they can use a few QRPs. When this does not work, they can just shelf the study and try again. All of this would be perfectly legal by current standards of research ethics. However, if the results are close and it is not easy to collect more data to hope for better results), it may be tempting to change a few labels of conditions to reach p < .05. And the accusation here (there are other studies) is that only 6 (or a couple more) rows were switched to get significance. However, Gino claims that the results were already significant and I agree that it makes no sense for somebody to temper with data, if the p-value is already below .05.

However, Gino did not present evidence that the results hold without the contested cases. So, I downloaded the data and took a look.

First, I was able to reproduce the published result of an ANOVA with the three conditions as categorical predictor variable and deductions as outcome variable.

In addition, the original article reported that the differences between the experimental “signature-on-top” and each of the two control conditions (“signature-on-bottom”, “no signature”) were significant. I also confirmed these results.

Now I repeated the analysis without rows 67 to 72. Without the six contested cases, the results are no longer statistically significant, F(2, 92) = 2.96, p = .057.

Interestingly, the comparisons of the experimental group with the two control groups were statistically significant.

Combining the two control groups and comparing it to the experimental group and presenting the results as a planned contrast would also have produced a significant result.

However, these results do not support Gino’s implication that the same analysis that was reported in the article would have produced a statistically significant result, p < .05, without the six contested cases. Moreover, the accusation is that she switched rows with low values to the experimental condition and rows with high values to the control condition. To simulate this scenario, I recoded the contested rows 67-69 as signature-at-the-bottom and 70-72 as signature-at-the-top and repeated the analysis. In this case, there was no evidence that the group means differed from each other, F(2,98) = 0.45, p = .637.

Conclusion

Experimental social psychology has a credibility crisis because researchers were (and still are) allowed to use many statistical tricks to get significant results or to hide studies that didn’t produce the desired results. The Gino scandal is only remarkable because outright manipulation of data is the only ethics violations that has personal consequences for researchers when it can be proven. Lack of evidence that fraud was committed or lack of fraud do not imply that results are credible. For example, the results in Study 2 are meaningless even without fraud because the null-hypothesis was rejected with a confidence interval that had a value close to zero as a plausible value. While the article claims to show evidence of mediation, the published data alone show that there is no empirical evidence for this claim even if p < .05 was obtained without p-hacking or fraud. Misleading claims based on weak data, however, do not violate any ethics guidelines and are a common, if not essential, part of a game called social psychology.

This blog post only examined one minor question. Gino claimed that she did not have to manipulate data because the results were already significant.

My results suggest that this claim lacks empirical support. A key result was only significant with the rows of data that have been contested. Of course, this finding does not warrant the conclusion that the data were tempered with to get statistical significance. We have to wait to get the answer to this 25 million dollar question.

Replicability Report 2023: Aggressive Behavior

August 18, 2023UncategorizedUlrich Schimmack

This report was created in collaboration with Anas Alsayed Hasan.
Citation: Alsayed Hasan, A. & Schimmack, U. (2023). Replicability Report 2023: Aggressive Behavior. Replicationindex.com

My colleagues and I have developed a statistical tools that can reveal the use of unscientific practices and predict the outcome of replication studies (Brunner & Schimmack, 2021; Bartos & Schimmack, 2022). This method is called z-curve. Z-curve cannot be used to evaluate the credibility of a single study. However, it can provide valuable information about the research practices in a particular research domain.

Replicability-Reports (RR) use z-curve to provide information about the research and publication practices of psychological journals. This information can be used by authors to chose journals they want to publish in, provides feedback to journal editors who have influence on selection bias and replicability of results published in their journals, and most importantly to readers of these journals.

Aggressive Behavior

Aggressive Behavior is the official journal of the International Society for Research on Aggression. Founded in 1974, this journal provides a multidisciplinary view of aggressive behavior and its physiological and behavioral consequences on subjects. Published articles use theories and methods from psychology, psychiatry, anthropology, ethology, and more. So far, Aggressive Behavior has published close to 2,000 articles. Nowadays, it publishes about 60 articles a year in 6 annual issues. The journal has been cited by close to 5000 articles in the literature and has an H-Index of 104 (i.e., 104 articles have received 104 or more citations). The journal also has a moderate impact factor of 3. This journal is run by an editorial board containing over 40 members. The Editor-In-Chief is Craig Anderson. The associate editors are Christopher Barlett, Thomas Denson, Ann Farrell, Jane Ireland, and Barbara Krahé.

Report

Replication reports are based on automatically extracted test-statistics (F-tests, t-tests, z-tests) from the text potion of articles. The reports do not include results reported as effect sizes (r), confidence intervals, or results reported in tables or figures.

Z-curve fits a statistical model to the distribution of these z-scores. The predicted distribution is shown as a grey curve. Importantly, the model is fitted to the significant z-scores, but the model makes a prediction about the distribution of non-significant results. This makes it possible to examine publication bias (i.e., selective publishing of significant results).

Selection for Significance

Visual inspection of Figure 1 shows that there are fewer observed non-significant results (z-scores between 0 and 1.65) than predicted z-scores. This is evidence of selection for significance. It is possible to quantify the bias in favor of significant results by comparing the proportion of observed significant results (i.e., the observed discovery rate, ODR) with the expected discovery rate (EDR) based on the grey curve. While the observed discovery rate is 71%, the expected discovery rate is 45%. The 95% confidence intervals of the ODR and EDR do not overlap. Thus, there is clear evidence of selection bias in this journal.

False Positive Risk

The replication crisis has led to concerns that many or even most published results are false positives (i.e., the true effect size is zero). The false positive risk is inversely related to the expected discovery rate and z-curve uses the EDR to estimate the risk that a published significant result is a false positive result. An EDR of 45% implies that no more than 7% of the significant results are false positives. The 95%CI puts the upper limit at false positive results at 12%. Thus, concerns that most published results are false are overblown. However, a focus on false positives is misleading because it ignores effect sizes. Even if an effect is not exactly zero, it may be too small to be relevant (i.e., practically significant). Readers of original articles need to focus on confidence intervals of effect size estimates and take into account that selection for significance inflates effect size estimates. Thus, published results are likely to show the correct direction of a relationship, but may not provide enough information to determine whether a statistically significant result is theoretically or practically important.

Expected Replication Rate

The expected replication rate estimates the percentage of studies that would produce a significant result again if exact replications with the same sample size were conducted. A comparison of the ERR with the outcome of actual replication studies shows that the ERR is higher than the actual replication rate. There are several factors that can explain this discrepancy such as the difficulty of conducting exact replication studies. Thus, the ERR is an optimist estimate. A conservative estimate is the EDR. The EDR predicts replication outcomes if significance testing does not favor studies with higher power (larger effects and smaller sampling error) because statistical tricks make it just as likely that studies with low power are published. The ERR of 69% suggests that the majority of results published in Aggressive Behavior are replicable, but the EDR allows for a replication rate as low as 45%. Thus, replicability is estimated to range from 45% to 69%. There are currently no large replication studies in this field, making it difficult to compare these estimates to outcomes of empirical replication studies. However, the ERR for the OSC reproducibility project that produced 36% successful actual replications was around 60%, suggesting that roughly 50% of actual replication studies of articles in this journal would be significant. It is unlikely that the success rate would be lower than the EDR of 45%. Given the relatively low risk of type-I errors, most of these replication failures are likely to occur because studies in this journal tend to be underpowered. Thus, replication studies should use larger samples.

Time Trends

To examine changes in credibility over time, z-curves were fitted to test statistics for each year from 2000 to 2022. The ODR, EDR, and ERR were regressed on time and time-squared to allow for non-linear relationships. The ODR showed a significant linear trend, indicating more publications of non-significant results, b = -.52 percentage points per year (SE = .22). The EDR showed no significant trends, p > .30. There were no linear or quadratic time trends for the ERR, p > .10. Figure 2 shows the ODR and EDR to examine selection bias.

The decrease in the ODR implies that selection bias is decreasing over time. In the last years, the confidence intervals for the ODR and EDR overlap, indicating that there are no longer statistically reliable differences. However, this does not imply that all results are being reported. The main reason for the overlap is the low certainty about the annual EDR. Given the lack of a significant time trend for the EDR, the average EDR across all years implies that there is still selection bias. Finally, automatically extracted test-statistics make it impossible to say whether researchers are reporting more focal or non-focal results as non-significant. To investigate this question, it is necessary to hand-code focal tests (see Limitation section).

Figure 3 shows the false discovery risk (FDR) and the estimated replication rate (ERR). It also shows the expected replication failure rate (EFR = 1 – ERR). A comparison of the EFR with the FDR provides information for the interpretation of replication failures. If the FDR is close to the EFR, many replication failures may be due to false positive results in original studies. In contrast, if the FDR is low, most replication failures are likely to be false negative results in underpowered replication studies.

The FDR is based on the EDR that also showed no time trends. Thus, the estimates for all years can be used to obtain more precise estimates than the annual ones. Based on the results in Figure 1, the expected failure rate is 31% and the FDR is 7%. This suggests that replication failures are more likely to be false negatives due to modest power rather than false positive results in original studies. To avoid false negative results in replication studies, these studies should use larger samples.

Retrospective Improvement of Credibility

The criterion of alpha = .05 is an arbitrary criterion to make decisions about a hypothesis. It was used by authors to conclude that an effect is present, and editors accepted articles on the basis of this evidence. However, readers can demand stronger evidence. A rational way to decide what alpha criterion to use is the false positive risk. A lower alpha, say alpha = .005, reduces the false positive risk, but also increases the percentage of false negatives (i.e.., there is an effect even if the p-value is above alpha).

Figure 4 shows the implications of using different significance criteria for the observed discovery rate (lower alpha implies fewer significant results).

Using alpha = .01 lowers the discovery rate by about 15 percentage points. The stringent criterion of alpha = .001 lowers it by another 10 percentage points to around 40% discoveries. This would mean that many published results that were used to make claims no longer have empirical support.

Figure 5 shows the effects of alpha on the false positive risk. Even alpha = .01 is sufficient to ensure a false positive risk of 5% or less. Thus, alpha = .01 seems a reasonable criterion to avoid too many false positive results without discarding too many true positive results. Authors may want to increase statistical power to increase their chances of obtaining a p-value below .01 when their hypotheses are true to produce credible evidence for their hypotheses.

Limitations

The main limitation of these results is the use of automatically extracted test statistics. This approach cannot distinguish between theoretically important statistical results and other results that are often reported, but do not test focal hypotheses (e.g., testing statistical significance of a manipulation check, reporting a non-significant result for a factor in a complex statistical design that was not expected to produce a significant result).

Hand-coding of other journals shows that publications of non-significant focal hypothesis tests are still rare. As a result, the ODR for focal hypothesis tests in Aggressive Behavior is likely to be higher and selection bias larger than the present results suggest. Hand-coding of a representative sample of articles in this journal is needed.

Conclusion

The replicability report for Aggressive Behavior shows clear evidence of selection bias, although there is a trend selection bias may be decreasing in the last years. The results also suggest that replicability is in a range from 40% to 70%. This replication rate does not deserve to be called a crisis, but it is does suggest that many studies are underpowered and require luck to get a significant result. The false positive risk is modest and can be controlled by setting alpha to .01. Finally, time trend analyses show no important changes in response to the open science movement. An important goal is to reduce the selective publishing of studies that worked (p < .05) and to hide studies that did not work (p > .05). Preregistration or registered reports can help to address this problem. Given concerns that most published results in psychology are false positives, the present results are reassuring and suggest that most results with p-values below .01 are true positive results.

Replicability Report 2023: Cognition & Emotion

July 12, 2023UncategorizedUlrich Schimmack

In the 2010s, it became apparent that empirical psychology has a replication problem. When psychologists tested the replicability of 100 results, they found that only 36% of the 97 significant results in original studies could be reproduced (Open Science Collaboration, 2015). In addition, several prominent cases of research fraud further undermined trust in published results. Over the past decade, several proposals were made to improve the credibility of psychology as a science. Replicability reports are the results of one of these initiatives.The main problem in psychological science is the selective publishing of statistically significant results and the blind trust in statistically significant results as evidence for researchers’ theoretical claims. Unfortunately, psychologists have been unable to self-regulate their behavior and continue to use unscientific practices to hide evidence that disconfirms their predictions. Moreover, ethical researchers who do not use unscientific practices are at a disadvantage in a game that rewards publishing many articles without any concern about the replicability of these findings.My colleagues and I have developed a statistical tools that can reveal the use of unscientific practices and predict the outcome of replication studies (Brunner & Schimmack, 2021; Bartos & Schimmack, 2022). This method is called z-curve. Z-curve cannot be used to evaluate the credibility of a single study. However, it can provide valuable information about the research practices in a particular research domain. Replicability-Reports (RR) use z-curve to provide information about the research and publication practices of psychological journals. This information can be used by authors to chose journals they want to publish in, provides feedback to journal editors who have influence on selection bias and replicability of results published in their journals, and most importantly to readers of these journals.

Cognition & Emotion

The study of emotions largely disappeared from psychology after the second world war and during the rain of behaviorism or was limited to facial expressions. The study of emotional experiences reemerged in the 1980. Cognition & Emotion was established in 1987 as an outlet for this research.

So far, the journal has published close to 3,000 articles. The average number of citations per article is 46. The journal has an H-Index of 155 (i.e., 155 articles have 155 or more citations). These statistics show that Cognition & Emotion is an influential journal for research on emotions.

Nine articles have more than 1,000 citations. The most highly cited article is a theoretical article by Paul Ekman arguing for basic emotions (Ekman, 1992);

Report

Figure 1 shows a z-curve plot for all articles from 2000-2022 (see Schimmack, 2023, for a detailed description of z-curve plots). The plot is essentially a histogram of all test statistics converted into absolute z-scores (i.e., the direction of an effect is ignored). Z-scores can be interpreted as the strength of evidence against the null-hypothesis that there is no statistical relationship between two variables (i.e., the effect size is zero and the expected z-score is zero).A z-curve plot shows the standard criterion of statistical significance (alpha = .05, z = 1.97) as a vertical red line. It also shows a dotted vertical red line at z = 1.65 because results with z-scores between 1.65 and 1.97 are often interpreted as evidence for an effect using a more liberal alpha criterion, alpha = .10, a one-sided test, or with qualifiers (e.g., marginally significant). Thus, values in this range cannot be interpreted as reporting of results that failed to support a hypothesis.

Z-curve plots are limited to values less than z = 6. The reason is that values greater than 6 are so extreme that a successful replication is all but certain, unless the value is a computational error or based on fraudulent data. The extreme values are still used for the computation of z-curve statistics, but omitted from the plot to highlight the shape of the distribution for diagnostic z-scores in the range from 2 to 6.Z-curve fits a statistical model to the distribution of these z-scores. The predicted distribution is shown as a grey curve. Importantly, the model is fitted to the significant z-scores, but the model makes a prediction about the distribution of non-significant results. This makes it possible to examine publication bias (i.e., selective publishing of significant results).

Selection for Significance

Visual inspection of Figure 1 shows that there are fewer observed non-significant results (z-scores between 0 and 1.65) than predicted z-scores. This is evidence of selection for significance. It is possible to quantify the bias in favor of significant results by comparing the proportion of observed significant results (i.e., the observed discovery rate, ODR) with the expected discovery rate (EDR) based on the grey curve. While the observed discovery rate is 68%, the expected discovery rate is 34%. The 95% confidence intervals of the ODR and EDR do not overlap. Thus, there is clear evidence of selection bias in this journal.

False Positive Risk

An EDR of 34% implies that up to 10% of the significant results could be false positives. Thus, concerns that most published results are false are overblown. However, a focus on false positives is misleading because it ignores effect sizes. Even if an effect is not exactly zero, it may be too small to be relevant (i.e., practically significant). Readers of statistical results in this journal need to examine the range of plausible effect sizes, confidence intervals, to see whether results have practical significance. Unfortunately, these estimates are inflated by selection bias, especially when the evidence is weak and the confidence interval already includes effect sizes close to zero.

Expected Replication Rate

The ERR of 70% suggests that most results published in this journal are replicable, but the EDR allows for a replication rate as low as 34%. Thus, replicability is estimated to range from 34% to 70%. There is no representative sample of replication studies from this journal to compare this estimate with the outcome of actual replication studies. However, a journal with lower ERR and EDR estimates, Psychological Science, had an actual replication rate of 41%. Thus, it is plausible to predict a higher actual replication rate than this for Cognition & Emotion.

Time Trends

To examine changes in credibility over time, z-curves were fitted to test statistics for each year from 2000 to 2022. Confidence intervals were created by regressing the estimates on time and time-squared to examine non-linear relationships.

Figure 2 shows the ODR and EDR to examine selection bias. The ODR showed a significant linear trend, indicating more publications of non-significant results, b = -.79, percentage points per year (SE = .10). The EDR showed no significant linear, b = .23, SE = .41, or non-linear, b = -.10, SE = .07, trends.

The decreasing ODR implies that selection bias is decreasing, but it is not clear whether this trend also applies to focal hypothesis tests (see limitations section). The lack of an increase in the EDR implies that researchers continue to conduct studies with low statistical power and that the non-significant results often remain unpublished. To improve credibility of this journal, editors could focus on power rather than statistical significance in the review process.

There was a significant linear, b = .24, SE = .11, tend for the ERR, indicating an increase in the ERR. The increase in the ERR implies fewer replication failures in the later years. However, because the FDR is not decreasing, a larger portion of these replication failures could be false positives.

Retrospective Improvement of Credibility

he criterion of alpha = .05 is an arbitrary criterion to make decisions about a hypothesis. It was used by authors to conclude that an effect is present and editors accepted articles on the basis of this evidence. However, readers can demand stronger evidence. A rational way to decide what alpha criterion to use is the false positive risk. A lower alpha, say alpha = .005, reduces the false positive risk, but also increases the percentage of false negatives (i.e.., there is an effect even if the p-value is above alpha).

Figure 4 shows the implications of using different significance criteria for the observed discovery rate (lower alpha implies fewer significant results).

Lowering alpha to .01 reduces the observed discovery rate by 20 to 30 percentage points. It is also interesting that the ODR decreases more with alpha = .05 than for other alpha levels. This suggests that changes in the ODR are in part caused by fewer p-values between .05 and .01. These significant results are more likely to result from unscientific methods and are often do not replicate.

Figure 5 shows the effects of alpha on the false positive risk. Lowering alpha to .01 reduces the false positive risk to less than 5%. Thus, readers can use this criterion to reduce the false positive risk to an acceptable level.

Limitations

For the journal Cognition & Emotion a small set of articles were hand-coded as part of a study on the effects of open science reforms on the credibility of psychological science. Figure 6 shows the z-curve plot and results for 117 focal hypothesis tests.

The main difference between manually and automatically coded data is a much higher ODR (95%) for manually coded data. This finding shows that selection bias for focal hypothesis tests is much more severe than the automatically extracted data suggest.

The point estimate of the EDR, 37%, is similar to the EDR for automatically extracted data, 34%. However, due to the small sample size, the 95%CI for manually coded data is wide and it is impossible to draw firm conclusions about the EDR, but results from other journals and large samples also show similar results.

The ERR estimates are also similar and the 95%CI for hand-coded data suggests that the majority of results are replicable.

Overall, these results suggest that automatically extracted results are informative, but underestimate selection bias for focal hypothesis tests.

Conclusion

The replicability report for Cognition & Emotion shows clear evidence of selection bias, but also a relatively low risk of false positive results that can be further reduced by using alpha = .01 as a criterion to reject the null-hypothesis. There are no notable changes in credibility over time. Editors of this journal could improve credibility by reducing selection bias. The best way to do so would be to evaluate the strength of evidence rather than using alpha = .05 as a dichotomous criterion for acceptance. Moreover, the journal needs to publish more articles that fail to support theoretical predictions. The best way to do so is to accept articles that preregistered predictions and failed to confirm them or to invite registered reports that publish articles independent of outcome of a study. Readers can set their own level of alpha depending on their appetite for risk, but alpha = .01 is a reasonable criterion because it (a) maintains a false positive risk below 5%, and eliminates p-values between .01 and .05 that are often obtained with unscientific practices and fail to replicate.

Link to replicability reports for other journals.

How to Interpret Z-Curve Plots

July 5, 2023UncategorizedUlrich Schimmack

work in progress.

Replicability Report 2023: Acta Psychologica

July 5, 2023UncategorizedUlrich Schimmack

The main problem in psychological science is the selective publishing of statistically significant results and the blind trust in statistically significant results as evidence for researchers’ theoretical claims. Unfortunately, psychologists have been unable to self-regulate their behavior and continue to use unscientific practices to hide evidence that disconfirms their predictions. Moreover, ethical researchers who do not use unscientific practices are at a disadvantage in a game that rewards publishing many articles without any concern about the replicability of these findings.

Acta Psychologica

Acta Psychologica is an old psychological journal that was founded in 1936. The journal publishes articles from various areas of psychology, but cognitive psychological research seems to be the most common area.

So far, Acta Psychologica has published close to 6,000 articles. Nowadays, it publishes about 150 articles a year in 10 annual issues. Over the past 30 years, articles have an average citation rate of 24.48 citations, and the journal has an H-Index of 116 (i.e., 116 articles have received 116 or more citations). The journal has an impact factor of 2 which is typical of most empirical psychology journals.

So far, the journal has published 4 articles with more than 1,000 citations, but all of these articles were published in the 1960s and 1970s. The most highly cited article in the 2000s, examined the influence of response categories on the psychometric properties of survey items (Preston & Colman, 2000; 947 citations).

Given the multidisciplinary nature of the journal, the journal has a team of editors. The current editors are Mohamed Alansari, Martha Arterberry, Colin Cooper, Martin Dempster, Tobias Greitemeyer, Matthieu Guitton, and Nhung T Hendy.

Report

A z-curve plot shows the standard criterion of statistical significance (alpha = .05, z = 1.97) as a vertical red line. It also shows a dotted vertical red line at z = 1.65 because results with z-scores between 1.65 and 1.97 are often interpreted as evidence for an effect using a more liberal alpha criterion, alpha = .10, a one-sided test, or with qualifiers (e.g., marginally significant). Thus, values in this range cannot be interpreted as reporting of results that failed to support a hypothesis.

Selection for Significance

Visual inspection of Figure 1 shows that there are fewer observed non-significant results (z-scores between 0 and 1.65) than predicted z-scores. This is evidence of selection for significance. It is possible to quantify the bias in favor of significant results by comparing the proportion of observed significant results (i.e., the observed discovery rate, ODR) with the expected discovery rate (EDR) based on the grey curve. While the observed discovery rate is 70%, the expected discovery rate is 46%. The 95% confidence intervals of the ODR and EDR do not overlap. Thus, there is clear evidence of selection bias in this journal.

False Positive Risk

An EDR of 46% implies that no more than 6% of the significant results are false positives. The 95%CI puts the upper limit at false positive results at 8%. Thus, concerns that most published results are false are overblown. However, a focus on false positives is misleading because it ignores effect sizes. Even if an effect is not exactly zero, it may be too small to be relevant (i.e., practically significant). Readers of original articles need to focus on confidence intervals of effect size estimates and take into account that selection for significance inflates effect size estimates. Thus, published results are likely to show the correct direction of a relationship, but may not provide enough information to determine whether a statistically significant result is theoretically or practically important.

Expected Replication Rate

The ERR of 72% suggests that the majority of results published in Acta Psychologica is replicable, but the EDR allows for a replication rate as low as 46%. Thus, replicability is estimated to range from 46% to 72%. Actual replications of cognitive research suggest that 50% of results produce a significant result again (Open Science Collaboration, 2015). This finding is consistent with the present results. Taking the low false positive risk into account, most replication failures are likely to be false negatives due to insufficient power in the original and replication studies. This suggests that replication studies should increase sample sizes to have sufficient statistical power to replicate true positive effects.

Time Trends

To examine changes in credibility over time, z-curves were fitted to test statistics for each year from 2000 to 2022. The results were regressed on time and time-squared to allow for non-linear relationships.

Figure 2 shows the ODR and EDR to examine selection bias. The ODR showed a significant linear trend, indicating more publications of non-significant results, b = -.43 percentage points per year (SE = .13). The EDR showed no significant trends, p > .20.

The decrease in the ODR implies that selection bias is decreasing over time. However, a low EDR still implies that many studies that produced non-significant results remain unpublished. Moreover, it is unclear whether researcher are reporting more focal results as non-significant. To investigate this question it is necessary to hand-code focal tests (see Limitation section).

There were no linear or quadratic time trends for the ERR, p > .2. The FDR is based on the EDR that also showed no time trends. Thus, the estimates for all years can be used to obtain more precise estimates than the annual ones. Based on the results in Figure 1, the expected failure rate is 28% and the FDR is 5%. This suggests that replication failures are more likely to be false negatives due to modest power rather than false positive results in original studies. To avoid false negative results in replication studies, these studies should use larger samples.

Retrospective Improvement of Credibility

The criterion of alpha = .05 is an arbitrary criterion to make decisions about a hypothesis. It was used by authors to conclude that an effect is present and editors accepted articles on the basis of this evidence. However, readers can demand stronger evidence. A rational way to decide what alpha criterion to use is the false positive risk. A lower alpha, say alpha = .005, reduces the false positive risk, but also increases the percentage of false negatives (i.e.., there is an effect even if the p-value is above alpha).

Figure 4 shows the implications of using different significance criteria for the observed discovery rate (lower alpha implies fewer significant results).

Using alpha = .01 lowers the discovery rate by about 20 percentage points. The stringent criterion of alpha = .001 lowers it by another 20 percentage points to around 40% discoveries. This would mean that many published results that were used to make claims no longer have empirical support.

Limitations

For the journal Acta Psychologica, hand-coded data are available for the years 2010 and 2020 from a study that examines changes in replicability from 2010 to 2020. Figure 6 shows the results.

The most notable difference is the higher observed discovery rate for hand-coding of focal hypothesis tests (94%) than for automatically extracted test statistics (70%). Thus, results based on automatically extracted data underestimate selection bias.

In contrast, the expected discovery rates are similar in hand-coded (46%) and automatically extracted (46%) data. Given the small set of hand-coded tests, the 95% confidence interval around the 46% estimate is wide, but there is no evidence that automatically extracted data overestimate the expected discovery rate and by implication underestimate the false discovery rate.

The ERR for hand-coded focal tests (70%) is also similar to the ERR for automatically extracted tests (72%).

This comparison suggests that the main limitation of automatic extraction of test statistics is that this method underestimates the amount of selection bias because authors are more likely to report non-focal tests than focal results that are not significant. Thus, selection bias remains a pervasive problem in this journal.

Conclusion

The replicability report for Acta Psychologica shows clear evidence of selection bias, although there is a trend selection bias may be decreasing in the last years. The results also suggest that replicability is in a range from 40% to 70%. This replication rate does not deserve to be called a crisis, but it is does suggest that many studies are underpowered and require luck to get a significant result. The false positive risk is modest and can be controlled by setting alpha to .01. Finally, time trend analysis show no important changes in response to the open science movement. An important goal is to reduce the selective publishing of studies that worked (p < .05) and to hide studies that did not work (p > .05). Preregistration or registered reports can help to address this problem. Given concerns that most published results in psychology are false positives, the present results are reassuring and suggest that most results with p-values below .01 are true positive results.