Category Archives: Gelman

Gelman’s Type-S Error: A Misunderstanding of Hypothesis Testing

April 2, 2026Fisher, Gelman, NHST, Sign-ErrorUlrich Schimmack

Andrew Gelman is well known for strong opinions about psychological science, including its methods and research culture (Fiske, 2017. For the most part, he writes as if psychologists are still following a statistical ritual that cannot produce meaningful results. This criticism is not new. It was already made by influential psychologists and methodologists, including Cohen (1990, 1994) and Gigerenzer (2004). The problem with Gelman’s critique is that it is outdated and largely ignores the discussion of null-hypothesis significance testing that took place in psychology during the 1990s. As evidence for this claim, one can simply inspect the reference list of Gelman and Carlin (2014). An article published in Perspectives on Psychological Science does not cite Cohen (1990, 1994), Gigerenzer (2004), or Tukey’s directional reformulation of significance testing (Tukey, 1991; Jones & Tukey, 2000). Although an outsider perspective can be useful for challenging untested assumptions, a commentary that ignores key insights produced by eminent statisticians and methodologists within psychology is unlikely to do so.

The Null-Hypothesis Significance Testing Strawman

As Gigerenzer (2004) pointed out, statistics is often taught as a ritual to be followed rather than as a principled approach to drawing conclusions from data. Rituals are not necessarily bad, but in science it is usually better to understand the rationale and assumptions underlying routine practices.

Null-hypothesis significance testing (NHST) has been described and criticized for decades (Tukey, 1991; Cohen, 1994). Most students of psychology will recognize the following brief description of it. First, researchers collect data that relate one variable to another. Ideally, this is an experiment in which one variable is experimentally manipulated (the independent variable) and the other is observed (the dependent variable). In experiments, a relationship between the independent and dependent variable may justify causal claims, but NHST itself is indifferent to causality. It can be applied to both experimental and correlational data. The main information produced by statistical analyses is the p-value. P-values below a conventional threshold are called statistically significant; those above the threshold are treated as not significant (ns). Significant results are easier to publish. As a result, data analysis often becomes a series of statistical tests searching for statistically significant results (Bem, 2010).

This approach to data analysis has been criticized for several reasons. First, statistical significance by itself does not provide information about effect size. For this reason, psychologists have increasingly reported effect-size estimates in addition to tests of statistical significance, in large part due to Cohen’s (1990) emphasis on effect sizes. Second, NHST has been criticized for its focus on statistically significant findings. Psychology journals have long reported rates of over 90% statistically significant results (Sterling, 1959; Sterling et al., 1995). Publication bias in favor of significant results then leads to inflated effect-size estimates (Rosenthal, 1979).

Most importantly, NHST has been criticized because it appears to reject a null hypothesis that is known to be false before any data are collected. Cohen (1994) called this the nil hypothesis. The nil hypothesis assumes that the population effect size is exactly zero. Statistical significance is then taken to imply that this hypothesis is unlikely to be true and can be rejected. The problem is that rejecting one specific possible effect size tells us very little about the data. It would be equally uninformative to test the hypothesis that the effect size equals any other single value, such as Cohen’s d = .20. So what if the effect size can be said not to be 0 or .20? It could still be 0.01 or 1.99. In short, hypothesis testing with a single point as the null hypothesis is meaningless. Yet that is exactly what psychological articles seem to be reporting when they state p < .05.

What Psychological Scientists Are Implicitly Doing

In reality, however, psychological scientists are doing something different. It may look as if they are testing the nil hypothesis, but in practice they are often testing two directional hypotheses at the same time (Kaiser 1960; Lakens et al., 2025; Tukey, 1991; Jones & Tukey, 2000). When the nil hypothesis is rejected, researchers do not merely conclude that there is a difference. They also inspect the sign of the effect size estimate and infer that the experimental manipulation increased or decreased behavior.

Some authors have argued that drawing directional conclusions from a two-sided test is conceptually problematic (e.g., Rubin, 2020). However, Jones and Tukey explain the rationale for doing so. The easiest way to see this is to reinterpret the standard nil-hypothesis test as two directional tests with two complementary null hypotheses. One null hypothesis states that the effect size is zero or negative. The other states that the effect size is zero or positive. Rejecting the first leads to the inference that the effect is probably positive. Rejecting the second leads to the inference that the effect is probably negative. Viewed this way, zero is simply the boundary between two rejection regions.

Because NHST can be understood in this way as involving two directional possibilities, alpha must be allocated across both tails to maintain the long-run error rate. No psychology student would be surprised to see a t distribution with 2.5% of the area in each tail. Each tail represents the error rate for one directional rejection, and together they produce the familiar two-sided alpha level of 5%.

Most psychology students are not taught that they are implicitly conducting directional tests when they interpret significant p values, but their actual practice shows that this is what they are doing. They routinely draw directional inferences from NHST, and this is a legitimate use of the procedure. It also makes NHST more meaningful than the strawman version in which researchers merely reject an exact value of zero that is often known in advance to be false.

Using NHST to infer the direction of population effects is meaningful because researchers often do not know that direction before data are collected. Empirical data can therefore provide genuinely new information. This is not a full defense of NHST, because effect size and practical importance can still be ignored, but it does show that psychologists have not spent decades and millions of dollars merely to establish that effect sizes are not exactly zero.

Gelman’s Type-S Error

Gelman and Tuerlinckx (2000) criticized NHST because “the significance of comparisons … is calibrated using the Type 1 error rate, relying on the assumption that the true difference is zero, which makes no sense in many applications.” To replace this framework, they proposed focusing on Type S error, where S stands for sign. A Type S error occurs when a researcher makes a confident directional claim even though the true effect has the opposite sign.

The label Type S error is potentially confusing because it suggests a replacement for the Type I error framework rather than a refinement of it. A Type I error is the unconditional long-run probability of falsely rejecting a null hypothesis across all tests that are conducted. For example, suppose a researcher conducts 100 tests with a significance criterion (alpha) of 5%. This criterion ensures that in the long run no more than 5% of all tests will be false positives. Testing at least some real effects will reduce the probability of a false positive. For example, if all studies have high power to detect a true effect, the probability of a false positive is zero (Soric, 1989). Thus, alpha sets a range of the relative frequency of false positives between 0 and alpha.

This unconditional probability must be distinguished from the conditional probability of error among the subset of studies that produced statistically significant results. In the previous example, if only 5 results were significant, it is likely that all 5 rejections were errors and that the conditional probability of a false positive given a significant result is 5 / 5 = 100% (Sorić, 1989). The proportion of false rejections among statistically significant results is called the false discovery rate (FDR), and the estimation and control of FDRs has become a large literature in statistics (Benjamini & Hochberg, 1995).

Applying Jones and Tukey’s interpretation of NHST to false discovery rates, a false discovery occurs not only when the true effect size is zero but also when it is in the opposite direction of the significant result. Gelman’s Type S error rate, also called the false sign rate (Stephens, 2017), assumes that effect sizes are never zero and counts only false rejections with the opposite sign. False sign rates are necessarily smaller than false discovery rates because wrong-sign rejections are only a subset of all false rejections. Exact-zero effects can produce significant results in either direction, whereas nonzero effects make correct-sign rejections more likely and wrong-sign rejections less likely.

The key source of confusion is that Gelman’s criticism of NHST and FDR estimation rests on a misunderstanding of NHST (Gelman, 2021). He maintains that FDR estimates are limited to the unlikely scenario that an effect is exactly zero and ignores sign errors. However, as Jones and Tukey (2000) pointed out, psychological researchers routinely use NHST as a directional sign test. Once NHST is understood in this way, Type S errors are no longer a fundamentally new kind of inferential problem and are already included in conditional and unconditional error rates. Moreover, NHST provides researchers with concrete statistical tools to estimate and control error rates, whereas Gelman’s Type S error is not something that can be estimated and was introduced as a rhetorical tool without practical use (Gelman, 2025; Lakens et al., 2025). In contrast, estimation of false discovery rates and false sign rates is an active area of research in statistics that builds on the foundations of NHST (Benjamini & Hochberg, 1995; Stephens, 2017) and has been largely ignored in psychology.

Statistical Power

So far, the distinction between Type I and (unconditional) Type S errors is mostly harmless. It may even help clarify that NHST is really used as a test of the sign of the population effect size rather than as a literal test of the nil hypothesis (Jones & Tukey, 2000). However, the wheels come off when Gelman and Carlin (2014) extend this critique from Type I error to Type II error and statistical power.

The distinction between Type I and Type II errors was introduced by Neyman and Pearson. A Type II error is the probability of failing to reject a false null hypothesis. Neyman and Pearson were cautious and avoided framing results as inferences about a true effect or as acceptance of a true hypothesis. In practice, however, failure to reject a false hypothesis means that either the population effect is positive and the study failed to produce a statistically significant result with a positive sign, or the population effect is negative and the study failed to produce a statistically significant result with a negative sign.

Statistical power is simply the complementary probability of obtaining a statistically significant result with the correct sign. Unlike the discussion of Type I errors, there is no important distinction here between a point null and an opposite-sign error. Power calculations are inherently directional. Researchers assume either a positive or a negative effect and then choose a design and sample size that reduce sampling error while controlling the Type I error rate. For example, a comparison of two groups with n = 50 per group, a population effect size of half a standard deviation (Cohen’s d = .50), and alpha = .05 has about a 70% probability of producing a statistically significant result with the correct sign.

By definition, then, power already concerns rejections with the correct sign. At this point, there is no meaningful difference between standard NHST and Gelman’s Type S framework (Stephens, 2017). The only minor difference arises in hypothetical scenarios with extremely low power. For two-sided (non-directional) power calculations, low power can produce significant results with sign errors. To use NHST as a sign-test in Jones and Tukey framework of two simultaneous one-sided tests, power should be estimated for one-sided directional tests with alpha/2. However, in practice, this distinction is irrelevant because Gelman and Carlin already showed that even modest power of 50% renders sign errors practically impossible.

Thus, the main concern about Gelman and Carlin’s (2014) article is the false implication that power calculations ignore sign errors and that researchers must move “beyond power” to control them. Grounding NHST in Jones and Tukey’s (2000) framework of two simultaneous directional tests shows that power calculations are not flawed. High power prevents both false negatives and sign errors. Gelman’s critique rests on a false premise: the assumption that NHST is nil-hypothesis testing. Under that assumption, power appears disconnected from sign errors. But once NHST is understood as directional inference, the criticism is invalid. Power analysis is not only useful but essential for controlling sign errors and the false sign rate.

Implications

Here’s a shortened version:

Implications

Gelman positions the Type S error as a new concept that requires moving “beyond power” because “power analysis is flawed” (p. 641). On closer inspection, power analysis is necessary and sufficient to control Type S error rates. Studies with high power ensure that most significant results have the correct sign, and high power also ensures a high discovery rate, which limits the proportion of false discoveries (Sorić, 1989). Power delivers everything needed to make significant results credible. It is paradoxical to criticize psychology for relying on small samples while also criticizing the tool that tells researchers how to avoid them. Cohen’s lasting contribution was precisely this: demonstrating that many studies lack power to detect plausible but small effect sizes and providing the tools to do better (Cohen, 1962).

Gelman and Carlin’s (2014) framing of power as flawed may have added to misunderstandings about the role of power in ensuring credible results. NHST and power analysis are not flawed. They are statistical tools for drawing conclusions about the direction of population effect sizes (Maxwell, Kelley, & Rausch, 2008). It would be desirable to conduct all studies with enough precision to provide informative effect size estimates, but limited resources often make this impossible. Meta-analysis of smaller studies can yield precise estimates, provided results are reported without selection bias. Reporting outcomes regardless of statistical significance is the most effective way to address selection bias, which remains the biggest threat to the credibility of NHST in practice (Sterling, 1959).

The real problem of NHST is not solved by a focus on Type S errors. The real problem is that non-significant results are inconclusive because failure to provide evidence for a positive or negative effect does not allow inferring the absence of an effect (Altman & Bland, 1995). The solution is to distinguish three hypotheses (Rice & Krakauer, 2023): (a) the effect is positive and larger than a smallest effect size of interest, (b) the effect is negative and larger in magnitude than a smallest effect size of interest, and (c) the effect falls within a region of practical equivalence around zero. Evidence for absence is established if the confidence interval falls entirely within the middle region. Replacing the point nil hypothesis with a range of practically equivalent values is an important addition to statistics for psychologists (Lakens, 2017; Lakens, Scheel, & Isager, 2018). It helps distinguish between statistical and practical significance, and it can turn non-significant results into significant evidence for the absence of a meaningful effect. However, providing evidence for absence often requires large samples because precise confidence intervals are needed to fit within a narrow region around zero. Power analysis remains essential for planning studies with this goal.

Conclusion

Continued controversy about NHST shows that better education about its underlying logic is needed. Jones and Tukey (2000) provided a clear explanation that deserves to be foundational for the teaching of NHST. Understanding NHST as two simultaneous directional tests avoids the confusion created by decades of criticism directed at a strawman version of the procedure. NHST has persisted for nearly a century despite harsh criticism because it provides a minimal but useful inference: determining the likely sign of a population effect size. Students need to learn about the real limitations of NHST and how they can be addressed. Changing statistical methods does not solve the problem that researchers need to publish and that precise effect size estimates are often out of reach. Even power to infer the sign of an effect is often low. Honest reporting of a single well-powered study is more important than reporting multiple underpowered studies that are p-hacked or selected for significance (Schimmack, 2012). With good data, different statistical approaches lead to the same conclusion. Open science reforms that improve the quality of data are more important than new statistical methods. The main reason NHST continues to attract criticism is that criticism is easy, but finding a better solution is harder. Real progress requires a real analysis of the problem NHST has many problems, but ignoring sign errors is not one of them.

References

Fiske, S. T. (2017). Going in many right directions, all at once. Perspectives on Psychological Science, 12, 652–655. https://doi.org/10.1177/1745691617706506

Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45(12), 1304–1312.

Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49(12), 997–1003.

Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio-Economics, 33(5), 587–606.

Tukey, J. W. (1991). The philosophy of multiple comparisons. Statistical Science, 6(1), 100–116.

Jones, L. V., & Tukey, J. W. (2000). A sensible formulation of the significance test. Psychological Methods, 5(4), 411–414.

Countering Misinformation about Z-Curve on Gelman’s Blog

March 29, 2026Erik van Zwet, Gelman, UncategorizedConcers, Erik van Zwet, Gelman, SimulationsUlrich Schimmack

I am all in favor of open science and a critic of closed pre-publication peer-review. The downside of open communication is that there is no quality control and internet searches will amplify misinformation. This is the case with Erik van Zwet’s critique of z-curve. Even though I addressed his criticisms in the comment section, search engines – like humans – do not scroll to the end and process all information. I have even addressed concerns about z-curve.2.0 by improving z-curve 3.0 to handle edge cases like the one used by van Zwet to cast doubt about z-curves performance in general. In science, facts trump visibility Z-curve.has been validated with many simulations across a wide range of scenarios and works well even with just 50 significant z-values. For more information, check out the Replication Index blog or the FAQ about z-curve page.

The bias in the Bing (AI) summary is evident when we compare it to Google search summary. Still makes a false claim about assumptions based on Erik van Zwet’s blog bost, but also avoids the dismissal of a method based on a single edge case that was easy to address and is no longer of concern in the new z-curve.3.0. In short, don’t trust the first generic response of AI. Use AI to probe arguments.

Google

Loken and Gelman’s Simulation Is Not a Fair Comparison

November 25, 2023Gelman, LokenBackpack, Gelman, Loken, Random Measurement Error, replicability, Simulation StudyUlrich Schimmack

“What I’d like to say is that it is OK to criticize a paper, even [if, typo in original] it isn’t horrible.” (Gelman, 2023)

In this spirit, I would like to criticize Loken and Gelman’s confusing article about the interpretation of effect sizes in studies with small samples and selection for significance. They compare random measurement error to a backpack and the outcome of a study to running speed. Common sense suggests that the same individual under identical conditions would run faster without a backpack than with a backpack. The same outcome is also suggested by psychometric theories that suggest random measurement error attenuates population effect sizes, which would make it harder to demonstrate significance and produce, on average, weaker effect sizes.

The key point of Loken and Gelman’s article is to suggest that this intuition fails under some conditions. “Should we assume that if statistical significance is achieved in the presence
of measurement error, the associated effects would have been stronger without
noise? We caution against the fallacy”

To support their clam that common sense is a fallacy under certain conditions, they present the results of a simple simulation study. After some concerns about their conclusions were raised, Loken and Gelman shared the actual code of their simulation study. In this blog post, I share the code with annotations and reproduce their results. I also show that their results are based on selecting for significance only for the measure with random measurement error (with a backpack) and not for the measure without a backpack (no random measurement error). Reversing the selection shows that selection for significance without measurement error produces stronger effect sizes even more often than selection for significance with a backpack. Thus, it is not a fallacy to assume that we would all run faster without a backpack holding all other factors equal. However, a runner with a heavy backpack and tailwinds might run faster than a runner without a backpack facing strong headwinds. While this is true, the influence of wind on performance makes it difficult to see the influence of the backpack. Under identical conditions backpacks slow people down and random measurement error attenuates effects.

Loken and Gelman’s presentation of the results may explain why some readers, including us, misinterpreted their results to imply that selection bias and random measurement error may interaction in some complex way to produce even more inflated estimates of the true correlation. We added some lines of code to their simulation to compute the average correlations after selection for significance separately for the measure without error and the measure with error. This way, both measures benefit equally from selection bias. The plot also provides more direct evidence about the amount of bias that is introduced by selection bias and random measurement error. In addition, the plot shows the average 95% confidence intervals around the estimated correlation coefficients.

The plot shows that for large samples (N > 1,000), the measure without error always produces the expected true correlation of r = .15, whereas the measure with error always produces the expected attenuated correlation of r = .15 * .80 = .12. As sample sizes get smaller, the effect of selection bias becomes apparent. For the measure without error, the observed effect sizes are now inflated. For the measure with error, selection bias corrects for the inflation and the two biases cancel each other out to produce more accurate estimates of the true effect size than with the measure without error. For sample sizes below N = 400, however, both measures produce inflated estimates and in really small samples the attenuation effect due to unreliability is overwhelmed by selection bias. However, while the difference due to unreliability is negligible and approaches zero, it is clear that random measurement error combined with selection bias never produces even stronger estimates than the measure without error. Thus, it remains true that we should expect a measure without random measurement error to produce stronger correlations than a measure with random error. This fundamental principle of psychometrics, however, does not warrant the conclusion that an observed statistically significant correlation in small samples underestimates the true correlation coefficient because the correlation may have been inflated by selection for significance.

The plot also shows how researchers can avoid misinterpretation of inflated effect size estimates in small samples. In small samples, confidence intervals are wide. Figure 2 shows that the confidence interval around inflated effect size estimates in small samples is so wide that it includes the true correlation of r = .15. The width of the confidence interval in small samples make it clear that the study provided no meaningful information about the size of an effect. This does not mean the results are useless. After all, the results correctly show that the relationship between the variables is positive rather than negative. For the purpose of effect size estimation it is necessary to conduct meta-analysis and to include studies with significant and non-significant results. Furthermore, meta-analysis need to test for the presence of selection bias and correct for it when it is present.

P.S. If somebody claims that they ran a marathon in 2 hours with a heavy backpack, they may not be lying. They may just not tell you all of the information. We often fill in the blanks and that is where things can go wrong. If the backpack were a jet pack and the person was using it to fly for some of the race, we would no longer be surprised by the amazing feat. Similarly, if somebody tells you that they got a correlation of r = .8 in a sample of N = 8 with a measure that has only 20% reliable variance, you should not be surprised if they tell you that they got this result after picking 1 out of 20 studies because selection for significance will produce strong correlations in small samples even if there is no correlation at all. Once they tell you that they tried many times to get the one significant result, it is obvious that the next study is unlikely to replicate a significant result.

Sometimes You Can Be
Faster With a Heavy Backpack

Annotated Original Code

### This is the final code used for the simulation studies posted by Andrew Gelman on his blog

### https://statmodeling.stat.columbia.edu/wp-content/uploads/2023/11/graph-codes-to-share-for-science-paper-final.txt

### Comments are highlighted with my initials #US#

# First just the original two plots, high power N = 3000, low power N = 50, true slope = .15

r <- .15

sims<-array(0,c(1000,4))

xerror <- 0.5

yerror<-0.5

for (i in 1:1000) {

x <- rnorm(50,0,1)

y <- r*x + rnorm(50,0,1)

#US# this is a sloppy way to simulate a correlation of r = .15

#US# The proper code is r*x + rnorm(50,0,1)*sqrt(1-r^2)

#US# However, with the specific value of r = .15, the difference is trivial

#US# However, however, it raises some concerns about expertise

xx<-lm(y~x)

sims[i,1]<-summary(xx)$coefficients[2,1]

x<-x + rnorm(50,0,xerror)

y<-y + rnorm(50,0,yerror)

xx<-lm(y~x)

sims[i,2]<-summary(xx)$coefficients[2,1]

x <- rnorm(3000,0,1)

y <- r*x + rnorm(3000,0,1)

xx<-lm(y~x)

sims[i,3]<-summary(xx)$coefficients[2,1]

x<-x + rnorm(3000,0,xerror)

y<-y + rnorm(3000,0,yerror)

xx<-lm(y~x)

sims[i,4]<-summary(xx)$coefficients[2,1]

}

plot(sims[,2] ~ sims[,1],ylab=”Observed with added error”,xlab=”Ideal Study”)

abline(0,1,col=”red”)

plot(sims[,4] ~ sims[,3],ylab=”Observed with added error”,xlab=”Ideal Study”)

abline(0,1,col=”red”)

#US# There is no major issue with graphs 1 and 2.

#US# They merely show that high sampling error produces large uncertainty in the estimates.

#US# The small attenuation effect of r = .15 vs. r = 12 is overwhelmed by sampling error

#US# The real issue is the simulation of selection for significance in the third graph

# third graph

# run 2000 regressions at points between N = 50 and N = 3050

r <- .15

propor <-numeric(31)

powers<-seq(50,3050,100)

#US# These lines of code are added to illustrate the biased selection for significane

propor.reversed.selection <-numeric(31)

mean.sig.cor.without.error <- numeric(31) # mean correlation for the measure without error when t > 2

mean.sig.cor.with.error <- numeric(31) # mean correlation for the measure with error when t > 2

#US# It is sloppy to refer to sample sizes as powers.

#US# In between subject studies, the power to produce a true positive result

#US# is a function of the population correlation and the sample size

#US# With population correlations fixed at r = .15 or r = .12, sample size is the

#US# only variable that influences power

#US# However, power varies from alpha to 1 and it would be interesting to compare the

#US# power of studies with r = .15 and r = .12 to produce a significant result.

#US# The claim that “one would always run faster without a backback”

#US# could be interpreted as a claim that it is always easier to obtain a

#US# significant result without measurement error, r = .15, than with measurement error, r = .12

#US# This claim can be tested with Loken and Gelman’s simulation by computing

#US# the percentage of significant results obtained without and with measurement error

#US# Loken and Golman do not show this comparison of power.

#US# The reason might be the confusion of sample size with power.

#US# While sample sizes are held constant, power varies as a function of the population correlations

#US# without, r = .15, and with, r = .12, measurement error.

xerror<-0.5

yerror<-0.5

j = 1

i = 1

for (j in 1:31) {

sims<-array(0,c(1000,4))

for (i in 1:1000) {

x <- rnorm(powers[j],0,1)

y <- r*x + rnorm(powers[j],0,1)

#US# the same sloppy simulation of population correlations as before

xx<-lm(y~x)

sims[i,1:2]<-summary(xx)$coefficients[2,1:2]

x<-x + rnorm(powers[j],0,xerror)

y<-y + rnorm(powers[j],0,yerror)

xx<-lm(y~x)

sims[i,3:4]<-summary(xx)$coefficients[2,1:2]

}

#US# The code is the same as before, it just adds variation in sample sizes

#US# The crucial aspect to understand figure 3 is the following code that

#US# compares the results for the paired outcomes without and with measurement error

#US# Carlos Ungil (https://ch.linkedin.com/in/ungil) pointed out on Gelman’s blog #US# that there is another sloppy mistake in the simulation code that does not alter the results #US# The code compares absolute t-values (coefficient/sampling error), while the article #US# talks about inflated effect size estimates. However, while the sampling error variation #US# creates some variability, the pattern remains the same. #US# For sake of reproducibility I kept the comparison of t-values.

# find significant observations (t test > 2) and then check proportion

temp<-sims[abs(sims[,3]/sims[,4])> 2,]

#US# the use of t > 2 is sloppy and unnecessary.

#US# summary(lm) gives the exact p-values that could be used to select for significance

#US# summary(xx)[2,4] < .05

#US# However, this does not make a substantial difference

#US# The crucial part of this code is that it uses the outcomes of the simulation

#US# with random measurement error to select for significance

#US# As outcomes are paired, this means that the code sometimes selects outcomes

#US# in which sampling error produces significance with random measurement error

#US# but not without measurement error.

propor[j] <- table((abs(temp[,3]/temp[,4])> abs(temp[,1]/temp[,2])))[2]/length(temp[,1])

#US# this line is added to compute the mean correlation for significant outcomes

#US# when measurement error is present.

mean.sig.cor.with.error[j] = mean(temp[,3])

#US# Conditioning on significance for one of the two measures is a strange way

#US$ to compare outcomes with and without measurement error.

#US# Obviously, the opposite selection bias would favor the measure without error.

#US# This can be shown by computing the same proportion after selectiong for significance

#US$ for the measure without error

temp<-sims[abs(sims[,1]/sims[,2])> 2,]

propor.reversed.selection[j] <- table((abs(temp[,1]/temp[,2])> abs(temp[,3]/temp[,2])))[2]/length(temp[,4])

#US# this line is added to compute the mean correlation for significant outcomes

#US# without measurement error.

mean.sig.cor.without.error[j] = mean(temp[,1])

print(j)

#US# we can also add to comparisons that are more meaningful and avoid the comparison

###

}

#US# the plot code had to be modified slightly to have matching y-axes

#US# I also added a title

title = “Original Loken and Gelman Code”

plot(powers,propor,type=”l”,

ylim=c(0,1),main=title, ### added code

xlab=”Sample Size”,ylab=”Prop where error slope greater”,col=”blue”)

#US# text that explains what the plot displays, not shown

#US# #text(200,.8,”How often is the correlation higher for the measure with error”,pos=4)

#US# text(200,.75,”when pairs of outcomes are selected based on significance of”,pos=4)

#US# text(200,.70,”of the measure with error?”,pos=4)

#US# We can now plot the two outcomes in the same figure

#US# The original color was blue. I used red for the reversed selection

par(new=TRUE)

plot(powers,propor.reversed.selection,type=”l”,

ylim=c(0,1), ### added code

xlab=”Sample Size”,ylab=”Prop where error slope greater”,col=”firebrick2″)

#US# adding a legend

legend(1500,.9,legend=c(“with backpack only sig. \n shown in article \n “,

“without backpack only sig. \n added by me”),pch=c(15),

pt.cex=2,col=c(“blue”,”firebrick2″))

#US# adding a horizontal line at 50%

abline(h=.5,lty=2)

#US# The following code shows the plot of mean correlations after selection for significance

#US# for the measure with error (blue) and the measure witout error (red)

title = “Comparison of Correlations after Selection for Significance”

plot(powers,mean.sig.cor.with.error,type=”l”,ylim=c(.1,.4),main=title,

xlab=”Sample Size”,ylab=”Mean Observed Correlation”,col=”blue”)

par(new=TRUE)

plot(powers,mean.sig.cor.without.error,type=”l”,ylim=c(.1,.4),main=””,

xlab=”Sample Size”,ylab=”Mean Observed Correlation”,col=”firebrick2″)

#US# adding a legend

legend(2000,.4,legend=c(“with error”,

“without error”),pch=c(15),

pt.cex=2,col=c(“blue”,”firebrick2″))

#US# adding a horizontal line at 50%

abline(h=.15,lty=2)

Replicability-Index

Improving the replicability of empirical research

Category Archives: Gelman

Gelman’s Type-S Error: A Misunderstanding of Hypothesis Testing

The Null-Hypothesis Significance Testing Strawman

What Psychological Scientists Are Implicitly Doing

Gelman’s Type-S Error

Statistical Power

Implications

Conclusion

References

Countering Misinformation about Z-Curve on Gelman’s Blog

Loken and Gelman’s Simulation Is Not a Fair Comparison

Annotated Original Code