*** Background. The Loken and Gelman article “Measurement Error and the Replication Crisis” created a lot of controversy in the Psychological Methods Discussion Group. I believe the article is confusing and potentially misleading. For example, the authors do not clearly distinguish between unstandardized and standardized effect size measures, although random measurement error has different consequences for one or the other. I think a blog post by Gelman makes it clear what the true purpose of the article. is.
This explains why the article tries to construct one more fallacy in the use of traditional statistics, but fails to point out a simple solution to avoid this fallacy. Moreover, I argue in the blog post that Loken and Gelman committed several fallacies on their own in an attempt to discredit tvalues and significance testing.
I asked Gelman to clarify several statements that made no sense to me.

Ulrich:
Sure, fair enough. The zscore provides some information. I guess I’d just say it provides less information than people think.
I believe that the article contains many more statements that are misleading and do not inform readers how tvalues and significance testing works. Maybe the article is not as bad as I think it is, but I am pretty sure that it provides less information than people think.
In contrast, Jacob Cohen has provided clear and instructive recommendations for psychologists to improve their science. If psychologists had listened to him, we wouldn’t have a replication crisis.
The main points to realize about random measurement error and replicability are.
1. Neither population nor sample mean differences (or covariances) are effect sizes. They are statistics that provide some information about effects and the magnitude of effects. The main problem in psychology has been the interpretation of mean differences in small samples as “observed effect sizes” Effects cannot be observed.
2. Point estimates of effect sizes vary from sample to sample. It is incorrect to interpret a point estimate as information about the size of an effect in a sample or a population. To avoid this problem, researchers should always report a confidence interval of plausible effect sizes. In small samples with just significant results these intervals are wide and often close to zero. Thus, no research should interpret a moderate to large point estimate, when effect sizes close to zero are also consistent with the data.
3. Random measurement creates more uncertainty about effect sizes. It has no systematic effect on unstandardized effect sizes, but it systematically lowers standardized effect sizes (correlations, Cohen’s d amount of explained variance).
4. Selection for significance inflates standardized and unstandardized effect size estimates. Replication studies may fail if original studies were selected for significance, depending on the amount of bias introduced by selection for significance (this is essentially regression to the mean).
5. As random measurement error attenuates standardized effect sizes, selection for significance partially corrects for this attenuation. Applying a correction formula (Spearman) to estimates after selection for significance would produce even more inflated effect size estimates.
6. The main cause of the replication crisis is undisclosed selection for significance. Random measurement error has nothing to do with the replication crisis because random measurement error has the same effect on original and replication studies. Thus, it cannot explain why an original study was significant and a replication study failed to be significant.
Questionable Claims in Loken and Gelman’s Backpack article.
If you learned that a friend had run a mile in 5 minutes, you would be respectful; if you learned she had done it while carrying a heavy backpack, you would be awed. The obvious inference is that she would have been even faster without the backpack.
This makes sense. We assume that our friends’ ability is a relatively fixed ability, everybody is slower with a heavy backpack, and the distance is really a mile, the clock was working properly, and no magic potion or tricks are involved. As a result, we expect very little variability in our friends’ performance and an even faster time without the backpack.
But should the same intuition always be applied to research findings? Should we assume that if statistical significance is achieved in the presence of measurement error, the associated effects would have been stronger without noise?
How do we translate this analogy? Let’s say running 1 mile in 5 minutes corresponds to statistical significance. Any time below 5 minutes is significant and any time longer than 5 minutes is not significant. The friends’ ability is the sample size. The lager the sample size, the easier it is to get a significant result. Finally, the backpack is measurement error. Just like a heavy backpack makes it harder to run 1 mile in 5 minutes, more measurement error makes it harder to get significance.
The question is whether it follows that the “associated effects” (mean difference or regression coefficient that are used to estimate effect sizes) would have been stronger without random measurement error?
The answer is no. This may not be obvious, but it directly follows from basic introductory statistics, like the formula for the tstatistic.
tvalue = mean.difference / SD * sqrt(N)/2
and SD reflects the variability of a construct in the population plus additional variability due to measurement error. So, measurement error increases the SD component of the tvalue, but it has no effect on the effect size.
We caution against the fallacy of assuming that that which does not kill statistical significance makes it stronger.
With all due respect for trying to make statistics accessible, there is a tradeoff between accessibility and sensibility. First, statistical significance cannot be made stronger. A finding is either significant or it is not. Surely a teststatistic like a tvalue can be (made) stronger or weaker depending on changes in its components. If we interpret “that which does not kill” as “obtaining a significant result with a lot of random measurement error” it is correct to expect a larger tvalue and stronger evidence against the nullhypothesis in a study with a more reliable measure. This follows directly from the effect of random error on the standard deviation in the denominator of the formula. So how can it be a fallacy to assume something that can be deduced from a mathematical formula? Maybe the authors are not talking about tvalues.
It is understandable, then, that many researchers have the intuition that if they manage to achieve statistical significance under noisy conditions, the observed effect would have been even larger in the absence of noise. As with the runner, they assume that without the burden—that is, uncontrolled variation—their effects would have been even larger.
Although this statement makes it clear that the authors are not talking about tvalues, it is not clear why researchers should have the intuition that a study with a more reliable measure should produce larger effect sizes. As shown above, random measurement error adds to the variability of observations, but it has no systematic effect on the mean difference or regression coefficient.
Now the authors introduce a second source of bias. Unlike random measurement error, this error is systematic and can lead to inflated estimates of effect sizes.
The reasoning about the runner with the backpack fails in noisy research for two reasons. First, researchers typically have so many “researcher degrees of freedom”—unacknowledged choices in how they prepare, analyze, and report their data—that statistical significance is easily found even in the absence of underlying effects and even without multiple hypothesis testing by researchers. In settings with uncontrolled researcher degrees of freedom, the attainment of statistical significance in the presence of noise is not an impressive feat.
The main reason for inferential statistics is to generalize results from a sample to a wider population. The problem of these inductive inferences is that results in a sample vary from sample to sample. This variation is called sampling error. Sampling error is separate from measurement error and even studies with perfect measures have sampling error and sampling error is inversely related to sample size (2/sqrt(N)). Sampling error alone is again unbiased. It can produce larger mean differences or smaller mean differences. However, if studies are split into significant studies and nonsignificant studies, mean differences of significant results are inflated – and mean differences of nonsignificant results are deflated estimates of the population mean difference. So, effect size estimates in studies that are selected for significance are inflated. This is true, even in studies with reliable measures.
In a study with noisy measurements and small or moderate sample size, standard errors will be high and statistically significant estimates will therefore be large, even if the underlying effects are small.
To give an example, assume there were a height difference of 1 cm between brown eyed and blue eyed individuals. The standard deviation of height is 10 cm. A study with 400 participants has a sampling error of 10 / sqrt(400)/2 cm = 1 cm. To achieve significance, the effect size has to be about twice as larger as the sampling error (t = 2 ~ p = .05). Thus, a significant result requires a mean difference of 2 cm, which is 100% larger than the population mean difference in height.
Another researcher uses an unreliable measure (25% reliability) of height that quadruples the variance (100 cm^2 vs. 400 cm^2) and doubles the standard deviation (10cm vs. 20cm). The sampling error also doubles to 2 cm, and now a mean difference of 4 cm is needed to achieve significance with the same tvalue of 2 as in the study with the perfect measure.
The mean difference is two times larger than before and four times larger than the mean difference in the population.
The fallacy would be to look at this difference of 4 cm and to believe that an even larger difference could have been obtained with a more reliable measure. This is a fallacy, but not for the reasons the authors suggest. The fallacy is to assume that random measurement error in the measure of height reduced the estimate of 4cm and that an even bigger difference would be obtained with a more reliable measure. This is a fallacy because random measurement error does not influence the mean difference of 4cm. Instead, it increased the standard deviation and with a more reliable measure the standard deviation would be smaller (1 cm) and the mean difference of 4 cm would have a tvalue of 4 rather than 2, which is significantly stronger evidence for an effect.
How can the authors overlook that random measurement error has no influence on mean differences? The reason is that they do not clearly distinguish between standardized and unstandardized estimates of effect sizes.
Spearman famously derived a formula for the attenuation of observed correlations due to unreliable measurement.
Spearman’s formula applies to correlation coefficients and correlation coefficients are standardized measures of effect sizes because the covariance is divided by the standard deviations of both variables. Similarly Cohen’s d is a standardized coefficient because the mean difference is divided by the pooled standard deviation of the two groups.
Random measurement error does clearly influence standardized effect size estimates because the standard deviation is used to standardized effect sizes.
The true population mean difference of 1 cm divided by the population standard deviation of 10 cm yields a Cohen’s d = .10; that is onetenth of a standard deviation difference.
In the example, the mean difference for a just significant result with a perfect measure was 2 cm, which yields a Cohen’s d = 2 cm divided by 10 cm = .2, twotenth of a standard deviation.
The mean difference for a just significant result with a noisy measure was 4 cm, which yields a standardized effect size of 4 cm divided by 20cm = .20, also twotenth of a standard deviation.
Thus, the inflation of the mean difference is proportional to the increase in the standard deviation. As a result, the standardized effect size is the same for the perfect measure and the unreliable measure.
Compared to the true mean difference of onetenth of a standard deviation, the standardized effect sizes are both inflated by the same amount (d = .20 vs. d = .10, 100% inflation).
This example shows the main point the authors are trying to make. Standardized effect size estimates are attenuated by random measurement error. At the same time, random measurement error increases sampling error and the mean difference has to be inflated to get significance. This inflation already corrects for the attenuation of standardized effect sizes and any additional corrections for unreliabilty with the Spearman formula would inflate effect size estimates rather than correcting for attenuation.
This would have been a noteworthy observation, but the authors suggest that random measurement error can even have paradox effects on effect size estimates.
But in the smallN setting, this will not hold; the observed correlation can easily be larger in the presence of measurement error (see the figure, middle panel).
This statement is confusing because the most direct effect of measurement error on standardized effect sizes is attenuation. In the height example, any observed mean difference is divided by 20 rather than 10, reducing the standardized effect sizes by 50%. The variability of these standardized effect sizes is simply a function of sample size and therefore equal. Thus, it is not clear how a study with more measurement error can produce larger standardized effect sizes. As demonstrated above, the inflation produced by the significance filter at most compensates for the deflation due to random measurement error. There is simply no paradox that researchers can obtain stronger evidence (larger tvalues or larger standardized effect sizes) with nosier measures even if results are selected for significance.
Our concern is that researchers are sometimes tempted to use the “iron law” reasoning to defend or justify surprisingly large statistically significant effects from small studies. If it really were true that effect sizes were always attenuated by measurement error, then it would be all the more impressive to have achieved significance.
This makes no sense. If random measurement error attenuates effect sizes, it cannot be used to justify surprisingly large mean differences. Either we are talking about unstandardized effect sizes and they are not influenced by measurement error or we are talking about standardized effect sizes and those are attenuated by measurement error and so obtaining large mean differences is surprising. If the true mean difference is 1 cm and an effect of 4 cm is needed to get significance with SD = 20 cm, it is surprising to get significance because the power to do so is only 17%. Of course, it is only surprising if we knew that the population effect size is only 1 cm, but the main point is that we cannot use random measurement error to justify large effect sizes because random measurement error always attenuates standardized effect size estimates.
More confusing claims follow.
If researchers focus on getting statistically significant estimates of small effects, using noisy measurements and small samples, then it is likely that the additional sources of variance are already making the t test look strong.
As explained above, random measurement error makes tvalues weaker not stronger. It therefore makes no sense to attribute strong tvalues to random measurement error as a potential source of variance. The most likely explanation for strong effect sizes in studies with large sampling error is selection for significance, not random measurement error.
After all of these confusing claims the authors end with a key point.
A key point for practitioners is that surprising results from small studies should not be defended by saying that they would have been even better with improved measurement.
This is true because it is not a logical argument and not an argument researchers actually make. The bigger problem is that researchers do not realize that their significance filter makes it necessary to find moderate to large effects and that sampling error in small samples alone can produce these effect sizes, especially when questionable research practices are being used. No claims about hypothetically larger effect sizes are necessary or regularly made.
Next the authors simply make random statements about significance testing that reveal their ideological bias rather than adding to the understanding of tvalues.
It is a common mistake to take a tratio as a measure of strength of evidence and conclude that just because an estimate is statistically significant, the signal to
noise level is high.
Of course, the tvalue is a measure of the strength of evidence against the nullhypothesis, typically the hypothesis that the data were obtained without a mean difference in the population. The larger the tvalue, the less likely it is that the observed tvalue could have been obtained without a population mean difference in the direction of the mean difference in the sample. And with tvalues of 4 or higher, published results also have a high probability of replicating a significant result in a replication study (Open Science Collaboration, 2015). It can be debated whether a tvalue of 2 is weak, moderate or strong evidence, but it is not debatable whether tvalues provide information that can be used for inductive inferences. Even BayesFactors rely on tvalues. So, the authors’ criticism of tvalues makes little sense from any statistical perspective.
It is also a mistake to assume that the observed effect size would have been even larger if not for the burden of measurement error. Intuitions that are appropriate when measurements are precise are sometimes misapplied in noisy and more
probabilistic settings.
Once more these broad claims are false and misleading. Everything else equal, estimates of standardized effect sizes are attenuated by random measurement error and would be larger if a more reliable measure had been used. Once selection for significance is present, the inflation introduced by selection for significance inflates standardized effect size estimates for perfect measures and it starts to disattenuate standardized effect size estimates with unreliable measures.
In the end, the authors try to link their discussion of random measurement error to the replication crisis.
The consequences for scientific replication are obvious. Many published effects
are overstated and future studies, powered by the expectation that the effects can be
replicated, might be destined to fail before they even begin. We would all run faster
without a backpack on our backs. But when it comes to surprising research findings
from small studies, measurement error (or other uncontrolled variation) should not be
invoked automatically to suggest that effects are even larger.
This is confusing. Replicability is a function of power and power is a function of the population mean difference and the sampling error of the design of a study. Random measurement error increases sampling error, which reduces standardized effect sizes, power, and replicability. As a result, studies with unreliable measure are less likely to produce significant results in original studies and in replication studies.
The only reason for surprising replication failures (e.g., 100% significant original studies and 25% significant replication studies for social psychology; OSC, 2015) are questionable practices that inflate the percentage of significant results in original studies. It is irrelevant whether the original result was produced with a small population mean difference and a reliable measure or with a moderate population mean difference and an unreliable measure. It only matters how strong the mean difference for the measure that was used is. That is, replicability is the same for a height difference of 1 cm with a perfect measure and a standard deviation of 10 cm or a height difference of 2 cm and a noisy measure with a standard deviation of 20 cm. However, the chance of obtaining a significant result in a study if the mean difference is 1 cm and the SD is 20 cm is lower because the noisy measure reduces the standardized effect size to Cohen’s d = 1 cm / 20 cm = 0.05.
Conclusion
Loken and Gelman wrote a very confusing article about measurement error. Although confusion about statistics is the norm among social scientists, it is surprising that a statistician has problems to explain basic statistical concepts and how they relate to the outcome of original and replication studies.
The most probable explanation for the confusion is that the authors seem to be believe that the combination of random measurement error and large sampling error creates a novel problem that has been overlooked.
Measurement error and selection bias thus can combine to exacerbate the replication crisis.
In the largeN scenario, adding measurement error will almost always reduce the observed correlation. Take these scenarios and now add selection on statistical significance… for smaller N, a fraction of the observed effects exceeds the original.
If researchers focus on getting statistically significant estimates of small effects, using noisy measurements and small samples, then it is likely that the additional sources of variance are already making the t test look strong.
“Of statistically significant effects observed after error, a majority could be greater than in the “ideal” setting when N is small”
The quotes suggest that the authors believe something extraordinary is happening in studies with large random measurement error and small samples. However, this is not the case. Random measurement error attenuates tvalues and selection for significance inflates them and these two effects are independent. There is no evidence to suggest that random measurement error suddenly inflates effect size estimates in small samples with or without selection for significance.
Recommendations for Researchers
It is also disconcerting that the authors fail to give recommendations how researchers can avoid fallacies, while those recommendations have been made before and would easily fix the problems associated with interpretation of effect sizes in studies with noisy measures and small samples.
The main problem in noisy studies is that point estimates of effect sizes are not a meaningful statistic. This is not necessarily a problem Many exploratory studies in psychology aim to examine whether there is an effect at all and whether this effect is positive or negative. A statistically significant result only allows researchers to infer that a positive or negative effect contributed to the outcome of the study (because the extreme tvalue falls into a range of values that are unlikely without an effect). So, conclusions should be limited to discussion of the sign of the effect.
Unfortunately, psychologists have misinterpreted Jacob Cohen’s work and started to interpret standardized coefficients like correlation coefficients or Cohen’s d that they observed in their samples. To make matters worse these coefficients are sometimes called observed effect sizes, as in the article by Loken and Gelman.
This might have been a reasonable term for trained statisticians, but for poorly trained psychologists it suggested that this number tells them something about the magnitude of the effect they were studying. After all, this seems a reasonable interpretation of the term “observed effect size.” They then used Cohen’s book to interpret these values as evidence that they obtained a small, moderate, or large effect. In small studies, the effects have to be moderate (2 groups, n = 20, p = .05 => d = .64) to reach significance.
However, Cohen explicitly warned against this use of effect sizes. He developed standardized effect size measures to help researchers to plan studies that can provide meaningful tests of hypotheses. A small effect size requires a large sample. His effect sizes were develop to help researchers to plan studies. If they think an effect is small, they shouldn’t run a study with 40 participants because the study is so noisy that it is likely to fail. So, standardized effect sizes were intended to be assumptions about unobservable population parameters.
However, psychologists ignored Cohen’s guidelines for the planning of studies. Instead they used his standardized effect sizes to examine how strong the “observed effects” in their studies were. The misintepretation of Cohen is partially responsible for the replication crisis because researchers ignored the significance filter and were happy to report that they consistently observed moderate to large effect sizes.
However, they also consistently observed replication failures in their labs. This was puzzling because moderate to large effects should be easy to replicate. However, without training in statistics, social psychologists found an explanation for this variability of observed effect sizes as well: surely, the variability in observed effect sizes (!) from study to study meant that their results were highly dependent on context. I still remember joking with some other social psychologists that effects even dependent on the color of research assistants’ shirts. Only after reading Cohen did I understand what was really happening. In studies with large sampling error, the “observed effect sizes” move around a lot because they are not observations of effects. Most of the variation is mean differences from study to study is purely random sampling error.
At the end of his career, Cohen seemed to have lost faith in psychology as a science. He wrote a dark and sarcastic article titled “The Earth is Round, p < .05.” In this article, he proposes a simple solution for misinterpretation of “observed effect sizes” in small samples. The abstract of this article is more informative and valuable than Loken and Gelman’s entire article.
Exploratory data analysis and the use of graphic methods, a steady improvement in
and a movement toward standardization in measurement, an emphasis on estimating effect sizes using confidence intervals, and the informed use of available statistical
methods is suggested. For generalization, psychologists must finally rely, as has been done in all the older sciences,
The key point is that any sample statistic like an “effect size estimate” (not an observed effect size) has to be considered in the context of the precision of the estimate. Nobody would take a public opinion poll seriously if it were conducted with 40 respondents and the result was a 55% chance of a candidate winning an election if this information were provided with the information that the 95%CI ranges from 40% to 70%.
The same is true for tons of articles that reported effect size estimates without confidence intervals. For studies with just significant results this is not a problem because significance translates into a confidence interval that does not contain the value specified by a nullhypothesis; typically zero. For a just significant result this means that the boundary of the CI is close to zero. So, researchers are justified in interpreting the result as evidence about the sign of an effect, but the effect size is uncertain. Nobody would rush to buy stocks in a drug company, if they report that their new drug had an effectiveness of extending life expectancy by 1 day up to 3 years. But if we are mislead in focusing on an observed effect size of 1.5 years, we might be foolish enough to invest in the company and lose some money.
In short, noisy studies with unreliable measures and wide confidence intervals cannot be used to make claims about effect sizes. The reporting of standardized effect size measures can be useful for metaanalysis or to help future research in the planning of their studies, but researchers should never interpret their point estimates as observed effect sizes.
Final Conclusion
Although mathematics and statistics are fundamental sciences for all quantitative, empirical sciences each scientific discipline has its own history, terminology, and unique challenges. Political science differs from psychology in many ways. On the one hand, political science has access to large representative samples because there is a lot of interest in those kind of data and a lot of money is spent on collecting these data. These data make it possible to obtain relatively precise estimates. The downside is that many data are unique to a historic context. The 2016 election in the United States cannot be replicated.
Psychology is different. Research budgets and ethics often limit sample sizes. However, withinsubject designs with many repeated measures can increase power, something political scientists cannot do. In addition, studies in psychology can be replicated because the results are less sensitive to a particular historic context (and yes, there are many replicable findings in psychology that generalize across time and culture).
Gelman knows about as much about psychology as I know about political science. Maybe his article is more useful for political scientists, but psychologists would be better off if they finally recognized the important contribution of one of their own methodologist.
To paraphrase Cohen: Sometimes reading less is more, except for Cohen.
“It is a common mistake to take a tratio as a measure of strength of evidence and conclude that just because an estimate is statistically significant, the signaltonoise level is high” (Loken and Gelman)
Ulrich Schimmack
Would you say that there is no meaningful difference between a zscore of 2 and a zscore of 4? These zscores are significantly different from each other. Why would we not say that a study with a zscore of 4 provides stronger evidence for an effect than a study with a zscore of 2?