Richard D. Morey, Rink Hoekstra, Jeffrey N. Rouder, Michael D. Lee, and

Eric-Jan Wagenmakers (2016), henceforce psycho-Baysians, have a clear goal. They want psychologists to change the way they analyze their data.

Although this goal motivates the flood of method articles by this group, the most direct attack on other statistical approaches is made in the article “The fallacy of placing confidence in confidence intervals.” In this article, the authors claim that everybody, including textbook writers in statistics, misunderstood Neyman’s classic article on interval estimation. What are the prior odds that after 80 years, a group of psychologists discover a fundamental flaw in the interpretation of confidence intervals (H1) versus a few psychologists are either unable or unwilling to understand Neyman’s article?

Underlying this quest for change in statistical practices lies the ultimate attribution error that Fisher’s p-values or Neyman-Pearsons significance testing with or without confidence intervals are responsible for the replication crisis in psychology (Wagenmakers et al., 2011).

This is an error because numerous articles have argued and demonstrated that questionable research practices undermine the credibility of the psychological literature. The unprincipled use of p-values (undisclosed multiple testing), also called p-hacking, means that many statistically significant results have inflated error rates and the long-run probabilities of false positives are not 5%, as stated in each article, but could be 100% (Rosenthal, 1979; Sterling, 1959; Simmons, Nelson, & Simonsohn, 2011).

You will not find a single article by Psycho-Bayesians that will acknowledge the contribution of unprincipled use of p-values to the replication crisis. The reason is that they want to use the replication crisis as a vehicle to sell Bayesian statistics.

It is hard to believe that classic statistics are fundamentally flawed and misunderstood because they are used in industry to produce SmartPhones and other technology that requires tight error control in mass production of technology. Nevertheless, this article claims that everybody misunderstood Neyman’s seminal article on confidence intervals.

The authors claim that Neyman wanted us to compute confidence intervals only before we collect data, but warned readers that confidence intervals provide no useful information after the data are collected.

*Post-data assessments of probability have never been an advertised feature of CI theory. Neyman, for instance, said “Consider now the case when a sample…is already drawn and the [confidence interval] given…Can we say that in this particular case the probability of the true value of [the parameter] falling between [the limits] is equal to [X%]? The answer is obviously in the negative”*

This is utter nonsense. Of course, Neyman was asking us to interpret confidence intervals after we collected data because we need a sample to compute confidence interval. It is hard to believe that this could have passed peer-review in a statistics journal and it is not clear who was qualified to review this paper for Psychonomic Bullshit Review.

The way the psycho-statisticians use Neyman’s quote is unscientific because they omit the context and the following statements. In fact, Neyman was arguing against Bayesian attempts of estimate probabilities that can be applied to a single event.

*It is important to notice that for this conclusion to be true, it is not necessary that the problem of estimation should be the same in all the cases. For instance, during a period of time the statistician may deal with a thousand problems of estimation and in each the parameter M to be estimated and the probability law of the X’s may be different. As far as in each case the functions L and U are properly calculated and correspond to the same value of alpha, his steps (a), (b), and (c), though different in details of sampling and arithmetic, will have this in common—the probability of their resulting in a correct statement will be the same, alpha. Hence the frequency of actually correct statements will approach alpha. It will be noticed that in the above description the probability statements refer to the problems of estimation with which the statistician will be concerned in the future. In fact, I have repeatedly stated that the frequency of correct results tend to alpha.* *

*Consider now the case when a sample, S, is already drawn and the calculations have given, say, L = 1 and U = 2. Can we say that in this particular case the probability of the true value of M falling between 1 and 2 is equal to alpha? The answer is obviously in the negative. *

*The parameter M is an unknown constant and no probability statement concerning its value may be made, that is except for the hypothetical and trivial ones P{1 < M < 2}) = 1 if 1 < M < 2) or 0 if either M < 1 or 2 < M) , which we have decided not to consider. *

The full quote makes it clear that Neyman is considering the problem of quantifying the probability that a population parameter is in a specific interval and dismisses it as trivial because it doesn’t solve the estimation problem. We don’t even need observe data and compute a confidence interval. The statement that a specific unknown number is between two other numbers (1 and 2) or not is either TRUE (P = 1) or FALSE (P = 0). To imply that this trivial observation leads to the conclusion that we cannot make post-data inferences based on confidence intervals is ridiculous.

Neyman continues.

*The theoretical statistician [constructing a confidence interval] may be compared with the organizer of a game of chance in which the gambler has a certain range of possibilities to choose from while, whatever he actually chooses, the probability of his winning and thus the probability of the bank losing has permanently the same value, 1 – alpha. The choice of the gambler on what to bet, which is beyond the control of the bank, **corresponds to the uncontrolled possibilities of M having this or that value. The **case in which the bank wins the game corresponds to the correct statement of the **actual value of M. In both cases the frequency of “ successes ” in a long series of **future “ games ” is approximately known. **On the other hand, if the owner of the bank, say, in the case of roulette, knows that in a particular game the ball has stopped at the sector No. 1, this information does not help him in any way to guess how the gamblers have betted. Similarly, once the boundaries of the interval are drawn and the values of L and U determined, the calculus of probability adopted here is helpless to provide answer to the question of what is the true value of M.*

What Neyman was saying is that population parameters are unknowable and remain unknown even after researchers compute a confidence interval. Moreover, the construction of a confidence interval doesn’t allow us to quantify the probability that an unknown value is within the constructed interval. This probability remains unspecified. Nevertheless, we can use the property of the long-run success rate of the method to place confidence in the belief that the unknown parameter is within the interval. This is common sense. If we place bets in roulette or other random events, we rely on long-run frequencies of winnings to calculate our odds of winning in a specific game.

It is absurd to suggest that Neyman himself argued that confidence intervals provide no useful information after data are collected because the computation of a confidence interval requires a sample of data. That is, while the width of a confidence interval can be determined a priori before data collection (e.g. in precision planning and power calculations), the actual confidence interval can only be computed based on actual data because the sample statistic determines the location of the confidence interval.

Readers of this blog may face a dilemma. Why should they place confidence in another psycho-statistician? The probability that I am right is 1, if I am right and 0 if I am wrong, but this doesn’t help readers to adjust their beliefs in confidence intervals.

The good news is that they can use prior information. Neyman is widely regarded as one of the most influential figures in statistics. His methods are taught in hundreds of text books, and statistical software programs compute confidence intervals. Major advances in statistics have been new ways to compute confidence intervals for complex statistical problems (e.g., confidence intervals for standardized coefficients in structural equation models; MPLUS; Muthen & Muthen). What are the a priori chances that generations of statisticians misinterpreted Neyman and committed the fallacy of interpreting confidence intervals after data are obtained?

However, if readers need more evidence of psycho-statisticians deceptive practices, it is important to point out that they omitted Neyman’s criticism of their favored approach, namely Bayesian estimation.

The fallacy article gives the impression that Neyman’s (1936) approach to estimation is outdated and should be replaced with more modern, superior approaches like Bayesian credibility intervals. For example, they cite Jeffrey’s (1961) theory of probability, which gives the impression that Jeffrey’s work followed Neyman’s work. However, an accurate representation of Neyman’s work reveals that Jeffrey’s work preceded Neyman’s work and that Neyman discussed some of the problems with Jeffrey’s approach in great detail. Neyman’s critical article was even “communicated” by Jeffreys (these were different times where scientists had open conflict with honor and integrity and actually engaged in scientific debates).

Given that Jeffrey’s approach was published just one year before Neyman’s (1936) article, Neyman’s article probably also offers the first thorough assessment of Jeffrey’s approach. Neyman first gives a thorough account of Jeffrey’s approach (those were the days).

Neyman then offers his critique of Jeffrey’s approach.

*It is known that, as far as we work with the conception of probability as adopted in*

*this paper, the above theoretically perfect solution may be applied in practice only*

*in quite exceptional cases, and this for two reasons. *

Importantly, he does not challenge the theory. He only points out that the theory is not practical because it requires knowledge that is often not available. That is, to estimate the probability that an unknown parameter is within a specific interval, we need to make prior assumptions about unknown parameters. This is the problem that has plagued subjective Bayesians approaches.

Neyman then discusses Jeffrey’s approach to solving this problem. I am not claiming that I am a statistical expert to decide whether Neyman or Jeffrey’s are right. Even statisticians have been unable to resolve these issues and I believe the consensus is that Bayesian credibility intervals and Neyman’s confidence intervals are both mathematically viable approaches to interval estimation with different strengths and weaknesses.

I am only trying to point out to unassuming readers of the fallacy article that both approaches are as old as statistics and that the presentation of the issue in this article is biased and violates my personal, and probably idealistic, standards of scientific integrity. Using a selective quote by Neyman to dismiss confidence intervals and then to omit Neyman’s critic of Bayesian credibility intervals is deceptive and shows an unwillingness or inability to engage in open scientific examination of scientific arguments for and against different estimation methods.

It is sad and ironic that Wagenmakers’ efforts to convert psychologists into Bayesian statisticians is similar to Bem’s (2011) attempt to convert psychologists into believers in parapsychology; or at least in parapsychology as a respectable science. While Bem fudged data to show false empirical evidence, Wagenmakers is misrepresenting the way classic statistics works and ignoring the key problem of Bayesian statistics, namely that Bayesian inferences are contingent on prior assumptions that can be gamed to show what a researcher wants to show. Wagenmaker used this flexibility in Bayesian statistics to suggest that Bem (2011) presented weak evidence for extra-sensory perception. However, a rebuttle by Bem showed that Bayesian statistics also showed support for extra-sensory perception with different and more reasonable priors. Thus, Wagenmakers et al. (2011) were simply wrong to suggest that Bayesian methods would have prevented Bem from providing strong evidence for an incredible phenomenon.

The problem with Bem’s article is not the way he “analyzed” the data. The problem is that Bem violated basic principles of science that are required to draw valid statistical inferences from data. It would be a miracle if Bayesian methods that assume unbiased data could correct for data falsification. The problem with Bem’s data has been revealed using statistical tools for the detection of bias (Francis, 2012; Schimmack, 2012, 2015, 2118). There has been no rebuttal from Bem and he admits to the use of practices that invalidate the published p-values. So, the problem is not the use of p-values, confidence intervals, or Bayesian statistics. The problem is abuse of statistical methods. There are few cases of abuse of Bayesian methods simply because they are used rarely. However, Bayesian statistics can be gamed without data fudging by specifying convenient priors and failing to inform readers about the effect of priors on results (Gronau et al., 2017).

In conclusion, it is not a fallacy to interpret confidence intervals as a method for interval estimation of unknown parameter estimates. It would be a fallacy to cite Morey et al.’s article as a valid criticism of confidence intervals. This does not mean that Bayesian credibility intervals are bad or could not be better than confidence intervals. It only means that this article is so blatantly biased and dogmatic that it does not add to the understanding of Neyman’s or Jeffrey’s approach to interval estimation.

P.S. More discussion of the article can be found on Gelman’s blog.

Andrew Gelman himself comments:

*My current favorite (hypothetical) example is an epidemiology study of some small effect where the point estimate of the odds ratio is 3.0 with a 95% conf interval of [1.1, 8.2]. As a 95% confidence interval, this is fine (assuming the underlying assumptions regarding sampling, causal identification, etc. are valid). But if you slap on a flat prior you get a Bayes 95% posterior interval of [1.1, 8.2] which will not in general make sense, because real-world odds ratios are much more likely to be near 1.1 than to be near 8.2. In a practical sense, the uniform prior is causing big problems by introducing the possibility of these high values that are not realistic. *

I have to admit some *Schadenfreude* when I see one Bayesian attacking another Bayesian for the use of an ill-informed prior. While Bayesians are still fighting over the right priors, practical researchers may be better off to use statistical methods that do not require priors, like, hm, confidence intervals?

P.P.S. Science requires trust. At some point, we cannot check all assumptions. I trust Neyman, Cohen, and Muthen and Muthen’s confidence intervals in MPLUS.

I’ve been reading your P-curve articles and thoughts on the replicability crisis for awhile now, and appreciate your work. However, I think this article is off the mark. It disappoints me in several ways.

First, the overall tone. You pine for the days when “where scientists had open conflict with honor and integrity and actually engaged in scientific debates” yet begin by calling Morey et al psycho-Bayesians. Am I correct to infer you did not check whether your reading of their article matched theirs? See below, but I think you’ve misread. But maybe we have different drafts, because…

Second, in trying to check your reading of their article, I cannot find the bit you claim was “utter nonsense”. I searched for “Post-data” and “advertised”. Nothing.

I’m using: “http://andrewgelman.com/wp-content/uploads/2014/09/fundamentalError.pdf” and article by the same name, which I found by following the Gelman link you provided.

Third, your penultimate paragraph also seems to miss the entire point of a Bayesian approach, so I suspect there is some history driving what I take to be an odd mis-reading.

::: What I _think_ is going on :::::::

Assuming I have the right article etc., and disclaiming I read their article only briefly (esp. as it might be the wrong one).

(1) They say CIs do not give parameter estimates, and are not designed to. You seem to AGREE. Now, they think people should estimate parameters, and you think they’re unknowable, but that’s just the difference in Bayes/freq approaches.

(2) You think they argue Neyman said you can’t use CIs after the data are gathered.

That would be absurd. So much so I’m betting they didn’t say it in your draft either. I think they said (1). But .. and I’m guessing here .. because you’re not interested in estimating, I think you read much more into what they said.

(3) You note the Bayes alone won’t solve replicability. I think that’s right, but a

separate point almost entirely unrelated to this harrangue.

(4) They note many practitioners misinterpret CIs, and ascribe to them inferential properties they do not possess, and that this *has* harmed scientific inference in ways relevant to the replicability crisis. I think that’s been demonstrated repeatedly since the 1950s.

(5) I suspect like a friend of mine, you really know your frequentist stats. My friend was totally unsurprised by “Dance of the p-values” because he knows the chi-square distribution for replicability at any alpha level, and wondered why anyone would point this out. I think he can’t get his head around people actually getting this stuff wrong. But there’s decades of literature showing published authors get this way wrong, and quite likely it *does* contribute to the replicability crisis.

(6) Repeating (3): but hacking can / will happen in any paradigm. Bayes may be more robust (as I like to think), or it may simply have other failure modes. I think Wasserman points out that while people misinterpret CIs as parameter estimate, people also misinterpret Bayesian credible intervals as long-run frequencies.

I hope at least half my claims were correct, and that you can help me find any errors.

LikeLike

Dear Charles,

thank you for your feedback. I maybe wrong, but I think we do need a discussion about this article. If it is correct, it is a fundamental milestone in statistics with far reaching implications. This follows directly from the claim by the authors that we should abandon confidence intervals, just when they are being reported more frequently in psychology (see reporting of results in registered replication reports), in favor of credibility intervals.

I hope I didn’t misread the article when I believe the authors are suggesting confidence intervals are useless and do not provide the information that we think they provide.

Maybe we can just start here with trying to figure out possible misunderstandings.

LikeLike

P.S. Dear Charles,

“It may seem strange to the modern user of CIs, but Neyman is quite clear that CIs do not support any sort of reasonable belief about the parameter.”

How do you interpret this statement?

LikeLike

P.P.S

“If confidence procedures do not allow an assessment of the probability that an interval contains the true value, if they do not yield measures of precision, and if they do not yield assessments of the likelihood or plausibility of parameter values, then what are they?”

Translate: We have shown that

– confidence procedures do not allow an assessment of the probability that an interval contains the true value

– CI do not yield measures of precision

– CI do not yield assessments of the likelihood or plausibility of parameter values

If this were true, what implications would that have for 80 years of statistics and interpretations of results based on CIs?

LikeLike

Can I have a link to the paper as you read it? I do not find those quotes in http://andrewgelman.com/wp-content/uploads/2014/09/fundamentalError.pdf, which is by those authors and has the right title.

LikeLike

“As we have seen, according to CI theory, what happens in step (c) is not a belief, a conclusion, or any sort of reasoning from the data.”

Does that make sense?

LikeLike

“Furthermore, a confidence interval is not associated with any level of uncertainty about whether θ is, actually, in the interval.”

Do you agree?

LikeLike

I had to check that this free copy is identical with the article. I believe it is.

https://learnbayes.org/papers/confidenceIntervalsFallacy/

LikeLike

Excellent post!

https://invertedlogicblog.wordpress.com/2019/05/27/philosophical-rants-20-the-gamblers-fallacy/

LikeLike

Thanks, interesting post.

I agree that we must collect data before calculating a CI.

I have two questions, though. Say that you in a large random sample estimate a mean to be 8.5 (95% CI 5.0 – 12.0) and that we assume there are no systematic errors and that the underlying assumptions are valid.

Based on the CI, are there any legit inferences one can draw in relation to the true mean?

And are there any legit inferences in relation to future estimations?

Thanks in advance.

LikeLike

If we don’t have access to any other information or there simply is no other information, we can state with a (long-run) error frequency of no more than 5% that the effect size is likely to be greater than 5 and less than 12. We can narrow this interval by increasing the error probabilty or widen it by lowering it.

LikeLike