Richard D. Morey, Rink Hoekstra, Jeffrey N. Rouder, Michael D. Lee, and

Eric-Jan Wagenmakers (2016), henceforce psycho-Baysians, have a clear goal. They want psychologists to change the way they analyze their data.

Although this goal motivates the flood of method articles by this group, the most direct attack on other statistical approaches is made in the article “The fallacy of placing confidence in confidence intervals.” In this article, the authors claim that everybody, including textbook writers in statistics, misunderstood Neyman’s classic article on interval estimation. What are the prior odds that after 80 years, a group of psychologists discover a fundamental flaw in the interpretation of confidence intervals (H1) versus a few psychologists are either unable or unwilling to understand Neyman’s article?

Underlying this quest for change in statistical practices lies the ultimate attribution error that Fisher’s p-values or Neyman-Pearsons significance testing with or without confidence intervals are responsible for the replication crisis in psychology (Wagenmakers et al., 2011).

This is an error because numerous articles have argued and demonstrated that questionable research practices undermine the credibility of the psychological literature. The unprincipled use of p-values (undisclosed multiple testing), also called p-hacking, means that many statistically significant results have inflated error rates and the long-run probabilities of false positives are not 5%, as stated in each article, but could be 100% (Rosenthal, 1979; Sterling, 1959; Simmons, Nelson, & Simonsohn, 2011).

You will not find a single article by Psycho-Bayesians that will acknowledge the contribution of unprincipled use of p-values to the replication crisis. The reason is that they want to use the replication crisis as a vehicle to sell Bayesian statistics.

It is hard to believe that classic statistics are fundamentally flawed and misunderstood because they are used in industry to produce SmartPhones and other technology that requires tight error control in mass production of technology. Nevertheless, this article claims that everybody misunderstood Neyman’s seminal article on confidence intervals.

The authors claim that Neyman wanted us to compute confidence intervals only before we collect data, but warned readers that confidence intervals provide no useful information after the data are collected.

*Post-data assessments of probability have never been an advertised feature of CI theory. Neyman, for instance, said “Consider now the case when a sample…is already drawn and the [confidence interval] given…Can we say that in this particular case the probability of the true value of [the parameter] falling between [the limits] is equal to [X%]? The answer is obviously in the negative”*

This is utter nonsense. Of course, Neyman was asking us to interpret confidence intervals after we collected data because we need a sample to compute confidence interval. It is hard to believe that this could have passed peer-review in a statistics journal and it is not clear who was qualified to review this paper for Psychonomic Bullshit Review.

The way the psycho-statisticians use Neyman’s quote is unscientific because they omit the context and the following statements. In fact, Neyman was arguing against Bayesian attempts of estimate probabilities that can be applied to a single event.

*It is important to notice that for this conclusion to be true, it is not necessary that the problem of estimation should be the same in all the cases. For instance, during a period of time the statistician may deal with a thousand problems of estimation and in each the parameter M to be estimated and the probability law of the X’s may be different. As far as in each case the functions L and U are properly calculated and correspond to the same value of alpha, his steps (a), (b), and (c), though different in details of sampling and arithmetic, will have this in common—the probability of their resulting in a correct statement will be the same, alpha. Hence the frequency of actually correct statements will approach alpha. It will be noticed that in the above description the probability statements refer to the problems of estimation with which the statistician will be concerned in the future. In fact, I have repeatedly stated that the frequency of correct results tend to alpha.* *

*Consider now the case when a sample, S, is already drawn and the calculations have given, say, L = 1 and U = 2. Can we say that in this particular case the probability of the true value of M falling between 1 and 2 is equal to alpha? The answer is obviously in the negative. *

*The parameter M is an unknown constant and no probability statement concerning its value may be made, that is except for the hypothetical and trivial ones P{1 < M < 2}) = 1 if 1 < M < 2) or 0 if either M < 1 or 2 < M) , which we have decided not to consider. *

The full quote makes it clear that Neyman is considering the problem of quantifying the probability that a population parameter is in a specific interval and dismisses it as trivial because it doesn’t solve the estimation problem. We don’t even need observe data and compute a confidence interval. The statement that a specific unknown number is between two other numbers (1 and 2) or not is either TRUE (P = 1) or FALSE (P = 0). To imply that this trivial observation leads to the conclusion that we cannot make post-data inferences based on confidence intervals is ridiculous.

Neyman continues.

*The theoretical statistician [constructing a confidence interval] may be compared with the organizer of a game of chance in which the gambler has a certain range of possibilities to choose from while, whatever he actually chooses, the probability of his winning and thus the probability of the bank losing has permanently the same value, 1 – alpha. The choice of the gambler on what to bet, which is beyond the control of the bank, **corresponds to the uncontrolled possibilities of M having this or that value. The **case in which the bank wins the game corresponds to the correct statement of the **actual value of M. In both cases the frequency of “ successes ” in a long series of **future “ games ” is approximately known. **On the other hand, if the owner of the bank, say, in the case of roulette, knows that in a particular game the ball has stopped at the sector No. 1, this information does not help him in any way to guess how the gamblers have betted. Similarly, once the boundaries of the interval are drawn and the values of L and U determined, the calculus of probability adopted here is helpless to provide answer to the question of what is the true value of M.*

What Neyman was saying is that population parameters are unknowable and remain unknown even after researchers compute a confidence interval. Moreover, the construction of a confidence interval doesn’t allow us to quantify the probability that an unknown value is within the constructed interval. This probability remains unspecified. Nevertheless, we can use the property of the long-run success rate of the method to place confidence in the belief that the unknown parameter is within the interval. This is common sense. If we place bets in roulette or other random events, we rely on long-run frequencies of winnings to calculate our odds of winning in a specific game.

It is absurd to suggest that Neyman himself argued that confidence intervals provide no useful information after data are collected because the computation of a confidence interval requires a sample of data. That is, while the width of a confidence interval can be determined a priori before data collection (e.g. in precision planning and power calculations), the actual confidence interval can only be computed based on actual data because the sample statistic determines the location of the confidence interval.

Readers of this blog may face a dilemma. Why should they place confidence in another psycho-statistician? The probability that I am right is 1, if I am right and 0 if I am wrong, but this doesn’t help readers to adjust their beliefs in confidence intervals.

The good news is that they can use prior information. Neyman is widely regarded as one of the most influential figures in statistics. His methods are taught in hundreds of text books, and statistical software programs compute confidence intervals. Major advances in statistics have been new ways to compute confidence intervals for complex statistical problems (e.g., confidence intervals for standardized coefficients in structural equation models; MPLUS; Muthen & Muthen). What are the a priori chances that generations of statisticians misinterpreted Neyman and committed the fallacy of interpreting confidence intervals after data are obtained?

However, if readers need more evidence of psycho-statisticians deceptive practices, it is important to point out that they omitted Neyman’s criticism of their favored approach, namely Bayesian estimation.

The fallacy article gives the impression that Neyman’s (1936) approach to estimation is outdated and should be replaced with more modern, superior approaches like Bayesian credibility intervals. For example, they cite Jeffrey’s (1961) theory of probability, which gives the impression that Jeffrey’s work followed Neyman’s work. However, an accurate representation of Neyman’s work reveals that Jeffrey’s work preceded Neyman’s work and that Neyman discussed some of the problems with Jeffrey’s approach in great detail. Neyman’s critical article was even “communicated” by Jeffreys (these were different times where scientists had open conflict with honor and integrity and actually engaged in scientific debates).

Given that Jeffrey’s approach was published just one year before Neyman’s (1936) article, Neyman’s article probably also offers the first thorough assessment of Jeffrey’s approach. Neyman first gives a thorough account of Jeffrey’s approach (those were the days).

Neyman then offers his critique of Jeffrey’s approach.

*It is known that, as far as we work with the conception of probability as adopted in*

*this paper, the above theoretically perfect solution may be applied in practice only*

*in quite exceptional cases, and this for two reasons. *

Importantly, he does not challenge the theory. He only points out that the theory is not practical because it requires knowledge that is often not available. That is, to estimate the probability that an unknown parameter is within a specific interval, we need to make prior assumptions about unknown parameters. This is the problem that has plagued subjective Bayesians approaches.

Neyman then discusses Jeffrey’s approach to solving this problem. I am not claiming that I am a statistical expert to decide whether Neyman or Jeffrey’s are right. Even statisticians have been unable to resolve these issues and I believe the consensus is that Bayesian credibility intervals and Neyman’s confidence intervals are both mathematically viable approaches to interval estimation with different strengths and weaknesses.

I am only trying to point out to unassuming readers of the fallacy article that both approaches are as old as statistics and that the presentation of the issue in this article is biased and violates my personal, and probably idealistic, standards of scientific integrity. Using a selective quote by Neyman to dismiss confidence intervals and then to omit Neyman’s critic of Bayesian credibility intervals is deceptive and shows an unwillingness or inability to engage in open scientific examination of scientific arguments for and against different estimation methods.

It is sad and ironic that Wagenmakers’ efforts to convert psychologists into Bayesian statisticians is similar to Bem’s (2011) attempt to convert psychologists into believers in parapsychology; or at least in parapsychology as a respectable science. While Bem fudged data to show false empirical evidence, Wagenmakers is misrepresenting the way classic statistics works and ignoring the key problem of Bayesian statistics, namely that Bayesian inferences are contingent on prior assumptions that can be gamed to show what a researcher wants to show. Wagenmaker used this flexibility in Bayesian statistics to suggest that Bem (2011) presented weak evidence for extra-sensory perception. However, a rebuttle by Bem showed that Bayesian statistics also showed support for extra-sensory perception with different and more reasonable priors. Thus, Wagenmakers et al. (2011) were simply wrong to suggest that Bayesian methods would have prevented Bem from providing strong evidence for an incredible phenomenon.

The problem with Bem’s article is not the way he “analyzed” the data. The problem is that Bem violated basic principles of science that are required to draw valid statistical inferences from data. It would be a miracle if Bayesian methods that assume unbiased data could correct for data falsification. The problem with Bem’s data has been revealed using statistical tools for the detection of bias (Francis, 2012; Schimmack, 2012, 2015, 2118). There has been no rebuttal from Bem and he admits to the use of practices that invalidate the published p-values. So, the problem is not the use of p-values, confidence intervals, or Bayesian statistics. The problem is abuse of statistical methods. There are few cases of abuse of Bayesian methods simply because they are used rarely. However, Bayesian statistics can be gamed without data fudging by specifying convenient priors and failing to inform readers about the effect of priors on results (Gronau et al., 2017).

In conclusion, it is not a fallacy to interpret confidence intervals as a method for interval estimation of unknown parameter estimates. It would be a fallacy to cite Morey et al.’s article as a valid criticism of confidence intervals. This does not mean that Bayesian credibility intervals are bad or could not be better than confidence intervals. It only means that this article is so blatantly biased and dogmatic that it does not add to the understanding of Neyman’s or Jeffrey’s approach to interval estimation.

P.S. More discussion of the article can be found on Gelman’s blog.

Andrew Gelman himself comments:

*My current favorite (hypothetical) example is an epidemiology study of some small effect where the point estimate of the odds ratio is 3.0 with a 95% conf interval of [1.1, 8.2]. As a 95% confidence interval, this is fine (assuming the underlying assumptions regarding sampling, causal identification, etc. are valid). But if you slap on a flat prior you get a Bayes 95% posterior interval of [1.1, 8.2] which will not in general make sense, because real-world odds ratios are much more likely to be near 1.1 than to be near 8.2. In a practical sense, the uniform prior is causing big problems by introducing the possibility of these high values that are not realistic. *

I have to admit some *Schadenfreude* when I see one Bayesian attacking another Bayesian for the use of an ill-informed prior. While Bayesians are still fighting over the right priors, practical researchers may be better off to use statistical methods that do not require priors, like, hm, confidence intervals?

P.P.S. Science requires trust. At some point, we cannot check all assumptions. I trust Neyman, Cohen, and Muthen and Muthen’s confidence intervals in MPLUS.

I’ve been reading your P-curve articles and thoughts on the replicability crisis for awhile now, and appreciate your work. However, I think this article is off the mark. It disappoints me in several ways.

First, the overall tone. You pine for the days when “where scientists had open conflict with honor and integrity and actually engaged in scientific debates” yet begin by calling Morey et al psycho-Bayesians. Am I correct to infer you did not check whether your reading of their article matched theirs? See below, but I think you’ve misread. But maybe we have different drafts, because…

Second, in trying to check your reading of their article, I cannot find the bit you claim was “utter nonsense”. I searched for “Post-data” and “advertised”. Nothing.

I’m using: “http://andrewgelman.com/wp-content/uploads/2014/09/fundamentalError.pdf” and article by the same name, which I found by following the Gelman link you provided.

Third, your penultimate paragraph also seems to miss the entire point of a Bayesian approach, so I suspect there is some history driving what I take to be an odd mis-reading.

::: What I _think_ is going on :::::::

Assuming I have the right article etc., and disclaiming I read their article only briefly (esp. as it might be the wrong one).

(1) They say CIs do not give parameter estimates, and are not designed to. You seem to AGREE. Now, they think people should estimate parameters, and you think they’re unknowable, but that’s just the difference in Bayes/freq approaches.

(2) You think they argue Neyman said you can’t use CIs after the data are gathered.

That would be absurd. So much so I’m betting they didn’t say it in your draft either. I think they said (1). But .. and I’m guessing here .. because you’re not interested in estimating, I think you read much more into what they said.

(3) You note the Bayes alone won’t solve replicability. I think that’s right, but a

separate point almost entirely unrelated to this harrangue.

(4) They note many practitioners misinterpret CIs, and ascribe to them inferential properties they do not possess, and that this *has* harmed scientific inference in ways relevant to the replicability crisis. I think that’s been demonstrated repeatedly since the 1950s.

(5) I suspect like a friend of mine, you really know your frequentist stats. My friend was totally unsurprised by “Dance of the p-values” because he knows the chi-square distribution for replicability at any alpha level, and wondered why anyone would point this out. I think he can’t get his head around people actually getting this stuff wrong. But there’s decades of literature showing published authors get this way wrong, and quite likely it *does* contribute to the replicability crisis.

(6) Repeating (3): but hacking can / will happen in any paradigm. Bayes may be more robust (as I like to think), or it may simply have other failure modes. I think Wasserman points out that while people misinterpret CIs as parameter estimate, people also misinterpret Bayesian credible intervals as long-run frequencies.

I hope at least half my claims were correct, and that you can help me find any errors.

LikeLike

Dear Charles,

thank you for your feedback. I maybe wrong, but I think we do need a discussion about this article. If it is correct, it is a fundamental milestone in statistics with far reaching implications. This follows directly from the claim by the authors that we should abandon confidence intervals, just when they are being reported more frequently in psychology (see reporting of results in registered replication reports), in favor of credibility intervals.

I hope I didn’t misread the article when I believe the authors are suggesting confidence intervals are useless and do not provide the information that we think they provide.

Maybe we can just start here with trying to figure out possible misunderstandings.

LikeLike

P.S. Dear Charles,

“It may seem strange to the modern user of CIs, but Neyman is quite clear that CIs do not support any sort of reasonable belief about the parameter.”

How do you interpret this statement?

LikeLike

P.P.S

“If confidence procedures do not allow an assessment of the probability that an interval contains the true value, if they do not yield measures of precision, and if they do not yield assessments of the likelihood or plausibility of parameter values, then what are they?”

Translate: We have shown that

– confidence procedures do not allow an assessment of the probability that an interval contains the true value

– CI do not yield measures of precision

– CI do not yield assessments of the likelihood or plausibility of parameter values

If this were true, what implications would that have for 80 years of statistics and interpretations of results based on CIs?

LikeLike

Can I have a link to the paper as you read it? I do not find those quotes in http://andrewgelman.com/wp-content/uploads/2014/09/fundamentalError.pdf, which is by those authors and has the right title.

LikeLike

“As we have seen, according to CI theory, what happens in step (c) is not a belief, a conclusion, or any sort of reasoning from the data.”

Does that make sense?

LikeLike

“Furthermore, a confidence interval is not associated with any level of uncertainty about whether θ is, actually, in the interval.”

Do you agree?

LikeLike

I had to check that this free copy is identical with the article. I believe it is.

https://learnbayes.org/papers/confidenceIntervalsFallacy/

LikeLike

“Furthermore, a confidence interval is not associated with any level of uncertainty about whether θ is, actually, in the interval.”

Answer if you don’t agree:

The smaller the confidence level, the smaller the interval. That means that, all other things being equal, the closer we approach a point estimate, the smaller the probability that θ be in its CI.

LikeLike

Excellent post!

https://invertedlogicblog.wordpress.com/2019/05/27/philosophical-rants-20-the-gamblers-fallacy/

LikeLike

Thanks, interesting post.

I agree that we must collect data before calculating a CI.

I have two questions, though. Say that you in a large random sample estimate a mean to be 8.5 (95% CI 5.0 – 12.0) and that we assume there are no systematic errors and that the underlying assumptions are valid.

Based on the CI, are there any legit inferences one can draw in relation to the true mean?

And are there any legit inferences in relation to future estimations?

Thanks in advance.

LikeLike

If we don’t have access to any other information or there simply is no other information, we can state with a (long-run) error frequency of no more than 5% that the effect size is likely to be greater than 5 and less than 12. We can narrow this interval by increasing the error probabilty or widen it by lowering it.

LikeLike

The probability is on the confidence procedure (CP). So we can say that if we do similar studies on samples drawn from the same population, 95% of the CIs of these studies would contain the true (population) mean. That is, if we believed in all the CIs of these studies as containing the mean, we would be wrong 5% of the time.

It’s not the same as saying that if we consider the true mean to be in between 5.0 and 12.0, then we would be wrong 5% of times because if the population mean is out of this CI, then we will be wrong 100% of times. The true mean is not linked to any probibility measurement, it is what it is.

From many studies, you could guess what is the true mean. But you would not need to deduce that from the CIs. Simple aggregation computations will give you means of bigger and bigger samples, approching the true mean.

Call x_i the sample mean of study i (no same point can be found in two different samples). You can compute the mean of the sample that contains all the previous samples as x_{i=1 to n} = sum(n_i*x_i for i=1 to n) / sum(n_i for i=1 to n) where n is the number of studies available.

LikeLike

Isn’t that true for many thing, opinion polls, pregnancy tests. There is a true state of affairs (pregnant or not), but we don’t know what this state is. Then we use a test that has a proven accuracy rate. So, if the test gives the right result in 99.9% of cases, we conclude with high confidence that somebody is pregnant if the test is positive. Just don’t see the problem that people have with long-run frequencies as probabilities.

LikeLike

“if the test gives the right result in 99.9% of cases, we conclude with high confidence that somebody is pregnant if the test is positive”

Even if your test has high accuracy, if pregnancy is rare, then you won’t be often confident in your prediction of pregnancy.

Indeed, diagnostic tests use posterior probabilities (Bayes theorem):

Pr(pregnancy | test_value) = Pr(test_value | pregnancy) * Pr(pregnancy) / Pr(test_value)

Pr(pregnancy | test_value) = Pr(test_value | pregnancy) * Pr(pregnancy) / (Pr(test_value | pregnancy) * Pr(pregnancy) + Pr(test_value | no pregnancy) * Pr(no pregnancy))

Pr(pregnancy | test_value) = sensibility * Pr(pregnancy) / (sensibility * Pr(pregnancy) + (1-specificity) * Pr(no pregnancy))

CIs are not posterior probabilites. They speak neither for one study nor for the true population parameter. If you want to decide for the “true” parameter of a population from a study using the same reasoning as for diagnostic tests, you are looking for bayesian analyses. I refer you to Frank Davidoff, “Standing Statistics Right Side Up”, Annals of Internal Medicine, 1999;130:1019-1021.

(Or https://en.wikipedia.org/wiki/Base_rate_fallacy)

“Just don’t see the problem that people have with long-run frequencies as probabilities.”

The problem is that it doesn’t help us when deciding about a specific outcome. If one’s interest is: “for all the decisions I make (any question/subject), I want 95% of them be true”, then a 95%CI is what one is looking for. If one concludes that the “true” parameter is in the 95%CI, one will be right 95 times over 100 but one will not know for which parameter exactly. Among parameters A, B, C, …, T, one may be wrong about either A or B or … or T when saying “the parameter is inside the parameter’s CI”. And the greater the level of confidence, the less interesting is the CI, since the wider it is. [see final note]

Identifying which specific parameter is true is an undestandable matter. Imagine you are a cancer patient and someone estimates your lifetime = T ± CI. Do you really care about long-run frequencies that rule research findings? I don’t. I’m about to die, I matter about my specific life-time.

Also, as an example, take Fig. 1 in Cumming, G. (2014). “The New Statistics: Why and How”. Psychological Science, 25(1), 7–29. (https://doi.org/10.1177/0956797613504966). If you reject the hypothesis of an effect based on the 95%CI (that’s what some do), i.e. because the 95%CI contains the nil effect (µ0), then you will be wrong 14 times over 25 (56%, that’s crazy!). So if you are studying an important drug and this method rules your decision, you are very likely to prevent people from being cured.

Currently, I’m not convinced that either confidence intervals or credible intervals can give us what we are looking for in scientific research. I think that point estimates, descriptive statistics and aggregation techniques used in meta-analyses are already okay. Both frequentist and bayesian decisions rely on abritrary assumptions (called “priors” for bayesians) while in science, often, we are actually looking for what are the assumptions, i.e., the underlying process. In that case, neither confidence, nor credible intervals, nor P-values, nor likelihood ratios/Bayes factor (Goodman, 1993, 1999), nor false positive risks (Colquhoun, 2018) nor epistemic ratios (Trafimow, 2006) can answer our question. Only an increased number of studies and sample size can make us converge toward the “true” parameter (underlying process) of the data of interest. In my current opnion, whether you have a CI that contains such or such value should not guide your research, the money and efforts you put into it, etc. What should guide them is whether it interests you or not, it is your vision of the world you want. If you want a world with very sophisticated health care, you invest a lot in health care, you make many studies about past treatments, present treatments, you make many controls etc.

[Final note : As Popper explained ~60 years ago, the more precise a theory, the less probable it is. So we should actually be looking for the less probable theories. You can observe this phenomenon with confidence interval: the higher the level of confidence, the wider the confidence interval, i.e, the less it informs us.]

LikeLike

The apriori probability that a Tornado will hit your house is 1/100000. There is a Tornado warning that has a 90% accuracy. Are you going to stay or leave?

LikeLike

“There is a Tornado warning that has a 90% accuracy”

If by “accuracy” you mean “positive predictive value after adjustment to the prior”, and not the typical measurement that one uses in statistical learning (1 – test error), then yes I will leave.

LikeLike