The Fallacy of Placing Confidence in Bayesian Salvation

May 25, 2018Bayes-Factor, Bayes-Factors, Bayesian Statistics, Confidence Intervals, UncategorizedUlrich Schimmack

Richard D. Morey, Rink Hoekstra, Jeffrey N. Rouder, Michael D. Lee, and
Eric-Jan Wagenmakers (2016), henceforce psycho-Baysians, have a clear goal. They want psychologists to change the way they analyze their data.

Although this goal motivates the flood of method articles by this group, the most direct attack on other statistical approaches is made in the article “The fallacy of placing confidence in confidence intervals.” In this article, the authors claim that everybody, including textbook writers in statistics, misunderstood Neyman’s classic article on interval estimation. What are the prior odds that after 80 years, a group of psychologists discover a fundamental flaw in the interpretation of confidence intervals (H1) versus a few psychologists are either unable or unwilling to understand Neyman’s article?

Underlying this quest for change in statistical practices lies the ultimate attribution error that Fisher’s p-values or Neyman-Pearsons significance testing with or without confidence intervals are responsible for the replication crisis in psychology (Wagenmakers et al., 2011).

This is an error because numerous articles have argued and demonstrated that questionable research practices undermine the credibility of the psychological literature. The unprincipled use of p-values (undisclosed multiple testing), also called p-hacking, means that many statistically significant results have inflated error rates and the long-run probabilities of false positives are not 5%, as stated in each article, but could be 100% (Rosenthal, 1979; Sterling, 1959; Simmons, Nelson, & Simonsohn, 2011).

You will not find a single article by Psycho-Bayesians that will acknowledge the contribution of unprincipled use of p-values to the replication crisis. The reason is that they want to use the replication crisis as a vehicle to sell Bayesian statistics.

It is hard to believe that classic statistics are fundamentally flawed and misunderstood because they are used in industry to produce SmartPhones and other technology that requires tight error control in mass production of technology. Nevertheless, this article claims that everybody misunderstood Neyman’s seminal article on confidence intervals.

The authors claim that Neyman wanted us to compute confidence intervals only before we collect data, but warned readers that confidence intervals provide no useful information after the data are collected.

Post-data assessments of probability have never been an advertised feature of CI theory. Neyman, for instance, said “Consider now the case when a sample…is already drawn and the [confidence interval] given…Can we say that in this particular case the probability of the true value of [the parameter] falling between [the limits] is equal to [X%]? The answer is obviously in the negative”

This is utter nonsense. Of course, Neyman was asking us to interpret confidence intervals after we collected data because we need a sample to compute confidence interval. It is hard to believe that this could have passed peer-review in a statistics journal and it is not clear who was qualified to review this paper for Psychonomic Bullshit Review.

The way the psycho-statisticians use Neyman’s quote is unscientific because they omit the context and the following statements. In fact, Neyman was arguing against Bayesian attempts of estimate probabilities that can be applied to a single event.

It is important to notice that for this conclusion to be true, it is not necessary that the problem of estimation should be the same in all the cases. For instance, during a period of time the statistician may deal with a thousand problems of estimation and in each the parameter M to be estimated and the probability law of the X’s may be different. As far as in each case the functions L and U are properly calculated and correspond to the same value of alpha, his steps (a), (b), and (c), though different in details of sampling and arithmetic, will have this in common—the probability of their resulting in a correct statement will be the same, alpha. Hence the frequency of actually correct statements will approach alpha. It will be noticed that in the above description the probability statements refer to the problems of estimation with which the statistician will be concerned in the future. In fact, I have repeatedly stated that the frequency of correct results tend to alpha.*

Consider now the case when a sample, S, is already drawn and the calculations have given, say, L = 1 and U = 2. Can we say that in this particular case the probability of the true value of M falling between 1 and 2 is equal to alpha? The answer is obviously in the negative.

The parameter M is an unknown constant and no probability statement concerning its value may be made, that is except for the hypothetical and trivial ones P{1 < M < 2}) = 1 if 1 < M < 2) or 0 if either M < 1 or 2 < M) , which we have decided not to consider.

The full quote makes it clear that Neyman is considering the problem of quantifying the probability that a population parameter is in a specific interval and dismisses it as trivial because it doesn’t solve the estimation problem. We don’t even need observe data and compute a confidence interval. The statement that a specific unknown number is between two other numbers (1 and 2) or not is either TRUE (P = 1) or FALSE (P = 0). To imply that this trivial observation leads to the conclusion that we cannot make post-data inferences based on confidence intervals is ridiculous.

Neyman continues.

The theoretical statistician [constructing a confidence interval] may be compared with the organizer of a game of chance in which the gambler has a certain range of possibilities to choose from while, whatever he actually chooses, the probability of his winning and thus the probability of the bank losing has permanently the same value, 1 – alpha. The choice of the gambler on what to bet, which is beyond the control of the bank, corresponds to the uncontrolled possibilities of M having this or that value. The case in which the bank wins the game corresponds to the correct statement of the actual value of M. In both cases the frequency of “ successes ” in a long series of future “ games ” is approximately known. On the other hand, if the owner of the bank, say, in the case of roulette, knows that in a particular game the ball has stopped at the sector No. 1, this information does not help him in any way to guess how the gamblers have betted. Similarly, once the boundaries of the interval are drawn and the values of L and U determined, the calculus of probability adopted here is helpless to provide answer to the question of what is the true value of M.

What Neyman was saying is that population parameters are unknowable and remain unknown even after researchers compute a confidence interval. Moreover, the construction of a confidence interval doesn’t allow us to quantify the probability that an unknown value is within the constructed interval. This probability remains unspecified. Nevertheless, we can use the property of the long-run success rate of the method to place confidence in the belief that the unknown parameter is within the interval. This is common sense. If we place bets in roulette or other random events, we rely on long-run frequencies of winnings to calculate our odds of winning in a specific game.

It is absurd to suggest that Neyman himself argued that confidence intervals provide no useful information after data are collected because the computation of a confidence interval requires a sample of data. That is, while the width of a confidence interval can be determined a priori before data collection (e.g. in precision planning and power calculations), the actual confidence interval can only be computed based on actual data because the sample statistic determines the location of the confidence interval.

Readers of this blog may face a dilemma. Why should they place confidence in another psycho-statistician? The probability that I am right is 1, if I am right and 0 if I am wrong, but this doesn’t help readers to adjust their beliefs in confidence intervals.

The good news is that they can use prior information. Neyman is widely regarded as one of the most influential figures in statistics. His methods are taught in hundreds of text books, and statistical software programs compute confidence intervals. Major advances in statistics have been new ways to compute confidence intervals for complex statistical problems (e.g., confidence intervals for standardized coefficients in structural equation models; MPLUS; Muthen & Muthen). What are the a priori chances that generations of statisticians misinterpreted Neyman and committed the fallacy of interpreting confidence intervals after data are obtained?

However, if readers need more evidence of psycho-statisticians deceptive practices, it is important to point out that they omitted Neyman’s criticism of their favored approach, namely Bayesian estimation.

The fallacy article gives the impression that Neyman’s (1936) approach to estimation is outdated and should be replaced with more modern, superior approaches like Bayesian credibility intervals. For example, they cite Jeffrey’s (1961) theory of probability, which gives the impression that Jeffrey’s work followed Neyman’s work. However, an accurate representation of Neyman’s work reveals that Jeffrey’s work preceded Neyman’s work and that Neyman discussed some of the problems with Jeffrey’s approach in great detail. Neyman’s critical article was even “communicated” by Jeffreys (these were different times where scientists had open conflict with honor and integrity and actually engaged in scientific debates).

Given that Jeffrey’s approach was published just one year before Neyman’s (1936) article, Neyman’s article probably also offers the first thorough assessment of Jeffrey’s approach. Neyman first gives a thorough account of Jeffrey’s approach (those were the days).

Neyman then offers his critique of Jeffrey’s approach.

It is known that, as far as we work with the conception of probability as adopted in
this paper, the above theoretically perfect solution may be applied in practice only
in quite exceptional cases, and this for two reasons.

Importantly, he does not challenge the theory. He only points out that the theory is not practical because it requires knowledge that is often not available. That is, to estimate the probability that an unknown parameter is within a specific interval, we need to make prior assumptions about unknown parameters. This is the problem that has plagued subjective Bayesians approaches.

Neyman then discusses Jeffrey’s approach to solving this problem. I am not claiming that I am a statistical expert to decide whether Neyman or Jeffrey’s are right. Even statisticians have been unable to resolve these issues and I believe the consensus is that Bayesian credibility intervals and Neyman’s confidence intervals are both mathematically viable approaches to interval estimation with different strengths and weaknesses.

I am only trying to point out to unassuming readers of the fallacy article that both approaches are as old as statistics and that the presentation of the issue in this article is biased and violates my personal, and probably idealistic, standards of scientific integrity. Using a selective quote by Neyman to dismiss confidence intervals and then to omit Neyman’s critic of Bayesian credibility intervals is deceptive and shows an unwillingness or inability to engage in open scientific examination of scientific arguments for and against different estimation methods.

It is sad and ironic that Wagenmakers’ efforts to convert psychologists into Bayesian statisticians is similar to Bem’s (2011) attempt to convert psychologists into believers in parapsychology; or at least in parapsychology as a respectable science. While Bem fudged data to show false empirical evidence, Wagenmakers is misrepresenting the way classic statistics works and ignoring the key problem of Bayesian statistics, namely that Bayesian inferences are contingent on prior assumptions that can be gamed to show what a researcher wants to show. Wagenmaker used this flexibility in Bayesian statistics to suggest that Bem (2011) presented weak evidence for extra-sensory perception. However, a rebuttle by Bem showed that Bayesian statistics also showed support for extra-sensory perception with different and more reasonable priors. Thus, Wagenmakers et al. (2011) were simply wrong to suggest that Bayesian methods would have prevented Bem from providing strong evidence for an incredible phenomenon.

The problem with Bem’s article is not the way he “analyzed” the data. The problem is that Bem violated basic principles of science that are required to draw valid statistical inferences from data. It would be a miracle if Bayesian methods that assume unbiased data could correct for data falsification. The problem with Bem’s data has been revealed using statistical tools for the detection of bias (Francis, 2012; Schimmack, 2012, 2015, 2118). There has been no rebuttal from Bem and he admits to the use of practices that invalidate the published p-values. So, the problem is not the use of p-values, confidence intervals, or Bayesian statistics. The problem is abuse of statistical methods. There are few cases of abuse of Bayesian methods simply because they are used rarely. However, Bayesian statistics can be gamed without data fudging by specifying convenient priors and failing to inform readers about the effect of priors on results (Gronau et al., 2017).

In conclusion, it is not a fallacy to interpret confidence intervals as a method for interval estimation of unknown parameter estimates. It would be a fallacy to cite Morey et al.’s article as a valid criticism of confidence intervals. This does not mean that Bayesian credibility intervals are bad or could not be better than confidence intervals. It only means that this article is so blatantly biased and dogmatic that it does not add to the understanding of Neyman’s or Jeffrey’s approach to interval estimation.

P.S. More discussion of the article can be found on Gelman’s blog.

Andrew Gelman himself comments:

My current favorite (hypothetical) example is an epidemiology study of some small effect where the point estimate of the odds ratio is 3.0 with a 95% conf interval of [1.1, 8.2]. As a 95% confidence interval, this is fine (assuming the underlying assumptions regarding sampling, causal identification, etc. are valid). But if you slap on a flat prior you get a Bayes 95% posterior interval of [1.1, 8.2] which will not in general make sense, because real-world odds ratios are much more likely to be near 1.1 than to be near 8.2. In a practical sense, the uniform prior is causing big problems by introducing the possibility of these high values that are not realistic.

I have to admit some Schadenfreude when I see one Bayesian attacking another Bayesian for the use of an ill-informed prior. While Bayesians are still fighting over the right priors, practical researchers may be better off to use statistical methods that do not require priors, like, hm, confidence intervals?

P.P.S. Science requires trust. At some point, we cannot check all assumptions. I trust Neyman, Cohen, and Muthen and Muthen’s confidence intervals in MPLUS.

50 thoughts on “The Fallacy of Placing Confidence in Bayesian Salvation”

Charles R. Twardy says:

May 26, 2018 at 10:36 pm

I’ve been reading your P-curve articles and thoughts on the replicability crisis for awhile now, and appreciate your work. However, I think this article is off the mark. It disappoints me in several ways.

First, the overall tone. You pine for the days when “where scientists had open conflict with honor and integrity and actually engaged in scientific debates” yet begin by calling Morey et al psycho-Bayesians. Am I correct to infer you did not check whether your reading of their article matched theirs? See below, but I think you’ve misread. But maybe we have different drafts, because…

Second, in trying to check your reading of their article, I cannot find the bit you claim was “utter nonsense”. I searched for “Post-data” and “advertised”. Nothing.
I’m using: “http://andrewgelman.com/wp-content/uploads/2014/09/fundamentalError.pdf” and article by the same name, which I found by following the Gelman link you provided.

Third, your penultimate paragraph also seems to miss the entire point of a Bayesian approach, so I suspect there is some history driving what I take to be an odd mis-reading.

::: What I _think_ is going on :::::::
Assuming I have the right article etc., and disclaiming I read their article only briefly (esp. as it might be the wrong one).

(1) They say CIs do not give parameter estimates, and are not designed to. You seem to AGREE. Now, they think people should estimate parameters, and you think they’re unknowable, but that’s just the difference in Bayes/freq approaches.

(2) You think they argue Neyman said you can’t use CIs after the data are gathered.
That would be absurd. So much so I’m betting they didn’t say it in your draft either. I think they said (1). But .. and I’m guessing here .. because you’re not interested in estimating, I think you read much more into what they said.

(3) You note the Bayes alone won’t solve replicability. I think that’s right, but a
separate point almost entirely unrelated to this harrangue.

(4) They note many practitioners misinterpret CIs, and ascribe to them inferential properties they do not possess, and that this *has* harmed scientific inference in ways relevant to the replicability crisis. I think that’s been demonstrated repeatedly since the 1950s.

(5) I suspect like a friend of mine, you really know your frequentist stats. My friend was totally unsurprised by “Dance of the p-values” because he knows the chi-square distribution for replicability at any alpha level, and wondered why anyone would point this out. I think he can’t get his head around people actually getting this stuff wrong. But there’s decades of literature showing published authors get this way wrong, and quite likely it *does* contribute to the replicability crisis.

(6) Repeating (3): but hacking can / will happen in any paradigm. Bayes may be more robust (as I like to think), or it may simply have other failure modes. I think Wasserman points out that while people misinterpret CIs as parameter estimate, people also misinterpret Bayesian credible intervals as long-run frequencies.

I hope at least half my claims were correct, and that you can help me find any errors.

Loading...

Reply
1. Dr. R says:
  
  May 27, 2018 at 7:01 am
  
  Dear Charles,
  
  thank you for your feedback. I maybe wrong, but I think we do need a discussion about this article. If it is correct, it is a fundamental milestone in statistics with far reaching implications. This follows directly from the claim by the authors that we should abandon confidence intervals, just when they are being reported more frequently in psychology (see reporting of results in registered replication reports), in favor of credibility intervals.
  
  I hope I didn’t misread the article when I believe the authors are suggesting confidence intervals are useless and do not provide the information that we think they provide.
  
  Maybe we can just start here with trying to figure out possible misunderstandings.
  
  Loading...
  
  Reply
Dr. R says:

May 27, 2018 at 7:06 am

P.S. Dear Charles,

“It may seem strange to the modern user of CIs, but Neyman is quite clear that CIs do not support any sort of reasonable belief about the parameter.”

How do you interpret this statement?

Loading...

Reply
Dr. R says:

May 27, 2018 at 7:09 am

P.P.S

“If confidence procedures do not allow an assessment of the probability that an interval contains the true value, if they do not yield measures of precision, and if they do not yield assessments of the likelihood or plausibility of parameter values, then what are they?”

Translate: We have shown that
– confidence procedures do not allow an assessment of the probability that an interval contains the true value
– CI do not yield measures of precision
– CI do not yield assessments of the likelihood or plausibility of parameter values

If this were true, what implications would that have for 80 years of statistics and interpretations of results based on CIs?

Loading...

Reply
1. Charles R. Twardy says:
  
  May 27, 2018 at 8:41 am
  
  Can I have a link to the paper as you read it? I do not find those quotes in http://andrewgelman.com/wp-content/uploads/2014/09/fundamentalError.pdf, which is by those authors and has the right title.
  
  Loading...
  
  Reply
Dr. R says:

May 27, 2018 at 7:10 am

“As we have seen, according to CI theory, what happens in step (c) is not a belief, a conclusion, or any sort of reasoning from the data.”

Does that make sense?

Loading...

Reply
Dr. R says:

May 27, 2018 at 7:11 am

“Furthermore, a confidence interval is not associated with any level of uncertainty about whether θ is, actually, in the interval.”

Do you agree?

Loading...

Reply
1. Dr. R says:
  
  May 27, 2018 at 2:51 pm
  
  I had to check that this free copy is identical with the article. I believe it is.
  
  https://learnbayes.org/papers/confidenceIntervalsFallacy/
  
  Loading...
  
  Reply
2. Alexandre Huat says:
  
  December 6, 2019 at 11:47 am
  
  “Furthermore, a confidence interval is not associated with any level of uncertainty about whether θ is, actually, in the interval.”
  
  Answer if you don’t agree:
  The smaller the confidence level, the smaller the interval. That means that, all other things being equal, the closer we approach a point estimate, the smaller the probability that θ be in its CI.
  
  Loading...
  
  Reply
Pingback: 2018 Blogs | Replicability-Index
Pingback: Wagenmakers’ Crusade Against p-values | Replicability-Index
invertedlogicblog says:

May 28, 2019 at 12:30 pm

Excellent post!

https://invertedlogicblog.wordpress.com/2019/05/27/philosophical-rants-20-the-gamblers-fallacy/

Loading...

Reply
Dr Jorun says:

September 22, 2019 at 9:06 am

Thanks, interesting post.

I agree that we must collect data before calculating a CI.
I have two questions, though. Say that you in a large random sample estimate a mean to be 8.5 (95% CI 5.0 – 12.0) and that we assume there are no systematic errors and that the underlying assumptions are valid.
Based on the CI, are there any legit inferences one can draw in relation to the true mean?
And are there any legit inferences in relation to future estimations?

Thanks in advance.

Loading...

Reply
1. Dr. R says:
  
  September 22, 2019 at 11:15 am
  
  If we don’t have access to any other information or there simply is no other information, we can state with a (long-run) error frequency of no more than 5% that the effect size is likely to be greater than 5 and less than 12. We can narrow this interval by increasing the error probabilty or widen it by lowering it.
  
  Loading...
  
  Reply
  1. Alexandre Huat says:
    
    December 6, 2019 at 12:08 pm
    
    The probability is on the confidence procedure (CP). So we can say that if we do similar studies on samples drawn from the same population, 95% of the CIs of these studies would contain the true (population) mean. That is, if we believed in all the CIs of these studies as containing the mean, we would be wrong 5% of the time.
    
    It’s not the same as saying that if we consider the true mean to be in between 5.0 and 12.0, then we would be wrong 5% of times because if the population mean is out of this CI, then we will be wrong 100% of times. The true mean is not linked to any probibility measurement, it is what it is.
    
    From many studies, you could guess what is the true mean. But you would not need to deduce that from the CIs. Simple aggregation computations will give you means of bigger and bigger samples, approching the true mean.
    
    Call x_i the sample mean of study i (no same point can be found in two different samples). You can compute the mean of the sample that contains all the previous samples as x_{i=1 to n} = sum(n_i*x_i for i=1 to n) / sum(n_i for i=1 to n) where n is the number of studies available.
    
    Loading...
  2. Dr. R says:
    
    December 6, 2019 at 12:57 pm
    
    Isn’t that true for many thing, opinion polls, pregnancy tests. There is a true state of affairs (pregnant or not), but we don’t know what this state is. Then we use a test that has a proven accuracy rate. So, if the test gives the right result in 99.9% of cases, we conclude with high confidence that somebody is pregnant if the test is positive. Just don’t see the problem that people have with long-run frequencies as probabilities.
    
    Loading...
  3. Alexandre Huat says:
    
    December 9, 2019 at 6:35 am
    
    “if the test gives the right result in 99.9% of cases, we conclude with high confidence that somebody is pregnant if the test is positive”
    
    Even if your test has high accuracy, if pregnancy is rare, then you won’t be often confident in your prediction of pregnancy.
    Indeed, diagnostic tests use posterior probabilities (Bayes theorem):
    Pr(pregnancy | test_value) = Pr(test_value | pregnancy) * Pr(pregnancy) / Pr(test_value)
    Pr(pregnancy | test_value) = Pr(test_value | pregnancy) * Pr(pregnancy) / (Pr(test_value | pregnancy) * Pr(pregnancy) + Pr(test_value | no pregnancy) * Pr(no pregnancy))
    Pr(pregnancy | test_value) = sensibility * Pr(pregnancy) / (sensibility * Pr(pregnancy) + (1-specificity) * Pr(no pregnancy))
    
    CIs are not posterior probabilites. They speak neither for one study nor for the true population parameter. If you want to decide for the “true” parameter of a population from a study using the same reasoning as for diagnostic tests, you are looking for bayesian analyses. I refer you to Frank Davidoff, “Standing Statistics Right Side Up”, Annals of Internal Medicine, 1999;130:1019-1021.
    (Or https://en.wikipedia.org/wiki/Base_rate_fallacy)
    
    “Just don’t see the problem that people have with long-run frequencies as probabilities.”
    The problem is that it doesn’t help us when deciding about a specific outcome. If one’s interest is: “for all the decisions I make (any question/subject), I want 95% of them be true”, then a 95%CI is what one is looking for. If one concludes that the “true” parameter is in the 95%CI, one will be right 95 times over 100 but one will not know for which parameter exactly. Among parameters A, B, C, …, T, one may be wrong about either A or B or … or T when saying “the parameter is inside the parameter’s CI”. And the greater the level of confidence, the less interesting is the CI, since the wider it is. [see final note]
    
    Identifying which specific parameter is true is an undestandable matter. Imagine you are a cancer patient and someone estimates your lifetime = T ± CI. Do you really care about long-run frequencies that rule research findings? I don’t. I’m about to die, I matter about my specific life-time.
    
    Also, as an example, take Fig. 1 in Cumming, G. (2014). “The New Statistics: Why and How”. Psychological Science, 25(1), 7–29. (https://doi.org/10.1177/0956797613504966). If you reject the hypothesis of an effect based on the 95%CI (that’s what some do), i.e. because the 95%CI contains the nil effect (µ0), then you will be wrong 14 times over 25 (56%, that’s crazy!). So if you are studying an important drug and this method rules your decision, you are very likely to prevent people from being cured.
    
    Currently, I’m not convinced that either confidence intervals or credible intervals can give us what we are looking for in scientific research. I think that point estimates, descriptive statistics and aggregation techniques used in meta-analyses are already okay. Both frequentist and bayesian decisions rely on abritrary assumptions (called “priors” for bayesians) while in science, often, we are actually looking for what are the assumptions, i.e., the underlying process. In that case, neither confidence, nor credible intervals, nor P-values, nor likelihood ratios/Bayes factor (Goodman, 1993, 1999), nor false positive risks (Colquhoun, 2018) nor epistemic ratios (Trafimow, 2006) can answer our question. Only an increased number of studies and sample size can make us converge toward the “true” parameter (underlying process) of the data of interest. In my current opnion, whether you have a CI that contains such or such value should not guide your research, the money and efforts you put into it, etc. What should guide them is whether it interests you or not, it is your vision of the world you want. If you want a world with very sophisticated health care, you invest a lot in health care, you make many studies about past treatments, present treatments, you make many controls etc.
    
    [Final note : As Popper explained ~60 years ago, the more precise a theory, the less probable it is. So we should actually be looking for the less probable theories. You can observe this phenomenon with confidence interval: the higher the level of confidence, the wider the confidence interval, i.e, the less it informs us.]
    
    Loading...
  4. Dr. R says:
    
    December 9, 2019 at 1:23 pm
    
    The apriori probability that a Tornado will hit your house is 1/100000. There is a Tornado warning that has a 90% accuracy. Are you going to stay or leave?
    
    Loading...
  5. Alexandre Huat says:
    
    December 10, 2019 at 4:14 am
    
    “There is a Tornado warning that has a 90% accuracy”
    If by “accuracy” you mean “positive predictive value after adjustment to the prior”, and not the typical measurement that one uses in statistical learning (1 – test error), then yes I will leave.
    
    Loading...
  6. Sirena says:
    
    March 16, 2021 at 2:24 am
    
    It seems that you are trying to attach a probability, disguised as a long-run error rate, to the hypothesis that this particular interval contains the true mean.
    
    Loading...
  7. Ulrich Schimmack says:
    
    March 16, 2021 at 10:47 am
    
    I am not disguising anything. Long-run frequencies do provide valuable information about probability (e.g., Roulette). Why should we ignore this information?
    
    Loading...
  8. Alexandre Huat says:
    
    March 17, 2021 at 7:23 am
    
    It may be useful information at some point but it is definitively not enough as it does not say what is the true value of the parameter in the long-run.
    
    Why computing confidence intervals on a point estimate instead of just gathering more data (more point estimates) in order to converge towards the true parameter of the process that we are trying to model (which is a unique value)? While some compute confidence intervals and fix their decisions based on some studies, others just perform continuous monitoring through continuous data analysis and I cannot see how it could not be enough. Measuring uncertainty (computing confidence intervals) cannot be better than knowledge.
    
    Le mar. 16 mars 2021 à 15:47, Replicability-Index a écrit :
    
    > Ulrich Schimmack commented: “I am not disguising anything. Long-run > frequencies do provide valuable information about probability (e.g., > Roulette). Why should we ignore this information? ” >
    
    Loading...
  9. Ulrich Schimmack says:
    
    March 17, 2021 at 10:34 am
    
    I really don’t have time to debate confidence intervals with Bayesians. The debate is 100 years old. I don’t think you have anything new to say about it. You are just locked into a belief system that says we should not trust confidence intervals. I think they are trustworthy and we see them everywhere (opinion polls, pregnancy tests, bathroom scales).
    
    Loading...
  10. Alexandre Huat says:
    
    March 17, 2021 at 11:48 am
    
    I am not a Bayesian. I have not said that “we should not trust confidence intervals”. I say that they are not useful when establishing more and more precise hypotheses, i.e. when performing statistical inference. The fastest way, and the only robust one, to build better hypotheses is to perform more and more data analysis. All the time taken to build confidence intervals, interpret them, decide from it, etc. should be taken for gathering more and more more data and do more and more point estimates that will be more precise. The only way to know something is by measuring it, not by measuring the uncertainty of a previous measure.
    
    Le mer. 17 mars 2021 à 15:34, Replicability-Index a écrit :
    
    > Ulrich Schimmack commented: “I really don’t have time to debate confidence > intervals with Bayesians. The debate is 100 years old. I don’t think you > have anything new to say about it. You are just locked into a belief system > that says we should not trust confidence intervals. I think ” >
    
    Loading...
  11. Ulrich Schimmack says:
    
    March 17, 2021 at 12:13 pm
    
    Let me guess, you are not a researcher with a budget and limited grant money.
    
    Loading...
  12. Alexandre Huat says:
    
    March 17, 2021 at 12:20 pm
    
    I am a researcher with limited budget and limited money. Everything is limited. But I choose not to spend my budget into building confidence intervals. Instead I choose to analyse more and more data. It is the only way to converge towards the true parameters of the modeled process. It is not a budget question, it is a scientific question.
    
    Le mer. 17 mars 2021 à 17:13, Replicability-Index a écrit :
    
    > Ulrich Schimmack commented: “Let me guess, you are not a researcher with a > budget and limited grant money. ” >
    
    Loading...
  13. Alexandre Huat says:
    
    March 17, 2021 at 11:50 am
    
    By the way, none of your sentences were scientific arguments and I really do not appreciate that one talks to me in that way in a scientific debate.
    
    Loading...
Sirena says:

March 17, 2021 at 3:32 am

Yes, probabilities do provide valuable information about the long-term frequency, but that’s ahead of samplings and tests. Once a confidence interval has been calculated (or the roulette ball has stopped), the true mean parameter value (which is fixed) will either be within the interval or not. Within the Neyman confidence interval theory: the chosen confidence level/error level does not – in a post-test situation – support reasoning about whether a particular interval is likely or not to include the true value.

Neymans theory was strictly behavioral: use the interval to adjust your behavior in accordance with the values of the confidence limits. It’s an act of will, and not about inferential reasoning.
In defence of Neymans confidence interval theory: it works perfectly well when used for decision making in scientific matters! Its main purpose was not to produce evidence in particular cases, but to limit errors in the long run. It is not the theory’s fault that confidence intervals typically are used and misinterpreted as credibility intervals, just as it is not the fault of “significance testing” that p-values often are misinterpreted as evidential measures of hypotheses.

However, I am not sure what your stand point is. When you say “we can state with a (long-run) error frequency of no more than 5% that the effect size is likely to be greater than 5 and less than 12”, this seems like a fancy way to say that the 95% CI is 5-12, which doesn’t help a reader. And if a reader thinks that the error rate of 5% applies to this particular interval of 5-12, then you have invited a reader to apply Bayesian intuitions on results from frequentist statistics, which I guess neither framework would approve on.

And thanks for keeping this blog post/thread still alive and kicking! This is an important subject to discuss! Best /Sirena

Loading...

Reply
1. Alexandre Huat says:
  
  March 17, 2021 at 7:05 am
  
  Was Neyman’s theory put to experimental validation? That is, was there any studies that proved that using Neyman’s decision rule to take a decision led to better decisions (more exact decisions) than using point estimates only?
  
  Loading...
  
  Reply
  1. Ulrich Schimmack says:
    
    March 17, 2021 at 10:34 am
    
    Of course, simulation studies can be used to confirm the coverage of confidence intervals.
    
    Loading...
  2. Alexandre Huat says:
    
    March 17, 2021 at 11:53 am
    
    Can you cite the studies that you are thinking of? I am not sure that you are thinking of the same things as me.
    
    Le mer. 17 mars 2021 à 15:35, Replicability-Index a écrit :
    
    > Ulrich Schimmack commented: “Of course, simulation studies can be used to > confirm the coverage of confidence intervals. ” >
    
    Loading...
2. Ulrich Schimmack says:
  
  March 17, 2021 at 10:35 am
  
  It doesn’t matter whether the true mean is fixed, if I don’t know it. CI are used for situations of uncertainty.
  
  Loading...
  
  Reply
  1. Alexandre Huat says:
    
    March 17, 2021 at 12:01 pm
    
    It does matter that it is fixed. If it varies, the rate of studies of it must increase in order to monitor its evolution.
    
    Le mer. 17 mars 2021 à 15:35, Replicability-Index a écrit :
    
    > Ulrich Schimmack commented: “It doesn’t matter whether the true mean is > fixed, if I don’t know it. CI are used for situations of uncertainty. ” >
    
    Loading...
Sirena says:

March 18, 2021 at 3:57 am

I am also a little surprised of being dismissed as a Bayesian (I never performed a Bayesian analysis in my entire life, however produced numerous confidence intervals in which I rely as they limit my errors), but I am also a little saddened that an interesting scientific discussion so quickly went from factual arguing and deductive reasoning to ad hominem attacks. My belief system? Well, some bilateral introspection seems to be fair: the argument that CIs are trustworthy simply because they seem to be about everywhere must surely be on the list of common cognitive biases.
But I get the impression that you miss the point. CIs are trustworthy over the course of repeated tests; they are just not evidential in particular cases. And don’t kill the messengers, it was the founder of the confidence interval theory (Neyman) who was very clear about this.

Here is a link to a simulation that “proves” that a chosen confidence limit – over repeated runs – will close in to the chosen CI-coverage. It is good site to play around on if you wish to understand the impact of sample size, confidence level and more.
https://rpsychologist.com/d3/ci/

One interesting and philosophical question is this.
I ran 10 000 samplings on this site and the CI coverage turned out to be 94.7%. If I pick any particular CI in this batch, should I consider it have an error rate of 5.3% based on relative frequency of CI-coverage of the tests I ran? Or should I consider it to have a long runt 5% error rate in regards some fictitious never ending long run of tests?
This may seem like a stupid question, but I think it is significant if you want to understand inference of frequentist statistic. The true probability lies in this fictitious long runt reference class. It is utterly clear before we run a test what that is, but once test is run (or an event has happened) the reference is not clear. Single events that have happened or not (or fixed mean parameters that are within CI limits or not) do not have probabilities as the denominator is unclear. This is because of the reference class problem, first described by John Venn in the Logic of Chance in 1888.

Loading...

Reply
Sirena says:

March 18, 2021 at 4:59 am

And for those of you who are interested in what the founder of confidence interval theory, Neyman, said about the practical use of confidence intervals:

“I wish to emphasize the circumstance that the use of confidence intervals involves the following phases: (i) formulation of the problem (ii) deductive reasoning leading to the solution of the problem (iii) act of will to adjust our behavior in accordance with the values of the confidence limits. In the past, claims have been made frequently that statistical estimation involves some mental processes described as inductive reasoning. The foregoing analysis tends to indicate that in the ordinary procedure of statistical estimation there is no phase corresponding to the description of “inductive reasoning”. This applies equally to cases in which probabilities a priori are implied by the conditions of the problem and to cases in which they are not. In either case, all of the reasoning is deductive and leads to certain formulae and their properties. A new phase arrives when we decide to apply these formulae and to enjoy the consequences of their properties. This phase is marked by an act of will (not reasoning) and, therefore, if it is desired to use the adjective “inductive” in relation to methods of estimation, it should be used in connection with the noun “behaviour” rather than “reasoning”.

From “Lectures and conferences on mathematical statistics and probability, page 210, under the headline Practical use of confidence intervals.
http://cda.psych.uiuc.edu/Wallace/neyman_book.pdf

Loading...

Reply
Sirena says:

March 18, 2021 at 5:50 am

And a comment to @Alexandre: just gathering and analyzing more data will surely give you other headaches as you definitely will run into problems with multiplicity, resulting in false claims (yet, perhaps, also more grants).

Also, when observations and study samples increase, confidence intervals obviously will shrink, as we reduce the impact of random errors.
In fields such as medicine and life sciences, it is, however, reasonable to assume that data more or less always is biased due to systematic errors, for example because of measurement errors, selective attrition and problems with the blinding of interventions.
This gives rise to a paradox: as your increased sample size results in shrinking interval estimates which more precisely depict the location of the true but unknown parameter value, the higher the likeliness that the true value, in fact, is located outside the interval you’re looking at.

Loading...

Reply
1. Alexandre Huat says:
  
  March 18, 2021 at 6:53 am
  
  Analyzing more data may give you more headaches but it will bring you more knowledge, and that is the only way of bringing more knowledge. But it is what science is about: the word “science” comes from latin “scientia” which means “knowledge”.
  
  “as we reduce the impact of random errors” : This must not confuse statistical fluctuations with errors. The fact that the true mean does not lie in a given CI is not an error, it is an expected statistical fact, it is an expected reality. The error would be that 100% of alpha CIs contain the true parameter (for alpha < 100% o.c.). For fixed alpha and population std, CIs will shrink because knowledge increases.
  
  "systematic errors […] gives rise to a paradox: as your increased sample size results in shrinking interval estimates which more precisely depict the location of the true but unknown parameter value, the higher the likeliness that the true value, in fact, is located outside the interval you’re looking at." : This is self-contradictory and this is not understanding what a CI is. First, a CI is independent of the size of the data. 95% of all possible 95%CIs, regardless of the sample size and thus regardless of their width, contain the true parameter. If this property isn't respected, then your interval is not a confidence interval. Second, systematic errors (measurement errors, selective attrition and blinding) are not determined by the sample size. These errors are only the result of non-robust enough protocols. It is not a problem related to the size of the data, it is related to human will, intelligence, rigor and other purely arbitrary decision factors. It has nothing to do with dataset size. Look at other industries: some plants build 40 cars per week among which 3% have major defects, whereas other plants build 560 cars per week among which only 1% have major defects. It is only determined by the design of the cars and the process by which they are built. Similarly, some data collection processes collect 40 data entries per week with an error rate of 3% whereas others collect 900 data entries per week with an error rate of 1%.
  
  In all cases, a position against just gathering more data is not coherent with real research practice and judgment. Everyone knows that, all other things being equal, the greater the dataset size, the bigger the impact of the study and the more acclaimed the authors become; all other things being equal.
  
  P.S. I work in medical data science.
  
  Le jeu. 18 mars 2021 à 10:50, Replicability-Index a écrit :
  
  > Sirena commented: “And a comment to @Alexandre: just gathering and > analyzing more data will surely give you other headaches as you definitely > will run into problems with multiplicity, resulting in false claims (yet, > perhaps, also more grants). Also, when observations a” >
  
  Loading...
  
  Reply
  1. Sirena says:
    
    March 18, 2021 at 8:31 am
    
    Hi Alexander, there is too much confusion in your response for me to put time in it.
    Certain terms are to be considered established:
    
    Random errors = standards errors. The standard error of a statistic will decrease when your sample size increases, which in turn will narrow the CI. The width of the interval is called precision. My point was: gathering more data will produce more precise intervals. Totally non-controversial.
    
    Systematic errors. What is discussed in this thread are legitimate statical inferences of confidence interval, that is CIs that are free of systematic errors. Stupidly I deviated from the main topic and in a weak moment I stated the obvious: there is always a risk that real life data, at least in life-sciences also is biased, which is the term for systematic errors. When assessing research evidence this always has to be considered. Tons of literature on the matter, google “risk of bias” + assessment, if you’re interested. Totally non-controversial, too.
    And GOD NO! – systematic errors are not affected by sample size!! Let me help you to kill that straw man right away. Aahhrrg… May he never rise again.
    
    You must distinguish between a CI and the CI-procedure. The first, a particular CI of course depends on the data!!! which is the straw man beautifully slashed in the above blog post. How could you get that wrong? The latter, the CI-procedure…yes, we all know that one, seems to work so well that we almost can consider every answer in the form as a CI as containing the truth.
    
    To blame the existence of systematic errors on non-robust enough protocols and in the same paragraph pull a car production metaphor makes me think that you might have a background in…..say, engineering and data science? Even the world’s best blood test machine might be poorly calibrated and no robust enough study protocol will change that. In fact, even some statical estimators are biased, i.e. they provide statistics that are systematically wrong, consider that? So, the existence of systematic errors is – too- totally non-controversial, at least for the practical researcher.
    
    My combined point was this (and I know I take a risk now): with narrow enough CIs, which use real life data, the true value will with increasing likeliness be outside the interval at hand. Totally non-controversial, if you think about it, yet the thought keeps me awake at night some times. Statistical inference can be a true bitch!
    
    Lastly, another self-evident statistical fact, known as the Dodo-paradigm: given a large enough study sample, everything becomes statically significant!!
    
    P.S. I don’t dare to pull my work experience authority card, but I can tell you that the car I drive seems to have been produced in a small factory, not building more than 40 cars per week.
    
    Loading...
  2. Alexandre Huat says:
    
    March 18, 2021 at 11:15 am
    
    I have never read the term “random errors”, I always use “standard error”. Your claim about random errors also misled me because you said “reduce the impact of random errors” instead of “reduce random errors” and because of what you said after. What I call “impact” would not change in function of the standard error: the formula that binds the standard error to CIs is the same for whatever standard error value.
    
    This being clarified, either there is ambiguïty in the formulation of your combined point or it misunderstands what a CI is. As I explained, it does not hold true if by “narrow enough CI” you mean “a CI that is more narrow because of different sample size, but not because of different confidence level alpha or other population parameters”.
    
    For fixed alpha: 1) The likeliness of the true value to lie inside or outside a CI is independent of its width. It is part of the definition of a CI. For the same sampled population, 95% of 95% CI with width W1 (sample size N1) contain the true value just as 95% of 95% CI with width W2 N1) contain the true value. 2) I can’t see a mathematical proof that, when using real life (imprecise) data, the more narrow the CIs, the more likely the true value will be outside the intervals. It is an imprecise claim. Which data? Which CIs? Even with bad data, 95% of your 95% CIs may contain the true parameter. For instance, suppose the parameter is the mean and the data are bad because the values of samples 1, 2, …, j were permuted, or because some errors were compensated by other ones by another way. The true mean would not change, so the CIs would still respect the expected property.
    
    In all cases, I don’t appreciate your familiarity. I don’t appreciate your accusations of using strawmans or other sophisms. The claim that I “have a background in… say, engineering and data science” and the rest of its paragraph makes no point in the debate. “Totally non-controversial, if you think about it” isn’t a mathematical proof. The idea that I stated my work experience as an “authority card” is only your interpretation. Misunderstanding is common, in particular with people who don’t work together. So do stick to scientific arguments or just ignore me.
    
    Regarding the “Dodo paradigm”, we should rather stop associating the term “significance” with “P response for me to put time in it. Certain terms are to be considered > established: Random errors = standards errors. The standard error of a > statistic will decrease when your sample size increases, w” >
    
    Loading...
  3. Ulrich Schimmack says:
    
    March 18, 2021 at 11:51 am
    
    I think you should write your own blog post rather than using my comment section to write one.
    
    Loading...
  4. Alexandre Huat says:
    
    March 18, 2021 at 2:22 pm
    
    I thought the comment section of a scientific blog would enjoy scientific debates but it seems not to be the case here.
    
    Le jeu. 18 mars 2021 à 16:51, Replicability-Index a écrit :
    
    > Ulrich Schimmack commented: “I think you should write your own blog post > rather than using my comment section to write one. ” >
    
    Loading...
  5. Ulrich Schimmack says:
    
    March 18, 2021 at 3:26 pm
    
    if you know the history of statistics, you know that this debate has been going on for 100 years and that there is no end in sight. If you want to claim victory, go ahead. You win.
    
    Loading...
  6. Alexandre Huat says:
    
    March 19, 2021 at 9:24 am
    
    Statistics isn’t innate so the “100 years debate” argument doesn’t make sense. It may be an authority argument but it will never be a mathematical proof. When I was in school, my math teachers taught us everything through mathematical proofs, even theorems that were discovered thousand years ago. Believing someone on word isn’t the way to scientific knowledge, it’s the way to religion. You want to claim victory, I want mathematical proofs that I am wrong.
    
    Le jeu. 18 mars 2021 à 20:26, Replicability-Index a écrit :
    
    > Ulrich Schimmack commented: “if you know the history of statistics, you > know that this debate has been going on for 100 years and that there is no > end in sight. If you want to claim victory, go ahead. You win. ” >
    
    Loading...
Sirena says:

March 18, 2021 at 5:52 am

I am also a little surprised of being dismissed as a Bayesian (I never performed a Bayesian analysis in my entire life, however produced numerous confidence intervals in which I rely as they limit my errors), but I am also a little saddened that an interesting scientific discussion so quickly went from factual arguing and deductive reasoning to ad hominem attacks. My belief system? Well, some bilateral introspection seems to be fair: the argument that CIs are trustworthy simply because they seem to be about everywhere must surely be on the list of common cognitive biases.
But I get the impression that you miss the point. CIs are trustworthy over the course of repeated tests; they are just not evidential in particular cases. And don’t kill the messenger, it was the founder of the confidence interval theory (Neyman) who was very clear about this.

Here is a link to a simulation that “proves” that a chosen confidence limit – over repeated runs – will close in to the chosen CI-coverage. It is good site to play around on if you wish to understand the impact of sample size, confidence level and more.
https://rpsychologist.com/d3/ci/

One interesting and philosophical question is this.
I ran 10 000 samplings on this site and the CI coverage turned out to be 94.7%. If I pick any particular CI in this batch, should I consider it have an error rate of 5.3% based on relative frequency of CI-coverage of the tests I ran? Or should I consider it to have a long runt 5% error rate in regards some fictious never ending long run of tests?
This may seem like a stupid question, but I think it is significant if you want to understand inference of frequentist statistic. The true probability lies in this fictious long runt reference class. It is utterly clear before we run a test what that is, but once test is run (or an event has happened) the reference is not clear. Single events that have happened or not (or fixed mean parameters that are within CI limits or not) do not have probabilities. This is because of the reference class problem, first described by John Venn in the Logic of Chance in 1888.

Loading...

Reply
1. Alexandre Huat says:
  
  March 18, 2021 at 7:17 am
  
  Indeed, it is a statistical syllogism. It is confusing the instance (this specific interval) with the class (intervals). The reference class problem is a great reasoning tool here.
  
  Loading...
  
  Reply
Sirena says:

March 19, 2021 at 3:06 am

Gentlemen, it’s been so much fun to dance with you, but now it will end.

Alexandre, truly sorry if I passed a line or two. Yet, I do feel that your last and also previous posts do include strawmans, trivia, ambiguities but also wrongs. I would like to respond to them all, but I just have to let go.

Ulrich, sorry for kidnapping your comment thread, but I will make it up to you with a few remarks on your article. But first, 100 years? CI-theory was launched in 1937. And win/lose? Please, let’s agree on higher goals. After all, realizing that you’ve been wrong is an underrated pleasure. They say.

The picture of Bayesiology as an applied religious philosophy is fun, and the pejorative naming to those attending that church makes it clear that you dislike Bayesians. But to attribute them ideas they never had or said, such as that CIs should be computed before you collect data? That impossible strawman did not serve your article well.

Your standpoint seems to be that you are defending the correct interpretation of Neymans theory against some psycho-statisticians, who have an underlying kalifate-like dark purpose of letting their church overtake the world.
When I read your post, your comments and Neymans texts I, however, would like to suggest that the real conflict line lies between you and Neyman. Everybody agrees that one of the main differences between Bayes and freq, is that the former, in mysteries and pseudoscientific ways, can attribute probabilities to hypotheses, which the objective freq cannot. The so called psycho-statisticians explain in their article why, and Neyman goes even further and states, utterly clear, that in the practical use of confidence intervals there is no phase that corresponds to reasoning.
But that is what you do! You suggest using CIs for reasoning!

Article quote: “Nevertheless, we can use the property of the long-run success rate of the method to place confidence in the belief that the unknown parameter is within the interval.”

Do you see where this is heading? (hint, look at your wordings).
If not, there is no easy way to break it. You are the Bayesian, Ulrich!, surrounded by friendly frequentists trying to adhere to the deductive consequences of their frequentist choice when analyzing data.

I understand this could be hard to process, might take decades (even for one claiming not to be afraid to correct errors, no matter who made them).
But I surely salute your grand out-of-the-closet exit! It was magnificent.

The important thing you to bear with you now is that it’s ok!
My dream is that in the future, or if it was in a parallel universe, we can all sit down together. Laugh, have a drink and agree that it is perfectly ok to be fully or partially Bayesian – also the first time you place a bet at the roulette table. We can even seal that obvious truth with the foremost of scientific arguments:
“This is common sense.”

Have a wonderful life lads!

Cheers,
Sirena

Loading...

Reply
1. Ulrich Schimmack says:
  
  March 19, 2021 at 7:49 am
  
  Hi Sirena, statistics is not my specialty. I have invested more time and energy in trying to understand the debate among those who made thinking about this their main focus. I am not even sure I understand Neyman correctly. So, what can an applied researcher do. Having a choice among Fisher, Neyman, and Jeffreys, I opt for the church of Neyman. My statistics program gives me CI and I can interpret them. Am I wrong? Who knows. No statistician after Neyman has demonstrated convincingly that he was wrong and the hobby statisticians in psychology who are pushing Jeffreys are not a real match to real statisticians either. Priors are only useful if they are rooted in knowledge or wash out with accumulation of data. In that case, freq CI are no different from Bayesian CI. So, even if Neyman was wrong, the CI I report with large datasets are correct because there is no difference between Bayesian and frequentists CI. For applied researchers like me, the whole fight of Bayesians for supremacy is irrelevant.
  
  Loading...
  
  Reply
Sirena says:

March 23, 2021 at 9:17 am

Dear Ulrich,
We share a common interest! I have also spent some time and effort into understanding the debate as well as the different approaches of statistical inference. And since I sense a subtle change in your tone, I have decided to stay for one last dance.
Also, I think that your question: what can an applied researcher can do? is sincerely asked and, thus, deserves to be reflected upon.
But before I do that, let’s acknowledge that we do not disagree about everything! I am, like you, also not sure if you understand Neyman correctly.

Another thing. Before we start, we must agree on this. Whatever church or philosophy one decides to examine, one must also accept to bear the consequences of that choice, including the heavens and hells that may arise in the aftermath. And statistical inference is not a walk in the park. Regardless of the choices we make, there will be blood. Agreed upon and accepted? Good!

So, for better or worse: here are the words of Sirena:

The house of Neyman is a mighty church and I belong to it too! When I attend the communion each week, I silently to myself read out its holy credo. It goes: “But we may look at the purpose of tests from another viewpoint. Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behavior with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong”.
As I, like you, opt for this school and have read their holy writings, I hold these dogmas to be correct. Therefore, one of your comments startles me. Who has said that the theory is wrong? And who has taken on the task of convincingly trying to falsify it? I would really like to read those papers.

I think that part of the confusion about the church of Neyman is that many have difficulties in distinguishing between the confidence interval procedure and the result it produces: confidence intervals. The confidence interval theory I flawless! Given a chosen confidence level, say 95%, the coverage of CIs that include true value, will – in the long run of never-ending fictitious samplings and tests – converge to that level. This is not controversial. (However, rarely said in the house of Neyman, this is truth is actually conditional. It only holds true if the cumulative effects of all potential biases in each and every study in that very long run of tests will equal zero. Would you consider that assumption to be likely true in your field of research? Regardless, that is a whole other story, not to be told or discussed here.)

What about the confidence interval? Well, strangely enough, within the church of Neyman no method was designed to create intervals that could be interpreted (now: read the credo again.). Amazing, isn’t it? That is the price you pay for being a member in this church. There is a backdoor though (thanks heaven), you could still use an interval that you have calculated to guide your actions. This is a justified way to use confidence intervals because when doing so… (here: add the 16 last words of the credo to finish the sentence). The faith we have in the church of Neyman is the faith in the procedure, not in the specific intervals.

This what the so-called psycho/hobby statisticians tried to explain in their article. They said it is a fallacy of placing confidence in confidence intervals. They did not say it is a fallacy to place confidence in the confidence procedure, which you seem to imply. In the article they explain various inference fallacies of confidence intervals. Neyman went even further, he didn’t use the world wrong or fallacy. He said that in the church on Neyman there is no such thing as reasoning. Neyman realized that researchers, like you and me, are simple creatures and that their cognitive abilities are limited. He also realized that performing inferences in particular cases, under the laws of probability, will be too uncertain. (A cleric in another church disagrees, more about that shortly).
Neyman’s genius was that that he developed a plan to control the ignorant masses: one theory to rule them all, you might say. The brilliant but simple idea was this: restrict the potential damage they might do by setting rules to limit their errors. (optional this time, but you may want to read the credo again).

Many can’t grasp this. The idea that an interval, produced using some interval estimation procedure, must tell us something about what we asked, is surely one of the most of powerful illusions within statistical inference. Our brains seem to be hardwired to interpret intervals to mean something. At least they must be informative that it is more probable that the true value is among the values within the interval than among the values outside. Or that that the confidence level surely must serve as a measure of the confidence we can have that an interval includes the true value. (Have you ever considered that saying that [i.e. I can be 95 confident, in the long term yada yada way of understanding things, that the true value is within…..] is the exact same thing as saying that there is a 95% probability that a given interval includes the true value. [which is wrong to say in the house of Neyman]. Why are those the same thing? Because the two scales referred to (confidence and probability), both range from 0-1 and thus they must correlate exactly, r=1! This is because any wider interval (higher confidence level) must logically have at least as high, or higher probability to contain the true value, than any less wide interval).

The foremost used “proof” to defend the illusion that we can infer confidence intervals, is by the way: “Its common sense”. While this, by many, is considered being a weak scientific argument, it has some validity as it, in fact, is the credo in another church, which I will tell you more about in a little while. It is, however, NOT a valid argument in the church of Neyman. And that many believe it is so, does not change that. We are walking the deductive path, here, remember?

Many clever persons have tried to explain the illusion, such as the hobby statisticians. Perhaps you consider them to attack Neyman and his theory, but they are actually standing by his side saying the same thing as he.
I think the following argument against the illusion might be persuasive for those who understand statistics and logic. When having continuous data, it is, under certain circumstances, possible to calculate, say, 95% CIs, using either the t och z distribution (no conflict about that), which will, from exact the same data, produce two 95% CI intervals, one more narrow nested within the other. Please guide me here, which of the two intervals does the long-term error rate of 5% apply to? The larger, the smaller. Or both? Such mathematical “artefacts” might persuade some people, but not all, that the inference is fishy.

So, let me give you a more straightforward example why your reasoning is flawed. Consider that I use the excel app in my computer to create 95% confidence intervals. I will let the random function spit out numbers from 1-100. If a number lies within 1-95 then I will ask the machine to create an interval that includes every possible number [-∞ to ∞] and if the random number is within 96-100, then the machine will create an empty interval [empty]. There you go! This is a perfectly fine working 95% confidence interval procedure as it – over repeated never-ending runs – will produce intervals that contain the true value in 95% of cases. Hopefully I don’t have to show you any specific interval to make you understand the common-sense inference approach is not valid here. The example, must be said, is adapted from one given in the article by the so-called hobby statisticians, as those who’ve read it knows. But this does not reduce the its power. The real world is of course much more complicated as we can never tell, form a particular CI, whether the true value is in there or not. Or likely to be in it. But, importantly, the exact same logic applies: in the house of Neyman you can’t use the chosen confidence/error level as measures or even a proxy for inferential reasoning. Damnit, read the credo!

Fisher was a mighty cleric. His credo was: My method will give you evidence, but no sane person will ever figure out for what! The Fisher church have decreased in popularity, partly because people are beginning understand the implications of his credo, but also because the many powerful spells of his church are even better performed within the house of Neuman. As Fishers credo implies his method claimed to be evidential, something he bitterly fought about with knights from the Neyman house. The fishy idea of Fishers method is that the P, not being claimed to be a measure of truth, still in some peculiar way could informally be interpreted in just that way. This made Gigerenzer call Fisher a quasi-Bayesian. You slapped the hobby epidemiologist in their faces, Gigerenzer went straight at Fishers balls. Anyhow, I don’t care much about Fisher, but won’t dismiss him totally. Some interesting statistical questions are after all not about estimation.

Then we have the church of Bayes, bemocked and belaughed. As I said: I am not a member of this church, but I am interested in philosophy in general, so why not take a look at their funny hats. I think Bayesian people reason like this: we can’t avoid to pay a price, so let us pull the creature out in the open sunlight. The Bayesians call their beast the Prior and we can all laugh at, that despite it’s being in the clear day light, nobody can surely say what it looks like. But I do think the Prior is there and that it means something. Hypotheses are after all not created equal. If you were instructed to investigate the effect of remote healing or some evidence based know-to-work-treatment founded on a solid theoretical framework, you would rightfully have different prior credence in these two hypotheses.

The appealing feature of the Bayesian school is, however, this. Once you have paid the fee, you get to get your price, which in Bayesian terminology is called the credibility interval (or BF). (If you are really greedy, just pay a weakly or uninformative prior, if it makes you feel better.) And since the credo of the Bayesian church is “It’s common sense” it is perfectly ok to interpret the intervals the way we are hardwired to. Once we have included a probability distribution into our equation the Bayesian interval may be interpreted in the sense that it is likely/probable/reasonable/sensible/ I bet my little sister on this on this one/etc etc, that it includes the true value.
Unfortunately, just like the confusion seems to be close to total about the differences the Fisher P and the Neyman alfa, there is also a close to total confusion about the differences between confidence intervals and credibility intervals. Because they look exactly the same! And under certain conditions (flat prior, normally distributed data) the even ARE the same. This makes some people argue that if you just add data, priors will be redundant and then -jump- jump – “there is no difference between Bayesian and frequentists CI”. But don’t listen to these people. They just don’t know what they’re talking about.

Then there are other smaller churches, like the church of Likeliness, which credo reads: Let’s compare to see if mine is bigger than yours (childish but yet quite interesting if you are a girl aware of that size matters). Ah, yes, almost forgot to tell you. It’s is a Bayesian church too. Sorry.

And then there is a small new sect-like church called BASP-fukkemall_just_digest_the_data. Their credo is: Houston- we seem to have a problem, but no sight of a solution. They’re funny.

Well, those are the churches, or the main ones anyway.
In the ecumenical or even esoteric church of Sirena, all these philosophies coexist. They do so because Sirena believes (hard assumption) that it is not possible to avoid or reduce uncertainty only by the choice of a certain inferential technique. God can’t be that unfair. I also think that it is possible to hide uncertainty, for example in untestable assumptions, vague modelling and in the reference class compartment. And I am certain that if your inferences are invalid to start with, then there is a good chance that you will overrate the certainty of your both your results and your importance (p<0.0000001)

What about your school of inference Ulrich? I understand that you have taken a position as a self-proclaimed truthteller, who strive for improving replicability in science. And you declare yourself as someone not being afraid to follow truth, wherever it may lead you. Furthermore, you are not afraid to correct any errors, regardless who made them. You sit pretty high up there, it seems.

Well, Sirena sees this. You do not seem to know what you’re talking about. You move effortlessly between the churches picking the inferential features that are shiny to your eyes: the mathematical rigour of the Neyman house is simply combined with the common-sense interpretation of intervals. This is not allowed in either house. What you managed to do, amazingly enough in one and the same article, was savagely attacking colleagues (no I am not one of them), while a paragraph or two later you are DEFENDING your right to make the exact same inferential mistake they warned about.

I can’t stop wondering about what any of the hobby statisticians might feel when they read your article. Could it be anything like Schadenfreude? Or are they amused of the sublime irony than one of their most high-pitched critics in fact turn out to be the proof_of_concept of the point they worked so hard to make in their article?
Or do they, like I, blush and feel empathically embarrassed for the sake of the scientific community, that after 100 years (using a common approximation) of debate, this is where we stand.

Now I am all warm after the dance. I am sorry about being honest, but there was no other way. And frankly, I am just the messenger. I so much longed for this last dance to be a synchronized and sensual tango, but I realize it turned out more to be a one person breakdance. And now the bar has closed.

As I wrote earlier. I think it will take time to process. I expect nothing else but the usual set of defence mechanism: denial, distortion, projection, intellectualization not to mention passive aggressive behaviour etc. Take care!

Will you earn your credo? Only time will tell.
In the meantime, my sincere advice is for you as an applied researcher to keep producing CI:s. Because regardless of your own inferential practice…… they will still insure, that in the long run of experience, we shall not be too often wrong.

And luckily for applied happiness researchers like yourself, for whom the whole fight of Bayesians for supremacy has been declared irrelevant:
Ignorance is bliss!

Truly, madly, deeply.
Yours,
Sirena

Loading...

Reply
Ulrich Schimmack says:

March 23, 2021 at 10:56 am

I don’t understand a word you say, but you made me re-read my post that I wrote during some dark times. I actually liked it more than I expected, especially the end.

P.P.S. Science requires trust. At some point, we cannot check all assumptions. I trust Neyman, Cohen, and Muthen and Muthen’s confidence intervals in MPLUS.

I don’t trust some random anonymous person on the internet.

Loading...

Reply
1. Alexandre Huat says:
  
  March 24, 2021 at 6:16 pm
  
  This is supposed to be a place for a scientific debate, but you speak about “churches” for several messages. This is intellectual fraud. Neyman is not less random than others. Neywan does not deserve more or less to be trusted. Arguments about popularity are for immature high-school debates. Science is not about believing nor about who has the biggest number of citations, it is about knowing. Mathematical demonstrations are to be done. As a professional statistician (and thus mathematician), I now clearly understand that what you lack the most is solid philosophical & scientific foundations.
  
  Le mar. 23 mars 2021 à 15:56, Replicability-Index a écrit :
  
  > Ulrich Schimmack commented: “I don’t understand a word you say, but you > made me re-read my post that I wrote during some dark times. I actually > liked it more than I expected, especially the end. P.P.S. Science requires > trust. At some point, we cannot check all assumptions. I trus” >
  
  Loading...
  
  Reply