The Bayesian Mixture Model for P-Curves is Fundamentally Flawed

Draft. Comments are welcome. To be submitted to Meta-Psychology

Authors: Ulrich Schimmack & Jerry Brunner
[Jerry Brunner is a professor in the statistics department of the University of Toronto, and an expert on Bayesian Mixture Models. He wrote the r-code to estimate the false discovery rate without a dogmatic prior that limits heterogeneity in the evidence against H0.]

There are many mixture models. I amusing the term Bayesian Mixture Model (BMM) to refer to the specific model proposed by Gronau et al. (2017). Criticism of their model does not generalize to other Bayesian mixture models.

Meta-Analysis in Psychology

The primary purpose of psychological science is to collect data and to interpret them. The focus is usually on internal consistency. That is, are the data of a study consistent with theoretical predictions. The problem with this focus on single studies is that a single study rarely provides conclusive evidence for a hypothesis and even more rarely against a hypothesis, let alone a complete theory.

The solution to this problem has been the use of meta-analyses. Meta-analyses aim to provide conclusive evidence by aggregating information from several studies. The most common form of meta-analysis in psychology convert information from single studies into estimates of standardized effect sizes and then draw conclusions about the effect size in the population.

There are numerous problems with effect-size meta-analyses in psychology. One problem is that a collection of convenience samples does not allow generalizing to a population. Another problem is that studies are often very heterogeneous and it is not clear which of these studies can be replicated and which studies produced false positive results. The biggest problem is that original studies are selected for significance (Sterling, 1959; Sterling et al., 1959). As a result, effect size estimates in meta-analyses are inflated and can provide false evidence for effects that do not exist.

The problem of selection for significance has led to the development of a new type of meta-analyses that take selection for significance into account. The Bayesian Mixture Model for Significant P-Values is one of these models (Gronau, Duizer, Bakker, & Wagenmakers, 2017). Compared to well-established mixture models that assume all data are available (Allison et al., 2002), the BMM uses only significant p-values. In this regard, the model is similar to pcurve (Simonsohn et al., 2014), puniform (van Assen et al., 2014), and zcurve (Brunner & Schimmack, 2018).

Henceforth, we will use the acronym BMM to refer to Gronau et al.’s model, but we want to clarify that any criticism of their model is not a criticism of Bayesian Mixture Models in general. In fact, we present an alternative mixture model that fixes the fundamental problem of their model.

A Meta-Analysis of P-values

Like p-curve (Simonsohn et al., 2014), DMM is a meta-analysis of significant p-values. It is implied, but never explicitly stated, that the p-values in a p-curve are p-values from two-sided tests. That is, the direction of an effect is not reflected in the test-statistic. The test-statistic could be a test statistic that ignores signs of effects (F-values, chi-square) or the absolute value of a directional test-statistic (absolute t-values, absolute z-scores). As a result, p-values of 1 correspond to test statistics and effect sizes of zero and p-values decrease towards zero as test-statistics increase towards infinity.

In a meta-analysis of p-values, sampling error will produce some variation in test statistics and p-values even if the population effect size is zero. It is well known in statistics, that the distribution of p-values in this scenario is uniform. This can be modeled with a beta distribution with shape parameters a = 1 and b = 1.

However, if all of the tests tested a true hypothesis, the distribution of p-values is monotonically decreasing with increasing p-values. This can also be modeled with a beta distribution by setting the first shape parameter a to a value less than 1. The steeper the decrease is, the stronger is the evidence against the null-hypotheses. Figure 1 illustrates this p-curve with a shape parameter of a = .5.

We see that both distributions contribute p-values close to 1. However, we also see that the uniform distribution based on true null-hypotheses contributes more p-values close to 1. The reason is simply that p-values are more likely to be close to 0, when the null-hypothesis is false.

The First Flaw in the Bayesian Mixture Model

Like z-curve, BMM does not try to fit a model to the distribution of p-values. Rather, it first transforms p-values into z-scores, which the authors call probit transformed p-values. To convert p-values into z-scores, it is important to take into account whether p-values are one-sided or two-sided.

For one-sided p-values, values of .5 correspond to z-score of 0, while p-values of 0 correspond to a z-score of infinity, and p-values of 1 correspond to a z-score of minus infinity. However, p-values in a p-curve analysis are two-tailed and p-values of 0 can correspond to a z-score of infinity or minus infinity while a p-value of 1 corresponds to a z-score of zero.

The proper formula to transform two-sided p-values into z-scores is -qnorm(p/2). To illustrate, take a z-score of 1.96 and compute the two-sided p-value using the formula (1-pnorm(abs(z))*2. We obtain a p-value of .05 because 1.96 is the critical value for a two-sided z-test with alpha = .05. We can do the same with a z-score of -1.96. Again, we obtain a value of .05. We can now use the formula qnorm(p/2) with p = .05 and obtain z = 1.96. This shows that the test statistic is about two standard deviation from a value of zero. With decreasing p-values, the evidence against the null-hypothesis is stronger. For example, p = .005, gives us a z-score of 2.8, which is nearly 2.8 standard deviations away from zero.

The BMM model uses qnorm(p) to convert two-sided p-values into z-scores. The omission of the minus sign is just a minor nuisance. Now negative values show evidence against the null-hypothesis. However, using p rather than p/2 has more dramatic consequences. Based on this formula, a p-value of .5 is converted into a z-score of zero, and any p-values greater than .5 now produce positive values that would suggest we now have evidence against the null-hypothesis in the opposite direction. A p-value of 1 is converted into a value of infinity.

The only reason why this flawed conversion of p-values does not produce major problems is that only significant p-values are used. Thus, all p-values are well below .5, where the problems would occur. Thus, z-scores of 1.96 that produce a p-value of .05 are converted into z-scores of -1.65, which is the critical value for a one-sided z-test.

The Fatal Flaw of the Bayesian Mixture Model

The fatal flaw of BMM is the prior for the standard deviation of the distribution of z-scores produced by true hypothesis (a.k.a false null-hypothesis). Theoretically, this standard deviation can range from small values to large values depending on the strength of the evidence against the null-hypothesis and the heterogeneity in effect sizes and sample sizes across studies. However, the BMM constrains the prior distribution of the standard deviation to a range from 0 to 1. Importantly, this restriction is applied to the theoretical distribution in the model, not to the truncated distribution that is produced by selection for significance.

It should also be noted that it is highly unusual in Bayesian statistics to use priors to restrict the range of possible values. Priors that set the probability of certain parameters to zero are known as dogmatic priors. The problem with dogmatic priors is that empirical data can never correct a bad prior. Thus, they continue to have an influence on results even in very large sets of data.

The authors justify their use of a dogmatic prior with a single sentence. They claim that values greater than 1 “make the implausible prediction that p values near 1 are more common under H1 than under H0” (p. 1226).

Figure 2 makes it clear that this claim is blatantly false. With a monotonic decreasing function of p-values when the null-hypothesis is false, the proportion of studies with two-sided p-values close to 1 will always be less than the proportion of p-values close to 1 that are produced by the uniform distribution when the null-hypothesis is false. It is therefore totally unnecessary to impose a restriction on the prior for the standard deviation of z-scores that are based on transformed two-sided p-values.

If standard deviations of z-scores were always less than 1, imposing an unnecessary constraint on this parameter would not be a problem. However, the standard deviation of z-scores can easily be greater than 1. This is illustrated when the p-curves in Figure 2 are converted into z-curves, using the improper transformation of p-values with qnorm(p). The z-curve for a = .9 is just a little bit higher than 1. For a = .5 (green) the standard deviation is 1.25 and for a = .1 (blue), the standard deviation is 2.32. In general, the standard deviation increases from 1 to values greater than 1 with decreasing values for the shape parameter a of the beta function.


The next figure shows how limiting the standard deviation of the normal distribution under H1 inflates the estimates of the proportion of false positives (H0 is true). BMM tries to model the observed distribution of z-scores with two standard normals. The center of the standard normal for H0 is fixed at zero. The center of the standard normal for H1 can move to maximize model fit. No model will have good fit because the data were not generated by two standard normal distributions. However, the best fit will be achieved by letting H0 account for the low z-scores and use H1 to account for the high z-scores. As a result, BMM will produce estimates of high proportions of false positives without any false positives in the data.

This can be easily demonstrated by submitting data that were generated with a beta distribution with a = .2 to the shiny app for the BMM model. The output is shown in the next Figure. The model returns the estimate that one third, 33.2%, of the data were false positives. It also is very confident in this estimate with a credibility interval ranging from .27 to .39. More data would only tighten this interval and not dramatically alter the point estimate. The reason for this false positive result (pun intended) is the dogmatic prior that limits the standard deviation of the normal distribution.

Image may contain: text

To demonstrate that the fatal flaw is really the dogmatic prior, we used a program written by Jerry Brunner that fits the BMM without the restriction on the standard deviation of the normal distribution under H1 (link). The model estimated only 3% false positives, placed the mean of the normal distribution for H1 at z = -1.22 and estimated the standard deviation as 2.10. The following figure shows that even this model fails to recover the actual distribution of z-scores, due to the wrong conversion of two-tailed p-values into z.scores. However, the key point is that removing the restriction on the standard deviation leads to a much lower estimate of false positives than the actual BMM model with the dogmatic prior that limits the standard deviation to 1.

It Matters: False Positive Rate in Cognitive Psychology

The authors of BMM also applied their model to actual data. The most relevant dataset are the 855 t-tests from cognitive psychology journals (Wetzels, Matzke, Lee, Rouder, Iverson, &
Wagenmakers, E.-J. , 2011). This set of t-values was not selected for significance. The next figure shows the p-curve for all t-values converted into two-sided p-values.

The main finding is that most p-values are below .05. Given low density for non-significant p-values it is hard to say something about the distribution of these p-values based on visual inspection of the data. We used the frequentist mixture model, dBUMfit, to analyze these data. Just like BMM, the model aims to estimate the proportion of true null-hypothesis. dBUMfit provides an estimate of 0% false positives to the data. Thus, even the non-significant results reported in cognitive journals do not provide evidence for the null-hypothesis and it is wrong to interpret non-significant results as evidence for the absence of an effect (McShane REF). As the proportion of true null-hypotheses decreases with decreasing p-values, it is clear that the estimate for the subset of significant results has to be zero. Thus, there is no evidence in these data that a subset of p-values have a uniform distribution that leads to a flat distribution of p-values greater than .05.

Johnson (2013) limited the analysis of Wetzel’s data to significant results. The p-curve for the significant results was also published by Gronau et al. (2017).

Johnson writes, quote,

The P values displayed in Fig. 3 presumably arise from two types of experiments: experiments in which a true effect was present and the alternative hypothesis was true, and experiments in which there was no effect present and the null hypothesis was true. For the latter experiments, the nominal distribution of P values is uniformly distributed on the range (0.0, 0.05). The distribution of P values reported for true alternative hypotheses is, by assumption, skewed to the left. The P values displayed in this plot thus represent a mixture of a uniform distribution and some other distribution.

Even without resorting to complicated statistical methods to fit this mixture, the appearance of this histogram suggests that many, if not most, of the P values falling above 0.01 are approximately uniformly distributed. That is, most of the significant P values that fell in the range (0.01-().05) probably represent P values that were computed from data in which the null hypothesis of no effect was true.

Based on Johnson’s visual inspection of p-curve, we would estimate that up to 32% of Wetzel’s significant t-test were false positives as there are 32% of p-values greater than .01.

Gronau et al. (2017) quote Johnson at length and state that their model was inspired by Johnson’s article. “Our Bayesian mixture model was inspired by a suggestion from Johnson (2013)” (p. 1230). In addition, they present the BMM as a confirmation of Johnson’s claim with an estimate that 40% of the p-values stem from a true null-hypothesis. This estimate implies that even some p-values less than .01 are assumed to be false positives.

The percentage of true null-hypothesis would be even larger for the full set of t-tests that includes non-significant results. With 31% non-significant results, the percentage would be that 59% of all t-tests in cognitive psychology test a true null-hypothesis.

In contrast, the mixture model that was fitted to all p-values returned an estimate of 0%. We believe that this is a practically significant difference that requires explanation.

We proposed that the 41% estimate for significant p-values and the 59% estimate for all p-values are inflated estimates that are caused by restricting the standard deviation of z-scores to 1. To test this prediction, we fitted the significant p-values to an alternative Bayesian Mixture Model (AltBMM). The only difference between BMM and altBMM is that altBMM allows the data to determine the standard deviation of z-scores for true hypotheses. Results confirmed our predictions. The estimate for false positive results dropped from 40% to 11% and the estimate of the standard deviation increased from 1 to 2.83.

Another way to show that restricting the variability of z-scores under H1 is a problem is to remove extreme p-values from the data set. In particle physics, a threshold of 5 sigma is used to rule out false positives. We can therefore set aside all p-values lower than (1-pnorm(5))*2. As we are removing cases that must definitely stem from H1, the proportion of false positives in the limited set should increase. However, the estimate provided by BMM drops from 41% to .22%. The reason is that removing extreme z-scores, reduces the variability of z-scores and allows the standard normal distribution for H1 to cover lower z-scores. This estimate is also closer to the estimate for altBMM, which also produced a smaller estimate for the standard deviation of the normal distribution under H1 (11% false positives, SD = 1.17).

It Matters: False Positive Rate in Social Psychology

Gronau et al. also conducted an analysis of focal hypothesis tests in social psychology.
There are two reasons why the false positive rate in social psychology would be higher than in cognitive psychology. First, selecting focal tests selects significant results with weaker evidence compared to sets of studies that also include non-focal tests (e.g., manipulation checks). Second, social psychology is known to be less replicable than cognitive psychology (OSC, 2015). The BMM point estimate was 52%. Given the small set of studies, the 95% credibility estimate ranged from .40% to 64%. To avoid the problem of small samples, we also fitted the model to Motyl et al.’s (2017) data that contained 803 significant focal hypothesis test from social psychology articles. The estimate of false positives was even higher with 64% and a tight 95% credibility interval ranging from 59% to 69%. The point estimate decreases to 40% for the altBMM model. This model also estimates the standard deviation to be 1.94. Limiting the set of p-values to p-values above 5 sigma, also reduces the BMM estimate to 48% and the 95%CI gets wider (38% to 58%) because there is now more uncertainty about the allocation of lower z-scores. These results suggest that a false positive rate of 60% is an inflated estimate of false positives in social psychology. However, consistent with the outcome from replication studies, the false positive rate in social psychology is higher than in cognitive psychology, and high enough to warrant the claim of a replication crisis in social psychology.

Conclusion

We demonstrated that the BMM for p-curve overestimates the percentage of false positives because it imposes an unnecessary and dogmatic restrictions on the variability of probit transformed p-values that are used to estimate the false positive rate. We also found that this inflation has practical consequences for the assessment of the false positive rates in cognitive psychology and social psychology. While the flawed BMM estimates are 40% and 60%, respectively, unbiased estimates are much lower, about 10% and 40%.

We therefore recommend our Alternative Mixture Model (AMM) that does not impose the dogmatic prior as a superior method for the estimation of the false positive rate.

Even though the rate of false positives in social psychology is below 50%, the rate of false positives is unacceptably high. It is even more disconcerting that this estimate is limited to studies where the true effect size is so small that it is practically zero. If we also consider true positives with practically insignificant effect sizes, the estimate would be even higher and probably exceed 50%. Thus, many published results in social psychology cannot be trusted and require new evidence from credible replication studies to provide empirical support for theories in social psychology.

The relatively low false positive rate for cognitive psychology does not imply that most results in cognitive psychology are replicable. It remains a concern that selection for significance produces many true positive results with low power that are difficult to replicate. Another concern is that effect size estimates are inflated and many results may be true positives with very small effect sizes that have no theoretical significance. For example, BMM estimates the percentage of significant results that were produced with an effect size of exactly 0, while it treats population effect sizes with non-zero values in the fourth or fifth decimal (d = 0.00001) as evidence that the null-hypothesis is false. From a practical point of view it is more reasonable to use a range of small effect sizes to define the null-hypothesis. Knowing the proportion of effect sizes that are exactly zero provides little added value (Lakens, 2017). Thus, statistical power and effect size estimate are more important than the distinction between true positives and false positives (Brunner & Schimmack, 2018).

An Invitation for Open Debate

We have a strong prior that our criticism of BMM is correct, but we are not going to make this a dogmatic prior that excludes the possibility that we are wrong. We are sharing our criticism of the dogmatic BMM openly to receive constructive criticism of our work. We also expect the authors of BMM to respond to our criticism and to point out flaws in our arguments if they can. To do so, the authors need to be open to the idea that their dogmatic prior on the standard deviation was unnecessary, a bad idea, and produces inflated estimates of false positives in the psychological literature. Ultimately, it is not important whether we are right or wrong, but we need to be able to trust statistical tools that are used to evaluate the credibility of psychological science. If the BMM provides dramatically inflated estimates of false positives, the results are misleading and should not be trusted.

9 thoughts on “The Bayesian Mixture Model for P-Curves is Fundamentally Flawed

  1. “The Bayesian Mixture Model is Fundamentally Flawed”.

    This post caught my attention. Here’s my 2¢.

    It seems to me that the title itself is fundamentally flawed because I don’t think that the post uncovers a flaw that is by any definition fundamental.

    A poorly chosen prior does not qualify the Bayesian Mixture Model for the adjective used in the post’s title. The critique itself uses a BMM to show that it is the prior that drives the incredible conclusion. The post also explains that the referenced paper does not provide a good justification for the dogmatic prior.

    I find this post to be a little clickbaity with an easily misconstrued point (I don’t necessarily fault the authors for the latter).

    1. If I choose the prior for effect sizes to be in the range from 0 to .5, I can never find that the data have an effect size greater than .5. Would anybody want to know about the results of such a model. Maybe I am a bit too strong in my opinion, but poor choice of priors is not helping the Bayesian cause. Put a normal distribution over d = .25 if you expect small effect sizes, but then the data can still show that the prior was off. Limiting a prior to a small range that excludes plausible values is not what Bayesians should do.

      1. I totally agree. Priors need to be discussed more openly and evaluated critically, especially in the context of BFs. For example, Cauchy on effect sizes is a quirky prior in my mind; either too much mass on meaninglessly small effects or too much mass on implausibly large effects. People are using this prior without discussing the issues.

        But this is hardly a criticism of the Bayesian approach, per se..

    2. On second thought, maybe there is some confusion because I used the term “Bayesian Mixture Model” to refer to one specific model, while the term applies to a whole set of models. I revised the blog (draft) to make clear that the term BMM applies only to the specific model that is critiqued.

      1. Yes. My thoughts exactly. The criticism appears to be a very good one but is also fairly narrowly focused on one use of BMM. I hedge here (i.e. “appears”) because I need to work through the examples you provide in more detail. It’s on my list!

Leave a Reply