Category Archives: Uncategorized

The Bayesian Mixture Model for P-Curves is Fundamentally Flawed

Draft. Comments are welcome. To be submitted to Meta-Psychology

Authors: Ulrich Schimmack & Jerry Brunner
[Jerry Brunner is a professor in the statistics department of the University of Toronto, and an expert on Bayesian Mixture Models. He wrote the r-code to estimate the false discovery rate without a dogmatic prior that limits heterogeneity in the evidence against H0.]

There are many mixture models. I amusing the term Bayesian Mixture Model (BMM) to refer to the specific model proposed by Gronau et al. (2017). Criticism of their model does not generalize to other Bayesian mixture models.

Meta-Analysis in Psychology

The primary purpose of psychological science is to collect data and to interpret them. The focus is usually on internal consistency. That is, are the data of a study consistent with theoretical predictions. The problem with this focus on single studies is that a single study rarely provides conclusive evidence for a hypothesis and even more rarely against a hypothesis, let alone a complete theory.

The solution to this problem has been the use of meta-analyses. Meta-analyses aim to provide conclusive evidence by aggregating information from several studies. The most common form of meta-analysis in psychology convert information from single studies into estimates of standardized effect sizes and then draw conclusions about the effect size in the population.

There are numerous problems with effect-size meta-analyses in psychology. One problem is that a collection of convenience samples does not allow generalizing to a population. Another problem is that studies are often very heterogeneous and it is not clear which of these studies can be replicated and which studies produced false positive results. The biggest problem is that original studies are selected for significance (Sterling, 1959; Sterling et al., 1959). As a result, effect size estimates in meta-analyses are inflated and can provide false evidence for effects that do not exist.

The problem of selection for significance has led to the development of a new type of meta-analyses that take selection for significance into account. The Bayesian Mixture Model for Significant P-Values is one of these models (Gronau, Duizer, Bakker, & Wagenmakers, 2017). Compared to well-established mixture models that assume all data are available (Allison et al., 2002), the BMM uses only significant p-values. In this regard, the model is similar to pcurve (Simonsohn et al., 2014), puniform (van Assen et al., 2014), and zcurve (Brunner & Schimmack, 2018).

Henceforth, we will use the acronym BMM to refer to Gronau et al.’s model, but we want to clarify that any criticism of their model is not a criticism of Bayesian Mixture Models in general. In fact, we present an alternative mixture model that fixes the fundamental problem of their model.

A Meta-Analysis of P-values

Like p-curve (Simonsohn et al., 2014), DMM is a meta-analysis of significant p-values. It is implied, but never explicitly stated, that the p-values in a p-curve are p-values from two-sided tests. That is, the direction of an effect is not reflected in the test-statistic. The test-statistic could be a test statistic that ignores signs of effects (F-values, chi-square) or the absolute value of a directional test-statistic (absolute t-values, absolute z-scores). As a result, p-values of 1 correspond to test statistics and effect sizes of zero and p-values decrease towards zero as test-statistics increase towards infinity.

In a meta-analysis of p-values, sampling error will produce some variation in test statistics and p-values even if the population effect size is zero. It is well known in statistics, that the distribution of p-values in this scenario is uniform. This can be modeled with a beta distribution with shape parameters a = 1 and b = 1.

However, if all of the tests tested a true hypothesis, the distribution of p-values is monotonically decreasing with increasing p-values. This can also be modeled with a beta distribution by setting the first shape parameter a to a value less than 1. The steeper the decrease is, the stronger is the evidence against the null-hypotheses. Figure 1 illustrates this p-curve with a shape parameter of a = .5.

We see that both distributions contribute p-values close to 1. However, we also see that the uniform distribution based on true null-hypotheses contributes more p-values close to 1. The reason is simply that p-values are more likely to be close to 0, when the null-hypothesis is false.

The First Flaw in the Bayesian Mixture Model

Like z-curve, BMM does not try to fit a model to the distribution of p-values. Rather, it first transforms p-values into z-scores, which the authors call probit transformed p-values. To convert p-values into z-scores, it is important to take into account whether p-values are one-sided or two-sided.

For one-sided p-values, values of .5 correspond to z-score of 0, while p-values of 0 correspond to a z-score of infinity, and p-values of 1 correspond to a z-score of minus infinity. However, p-values in a p-curve analysis are two-tailed and p-values of 0 can correspond to a z-score of infinity or minus infinity while a p-value of 1 corresponds to a z-score of zero.

The proper formula to transform two-sided p-values into z-scores is -qnorm(p/2). To illustrate, take a z-score of 1.96 and compute the two-sided p-value using the formula (1-pnorm(abs(z))*2. We obtain a p-value of .05 because 1.96 is the critical value for a two-sided z-test with alpha = .05. We can do the same with a z-score of -1.96. Again, we obtain a value of .05. We can now use the formula qnorm(p/2) with p = .05 and obtain z = 1.96. This shows that the test statistic is about two standard deviation from a value of zero. With decreasing p-values, the evidence against the null-hypothesis is stronger. For example, p = .005, gives us a z-score of 2.8, which is nearly 2.8 standard deviations away from zero.

The BMM model uses qnorm(p) to convert two-sided p-values into z-scores. The omission of the minus sign is just a minor nuisance. Now negative values show evidence against the null-hypothesis. However, using p rather than p/2 has more dramatic consequences. Based on this formula, a p-value of .5 is converted into a z-score of zero, and any p-values greater than .5 now produce positive values that would suggest we now have evidence against the null-hypothesis in the opposite direction. A p-value of 1 is converted into a value of infinity.

The only reason why this flawed conversion of p-values does not produce major problems is that only significant p-values are used. Thus, all p-values are well below .5, where the problems would occur. Thus, z-scores of 1.96 that produce a p-value of .05 are converted into z-scores of -1.65, which is the critical value for a one-sided z-test.

The Fatal Flaw of the Bayesian Mixture Model

The fatal flaw of BMM is the prior for the standard deviation of the distribution of z-scores produced by true hypothesis (a.k.a false null-hypothesis). Theoretically, this standard deviation can range from small values to large values depending on the strength of the evidence against the null-hypothesis and the heterogeneity in effect sizes and sample sizes across studies. However, the BMM constrains the prior distribution of the standard deviation to a range from 0 to 1. Importantly, this restriction is applied to the theoretical distribution in the model, not to the truncated distribution that is produced by selection for significance.

It should also be noted that it is highly unusual in Bayesian statistics to use priors to restrict the range of possible values. Priors that set the probability of certain parameters to zero are known as dogmatic priors. The problem with dogmatic priors is that empirical data can never correct a bad prior. Thus, they continue to have an influence on results even in very large sets of data.

The authors justify their use of a dogmatic prior with a single sentence. They claim that values greater than 1 “make the implausible prediction that p values near 1 are more common under H1 than under H0” (p. 1226).

Figure 2 makes it clear that this claim is blatantly false. With a monotonic decreasing function of p-values when the null-hypothesis is false, the proportion of studies with two-sided p-values close to 1 will always be less than the proportion of p-values close to 1 that are produced by the uniform distribution when the null-hypothesis is false. It is therefore totally unnecessary to impose a restriction on the prior for the standard deviation of z-scores that are based on transformed two-sided p-values.

If standard deviations of z-scores were always less than 1, imposing an unnecessary constraint on this parameter would not be a problem. However, the standard deviation of z-scores can easily be greater than 1. This is illustrated when the p-curves in Figure 2 are converted into z-curves, using the improper transformation of p-values with qnorm(p). The z-curve for a = .9 is just a little bit higher than 1. For a = .5 (green) the standard deviation is 1.25 and for a = .1 (blue), the standard deviation is 2.32. In general, the standard deviation increases from 1 to values greater than 1 with decreasing values for the shape parameter a of the beta function.

The next figure shows how limiting the standard deviation of the normal distribution under H1 inflates the estimates of the proportion of false positives (H0 is true). BMM tries to model the observed distribution of z-scores with two standard normals. The center of the standard normal for H0 is fixed at zero. The center of the standard normal for H1 can move to maximize model fit. No model will have good fit because the data were not generated by two standard normal distributions. However, the best fit will be achieved by letting H0 account for the low z-scores and use H1 to account for the high z-scores. As a result, BMM will produce estimates of high proportions of false positives without any false positives in the data.

This can be easily demonstrated by submitting data that were generated with a beta distribution with a = .2 to the shiny app for the BMM model. The output is shown in the next Figure. The model returns the estimate that one third, 33.2%, of the data were false positives. It also is very confident in this estimate with a credibility interval ranging from .27 to .39. More data would only tighten this interval and not dramatically alter the point estimate. The reason for this false positive result (pun intended) is the dogmatic prior that limits the standard deviation of the normal distribution.

Image may contain: text

To demonstrate that the fatal flaw is really the dogmatic prior, we used a program written by Jerry Brunner that fits the BMM without the restriction on the standard deviation of the normal distribution under H1 (link). The model estimated only 3% false positives, placed the mean of the normal distribution for H1 at z = -1.22 and estimated the standard deviation as 2.10. The following figure shows that even this model fails to recover the actual distribution of z-scores, due to the wrong conversion of two-tailed p-values into z.scores. However, the key point is that removing the restriction on the standard deviation leads to a much lower estimate of false positives than the actual BMM model with the dogmatic prior that limits the standard deviation to 1.

It Matters: False Positive Rate in Cognitive Psychology

The authors of BMM also applied their model to actual data. The most relevant dataset are the 855 t-tests from cognitive psychology journals (Wetzels, Matzke, Lee, Rouder, Iverson, &
Wagenmakers, E.-J. , 2011). This set of t-values was not selected for significance. The next figure shows the p-curve for all t-values converted into two-sided p-values.

The main finding is that most p-values are below .05. Given low density for non-significant p-values it is hard to say something about the distribution of these p-values based on visual inspection of the data. We used the frequentist mixture model, dBUMfit, to analyze these data. Just like BMM, the model aims to estimate the proportion of true null-hypothesis. dBUMfit provides an estimate of 0% false positives to the data. Thus, even the non-significant results reported in cognitive journals do not provide evidence for the null-hypothesis and it is wrong to interpret non-significant results as evidence for the absence of an effect (McShane REF). As the proportion of true null-hypotheses decreases with decreasing p-values, it is clear that the estimate for the subset of significant results has to be zero. Thus, there is no evidence in these data that a subset of p-values have a uniform distribution that leads to a flat distribution of p-values greater than .05.

Johnson (2013) limited the analysis of Wetzel’s data to significant results. The p-curve for the significant results was also published by Gronau et al. (2017).

Johnson writes, quote,

The P values displayed in Fig. 3 presumably arise from two types of experiments: experiments in which a true effect was present and the alternative hypothesis was true, and experiments in which there was no effect present and the null hypothesis was true. For the latter experiments, the nominal distribution of P values is uniformly distributed on the range (0.0, 0.05). The distribution of P values reported for true alternative hypotheses is, by assumption, skewed to the left. The P values displayed in this plot thus represent a mixture of a uniform distribution and some other distribution.

Even without resorting to complicated statistical methods to fit this mixture, the appearance of this histogram suggests that many, if not most, of the P values falling above 0.01 are approximately uniformly distributed. That is, most of the significant P values that fell in the range (0.01-().05) probably represent P values that were computed from data in which the null hypothesis of no effect was true.

Based on Johnson’s visual inspection of p-curve, we would estimate that up to 32% of Wetzel’s significant t-test were false positives as there are 32% of p-values greater than .01.

Gronau et al. (2017) quote Johnson at length and state that their model was inspired by Johnson’s article. “Our Bayesian mixture model was inspired by a suggestion from Johnson (2013)” (p. 1230). In addition, they present the BMM as a confirmation of Johnson’s claim with an estimate that 40% of the p-values stem from a true null-hypothesis. This estimate implies that even some p-values less than .01 are assumed to be false positives.

The percentage of true null-hypothesis would be even larger for the full set of t-tests that includes non-significant results. With 31% non-significant results, the percentage would be that 59% of all t-tests in cognitive psychology test a true null-hypothesis.

In contrast, the mixture model that was fitted to all p-values returned an estimate of 0%. We believe that this is a practically significant difference that requires explanation.

We proposed that the 41% estimate for significant p-values and the 59% estimate for all p-values are inflated estimates that are caused by restricting the standard deviation of z-scores to 1. To test this prediction, we fitted the significant p-values to an alternative Bayesian Mixture Model (AltBMM). The only difference between BMM and altBMM is that altBMM allows the data to determine the standard deviation of z-scores for true hypotheses. Results confirmed our predictions. The estimate for false positive results dropped from 40% to 11% and the estimate of the standard deviation increased from 1 to 2.83.

Another way to show that restricting the variability of z-scores under H1 is a problem is to remove extreme p-values from the data set. In particle physics, a threshold of 5 sigma is used to rule out false positives. We can therefore set aside all p-values lower than (1-pnorm(5))*2. As we are removing cases that must definitely stem from H1, the proportion of false positives in the limited set should increase. However, the estimate provided by BMM drops from 41% to .22%. The reason is that removing extreme z-scores, reduces the variability of z-scores and allows the standard normal distribution for H1 to cover lower z-scores. This estimate is also closer to the estimate for altBMM, which also produced a smaller estimate for the standard deviation of the normal distribution under H1 (11% false positives, SD = 1.17).

It Matters: False Positive Rate in Social Psychology

Gronau et al. also conducted an analysis of focal hypothesis tests in social psychology.
There are two reasons why the false positive rate in social psychology would be higher than in cognitive psychology. First, selecting focal tests selects significant results with weaker evidence compared to sets of studies that also include non-focal tests (e.g., manipulation checks). Second, social psychology is known to be less replicable than cognitive psychology (OSC, 2015). The BMM point estimate was 52%. Given the small set of studies, the 95% credibility estimate ranged from .40% to 64%. To avoid the problem of small samples, we also fitted the model to Motyl et al.’s (2017) data that contained 803 significant focal hypothesis test from social psychology articles. The estimate of false positives was even higher with 64% and a tight 95% credibility interval ranging from 59% to 69%. The point estimate decreases to 40% for the altBMM model. This model also estimates the standard deviation to be 1.94. Limiting the set of p-values to p-values above 5 sigma, also reduces the BMM estimate to 48% and the 95%CI gets wider (38% to 58%) because there is now more uncertainty about the allocation of lower z-scores. These results suggest that a false positive rate of 60% is an inflated estimate of false positives in social psychology. However, consistent with the outcome from replication studies, the false positive rate in social psychology is higher than in cognitive psychology, and high enough to warrant the claim of a replication crisis in social psychology.


We demonstrated that the BMM for p-curve overestimates the percentage of false positives because it imposes an unnecessary and dogmatic restrictions on the variability of probit transformed p-values that are used to estimate the false positive rate. We also found that this inflation has practical consequences for the assessment of the false positive rates in cognitive psychology and social psychology. While the flawed BMM estimates are 40% and 60%, respectively, unbiased estimates are much lower, about 10% and 40%.

We therefore recommend our Alternative Mixture Model (AMM) that does not impose the dogmatic prior as a superior method for the estimation of the false positive rate.

Even though the rate of false positives in social psychology is below 50%, the rate of false positives is unacceptably high. It is even more disconcerting that this estimate is limited to studies where the true effect size is so small that it is practically zero. If we also consider true positives with practically insignificant effect sizes, the estimate would be even higher and probably exceed 50%. Thus, many published results in social psychology cannot be trusted and require new evidence from credible replication studies to provide empirical support for theories in social psychology.

The relatively low false positive rate for cognitive psychology does not imply that most results in cognitive psychology are replicable. It remains a concern that selection for significance produces many true positive results with low power that are difficult to replicate. Another concern is that effect size estimates are inflated and many results may be true positives with very small effect sizes that have no theoretical significance. For example, BMM estimates the percentage of significant results that were produced with an effect size of exactly 0, while it treats population effect sizes with non-zero values in the fourth or fifth decimal (d = 0.00001) as evidence that the null-hypothesis is false. From a practical point of view it is more reasonable to use a range of small effect sizes to define the null-hypothesis. Knowing the proportion of effect sizes that are exactly zero provides little added value (Lakens, 2017). Thus, statistical power and effect size estimate are more important than the distinction between true positives and false positives (Brunner & Schimmack, 2018).

An Invitation for Open Debate

We have a strong prior that our criticism of BMM is correct, but we are not going to make this a dogmatic prior that excludes the possibility that we are wrong. We are sharing our criticism of the dogmatic BMM openly to receive constructive criticism of our work. We also expect the authors of BMM to respond to our criticism and to point out flaws in our arguments if they can. To do so, the authors need to be open to the idea that their dogmatic prior on the standard deviation was unnecessary, a bad idea, and produces inflated estimates of false positives in the psychological literature. Ultimately, it is not important whether we are right or wrong, but we need to be able to trust statistical tools that are used to evaluate the credibility of psychological science. If the BMM provides dramatically inflated estimates of false positives, the results are misleading and should not be trusted.

One-tail or two-tails: That is the question

In this mini-tutorial, I discuss the relationship between p-values and z-scores. Although the standard normal distribution is a staple for intro stats, it plays a minor role when researchers conduct actual research with t-tests, F-tests and often only look up test statistics and p-values without thinking about the underlying sampling distributions of their test statistics. A better understanding of p-values and z-scores is needed because new statistical methods rely on meta-analyses of p-values or z-scores to make claims about the quality of psychological research.

Basic Introduction

Let’s assume that researchers would use z-tests to analyze their data and convert z-tests into p-values to determine whether a result is statistically significant, p < .05. In the statistics program R, the conversion of a z-score into a p-value uses the command pnorm(z, mean, sd). For significance testing we want to know how extreme the observed z-score is relative to the null-hypothesis, which is defined by a standard normal distribution with mean = 0, and sd = 1). So, we would use the command pnorm(z, mean=0, sd=1). Because the standard normal is the default assumption, we can also simply request the p-value with pnorm(z).

However, using this command will produce some strange results. For example, if we observed a z-score of 2.5, we obtain a p-value of .99, which would suggest that our result is not significant p > .05. The problem is that the default option in R is to provide the area under the standard normal distribution on the left side of the z-score. So, we see that 99% of the distribution is on the left side, the lower tail, and only 1% is on the right side, the upper tail. With only 1% in the upper tail, we can claim a significant result, p < .05.

[1] 0.9937903

There are various options to obtain the p-value we really want. One option is to write pnorm(2.5, lower.tail-FALSE), which gives use p = .01.

pnorm(2.5, lower.tail=FALSE)
[1] 0.006209665

A simpler option is to make use of the symmetry of the standard normal distribution and simply turn the positive z-score into a negative z-score.

[1] 0.006209665

Yet another option is to subtract the lower tail from 1.

[1] 0.006209665

So, we see that a z-score of 2.5 is statistically significant with p < .05. However, z-scores are two-sided. That is they have positive and negative values. What if we had observed a z-score of -2.5. Would that also be significant? As we can see, the answer is no. The reason is that we are conducting one-tailed tests, where only positive deviations from H0 can be used to reject the null-hypothesis.

[1] 0.9937903

Typically, psychologists prefer two-tailed tests, which is the default for F-tests that ignore the sign of an effect. To make the sign irrelevant, we can simply use the absolute z-score to obtain our upper tail p-value.

[1] 0.006209665

Now we get the same p-value that we obtained for z = 2.5. However, checking both tails doubles the risk of a type-I error. Therefore, we have to double the p-value, if we want to conduct a two-tailed test.

[1] 0.01241933

Multiple p-values

Psychologists are familiar with effect size meta-analysis. However, before effect size meta-analysis became common, meta-analyses were carried out with p-values or z-scores. Fisher not only invented p-values, he also introduced a method to combine p-values from multiple studies. Meta-analysis of p-values have encountered a renaissance in psychology with the introduction of p-curve, which is essentially a histogram of statistically significant p-values. Importantly, p-curve is based on two-tailed p-values, as I will demonstrate below.

Assume that we have a large set of z-tests from 1000 studies, but all 1000 studies tested a true null-hypothesis. As a result, we would expect that the 1000 z-scores follow the sampling distribution of a standard normal distribution.

z = rnorm(1000)

After we convert the z-scores into ONE-TAILED p-values, we see that they follow a uniform distribution.

p = pnorm(-z)

The same is true for TWO-TAILED p-values

p = (1-pnorm(abs(z)))*2
hist(p)p = pnorm(abs(z))*2

However, this is only true for the special case, when the null-hypothesis is true. When the null-hypothesis is false, the histograms of p-values (p-curves) differ dramatically.

z = rnorm(1000,1,3)

For the one-tailed p-values the distribution is bimodal. The reason is that null-effects are represented by p-values of .5. As we simulated many extreme positive and extreme negative deviations from 0, we have more p-values in the tails, close to 0 and close to 1, than p-values in the middle. Evidently, p-values are not just decreasing from 0 to 1.

However, if we compute two-tailed p-values, the distribution of p-values shows decreasing frequencies from 0 to 1.

In sum, it is important to think about the tails of a p-value. One-tailed p-values should be used when the sign of a test is meaningful. For example, in a meta-analysis of studies that tested the same hypothesis. In this case, we need to obtain p-values from test statistics that have a direction (z-scores, t-value) and we cannot use test statistics that remove information about the direction of a test (F-values, chi-square values). However, if we do not care about the sign of an effect, we should use two-tailed p-values because we only care about the strength of evidence against the null-hypothesis.

Going from p-values to z-scores

Meta-analyses of p-values can use p-values that are based on different test statistics (t-tests, F-tests, etc.). The reason is that all p-values have the same meaning. A p-value of .02 from a z-test provides the same information as a p-value of .02 from a t-test. However, p-values have an undesirable distribution. A solution to this problem is to convert p-values into values that follow a distribution with more desirable characteristics. The most desirable distribution is the standard normal distribution. Thus, we can use z-scores as a common metric to compare results of different studies (Stauffer et al., 1938).

However, we have to think again about the tails of p-values when we convert p-values into z-scores. If all p-values are one-tailed p-values, we can simply convert our upper-tail p-values into z-scores using the qnorm command. Simulating a uniform distribution of p-values and converting the p-values into z-scores gives us the standard normal distribution centered over 0.

p = runif(1000,0,1)
z = qnorm(p,lower.tail=FALSE)

However, things are more complicated with TWO-TAILED p-values as shown in the diagram below. First, we simulated a set of z-tests as before. We then convert the results into two-tailed p-values and convert them back. The conversion has to take into account that we doubled p-values to take into account two-tailed testing. So, now we need to half the p-values before we convert from p to z.

z = rnorm(1000,1,3)
p = (1-pnorm(abs(z)))*2
qz = -qnorm(p/2)

However, we see that the distribution of the original z-scores (black) differs form the distribution of the z-scores obtained from two-tailed p-values. The difference is that the converted z-scores do not have negative values. They are ABSOLUTE z-scores because the computation of two-tailed p-values erased information about the sign of a test. A low p-value could have been obtained from a high positive or a high negative z-score. To see this we can compare the converted z-scores to the absolute values of the original z-scores.


In sum, if we convert one-tailed p-values into z-scores, we retain information about the sign of an effect and the sampling error follows a standard normal distribution. However, if we use two-tailed p-values and convert them into z-scores, the distribution of z-scores is truncated at zero and only positive z-scores can be observed. Sampling error no longer follows a standard normal distribution.

A Minor Technical Problem

As noted before, p-values have an undesirable distribution. A z-score of 1 corresponds to a two-tailed p-value of p = .32. A z-score of 2 corresponds to a p-value of .05. A z-score of 3 corresponds to a p-value of p = .003. A z-score of 4 corresponds to a p-value of .0001. A z-score of 5 corresponds to a p-value of .000001. The number of zeros behind the decimal point increases quickly and at some point, rounding errors make it impossible to convert p-values into z-scores.

p = (1-pnorm(8.3))*2
[1] Inf

All p-values for z-scores greater than 8.2 are treated as 0 and are converted into a z-score of infinity. To avoid this problem, R provides the option to use log p-values. Using the log.p option makes it possible to convert a z-score of 10 into p-value and to retrieve the value of 10 after converting the p-value into a log and to obtain the correct z-score.

p = pnorm(10,lower.tail=FALSE)*2
[1] 0.00000000000000000000001523971
-qnorm(log(p) – log(2),log.p=TRUE)
[1] 10

Does and Don’ts

Just like meta-analysis of p-values has seen a renaissance, meta-analysis of z-scores has also seen renewed attention. Jerry Brunner and I developed z-curve to estimate mean power of a set of studies that were selected for significance and have heterogeneity in power as a result of heterogeneity in sample sizes and effect sizes (Brunner & Schimmack, 2018). Z-curve first converts all observed test-statistics into TWO-TAILED p-values and then converts two-tailed p-values into ABSOLUTE Z-SCORES. The method then fits several TRUNCATED standard normal distributions to the data to obtain estimates of statistical power.

Another method is the Bayesian Mixture Model (BMM) that aims to estimate the percentage of false positives in a set of studies. However, the BMM model has several deficiencies in the conversion process from p-values to z-scores.

First, it uses the formula for ONE-TAILED p-values when the input are TWO-TAILED p-values.

Second, it converts upper-tail p-values into z-scores using the formula for lower-tail p-values.

z = qnorm(p).

As a result, the distribution of z-scores obtained from p-values that were produced by z-tests differs from the distribution of the actual z-tests.

z = rnorm(10000,1)
p = (1-pnorm(abs(z)))*2

qz = qnorm(p) #BMM transformation

qz = -qnorm(p) #FLIPPED BMM transformation

Even if we correct for the sign error and flip the distribution, the reproduced distribution differs from the original distribution because TWO-TAILED p-values are converted into z-scores using the formula for ONE-TAILED p-values.

In conclusion, it is important to think about the tails of p-values. One-tailed p-values are not identical to two-tailed p-values. Using the formula for one-tailed p-values with two-tailed p-values distorts the information that is provided by the actual data. Two-tailed p-values do not contain information about the sign of an effect. Converting them into z-scores produces absolute z-scores that reflect the strength of evidence against the null-hypothesis without information about the direction of an effect.


P-values and z-scores contain valuable information about the results of studies. Both p-values and z-scores provide a common metric to compare results of studies that used different test statistics or differed in sample sizes (and degrees of freedom). Meta-analysts can use standardized effect sizes, p-values, or z-scores. P-values and z-scores can be transformed into each other without further information about sample sizes. However, to convert them properly, we have to take into account whether p-values tested a one-tailed or a two-tailed hypothesis. For one-tailed tests, the null-hypothesis corresponds to a p-value of .5 with values of 0 and 1 corresponding to very strong (infinite) evidence against the null-hypothesis. For two-tailed tests, p-values of 1 correspond to the null-hypothesis and a value of 0 corresponds to infinite evidence against the null-hypothesis. For z-scores a value of 0 corresponds to the null-hypothesis and increasing values in either direction provide evidence against it. Thus, one-tailed p-values correspond to z-scores with p = 0 corresponding to z = inf, p = .5 corresponding to z = 0, and p = 1 corresponding to z = -inf. In contrast, two-tailed p-values only provide information about strength of evidence and a p-value of 1 corresponds to z = 0, while a p-value of 0 corresponds to z = inf. Any meta-analysis with z-scores requires a transformation into p-values to create a common metric. The conversion of p-values into z-scores for this purpose should take into account whether p-values are one-tailed or two-tailed. Converting two-tailed p-values into z-scores using the formula for one-tailed p-values may lead to false conclusions in a meta-analysis of z-scores.

S.O.S We need open reviews.

I wrote a commentary that made a very simple point. A published model assumed that the variance of z-scores is typically less than 1. I pointed out that this is not a reasonable assumption because the standard deviation of z-scores is at least one and often greater than 1, when studies vary in effect sizes, sample sizes, or both. This commentary was rejected. One reviewer even provided R-Code to make his or her case. Here is my rebuttal.

Here is the r-code provided by the reviewer. We see SDs of 0.59, 0.49 and 0.46. Based on these results, the reviewer thinks that setting a prior to a range of values between 0 and 1 is reasonable.

Let’s focus on the example that the reviewer claims is realistic for a p-value distribution for 80% power. The reviewer simulates this scenario with a beta distribution with shape parameters 1 and 31. The Figure shows the implied distribution of p-values. What is most notable is that p-values greater than .38 are entirely missing; the maximum p-value is .38.

In this figure 80% of p-values are below .05 and 20% are above .05. This is why the reviewer suggests that the pattern of observed p-values corresponds to a set of studies with 80% power.

However, the reviewer does not consider whether this distribution of p-values could arise from a set of studies where p-values are the result of the non-central parameter and sampling error that follows a sampling distribution.

To simulate studies with 80% power, we can simply use a standard normal distribution centered over 2.80. Sampling error will produce z-scores greater and smaller than the non-centrality parameter of 2.80. Moreover, we already know that the standard deviation of these tests statistics is 1 because z-scores have the standard normal distribution as a sampling distribution (a point made and ignored by the reviewers and editor).

We can know compute the two-tailed p-values for each z-test and plot the distribution of p-values. Figure 2 shows the actual distribution in black and the reviewer’s beta distribution in red.

It is visible that the actual distribution has a lot more p-values that are very close to zero, which corresponds to high z-scores. We can know transform the p-values into z-scores using the reviewers’ formula (for one-tailed tests).

mean(y) #-2.54
sd(y) #1.11

We see that the standard deviation of these z-scores is greater than 1.

Using the correct formula for two-tailed p-values, we of course get the result that we already know to be true.

y = -qnorm(p/2)
mean(y) #2.80
sd(y) #1.00

It should be obvious that the reviewer made a mistake by assuming we can simulate p-value distributions with any beta-distribution. P-values cannot assume any distribution because the actual distribution of p-values is a function of the properties of the distribution of test-statistics that are used to compute p-values. With z-scores as test statistics it is well-known from intro statistics that sampling error follows a standard normal distribution, which is a normal distribution with a standard deviation of 1. Any transformation of z-scores into p-values and back into z-scores does not alter the standard deviation. Thus, the standard deviation has to be at least 1.

Heterogeneity in Power

The previous example assumed that all studies have the same amount of power. Allowing for heterogeneity in power, will further increase the standard deviation of z-scores. This is illustrated with the next example, where mean power is again 80%, but this time the non-centrality parameters vary with a normal distribution centered over 3.15 and a standard deviation of 1. Figure 3 shows the distribution of p-values which is even more extreme and deviates even more from the simulated beta-distribution by the reviewer.

Using the reviewer’s formula, we now get a standard deviation of 1.54, but if we use the correct formula for two-tailed p-values, we end up with 1.41.

mean(y) #-2.90
sd(y) #1.54

y = -qnorm(p/2)
mean(y) #3.16
sd(y) #1.39

This value makes sense because we simulated variation in z-scores with two standard normal distributions. One for the variation in the non-centrality parameters and one for the variation in sampling error. Adding two variances, gives a joint variance of 1 + 2 = 2, and a standard deviation of sqrt(2) = 1.41.


Unless I am totally crazy, I have demonstrated that we can use simple intro stats knowledge to realize that the standard deviation of p-values converted into z-scores has to be at least 1 because sampling error alone produces a standard deviation of 1. If the set of studies is heterogeneous and power varies across studies, the standard deviation will be even greater than 1. A variance less than 1 is only expected in unrealistic simulations or when researchers use questionable research practices, which reduces variability in p-values (e.g., all p-values greater than .05 are missing) and therewith also the variability in z-scores.

A broader conclusion is that the traditional publishing model in psychology is broken. Closed peer-review is too slow and unreliable to ensure quality control. Neither the editor of a prestigious journal, nor four reviewers were able to follow this simple line of argument. Open review is the only way forward. I guess I will be submitting this work to a journal with open reviews, where reviewers’ reputation is on the line and they have to think twice before they criticize a manuscript.

Replicability Audit of Steven J. Heine

“Trust is good, but control is better”  


Information about the replicability of published results is important because empirical results can only be used as evidence if the results can be replicated.  However, the replicability of published results in social psychology is doubtful. Brunner and Schimmack (2018) developed a statistical method called z-curve to estimate how replicable a set of significant results are, if the studies were replicated exactly.  In a replicability audit, I am applying z-curve to the most cited articles of psychologists to estimate  the replicability of their studies.

Steven J. Heine

Under construction

Replicability Audit of John A. Bargh

“Trust is good, but control is better”  


Information about the replicability of published results is important because empirical results can only be used as evidence if the results can be replicated.  However, the replicability of published results in social psychology is doubtful. Brunner and Schimmack (2018) developed a statistical method called z-curve to estimate how replicable a set of significant results are, if the studies were replicated exactly.  In a replicability audit, I am applying z-curve to the most cited articles of psychologists to estimate  the replicability of their studies.

John A. Bargh

Bargh is an eminent social psychologist (H-Index in WebofScience = 61). He is best known for his claim that unconscious processes have a strong influence on behavior. Some of his most cited article used subliminal or unobtrusive priming to provide evidence for this claim.

Bargh also played a significant role in the replication crisis in psychology. In 2012, a group of researchers failed to replicate his famous “elderly priming” study (Doyen et al., 2012). He responded with a personal attack that was covered in various news reports (Bartlett, 2013). It also triggered a response by psychologist and Nobel Laureate Daniel Kahneman, who wrote an open letter to Bargh (Young, 2012).

As all of you know, of course, questions have been raised about the robustness of priming results…. your field is now the poster child for doubts about the integrity of psychological research.

Kahneman also asked Bargh and other social priming researchers to conduct credible replication studies to demonstrate that the effects are real. However, seven years later neither Bargh nor other prominent social priming researchers have presented new evidence that their old findings can be replicated.

Instead other researchers have conducted replication studies and produced further replication failures. As a result, confidence in social priming is decreasing as reflected in Bargh’s citation counts (Figure 1)

Figure 1. John A. Bargh’s citation counts in Web of Science (3/17/19)

In this blog post, I examine the replicability and credibility of John A. Bargh’s published results using a statistical approach; z-curve (Brunner & Schimmack, 2018). ). It is well known that psychology journals only published confirmatory evidence with statistically significant results, p < .05 (Sterling, 1959). This selection for significance is the main cause of the replication crisis in psychology because selection for significance makes it impossible to distinguish results that can be replicated from results that cannot be replicated because selection for significance ensures that all results will be replicated (we never see replication failures).

While selection for significance makes success rates uninformative, the strength of evidence against the null-hypothesis (signal/noise or effect size / sampling error) does provide information about replicability. Studies with higher signal to noise ratios are more likely to replicate. Z-curve uses z-scores as the common metric of signal-to-noise ratio for studies that used different test statistics. The distribution of observed z-scores provides valuable information about the replicability of a set of studies. If most z-scores are close to the criterion for statistical significance (z = 1.96), replicability is low.

Given the requirement to publish significant results, researches had two options how they could meet this goal. One option requires obtaining large samples to reduce sampling error and therewith increase the signal-to-noise ratio. The other solution is to conduct studies with small samples and conduct multiple statistical tests. Multiple testing increases the probability of obtaining a significant results with the help of chance. This strategy is more efficient in producing significant results, but these results are less replicable because a replication study will not be able to capitalize on chance again. The latter strategy is called a questionable research practice (John et al., 2012), and it produces questionable results because it is unknown how much chance contributed to the observed significant result. Z-curve reveals how much a researcher relied on questionable research practices to produce significant results.


I used WebofScience to identify the most cited articles by John A. Bargh (datafile).  I then selected empirical articles until the number of coded articles matched the number of citations, resulting in 43 empirical articles (H-Index = 41).  The 43 articles reported 111 studies (average 2.6 studies per article).  The total number of participants was 7,810 with a median of 56 participants per study.  For each study, I identified the most focal hypothesis test (MFHT).  The result of the test was converted into an exact p-value and the p-value was then converted into a z-score.  The z-scores were submitted to a z-curve analysis to estimate mean power of the 100 results that were significant at p < .05 (two-tailed). Four studies did not produce a significant result. The remaining 7 results were interpreted as evidence with lower standards of significance. Thus, the success rate for 111 reported hypothesis tests was 96%. This is a typical finding in psychology journals (Sterling, 1959).


The z-curve estimate of replicability is 29% with a 95%CI ranging from 15% to 38%.  Even at the upper end of the 95% confidence interval this is a low estimate. The average replicability is lower than for social psychology articles in general (44%, Schimmack, 2018) and for other social psychologists. At present, only one audit has produced an even lower estimate (Replicability Audits, 2019).

The histogram of z-values shows the distribution of observed z-scores (blue line) and the predicted density distribution (grey line). The predicted density distribution is also projected into the range of non-significant results.  The area under the grey curve is an estimate of the file drawer of studies that need to be conducted to achieve 100% successes if hiding replication failures were the only questionable research practice that is used. The ratio of the area of non-significant results to the area of all significant results (including z-scores greater than 6) is called the File Drawer Ratio.  Although this is just a projection, and other questionable practices may have been used, the file drawer ratio of 7.53 suggests that for every published significant result about 7 studies with non-significant results remained unpublished. Moreover, often the null-hypothesis may be false, but the effect size is very small and the result is still difficult to replicate. When the definition of a false positive includes studies with very low power, the false positive estimate increases to 50%. Thus, about half of the published studies are expected to produce replication failures.

Finally, z-curve examines heterogeneity in replicability. Studies with p-values close to .05 are less likely to replicate than studies with p-values less than .0001. This fact is reflected in the replicability estimates for segments of studies that are provided below the x-axis. Without selection for significance, z-scores of 1.96 correspond to 50% replicability. However, we see that selection for significance lowers this value to just 14% replicability. Thus, we would not expect that published results with p-values that are just significant would replicate in actual replication studies. Even z-scores in the range from 3 to 3.5 average only 32% replicability. Thus, only studies with z-scores greater than 3.5 can be considered to provide some empirical evidence for this claim.

Inspection of the datafile shows that z-scores greater than 3.5 were consistently obtained in 2 out of the 43 articles. Both articles used a more powerful within-subject design.

The automatic evaluation effect: Unconditional automatic attitude activation with a pronunciation task (JPSP, 1996)

Subjective aspects of cognitive control at different stages of processing (Attention, Perception, & Psychophysics, 2009).


John A. Bargh’s work on unconscious processes with unobtrusive priming task is at the center of the replication crisis in psychology. This replicability audit suggests that this is not an accident. The low replicability estimate and the large file-drawer estimate suggest that replication failures are to be expected. As a result, published results cannot be interpreted as evidence for these effects.

So far, John Bargh has ignored criticism of his work. In 2017, he published a popular book about his work on unconscious processes. The book did not mention doubts about the reported evidence, while a z-curve analysis showed low replicability of the cited studies (Schimmack, 2017).

Recently, another study by John Bargh failed to replicate (Chabris et al., in press), and Jessy Singal wrote a blog post about this replication failure (Research Digest) and John Bargh wrote a lengthy comment.

In the commentary, Bargh lists several studies that successfully replicated the effect. However, listing studies with significant results does not provide evidence for an effect unless we know how many studies failed to demonstrate the effect and often we do not know this because these studies are not published. Thus, Bargh continues to ignore the pervasive influence of publication bias.

Bargh then suggests that the replication failure was caused by a hidden moderator which invalidates the results of the replication study.

One potentially important difference in procedure is the temperature of the hot cup of coffee that participants held: was the coffee piping hot (so that it was somewhat uncomfortable to hold) or warm (so that it was pleasant to hold)? If the coffee was piping hot, then, according to the theory that motivated W&B, it should not activate the concept of social warmth – a positively valenced, pleasant concept. (“Hot” is not the same as just more “warm”, and actually participates in a quite different metaphor – hot vs. cool – having to do with emotionality.) If anything, an uncomfortably hot cup of coffee might be expected to activate the concept of anger (“hot-headedness”), which is antithetical to social warmth. With this in mind, there are good reasons to suspect that in C&S, the coffee was, for many participants, uncomfortably hot. Indeed, C&S purchased a hot or cold coffee at a coffee shop and then immediately handed that coffee to passersby who volunteered to take the study. Thus, the first few people to hold a hot coffee likely held a piping hot coffee (in contrast, W&B’s coffee shop was several blocks away from the site of the experiment, and they used a microwave for subsequent participants to keep the coffee at a pleasantly warm temperature). Importantly, C&S handed the same cup of coffee to as many as 7 participants before purchasing a new cup. Because of that feature of their procedure, we can check if the physical-to-social warmth effect emerged after the cups were held by the first few participants, at which point the hot coffee (presumably) had gone from piping hot to warm.

He overlooks that his original study produced only weak evidence for the effect with a p-value of .0503, that is technically not below the .05 value for significance. As shown in the z-curve plot, results with a p-value of .0503 have only an average replicability of 13%. Moreover, the 95%CI for the effect size touches 0. Thus, the original study did not rule out that the effect size is extremely small and has no practical significance. To make any claims that the effect of holding a warm cup on affection is theoretically relevant for our understanding of affection would require studies with larger samples and more convincing evidence.

At the end of his commentary, John A. Bargh assures readers that he is purely motivated by a search for the truth.

Let me close by affirming that I share your goal of presenting the public with accurate information as to the state of the scientific evidence on any finding I discuss publicly. I also in good faith seek to give my best advice to the public at all times, again based on the present state of evidence. Your and my assessments of that evidence might differ, but our motivations are the same.

Let me be crystal clear. I have no reasons to doubt that John A. Bargh believes what he says. His conscious mind sees himself as a scientist who employs the scientific method to provide objective evidence. However, Bargh himself would be the first to acknowledge that our conscious mind is not fully aware of the actual causes of human behavior. I submit that his response to criticism of his work shows that he is less capable of being objective than he thinks he his. I would be happy to be proven wrong in a response by John A. Bargh to my scientific criticism of his work. So far, eminent social psychologists have preferred to remain silent about the results of their replicability audits.


It is nearly certain that I made some mistakes in the coding of John A. Bargh’s articles. However, it is important to distinguish consequential and inconsequential mistakes. I am confident that I did not make consequential errors that would alter the main conclusions of this audit. However, control is better than trust and everybody can audit this audit.  The data are openly available and the data can be submitted to a z-curve analysis using a shinny app. Thus, this replicability audit is fully transparent and open to revision.


Many psychologists do not take this work seriously because it has not been peer-reviewed. However, nothing is stopping them from conducting a peer-review of this work and to publish the results of their review as a commentary here or elsewhere. Thus, the lack of peer-review is not a reflection of the quality of this work, but rather a reflection of the unwillingness of social psychologists to take criticism of their work seriously.

If you found this audit interesting, you might also be interested in other replicability audits of eminent social psychologists.

Psychological Science is Self-Correcting

It is easy to say that science is self-correcting.  The notion of a self-correcting science is based on the naive model of science as an objective process that incorporates new information and updates beliefs about the world depending on the available evidence.  When new information suggests that old beliefs are false, the old beliefs are replaced by new beliefs.   

It has been a while since I read Kuhn’s book on paradigm shifts, but I do remember that a main point of the book was that science doesn’t work this way for a number of reasons.  

Thus, self-correction cannot be taken for granted. Rather, it is an attribute that needs to be demonstrated for a discipline to be an actual science. If psychological science wants to be a science, there should be empirical evidence that it is self-correcting. 

One piece of evidence for self-correction is that theories that are in doubt receive fewer citations.  Fortunately, modern software like the database WebofScience makes it very easy to count citations by year of publication.

Social Priming

In the past years, research on social priming has come under attack. Several replication studies failed to replicate key findings in this literature. In 2012, Nobel Laureate Daniel Kahneman wrote an open letter to John A. Bargh calling social priming “the poster child of doubts about doubts about the integrity of psychological research.” (cf. Train Wreck blog post). I have demonstrated with statistical methods that many of the published results in this literature were obtained with questionable research methods that inflate the risk of false positive results (Before You Know It).

If science is self-correcting, we should see a decrease in citations of social priming articles.

John A. Bargh

The graph below shows the citations of John A. Bargh’s articles by year. 2019 does not count because it just started. 2018 citation are still added but at a very low rate. So, the 2018 data can be interpreted.

The graph shows that John A. Bargh’s citation counts still increased after 2012, when Kahneman published the open letter. However, publishing is a slow process and many articles published in 2013 and 2014 had been written before 2012. Starting with 2015, we see a decrease in citations and this decrease continues to 2018. The decrease seems to be accelerating with a drop by 200 citations from 2017 to 2018.

In conclusion, there is some evidence of self-correction in psychology. However, Bargh may be an exception because an open letter by a Nobel Laureate is a rare and powerful impetus for self-correction.

Ap Dijksterhuis

Dijksterhuis is also known for work on unconscious processes and social priming. Importantly, a large replication study failed to replicate his professor-priming results in 2018 (Registered Replication Report).

The increase in citation counts stalled in 2011, even before the citation counts of John A. Bargh started to decrease. However, there was no clear decrease in the years from 2012 to 2017, while citation counts decreased by over 100 citations in 2018. Thus, there are some signs of self-correction here as well.

Fritz Strack

The work by Fritz Strack was also featured in Kahneman’s book. There have been two registered replication reports of work by Fritz Strack and both failed to replicate the original results (facial feedback, item-order effects).

Strack’s citation counts increased dramatically after 2012. However in 2018 they decreased by 150 counts. We need the 2019 data to see whether this is a blip or the beginning of a downward trend.

Susan T. Fiske

To make sure that the trends for social priming researchers are not just general trends we need a control condition. I picked Susan T. Fiske because she is an eminent social psychologist, but her work is different from social priming experiments. Here work is also more replicable than work by social priming researchers (social psychologists’ replicability rankings).

Fiske’s graph shows no decrease in 2018. Thus, the decreases seen for social priming researchers do not reflect a general trend in social psychology.


This blog post shows how citation counts can be used to examine whether psychological science is self-correcting, which is an essential feature of a science. There are some positive signs that the recent replication crisis in social psychology has triggered a process of self-correction. I suggest that further investigation of changes in citation counts are a fruitful area of research for meta-psychologists.

The Limited Utility of Network Models

This blog post is based on a commentary that was published in the European Journal of Personality Psychology in 2012. Republishing it as a blog post makes it openly accessible.

The Utility of Network Analysis for Personality Psychology
European Journal of Personality, 26: 446–447 (2012)
DOI: 10.1002/per.1876


We note that network analysis provides some new opportunities but also has some limitations: (i) network analysis relies on observed measures such as single items or scale scores; (ii) it is a descriptive method and, as such,
cannot test causal hypotheses; and (iii) it does not test the influence of outside forces on the network, such as dispositional influences on behaviour. We recommend structural equation modelling as a superior method that overcomes limitations of exploratory factor analysis and network analysis.


Cramer et al. (2012) introduce network analysis (NA) as a new statistical tool for the study of personality that addresses some limitations of exploratory factor analysis (EFA). We concur with the authors that NA provides valuable new opportunities but feel forced by the situational pressure of a 1000 word limit to focus on some potential limitations of

We also compare NA to structural equation modelling (SEM) because we agree with the authors that SEM is currently the most powerful statistical method for the testing of competing (causal) theories of personality.

One limitation of EFA and NA is that these methods rely on observed measures to examine relationships between personality constructs. For example, Cramer et al. (2012) apply NA to correlations among ratings of single items. The authors recognize this limitation but do not present an alternative to this suboptimal approach.

A major advantage of SEM is that it allows researchers to create measurement models that can remove random and systematic measurement error from observed measures of personality constructs. Measurement models of multimethod data are particularly helpful to separate perception and rater biases from actual personality traits
(e.g. Gere & Schimmack, 2011; Schimmack, 2010).

Our second concern is that NA is presented as a statistical tool that can test dynamic process models of personality. Yet, NA is a descriptive method that provides graphical representations of patterns in correlation matrices. Thus, NA is akin to other descriptive methods (e.g. multidimensional scaling, cluster analysis and principal component analysis) that reveal patterns in complex data. These descriptive methods make no assumptions about causality. In contrast, SEM forces researchers to make a priori assumptions about causal processes and provides information about the ability of a causal theory to explain the observed pattern of correlations. Thus, we recommend SEM for theory testing and do not think it is appropriate to use NA for this purpose.

Specifically, we think it is questionable to make inferences about the Big Five model based on network graphs. Cramer et al. (2012) highlight the ability to visualize the centrality of items in a network as a major strength of NA. However, factor loading patterns and communalities in EFA provide similar information. In our opinion, the authors go beyond the statistical method of NA when they propose that activation of central components will increase the chances that neighbouring components will also become
more activated. This assumption is problematic for several reasons.

First, it is not clear what the authors mean by the notion of activation of personality components. Second, the connections in a network graph are not causal paths. An item could be central because it is influenced by many personality components (e.g. life satisfaction is influenced by neuroticism,
extraversion, agreeableness and conscientiousness) or because it is the cause of neighbouring items (life satisfaction influences neuroticism, extraversion, agreeableness and conscientiousness). Researchers interested in testing causal relationships should collect data that are informative about causality (e.g. twin data) and use SEM to test whether the
data favour one causal theory over another.

We are also concerned about the suggestion of Cramer et al. (2012) that NA provides an alternative account of classic personality constructs such as extraversion and neuroticism. It is important to make clear that this alternative view challenges the core assumption of many personality
theories that behaviour is influenced by personality dispositions.

That is, whereas the conception of neuroticism as a personality trait assumes that neuroticism has causal force (Funder, 1991), the conceptualization of neuroticism as a personality component implies that it does not have causal force. The authors compare personality constructs such as neuroticism with the concept of a flock. The term flock in the expression a flock of birds does not refer to an independent entity that exists apart from the individual birds, and it makes no sense to attribute the gathering of birds to the causal effect of flocking (the birds are gathered in the same place because they are a flock of birds). We prefer to compare neuroticism with the causal force of seasonal changes that make individual birds flock together.


Since we published this commentary, network models have become even more popular to make claims about important constructs like depression and other constructs. So far, we have only seen pretty pictures of item clusters, but no evidence that network models provide new insights into the causes of depression or dynamic developments over time. The reason is that the statistical tool is merely descriptive, whereas the articles talk a lot about things that go well beyond the empirical contribution of plotting correlations or partial correlations. In this regard, network articles remind me of the old days in personality psychology, where researchers told stories about their principle components. Instead researchers interested in individual differences should learn how to use structural equation modeling to test causality and to study stability and change of personality traits and states. Unfortunately, learning structural equation modeling is a bit more difficult than network analysis which requires no theory and does not test model fit. Maybe that is the reason for the popularity of network models. Easy to do and pretty pictures. Who can resist.

Ulrich Schimmack, March 1, 2019