All posts by Dr. R

About Dr. R

Since Cohen (1962) published his famous article on statistical power in psychological journals, statistical power has not increased. The R-Index makes it possible f to distinguish studies with high power (good science) and studies with low power (bad science). Protect yourself from bad science and check the R-Index before you believe statistical results.

Social psychology textbook audit: Something smells fishy

Social psychology textbook like colorful laboratory experiments that illustrate a theoretical point. As famous social psychologist Daryl Bem stated, he considered his experiments more illustrations of what could happen than empirical tests of what actually happens. Unfortunately, social psychology textbooks make it less obvious that the results of highlighted studies should not be generalized to real life.

Myers and Twenge (2019) tell the story of fishy smells.

In a laboratory experiment, exposure to a fishy smell caused people to be suspicious of each other and cooperate less—priming notions of a shady deal as “fishy” (Lee & Schwarz, 2012). All these effects occurred without the participants’ conscious awareness of the scent and its influence.

They don’t even mention some other fun facts about this study. To make sure that the effect is not just a mood effect induced by bad odors in general, fishy smells were contrasted with fart smells, and the effect seemed to be limited to fishy smells.

The article was published in the top journal for experimental social psychology (JPSP:ASC) and is relatively highly cited.

However, the studies reported in this article smell a bit fishy and should be consumed with a grain of salt and a lot of lemon. The problem is that all of the results are significant, which is highly unlikely unless studies have very high statistical power (Schimmack, 2012).

And it even works the other way around.

And making people think about suspicion, also makes them think about fish, in theory.

Suspicion also makes you be more sensitive to fishy smells.

Undergraduate students may not realize what the problem with these studies is. After all, they all worked out; that is they produced a p-value less than .05, which is supposed to ensure that no more than 1 out of 20 studies are a false positive result. As all of these studies are significant, it is extremely unlikely that all of them are false positives. So, we would have to infer that suspicion is related to fishy smells in our minds.

However, since 2012 it is clear that we have to draw another conclusion. The reason is that results in social psychology articles like this one smell fishy and suggest that the authors are telling us a fun story, but they are not telling us what really happened in their lab. It is extremely unlikely that the authors reported all of their studies and data analyses that they conducted. Instead they may have used a variety of so-called questionable research practices that increase the chances of reporting a significant result. Questionable research practices are also known as fishing for significance. These questionable research practices have the undesirable effect that they increase the type-I error rate. Thus, while the reported p-values are below .05, the risk of a false positive result is not and could be as high as 100%.

To demonstrate that researchers used questionable research practices, we can conduct a bias test. The most powerful bias test for small sets of studies is the Test of Insufficient Variance. When most p-values are just significant , p < .05 and p > .005, but always significant the results are not trustworthy because sampling error should produce more variability than we see.

The table lists the test statistics, converts the two-tailed p-values into z-scores and computes the variance of the z-scores. The variance is expected to be 1, but the actual variance is only 0.14. A chi-square test shows that this deviation is significant with p = .01. Thus, we have scientific evidence to claim that these results smell a bit fishy.

Study  testvaluedfpz

Unfortunately, these results are not the only fishy results in social psychology textbooks. Thus, students of social psychology should read textbook claims with a healthy dose of skepticism. They should also ask their professors to provide information about the replicability of textbook findings. Has this study been replicated in a preregistered replication attempt? Would you think you could replicate this result in your own lab? It is time to get rid of the fishy smell and let the fresh wind of open science clean up social psychology.

We can only hope that sooner than later, articles like this will sleep with the fishes.

Social-Psychology Textbook Audit: External Validity

Every social psychology textbook emphasizes the problem of naturalistic studies (correlational research) that it is difficult to demonstrate cause-effect relationships in these studies.

Social psychology has a proud tradition of addressing this problem with laboratory experiments. The advantage of laboratory experiments is that they make it easy to demonstrate causality. The disadvantage is that laboratory experiments have low ecological validity. It is therefore important to demonstrate that findings from laboratory experiments generalize to real world behavior.

Myers and Twenge’s (2019) textbook (13e edition) addresses this issue in a section called “Generalizing from Laboratory to Life”

What people saw in everyday life suggested correlational research, which led to experimental research. Network and government policymakers, those with the power to make changes, are now aware of the results. In many areas, including studies of helping, leadership style, depression, and self-efficacy, effects found in the lab have been mirrored by effects in the field, especially when the laboratory effects have been large (Mitchell, 2012).

Mitchell, G. (2012). Revisiting truth or triviality: The external validity of research in the psychological laboratory. Perspectives on Psychological Science, 7, 109–117.

Curious about the evidence, I examined Mitchell’s article. I didn’t need to read beyond the abstract to see that the textbook misrepresented Mitchell’s findings.

Using 217 lab-field comparisons from 82 meta-analyses found that the external validity of laboratory research differed considerably by psychological subfield, research topic, and effect size. Laboratory results from industrial-organizational psychology most reliably predicted field results, effects found in social psychology laboratories most frequently changed signs in the field (from positive to negative or vice versa), and large laboratory effects were more reliably replicated in the field than medium and small laboratory effects.

Mitchell, G. (2012). Revisiting Truth or Triviality: The External Validity of Research in the Psychological Laboratory. Perspectives on Psychological Science7(2), 109–117.

So, a course in social psychology covers 80% results based on laboratory experiments that may not generalize to the real world. In addition, students are given the false information that these results do generalize to the real world, when evidence of ecological validity is often missing. On top of this, many articles based on laboratory experiments report inflated effect sizes due to selection for significance and the results may not even replicate in other laboratory contexts.

Estimating the Replicability of Psychological Science

Over the past years, psychologists have become increasingly concerned about the credibility of published results. The credibility crisis started in 2011, when Bem published incredible results that seemed to suggest that humans can foresee random future events. Bem’s article revealed fundamental flaws in the way psychologists conduct research. The main problem is that psychology journals only publish statistically significant results (Sterling, 1959). If only significant results are published, all hypotheses will receive empirical support as long as they are tested. This is akin to saying that everybody has a 100% free throw average or nobody ever makes a mistake if we do not count failures.

The main problem of selection for significance is that we do not know the real strength of evidence that empirical studies provide. Maybe the selection effect is small and most studies would replicate. However, it is also possible that many studies might fail a replication test. Thus, the crisis of confidence is a crisis of uncertainty.

The Open Science Collaboration conducted actual replication studies to estimate the replicability of psychological science. They replicated 97 studies with statistically significant results and were able to reproduce 35 significant results (a 36% success rate). This is a shockingly low success rate. Based on this finding, most published results cannot be trusted, especially because there is heterogeneity across studies. Some studies would have an even lower chance of replication and several studies might even be outright false positives (there is actually no real effect).

As important as this project was to reveal major problems with the research culture in psychological science, there are also some limitations that cast doubt about the 36% estimate as a valid estimate of the replicability of psychological science. First, the sample size is small and sampling error alone might have lead to an underestimation of the replicability in the population of studies. However, sampling error could also have produced a positive bias. Another problem is that most of the studies focused on social psychology and that replicability in social psychology could be lower than in other fields. In fact, a moderator analysis suggested that the replication rate in cognitive psychology is 50%, while the replication rate in social psychology is only 25%. The replicated studies were also limited to a single year (2008) and three journals. It is possible that the replication rate has increased since 2008 or could be higher in other journals. Finally, there have been concerns about the quality of some of the replication studies. These limitations do not undermine the importance of the project, but they do imply that the 36% estimate is an estimate and that it may underestimate the replicability of psychological science.

Over the past years, I have been working on an alternative approach to estimate the replicability of psychological science. This approach starts with the simple fact that replicabiliity is tightly connected to the statistical power of a study because statistical power determines the long-run probability of producing significant results (Cohen, 1988). Thus, estimating statistical power provides valuable information about replicability. Cohen (1962) conducted a seminal study of statistical power in social psychology. He found that the average power to detect an average effect size was around 50%. This is the first estimate of replicability of psychological science, although it was only based on one journal and limited to social psychology. However, subsequent studies replicated Cohen’s findings and found similar results over time and across journals (Sedlmeier & Gigerenzer, 1989). It is noteworthy that the 36% estimate from the OSC project is not statistically different from Cohen’s estimate of 50%. Thus, there is convergent evidence that replicability in social psychology is around 50%.

In collaboration with Jerry Brunner, I have developed a new method that can estimate mean power for a set of studies that are selected for significance and that vary in effect sizes and samples sizes, which produces heterogeneity in power (Brunner & Schimmack, 2018). The input for this method are the actual test statistics of significance tests (e.g., t-tests, F-tests). These test-statistics are first converted into two-tailed p-values and then converted into absolute z-scores. The magnitude of these absolute z-scores provides information about the strength of evidence against the null-hypotheses. The histogram of these z-scores, called a z-curve, is then used to fit a finite mixture model to the data that estimates mean power, while taking selection for significance intro account. Extensive simulation studies demonstrate that z-curve performs well and provides better estimates than alternative methods. Thus, z-curve is the method of choice for estimating the replicability of psychological science on the basis of the test statistics that are reported in original articles.

For this blog post, I am reporting results based on preliminary results from a large project that extracts focal hypothesis from a broad range of journals that cover all areas of psychology for the years 2010 to 2017. The hand-coding of these articles complements a similar project that relies on automatic extraction of test statistics (Schimmack, 2018).

Table 1 shows the journals that have been coded so far. It also shows the estimates based on the automated method and for hand-coding of focal hypotheses.

Journal of Abnormal Psychology7668
Journal of Cross-Cultural Psychology7377
Journal of Research in Personality6875
J. Exp. Psych: Learning, Memory, & Cognition5877
Journal of Experimental Social Psychology5562
Behavioral Neuroscience5368
Psychological Science5266
JPSP-Interpersonal Relations & Group Processes3363
JPSP-Attitudes and Social Cognition3065

Hand coding of focal hypothesis produces lower estimates than the automated method because the automated analysis also codes manipulation checks and other highly significant results that are not theoretically important. The correlation between the two methods shows consistency across the two methods, r = .67. Finally, the mean for the automated method, 69%, is close to the mean for over 100 journals, 72%, suggesting that the sample of journals is an unbiased sample.

The hand coding results also confirm results found with the automated method that social psychology has a lower replicability than some other disciplines. Thus, the OSC reproducibility results that are largely based on social psychology should not be used to make claims about psychological science in general.

The figure below shows the output of the latest version of z-curve. The first finding is that the replicability estimate for all 1,671 focal tests is 56% with a relatively tight confidence interval ranging from 45% to 56%. ZZZ The next finding is that the discovery rate or success rate is 92%, using p < .05 as the criterion. This confirms that psychology journals continue to published results are selected for significance (Sterling, 1959). The histogram further shows that even more results would be significant if p-values below .10 are included as evidence for “marginal significance.”

Z-Curve.19.1 also provides an estimate of the size of the file drawer. It does so by projecting the distribution of observed significant results into the range of non-significant results (grey curve). The file drawer ratio shows that for every published result, we would expect roughly two unpublished studies with non-significant results. However, z-curve cannot distinguish between different questionable research practices. Rather than not disclosing failed studies researchers may not disclose other statistical analyses within a published study to report significant results.

Z-Curve.19.1 also provides an estimate of the false positive rate (FDR). FDR is the percentage of significant results that may arise from testing a true nil-hypothesis, where the population effect size is zero. For a long time, the consensus has been that false positives are rare because the nil-hypothesis is rarely true (Cohen, 1994). Consistent with this view, Soric’s estimate of the maximum false discovery rate is only 10% with a tight CI ranging from 8% to 16%.

However, the focus on the nil-hypothesis is misguided because it treats tiny deviations from zero as true hypotheses even if the effect size has no practical or theoretical significance. These effect sizes also lead to low power and replication failures. Therefore, Z-Curve 19.1 also provides an estimate of the FDR that treats studies with very low power as false positives. This broader definition of false positives raises the FDR estimate slightly, but 15% is still a low percentage. Thus, the modest replicability of results in psychological science is mostly due to low statistical power to detect true effects rather than a high number of false positive discoveries.

The reproducibility project showed that studies with low p-values were more likely to replicate. This relationship follows from the influence of statistical power on p-values and replication rates. To achieve a replication rate of 80%, p-values had to be less than .00005 or the z-score had to exceed 4 standard deviations. However, this estimate was based on a very small sample of studies. Z-Curve.19.1 also provides estimates of replicability for different levels of evidence. These values are shown below the x-axis. Consistent with the OSC results, a replication rate over 80% is only expected once z-scores are greater than 4.

The results also provide information about the choice of the alpha criterion to draw inferences from significance tests in psychology. To do so, it is important to distinguish observed p-values and type-I probabilities. For a single unbiased tests, we can infer from an observed p-value less than .05 that the risk of a false positive result is less than 5%. However, when multiple comparisons are made or results are selected for significance, an observed p-values less than .05 does not imply that the type-I error risk is below .05. To claim a type-I error risk of 5% or less, we have to correct the observed p-values, just like a Bonferroni correction. As 50% power corresponds to statistical significance, we see that z-scores between 2 and 3 are not statistically significant; that is, the type-I error risk is greater than 5%. Thus, the standard criterion to claim significance with alpha = .05 is a p-value of .003. Given the popularity of .005, I suggest to use p = .005 as a criterion for statistical significance. However, this claim is not based on lowering the criterion for statistical significance because p < .005 still only allows to claim that the type-I error probability is less than 5%. The need for a lower criterion value stems from the inflation of the type-I error rate due to selection for significance. This is a novel argument that has been overlooked in the significance wars, which ignored the influence of publication bias on false positive risks.

Finally, z-curve.19.1 makes it possible to examine the robustness of the estimates by using different selection criteria. One problem with selection models is that p-values just below .05, say in the .01 to .05 range, can arise from various questionable research practices that have different effects on replicability estimates. To address this problem, it is possible to estimate the density with a different selection criterion, while still estimating the replicability with alpha = .05 as the criterion. Figure 2 shows the results by using only z-scores greater than 2.5, p = .012) to fit the observed z-curve for z-scores greater than 2.5.

The blue dashed line at z = 2.5 shows the selection criterion. The grey curve between 1.96 and 2.5 is projected form the distribution for z-scores greater than 2.5. Results show a close fit with the observed distribution. A s a result, the parameter estimates are also very similar. Thus, the results are robust and the selection model seems to be reasonable.


Psychology is in a crisis of confidence about the credibility of published results. The fundamental problems are as old as psychology itself. Psychologists have conducted low powered studies and selected only studies that worked for decades (Cohen, 1962; Sterling, 1959). However, awareness of these problems has increased in recent years. Like many crises, the confidence crisis in psychology has created confusion. Psychologists are aware that there is a problem, but they do not know how large the problem is. Some psychologists believe that there is no crisis and pretend that most published results can be trusted. Others are worried that most published results are false positives. Meta-psychologists aim to reduce the confusion among psychologists by applying the scientific method to psychological science itself.

This blog post provided the most comprehensive assessment of the replicability of psychological science so far. The evidence is largely consistent with previous meta-psychological investigations. First, replicability is estimated to be slightly above 50%. However, replicability varies across discipline and the replicability of social psychology is below 50%. The fear that most published results are false positives is not supported by the data. Replicability increases with the strength of evidence against the null-hypothesis. If the p-value is below .00001, studies are likely to replicate. However, significant results with p-values above .005 should not be considered statistically significant with an alpha level of 5%, because selection for significance inflates the type-I error. Only studies with p < .005 can claim statistical significance with alpha = .05.

The correction for publication bias implies that researchers have to increase sample sizes to meet the more stringent p < .005 criterion. However, a better strategy is to preregister studies to ensure that reported results can be trusted. In this case, p-values below .05 are sufficient to demonstrate statistical significance with alpha = .05. Given the low prevalence of false positives in psychology, I do see no need to lower the alpha criterion.

Future Directions

This blog post is just an interim report. The final project requires hand-coding of a broader range of journals. Readers who think that estimating the replicability of psychological science is beneficial and who want information about a particular journal are invited to collaborate on this project and can obtain authorship if their contribution is substantial enough to warrant authorship. Please consider taking part in this project. Although it is a substantial time commitment, it doesn’t require participants or materials that are needed for actual replication studies. Please consider taking part in this project. Contact me, if you are interested and want to know how you can get involved.

The Bayesian Mixture Model for P-Curves is Fundamentally Flawed

Draft. Comments are welcome. To be submitted to Meta-Psychology

Authors: Ulrich Schimmack & Jerry Brunner
[Jerry Brunner is a professor in the statistics department of the University of Toronto, and an expert on Bayesian Mixture Models. He wrote the r-code to estimate the false discovery rate without a dogmatic prior that limits heterogeneity in the evidence against H0.]

There are many mixture models. I amusing the term Bayesian Mixture Model (BMM) to refer to the specific model proposed by Gronau et al. (2017). Criticism of their model does not generalize to other Bayesian mixture models.

Meta-Analysis in Psychology

The primary purpose of psychological science is to collect data and to interpret them. The focus is usually on internal consistency. That is, are the data of a study consistent with theoretical predictions. The problem with this focus on single studies is that a single study rarely provides conclusive evidence for a hypothesis and even more rarely against a hypothesis, let alone a complete theory.

The solution to this problem has been the use of meta-analyses. Meta-analyses aim to provide conclusive evidence by aggregating information from several studies. The most common form of meta-analysis in psychology convert information from single studies into estimates of standardized effect sizes and then draw conclusions about the effect size in the population.

There are numerous problems with effect-size meta-analyses in psychology. One problem is that a collection of convenience samples does not allow generalizing to a population. Another problem is that studies are often very heterogeneous and it is not clear which of these studies can be replicated and which studies produced false positive results. The biggest problem is that original studies are selected for significance (Sterling, 1959; Sterling et al., 1959). As a result, effect size estimates in meta-analyses are inflated and can provide false evidence for effects that do not exist.

The problem of selection for significance has led to the development of a new type of meta-analyses that take selection for significance into account. The Bayesian Mixture Model for Significant P-Values is one of these models (Gronau, Duizer, Bakker, & Wagenmakers, 2017). Compared to well-established mixture models that assume all data are available (Allison et al., 2002), the BMM uses only significant p-values. In this regard, the model is similar to pcurve (Simonsohn et al., 2014), puniform (van Assen et al., 2014), and zcurve (Brunner & Schimmack, 2018).

Henceforth, we will use the acronym BMM to refer to Gronau et al.’s model, but we want to clarify that any criticism of their model is not a criticism of Bayesian Mixture Models in general. In fact, we present an alternative mixture model that fixes the fundamental problem of their model.

A Meta-Analysis of P-values

Like p-curve (Simonsohn et al., 2014), DMM is a meta-analysis of significant p-values. It is implied, but never explicitly stated, that the p-values in a p-curve are p-values from two-sided tests. That is, the direction of an effect is not reflected in the test-statistic. The test-statistic could be a test statistic that ignores signs of effects (F-values, chi-square) or the absolute value of a directional test-statistic (absolute t-values, absolute z-scores). As a result, p-values of 1 correspond to test statistics and effect sizes of zero and p-values decrease towards zero as test-statistics increase towards infinity.

In a meta-analysis of p-values, sampling error will produce some variation in test statistics and p-values even if the population effect size is zero. It is well known in statistics, that the distribution of p-values in this scenario is uniform. This can be modeled with a beta distribution with shape parameters a = 1 and b = 1.

However, if all of the tests tested a true hypothesis, the distribution of p-values is monotonically decreasing with increasing p-values. This can also be modeled with a beta distribution by setting the first shape parameter a to a value less than 1. The steeper the decrease is, the stronger is the evidence against the null-hypotheses. Figure 1 illustrates this p-curve with a shape parameter of a = .5.

We see that both distributions contribute p-values close to 1. However, we also see that the uniform distribution based on true null-hypotheses contributes more p-values close to 1. The reason is simply that p-values are more likely to be close to 0, when the null-hypothesis is false.

The First Flaw in the Bayesian Mixture Model

Like z-curve, BMM does not try to fit a model to the distribution of p-values. Rather, it first transforms p-values into z-scores, which the authors call probit transformed p-values. To convert p-values into z-scores, it is important to take into account whether p-values are one-sided or two-sided.

For one-sided p-values, values of .5 correspond to z-score of 0, while p-values of 0 correspond to a z-score of infinity, and p-values of 1 correspond to a z-score of minus infinity. However, p-values in a p-curve analysis are two-tailed and p-values of 0 can correspond to a z-score of infinity or minus infinity while a p-value of 1 corresponds to a z-score of zero.

The proper formula to transform two-sided p-values into z-scores is -qnorm(p/2). To illustrate, take a z-score of 1.96 and compute the two-sided p-value using the formula (1-pnorm(abs(z))*2. We obtain a p-value of .05 because 1.96 is the critical value for a two-sided z-test with alpha = .05. We can do the same with a z-score of -1.96. Again, we obtain a value of .05. We can now use the formula qnorm(p/2) with p = .05 and obtain z = 1.96. This shows that the test statistic is about two standard deviation from a value of zero. With decreasing p-values, the evidence against the null-hypothesis is stronger. For example, p = .005, gives us a z-score of 2.8, which is nearly 2.8 standard deviations away from zero.

The BMM model uses qnorm(p) to convert two-sided p-values into z-scores. The omission of the minus sign is just a minor nuisance. Now negative values show evidence against the null-hypothesis. However, using p rather than p/2 has more dramatic consequences. Based on this formula, a p-value of .5 is converted into a z-score of zero, and any p-values greater than .5 now produce positive values that would suggest we now have evidence against the null-hypothesis in the opposite direction. A p-value of 1 is converted into a value of infinity.

The only reason why this flawed conversion of p-values does not produce major problems is that only significant p-values are used. Thus, all p-values are well below .5, where the problems would occur. Thus, z-scores of 1.96 that produce a p-value of .05 are converted into z-scores of -1.65, which is the critical value for a one-sided z-test.

The Fatal Flaw of the Bayesian Mixture Model

The fatal flaw of BMM is the prior for the standard deviation of the distribution of z-scores produced by true hypothesis (a.k.a false null-hypothesis). Theoretically, this standard deviation can range from small values to large values depending on the strength of the evidence against the null-hypothesis and the heterogeneity in effect sizes and sample sizes across studies. However, the BMM constrains the prior distribution of the standard deviation to a range from 0 to 1. Importantly, this restriction is applied to the theoretical distribution in the model, not to the truncated distribution that is produced by selection for significance.

It should also be noted that it is highly unusual in Bayesian statistics to use priors to restrict the range of possible values. Priors that set the probability of certain parameters to zero are known as dogmatic priors. The problem with dogmatic priors is that empirical data can never correct a bad prior. Thus, they continue to have an influence on results even in very large sets of data.

The authors justify their use of a dogmatic prior with a single sentence. They claim that values greater than 1 “make the implausible prediction that p values near 1 are more common under H1 than under H0” (p. 1226).

Figure 2 makes it clear that this claim is blatantly false. With a monotonic decreasing function of p-values when the null-hypothesis is false, the proportion of studies with two-sided p-values close to 1 will always be less than the proportion of p-values close to 1 that are produced by the uniform distribution when the null-hypothesis is false. It is therefore totally unnecessary to impose a restriction on the prior for the standard deviation of z-scores that are based on transformed two-sided p-values.

If standard deviations of z-scores were always less than 1, imposing an unnecessary constraint on this parameter would not be a problem. However, the standard deviation of z-scores can easily be greater than 1. This is illustrated when the p-curves in Figure 2 are converted into z-curves, using the improper transformation of p-values with qnorm(p). The z-curve for a = .9 is just a little bit higher than 1. For a = .5 (green) the standard deviation is 1.25 and for a = .1 (blue), the standard deviation is 2.32. In general, the standard deviation increases from 1 to values greater than 1 with decreasing values for the shape parameter a of the beta function.

The next figure shows how limiting the standard deviation of the normal distribution under H1 inflates the estimates of the proportion of false positives (H0 is true). BMM tries to model the observed distribution of z-scores with two standard normals. The center of the standard normal for H0 is fixed at zero. The center of the standard normal for H1 can move to maximize model fit. No model will have good fit because the data were not generated by two standard normal distributions. However, the best fit will be achieved by letting H0 account for the low z-scores and use H1 to account for the high z-scores. As a result, BMM will produce estimates of high proportions of false positives without any false positives in the data.

This can be easily demonstrated by submitting data that were generated with a beta distribution with a = .2 to the shiny app for the BMM model. The output is shown in the next Figure. The model returns the estimate that one third, 33.2%, of the data were false positives. It also is very confident in this estimate with a credibility interval ranging from .27 to .39. More data would only tighten this interval and not dramatically alter the point estimate. The reason for this false positive result (pun intended) is the dogmatic prior that limits the standard deviation of the normal distribution.

Image may contain: text

To demonstrate that the fatal flaw is really the dogmatic prior, we used a program written by Jerry Brunner that fits the BMM without the restriction on the standard deviation of the normal distribution under H1 (link). The model estimated only 3% false positives, placed the mean of the normal distribution for H1 at z = -1.22 and estimated the standard deviation as 2.10. The following figure shows that even this model fails to recover the actual distribution of z-scores, due to the wrong conversion of two-tailed p-values into z.scores. However, the key point is that removing the restriction on the standard deviation leads to a much lower estimate of false positives than the actual BMM model with the dogmatic prior that limits the standard deviation to 1.

It Matters: False Positive Rate in Cognitive Psychology

The authors of BMM also applied their model to actual data. The most relevant dataset are the 855 t-tests from cognitive psychology journals (Wetzels, Matzke, Lee, Rouder, Iverson, &
Wagenmakers, E.-J. , 2011). This set of t-values was not selected for significance. The next figure shows the p-curve for all t-values converted into two-sided p-values.

The main finding is that most p-values are below .05. Given low density for non-significant p-values it is hard to say something about the distribution of these p-values based on visual inspection of the data. We used the frequentist mixture model, dBUMfit, to analyze these data. Just like BMM, the model aims to estimate the proportion of true null-hypothesis. dBUMfit provides an estimate of 0% false positives to the data. Thus, even the non-significant results reported in cognitive journals do not provide evidence for the null-hypothesis and it is wrong to interpret non-significant results as evidence for the absence of an effect (McShane REF). As the proportion of true null-hypotheses decreases with decreasing p-values, it is clear that the estimate for the subset of significant results has to be zero. Thus, there is no evidence in these data that a subset of p-values have a uniform distribution that leads to a flat distribution of p-values greater than .05.

Johnson (2013) limited the analysis of Wetzel’s data to significant results. The p-curve for the significant results was also published by Gronau et al. (2017).

Johnson writes, quote,

The P values displayed in Fig. 3 presumably arise from two types of experiments: experiments in which a true effect was present and the alternative hypothesis was true, and experiments in which there was no effect present and the null hypothesis was true. For the latter experiments, the nominal distribution of P values is uniformly distributed on the range (0.0, 0.05). The distribution of P values reported for true alternative hypotheses is, by assumption, skewed to the left. The P values displayed in this plot thus represent a mixture of a uniform distribution and some other distribution.

Even without resorting to complicated statistical methods to fit this mixture, the appearance of this histogram suggests that many, if not most, of the P values falling above 0.01 are approximately uniformly distributed. That is, most of the significant P values that fell in the range (0.01-().05) probably represent P values that were computed from data in which the null hypothesis of no effect was true.

Based on Johnson’s visual inspection of p-curve, we would estimate that up to 32% of Wetzel’s significant t-test were false positives as there are 32% of p-values greater than .01.

Gronau et al. (2017) quote Johnson at length and state that their model was inspired by Johnson’s article. “Our Bayesian mixture model was inspired by a suggestion from Johnson (2013)” (p. 1230). In addition, they present the BMM as a confirmation of Johnson’s claim with an estimate that 40% of the p-values stem from a true null-hypothesis. This estimate implies that even some p-values less than .01 are assumed to be false positives.

The percentage of true null-hypothesis would be even larger for the full set of t-tests that includes non-significant results. With 31% non-significant results, the percentage would be that 59% of all t-tests in cognitive psychology test a true null-hypothesis.

In contrast, the mixture model that was fitted to all p-values returned an estimate of 0%. We believe that this is a practically significant difference that requires explanation.

We proposed that the 41% estimate for significant p-values and the 59% estimate for all p-values are inflated estimates that are caused by restricting the standard deviation of z-scores to 1. To test this prediction, we fitted the significant p-values to an alternative Bayesian Mixture Model (AltBMM). The only difference between BMM and altBMM is that altBMM allows the data to determine the standard deviation of z-scores for true hypotheses. Results confirmed our predictions. The estimate for false positive results dropped from 40% to 11% and the estimate of the standard deviation increased from 1 to 2.83.

Another way to show that restricting the variability of z-scores under H1 is a problem is to remove extreme p-values from the data set. In particle physics, a threshold of 5 sigma is used to rule out false positives. We can therefore set aside all p-values lower than (1-pnorm(5))*2. As we are removing cases that must definitely stem from H1, the proportion of false positives in the limited set should increase. However, the estimate provided by BMM drops from 41% to .22%. The reason is that removing extreme z-scores, reduces the variability of z-scores and allows the standard normal distribution for H1 to cover lower z-scores. This estimate is also closer to the estimate for altBMM, which also produced a smaller estimate for the standard deviation of the normal distribution under H1 (11% false positives, SD = 1.17).

It Matters: False Positive Rate in Social Psychology

Gronau et al. also conducted an analysis of focal hypothesis tests in social psychology.
There are two reasons why the false positive rate in social psychology would be higher than in cognitive psychology. First, selecting focal tests selects significant results with weaker evidence compared to sets of studies that also include non-focal tests (e.g., manipulation checks). Second, social psychology is known to be less replicable than cognitive psychology (OSC, 2015). The BMM point estimate was 52%. Given the small set of studies, the 95% credibility estimate ranged from .40% to 64%. To avoid the problem of small samples, we also fitted the model to Motyl et al.’s (2017) data that contained 803 significant focal hypothesis test from social psychology articles. The estimate of false positives was even higher with 64% and a tight 95% credibility interval ranging from 59% to 69%. The point estimate decreases to 40% for the altBMM model. This model also estimates the standard deviation to be 1.94. Limiting the set of p-values to p-values above 5 sigma, also reduces the BMM estimate to 48% and the 95%CI gets wider (38% to 58%) because there is now more uncertainty about the allocation of lower z-scores. These results suggest that a false positive rate of 60% is an inflated estimate of false positives in social psychology. However, consistent with the outcome from replication studies, the false positive rate in social psychology is higher than in cognitive psychology, and high enough to warrant the claim of a replication crisis in social psychology.


We demonstrated that the BMM for p-curve overestimates the percentage of false positives because it imposes an unnecessary and dogmatic restrictions on the variability of probit transformed p-values that are used to estimate the false positive rate. We also found that this inflation has practical consequences for the assessment of the false positive rates in cognitive psychology and social psychology. While the flawed BMM estimates are 40% and 60%, respectively, unbiased estimates are much lower, about 10% and 40%.

We therefore recommend our Alternative Mixture Model (AMM) that does not impose the dogmatic prior as a superior method for the estimation of the false positive rate.

Even though the rate of false positives in social psychology is below 50%, the rate of false positives is unacceptably high. It is even more disconcerting that this estimate is limited to studies where the true effect size is so small that it is practically zero. If we also consider true positives with practically insignificant effect sizes, the estimate would be even higher and probably exceed 50%. Thus, many published results in social psychology cannot be trusted and require new evidence from credible replication studies to provide empirical support for theories in social psychology.

The relatively low false positive rate for cognitive psychology does not imply that most results in cognitive psychology are replicable. It remains a concern that selection for significance produces many true positive results with low power that are difficult to replicate. Another concern is that effect size estimates are inflated and many results may be true positives with very small effect sizes that have no theoretical significance. For example, BMM estimates the percentage of significant results that were produced with an effect size of exactly 0, while it treats population effect sizes with non-zero values in the fourth or fifth decimal (d = 0.00001) as evidence that the null-hypothesis is false. From a practical point of view it is more reasonable to use a range of small effect sizes to define the null-hypothesis. Knowing the proportion of effect sizes that are exactly zero provides little added value (Lakens, 2017). Thus, statistical power and effect size estimate are more important than the distinction between true positives and false positives (Brunner & Schimmack, 2018).

An Invitation for Open Debate

We have a strong prior that our criticism of BMM is correct, but we are not going to make this a dogmatic prior that excludes the possibility that we are wrong. We are sharing our criticism of the dogmatic BMM openly to receive constructive criticism of our work. We also expect the authors of BMM to respond to our criticism and to point out flaws in our arguments if they can. To do so, the authors need to be open to the idea that their dogmatic prior on the standard deviation was unnecessary, a bad idea, and produces inflated estimates of false positives in the psychological literature. Ultimately, it is not important whether we are right or wrong, but we need to be able to trust statistical tools that are used to evaluate the credibility of psychological science. If the BMM provides dramatically inflated estimates of false positives, the results are misleading and should not be trusted.

One-tail or two-tails: That is the question

In this mini-tutorial, I discuss the relationship between p-values and z-scores. Although the standard normal distribution is a staple for intro stats, it plays a minor role when researchers conduct actual research with t-tests, F-tests and often only look up test statistics and p-values without thinking about the underlying sampling distributions of their test statistics. A better understanding of p-values and z-scores is needed because new statistical methods rely on meta-analyses of p-values or z-scores to make claims about the quality of psychological research.

Basic Introduction

Let’s assume that researchers would use z-tests to analyze their data and convert z-tests into p-values to determine whether a result is statistically significant, p < .05. In the statistics program R, the conversion of a z-score into a p-value uses the command pnorm(z, mean, sd). For significance testing we want to know how extreme the observed z-score is relative to the null-hypothesis, which is defined by a standard normal distribution with mean = 0, and sd = 1). So, we would use the command pnorm(z, mean=0, sd=1). Because the standard normal is the default assumption, we can also simply request the p-value with pnorm(z).

However, using this command will produce some strange results. For example, if we observed a z-score of 2.5, we obtain a p-value of .99, which would suggest that our result is not significant p > .05. The problem is that the default option in R is to provide the area under the standard normal distribution on the left side of the z-score. So, we see that 99% of the distribution is on the left side, the lower tail, and only 1% is on the right side, the upper tail. With only 1% in the upper tail, we can claim a significant result, p < .05.

[1] 0.9937903

There are various options to obtain the p-value we really want. One option is to write pnorm(2.5, lower.tail-FALSE), which gives use p = .01.

pnorm(2.5, lower.tail=FALSE)
[1] 0.006209665

A simpler option is to make use of the symmetry of the standard normal distribution and simply turn the positive z-score into a negative z-score.

[1] 0.006209665

Yet another option is to subtract the lower tail from 1.

[1] 0.006209665

So, we see that a z-score of 2.5 is statistically significant with p < .05. However, z-scores are two-sided. That is they have positive and negative values. What if we had observed a z-score of -2.5. Would that also be significant? As we can see, the answer is no. The reason is that we are conducting one-tailed tests, where only positive deviations from H0 can be used to reject the null-hypothesis.

[1] 0.9937903

Typically, psychologists prefer two-tailed tests, which is the default for F-tests that ignore the sign of an effect. To make the sign irrelevant, we can simply use the absolute z-score to obtain our upper tail p-value.

[1] 0.006209665

Now we get the same p-value that we obtained for z = 2.5. However, checking both tails doubles the risk of a type-I error. Therefore, we have to double the p-value, if we want to conduct a two-tailed test.

[1] 0.01241933

Multiple p-values

Psychologists are familiar with effect size meta-analysis. However, before effect size meta-analysis became common, meta-analyses were carried out with p-values or z-scores. Fisher not only invented p-values, he also introduced a method to combine p-values from multiple studies. Meta-analysis of p-values have encountered a renaissance in psychology with the introduction of p-curve, which is essentially a histogram of statistically significant p-values. Importantly, p-curve is based on two-tailed p-values, as I will demonstrate below.

Assume that we have a large set of z-tests from 1000 studies, but all 1000 studies tested a true null-hypothesis. As a result, we would expect that the 1000 z-scores follow the sampling distribution of a standard normal distribution.

z = rnorm(1000)

After we convert the z-scores into ONE-TAILED p-values, we see that they follow a uniform distribution.

p = pnorm(-z)

The same is true for TWO-TAILED p-values

p = (1-pnorm(abs(z)))*2
hist(p)p = pnorm(abs(z))*2

However, this is only true for the special case, when the null-hypothesis is true. When the null-hypothesis is false, the histograms of p-values (p-curves) differ dramatically.

z = rnorm(1000,1,3)

For the one-tailed p-values the distribution is bimodal. The reason is that null-effects are represented by p-values of .5. As we simulated many extreme positive and extreme negative deviations from 0, we have more p-values in the tails, close to 0 and close to 1, than p-values in the middle. Evidently, p-values are not just decreasing from 0 to 1.

However, if we compute two-tailed p-values, the distribution of p-values shows decreasing frequencies from 0 to 1.

In sum, it is important to think about the tails of a p-value. One-tailed p-values should be used when the sign of a test is meaningful. For example, in a meta-analysis of studies that tested the same hypothesis. In this case, we need to obtain p-values from test statistics that have a direction (z-scores, t-value) and we cannot use test statistics that remove information about the direction of a test (F-values, chi-square values). However, if we do not care about the sign of an effect, we should use two-tailed p-values because we only care about the strength of evidence against the null-hypothesis.

Going from p-values to z-scores

Meta-analyses of p-values can use p-values that are based on different test statistics (t-tests, F-tests, etc.). The reason is that all p-values have the same meaning. A p-value of .02 from a z-test provides the same information as a p-value of .02 from a t-test. However, p-values have an undesirable distribution. A solution to this problem is to convert p-values into values that follow a distribution with more desirable characteristics. The most desirable distribution is the standard normal distribution. Thus, we can use z-scores as a common metric to compare results of different studies (Stauffer et al., 1938).

However, we have to think again about the tails of p-values when we convert p-values into z-scores. If all p-values are one-tailed p-values, we can simply convert our upper-tail p-values into z-scores using the qnorm command. Simulating a uniform distribution of p-values and converting the p-values into z-scores gives us the standard normal distribution centered over 0.

p = runif(1000,0,1)
z = qnorm(p,lower.tail=FALSE)

However, things are more complicated with TWO-TAILED p-values as shown in the diagram below. First, we simulated a set of z-tests as before. We then convert the results into two-tailed p-values and convert them back. The conversion has to take into account that we doubled p-values to take into account two-tailed testing. So, now we need to half the p-values before we convert from p to z.

z = rnorm(1000,1,3)
p = (1-pnorm(abs(z)))*2
qz = -qnorm(p/2)

However, we see that the distribution of the original z-scores (black) differs form the distribution of the z-scores obtained from two-tailed p-values. The difference is that the converted z-scores do not have negative values. They are ABSOLUTE z-scores because the computation of two-tailed p-values erased information about the sign of a test. A low p-value could have been obtained from a high positive or a high negative z-score. To see this we can compare the converted z-scores to the absolute values of the original z-scores.


In sum, if we convert one-tailed p-values into z-scores, we retain information about the sign of an effect and the sampling error follows a standard normal distribution. However, if we use two-tailed p-values and convert them into z-scores, the distribution of z-scores is truncated at zero and only positive z-scores can be observed. Sampling error no longer follows a standard normal distribution.

A Minor Technical Problem

As noted before, p-values have an undesirable distribution. A z-score of 1 corresponds to a two-tailed p-value of p = .32. A z-score of 2 corresponds to a p-value of .05. A z-score of 3 corresponds to a p-value of p = .003. A z-score of 4 corresponds to a p-value of .0001. A z-score of 5 corresponds to a p-value of .000001. The number of zeros behind the decimal point increases quickly and at some point, rounding errors make it impossible to convert p-values into z-scores.

p = (1-pnorm(8.3))*2
[1] Inf

All p-values for z-scores greater than 8.2 are treated as 0 and are converted into a z-score of infinity. To avoid this problem, R provides the option to use log p-values. Using the log.p option makes it possible to convert a z-score of 10 into p-value and to retrieve the value of 10 after converting the p-value into a log and to obtain the correct z-score.

p = pnorm(10,lower.tail=FALSE)*2
[1] 0.00000000000000000000001523971
-qnorm(log(p) – log(2),log.p=TRUE)
[1] 10

Does and Don’ts

Just like meta-analysis of p-values has seen a renaissance, meta-analysis of z-scores has also seen renewed attention. Jerry Brunner and I developed z-curve to estimate mean power of a set of studies that were selected for significance and have heterogeneity in power as a result of heterogeneity in sample sizes and effect sizes (Brunner & Schimmack, 2018). Z-curve first converts all observed test-statistics into TWO-TAILED p-values and then converts two-tailed p-values into ABSOLUTE Z-SCORES. The method then fits several TRUNCATED standard normal distributions to the data to obtain estimates of statistical power.

Another method is the Bayesian Mixture Model (BMM) that aims to estimate the percentage of false positives in a set of studies. However, the BMM model has several deficiencies in the conversion process from p-values to z-scores.

First, it uses the formula for ONE-TAILED p-values when the input are TWO-TAILED p-values.

Second, it converts upper-tail p-values into z-scores using the formula for lower-tail p-values.

z = qnorm(p).

As a result, the distribution of z-scores obtained from p-values that were produced by z-tests differs from the distribution of the actual z-tests.

z = rnorm(10000,1)
p = (1-pnorm(abs(z)))*2

qz = qnorm(p) #BMM transformation

qz = -qnorm(p) #FLIPPED BMM transformation

Even if we correct for the sign error and flip the distribution, the reproduced distribution differs from the original distribution because TWO-TAILED p-values are converted into z-scores using the formula for ONE-TAILED p-values.

In conclusion, it is important to think about the tails of p-values. One-tailed p-values are not identical to two-tailed p-values. Using the formula for one-tailed p-values with two-tailed p-values distorts the information that is provided by the actual data. Two-tailed p-values do not contain information about the sign of an effect. Converting them into z-scores produces absolute z-scores that reflect the strength of evidence against the null-hypothesis without information about the direction of an effect.


P-values and z-scores contain valuable information about the results of studies. Both p-values and z-scores provide a common metric to compare results of studies that used different test statistics or differed in sample sizes (and degrees of freedom). Meta-analysts can use standardized effect sizes, p-values, or z-scores. P-values and z-scores can be transformed into each other without further information about sample sizes. However, to convert them properly, we have to take into account whether p-values tested a one-tailed or a two-tailed hypothesis. For one-tailed tests, the null-hypothesis corresponds to a p-value of .5 with values of 0 and 1 corresponding to very strong (infinite) evidence against the null-hypothesis. For two-tailed tests, p-values of 1 correspond to the null-hypothesis and a value of 0 corresponds to infinite evidence against the null-hypothesis. For z-scores a value of 0 corresponds to the null-hypothesis and increasing values in either direction provide evidence against it. Thus, one-tailed p-values correspond to z-scores with p = 0 corresponding to z = inf, p = .5 corresponding to z = 0, and p = 1 corresponding to z = -inf. In contrast, two-tailed p-values only provide information about strength of evidence and a p-value of 1 corresponds to z = 0, while a p-value of 0 corresponds to z = inf. Any meta-analysis with z-scores requires a transformation into p-values to create a common metric. The conversion of p-values into z-scores for this purpose should take into account whether p-values are one-tailed or two-tailed. Converting two-tailed p-values into z-scores using the formula for one-tailed p-values may lead to false conclusions in a meta-analysis of z-scores.

S.O.S We need open reviews.

I wrote a commentary that made a very simple point. A published model assumed that the variance of z-scores is typically less than 1. I pointed out that this is not a reasonable assumption because the standard deviation of z-scores is at least one and often greater than 1, when studies vary in effect sizes, sample sizes, or both. This commentary was rejected. One reviewer even provided R-Code to make his or her case. Here is my rebuttal.

Here is the r-code provided by the reviewer. We see SDs of 0.59, 0.49 and 0.46. Based on these results, the reviewer thinks that setting a prior to a range of values between 0 and 1 is reasonable.

Let’s focus on the example that the reviewer claims is realistic for a p-value distribution for 80% power. The reviewer simulates this scenario with a beta distribution with shape parameters 1 and 31. The Figure shows the implied distribution of p-values. What is most notable is that p-values greater than .38 are entirely missing; the maximum p-value is .38.

In this figure 80% of p-values are below .05 and 20% are above .05. This is why the reviewer suggests that the pattern of observed p-values corresponds to a set of studies with 80% power.

However, the reviewer does not consider whether this distribution of p-values could arise from a set of studies where p-values are the result of the non-central parameter and sampling error that follows a sampling distribution.

To simulate studies with 80% power, we can simply use a standard normal distribution centered over 2.80. Sampling error will produce z-scores greater and smaller than the non-centrality parameter of 2.80. Moreover, we already know that the standard deviation of these tests statistics is 1 because z-scores have the standard normal distribution as a sampling distribution (a point made and ignored by the reviewers and editor).

We can know compute the two-tailed p-values for each z-test and plot the distribution of p-values. Figure 2 shows the actual distribution in black and the reviewer’s beta distribution in red.

It is visible that the actual distribution has a lot more p-values that are very close to zero, which corresponds to high z-scores. We can know transform the p-values into z-scores using the reviewers’ formula (for one-tailed tests).

mean(y) #-2.54
sd(y) #1.11

We see that the standard deviation of these z-scores is greater than 1.

Using the correct formula for two-tailed p-values, we of course get the result that we already know to be true.

y = -qnorm(p/2)
mean(y) #2.80
sd(y) #1.00

It should be obvious that the reviewer made a mistake by assuming we can simulate p-value distributions with any beta-distribution. P-values cannot assume any distribution because the actual distribution of p-values is a function of the properties of the distribution of test-statistics that are used to compute p-values. With z-scores as test statistics it is well-known from intro statistics that sampling error follows a standard normal distribution, which is a normal distribution with a standard deviation of 1. Any transformation of z-scores into p-values and back into z-scores does not alter the standard deviation. Thus, the standard deviation has to be at least 1.

Heterogeneity in Power

The previous example assumed that all studies have the same amount of power. Allowing for heterogeneity in power, will further increase the standard deviation of z-scores. This is illustrated with the next example, where mean power is again 80%, but this time the non-centrality parameters vary with a normal distribution centered over 3.15 and a standard deviation of 1. Figure 3 shows the distribution of p-values which is even more extreme and deviates even more from the simulated beta-distribution by the reviewer.

Using the reviewer’s formula, we now get a standard deviation of 1.54, but if we use the correct formula for two-tailed p-values, we end up with 1.41.

mean(y) #-2.90
sd(y) #1.54

y = -qnorm(p/2)
mean(y) #3.16
sd(y) #1.39

This value makes sense because we simulated variation in z-scores with two standard normal distributions. One for the variation in the non-centrality parameters and one for the variation in sampling error. Adding two variances, gives a joint variance of 1 + 2 = 2, and a standard deviation of sqrt(2) = 1.41.


Unless I am totally crazy, I have demonstrated that we can use simple intro stats knowledge to realize that the standard deviation of p-values converted into z-scores has to be at least 1 because sampling error alone produces a standard deviation of 1. If the set of studies is heterogeneous and power varies across studies, the standard deviation will be even greater than 1. A variance less than 1 is only expected in unrealistic simulations or when researchers use questionable research practices, which reduces variability in p-values (e.g., all p-values greater than .05 are missing) and therewith also the variability in z-scores.

A broader conclusion is that the traditional publishing model in psychology is broken. Closed peer-review is too slow and unreliable to ensure quality control. Neither the editor of a prestigious journal, nor four reviewers were able to follow this simple line of argument. Open review is the only way forward. I guess I will be submitting this work to a journal with open reviews, where reviewers’ reputation is on the line and they have to think twice before they criticize a manuscript.

Replicability Audit of Steven J. Heine

“Trust is good, but control is better”  


Information about the replicability of published results is important because empirical results can only be used as evidence if the results can be replicated.  However, the replicability of published results in social psychology is doubtful. Brunner and Schimmack (2018) developed a statistical method called z-curve to estimate how replicable a set of significant results are, if the studies were replicated exactly.  In a replicability audit, I am applying z-curve to the most cited articles of psychologists to estimate  the replicability of their studies.

Steven J. Heine

Under construction