You can now do your own z-curve analysis with this shinyApp
Interpretation of a Zcurve Output
The basic principles of z-curve were outlined in Brunner and Schimmack (2018). This post explains the latest version of z-curve plots and the statistics obtained from it. The data used for this article stem from a project that codes a representative sample of focal hypothesis tests in the Journal of Experimental Psychology: Learning, Memory, and Cognition.
The Range information shows the full range of observed z-scores. Here z-scores range from 0.41 to 10. A value of 10 is the maximum because all larger z-scores were recoded to 10. Only z-scores less than 6 are shown because z-curve treats all studies with z-scores greater than 6 as having 100% power, and no estimation of power is needed.
There are 302 focal tests in the dataset and 273 are significant. The vertical, solid red line at z = 1.96 divides non-significant results on the left and significant results on the right side with alpha = .05 (two-tailed). The dotted red line at 1.65 is the boundary for marginally significant results, alpha = .10 (two-tailed). The green line at 2.8 implies 80% power with alpha = .05. If studies have an average power of 80%, the mode of the distribution should be here.
The main part of the figure is a histogram of the test statistics (F-values, t-values) converted into p-values and then converted into z-scores; z = qnorm(1-p/2). The solid blue line shows the density distribution for the significant z-scores with a default bandwidth of .05. The grey line shows the fit of the predicted density distribution based on the z-curve model. The grey line is extended into the range of non-significant results, which provides an estimate of the file-drawer of non-significant results that were not reported.
The observed discovery rate (DR) is the proportion of significant results that were observed in the set of k = 302 tests. A 95% confidence interval is given to provide information about the accuracy of this estimate. In this example the discovery rate is 90%, which is typical for psychology (Sterling, 1959; Sterling et al., 1959).
For all other results, a 95% confidence interval is obtained using bootstrapping with a default of 500 iterations.
The estimated discovery rate is the proportion of significant results that were observed compared, while taking the estimated file-drawer of significant results into account. The estimated discovery rate is only 38%.
A comparison of these two rates provides information about the amount of publication bias (Schimmack, 2012). As the observed discovery rate is much higher than the expected discovery rate, we can conclude that JEP-LMC selectively publishes significant results. This is consistent with the visual inspection of the file-drawer in the plot.
The file-drawer ratio is a simple conversion of the estimated discovery rate into a ratio of the size of the file-drawer to the proportion of significant results. It estimates how many non-significant results were obtained for every significant result; file.drawer.ratio = (1-EDR)/EDR. In this example, the ratio is 1.63:1, meaning there are 1.63 non-significant results for every significant result.
The latest addition to z-curve is Soric’s false discovery risk (FDR). Soric showed that it is possible to compute the maximum false discovery rate based on the assumption that all true discoveries were obtained with 100% power. If average power were less, the actual false discovery rate would be less than the stated false discovery risk; false discovery rate <= false discovery risk. Using Soric’s formula, FDR = ((1/EDR)-1)*(.05/.95), yields a false discovery risk of 9%. František Bartoš suggested a direct transformation of the file-drawer ratio (fdr) into the false discovery risk (FDR), which gives the same result, FDR = fdr*.05/.95 (or more general FDR = fdr*alpha/(1-alpha). This means that no more than 9% of the significant focal tests (discoveries) in JEP-LMC are false positives.
The accuracy of Soric’s FDR depends on the accuracy of the projected z-curve into the range of non-significant results. I am conducting simulation studies to evaluate the performance of z-curve. Preliminary results suggest that Soric’s FDR may underestimate FDR with a high proportion of high powered studies.
Soric’s FDR defines a false discovery as a significant results with a population effect size of zero (i.e., the nil-hypothesis, Cohen, 1994). As a result, even studies with extremely small effect sizes and power that are difficult to replicate are treated as true positives. Z-curve addresses this problem by computing an alternative false discovery risk. Z-curve is fitted with fixed proportions of false positives and fit is compared to the baseline model with no restrictions on the percentage of false positives. Once model fit deviates notably from the baseline model, the model specified too many false positives. The model with the highest proportion of false positives that still has acceptable fit is used to estimate the maximum false discovery risk. The obtained value depends on the specification of other possible power values. In the default model, the lowest power for true positive results is 17% with a non-central z-score of 1. Lowering this value, would decrease the FDR and in the limit reach Soric’s FDR. Increasing this value would increase the FDR. The Z0-FDR estimate is considerably higher than Soric’s FDR, indicating that several studies with positive results are studies with very small effects. The 95%CI interval shows that up to 55% of published results could be false positives when very small effects are considered false positives. The drawback of this approach is that there is no clear definition of the effect sizes that are considered false positives.
The last, yet most important, estimate is the replication rate. The replication rate is the mean power of published results with a significant result. As power predicts the long-run proportion of significant results, mean power is an estimate of the replication rate if the set of studies were replicated exactly with the same sample size. Increasing sample sizes would increase the replication rate, while lowering sample sizes would decrease it. With the same sample sizes as in the original studies, articles in JEP-LMC are expected to have a replication rate of 57%. The 95%CI shows that this estimate includes the observed replication rate of 50% in the Open Science Collaboration project for cognitive studies (OSC, 2015). This result validates the replication rate estimate of z-curve with outcomes from actual replication studies.
Brunner, J. & Schimmack, U. (2018). Estimating population mean power under conditions of heterogeneity and selection for significance. Meta-Psychology.
Sorić, B. (1989) Statistical “Discoveries” and Effect-Size Estimation, Journal of the American Statistical Association, 84:406, 608-610, DOI: 10.1080/01621459.1989.10478811
Open Science Collaboration. 2015. Estimating the reproducibility of psychological science. Science 349:6251