Category Archives: Z-Curve

An Honorable Response to the Credibility Crisis by D.S. Lindsay: Fare Well

We all know what psychologists did before 2012. The name of the game was to get significant results that could be sold to a journal for publication. Some did it with more power and some did it with less power, but everybody did it.

In the beginning of the 2010s it became obvious that this was a flawed way to do science. Bem (2011) used this anything-goes to get significance approach to publish 9 significant demonstration of a phenomenon that does not exist: mental time-travel. The cat was out of the bag. There were only two questions. How many other findings were unreal and how would psychologists respond to the credibility crisis.

D. Steve Lindsay responded to the crisis by helping to implement tighter standards and to enforce these standards as editor of Psychological Science. As a result, Psychological Science has published more credible results over the past five years. At the end of his editorial term, Linday published a gutsy and honest account of his journey towards a better and more open psychological science. It starts with his own realization that his research practices were suboptimal.

Early in 2012, Geoff Cumming blew my mind with a talk that led me to realize that I had been conducting underpowered experiments for decades. In some lines of research in my lab, a predicted effect would come booming through in one experiment but melt away in the next.
My students and I kept trying to find conditions that yielded consistent statistical significance—tweaking items, instructions, exclusion rules—but we sometimes eventually threw in the towel
because results were maddeningly inconsistent. For example, a chapter by Lindsay
and Kantner (2011) reported 16 experiments with an on-again/off-again effect of feedback on recognition memory. Cumming’s talk explained that p values are very noisy. Moreover, when between-subjects designs are used to study small- to medium-sized effects, statistical
tests often yield nonsignificant outcomes (sometimes with huge p values) unless samples are very large.

Hard on the heels of Cumming’s talk, I read Simmons, Nelson, and Simonsohn’s (2011) “False-Positive Psychology” article, published in Psychological Science. Then I gobbled up several articles and blog posts on misuses of null-hypothesis significance testing (NHST). The
authors of these works make a convincing case that hypothesizing after the results are known (HARKing; Kerr, 1998) and other forms of “p hacking” (post hoc exclusions, transformations, addition of moderators, optional stopping, publication bias, etc.) are deeply problematic. Such practices are common in some areas of scientific psychology, as well as in some other life
sciences. These practices sometimes give rise to mistaken beliefs in effects that really do not exist. Combined with publication bias, they often lead to exaggerated estimates
of the sizes of real but small effects.

This quote is exceptional because few psychologists have openly talked about their research practices before (or after) 2012. It is an open secrete that questionable research practices were widely used and anonymous surveys support this (John et al., 2012), but nobody likes to talk about it. Lindsay’s frank account is an honorable exception in the spirit of true leaders who confront mistakes head on, just like a Nobel laureate who recently retracted a Science article (Frances Arnold).

1. Acknowledge your mistakes.

2. Learn from your mistakes.

3. Teach others from your mistakes.

4. Move beyond your mistakes.

Lindsay’s acknowledgement also makes it possible to examine what these research practices look like when we examine published results, and to see whether this pattern changes in response to awareness that certain practices were questionable.

So, I z-curved Lindsay’s published results from 1998 to 2012. The graph shows some evidence of QRPs, in that the model assumes more non-significant results (grey line from 0 to 1.96) than are actually observed (histogram of non-significant results). This is confirmed by a comparison of the observed discovery rate (70% of published results are significant) and the expected discovery rate (44%). However, the confidence intervals overlap. So this test of bias is not significant.

The replication rate is estimated to be 77%. This means that there is a 77% probability that repeating a test with a new sample (of equal size) would produce a significant result again. Even for just significant results (z = 2 to 2.5), the estimated replicability is still 45%. I have seen much worse results.

Nevertheless, it is interesting to see whether things improved. First of all, being editor of Psychological Science is full-time job. Thus, output has decreased. Maybe research also slowed down because studies were conducted with more care. I don’t know. I just know that there are very few statistics to examine.

Although the small sample size of tests makes results somewhat uncertain, the graph shows some changes in research practices. Replicability increased further to 88% and there is no loner a discrepancy between observed and expected discovery rate.

If psychology as a whole had responded like D.S. Lindsay it would be in a good position to start the new decade. The problem is that this response is an exception rather than the rule and some areas of psychology and some individual researchers have not changed at all since 2012. This is unfortunate because questionable research practices hurt psychology, especially when undergraduates and the wider public learn more and more how untrustworthy psychological science has been and often still us. Hopefully, reforms will come sooner than later or we may have to sing a swan song for psychological science.

Estimating the Replicability of Psychological Science

Over the past years, psychologists have become increasingly concerned about the credibility of published results. The credibility crisis started in 2011, when Bem published incredible results that seemed to suggest that humans can foresee random future events. Bem’s article revealed fundamental flaws in the way psychologists conduct research. The main problem is that psychology journals only publish statistically significant results (Sterling, 1959). If only significant results are published, all hypotheses will receive empirical support as long as they are tested. This is akin to saying that everybody has a 100% free throw average or nobody ever makes a mistake if we do not count failures.

The main problem of selection for significance is that we do not know the real strength of evidence that empirical studies provide. Maybe the selection effect is small and most studies would replicate. However, it is also possible that many studies might fail a replication test. Thus, the crisis of confidence is a crisis of uncertainty.

The Open Science Collaboration conducted actual replication studies to estimate the replicability of psychological science. They replicated 97 studies with statistically significant results and were able to reproduce 35 significant results (a 36% success rate). This is a shockingly low success rate. Based on this finding, most published results cannot be trusted, especially because there is heterogeneity across studies. Some studies would have an even lower chance of replication and several studies might even be outright false positives (there is actually no real effect).

As important as this project was to reveal major problems with the research culture in psychological science, there are also some limitations that cast doubt about the 36% estimate as a valid estimate of the replicability of psychological science. First, the sample size is small and sampling error alone might have lead to an underestimation of the replicability in the population of studies. However, sampling error could also have produced a positive bias. Another problem is that most of the studies focused on social psychology and that replicability in social psychology could be lower than in other fields. In fact, a moderator analysis suggested that the replication rate in cognitive psychology is 50%, while the replication rate in social psychology is only 25%. The replicated studies were also limited to a single year (2008) and three journals. It is possible that the replication rate has increased since 2008 or could be higher in other journals. Finally, there have been concerns about the quality of some of the replication studies. These limitations do not undermine the importance of the project, but they do imply that the 36% estimate is an estimate and that it may underestimate the replicability of psychological science.

Over the past years, I have been working on an alternative approach to estimate the replicability of psychological science. This approach starts with the simple fact that replicabiliity is tightly connected to the statistical power of a study because statistical power determines the long-run probability of producing significant results (Cohen, 1988). Thus, estimating statistical power provides valuable information about replicability. Cohen (1962) conducted a seminal study of statistical power in social psychology. He found that the average power to detect an average effect size was around 50%. This is the first estimate of replicability of psychological science, although it was only based on one journal and limited to social psychology. However, subsequent studies replicated Cohen’s findings and found similar results over time and across journals (Sedlmeier & Gigerenzer, 1989). It is noteworthy that the 36% estimate from the OSC project is not statistically different from Cohen’s estimate of 50%. Thus, there is convergent evidence that replicability in social psychology is around 50%.

In collaboration with Jerry Brunner, I have developed a new method that can estimate mean power for a set of studies that are selected for significance and that vary in effect sizes and samples sizes, which produces heterogeneity in power (Brunner & Schimmack, 2018). The input for this method are the actual test statistics of significance tests (e.g., t-tests, F-tests). These test-statistics are first converted into two-tailed p-values and then converted into absolute z-scores. The magnitude of these absolute z-scores provides information about the strength of evidence against the null-hypotheses. The histogram of these z-scores, called a z-curve, is then used to fit a finite mixture model to the data that estimates mean power, while taking selection for significance intro account. Extensive simulation studies demonstrate that z-curve performs well and provides better estimates than alternative methods. Thus, z-curve is the method of choice for estimating the replicability of psychological science on the basis of the test statistics that are reported in original articles.

For this blog post, I am reporting results based on preliminary results from a large project that extracts focal hypothesis from a broad range of journals that cover all areas of psychology for the years 2010 to 2017. The hand-coding of these articles complements a similar project that relies on automatic extraction of test statistics (Schimmack, 2018).

Table 1 shows the journals that have been coded so far. It also shows the estimates based on the automated method and for hand-coding of focal hypotheses.

JournalHandAutomated
Psychophysiology8475
Journal of Abnormal Psychology7668
Journal of Cross-Cultural Psychology7377
Journal of Research in Personality6875
J. Exp. Psych: Learning, Memory, & Cognition5877
Journal of Experimental Social Psychology5562
Infancy5368
Behavioral Neuroscience5368
Psychological Science5266
JPSP-Interpersonal Relations & Group Processes3363
JPSP-Attitudes and Social Cognition3065
Mean5869

Hand coding of focal hypothesis produces lower estimates than the automated method because the automated analysis also codes manipulation checks and other highly significant results that are not theoretically important. The correlation between the two methods shows consistency across the two methods, r = .67. Finally, the mean for the automated method, 69%, is close to the mean for over 100 journals, 72%, suggesting that the sample of journals is an unbiased sample.

The hand coding results also confirm results found with the automated method that social psychology has a lower replicability than some other disciplines. Thus, the OSC reproducibility results that are largely based on social psychology should not be used to make claims about psychological science in general.

The figure below shows the output of the latest version of z-curve. The first finding is that the replicability estimate for all 1,671 focal tests is 56% with a relatively tight confidence interval ranging from 45% to 56%. ZZZ The next finding is that the discovery rate or success rate is 92%, using p < .05 as the criterion. This confirms that psychology journals continue to published results are selected for significance (Sterling, 1959). The histogram further shows that even more results would be significant if p-values below .10 are included as evidence for “marginal significance.”

Z-Curve.19.1 also provides an estimate of the size of the file drawer. It does so by projecting the distribution of observed significant results into the range of non-significant results (grey curve). The file drawer ratio shows that for every published result, we would expect roughly two unpublished studies with non-significant results. However, z-curve cannot distinguish between different questionable research practices. Rather than not disclosing failed studies researchers may not disclose other statistical analyses within a published study to report significant results.

Z-Curve.19.1 also provides an estimate of the false positive rate (FDR). FDR is the percentage of significant results that may arise from testing a true nil-hypothesis, where the population effect size is zero. For a long time, the consensus has been that false positives are rare because the nil-hypothesis is rarely true (Cohen, 1994). Consistent with this view, Soric’s estimate of the maximum false discovery rate is only 10% with a tight CI ranging from 8% to 16%.

However, the focus on the nil-hypothesis is misguided because it treats tiny deviations from zero as true hypotheses even if the effect size has no practical or theoretical significance. These effect sizes also lead to low power and replication failures. Therefore, Z-Curve 19.1 also provides an estimate of the FDR that treats studies with very low power as false positives. This broader definition of false positives raises the FDR estimate slightly, but 15% is still a low percentage. Thus, the modest replicability of results in psychological science is mostly due to low statistical power to detect true effects rather than a high number of false positive discoveries.

The reproducibility project showed that studies with low p-values were more likely to replicate. This relationship follows from the influence of statistical power on p-values and replication rates. To achieve a replication rate of 80%, p-values had to be less than .00005 or the z-score had to exceed 4 standard deviations. However, this estimate was based on a very small sample of studies. Z-Curve.19.1 also provides estimates of replicability for different levels of evidence. These values are shown below the x-axis. Consistent with the OSC results, a replication rate over 80% is only expected once z-scores are greater than 4.

The results also provide information about the choice of the alpha criterion to draw inferences from significance tests in psychology. To do so, it is important to distinguish observed p-values and type-I probabilities. For a single unbiased tests, we can infer from an observed p-value less than .05 that the risk of a false positive result is less than 5%. However, when multiple comparisons are made or results are selected for significance, an observed p-values less than .05 does not imply that the type-I error risk is below .05. To claim a type-I error risk of 5% or less, we have to correct the observed p-values, just like a Bonferroni correction. As 50% power corresponds to statistical significance, we see that z-scores between 2 and 3 are not statistically significant; that is, the type-I error risk is greater than 5%. Thus, the standard criterion to claim significance with alpha = .05 is a p-value of .003. Given the popularity of .005, I suggest to use p = .005 as a criterion for statistical significance. However, this claim is not based on lowering the criterion for statistical significance because p < .005 still only allows to claim that the type-I error probability is less than 5%. The need for a lower criterion value stems from the inflation of the type-I error rate due to selection for significance. This is a novel argument that has been overlooked in the significance wars, which ignored the influence of publication bias on false positive risks.

Finally, z-curve.19.1 makes it possible to examine the robustness of the estimates by using different selection criteria. One problem with selection models is that p-values just below .05, say in the .01 to .05 range, can arise from various questionable research practices that have different effects on replicability estimates. To address this problem, it is possible to estimate the density with a different selection criterion, while still estimating the replicability with alpha = .05 as the criterion. Figure 2 shows the results by using only z-scores greater than 2.5, p = .012) to fit the observed z-curve for z-scores greater than 2.5.

The blue dashed line at z = 2.5 shows the selection criterion. The grey curve between 1.96 and 2.5 is projected form the distribution for z-scores greater than 2.5. Results show a close fit with the observed distribution. A s a result, the parameter estimates are also very similar. Thus, the results are robust and the selection model seems to be reasonable.

Conclusion

Psychology is in a crisis of confidence about the credibility of published results. The fundamental problems are as old as psychology itself. Psychologists have conducted low powered studies and selected only studies that worked for decades (Cohen, 1962; Sterling, 1959). However, awareness of these problems has increased in recent years. Like many crises, the confidence crisis in psychology has created confusion. Psychologists are aware that there is a problem, but they do not know how large the problem is. Some psychologists believe that there is no crisis and pretend that most published results can be trusted. Others are worried that most published results are false positives. Meta-psychologists aim to reduce the confusion among psychologists by applying the scientific method to psychological science itself.

This blog post provided the most comprehensive assessment of the replicability of psychological science so far. The evidence is largely consistent with previous meta-psychological investigations. First, replicability is estimated to be slightly above 50%. However, replicability varies across discipline and the replicability of social psychology is below 50%. The fear that most published results are false positives is not supported by the data. Replicability increases with the strength of evidence against the null-hypothesis. If the p-value is below .00001, studies are likely to replicate. However, significant results with p-values above .005 should not be considered statistically significant with an alpha level of 5%, because selection for significance inflates the type-I error. Only studies with p < .005 can claim statistical significance with alpha = .05.

The correction for publication bias implies that researchers have to increase sample sizes to meet the more stringent p < .005 criterion. However, a better strategy is to preregister studies to ensure that reported results can be trusted. In this case, p-values below .05 are sufficient to demonstrate statistical significance with alpha = .05. Given the low prevalence of false positives in psychology, I do see no need to lower the alpha criterion.

Future Directions

This blog post is just an interim report. The final project requires hand-coding of a broader range of journals. Readers who think that estimating the replicability of psychological science is beneficial and who want information about a particular journal are invited to collaborate on this project and can obtain authorship if their contribution is substantial enough to warrant authorship. Please consider taking part in this project. Although it is a substantial time commitment, it doesn’t require participants or materials that are needed for actual replication studies. Please consider taking part in this project. Contact me, if you are interested and want to know how you can get involved.

An Introduction to Z-Curve: A method for estimating mean power after selection for significance (replicability)

UPDATE 5/13/2019   Our manuscript on the z-curve method for estimation of mean power after selection for significance has been accepted for publication in Meta-Psychology. As estimation of actual power is an important tool for meta-psychologists, we are happy that z-curve found its home in Meta-Psychology.  We also enjoyed the open and constructive review process at Meta-Psychology.  Definitely will try Meta-Psychology again for future work (look out for z-curve.2.0 with many new features).

Z.Curve.1.0.Meta.Psychology.In.Press

Since 2015, Jerry Brunner and I have been working on a statistical tool that can estimate mean (statitical) power for a set of studies with heterogeneous sample sizes and effect sizes (heterogeneity in non-centrality parameters and true power).   This method corrects for the inflation in mean observed power that is introduced by the selection for statistical significance.   Knowledge about mean power makes it possible to predict the success rate of exact replication studies.   For example, if a set of studies with mean power of 60% were replicated exactly (including sample sizes), we would expect that 60% of the replication studies produce a significant result again.

Our latest manuscript is a revision of an earlier manuscript that received a revise and resubmit decision from the free, open-peer-review journal Meta-Psychology.  We consider it the most authoritative introduction to z-curve that should be used to learn about z-curve, critic z-curve, or as a citation for studies that use z-curve.

Cite as “submitted for publication”.

Final.Revision.874-Manuscript in PDF-2236-1-4-20180425 mva final (002)

Feel free to ask questions, provide comments, and critic our manuscript in the comments section.  We are proud to be an open science lab, and consider criticism an opportunity to improve z-curve and our understanding of power estimation.

R-CODE
Latest R-Code to run Z.Curve (Z.Curve.Public.18.10.28).
[updated 18/11/17]   [35 lines of code]
call function  mean.power = zcurve(pvalues,Plot=FALSE,alpha=.05,bw=.05)[1]

Z-Curve related Talks
Presentation on Z-curve and application to BS Experimental Social Psychology and (Mostly) WS-Cognitive Psychology at U Waterloo (November 2, 2018)
[Powerpoint Slides]

Can the Bayesian Mixture Model Estimate the Percentage of False Positive Results in Psychology Journals?

A method revolution is underway in psychological science.  In 2011, an article published in JPSP-ASC made it clear that experimental social psychologists were publishing misleading p-values because researchers violated basic principles of significance testing  (Schimmack, 2012; Wagenmakers et al., 2011).  Deceptive reporting practices led to the publication of mostly significant results, while many non-significant results were not reported.  This selective publishing of results dramatically increases the risk of a false positive result from the nominal level of 5% that is typically claimed in publications that report significance tests  (Sterling, 1959).

Although experimental social psychologists think that these practices are defensible, no statistician would agree with them.  In fact, Sterling (1959) already pointed out that the success rate in psychology journals is too high and claims about statistical significance are meaningless.  Similar concerns were raised again within psychology (Rosenthal, 1979), but deceptive practices remain acceptable until today (Kitayama, 2018). As a result, most published results in social psychology do not replicate and cannot be trusted (Open Science Collaboration, 2015).

For non-methodologists it can be confusing to make sense of the flood of method papers that have been published in the past years.  It is therefore helpful to provide a quick overview of methodological contributions concerned with detection and correction of biases.

First, some methods focus on effect sizes, (pcurve2.0; puniform), whereas others focus on strength of evidence (Test of Excessive Significance; Incredibility Index; R-Index, Pcurve2.1; Pcurve4.06; Zcurve).

Another important distinction is between methods that assume a fixed parameter and methods that assume heterogeneity.   If all studies have a common effect size or the same strength of evidence,  it is relatively easy to demonstrate bias and to correct for bias (Pcurve2.1; Puniform; TES).  However, heterogeneity in effect sizes or sampling error produces challenges.  Relatively few methods have been developed for this challenging, yet realistic scenario.  For example, Ioannidis and Trikalonis (2005) developed a method to reveal publication bias that assumes a fixed effect size across studies, while allowing for variation in sampling error, but this method can be biased if there is heterogeneity in effect sizes.  In contrast, I developed the Incredibilty-Index (also called Magic Index) to allow for heterogeneity in effect sizes and sampling error (Schimmack, 2012).

Following my work on bias detection in heterogeneous sets of studies, I started working with Jerry Brunner on methods that can estimate average power of a heterogeneous set of studies that are selected for significance.  I first published this method on my blog in June 2015, when I called it post-hoc power curves.   These days, the term Zcurve is used more often to refer to this method.  I illustrated the usefulness of Zcurve in various posts in the Psychological Methods Discussion Group.

In September, 2015 I posted replicability rankings of social psychology departments using this method. the post generated a lot of discussions and a question about the method.  Although the details were still unpublished, I described the main approach of the method.  To deal with heterogeneity, the method uses a mixture model.

EJ.Mixture.png

In 2016, Jerry Brunner and I submitted a manuscript for publication that compared four methods for estimating average power of heterogeneous studies selected for significance (Puniform1.1; Pcurve2.1; Zcurve & a Maximul Likelihood Method).  In this article, the mixture model, Zcurve, outperformed other methods, including a maximum-likelihood method developed by Jerry Brunner. The manuscript was rejected from Psychological Methods.

In 2017, Gronau, Duizer, Bakker, and Eric-Jan Wagenmakers published an article titled “A Bayesian Mixture Modeling of Significant p Values: A Meta-Analytic Method to Estimate the Degree of Contamination From H0”  in the Journal of Experimental Psychology: General.  The article did not mention z-curve, presumably because it was not published in a peer-reviewed journal.

Although a reference to our mixture model would have been nice, the Bayesian Mixture Model differs in several ways from Zcurve.  This blog post examines the similarities and differences between the two mixture models, it shows that BMM fails to provide useful estimates with simulations and social priming studies, and it explains why BMM fails. It also shows that Zcurve can provide useful information about replicability of social priming studies, while the BMM estimates are uninformative.

Aims

The Bayesian Mixture Model (BMM) and Zcurve have different aims.  BMM aims to estimate the percentage of false positives (significant results with an effect size of zero). This percentage is also called the False Discovery Rate (FDR).

FDR = False Positives / (False Positives + True Positives)

Zcurve aims to estimate the average power of studies selected for significance. Importantly, Brunner and Schimmack use the term power to refer to the unconditional probability of obtaining a significant result and not the common meaning of power as being conditional on the null-hypothesis being false. As a result, Zcurve does not distinguish between false positives with a 5% probability of producing a significant result (when alpha = .05) and true positives with an average probability between 5% and 100% of producing a significant result.

Average unconditional power is simply the percentage of false positives times alpha plus the average conditional power of true positive results (Sterling et al., 1995).

Unconditional Power = False Positives * Alpha + True Positives * Mean(1 – Beta)

Zcurve therefore avoids the thorny issue of defining false positives and trying to distinguish between false positives and true positives with very small effect sizes and low power.

Approach 

BMM and zcurve use p-values as input.  That is, they ignore the actual sampling distribution that was used to test statistical significance.  The only information that is used is the strength of evidence against the null-hypothesis; that is, how small the p-value actually is.

The problem with p-values is that they have a specified sampling distribution only when the null-hypothesis is true. When the null-hypothesis is true, p-values have a uniform sampling distribution.  However, this is not useful for a mixture model, because a mixture model assumes that the null-hypothesis is sometimes false and the sampling distribution for true positives is not defined.

Zcurve solves this problem by using the inverse normal distribution to convert all p-values into absolute z-scores (abs(z) = -qnorm(p/2).  Absolute z-scores are used because F-tests or two-sided t-tests do not have a sign and a test score of 0 corresponds to a probability of 1.  Thus, the results do not say anything about the direction of an effect, while the size of the p-value provides information about the strength of evidence.

BMM also transforms p-values. The only difference is that BMM uses the full normal distribution with positive and negative z-scores  (z = qnorm(p)). That is, a p-value of .5 corresponds to a z-score of zero; p-values greater than .5 would be positive, and p-values less than .5 are assigned negative z-scores.  However, because only significant p-values are selected, all z-scores are negative in the range from -1.65 (p = .05, one-tailed) to negative infinity (p = 0).

The non-centrality parameter (i.e., the true parameter that generates the sampling dstribution) is simply the mean of the normal distribution. For the null-hypothesis and false positives, the mean is zero.

Zcurve and BMM differ in the modeling of studies with true positive results that are heterogeneous.  Zcurve uses several normal distributions with a standard deviation of 1 that reflects sampling error for z-tests.  Heterogeneity in power is modeled by varying means of normal distributions, where power increases with increasing means.

BMM uses a single normal distribution with varying standard deviation.  A wider distribution is needed to predict large observed z-scores.

The main difference between Zcurve and BMM is that Zcurve either does not have fixed means (Brunner & Schimmack, 2016) or has fixed means, but does not interpret the weight assigned to a mean of zero as an estimate of false positives (Schimmack & Brunner, 2018).  The reason is that the weights attached to individual components are not very reliable estimates of the weights in the data-generating model.  Importantly, this is not relevant for the goal of zurve to estimate average power because the weighted average of the components of the model is a good estimate of the average true power in the data-generating model, even if the weights do not match the weights of the data-generating model.

For example, Zcurve does not care whether 50% average power is produced by a mixture of 50% false positives and 50% true positives with 95% power or 50% of studies with 20% power and 50% studies with 80% power. If all of these studies were exactly replicated, they are expected to produce 50% significant results.

BMM uses the weights assigned to the standard normal with a mean of zero as an estimate of the percentage of false positive results.  It does not estimate the average power of true positives or average unconditional power.

Given my simulation studies with zcruve, I was surprised that BBM solved a problem that weights of individual components cannot be reliably estimated because the same distribution of p-values can be produced by many mixture models with different weights.  The next section examines how BMM tries to estimate the percentage of false positives from the distribution of p-values.

A Bayesian Approach

Another difference between BMM and Zcurve is that BMM uses prior distributions, whereas Zcurve does not.  Whereas Zcurve makes no assumptions about the percentage of false positives, BMM uses a uniform distribution with values from 0 to 1 (100%) as a prior.  That is, it is equally likely that the percentage of false positives is 0%, 100%, or any value in between.  A uniform prior is typically justified as being agnostic; that is, no subjective assumptions bias the final estimate.

For the mean of the true positives, the authors use a truncated normal prior, which they also describe as a folded standard normal.  They justify this prior as reasonable based on extensive simulation studies.

Most important, however, is the parameter for the standard deviation.  The prior for this parameter was a uniform distribution with values between 0 and 1.   The authors argue that larger values would produce too many p-values close to 1.

“implausible prediction that p values near 1 are more common under H1 than under H0” (p 1226). 

But why would this be implausible.  If there are very few false positives and many true positives with low power, most p-values close to 1 would be the result of  true positives (H1) than of false positives (H0).

Thus, one way BMM is able to estimate the false discovery rate is by setting the standard deviation in a way that there is a limit to the number of low z-scores that are predicted by true positives (H1).

Although understanding priors and how they influence results is crucial for meaningful use of Bayesian statistics, the choice of priors is not crucial for Bayesian estimation models with many observations because the influence of the priors diminishes as the number of observations increases.  Thus, the ability of BMM to estimate the percentage of false positives in large samples cannot be explained by the use of priors. It is therefore still not clear how BMM can distinguish between false positives and true positives with low power.

Simulation Studies

The authors report several simulation studies that suggest BMM estimates are close and robust across many scenarios.

The online supplemental material presents a set of simulation studies that highlight that the model is able to accurately estimate the quantities of interest under a relatively broad range of circumstances”  (p. 1226).

The first set of simulations uses a sample size of N = 500 (n = 250 per condition).  Heterogeneity in effect sizes is simulated with a truncated normal distribution with a standard deviation of .10 (truncated at 2*SD) and effect sizes of d = .45, .30, and .15.  The lowest values are .35, .20, and .05.  With N = 500, these values correspond to  97%, 61%, and 8% power respectively.

d = c(.35,.20,.05); 1-pt(qt(.975,500-2),500-2,d*sqrt(500)/2)

The number of studies was k = 5,000 with half of the studies being false positives (H0) and half being true positives (H1).

Figure 1 shows the Zcurve plot for the simulation with high power (d = .45, power >  97%; median true power = 99.9%).

Sim1.png

The graph shows a bimodal distribution with clear evidence of truncation (the steep drop at z = 1.96 (p = .05, two-tailed) is inconsistent with the distribution of significant z-scores.  The sharp drop from z = 1.96 to 3 shows that there are many studies with non-significant results are missing.  The estimate of unconditional power (called replicability = expected success rate in exact replication studies) is 53%.  This estimate is consistent with the simulation of 50% studies with a probability of success of 5% and 50% of studies with a success probability of 99.9% (.5 * .05 + .5 * .999 = 52.5).

The values below the x-axis show average power for  specific z-scores. A z-score of 2 corresponds roughly to p = .05 and 50% power without selection for significance. Due to selection for significance, the average power is only 9%. Thus the observed power of 50% provides a much inflated estimate of replicability.  A z-score of 3.5 is needed to achieve significance with p < .05, although the nominal p-value for z = 3.5 is p = .0002.  Thus, selection for significance renders nominal p-values meaningless.

The sharp change in power from Z = 3 to Z = 3.5 is due to the extreme bimodal distribution.  While most Z-scores below 3 are from the sampling distribution of H0 (false positives), most Z-scores of 3.5 or higher come from H1 (true positives with high power).

Figure 2 shows the results for the simulation with d = .30.  The results are very similar because d = .30 still gives 92% power.  As a result, replicabilty is nearly as high as in the previous example.

Sim2.png

 

The most interesting scenario is the simulation with low powered true positives. Figure 3 shows the Zcurve for this scenario with an unconditional average power of only 23%.

Sim3.png

It is no longer possible to recognize two sampling distributions and average power increases rather gradually from 18% for z = 2, to 35% for z = 3.5.  Even with this challenging scenario, BMM performed well and correctly estimated the percentage of false positives.   This is surprising because it is easy to generate a similar Zcurve without false positives.

Figure 4 shows a simulation with a mixture distribution but the false positives (d = 0) have been replaced by true positives (d = .06), while the mean for the heterogeneous studies was reduced to from d = .15 to d = .11.  These values were chosen to produce the same average unconditional power (replicability) of 23%.

Sim4.png

I transformed the z-scores into (two-sided) p-values and submitted them to the online BMM app at https://qfgronau.shinyapps.io/bmmsp/ .  I used only k = 1,500 p-values because the server timed me out several times with k = 5,000 p-values.  The estimated percentage of false positives was 24%, with a wide 95% credibility interval ranging from 0% to 48%.   These results suggest that BMM has problems distinguishing between false positives and true positives with low power.   BMM appears to be able to estimate the percentage of false positives correctly when most low z-scores are sampled from H0 (false positives). However, when these z-scores are due to studies with low power, BMM cannot distinguish between false positives and true positives with low power. As a result, the credibility interval is wide and the point estimates are misleading.

BMM.output.png

With k = 1,500 the influence of the priors is negligible.  However, with smaller sample sizes, the priors do have an influence on results and may lead to overestimation and false credibility intervals.  A simulation with k = 200, produced a point estimate of 34% false positives with a very wide CI ranging from 0% to 63%. The authors suggest a sensitivity analysis by changing model parameters. The most crucial parameter is the standard deviation.  Increasing the standard deviation to 2, increases the upper limit of the 95%CI to 75%.  Thus, without good justification for a specific standard deviation, the data provide very little information about the percentage of false positives underlying this Zcurve.

BMM.k200.png

 

For simulations with k = 100, the prior started to bias the results and the CI no longer included the true value of 0% false positives.

BMM.k100

In conclusion, these simulation results show that BMM promises more than it can deliver.  It is very difficult to distinguish p-values sampled from H0 (mean z = 0) and those sampled from H1 with weak evidence (e.g., mean z = 0.1).

In the Challenges and Limitations section, the authors pretty much agree with this assessment of BMM (Gronau et al., 2017, p. 1230).

The procedure does come with three important caveats.

First, estimating the parameters of the mixture model is an inherently difficult statistical problem. ..  and consequently a relatively large number of p values are required for the mixture model to provide informative results. 

A second caveat is that, even when a reasonable number of p values are available, a change in the parameter priors might bring about a noticeably different result.

The final caveat is that our approach uses a simple parametric form to account for the distribution of p values that stem from H1. Such simplicity comes with the risk of model-misspecification.

Practical Implications

Despite the limitations of BMM, the authors applied BMM to several real data.  The most interesting application selected focal hypothesis tests from social priming studies.  Social priming studies have come under attack as a research area with sloppy research methods as well as fraud (Stapel).  Bias tests show clear evidence that published results were obtained with questionable scientific practices (Schimmack, 2017a, 2017b).

The authors analyzed 159 social priming p-values.  The 95%CI for the percentage of false positives ranged from 48% to 88%.  When the standard deviation was increased to 2, the 95%CI increased slightly to 56% to 91%.  However, when the standard deviation was halved, the 95%CI ranged from only 10% to 75%.  These results confirm the authors’ warning that estimates in small sets of studies (k < 200) are highly sensitive to the specification of priors.

What inferences can be drawn from these results about the social priming literature?  A false positive percentage of 10% doesn’t sound so bad.  A false positive percentage of 88% sound terrible. A priori, the percentage is somewhere between 0 and 100%. After looking at the data, uncertainty about the percentage of false positives in the social priming literature remains large.  Proponents will focus on the 10% estimate and critics will use the 88% estimate.  The data simply do not resolve inconsistent prior assumptions about the credibility of discoveries in social priming research.

In short, BMM promises that it can estimate the percentage of false positives in a set of studies, but in practice these estimates are too imprecise and too dependent on prior assumptions to be very useful.

A Zcurve of Social Priming Studies (k = 159)

It is instructive to compare the BMM results to a Zcurve analysis of the same data.

SocialPriming.png

The zcurve graph shows a steep drop and very few z-scores greater than 4, which tend to have a high success rate in actual replication attempts (OSC, 2015).  The average estimated replicability is only 27%.  This is consistent with the more limited analysis of social priming studies in Kahneman’ s Thinking Fast and Slow book (Schimmack, 2017a).

More important than the point estimate is that the 95%CI ranges from 15% to a maximum of 39%.  Thus, even a sample size of 159 studies is sufficient to provide conclusive evidence that these published studies have a low probability of replicating even if it were possible to reproduce the exact conditions again.

These results show that it is not very useful to distinguish between false positives with a replicability of 5% and true positives with a replicability of 6, 10, or 15%.  Good research provides evidence that can be replicated at least with a reasonable degree of statistical power.  Tversky and Kahneman (1971) suggested a minimum of 50% and most social priming studies fail to meet this minimal standard and hardly any studies seem to have been planned with the typical standard of 80% power.

The power estimates below the x-axis show that a nomimal z-score of 4 or higher is required to achieve 50% average power and an actual false positive risk of 5%. Thus, after correcting for deceptive publication practices, most of the seemingly statistically significant results are actually not significant with the common criterion of a 5% risk of a false positive.

The difference between BMM and Zcurve is captured in the distinction between evidence of absence and absence of evidence.  BMM aims to provide evidence of absence (false positives). In contrast, Zcurve has the more modest goal of demonstrating absence (or presence) of evidence.  It is unknown whether any social priming studies could produce robust and replicable effects and under what conditions these effects occur or do not occur.  However, it is not possible to conclude from the poorly designed studies and the selectively reported results that social priming effects are zero.

Conclusion

Zcurve and BMM are both mixture models, but they have different statistical approaches, they have different aims.  They also differ in their ability to provide useful estimates.  Zcurve is designed to estimate average unconditional power to obtain significant results without distinguishing between true positives and false positives.  False positives reduce average power, just like low powered studies, and in reality it can be difficult or impossible to distinguish between a false positive with an effect size of zero and a true positive with an effect size that is negligibly different from zero.

The main problem of BMM is that it treats the nil-hypothesis as an important hypothesis that can be accepted or rejected.  However, this is a logical fallacy.  it is possible to reject an implausible effect sizes (e.g., the nil-hypothesis is probably false if the 95%CI ranges from .8 to  1.2], but it is not possible to accept the nil-hypothesis because there are always values close to 0 that are also consistent with the data.

The problem of BMM is that it contrasts the point-nil-hypothesis with all other values, even if these values are very close to zero.  The same problem plagues the use of Bayes-Factors that compare the point-nil-hypothesis with all other values (Rouder et al., 2009).  A Bayes-Factor in favor of the point nil-hypothesis is often interpreted as if all the other effect sizes are inconsistent with the data.  However, this is a logical fallacy because data that are inconsistent with a specific H1 can be consistent with an alternative H1.  Thus, a BF in favor of H0 can only be interpreted as evidence against a specific H1, but never as evidence that the nil-hypothesis is true.

To conclude, I have argued that it is more important to estimate the replicability of published results than to estimate the percentage of false positives.  A literature with 100% true positives and average power of 10% is no more desirable than a literature with 50% false positives and 50% true positives with 20% power.  Ideally, researchers should conduct studies with 80% power and honest reporting of statistics and failed replications should control the false discovery rate.  The Zcurve for social priming studies shows that priming researchers did not follow these basic and old principles of good science.  As a result, decades of research are worthless and Kahneman was right to compare social priming research to a train wreck because the conductors ignored all warning signs.

 

 

 

Visual Inspection of Strength of Evidence: P-Curve vs. Z-Curve

Statistics courses often introduce students to a bewildering range of statistical test.  They rarely point out how test statistics are related.  For example, although t-tests may be easier to understand than F-tests, every t-test could be performed as an F-test and the F-value in the F-test is simply the square of the t-value (t^2 or t*t).

At an even more conceptual level, all test statistics are ratios of the effect size (ES) and the amount of sampling error (ES).   The ratio is sometimes called the signal (ES) to noise (ES) ratio.  The higher the signal to noise ratio (ES/SE), the stronger the observed results deviate from the hypothesis that the effect size is zero.  This hypothesis is often called the null-hypothesis, but this terminology has created some confusing.  It is also sometimes called the nil-hypothesis the zero-effect hypothesis or the no-effect hypothesis.  Most important, the test-statistic is expected to average zero if the same experiment could be replicated a gazillion times.

The test statistics of statistical tests cannot be directly compared.  A t-value of 2 in a study with N = 10 participants provides weaker evidence against the null-hypothesis than a z-score of 1.96.  and an F-value of 4 with df(1,40) provides weaker evidence than an F(10,200) = 4 result.  It is only possible to compare test values directly that have the same sampling distribution (z with z, F(1,40) with F(1,40), etc.).

There are three solutions to this problem. One solution is to use effect sizes as the unit of analysis. This is useful if the aim is effect size estimation.  Effect size estimation has become the dominant approach in meta-analysis.  This blog post is not about effect size estimation.  I just mention it because many readers may be familiar with effect size meta-analysis, but not familiar with meta-analysis of test statistics that reflect the ratio of effect size and sampling error (Effect size meta-analysis: unit = ES; Test Statistic Meta-Analysis: unit ES/SE).

P-Curve

There are two approaches to standardize test statistics so that they have a common unit of measurement.  The first approach goes back to Ronald Fisher, who is considered the founder of modern statistics for researchers.  Following Fisher it is common practice to convert test-statistics into p-values (for this blog post assumes that you are familiar with p-values).   P-values have the same meaning independent of the test statistic that was used to compute them.   That is, p = .05 based on a z-test, t-test, or an F-test provide equally strong evidence against the null-hypothesis (Bayesians disagree, but that is a different story).   The use of p-values as a common metric to examine strength of evidence (evidential value) was largely forgotten, until Simonsohn, Simmons, and Nelson (SSN) used p-values to develop a statistical tool that takes publication bias and questionable research practices into account.  This statistical approach is called p-curve.  P-curve is a family of statistical methods.  This post is about the p-curve plot.

A p-curve plot is essentially a histogram of p-values with two characteristics. First, it only shows significant p-values (p < .05, two-tailed).  Second, it plots the p-values between 0 and .05 with 5 bars.  The Figure shows a p-curve for Motyl et al.’s (2017) focal hypothesis tests in social psychology.  I only selected t-test and F-tests from studies with between-subject manipulations.

p.curve.motyl

The main purpose of a p-curve plot is to examine whether the distribution of p-values is uniform (all bars have the same height).  It is evident that the distribution for Motyl et al.’s data is not uniform.  Most of the p-values fall into the lowest range between 0 and .01. This pattern is called “rigth-skewed.”  A right-skewed plot shows that the set of studies has evidential value. That is, some test statistics are based on non-zero effect sizes.  The taller the bar on the left is, the greater the proportion of studies with an effect.  Importantly, meta-analyses of p-values do not provide information about effect sizes because p-values take effect size and sampling error into account.

The main inference that can be drawn from a visual inspection of a p-curve plot is how unlikely it is that all significant results are false positives; that is, the p-value is below .05 (statistically significant), but this strong deviation from 0 was entirely due to sampling error, while the true effect size is 0.

The next Figure also shows a plot of p-values.  The difference is that it shows the full range of p-values and that it differentiates more between p-values because p = .09 provides weaker evidence than p = .0009.

all.p.curve.motyl.png

The histogram shows that most p-values are below p < .001.  It also shows very few non-significant results.  However, this plot is not more informative than the actual p-curve plot. The only conclusion that is readily visible is that the distribution is not uniform.

The main problem with p-value plots is that p-values do not have interval scale properties.  This means, the difference between p = .4 and p = .3 is not the same as the difference between p = .10 and p = .00 (e.g., .001).

Z-Curve  

Stouffer developed an alternative method to Fisher’s p-value meta-analysis.  Every p-value can be transformed into a z-scores that corresponds to a particular p-value.  It is important to distinguish between one-sided and two-sided p-values.  The transformation requires the use of one-sided p-values, which can be obtained by simply dividing a two-sided p-value by 2.  A z-score of -1.96 corresponds to a one-sided p-value of 0.025 and a z-score of 1.96 corresponds to a one-sided p-values of 0.025.  In a two sided test, the sign no longer matters and the two p-values are added to yield 0.025 + 0.025 = 0.05.

In a standard meta-analysis, we would want to use one-sided p-values to maintain information about the sign.  However, if the set of studies examines different hypothesis (as in Motyl et al.’s analysis of social psychology in general) the sign is no longer important.   So, the transformed two-sided p-values produce absolute (only positive) z-scores.

The formula in R is Z = -qnorm(p/2)   [p = two.sided p-value]

For very strong evidence this formula creates problems. that can be solved by using the log.P=TRUE option in R.

Z = -qnorm(log(p/2), log.p=TRUE)

p.to.z.transformation.png

The plot shows the relationship between z-scores and p-values.  While z-scores are relatively insensitive to variation in p-values from .05 to 1, p-values are relatively insensitive to variation in z-scores from 2 to 15.

only.sig.p.to.z.transformation

The next figure shows the relationship only for significant p-values.  Limiting the distribution of p-values does not change the fact that p-values and z-values have very different distributions and a non-linear relationship.

The advantage of using (absolute) z-scores is that z-scores have ratio scale properties.  A z-score of zero has real meaning and corresponds to the absence of evidence for an effect; the observed effect size is 0.  A z-score of 2 is twice as strong as a z-score of 1. For example, given the same sampling error the effect size for a z-score of 2 is twice as large as the effect size for a z-score of 1 (e.g., d = .2, se = .2, z = d/se = 1,  d = 4, se = .2, d/se = 2).

It is possible to create the typical p-curve plot with z-scores by selecting only z-scores above z = 1.96. However, this graph is not informative because the null-hypothesis does not predict a uniform distribution of z-scores.   For z-values the central tendency of z-values is more important.  When the null-hypothesis is true, p-values have a uniform distribution and we would expect an equal number of p-values between 0 and 0.025 and between 0.025 and 0.050.   A two-sided p-value of .025 corresponds to a one-sided p-value of 0.0125 and the corresponding z-value is 2.24

p = .025
-qnorm(log(p/2),log.p=TRUE)
[1] 2.241403

Thus, the analog to a p-value plot is to examine how many significant z-scores fall into the region from 1.96 to 2.24 versus the region with z-values greater than 2.24.

z.curve.plot1.png

The histogram of z-values is called z-curve.  The plot shows that most z-values are in the range between 1 and 6, but the histogram stretches out to 20 because a few studies had very high z-values.  The red line shows z = 1.96. All values on the left are not significant with alpha = .05 and all values on the right are significant (p < .05).  The dotted blue line corresponds to p = .025 (two tailed).  Clearly there are more z-scores above 2.24 than between 1.96 and 2.24.  Thus, a z-curve plot provides the same information as a p-curve plot.  The distribution of z-scores suggests that some significant results reflect true effects.

However, a z-curve plot provides a lot of additional information.  The next plot removes the long tail of rare results with extreme evidence and limits the plot to z-scores in the range between 0 and 6.  A z-score of six implies a signal to noise ratio of 6:1 and corresponds to a p-value of p = 0.000000002 or 1 out of 2,027,189,384 (~ 2 billion) events. Even particle physics settle for z = 5 to decide that an effect was observed if it is so unlikely for a test result to occur by chance.

> pnorm(-6)*2
[1] 1.973175e-09

Another addition to the plot is to include a line that identifies z-scores between 1.65 and 1.96.  These z-scores correspond to two-sided p-values between .05 and .10. These values are often published as weak but sufficient evidence to support the inference that a (predicted) effect was detected. These z-scores also correspond to p-values below .05 in one-sided tests.

z.curve.plot2

A major advantage of z-scores over p-values is that p-values are conditional probabilities based on the assumption that the null-hypothesis is true, but this hypothesis can be safely rejected with these data.  So, the actual p-values are not important because they are conditional on a hypothesis that we know to be false.   It is like saying, I would be a giant if everybody else were 1 foot tall (like Gulliver in Lilliput), but everybody else is not 1 foot tall and I am not a giant.

Z-scores are not conditioned on any hypothesis. They simply show the ratio of the observed effect size and sampling error.  Moreover, the distribution of z-scores tell us something about the ratio of the true effect sizes and sampling error.  The reason is that sampling error is random and like any random variable has a mean of zero.  Therefore, the mode, median, or mean of a z-curve plot tells us something about ratio of the true effect sizes and sampling error.  The more the center of a distribution is shifted to the right, the stronger is the evidence against the null-hypothesis.  In a p-curve plot, this is reflected in the height of the bar with p-values below .01 (z > 2.58), but a z-curve plot shows the actual distribution of the strength of evidence and makes it possible to see where the center of a distribution is (without more rigorous statistical analyses of the data).

For example, in the plot above it is not difficult to see the mode (peak) of the distribution.  The most common z-values are between 2 and 2.2, which corresponds to p-values of .046 (pnorm(-2.2)*2) and .028 (pnorm(-2.2)*2).   This suggests that the modal study has a ratio of 2:1 for effect size over sampling error.

The distribution of z-values does not look like a normal distribution. One explanation for this is that studies vary in sampling errors and population effect sizes.  Another explanation is that the set of studies is not a representative sample of all studies that were conducted.   It is possible to test this prediction by trying to fit a simple model to the data that assumes representative sampling of studies (no selection bias or p-hacking) and that assumes that all studies have the same ratio of population effect size over sampling error.   The median z-score provides an estimate of the center of the sampling distribution.  The median for these data is z = 2.56.   The next picture shows the predicted sampling distribution of this model, which is an approximately normal distribution with a folded tail.

 

z.curve.plot3

A comparison of the observed and predicted distribution of z-values shows some discrepancies. Most important is that there are too few non-significant results.  This observation provides evidence that the results are not a representative sample of studies.  Either non-significant results were not reported or questionable research practices were used to produce significant results by increasing the type-I error rate without reporting this (e.g., multiple testing of several DVs, or repeated checking for significance during the course of a study).

It is important to see the difference between the philosophies of p-curve and z-curve. p-curve assumes that non-significant results provide no credible evidence and discards these results if they are reported.  Z-curve first checks whether non-significant results are missing.  In this way, p-curve is not a suitable tool for assessing publication bias or other problems, whereas even a simple visual inspection of z-curve plots provides information about publication bias and questionable research practices.

z.curve.plot4.png

The next graph shows a model that selects for significance.  It no longer attempts to match the distribution of non-significant results.  The objective is only to match the distribution of significant z-values.  You can do this by hand and simply try out different values for the center of the normal distribution.  The lower the center, the more z-scores are missing because they are not significant.  As a result, the density of the predicted curve needs to be adjusted to reflect the fact that some of the area is missing.

center.z = 1.8  #pick a value
z = seq(0,6,.001)  #create the range of z-values
y = dnorm(z,center.z,1) + dnorm(z,-center.z,1)  # get the density for a folded normal
y2 = y #duplicate densities
y2[x < 1.96] = 0   # simulate selection bias, density for non-significant results is zero
scale = sum(y2)/sum(y)  # get the scaling factor so that area under the curve of only significant results is 1.
y = y / scale   # adjust the densities accordingly

# draw a histogram of z-values
# input is  z.val.input
# example; z.val.input = rnorm(1000,2)
hist(z.val.input,freq=FALSE,xlim=c(0,6),ylim=c(0,1),breaks=seq(0,20,.2), xlab=””,ylab=”Density”,main=”Z-Curve”)

abline(v=1.96,col=”red”)   # draw the line for alpha = .05 (two-tailed)
abline(v=1.65,col=”red”,lty=2)  # draw marginal significance (alpha = .10 (two-tailed)

par(new=TRUE) #command to superimpose next plot on histogram

# draw the predicted sampling distribution
plot(x,y,type=”l”,lwd=4,ylim=c(0,1),xlim=c(0,6),xlab=”(absolute) z-values”,ylab=””)

Although this model fits the data better than the previous model without selection bias, it still has problems fitting the data.  The reason is that there is substantial heterogeneity in the true strength of evidence.  In other words, the variability in z-scores is not just sampling error but also variability in sampling errors (some studies have larger samples than others) and population effect sizes (some studies examine weak effects and others examine strong effects).

Jerry Brunner and I developed a mixture model to fit a predicted model to the observed distribution of z-values.  In a nutshell the mixture model has multiple (folded) normal distributions.  Jerry’s z-curve lets the center of the normal distribution move around and give different weights to them.  Uli’s z-curve uses fixed centers one standard deviation apart (0,1,2,3,4,5 & 6) and uses different weights to fit the model to the data.  Simulation studies show that both methods work well.  Jerry’s method works a bit better if there is little variability and Uli’s method works a bit better with large variability.

The next figure shows the result for Uli’s method because the data have large variability.

z.curve.plot5

The dark blue line in the figure shows the density distribution for the observed data.  A density distribution assigns densities to an observed distribution that does not fit a mathematical sampling distribution like the standard normal distribution.   We use the Kernel Density Estimation method implemented in the R base package.

The grey line shows the predicted density distribution based on Uli’s z-curve method.  The z-curve plot makes it easy to see the fit of the model to the data, which is typically very good.  The model result of the model is the weighted average of the true power that corresponds to the center of the simulated normal distributions.  For this distribution,  the weighted average is 48%.

The 48% estimate can be interpreted in two ways.  First, it means that if researchers randomly sampled from the set of studies in social psychology and were able to exactly reproduce the original study (including sample size),  they have a probability of 48% to replicate a significant result with alpha = .05.  The complementary interpretation is that if researchers were successful in replicating all studies exactly,  the reproducibility project is expected to produce 48% significant results and 52% non-significant results.  Because average power of studies predicts the success of exact replication studies, Jerry and I refer to the average power of studies that were selected for significance replicability.  Simulation studies show that our z-curve methods have good large sample accuracy (+/- 2%) and we adjust for the small estimation bias in large samples by computing a conservative confidence interval that adds 2% to the upper limit and 2% to the lower limit.

Below is the R-Code to obtain estimates of replicability from a set of z-values using Uli’s method.

<<<Download Zcurve R.Code>>>

Install R.Code on your computer, then run from anywhere with the following code

location =  <user folder>  #provide location information where z-curve code is stored
source(paste0(location,”fun.uli.zcurve.sharing.18.1.R”))  #read the code
run.zcurve(z.val.input)  #get z-curve estimates with z-values as input

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Closed and Biased Peer-Reviews Do not Advance Psychological Science

Update: March 14, 2021

Jerry Brunner and I developed a method to estimate average power of studies while taking selection for significance into account. We validated our method with simulation studies. We also show that other methods that are already in use for effect size estimation, like p-curve, produce biased (inflated) estimates. You might think that an article that relies on validated simulations to improve on an existing method, z-curve is better than p-curve, would be published, especially be a journal that was created to improve / advance psychological science. However, this blog post shows that AMPPS works like any other traditional, for-profit, behind pay-wall journal. Normally, you would not be able to see the editorial decision letter or know that an author of the inferior p-curve method provided a biased review. But, here you can see how traditional publishing works (or doesn’t work).

Meanwhile the article has been published in the journal Meta-Psychology, a journal without fees for authors, open access to articles, and transparent peer-reviews.

The article:  https://open.lnu.se/index.php/metapsychology/article/view/874
The peer-review report can be found on OSF: https://osf.io/peumw/

Also, three years after the p-curve authors have been alerted to the fact that their method can provide biased estimates, they have not modified their app or posted a statement that alerts readers to this problem.  This is how even meta-scientists operate.

REJECTION LETTER:

Dear Dr. Schimmack:

Thank you for submitting your manuscript (AMPPS-17-0114) entitled “Z-Curve: A Method for the Estimating Replicability Based on Test Statistics in Original Studies” to Advances in Methods and Practices in Psychological Science (AMPPS). First, my apologies for the overly long review process. I initially struggled to find reviewers for the paper and I also had to wait for the final review. In the end, I received guidance from three expert reviewers whose comments appear at the end of this message.

Reviewers 1 and 2 chose to remain anonymous and Reviewer 3 is Leif Nelson (signed review). Reviewers 1 and 2 were both strongly negative and recommended rejection. Nelson was more positive about the goals of the paper and approach, although he wasn’t entirely convinced by the approach and evidence. I read the paper independently of the reviews, both before sending it out and again before reading the reviews (given that it had been a while). My take was largely consistent with that of the reviewers.

Although the issue of estimating replicability from published results is an important one, I was less convinced about the method and felt that the paper does not do enough to define the approach precisely, and it did not adequately demonstrate its benefits and limits relative to other meta-analytic bias correction techniques. Based on the comments of the reviewers and my independent evaluation, I found these issues to be substantial enough that I have decided to decline the manuscript.

The reviews are extensive and thoughtful, and I won’t rehash all of the details in my letter. I would like to highlight what I see as the key issues, but many of the other comments are important and substantive. I hope you will find the comments useful as you continue to develop this approach (which I do think is a worthwhile enterprise).

All three reviews raised concerns about the clarity of the paper and the figures as well as the lack of grounding for a number of strong claims and conclusions (they each quote examples). They also note the lack of specificity for some of the simulations and question the datasets used for the analyses.

I agreed that the use of some of the existing data sets (e.g., the scraped data, the Cuddy data, perhaps the Motyl data) are not ideal ways to demonstrate the usefulness of this tool. Simulations in which you know and can specify the ground truth seem more helpful in demonstrating the advantages and constraints of this approach.

Reviewers 1 and 2 both questioned the goal of estimating average power. Reviewer 2 presents the strongest case against doing so. Namely, average power is a weird quantity to estimate in light of a) decades of research on meta-analytic approaches to estimating the average effect size in the face of selection, and b) the fact that average power is a transformation of effect size. To demonstrate that Z-curve is a valid measure and an improvement over existing approaches, it seems critical to test it against other established meta-analytic models.

p-curve is relatively new, and as reviewer 2 notes, it has not been firmly established as superior to other more formal meta-analytic approaches (it might well be better in some contexts and worse in others). When presenting a new method like Z-curve, it’s is important to establish it against well-grounded methods or at least to demonstrate how accurate, precise, and biased it is under a range of realistic scenarios. In the context of this broader literature on bias correction, the comparison only to p-curve seems narrow, and a stronger case would involve comparing the ability of Z-curve to recover average effect size against other models of bias correction (or power if you want to adapt them to do that).

[FYI:  p-curve is the only other method that aims to estimate average power of studies selected for significance.  Other meta-analytic tools aim to estimate effect sizes, which are related to power but not identical. ]

Nelson notes that other analyses show p-curve to be robust to heterogeneity and argues that you need to more clearly specify why and when Z curve does better or worse. I would take that as a constructive suggestion that is worth pursuing. (e.g., he and the other reviewers are right that you need to provide more specificity about the nature of the heterogeneity your modeling.).

I thought Nelson’s suggestions for ways to explain the discrepant results of these two approaches were constructive, and they might help to explain when each approach does better, which would be a useful contribution. Just to be clear, I know that the datacolada post that Nelson cites was posted after your paper was submitted and I’m not factoring your paper’s failure to anticipate it into my decision (after all, Bem was wrong).

[That blog post was posted after I shared our manuscript with Uri and tried to get him to comment on z-curve.  In a long email exchange he came up with scenarios in which p-curve did better,but never challenged the results of my simulations that it performs a lot worse when there is heterogeneity.  To refer to this self-serving blog post as a reason for rejection is problematic at best, especially if the simulation results in the manuscript are ignored. 

Like Reviewer 2 and Nelson, I was troubled by the lack of a data model for Z curve (presented around page 11-12). As best I can tell, it is a weighted average of 7 standard normal curves with different means. I could see that approach being useful, and it might well turn out to be optimal for some range of cases, but it seems  arbitrary and isn’t suitably justified. Why 7? Why those 7? Is there some importance to those choices?

The data model (never heard the term, model) was specified and it is so simple that the editorial letter even characterizes it correctly.  The observed density distribution is modeled with weighted averages of the density distributions of 7 standard normal distributions and yes 7 is arbitrary because it has very little influence on the results. 

Do they reflect some underlying principle or are they a means to an end? If the goal is to estimate only the end output from these weights, how do we know that those are the right weights to use?

Because the simulation results show that the model recovers the simulated average power correctly within 3% points? 

If the discrete values themselves are motivated by a model, then the fact that the weight estimates for each component are not accurate even with k=10000 seems worrisome.

No it is not worrisome because the end goal is the average, not the individual weights.

If they aren’t motivated by a more formal model, how were they selected and what aspects of the data do they capture? Similarly, doesn’t using absolute value mean that your model can’t handle sign errors for significant results? And, how are your results affected by an arbitrary ceiling at z=6?

There are no sign errors in analyses of significant results that cover different research questions. Heck, there is not even a reasonable way to speak about signs.  Is neuroticism a negative predictor of wellbeing or is emotional stabilty a positive predictor of wellbeing? 

Finally, the paper comments near the end that this approach works well if k=100, but that doesn’t inform the reader about whether it works for k=15 or k=30 as would be common for meta-analysis in psychology.

The editor doesn’t even seem to understand that this method is not intended to be used for a classic effect size meta-analysis.  We do have statistical methods for that. Believe me I know that.  But how would we apply these methods to estimate the replicability of social psychology? And why would we use k = 30 to do so, when we can use k = 1,000?

To show that this approach is useful in practice, it would be good to show how it fares with sets of results that are more typical in scale in psychology. What are the limits of its usefulness? That could be demonstrated more fully with simulations in which the ground truth is known.

No bias correction method that relies on only significant results provides meaningful results with k = 30.  We provide 95%CI and they are huge with k = 30. 

I know you worked hard on the preparation of this manuscript, and that you will be disappointed by this outcome. I hope that you will find the reviewer comments helpful in further developing this work and that the outcome for this submission will not discourage you from submitting future manuscripts to AMPPS.

Sincerely,
Daniel J. Simons, Editor
Advances in Methods and Practices in Psychological Science (AMPPS) Psychology

Unfortunately, I don’t think the editor worked hard on reading the manuscript and missed the main point of the contribution.  So, no I am not planning on wasting more time on sending my best work to this journal.  I had hopes that AMPPS was serious about improving psychology as a science. Now I know better.  I will also no longer review for your journal.  Good luck with your efforts to do actual replication studies. I will look elsewhere to publish my work that makes original research more credible to start with.

Sincerely,
Ulrich  Schimmack

==============================================================

Reviewer: 1

The authors of this manuscript introduce a new statistical method, the z-curve, for estimating the average replicability of empirical studies. The authors evaluate the method via simulation methods and via select empirical examples; they also compare it to an alternative approach (p-curve). The authors conclude that the z-curve approach works well, and that it may be superior to the p-curve in cases where there is substantial heterogeneity in the effect sizes of the studies being examined. They also conclude, based on applying the z-curve to specific cases, that the average power of studies in some domains (e.g., power posing research and social psychology) is low.

One of the strengths of this manuscript is that it addresses an important issue: How can we evaluate the replicability of findings reported in the literature based on properties inherent to the studies (or their findings) themselves?  In addition, the manuscript approaches the issue with a variety of demonstrations, including simulated data, studies based on power posing, scraped statistics from psychology journals, and social psychological studies.

After reading the manuscript carefully, however, I’m not sure I understand how the z-curve works or how it is supposed to solve potential problems faced by other approaches for evaluating the power of studies published in the empirical literature.

That is too bad.  We provided detailed annotated R-Code to make it possible for quantitative psychologists to understand how z-curve works.  It is unfortunate that you were not able to understand the code.  We would have been happy to answer questions.

I realize the authors have included links to more technical discussions of the z-curve on their websites, but I think a manuscript like this should be self-contained–especially when it is billed as an effort “to introduce and evaluate a new statistical method” (p. 25, line 34).

We think that extensive code is better provided in a supplement. However, this is really an editorial question and not a comment on the quality or originality of our work.

Some additional comments, questions, and suggestions:

1. One of my concerns is that the approach appears to be based on using “observed power” (or observed effect sizes) as the basis for the calculations. And, although the authors are aware of the problems with this (e.g., published effect sizes are over-estimates of true effect sizes; p. 9), they seem content with using observed effect sizes when multiple effect sizes from diverse studies are considered. I don’t understand how averaging values that are over-estimates of true values can lead to anything other than an inflated average. Perhaps this can be explained better.

Again, we are sorry that you did not understand how our method achieves this goal, but that is surely not a reason to recommend rejection

2. Figure 2 is not clear. What is varying on the x-axis? Why is there a Microsoft-style spelling error highlighted in the graph?

The Figure was added after an exchange with Uri Simonsohn and reproduces the simulation that they did for effect sizes (see text).  The x-axis shows d-values.

3. Figure 4 shows a z-curve for power posing research. But the content of the graph isn’t explained. What does the gray, dotted line represent? What does the solid blue line represent? (Is it a smoothed density curve?) What do the hashed red vertical lines represent? In short, without guidance, this graph is impossible to understand.

Thank you for your suggestion. We will revise the manuscript to make it easier to read the figures.

4. I’m confused on how Figure 2 is relevant to the discussion (see p. 19, line 22)

Again. We are sorry about the confusion that we caused.  The Figure shows that p-curve overestimates power in some scenarios (d = .8, SD = .2) which was not apparent when Simonsohn did these simulation to estimate effect sizes. 

5. Some claims are made without any explanation or rationale. For example, on page 19 the authors write “Random sampling error cannot produce this drop” when commenting on the distribution of z-scores in the power posing data. But no explanation is offered for how this conclusion is reached.

Random sampling error of z-scores is one.  So, we should see a lot of values next to the mode of a distribution.  A steep drop cannot be produced by random sampling error. The same observation has been made repeatedly about a string of just significant p-values. If you can get .04., .03, .02, again and again, why do you not get .06 or .11?

6. I assume the authors are re-analyzing the data collected by Motyl and colleagues for Demonstration 3? This isn’t stated explicitly; one has to read between the lines to reach this conclusion.

You read correctly between the lines 

7. Figure 6 contains text which states that the estimated replicability is 67%. But the narrative states that the estimated replicability using the z-curve approach is 46% (p. 24, line 8). Is the figure using a different method than the z-curve method?

This is a real problem. This was the wrong Figure.  Thank you for pointing it out. The estimate in the text is correct.

8.  p. 25. Unclear why Figure 4 is being referenced here.

Another typo. Thanks for pointing it out.

9. The authors write that “a study with 80% power is expected to produce 4 out of 5 significant results in the long run.” (p. 6). This is only true when the null hypothesis is false. I assume the authors know this, but it would be helpful to be precise when describing concepts that most psychologists don’t “really” understand.

If a study has 80% power it is assumed that the null-hypothesis is false. A study where the null-hypothesis is true has a power of alpha to produce significant results.

10. I am not sure I understand the authors’ claim that, “once we take replicability into account, the distinction between false positives and true positives with low power becomes meaningless.” (p. 7).

We are saying in the article that there is no practical difference between a study with power = alpha (5%) where the null-hypothesis is true and a study with very low power (6%) where the null-hypothesis is false.  Maybe it helps to think about effect sizes.  d = 0 means null is true and power is 5% and d = 0.000000000000001 means null-hypothesis is false and power is 5.000000001%.  In terms o f the probability to replicate a significant result both studies have a very low probability of doing so.

11. With respect to the second demonstration: The authors should provide a stronger justification for examining all reported test statistics. It seems that the z-curve’s development is mostly motivated by debates concerning the on-going replication crisis. Presumably, that crisis concerns the evaluation of specific hypotheses in the literature (e.g., power posing effects of hormone levels) and not a hodge-podge of various test results that could be relevant to manipulation checks, age differences, etc. I realize it requires more work to select the tests that are actually relevant to each research article than to scrape all statistics robotically from a manuscript, but, without knowing whether the tests are “relevant” or not, it seems pointless to analyze them and draw conclusions about them.

This is a criticism of one dataset, not a criticism of the method.

12. Some of the conclusions that the authors research, such as “our results suggest that the majority of studies in psychology fail to meet the minimum standard of a good study . . . and even more studies fail to meet the well-known and accepted norm that studies should have 80% power” have been reached by other authors too.

But what methodology did these authors used to come to this conclusion?  Did they validate their method with simulation studies? 

This leads me to wonder whether the z-curve approach represents an incremental advance over other approaches. (I’m nitpicking when I say this, of course. But, ultimately, the “true power” of a collection of studies is not really a “thing.”

What does that mean the true power of studies is not a thing.  Researchers conduct significance tests (many) and see whether they get a publishable significant result. The average percentage of times they get a significant result is the true average power of the population of statistical tests that are being conducted.  Of course, we can only estimate this true value, but who says that other estimates that we use everyday are any better than the z-curve estimates?

It is a useful fiction, of course, but getting a more precise estimate of it might be overkill.)

Sure let’s not get to precise.  Why don’t we settle for 50% +/- 50% and call it a day?

Perhaps the authors can provide a stronger justification for the need of highly precise, but non-transparent, methods for estimating power in published research?

Just because you don’t understand the method doesn’t mean it is not transparent and maybe it could be useful to know that social psychologists conduct studies with 30% power and only publish results that fit their theories and got significant with the help of luck?  Maybe we have had 6 years of talk about a crisis without any data except the OSC results in 2015 that are limited to 2008 and three journals.  But maybe we just don’t care because it is 2018 and it is time to get on with business as usual. Glad you were able to review for a new journal that was intended to Advance Methods and Practices in Psychological Science.   Clearly estimating the typical power of studies in psychology is not important for this goal in your opinion.  Again sorry for submitting such a difficult manuscript and wasting your time.

==========================================================================

Reviewer: 2

The authors present a new methodology (“z-curve”) that purports to estimate the average of the power of a set of studied included in a meta-analysis which is subject to publication bias (i.e., statistically significant studies are over-represented among the set meta-analyzed). At present, the manuscript is not suitable for publication largely for three major reasons.

[1] Average Power: The authors propose to estimate the average power of the set of prior historical studies included in a meta-analysis. This is a strange quantity: meta-analytic research has for decades focused on estimating effect sizes. Why are the authors proposing this novel quantity? This needs ample justification. I for one see no reason why I would be interested in such a quantity (for the record, I do not believe it says much at all about replicability).

Why three reasons, if the first reason is that we are doing something silly.  Who gives a fuck about power of studies.  Clearly knowing how powerful studies are is as irrelevant as knowing the number of potholes in Toronto.  Thank you for your opinion that unfortunately was shared by the editor and mostly the first reviewer.

There is another reason why this quantity is strange, namely it is redundant. In particular, under a homogeneous effect size, average power is a simple transformation of the effect size; under heterogeneous effect sizes, it is a simple transformation of the effect size distribution (if normality is assumed for the effect size distribution, then a simple transformation of the average effect size and the heterogeneity variance parameter; if a more complicated mixture distribution is assumed as here then a somewhat more complicated transformation). So, since it is just a transformation, why not stick with what meta-analysts have focused on for decades!

You should have stopped when things were going good.  Now you are making silly comments that show your prejudice and ignorance.  The whole point of the paper is to present a method that estimate average power when there is heterogeneity  (if this is too difficult for you, let’s call it variability or even better, you know, bro, power is not always the same in each study).  If you missed this, you clearly didn’t read the manuscript for more than two minutes.  So, your clever remark about redundancy is just a waste of my time and the time of readers of this blog because things are no longer so simple when there is heterogeneity.  But may be you even know this but just wanted to be a smart ass.

[2] No Data Model / Likelihood: On pages 10-13, the authors heuristically propose a model but never write down the formal data model or likelihood. This is simply unacceptable in a methods paper: we need to know what assumptions your model is making about the observed data!

We provided r-code that not only makes it clear how z-curve works but also was available for reviewers to test it.  The assumptions are made clear and are simple. This is not some fancy Bayesian model with 20 unproven priors. We simply estimate a single population parameter from the observed distribution of z-scores and we make this pretty clear.  It is simple, makes no assumptions, and it works. Take that! 

Further, what are the model parameters? It is unclear whether they are mixture weights as well as means, just mixture weights, etc. Further, if just the weights and your setting the means to 0, 1, …, 6 is not just an example but embedded in your method, this is sheer ad hockery.

Again, it works. What is your problem?

It is quite clear from your example (Page 12-13) the model cannot recover the weights correctly even with 10,000 (whoa!) studies! This is not good. I realize your interest is on the average power that comes out of the model and not the weights themselves (these are a means to an end) but I would nonetheless be highly concerned—especially as 20-100 studies would be much more common than 10,000.

Unlike some statisticians we do not pretend that we can estimate something that cannot be estimated without making strong and unproven assumptions.  We are content with estimating what we can estimate and that is average power, which of course, you think is useless. If average power is useless, why would it be better if we could estimate the weights?

[3] Model Validation / Comparison: The authors validate their z-curve by comparing it to an ad hoc improvised method known as the p-curve (“p-Curve and effect size”, Perspectives on Psychological Science, 2014). The p-curve method was designed to estimate effect sizes (as per the title of the paper) and is known to perform extremely poorly at estimating at this task (particularly under effect size heterogeneity); there is no work validating how well it performs at estimating this rather curious average power quantity (but likely it would do poorly given that it is poor at estimating effect sizes and average power is a transformation of the effect size). Thus, knowing the z-curve performs better than the p-curve at estimating average power tells me next to nothing: you cannot validate your model against a model that has no known validation properties! Please find a compelling way to validate your model estimates (some suggested in the paragraphs below) whether that is via theoretical results, comparison to other models known to perform well, etc. etc.

No we are not validating z-curve with p-curve. We are validating z-curve with simulation studies that show z-curve produces good estimates of simulated true power. We only included p-curve to show that this method produces biased estimates when there is considerable variability in power. 

At the same time, we disagree with the claim that p-curve is not a good tool to estimate average effect sizes from a set of studies that are selected for significance. It is actually surprisingly good at estimating the average effect size for the set of studies that were selected for significance (as is puniform). 

It is not a good tool to estimate the effect size for the population of studies before selection for significance, but this is irrelevant in this context because we focus on replicability which implies that an original study produced a significant result and we want to know how likely it is that a replication study will produce a significant result again.

Relatedly, the results in Table 1 are completely inaccessible. I have no idea what you are presenting here and this was not made clear either in the table caption or in the main text. Here is what we would need to see at minimum to understand how well the approach performs—at least in an absolute sense.

[It shows the estimates (mean, SD) by the various models for our 3 x 3 design of the simulation study.  But who cares, the objective is useless so you probably spend 5 seconds trying to understand the Table.]

First, and least important, we need results around bias: what is the bias in each of the simulation scenarios (these are implicitly in the Table 1 results I believe)? However, we also need a measure of accuracy, say RMSE, a metric the authors should definitely include for each simulation setting. Finally, we need to know something about standard errors or confidence intervals so we can know the precision of individual estimates. What would be nice to report is the coverage percentage of your 95% confidence intervals and the average width of these intervals in each simulation setting.

There are many ways to present results about accuracy. Too bad we didn’t pick the right way, but would it matter to you?  You don’t really think it is useful anyways.

This would allow us to, if not compare methods in a relative way, to get an absolute assessment of model performance. If, for example, in some simulation you have a bias of 1% and an RMSE of 3% and coverage percentage of 94% and average width of 12% you would seem to be doing well on all metrics*; on the other hand, if you have a bias of 1% and an RMSE of 15% and coverage percentage of 82% and average width of 56%, you would seem to be doing poorly on all metrics but bias (this is especially the case for RMSE and average width bc average power is bounded between 0% and 100%).

* Of course, doing well versus poorly is in the eye of the beholder and for the purposes at hand, but I have tried to use illustrative values for the various metrics that for almost all tasks at hand would be good / poor performance.

For this reason, we presented the Figure that showed how often the estimates were outside +/- 10%, where we think estimates of power do not need to be more precise than that.  No need to make a big deal out of 33% vs. 38% power, but 30% vs. 80% matters. 

I have many additional comments. These are not necessarily minor at all (some are; some aren’t) but they are minor relative to the above three:

[a] Page 4: a prior effect size: You dismiss these hastily which is a shame. You should give them more treatment, and especially discuss the compelling use of them by Gelman and Carlin here:

http://www.stat.columbia.edu/~gelman/research/published/PPS551642_REV2.pdf

This paper is absolutely irrelevant for the purpose of z-curve to estimate the actual power that researchers achieve in their studies.    

[b] Page 5: What does “same result” and “successful replication” mean? You later define this in terms of statistical significance. This is obviously a dreadful definition as it is subject to all the dichotomization issues intrinsic to the outmoded null hypothesis significance paradigm. You should not rely on dichotomization and NHST so strongly.

What is obvious to you, is not the scientific consensus.  The most widely used criterion for a succesful replicaiton study is to get a significant result again.  Of course, we could settle for getting the same sign again and a 50% type-I error probability, but hey, as a reviewer you get to say whatever you want without accountability. 

Further, throughout please replace “significant” by “statistically significant” and related terms when it is the latter you mean.

[F… You]

[c] Page 6: Your discussion regarding if studies had 80% then up to 80% of results would be successful is not quite right: this would depend on the prior probability of “non-null” studies.

[that is why we wrote UP TO]

[d] Page 7: I do not think 50% power is at all “good”. I would be appalled in fact to trust my scientific results to a mere coin toss. You should drop this or justify why coin tosses are the way we should be doing science.

We didn’t say it is all good.  We used it as a minimum, less than that is all bad, but that doesn’t mean 50% is all good. But hey, you don’t care anyways. so what the heck.  

[e] Page 10: Taking absolute values of z-statistics seems wrong as the sign provides information about the sign of the effect. Why do you do this?

It is only wrong if you are thinking about a meta-analysis of studies that test the same hypothesis.  However, if I want to examine the replicability of more than one specific hypothesis all results have to be coded so that a significant results  implies support for the hypothesis in the direction of significance.   

[f] Page 13 and throughout: There are ample references to working papers and blog posts in this paper. That really is not going to cut it. Peer review is far from perfect but these cited works do not even reach that low bar.

Well better than quoting hearsay rumors form blog posts that coding in some dataset is debatable in a peer review of a methods paper.

[g] Page 16: What was the “skewed distribution”? More details about this and all simulation settings are necessary. You need to be explicit about what you are doing so readers can evaluate it.

We provided the r-code to recreate the distributions or change them.  It doesn’t matter. The conclusions remain the same.

[h] Page 15, Figure 2: Why plot median and no mean? Where are SEs or CIs on this figure?

Why do you need a CI or SE for simulations and what do you need to see that there is a difference between 0 and 80%?

[i] Page 14: p-curve does NOT provide good estimates of effect sizes!

Wrong. You don’t know what you are talking about.  It does provide a good estimate of average effect sizes for the set of studies selected for significance, which is the relevant set here.  

[j] You find p-curve is biased upwards for average power under heterogeneity; this seems to follow directly from the fact that it is biased upwards for effect size under heterogeneity (“Adjusting for Publication Bias in Meta-analysis”, Perspectives on Psychological Science, 2016) and the simply mapping between effect size and average power discussed above.

Wrong again. You are confusing estimates of average effect size for the studies before selection and after selection for significance.

[k] Page 20: Can z-curve estimate heterogeneity (the answer is yes)? You should probably provide such estimates.

We do not claim that z-curve estimates heterogeneity. Maybe some misunderstanding.

[l] Page 21-23: I don’t think the concept of the “replicability of all of psychology” is at all meaningful*. You are mixing apples and oranges in terms of areas studies as well as in terms of tests (focal tests vs manipulation checks). I would entirely cut this.

Of course, we can look for moderators but that is not helpful to you because you don’t  think the concept of power is useful.

* Even if it were, it seems completely implausible that the way to estimate it would be to combine all the studies in a single meta-analysis as here.

Why? 

[m] Page 23-25: I also don’t think the concept of the “replicability of all of social psychology” is at all meaningful. Note also there has been much dispute about the Motyl coding of the data so it is not necessarily reliable.

Of course you don’t, but why should I care about your personal preferences.

Further, why do you exclude large sample, large F, and large df1 studies? This seems unjustified. 

They are not representative, but it doesn’t make a difference.  

[n] Page 25: You write “47% average power implies that most published results are not false positives because we would expect 52.5% replicability if 50% of studies were false positives and the other 50% of studies had 100% power.” No, I think this will depend on the prior probability.

Wrong again. If 50% of studies were false positives, the power estimate would be 5%.  To get an average of 50%, and the other studies have the maximum of 100% power, we would get a clearly visible bimodal distribution of z-scores and we would get an average estimate of p(H0) * 2.5 + (1-p(H0) * 100.  You are a smart boy (sorry assuming this is a dick), you figure it out.

[o] Page 25: What are the analogous z-curve results if those extreme outliers are excluded? You give them for p-curve but not z-curve.

We provided that information, but you would need to care to look for them. 

[p] Page 27: You say the z-curve limitations are not a problem when there are 100 or more studies and some heterogeneity. The latter is fine to assume as heterogeneity is rife in psychological research but seldom do we have 100+ studies. Usually 100 is an upper bound so this poses problems for your method.

It doesn’t mean our method doesn’t work with smaller N.  Moreover, the goal is not to conduct effect size meta-analysis, but apparently you missed that because you don’t really care about the main objective to estimate replicability.  Not sure why you agreed to review a paper that is titled “A method for estimating replicability?” 

Final comment: Thanks for nothing. 

===========================================================================

Reviewer: 3

This review was conducted by Leif Nelson

[Thank you for signing your review.]

Let me begin by apologizing for the delay in my review; the process has been delayed because of me and not anyone else in the review team. 

Not sure why the editor waited for your review.  Could have rejected it after reading the first two reviews that the whole objective, which you and I think is meaningful, is irrelevant for advancing psychological science. Sorry for the unnecessary trouble.

Part of the delay was because I spent a long time working on the review (as witnessed by the cumbersome length of this document). The paper is dense, makes strong claims, and is necessarily technical; evaluating it is a challenge.

I commend the authors for developing a new statistical tool for such an important topic. The assessment of published evidence has always been a crucial topic, but in the midst of the current methodological renaissance, it has gotten a substantial spotlight.

Furthermore, the authors are technically competent and the paper articulates a clear thesis. A new and effective tool for identifying the underlying power of studies could certainly be useful, and though I necessarily have a positive view of p-curve, I am open to the idea that a new tool could be even better.

Ok, enough with the politeness. Let’s get to it.

I am not convinced that Z-curve is that tool. To be clear, it might be, but this paper does not convince me of that.

As expected,…. p < .01.  So let’s hear why the simulation results and the demonstration of inflated estimates in real datasets do not convince you.

I have a list of concerns, but a quick summary might save someone from the long slog through the 2500 words that follow:

  1. The authors claim that, relative to Z-curve, p-curve fails under heterogeneity and do not report, comment on, or explain analyses showing exactly the opposite of that assertion.

Wow. let me parse this sentence.  The authors claim p-curve fails under heterogeneity (yes) and do not report … analyses showing … the opposite of that assertion.

Yes, that is correct. We do not show results opposite to our assertion. We show results that confirm our assertion in Figure 1 and 2.  We show in simulations with R-code that we provided and you could have used to run your own simulations that z-curve provides very good estimates of average power when there is heterogeneity and that p-curve tends to overestimate average power.  That is the key point of this paper.  Now how much time did you spend on this review, exactly?

The authors do show that Z-curve gives better average estimates under certain circumstances, but they neither explain why, nor clarify what those circumstances look like in some easy to understand way, nor argue that those circumstances are representative of published results. 

Our understanding was that technical details are handled in the supplement that we provided.  The editor asked us to supply R-code again for a reviewer but it is not clear to us which reviewer actually used the provided R-code to answer technical questions like this.  The main point is made clear in the paper. When the true power (or z-values) varies across studies,  p-curve tends to overestimate.  Not sure the claims of being open are very credible if this main point is ignored.

3. They attempt to demonstrate the validity of the Z-curve with three sets of clearly invalid data.

No. we do not attempt to validate z-curve with real datasets.  That would imply that we already know the average power in real data, which we do not.  We used simulations to validate z-cure and to show that p-curve estimates are biased.  We used real data only to show that the differences in estimates have real world implications.  For example, when we use the Motyl et al. (JSPSP) data to examine replicability,  z-curve gives a reasonable estimate of 46% (in line with the reported R-Index estimates in the JPSP article), while p-curve gives an estimate of 72% power.  This is not a demonstration of validity, it is a demonstration that p-curve would overestimate replicability of social psychological findings in a way that most readers would consider practically meaningful. ]

I think that any one of those would make me an overall negative evaluator; the combination only more so. Despite that, I could see a version which clarified the “heterogeneity” differences, acknowledged the many circumstances where Z-curve is less accurate than p-curve, and pointed out why Z-curve performs better under certain circumstances. Those might not be easy adjustments, but they are possible, and I think that these authors could be the right people to do it. (the demonstrations should simply be removed, or if the authors are motivated, replaced with valid sets).

We already point out when p-curve does better. When there is minimal variability or actually identical power, precision of p-curve is 2-3% better.

Brief elaboration on the first point: In the initial description of p-curve the authors seem to imply that it should/does/might have “problems when the true power is heterogeneous”. I suppose that is an empirical question, but it one that has been answered. In the original paper, Simonsohn et al. report results showing how p-curve behaves under some types of heterogeneity. Furthermore, and more recently, we have reported how p-curve responds under other different and severe forms of heterogeneity (dataclolada.org/67). Across all of those simulations, p-curve does indeed seem to perform fine. If the authors want to claim that it doesn’t perform well enough (with some quantifiable statement about what that means), or perhaps that there are some special conditions in which it performs worse, that would be entirely reasonable to articulate. However, to say “the robustness of p-curve has not been tested” is not even slightly accurate and quite misleading.

These are totally bogus and cheery-picked simulations that were conducted after I shared a preprint of this manuscript with Uri.  I don’t agree with Reviewer 2 that we shouldn’t use blogs, but the content of the blog post needs to be accurate and scientific. The simulations in this blog post are not.  The variation of power is very small.  In contrast, we examine p-curve and z-curve in a fair comparison with varying amounts of heterogeneity that is found in real data sets.   In this simulations p-curve again does slightly better when there is no heterogeneity, but it does a lot worse when there is considerable variability. 

To ignore the results in the manuscript and to claim that the blog post shows something different is not scientific.  It is pure politics. The good news is that simulation studies have a real truth and the truth is that when you simulate large variability in power,  p-curve starts overestimating average power.  We explain that this is due to the use of a single parameter model that cannot model heterogeneity. If we limit z-curve to a single parameter it has the same problem. The novel contribution of z-curve is to use multiple (3 or 7 doesn’t matter much) parameters to model heterogeneity.  Not surprisingly, a model that is more consistent with the data produces better estimates.

Brief elaboration on the second point: The paper claims (and shows) that p-curve performs worse than Z-curve with more heterogeneity. DataColada[67] claims (and shows) that p-curve performs better than Z-curve with more heterogeneity.

p-curve does not perform better with more heterogeneity. I had a two-week email exchange with Uri when he came up with simulations that showed better performance of p-curve.  For example, transformation to z-scores is an approximation and when you use t-values with small N (all studies have N = 20), the approximation leads to suboptimal estimates. Also smaller k is an issue because z-curve estimates density distributions. So, I am well aware of limited specialized situations where p-curve can do better by up to 10% points, but that doesn’t change the fact that it does a lot worse when p-curve is applied to real heterogeneous data like I have been analyzing for years (ego-depletion replicability report, Motyl focal hpyothesis tests, etc. etc.).

I doubt neither set of simulations. That means that the difference – barring an error or similar – must lie in the operational definition of “heterogeneity.” Although I have a natural bias in interpretation (I assisted in the process of generating different versions of heterogeneity to then be tested for the DataColada post), I accept that the Z-curve authors may have entirely valid thinking here as well. So a few suggestions: 1. Since there is apparently some disagreement about how to operationalize heterogeneity, I would recommend not talking about it as a single monolithic construct.

How is variability in true power not a single construct.  We have a parameter and it can vary from alpha to 1.   Or we have a population effect size and a specific amount of sampling error and that gives us a ratio that reflects the deviation of a test statistic from 0.   I understand the aim of saving p-curve, but in the end p-curve in its current form is unable to handle larger amounts of heterogeneity.   You provide no evidence to the contrary.

Instead clarify exactly how it will be operationalized and tested and then talk about those. 2. When running simulations, rather than only reporting the variance or the skewness, simply show the distribution of power in the studies being submitted to Z-curve (as in DataColada[67]). Those distributions, at the end of the day, will convey what exactly Z-curve (or p-curve) is estimating. 3. To the extent possible, figure out why the two differ. What are the cases where one fails and the other succeeds? It is neither informative (nor accurate) to describe Z-curve as simply “better”. If it were better in every situation then I might say, “hey, who cares why?”. But it is not. So then it becomes a question of identifying when it will be better.

Again, I had a frustrating email correspondence with Uri and the issues are all clear and do not change the main conclusion of our paper.  When there is large heterogeneity, modeling this heterogeneity leads to unbiased estimates of average power, whereas a single component model tends to produce biased estimates.

Brief elaboration on the third point: Cuddy et al. selected incorrect test statistics from problematic studies. Motyl et al. selected lots and lots of incorrect tests. Scraping test-fstatistics is not at all relevant to an assessment of the power of the studies where they came from. These are all unambiguously invalid. Unfortunately, one cannot therefore learn anything about the performance of Z-curve in assessing them.

I really don’t care about Cuddy. What I do care about is that they used p-curve as if it can produce accurate estimates of average power and reported an estimate to readers that suggested they had the right estimate, when p-curve again overestimated average power.

The claims about Motyl are false. I have done my own coding of these studies and despite a few inconsistencies in coding some studies, I get the same results with my coding.  Please provide your own coding of these studies and I am sure the results will be the same.  Unless you have coded Motyl et al.’s studies, you should not make unbased claims about this dataset or the results that are based on it.

OK, with those in mind, I list below concerns I have with specifics in the paper. These are roughly ordered based on where they occur in the paper:

Really,  I would love to stop hear, but I am a bit obsessive compulsive, but readers might have enough information to draw their own conclusions.

* The paper contains are a number of statements of fact that seem too certain. Just one early example, “the most widely used criterion for a successful replication is statistical significance (Killeen, 2005).” That is a common definition and it may be the most common, but that is hardly a certainty (even with a citation). It would be better to simply identify that definition as common and then consider its limitations (while also considering others).

Aside from being the most common, it is also the most reasonable.  How else would we compare the results of a study that claimed, the effect is positive, 95%CI d = .03 to .1.26 to the results of a replication study.  Would we say, wow replication d = .05, this is consistent with the original study therefore we have a successful replication?

* The following statement seems incorrect (and I think that the authors would be the first to agree with me): “Exact replications of the original study should also produce significant results; at least we should observe more successful than failed replications if the hypothesis is true.” If original studies were all true, but all powered at 25%, then exact (including sample size) replications would be significant 25% of the time. I assume that I am missing the argument, so perhaps I am merely suggesting a simple clarification. (p. 6)

You misinterpret the intention here. We are stating that a good study should be replicable and are implying that a study with 25% power is not a good study. At a minimum we would expect a good study to be more often correct than incorrect which happens when power is over 50%.

* I am not sure that I completely understand the argument about the equivalence of low power and false positives (e.g., “Once we take replicability into account, the distinction between false positives and true positives with low power becomes meaningless, and it is more important to distinguish between studies with good power that are replicable and studies with low power or false positives that are difficult to replicate.”) It seems to me that underpowered original studies may, in the extreme case, be true hypotheses, but they lack meaningful evidence. Alternatively, false positives are definitionally false hypotheses that also, definitionally, lack meaningful evidence. If a replicator were to use a very large sample size, they would certainly care about the difference. Note that I am hardly making a case in support of the underpowered original – I think the authors’ articulations of the importance of statistical power is entirely reasonable – but I think the statement of functional equivalence is a touch cavalier.

Replicability is a property of the original study.  If the original study had 6% power it is a bad study, even if a subsequent study with 10 times the sample size is able to show a significant result with much more power.

* I was surprised that there was no discussion of the Simonsohn Small Telescopes perspective in the statistical evaluation of replications. That offers a well-cited and frequently discussed definition of replicability that talks about many of the same issues considered in this introduction. If the authors think that work isn’t worth considering, that is fine, but they might anticipate that other readers would at least wonder why it was not.

The paper is about the replicability of published findings, not about sample size planing for replication studies.  Average power predicts what would happen in a study with the same sample sizes, not what would happen if sample sizes were increased.  So, the telescope paper is not relevant.

* The consideration of the Reproducibility Project struck me as lacking nuance. It takes the 36% estimate too literally, despite multiple articles and blog posts which have challenged that cut-and-dried interpretation. I think that it would be reasonable to at least give some voice to the Gilbert et al. criticisms which point out that, given the statistical imprecision of the replication studies, a more positive estimate is justifiable. (again, I am guessing that many people – including me – share the general sense of pessimism expressed by the authors, but a one-sided argument will not be persuasive).

Are you nuts? Gilbert may have had one or two points about specific replication studies, but his broader claims about the OSC results are utter nonsense, even if they were published as a commentary in Science.  It is a trivial fact that the success rate in a set of studies that is not selected for significance is an estimate of average power.  If we didn’t have a file drawer, we could just count the percentage of significant results to know how low power actually is. However, we do have file drawers, and therefore we need a statistical tool like z-curve to estimate average power if that is a desirable goal.  If you cannot see that the OSC data are the best possible dataset to evaluate bias-correction methods with heterogeneous data, you seem to lack the most fundamental understanding of statistical power and how it relates to success rates in significance tests.

* The initial description of Z-Curve is generally clear and brief. That is great. On the other hand I think that a reasonable standard should be that readers would need neither to download and run the R-code nor go and read the 2016 paper in order to understand the machinery of the algorithm. Perhaps a few extra sentences to clarify before giving up and sending readers to those other sources.

This is up to the editor. We are happy to move content from the Supplement to the main article or do anything else that can improve clarity and communication.  But first we need to be given an opportunity to do so.

* I don’t understand what is happening on pages 11-13. I say that with as much humility as possible, because I am sure that the failing is with me. Nevertheless, I really don’t understand. Is this going to be a telling example? Or is it the structure of the underlying computations? What was the data generating function that made the figure? What is the goal?

* Figure 1: (A few points). The caption mentions “…how Z-curve models…” I am sure that it does, but it doesn’t make sense to me. Perhaps it would be worth clarifying what the inputs are, what the outputs are, what the inferences are, and in general, what the point of the figure is. The authors have spent far more time in creating this figure than anyone else who simply reads it, so I do not doubt that it is a good representation of something, but I am honestly indicating that I do not know what that is. Furthermore, the authors’ say “the dotted black line in Figure 1.” I found it eventually, but it is really hard to see. Perhaps make the other lines a very light gray and the critical line a pure and un-dashed black?

It is a visual representation of the contribution of each component of the model to the total density.

* The authors say that they turn every Z-score of >6 into 6. How consequential is that decision? The explanation that those are all powered at 100% is not sufficient. If there are two results entered into Z-curve one with Z = 7 and one with Z = 12, Z-curve would treat them identically to each other and identically as if they were both Z = 6, right? Is that a strength? (without clarification, it sounds like a weakness). Perhaps it would be worth some sentences and some simulations to clarify the consequences of the arbitrary cutoff. Quite possibly the consequences are zero, but I can’t tell. 

Z-curve could also fit components here, but there are few z-scores and if you convert the z-score into power it is  pnorm(6, 1.96) = .99997 or 99.997%.  So does it matter. No it doesn’t, which is the reason why are doing it. If it would make a difference, we wouldn’t be doing it.

* On p. 13 the authors say, “… the average power estimate was 50% demonstrating large sample accuracy.” That seems like a good solid conclusion inference, but I didn’t understand how they got to it. One possible approach would be to start a bit earlier with clarifying the approach. Something that sounded like, “Our goal was to feed data from a 50% power distribution and then assess the accuracy of Z-curve by seeing whether or not it returned an average estimate of 50%.” From there, perhaps, it might be useful to explain in conversational language how that was conducted.

The main simulations are done later.  This is just an example.  So we can just delete the claim about large sample accuracy here.

* To reiterate, I simply cannot follow what the authors are doing. I accept that as my fault, but let’s assume that a different reader might share some of my shortcomings. If so, then some extra clarification would be helpful.

Thanks, but if you don’t understand what we are doing, why are you an expert reviewer for our paper.  I did ask that Uri is not picked as a reviewer because he ignored all reasonable arguments when I sent him a preprint, but that didn’t mean that some other proponent of p-curve with less statistical background should be the reviewer. 

* The authors say that p-curve generates an estimate of 76% for this analysis and that is bad. I believe them. Unfortunately, as I have indicated in a few places, I simply do not understand what the authors did, and so cannot assess the different results.

We used the R-Code for the p-curve app, submitted the data and read the output.  And yes, we agree, it is bad that a tool is in the public domain without any warning about bias when there is heterogeneity and the tool can overestimates average power by 25% points.  What are you going to do about it? 

So clarification would help. Furthermore, the authors then imply that this is due to p-curve’s failure with heterogeneity. That sounds unlikely, given the demonstrations of p-curve’s robustness to heterogeneity (i.e., DataColada[67]), but let’s assume that they are correct.

Uri simulated minimal heterogeneity to save p-curve from embarrassment. So there is nothing surprising here. Uri essentially p-hacked p-curve results to get the results he wanted.

It then becomes absolutely critical for the authors to explain why that particular version is so far off. Based on lengthy exchanges between Uli and Uri, and as referenced in the DataColada post, across large and varied forms of heterogeneity, Z-curve performs worse than p-curve. What is special about this case? Is it one that exists frequently in nature?

Enough already.  That p-hacked post is not worth the bytes on the hosting server.

* I understand Figure 2. That is great. 

Do we have badges for reviewers who actually understand something in a paper?

* The authors imply that p-curve does worse at estimating high powered studies because of heterogeneity. Is there evidence for that causal claim? It would be great if they could identify the source of the difference.

The evidence is in the fucking paper you were supposed to review and evaluate.

* Uri, in the previous exchanges with Uli (and again, described in the blog post), came to the conclusion that Z-curve did better than p-curve when there were many very extreme (100% power) observations in the presence of other very low powered observations. The effect seemed to be carried by how Z-curve handles those extreme cases. I believe – and truly I am not sure here – that the explanation had to do with the fact that with Z-curve, extreme cases are capped at some upper bound. If that is true then (a) it is REALLY important for that to be described, clarified, and articulated. In addition, (b) it needs to be clearly justified. Is that what we want the algorithm to do? It seems important and potentially persuasive that Z-curve does better with certain distributions, but it clearly does worse with others. Given that, (c) it seems like the best case positioning for Z-curve would be if it could lay out the conditions under which it would perform better (e.g., one in which there are many low powered studies, but the mode was nevertheless >99.99% power), while acknowledging those in which it performs worse (e.g., all of the scenarios laid out in DataColada[67]).

I can read and I read the blog post. I didn’t know these p-hacked simulations would be used against me in the review process. 

* Table 1: Rather than presenting these findings in tabular form, I think it would be informative if there were histograms of the studies being entered into Z-curve (as in DataColada[67]). That allows someone to see what is being assessed rather than relying on their intuitive grasp of skewness, for example.

Of course we can add those, but that doesn’t change anything about the facts.

The Deomnstrations:

* the Power Posing Meta-Analysis. I think it is interesting to look at how Z-curve evaluates a set of studies. I don’t think that one can evaluate the tool in this way (because we do not know the true power of Power Posing studies), but it is interesting to see. I would make some suggestions though. (a) In a different DataColada post (datacolada.org/66), we looked carefully at the Cuddy, Shultz, & Fosse p-curve and identified that the authors had selected demonstrably incorrect tests from demonstrably problematic studies. I can’t imagine anyone debating either contention (indeed, no one has, though the Z-curve authors might think the studies and tests selected were perfect. That would be interesting to add to this paper.). Those tests were also the most extreme (all >99% power). Without reading this section I would say, “well, no analysis should be run on those test statistics since they are meaningless. On the other hand, since they are extreme in the presence of other very low powered studies, this sounds like exactly the scenario where Z-curve will generate a different estimate from p-curve”. [again, the authors simply cite “heterogeneity” as the explanation and again, that is not informative]. I think that a better comparison might be on the original power-posing p-curve (Simmons & Simonsohn, 2017; datacolada.org/37). Since those test statistics were coded by two authors of the original p-curve, that part is not going to be casually contested. I have no idea what that comparison will look like, but I would be interested.

I don’t care about the silly power-posing research. I can take this out, but it just showed that p-curve is used without understanding its limitations, which have been neglected by the developers of p-curve (not sure how much you were involved). 

* The scraping of 995,654 test statistics. I suppose one might wonder “what is the average power of any test reported in psychology between 2010 and 2017?” So long as that is not seen as even vaguely relevant to the power of the studies in which they were reported, then OK. But any implied relevance is completely misleading. The authors link the numbers (68% or 83%) to the results of the Reproducibility Project. That is exactly the type of misleading reporting I am referring to. I would strongly encourage this demonstration to be removed from the paper.

How do we know what the replicability in developmental psychology is? How do we know what the replicability in clinical psychology is?  The only information that we have comes from social and experimental cognitive research with simple paradigms. Clearly we cannot generalize to all areas of psychology.  Surely an analysis of focal and non-focal tests has some problems that we discuss, but it clearly serves as an upper limit and can be used for termporal and cross-discipline comparisons without taking the absolute numbers too seriously.  But this can only be done with a method that is unbiased, not a method that estimates 96% power when power is 75%.

* The Motyl et al p-curve. This is a nice idea, but the data set being evaluated is completely unreasonable to use. In yet another DataColada post (datacolada.org/60), we show that the Motyl et al. researchers selected a number of incorrect tests. Many omnibus tests and many manipulation checks. I honestly think that those authors made a sincere effort but there is no way to use those data in any reasonable fashion. It is certainly no better (and possibly worse) than simply scraping every p-value from each of the included studies. I wish the Motyl et al. study had been very well conducted and that the data were usable. They are not. I recommend that this be removed from the analysis or, time permitting, the Z-curve authors could go through the set of papers and select and document the correct tests themselves.

You are wrong and I haven’t seen you posting a corrected data set. Give me a corrected data set and I bet you $1000 dollar that p-curve will produce a higher estimate than z-curve again.

* Since it is clearly relevant with the above, I will mention that the Z-curve authors do not mention how tests should be selected. Experientially, p-curve users infrequently make mistakes with the statistical procedure, but they frequently make mistakes in the selection of test statistics. I think that if the authors want their tool to be used correctly they would be well served by giving serious consideration to how tests should be selected and then carefully explaining that.

Any statistical method depends on the data you supply.  Like when Uri phacked simulations to show that p-curve does well with heterogeneity.

Final conclusion:

Dear reader (if you made it this far, please let me know in the comments section what you take away from all of this).  

The Motyl data are ok,  p-curve overestimates, and that is because p-curve doesn’t handle realistic amounts of heterogeneity well.

The Motyl data are ok,  p-curve overestimates, but this only happens with the Motyl data. 

The Motyl data are ok,  p-curve overestimates, but that is because we didn’t use p-curve properly.

The Motyl data are not ok, and our simulations are p-hacked and p-curve does well with heterogeneity.