Category Archives: Zcurve

Guest Post by Jerry Brunner: Response to an Anonymous Reviewer

Introduction

Jerry Brunner is a recent emeritus from the Department of Statistics at the University of Toronto Mississauga. Jerry first started in psychology, but was frustrated by the unscientific practices he observed in graduate school. He went on to become a professor in statistics. Thus, he is not only an expert in statistis. He also understands the methodological problems in psychology.

Sometime in the wake of the replication crisis around 2014/15, I went to his office to talk to him about power and bias detection. . Working with Jerry was educational and motivational. Without him z-curve would not exist. We spend years on trying different methods and thinking about the underlying statistical assumptions. Simulations often shattered our intuitions. The Brunner and Schimmack (2020) article summarizes all of this work.

A few years later, the method is being used to examine the credibility of published articles across different research areas. However, not everybody is happy about a tool that can reveal publication bias, the use of questionable research practices, and a high risk of false positive results. An anonymous reviewer dismissed z-curve results based on a long list of criticisms (Post: Dear Anonymous Reviewer). It was funny to see how ChatGPT responds to these criticisms (Comment). However, the quality of ChatGPT responses is difficult to evaluate. Therefore, I am pleased to share Jerry’s response to the reviewer’s comments here. Let’s just say that the reviewer was wise to make their comments anonymously. Posting the review and the response in public also shows why we need open reviews like the ones published in Meta-Psychology by the reviewers of our z-curve article. Hidden and biased reviews are just one more reason why progress in psychology is so slow.

Jerry Brunner’s Response

This is Jerry Brunner, the “Professor of Statistics” mentioned the post. I am also co-author of Brunner and Schimmack (2020). Since the review Uli posted is mostly an attack on our joint paper (Brunner and Schimmack, 2020), I thought I’d respond.

First of all, z-curve is sort of a moving target. The method described by Brunner and Schimmack is strictly a way of estimating population mean power based on a random sample of tests that have been selected for statistical significance. I’ll call it z-curve 1.0. The algorithm has evolved over time, and the current z-curve R package (available at https://cran.r-project.org/web/packages/zcurve/index.html) implements a variety of diagnostics based on a sample of p-values. The reviewer’s comments apply to z-curve 1.0, and so do my responses. This is good from my perspective, because I was in on the development of z-curve 1.0, and I believe I understand it pretty well. When I refer to z-curve in the material that follows, I mean z-curve 1.0. I do believe z-curve 1.0 has some limitations, but they do not overlap with the ones suggested by the reviewer.

Here are some quotes from the review, followed by my answers.

(1) “… z-curve analysis is based on the concept of using an average power estimate of completed studies (i.e., post hoc power analysis). However, statisticians and methodologists have written about the problem of post hoc power analysis …”

This is not accurate. Post-hoc power analysis is indeed fatally flawed; z-curve is something quite different. For later reference, in the “observed” power method, sample effect size is used to estimate population effect size for a single study. Estimated effect size is combined with observed sample size to produce an estimated non-centrality parameter for the non-central distribution of the test statistic, and estimated power is calculated from that, as an area under the curve of the non-central distribution. So, the observed power method produces an estimated power for an individual study. These estimates have been found to be too noisy for practical use.

The confusion of z-curve with observed power comes up frequently in the reviewer’s comments. To be clear, z-curve does not estimate effect sizes, nor does it produce power estimates for individual studies.

(2) “It should be noted that power is not a property of a (completed) study (fixed data). Power is a performance measure of a procedure (statistical test) applied to an infinite number of studies (random data) represented by a sampling distribution. Thus, what one estimates from completed study is not really “power” that has the properties of a frequentist probability even though the same formula is used. Average power does not solve this ontological problem (i.e., misunderstanding what frequentist probability is; see also McShane et al., 2020). Power should always be about a design for future studies, because power is the probability of the performance of a test (rejecting the null hypothesis) over repeated samples for some specified sample size, effect size, and Type I error rate (see also Greenland et al., 2016; O’Keefe, 2007). z-curve, however, makes use of this problematic concept of average power (for completed studies), which brings to question the validity of z-curve analysis results.”

The reviewer appears to believe that once the results of a study are in, the study no longer has a power. To clear up this misconception, I will describe the model on which z-curve is based.

There is a population of studies, each with its own subject population. One designated significance test will be carried out on the data for each study. Given the subject population, the procedure and design of the study (including sample size), significance level and the statistical test employed, there is a probability of rejecting the null hypothesis. This probability has the usual frequentist interpretation; it’s the long-term relative frequency of rejection based on (hypothetical) repeated sampling from the particular subject population. I will use the term “power” for the probability of rejecting the null hypothesis, whether or not the null hypothesis is exactly true.

Note that the power of the test — again, a member of a population of tests — is a function of the design and procedure of the study, and also of the true state of affairs in the subject population (say, as captured by effect size).

So, every study in the population of studies has a power. It’s the same before any data are collected, and after the data are collected. If the study were replicated exactly with a fresh sample from the same population, the probability of observing significant results would be exactly the power of the study — the true power.

This takes care of the reviewer’s objection, but let me continue describing our model, because the details will be useful later.

For each study in the population of studies, a random sample is drawn from the subject population, and the null hypothesis is tested. The results are either significant, or not. If the results are not significant, they are rejected for publication, or more likely never submitted. They go into the mythical “file drawer,” and are no longer available. The studies that do obtain significant results form a sub-population of the original population of studies. Naturally, each of these studies has a true power value. What z-curve is trying to estimate is the population mean power of the studies with significant results.

So, we draw a random sample from the population of studies with significant results, and use the reported results to estimate population mean power — not of the original population of studies, but only of the subset that obtained significant results. To us, this roughly corresponds to the mean power in a population of published results in a particular field or sub-field.

Note that there are two sources of randomness in the model just described. One arises from the random sampling of studies, and the other from random sampling of subjects within studies. In an appendix containing the theorems, Brunner and Schimmack liken designing a study (and choosing a test) to the manufacture of a biased coin with probability of heads equal to the power. All the coins are tossed, corresponding to running the subjects, collecting the data and carrying out the tests. Then the coins showing tails are discarded. We seek to estimate the mean P(Head) for all the remaining coins.

(3) “In Brunner and Schimmack (2020), there is a problem with ‘Theorem 1 states that success rate and mean power are equivalent …’ Here, the coin flip with a binary outcome is a process to describe significant vs. nonsignificant p-values. Focusing on observed power, the problem is that using estimated effect sizes (from completed studies) have sampling variability and cannot be assumed to be equivalent to the population effect size.”

There is no problem with Theorem 1. The theorem says that in the coin tossing experiment just described, suppose you (1) randomly select a coin from the population, and (2) toss it — so there are two stages of randomness. Then the probability of observing a head is exactly equal to the mean P(Heads) for the entire set of coins. This is pretty cool if you think about it. The theorem makes no use of the concept of effect size. In fact, it’s not directly about estimation at all; it’s actually a well-known result in pure probability, slightly specialized for this setting. The reviewer says “Focusing on observed power …” But why would he or she focus on observed power? We are talking about true power here.

(4) “Coming back to p-values, these statistics have their own distribution (that cannot be derived unless the effect size is null and the p-value follows a uniform distribution).

They said it couldn’t be done. Actually, deriving the distribution of the p-value under the alternative hypothesis is a reasonable homework problem for a masters student in statistics. I could give some hints …

(5) “Now, if the counter argument taken is that z-curve does not require an effect size input to calculate power, then I’m not sure what z-curve calculates because a value of power is defined by sample size, effect size, Type I error rate, and the sampling distribution of the statistical procedure (as consistently presented in textbooks for data analysis).”

Indeed, z-curve uses only p-values, from which useful estimates of effect size cannot be recovered. As previously stated, z-curve does not estimate power for individual studies. However, the reviewer is aware that p-values have a probability distribution. Intuitively, shouldn’t the distribution of p-values and the distribution of power values be connected in some way? For example, if all the null hypotheses in a population of tests were true so that all power values were equal to 0.05, then the distribution of p-values would be uniform on the interval from zero to one. When the null hypothesis of a test is false, the distribution of the p-value is right skewed and strictly decreasing (except in pathological artificial cases), with more of the probability piling up near zero. If average power were very high, one might expect a distribution with a lot of very small p-values. The point of this is just that the distribution of p-values surely contains some information about the distribution of power values. What z-curve does is to massage a sample of significant p-values to produce an estimate, not of the entire distribution of power after selection, but just of its population mean. It’s not an unreasonable enterprise, in spite of what the reviewer thinks. Also, it works well for large samples of studies. This is confirmed in the simulation studies reported by Brunner and Schimmack.

(6) “The problem of Theorem 2 in Brunner and Schimmack (2020) is assuming some distribution of power (for all tests, effect sizes, and sample sizes). This is curious because the calculation of power is based on the sampling distribution of a specific test statistic centered about the unknown population effect size and whose variance is determined by sample size. Power is then a function of sample size, effect size, and the sampling distribution of the test statistic.”

Okay, no problem. As described above, every study in the population of studies has its own test statistic, its own true (not estimated) effect size, its own sample size — and therefore its own true power. The relative frequency histogram of these numbers is the true population distribution of power.

(7) “There is no justification (or mathematical derivation) to show that power follows a uniform or beta distribution (e.g., see Figure 1 & 2 in Brunner and Schimmack, 2000, respectively).”

Right. These were examples, illustrating the distribution of power before versus after selection for significance — as given in Theorem 2. Theorem 2 applies to any distribution of true power values.

(8) “If the counter argument here is that we avoid these issues by transforming everything into a z-score, there is no justification that these z-scores will follow a z-distribution because the z-score is derived from a normal distribution – it is not the transformation of a p-value that will result in a z-distribution of z-scores … it’s weird to assume that p-values transformed to z-scores might have the standard error of 1 according to the z-distribution …”

The reviewer is objecting to Step 1 of constructing a z-curve estimate, given on page 6 of Brunner and Schimmack (2020). We start with a sample of significant p-values, arising from a variety of statistical tests, various F-tests, chi-squared tests, whatever — all with different sample sizes. Then we pretend that all the tests were actually two-sided z-tests with the results in the predicted direction, equivalent to one-sided z-tests with significance level 0.025. Then we transform the p-values to obtain the z statistics that would have generated them, had they actually been z-tests. Then we do some other stuff to the z statistics.

But as the reviewer notes, most of the tests probably are not z-tests. The distributions of their p-values, which depend on the non-central distributions of their test statistics, are different from one another, and also different from the distribution for genuine z-tests. Our paper describes it as an approximation, but why should it be a good approximation? I honestly don’t know, and I have given it a lot of thought. I certainly would not have come up with this idea myself, and when Uli proposed it, I did not think it would work. We both came up with a lot of estimation methods that did not work when we tested them out. But when we tested this one, it was successful. Call it a brilliant leap of intuition on Uli’s part. That’s how I think of it.

Uli’s comment.
It helps to know your history. Well before psychologists focused on effect sizes for meta-analysis, Fisher already had a method to meta-analyze p-values. P-Curve is just a meta-analysis of p-values with a selection model. However, p-values have ugly distributions and Stouffer proposed the transformation of p-values into z-scores to conduct meta-analyses. This method was used by Rosenthal to compute the fail-safe-N, one of the earliest methods to evaluate the credibility of published results (Fail-Safe-N). Ironically, even the p-curve app started using this transformation (p-curve changes). Thus, p-curve is really a version of z-curve. The problem with p-curve is that it has only one parameter and cannot model heterogeneity in true power. This is the key advantage of z-curve.1.0 over p-curve (Brunner & Schimmack, 2020). P-curve is even biased when all studies have the same population effect size, but different sample sizes, which leads to heterogeneity in power (Brunner, 2018].

Such things are fairly common in statistics. An idea is proposed, and it seems to work. There’s a “proof,” or at least an argument for the method, but the proof does not hold up. Later on, somebody figures out how to fill in the missing technical details. A good example is Cox’s proportional hazards regression model in survival analysis. It worked great in a large number of simulation studies, and was widely used in practice. Cox’s mathematical justification was weak. The justification starts out being intuitively reasonable but not quite rigorous, and then deteriorates. I have taught this material, and it’s not a pleasant experience. People used the method anyway. Then decades after it was proposed by Cox, somebody else (Aalen and others) proved everything using a very different and advanced set of mathematical tools. The clean justification was too advanced for my students.

Another example (from mathematics) is Fermat’s last theorem, which took over 300 years to prove. I’m not saying that z-curve is in the same league as Fermat’s last theorem, just that statistical methods can be successful and essentially correct before anyone has been able to provide a rigorous justification.

Still, this is one place where the reviewer is not completely mixed up.

Another Uli comment
Undergraduate students are often taught different test statistics and distributions as if they are totally different. However, most tests in psychology are practically z-tests. Just look at a t-distribution with N = 40 (df = 38) and try to see the difference to a standard normal distribution. The difference is tiny and invisible when you increase sample sizes above 40! And F-tests. F-values with 1 experimenter degree of freedom are just squared t-values, so the square root of these is practically a z-test. But what about chi-square? Well, with 1 df, chi-square is just a squared z-score, so we can use the square root and have a z-score. But what if we don’t have two groups, but compute correlations or regressions? Well, the statistical significance test uses the t-distribution and sample sizes are often well above 40. So, t and z are practically identical. It is therefore not surprising to me that approximating empirical results with different test-statistics can be approximated with the standard normal distribution. We could make teaching statistics so much easier, instead of confusing students with F-distributions. The only exception are complex designs with 3 x 4 x 5 ANOVAs, but they don’t really test anything and are just used to p-hack. Rant over. Back to Jerry.

(9) “It is unclear how Theorem 2 is related to the z-curve procedure.”

Theorem 2 is about how selection for significance affects the probability distribution of true power values. Z-curve estimates are based only on studies that have achieved significant results; the others are hidden, by a process that can be called publication bias. There is a fundamental distinction between the original population of power values and the sub-population belonging to studies that produce significant results. The theorems in the appendix are intended to clarify that distinction. The reviewer believes that once significance has been observed, the studies in question no longer even have true power values. So, clarification would seem to be necessary.

(10) “In the description of the z-curve analysis, it is unclear why z-curve is needed to calculate “average power.” If p < .05 is the criterion of significance, then according to Theorem 1, why not count up all the reported p-values and calculate the proportion in which the p-values are significant?”

If there were no selection for significance, this is what a reasonable person would do. But the point of the paper, and what makes the estimation problem challenging, is that all we can observe are statistics from studies with p < 0.05. Publication bias is real, and z-curve is designed to allow for it.

(11) “To beat a dead horse, z-curve makes use of the concept of “power” for completed studies. To claim that power is a property of completed studies is an ontological error …”

Wrong. Power is a feature of the design of a study, the significance test, and the subject population. All of these features still exist after data have been collected and the test is carried out.

Uli and Jerry comment:
Whenever a psychologist uses the word “ontological,” be very skeptical. Most psychologists who use the word understand philosophy as well as this reviewer understands statistics.

(12) “The authors make a statement that (observed) power is the probability of exact replication. However, there is a conceptual error embedded in this statement. While Greenwald et al. (1996, p. 1976) state “replicability can be computed as the power of an exact replication study, which can be approximated by [observed power],” they also explicitly emphasized that such a statement requires the assumption that the estimated effect size is the same as the unknown population effect size which they admit cannot be met in practice.”

Observed power (a bad estimate of true power) is not the probability of significance upon exact replication. True power is the probability of significance upon exact replication. It’s based on true effect size, not estimated effect size. We were talking about true power, and we mistakenly thought that was obvious.

(13) “The basis of supporting the z-curve procedure is a simulation study. This approach merely confirms what is assumed with simulation and does not allow for the procedure to be refuted in any way (cf. Popper’s idea of refutation being the basis of science.) In a simulation study, one assumes that the underlying process of generating p-values is correct (i.e., consistent with the z-curve procedure). However, one cannot evaluate whether the p-value generating process assumed in the simulation study matches that of empirical data. Stated a different way, models about phenomena are fallible and so we find evidence to refute and corroborate these models. The simulation in support of the z-curve does not put the z-curve to the test but uses a model consistent with the z-curve (absent of empirical data) to confirm the z-curve procedure (a tautological argument). This is akin to saying that model A gives us the best results, and based on simulated data on model A, we get the best results.”

This criticism would have been somewhat justified if the simulations had used p-values from a bunch of z-tests. However, they did not. The simulations reported in the paper are all F-tests with one numerator degree of freedom, and denominator degrees of freedom depending on the sample size. This covers all the tests of individual regression coefficients in multiple regression, as well as comparisons of two means using two-sample (and even matched) t-tests. Brunner and Schmmack say (p. 8)

Because the pattern of results was similar for F-tests
and chi-squared tests and for different degrees of freedom,
we only report details for F-tests with one numerator
degree of freedom; preliminary data mining of
the psychological literature suggests that this is the case
most frequently encountered in practice. Full results are
given in the supplementary materials.

So I was going to refer the reader (and the anonymous reviewer, who is probably not reading this post anyway) to the supplementary materials. Fortunately I checked first, and found that the supplementary materials include a bunch of OSF stuff like the letter submitting the article for publication, and the reviewers’ comments and so on — but not the full set of simulations. Oops.

All the code and the full set of simulation results is posted at

https://www.utstat.utoronto.ca/brunner/zcurve2018

You can download all the material in a single file at

https://www.utstat.utoronto.ca/brunner/zcurve2018.zip

After expanding, just open index.html in a browser.

Actually we did a lot more simulation studies than this, but you have to draw the line somewhere. The point is that z-curve performs well for large numbers of studies with chi-squared test statistics as well as F statistics — all with varying degrees of freedom.

(14) “The simulation study was conducted for the performance of the z-curve on constrained scenarios including F-tests with df = 1 and not for the combination of t-tests and chi-square tests as applied in the current study. I’m not sure what to make of the z-curve performance for the data used in the current paper because the simulation study does not provide evidence of its performance under these unexplored conditions.”

Now the reviewer is talking about the paper that was actually under review. The mistake is natural, because of our (my) error in not making sure that the full set of simulations was included in the supplementary materials. The conditions in question are not unexplored; they are thoroughly explored, and the accuracy of z-curve for large samples is confirmed.

(15+) There are some more comments by the reviewer, but these are strictly about the paper under review, and not about Brunner and Schimmack (2020). So, I will leave any further response to others.

Once a p-hacker, always a p-hacker?

The 2010s have seen a replication crisis in social psychology (Schimmack, 2020). The main reason why it is difficult to replicate results from social psychology is that researchers used questionable research practices (QRPs, John et al., 2012) to produce more significant results than their low-powered designs warranted. A catchy term for these practices is p-hacking (Simonsohn, 2014).

New statistical techniques made it possible to examine whether published results were obtained with QRPs. In 2012, I used the incredibility index to show that Bem (2011) used QRPs to provide evidence for extrasensory perception (Schimmack, 2012). In the same article, I also suggested that Gailliot, Baumeister, DeWall, Maner, Plant, Tice, and Schmeichel, (2007) used QRPs to present evidence that suggested will-power relies on blood glucose levels. During the review process of my manuscript, Baumeister confirmed that QRPs were used (cf. Schimmack, 2014). Baumeister defended the use of these practices with a statement that the use of these practices was the norm in social psychology and that the use of these practices was not considered unethical.

The revelation that research practices were questionable casts a shadow on the history of social psychology. However, many also saw it as an opportunity to change and improve these practices (Świątkowski and Dompnier, 2017). Over the past decades, the evaluation of QRPs has changed. Many researchers now recognize that these practices inflate error rates, make published results difficult to replicate, and undermine the credibility of psychological science (Lindsay, 2019).

However, there are no general norms regarding these practices and some researchers continue to use them (e.g., Adam D. Galinsky, cf. Schimmack, 2019). This makes it difficult for readers of the social psychological literature to identify research that can be trusted or not, and the answer to this question has to be examined on a case by case basis. In this blog post, I examine the responses of Baumeister, Vohs, DeWall, and Schmeichel to the replication crisis and concerns that their results provide false evidence about the causes of will-power (Friese, Loschelder , Gieseler , Frankenbach & Inzlicht, 2019; Inzlicht, 2016).

To examine this question scientifically, I use test-statistics that are automatically extracted from psychology journals. I divide the test-statistics into those that were obtained until 2012, when awareness about QRPs emerged, and those published after 2012. The test-statistics are examined using z-curve (Brunner & Schimmack, 2019; Bartos & Schimmack, 2020). Results provide information about the expected replication rate and discovery rate. The use of QRPs is examined by comparing the observed discovery rate (how many published results are significant) to the expected discovery rate (how many tests that were conducted produced significant results).

Roy F. Baumeister’s replication rate was 60% (53% to 67%) before 2012 and 65% (57% to 74%) after 2012. The overlap of the 95% confidence intervals indicates that this small increase is not statistically reliable. Before 2012, the observed discovery rate was 70% and it dropped to 68% after 2012. Thus, there is no indication that non-significant results are reported more after 2012. The expected discovery rate was 32% before 2012 and 25% after 2012. Thus, there is also no change in the expected discovery rate and the expected discovery rate is much lower than the observed discovery rate. This discrepancy shows that QRPs were used before 2012 and after 2012. The 95%CI do not overlap before and after 2012, indicating that this discrepancy is statistically significant. Figure 1 shows the influence of QRPs when the observed non-significant results (histogram of z-scores below 1.96 in blue) is compared to the model prediction (grey curve). The discrepancy suggests a large file drawer of unreported statistical tests.

An old saying is that you can’t teach an old dog new tricks. So, the more interesting question is whether the younger contributors to the glucose paper changed their research practices.

The results for C. Nathan DeWall show no notable response to the replication crisis (Figure 2). The expected replication rate increased slightly from 61% to 65%, but the difference is not significant and visual inspection of the plots suggests that it is mostly due to a decrease in reporting p-values just below .05. One reason for this might be a new goal to p-hack at least to the level of .025 to avoid detection of p-hacking by p-curve analysis. The observed discovery rate is practically unchanged from 68% to 69%. The expected discovery rate increased only slightly from 28% to 35%, but the difference is not significant. More important, the expected discovery rates are significantly lower than the observed discovery rates before and after 2012. Thus, there is evidence that DeWall used questionable research practices before and after 2012, and there is no evidence that he changed his research practices.

The results for Brandon J. Schmeichel are even more discouraging (Figure 3). Here the expected replication rate decreased from 70% to 56%, although this decrease is not statistically significant. The observed discovery rate decreased significantly from 74% to 63%, which shows that more non-significant results are reported. Visual inspection shows that this is particularly the case for test-statistics close to zero. Further inspection of the article would be needed to see how these results are interpreted. More important, The expected discovery rates are significantly lower than the observed discovery rates before 2012 and after 2012. Thus, there is evidence that QRPs were used before and after 2012 to produce significant results. Overall, there is no evidence that research practices changed in response to the replication crisis.

The results for Kathleen D. Vohs also show no response to the replication crisis (Figure 4). The expected replication rate dropped slightly from 62% to 58%; the difference is not significant. The observed discovery rate dropped slightly from 69% to 66%, and the expected discovery rate decreased from 43% to 31%, although this difference is also not significant. Most important, the observed discovery rates are significantly higher than the expected discovery rates before 2012 and after 2012. Thus, there is clear evidence that questionable research practices were used before and after 2012 to inflate the discovery rate.

Conclusion

After concerns about research practices and replicability emerged in the 2010s, social psychologists have debated this issue. Some social psychologists changed their research practices to increase statistical power and replicability. However, other social psychologists have denied that there is a crisis and attributed replication failures to a number of other causes. Not surprisingly, some social psychologists also did not change their research practices. This blog post shows that Baumeister and his students have not changed research practices. They are able to publish questionable research because there has been no collective effort to define good research practices and to ban questionable practices and to treat the hiding of non-significant results as a breach of research ethics. Thus, Baumeister and his students are simply exerting their right to use questionable research practices, whereas others voluntarily implemented good, open science, practices. Given the freedom of social psychologists to decide which practices they use, social psychology as a field continuous to have a credibility problem. Editors who accept questionable research in their journals are undermining the credibility of their journal. Authors are well advised to publish in journals that emphasis replicability and credibility with open science badges and with a high replicability ranking (Schimmack, 2019).

An Honorable Response to the Credibility Crisis by D.S. Lindsay: Fare Well

We all know what psychologists did before 2012. The name of the game was to get significant results that could be sold to a journal for publication. Some did it with more power and some did it with less power, but everybody did it.

In the beginning of the 2010s it became obvious that this was a flawed way to do science. Bem (2011) used this anything-goes to get significance approach to publish 9 significant demonstration of a phenomenon that does not exist: mental time-travel. The cat was out of the bag. There were only two questions. How many other findings were unreal and how would psychologists respond to the credibility crisis.

D. Steve Lindsay responded to the crisis by helping to implement tighter standards and to enforce these standards as editor of Psychological Science. As a result, Psychological Science has published more credible results over the past five years. At the end of his editorial term, Linday published a gutsy and honest account of his journey towards a better and more open psychological science. It starts with his own realization that his research practices were suboptimal.

Early in 2012, Geoff Cumming blew my mind with a talk that led me to realize that I had been conducting underpowered experiments for decades. In some lines of research in my lab, a predicted effect would come booming through in one experiment but melt away in the next.
My students and I kept trying to find conditions that yielded consistent statistical significance—tweaking items, instructions, exclusion rules—but we sometimes eventually threw in the towel
because results were maddeningly inconsistent. For example, a chapter by Lindsay
and Kantner (2011) reported 16 experiments with an on-again/off-again effect of feedback on recognition memory. Cumming’s talk explained that p values are very noisy. Moreover, when between-subjects designs are used to study small- to medium-sized effects, statistical
tests often yield nonsignificant outcomes (sometimes with huge p values) unless samples are very large.

Hard on the heels of Cumming’s talk, I read Simmons, Nelson, and Simonsohn’s (2011) “False-Positive Psychology” article, published in Psychological Science. Then I gobbled up several articles and blog posts on misuses of null-hypothesis significance testing (NHST). The
authors of these works make a convincing case that hypothesizing after the results are known (HARKing; Kerr, 1998) and other forms of “p hacking” (post hoc exclusions, transformations, addition of moderators, optional stopping, publication bias, etc.) are deeply problematic. Such practices are common in some areas of scientific psychology, as well as in some other life
sciences. These practices sometimes give rise to mistaken beliefs in effects that really do not exist. Combined with publication bias, they often lead to exaggerated estimates
of the sizes of real but small effects.

This quote is exceptional because few psychologists have openly talked about their research practices before (or after) 2012. It is an open secrete that questionable research practices were widely used and anonymous surveys support this (John et al., 2012), but nobody likes to talk about it. Lindsay’s frank account is an honorable exception in the spirit of true leaders who confront mistakes head on, just like a Nobel laureate who recently retracted a Science article (Frances Arnold).

1. Acknowledge your mistakes.

2. Learn from your mistakes.

3. Teach others from your mistakes.

4. Move beyond your mistakes.

Lindsay’s acknowledgement also makes it possible to examine what these research practices look like when we examine published results, and to see whether this pattern changes in response to awareness that certain practices were questionable.

So, I z-curved Lindsay’s published results from 1998 to 2012. The graph shows some evidence of QRPs, in that the model assumes more non-significant results (grey line from 0 to 1.96) than are actually observed (histogram of non-significant results). This is confirmed by a comparison of the observed discovery rate (70% of published results are significant) and the expected discovery rate (44%). However, the confidence intervals overlap. So this test of bias is not significant.

The replication rate is estimated to be 77%. This means that there is a 77% probability that repeating a test with a new sample (of equal size) would produce a significant result again. Even for just significant results (z = 2 to 2.5), the estimated replicability is still 45%. I have seen much worse results.

Nevertheless, it is interesting to see whether things improved. First of all, being editor of Psychological Science is full-time job. Thus, output has decreased. Maybe research also slowed down because studies were conducted with more care. I don’t know. I just know that there are very few statistics to examine.

Although the small sample size of tests makes results somewhat uncertain, the graph shows some changes in research practices. Replicability increased further to 88% and there is no loner a discrepancy between observed and expected discovery rate.

If psychology as a whole had responded like D.S. Lindsay it would be in a good position to start the new decade. The problem is that this response is an exception rather than the rule and some areas of psychology and some individual researchers have not changed at all since 2012. This is unfortunate because questionable research practices hurt psychology, especially when undergraduates and the wider public learn more and more how untrustworthy psychological science has been and often still us. Hopefully, reforms will come sooner than later or we may have to sing a swan song for psychological science.

Estimating the Replicability of Psychological Science

Over the past years, psychologists have become increasingly concerned about the credibility of published results. The credibility crisis started in 2011, when Bem published incredible results that seemed to suggest that humans can foresee random future events. Bem’s article revealed fundamental flaws in the way psychologists conduct research. The main problem is that psychology journals only publish statistically significant results (Sterling, 1959). If only significant results are published, all hypotheses will receive empirical support as long as they are tested. This is akin to saying that everybody has a 100% free throw average or nobody ever makes a mistake if we do not count failures.

The main problem of selection for significance is that we do not know the real strength of evidence that empirical studies provide. Maybe the selection effect is small and most studies would replicate. However, it is also possible that many studies might fail a replication test. Thus, the crisis of confidence is a crisis of uncertainty.

The Open Science Collaboration conducted actual replication studies to estimate the replicability of psychological science. They replicated 97 studies with statistically significant results and were able to reproduce 35 significant results (a 36% success rate). This is a shockingly low success rate. Based on this finding, most published results cannot be trusted, especially because there is heterogeneity across studies. Some studies would have an even lower chance of replication and several studies might even be outright false positives (there is actually no real effect).

As important as this project was to reveal major problems with the research culture in psychological science, there are also some limitations that cast doubt about the 36% estimate as a valid estimate of the replicability of psychological science. First, the sample size is small and sampling error alone might have lead to an underestimation of the replicability in the population of studies. However, sampling error could also have produced a positive bias. Another problem is that most of the studies focused on social psychology and that replicability in social psychology could be lower than in other fields. In fact, a moderator analysis suggested that the replication rate in cognitive psychology is 50%, while the replication rate in social psychology is only 25%. The replicated studies were also limited to a single year (2008) and three journals. It is possible that the replication rate has increased since 2008 or could be higher in other journals. Finally, there have been concerns about the quality of some of the replication studies. These limitations do not undermine the importance of the project, but they do imply that the 36% estimate is an estimate and that it may underestimate the replicability of psychological science.

Over the past years, I have been working on an alternative approach to estimate the replicability of psychological science. This approach starts with the simple fact that replicabiliity is tightly connected to the statistical power of a study because statistical power determines the long-run probability of producing significant results (Cohen, 1988). Thus, estimating statistical power provides valuable information about replicability. Cohen (1962) conducted a seminal study of statistical power in social psychology. He found that the average power to detect an average effect size was around 50%. This is the first estimate of replicability of psychological science, although it was only based on one journal and limited to social psychology. However, subsequent studies replicated Cohen’s findings and found similar results over time and across journals (Sedlmeier & Gigerenzer, 1989). It is noteworthy that the 36% estimate from the OSC project is not statistically different from Cohen’s estimate of 50%. Thus, there is convergent evidence that replicability in social psychology is around 50%.

In collaboration with Jerry Brunner, I have developed a new method that can estimate mean power for a set of studies that are selected for significance and that vary in effect sizes and samples sizes, which produces heterogeneity in power (Brunner & Schimmack, 2018). The input for this method are the actual test statistics of significance tests (e.g., t-tests, F-tests). These test-statistics are first converted into two-tailed p-values and then converted into absolute z-scores. The magnitude of these absolute z-scores provides information about the strength of evidence against the null-hypotheses. The histogram of these z-scores, called a z-curve, is then used to fit a finite mixture model to the data that estimates mean power, while taking selection for significance intro account. Extensive simulation studies demonstrate that z-curve performs well and provides better estimates than alternative methods. Thus, z-curve is the method of choice for estimating the replicability of psychological science on the basis of the test statistics that are reported in original articles.

For this blog post, I am reporting results based on preliminary results from a large project that extracts focal hypothesis from a broad range of journals that cover all areas of psychology for the years 2010 to 2017. The hand-coding of these articles complements a similar project that relies on automatic extraction of test statistics (Schimmack, 2018).

Table 1 shows the journals that have been coded so far. It also shows the estimates based on the automated method and for hand-coding of focal hypotheses.

JournalHandAutomated
Psychophysiology8475
Journal of Abnormal Psychology7668
Journal of Cross-Cultural Psychology7377
Journal of Research in Personality6875
J. Exp. Psych: Learning, Memory, & Cognition5877
Journal of Experimental Social Psychology5562
Infancy5368
Behavioral Neuroscience5368
Psychological Science5266
JPSP-Interpersonal Relations & Group Processes3363
JPSP-Attitudes and Social Cognition3065
Mean5869

Hand coding of focal hypothesis produces lower estimates than the automated method because the automated analysis also codes manipulation checks and other highly significant results that are not theoretically important. The correlation between the two methods shows consistency across the two methods, r = .67. Finally, the mean for the automated method, 69%, is close to the mean for over 100 journals, 72%, suggesting that the sample of journals is an unbiased sample.

The hand coding results also confirm results found with the automated method that social psychology has a lower replicability than some other disciplines. Thus, the OSC reproducibility results that are largely based on social psychology should not be used to make claims about psychological science in general.

The figure below shows the output of the latest version of z-curve. The first finding is that the replicability estimate for all 1,671 focal tests is 56% with a relatively tight confidence interval ranging from 45% to 56%. ZZZ The next finding is that the discovery rate or success rate is 92%, using p < .05 as the criterion. This confirms that psychology journals continue to published results are selected for significance (Sterling, 1959). The histogram further shows that even more results would be significant if p-values below .10 are included as evidence for “marginal significance.”

Z-Curve.19.1 also provides an estimate of the size of the file drawer. It does so by projecting the distribution of observed significant results into the range of non-significant results (grey curve). The file drawer ratio shows that for every published result, we would expect roughly two unpublished studies with non-significant results. However, z-curve cannot distinguish between different questionable research practices. Rather than not disclosing failed studies researchers may not disclose other statistical analyses within a published study to report significant results.

Z-Curve.19.1 also provides an estimate of the false positive rate (FDR). FDR is the percentage of significant results that may arise from testing a true nil-hypothesis, where the population effect size is zero. For a long time, the consensus has been that false positives are rare because the nil-hypothesis is rarely true (Cohen, 1994). Consistent with this view, Soric’s estimate of the maximum false discovery rate is only 10% with a tight CI ranging from 8% to 16%.

However, the focus on the nil-hypothesis is misguided because it treats tiny deviations from zero as true hypotheses even if the effect size has no practical or theoretical significance. These effect sizes also lead to low power and replication failures. Therefore, Z-Curve 19.1 also provides an estimate of the FDR that treats studies with very low power as false positives. This broader definition of false positives raises the FDR estimate slightly, but 15% is still a low percentage. Thus, the modest replicability of results in psychological science is mostly due to low statistical power to detect true effects rather than a high number of false positive discoveries.

The reproducibility project showed that studies with low p-values were more likely to replicate. This relationship follows from the influence of statistical power on p-values and replication rates. To achieve a replication rate of 80%, p-values had to be less than .00005 or the z-score had to exceed 4 standard deviations. However, this estimate was based on a very small sample of studies. Z-Curve.19.1 also provides estimates of replicability for different levels of evidence. These values are shown below the x-axis. Consistent with the OSC results, a replication rate over 80% is only expected once z-scores are greater than 4.

The results also provide information about the choice of the alpha criterion to draw inferences from significance tests in psychology. To do so, it is important to distinguish observed p-values and type-I probabilities. For a single unbiased tests, we can infer from an observed p-value less than .05 that the risk of a false positive result is less than 5%. However, when multiple comparisons are made or results are selected for significance, an observed p-values less than .05 does not imply that the type-I error risk is below .05. To claim a type-I error risk of 5% or less, we have to correct the observed p-values, just like a Bonferroni correction. As 50% power corresponds to statistical significance, we see that z-scores between 2 and 3 are not statistically significant; that is, the type-I error risk is greater than 5%. Thus, the standard criterion to claim significance with alpha = .05 is a p-value of .003. Given the popularity of .005, I suggest to use p = .005 as a criterion for statistical significance. However, this claim is not based on lowering the criterion for statistical significance because p < .005 still only allows to claim that the type-I error probability is less than 5%. The need for a lower criterion value stems from the inflation of the type-I error rate due to selection for significance. This is a novel argument that has been overlooked in the significance wars, which ignored the influence of publication bias on false positive risks.

Finally, z-curve.19.1 makes it possible to examine the robustness of the estimates by using different selection criteria. One problem with selection models is that p-values just below .05, say in the .01 to .05 range, can arise from various questionable research practices that have different effects on replicability estimates. To address this problem, it is possible to estimate the density with a different selection criterion, while still estimating the replicability with alpha = .05 as the criterion. Figure 2 shows the results by using only z-scores greater than 2.5, p = .012) to fit the observed z-curve for z-scores greater than 2.5.

The blue dashed line at z = 2.5 shows the selection criterion. The grey curve between 1.96 and 2.5 is projected form the distribution for z-scores greater than 2.5. Results show a close fit with the observed distribution. A s a result, the parameter estimates are also very similar. Thus, the results are robust and the selection model seems to be reasonable.

Conclusion

Psychology is in a crisis of confidence about the credibility of published results. The fundamental problems are as old as psychology itself. Psychologists have conducted low powered studies and selected only studies that worked for decades (Cohen, 1962; Sterling, 1959). However, awareness of these problems has increased in recent years. Like many crises, the confidence crisis in psychology has created confusion. Psychologists are aware that there is a problem, but they do not know how large the problem is. Some psychologists believe that there is no crisis and pretend that most published results can be trusted. Others are worried that most published results are false positives. Meta-psychologists aim to reduce the confusion among psychologists by applying the scientific method to psychological science itself.

This blog post provided the most comprehensive assessment of the replicability of psychological science so far. The evidence is largely consistent with previous meta-psychological investigations. First, replicability is estimated to be slightly above 50%. However, replicability varies across discipline and the replicability of social psychology is below 50%. The fear that most published results are false positives is not supported by the data. Replicability increases with the strength of evidence against the null-hypothesis. If the p-value is below .00001, studies are likely to replicate. However, significant results with p-values above .005 should not be considered statistically significant with an alpha level of 5%, because selection for significance inflates the type-I error. Only studies with p < .005 can claim statistical significance with alpha = .05.

The correction for publication bias implies that researchers have to increase sample sizes to meet the more stringent p < .005 criterion. However, a better strategy is to preregister studies to ensure that reported results can be trusted. In this case, p-values below .05 are sufficient to demonstrate statistical significance with alpha = .05. Given the low prevalence of false positives in psychology, I do see no need to lower the alpha criterion.

Future Directions

This blog post is just an interim report. The final project requires hand-coding of a broader range of journals. Readers who think that estimating the replicability of psychological science is beneficial and who want information about a particular journal are invited to collaborate on this project and can obtain authorship if their contribution is substantial enough to warrant authorship. Please consider taking part in this project. Although it is a substantial time commitment, it doesn’t require participants or materials that are needed for actual replication studies. Please consider taking part in this project. Contact me, if you are interested and want to know how you can get involved.

An Introduction to Z-Curve: A method for estimating mean power after selection for significance (replicability)

UPDATE 5/13/2019   Our manuscript on the z-curve method for estimation of mean power after selection for significance has been accepted for publication in Meta-Psychology. As estimation of actual power is an important tool for meta-psychologists, we are happy that z-curve found its home in Meta-Psychology.  We also enjoyed the open and constructive review process at Meta-Psychology.  Definitely will try Meta-Psychology again for future work (look out for z-curve.2.0 with many new features).

Z.Curve.1.0.Meta.Psychology.In.Press

Since 2015, Jerry Brunner and I have been working on a statistical tool that can estimate mean (statitical) power for a set of studies with heterogeneous sample sizes and effect sizes (heterogeneity in non-centrality parameters and true power).   This method corrects for the inflation in mean observed power that is introduced by the selection for statistical significance.   Knowledge about mean power makes it possible to predict the success rate of exact replication studies.   For example, if a set of studies with mean power of 60% were replicated exactly (including sample sizes), we would expect that 60% of the replication studies produce a significant result again.

Our latest manuscript is a revision of an earlier manuscript that received a revise and resubmit decision from the free, open-peer-review journal Meta-Psychology.  We consider it the most authoritative introduction to z-curve that should be used to learn about z-curve, critic z-curve, or as a citation for studies that use z-curve.

Cite as “submitted for publication”.

Final.Revision.874-Manuscript in PDF-2236-1-4-20180425 mva final (002)

Feel free to ask questions, provide comments, and critic our manuscript in the comments section.  We are proud to be an open science lab, and consider criticism an opportunity to improve z-curve and our understanding of power estimation.

R-CODE
Latest R-Code to run Z.Curve (Z.Curve.Public.18.10.28).
[updated 18/11/17]   [35 lines of code]
call function  mean.power = zcurve(pvalues,Plot=FALSE,alpha=.05,bw=.05)[1]

Z-Curve related Talks
Presentation on Z-curve and application to BS Experimental Social Psychology and (Mostly) WS-Cognitive Psychology at U Waterloo (November 2, 2018)
[Powerpoint Slides]

Can the Bayesian Mixture Model Estimate the Percentage of False Positive Results in Psychology Journals?

A method revolution is underway in psychological science.  In 2011, an article published in JPSP-ASC made it clear that experimental social psychologists were publishing misleading p-values because researchers violated basic principles of significance testing  (Schimmack, 2012; Wagenmakers et al., 2011).  Deceptive reporting practices led to the publication of mostly significant results, while many non-significant results were not reported.  This selective publishing of results dramatically increases the risk of a false positive result from the nominal level of 5% that is typically claimed in publications that report significance tests  (Sterling, 1959).

Although experimental social psychologists think that these practices are defensible, no statistician would agree with them.  In fact, Sterling (1959) already pointed out that the success rate in psychology journals is too high and claims about statistical significance are meaningless.  Similar concerns were raised again within psychology (Rosenthal, 1979), but deceptive practices remain acceptable until today (Kitayama, 2018). As a result, most published results in social psychology do not replicate and cannot be trusted (Open Science Collaboration, 2015).

For non-methodologists it can be confusing to make sense of the flood of method papers that have been published in the past years.  It is therefore helpful to provide a quick overview of methodological contributions concerned with detection and correction of biases.

First, some methods focus on effect sizes, (pcurve2.0; puniform), whereas others focus on strength of evidence (Test of Excessive Significance; Incredibility Index; R-Index, Pcurve2.1; Pcurve4.06; Zcurve).

Another important distinction is between methods that assume a fixed parameter and methods that assume heterogeneity.   If all studies have a common effect size or the same strength of evidence,  it is relatively easy to demonstrate bias and to correct for bias (Pcurve2.1; Puniform; TES).  However, heterogeneity in effect sizes or sampling error produces challenges.  Relatively few methods have been developed for this challenging, yet realistic scenario.  For example, Ioannidis and Trikalonis (2005) developed a method to reveal publication bias that assumes a fixed effect size across studies, while allowing for variation in sampling error, but this method can be biased if there is heterogeneity in effect sizes.  In contrast, I developed the Incredibilty-Index (also called Magic Index) to allow for heterogeneity in effect sizes and sampling error (Schimmack, 2012).

Following my work on bias detection in heterogeneous sets of studies, I started working with Jerry Brunner on methods that can estimate average power of a heterogeneous set of studies that are selected for significance.  I first published this method on my blog in June 2015, when I called it post-hoc power curves.   These days, the term Zcurve is used more often to refer to this method.  I illustrated the usefulness of Zcurve in various posts in the Psychological Methods Discussion Group.

In September, 2015 I posted replicability rankings of social psychology departments using this method. the post generated a lot of discussions and a question about the method.  Although the details were still unpublished, I described the main approach of the method.  To deal with heterogeneity, the method uses a mixture model.

EJ.Mixture.png

In 2016, Jerry Brunner and I submitted a manuscript for publication that compared four methods for estimating average power of heterogeneous studies selected for significance (Puniform1.1; Pcurve2.1; Zcurve & a Maximul Likelihood Method).  In this article, the mixture model, Zcurve, outperformed other methods, including a maximum-likelihood method developed by Jerry Brunner. The manuscript was rejected from Psychological Methods.

In 2017, Gronau, Duizer, Bakker, and Eric-Jan Wagenmakers published an article titled “A Bayesian Mixture Modeling of Significant p Values: A Meta-Analytic Method to Estimate the Degree of Contamination From H0”  in the Journal of Experimental Psychology: General.  The article did not mention z-curve, presumably because it was not published in a peer-reviewed journal.

Although a reference to our mixture model would have been nice, the Bayesian Mixture Model differs in several ways from Zcurve.  This blog post examines the similarities and differences between the two mixture models, it shows that BMM fails to provide useful estimates with simulations and social priming studies, and it explains why BMM fails. It also shows that Zcurve can provide useful information about replicability of social priming studies, while the BMM estimates are uninformative.

Aims

The Bayesian Mixture Model (BMM) and Zcurve have different aims.  BMM aims to estimate the percentage of false positives (significant results with an effect size of zero). This percentage is also called the False Discovery Rate (FDR).

FDR = False Positives / (False Positives + True Positives)

Zcurve aims to estimate the average power of studies selected for significance. Importantly, Brunner and Schimmack use the term power to refer to the unconditional probability of obtaining a significant result and not the common meaning of power as being conditional on the null-hypothesis being false. As a result, Zcurve does not distinguish between false positives with a 5% probability of producing a significant result (when alpha = .05) and true positives with an average probability between 5% and 100% of producing a significant result.

Average unconditional power is simply the percentage of false positives times alpha plus the average conditional power of true positive results (Sterling et al., 1995).

Unconditional Power = False Positives * Alpha + True Positives * Mean(1 – Beta)

Zcurve therefore avoids the thorny issue of defining false positives and trying to distinguish between false positives and true positives with very small effect sizes and low power.

Approach 

BMM and zcurve use p-values as input.  That is, they ignore the actual sampling distribution that was used to test statistical significance.  The only information that is used is the strength of evidence against the null-hypothesis; that is, how small the p-value actually is.

The problem with p-values is that they have a specified sampling distribution only when the null-hypothesis is true. When the null-hypothesis is true, p-values have a uniform sampling distribution.  However, this is not useful for a mixture model, because a mixture model assumes that the null-hypothesis is sometimes false and the sampling distribution for true positives is not defined.

Zcurve solves this problem by using the inverse normal distribution to convert all p-values into absolute z-scores (abs(z) = -qnorm(p/2).  Absolute z-scores are used because F-tests or two-sided t-tests do not have a sign and a test score of 0 corresponds to a probability of 1.  Thus, the results do not say anything about the direction of an effect, while the size of the p-value provides information about the strength of evidence.

BMM also transforms p-values. The only difference is that BMM uses the full normal distribution with positive and negative z-scores  (z = qnorm(p)). That is, a p-value of .5 corresponds to a z-score of zero; p-values greater than .5 would be positive, and p-values less than .5 are assigned negative z-scores.  However, because only significant p-values are selected, all z-scores are negative in the range from -1.65 (p = .05, one-tailed) to negative infinity (p = 0).

The non-centrality parameter (i.e., the true parameter that generates the sampling dstribution) is simply the mean of the normal distribution. For the null-hypothesis and false positives, the mean is zero.

Zcurve and BMM differ in the modeling of studies with true positive results that are heterogeneous.  Zcurve uses several normal distributions with a standard deviation of 1 that reflects sampling error for z-tests.  Heterogeneity in power is modeled by varying means of normal distributions, where power increases with increasing means.

BMM uses a single normal distribution with varying standard deviation.  A wider distribution is needed to predict large observed z-scores.

The main difference between Zcurve and BMM is that Zcurve either does not have fixed means (Brunner & Schimmack, 2016) or has fixed means, but does not interpret the weight assigned to a mean of zero as an estimate of false positives (Schimmack & Brunner, 2018).  The reason is that the weights attached to individual components are not very reliable estimates of the weights in the data-generating model.  Importantly, this is not relevant for the goal of zurve to estimate average power because the weighted average of the components of the model is a good estimate of the average true power in the data-generating model, even if the weights do not match the weights of the data-generating model.

For example, Zcurve does not care whether 50% average power is produced by a mixture of 50% false positives and 50% true positives with 95% power or 50% of studies with 20% power and 50% studies with 80% power. If all of these studies were exactly replicated, they are expected to produce 50% significant results.

BMM uses the weights assigned to the standard normal with a mean of zero as an estimate of the percentage of false positive results.  It does not estimate the average power of true positives or average unconditional power.

Given my simulation studies with zcruve, I was surprised that BBM solved a problem that weights of individual components cannot be reliably estimated because the same distribution of p-values can be produced by many mixture models with different weights.  The next section examines how BMM tries to estimate the percentage of false positives from the distribution of p-values.

A Bayesian Approach

Another difference between BMM and Zcurve is that BMM uses prior distributions, whereas Zcurve does not.  Whereas Zcurve makes no assumptions about the percentage of false positives, BMM uses a uniform distribution with values from 0 to 1 (100%) as a prior.  That is, it is equally likely that the percentage of false positives is 0%, 100%, or any value in between.  A uniform prior is typically justified as being agnostic; that is, no subjective assumptions bias the final estimate.

For the mean of the true positives, the authors use a truncated normal prior, which they also describe as a folded standard normal.  They justify this prior as reasonable based on extensive simulation studies.

Most important, however, is the parameter for the standard deviation.  The prior for this parameter was a uniform distribution with values between 0 and 1.   The authors argue that larger values would produce too many p-values close to 1.

“implausible prediction that p values near 1 are more common under H1 than under H0” (p 1226). 

But why would this be implausible.  If there are very few false positives and many true positives with low power, most p-values close to 1 would be the result of  true positives (H1) than of false positives (H0).

Thus, one way BMM is able to estimate the false discovery rate is by setting the standard deviation in a way that there is a limit to the number of low z-scores that are predicted by true positives (H1).

Although understanding priors and how they influence results is crucial for meaningful use of Bayesian statistics, the choice of priors is not crucial for Bayesian estimation models with many observations because the influence of the priors diminishes as the number of observations increases.  Thus, the ability of BMM to estimate the percentage of false positives in large samples cannot be explained by the use of priors. It is therefore still not clear how BMM can distinguish between false positives and true positives with low power.

Simulation Studies

The authors report several simulation studies that suggest BMM estimates are close and robust across many scenarios.

The online supplemental material presents a set of simulation studies that highlight that the model is able to accurately estimate the quantities of interest under a relatively broad range of circumstances”  (p. 1226).

The first set of simulations uses a sample size of N = 500 (n = 250 per condition).  Heterogeneity in effect sizes is simulated with a truncated normal distribution with a standard deviation of .10 (truncated at 2*SD) and effect sizes of d = .45, .30, and .15.  The lowest values are .35, .20, and .05.  With N = 500, these values correspond to  97%, 61%, and 8% power respectively.

d = c(.35,.20,.05); 1-pt(qt(.975,500-2),500-2,d*sqrt(500)/2)

The number of studies was k = 5,000 with half of the studies being false positives (H0) and half being true positives (H1).

Figure 1 shows the Zcurve plot for the simulation with high power (d = .45, power >  97%; median true power = 99.9%).

Sim1.png

The graph shows a bimodal distribution with clear evidence of truncation (the steep drop at z = 1.96 (p = .05, two-tailed) is inconsistent with the distribution of significant z-scores.  The sharp drop from z = 1.96 to 3 shows that there are many studies with non-significant results are missing.  The estimate of unconditional power (called replicability = expected success rate in exact replication studies) is 53%.  This estimate is consistent with the simulation of 50% studies with a probability of success of 5% and 50% of studies with a success probability of 99.9% (.5 * .05 + .5 * .999 = 52.5).

The values below the x-axis show average power for  specific z-scores. A z-score of 2 corresponds roughly to p = .05 and 50% power without selection for significance. Due to selection for significance, the average power is only 9%. Thus the observed power of 50% provides a much inflated estimate of replicability.  A z-score of 3.5 is needed to achieve significance with p < .05, although the nominal p-value for z = 3.5 is p = .0002.  Thus, selection for significance renders nominal p-values meaningless.

The sharp change in power from Z = 3 to Z = 3.5 is due to the extreme bimodal distribution.  While most Z-scores below 3 are from the sampling distribution of H0 (false positives), most Z-scores of 3.5 or higher come from H1 (true positives with high power).

Figure 2 shows the results for the simulation with d = .30.  The results are very similar because d = .30 still gives 92% power.  As a result, replicabilty is nearly as high as in the previous example.

Sim2.png

 

The most interesting scenario is the simulation with low powered true positives. Figure 3 shows the Zcurve for this scenario with an unconditional average power of only 23%.

Sim3.png

It is no longer possible to recognize two sampling distributions and average power increases rather gradually from 18% for z = 2, to 35% for z = 3.5.  Even with this challenging scenario, BMM performed well and correctly estimated the percentage of false positives.   This is surprising because it is easy to generate a similar Zcurve without false positives.

Figure 4 shows a simulation with a mixture distribution but the false positives (d = 0) have been replaced by true positives (d = .06), while the mean for the heterogeneous studies was reduced to from d = .15 to d = .11.  These values were chosen to produce the same average unconditional power (replicability) of 23%.

Sim4.png

I transformed the z-scores into (two-sided) p-values and submitted them to the online BMM app at https://qfgronau.shinyapps.io/bmmsp/ .  I used only k = 1,500 p-values because the server timed me out several times with k = 5,000 p-values.  The estimated percentage of false positives was 24%, with a wide 95% credibility interval ranging from 0% to 48%.   These results suggest that BMM has problems distinguishing between false positives and true positives with low power.   BMM appears to be able to estimate the percentage of false positives correctly when most low z-scores are sampled from H0 (false positives). However, when these z-scores are due to studies with low power, BMM cannot distinguish between false positives and true positives with low power. As a result, the credibility interval is wide and the point estimates are misleading.

BMM.output.png

With k = 1,500 the influence of the priors is negligible.  However, with smaller sample sizes, the priors do have an influence on results and may lead to overestimation and false credibility intervals.  A simulation with k = 200, produced a point estimate of 34% false positives with a very wide CI ranging from 0% to 63%. The authors suggest a sensitivity analysis by changing model parameters. The most crucial parameter is the standard deviation.  Increasing the standard deviation to 2, increases the upper limit of the 95%CI to 75%.  Thus, without good justification for a specific standard deviation, the data provide very little information about the percentage of false positives underlying this Zcurve.

BMM.k200.png

 

For simulations with k = 100, the prior started to bias the results and the CI no longer included the true value of 0% false positives.

BMM.k100

In conclusion, these simulation results show that BMM promises more than it can deliver.  It is very difficult to distinguish p-values sampled from H0 (mean z = 0) and those sampled from H1 with weak evidence (e.g., mean z = 0.1).

In the Challenges and Limitations section, the authors pretty much agree with this assessment of BMM (Gronau et al., 2017, p. 1230).

The procedure does come with three important caveats.

First, estimating the parameters of the mixture model is an inherently difficult statistical problem. ..  and consequently a relatively large number of p values are required for the mixture model to provide informative results. 

A second caveat is that, even when a reasonable number of p values are available, a change in the parameter priors might bring about a noticeably different result.

The final caveat is that our approach uses a simple parametric form to account for the distribution of p values that stem from H1. Such simplicity comes with the risk of model-misspecification.

Practical Implications

Despite the limitations of BMM, the authors applied BMM to several real data.  The most interesting application selected focal hypothesis tests from social priming studies.  Social priming studies have come under attack as a research area with sloppy research methods as well as fraud (Stapel).  Bias tests show clear evidence that published results were obtained with questionable scientific practices (Schimmack, 2017a, 2017b).

The authors analyzed 159 social priming p-values.  The 95%CI for the percentage of false positives ranged from 48% to 88%.  When the standard deviation was increased to 2, the 95%CI increased slightly to 56% to 91%.  However, when the standard deviation was halved, the 95%CI ranged from only 10% to 75%.  These results confirm the authors’ warning that estimates in small sets of studies (k < 200) are highly sensitive to the specification of priors.

What inferences can be drawn from these results about the social priming literature?  A false positive percentage of 10% doesn’t sound so bad.  A false positive percentage of 88% sound terrible. A priori, the percentage is somewhere between 0 and 100%. After looking at the data, uncertainty about the percentage of false positives in the social priming literature remains large.  Proponents will focus on the 10% estimate and critics will use the 88% estimate.  The data simply do not resolve inconsistent prior assumptions about the credibility of discoveries in social priming research.

In short, BMM promises that it can estimate the percentage of false positives in a set of studies, but in practice these estimates are too imprecise and too dependent on prior assumptions to be very useful.

A Zcurve of Social Priming Studies (k = 159)

It is instructive to compare the BMM results to a Zcurve analysis of the same data.

SocialPriming.png

The zcurve graph shows a steep drop and very few z-scores greater than 4, which tend to have a high success rate in actual replication attempts (OSC, 2015).  The average estimated replicability is only 27%.  This is consistent with the more limited analysis of social priming studies in Kahneman’ s Thinking Fast and Slow book (Schimmack, 2017a).

More important than the point estimate is that the 95%CI ranges from 15% to a maximum of 39%.  Thus, even a sample size of 159 studies is sufficient to provide conclusive evidence that these published studies have a low probability of replicating even if it were possible to reproduce the exact conditions again.

These results show that it is not very useful to distinguish between false positives with a replicability of 5% and true positives with a replicability of 6, 10, or 15%.  Good research provides evidence that can be replicated at least with a reasonable degree of statistical power.  Tversky and Kahneman (1971) suggested a minimum of 50% and most social priming studies fail to meet this minimal standard and hardly any studies seem to have been planned with the typical standard of 80% power.

The power estimates below the x-axis show that a nomimal z-score of 4 or higher is required to achieve 50% average power and an actual false positive risk of 5%. Thus, after correcting for deceptive publication practices, most of the seemingly statistically significant results are actually not significant with the common criterion of a 5% risk of a false positive.

The difference between BMM and Zcurve is captured in the distinction between evidence of absence and absence of evidence.  BMM aims to provide evidence of absence (false positives). In contrast, Zcurve has the more modest goal of demonstrating absence (or presence) of evidence.  It is unknown whether any social priming studies could produce robust and replicable effects and under what conditions these effects occur or do not occur.  However, it is not possible to conclude from the poorly designed studies and the selectively reported results that social priming effects are zero.

Conclusion

Zcurve and BMM are both mixture models, but they have different statistical approaches, they have different aims.  They also differ in their ability to provide useful estimates.  Zcurve is designed to estimate average unconditional power to obtain significant results without distinguishing between true positives and false positives.  False positives reduce average power, just like low powered studies, and in reality it can be difficult or impossible to distinguish between a false positive with an effect size of zero and a true positive with an effect size that is negligibly different from zero.

The main problem of BMM is that it treats the nil-hypothesis as an important hypothesis that can be accepted or rejected.  However, this is a logical fallacy.  it is possible to reject an implausible effect sizes (e.g., the nil-hypothesis is probably false if the 95%CI ranges from .8 to  1.2], but it is not possible to accept the nil-hypothesis because there are always values close to 0 that are also consistent with the data.

The problem of BMM is that it contrasts the point-nil-hypothesis with all other values, even if these values are very close to zero.  The same problem plagues the use of Bayes-Factors that compare the point-nil-hypothesis with all other values (Rouder et al., 2009).  A Bayes-Factor in favor of the point nil-hypothesis is often interpreted as if all the other effect sizes are inconsistent with the data.  However, this is a logical fallacy because data that are inconsistent with a specific H1 can be consistent with an alternative H1.  Thus, a BF in favor of H0 can only be interpreted as evidence against a specific H1, but never as evidence that the nil-hypothesis is true.

To conclude, I have argued that it is more important to estimate the replicability of published results than to estimate the percentage of false positives.  A literature with 100% true positives and average power of 10% is no more desirable than a literature with 50% false positives and 50% true positives with 20% power.  Ideally, researchers should conduct studies with 80% power and honest reporting of statistics and failed replications should control the false discovery rate.  The Zcurve for social priming studies shows that priming researchers did not follow these basic and old principles of good science.  As a result, decades of research are worthless and Kahneman was right to compare social priming research to a train wreck because the conductors ignored all warning signs.

 

 

 

Charles Stangor’s Failed Attempt to Predict the Future

Background

It is 2018, and 2012 is a faint memory.  So much has happened in the word and in
psychology over the past six years.

Two events rocked Experimental Social Psychology (ESP) in the year 2011 and everybody was talking about the implications of these events for the future of ESP.

First, Daryl Bem had published an incredible article that seemed to suggest humans, or at least extraverts, have the ability to anticipate random future events (e.g., where an erotic picture would be displayed).

Second, it was discovered that Diederik Stapel had fabricated data for several articles. Several years later, over 50 articles have been retracted.

Opinions were divided about the significance of these two events for experimental social psychology.  Some psychologists suggested that these events are symptomatic of a bigger crisis in social psychology.  Others considered these events as exceptions with little consequences for the future of experimental social psychology.

In February 2012, Charles Stangor tried to predict how these events will shape the future of experimental social psychology in an essay titled “Rethinking my Science

How will social and personality psychologists look back on 2011? With pride at having continued the hard work of unraveling the mysteries of human behavior, or with concern that the only thing that is unraveling is their discipline?

Stangor’s answer is clear.

“Although these two events are significant and certainly deserve our attention, they are flukes rather than game-changers.”

He describes Bem’s article as a “freak event” and Stapel’s behavior as a “fluke.”

“Some of us probably do fabricate data, but I imagine the numbers are relatively few.”

Stangor is confident that experimental social psychology is not really affected by these two events.

As shocking as they are, neither of these events create real problems for social psychologists

In a radical turn, Stangor then suggests that experimental social psychology will change, but not in response to these events, but in response to three other articles.

But three other papers published over the past two years must completely change how we think about our field and how we must conduct our research within it. And each is particularly important for me, personally, because each has challenged a fundamental assumption that was part of my training as a social psychologist.

Student Samples

The first article is a criticism of experimental social psychology for relying too much on first-year college students as participants (Heinrich, Heine, & Norenzayan, 2010).  Looking back, there is no evidence that US American psychologists have become more global in their research interests. One reason is that social phenomena are sensitive to the cultural context and for Americans it is more interesting to study how online dating is changing relationships than to study arranged marriages in more traditional cultures. There is nothing wrong with a focus on a particular culture.  It is not even clear that research article on prejudice against African Americans were supposed to generalize to the world (how would this research apply to African countries where the vast majority of citizens are black?).

The only change that occurred was not in response to Heinrich et al.’s (2010) article, but in response to technological changes that made it easier to conduct research and pay participants online.  Many social psychologists now use the online service Mturk to recruit participants.

Thus, I don’t think this article significantly changed experimental social psychology.

Decline Effect 

The second article with the title (“The Truth Wears Off“) was published in the weekly magazine the New Yorker.  It made the ridiculous claim that true effects become weaker or may even disappear over time.

The basic phenomenon is that observed findings in the social and biological sciences weaken with time. Effects that are easily replicable at first become less so every day. Drugs stop working over time the same way that social psychological phenomena become more and more elusive. The “the decline effect” or “the truth wears off effect,” is not easy to dismiss, although perhaps the strength of the decline effect will itself decline over time.

The assumption that the decline effect applies to real effects is no more credible than Bem’s claims of time-reversed causality.   I am still waiting for the effect of eating cheesecake on my weight (a biological effect) to wear off. My bathroom scale tells me it is not.

Why would Stangor believe in such a ridiculous idea?  The answer is that he observed it many times in his own work.

Frankly I have difficulty getting my head around this idea (I’m guessing others do too) but it is nevertheless exceedingly troubling. I know that I need to replicate my effects, but am often unable to do it. And perhaps this is part of the reason. Given the difficulty of replication, will we continue to even bother? And what becomes of our research if we do even less replicating than we do now? This is indeed a problem that does not seem likely to go away soon. 

In hindsight, it is puzzling that Stangor misses the connection between Bem’s (2011) article and the decline effect.   Bem published 9 successful results with p < .05.  This is not a fluke. The probability to get lucky 9 times in a row with a probability of just 5% for a single event is very very small (less than 1 in a billion attempts).  It is not a fluke. Bem also did not fabricate data like Stapel, but he falsified data to present results that are too good to be true (Definitions of Research Misconduct).  Not surprisingly, neither he nor others can replicate these results in transparent studies that prevent the use of QRPs (just like paranormal phenomena like spoon bending can not be replicated in transparent experiments that prevent fraud).

The decline effect is real, but it is wrong to misattribute it to a decline in the strength of a true phenomenon.  The decline effect occurs when researchers use questionable research practices (John et al., 2012) to fabricate statistically significant results.  Questionable research practices inflate “observed effect sizes” [a misnomer because effects cannot be observed]; that is, the observed mean differences between groups in an experiment.  Unfortunately, social psychologists do not distinguish between “observed effects sizes” and true or population effect sizes. As a result, they believe in a mysterious force that can reduce true effect sizes when sampling error moves mean differences in small samples around.

In conclusion, the truth does not wear off because there was no truth to begin with. Bem’s (2011) results did not show a real effect that wore off in replication studies. The effect was never there to begin with.

P-Hacking

The third article mentioned by Stangor did change experimental social psychology.  In this article, Simmons, Nelson, and Simonsohn (2011) demonstrate the statistical tricks experimental social psychologists have used to produce statistically significant results.  They call these tricks, p-hacking.  All methods of p-hacking have one common feature. Researchers conduct mulitple statistical analysis and check the results. When they find a statistically significant result, they stop analyzing the data and report the significant result.  There is nothing wrong with this practice so far, but it essentially constitutes research misconduct when the result is reported without fully disclosing how many attempts were made to get it.  The failure to disclose all attempts is deceptive because the reported result (p < .05) is only valid if a researcher collected data and then conducted a single test of a hypothesis (it does not matter whether this hypothesis was made before or after data collection).  The point is that at the moment a researcher presses a mouse button or a key on a keyboard to see a p-value,  a statistical test occurred.  If this p-value is not significant and another test is run to look at another p-value, two tests are conducted and the risk of a type-I error is greater than 5%. It is no longer valid to claim p < .05, if more than one test was conducted.  With extreme abuse of the statistical method (p-hacking), it is possible to get a significant result even with randomly generated data.

In 2010, the Publication Manual of the American Psychological Association advised researchers that “omitting troublesome observations from reports to present a more convincing story is also prohibited” (APA).  It is telling that Stangor does not mention this section as a game-changer, because it has been widely ignored by experimental psychologists until this day.  Even Bem’s (2011) article that was published in an APA journal violated this rule, but it has not been retracted or corrected so far.

The p-hacking article had a strong effect on many social psychologists, including Stangor.

Its fundamental assertions are deep and long-lasting, and they have substantially affected me. 

Apparently, social psychologists were not aware that some of their research practices undermined the credibility of their published results.

Although there are many ways that I take the comments to heart, perhaps most important to me is the realization that some of the basic techniques that I have long used to collect and analyze data – techniques that were taught to me by my mentors and which I have shared with my students – are simply wrong.

I don’t know about you, but I’ve frequently “looked early” at my data, and I think my students do too. And I certainly bury studies that don’t work, let alone fail to report dependent variables that have been uncooperative. And I have always argued that the researcher has the obligation to write the best story possible, even if may mean substantially “rewriting the research hypothesis.” Over the years my students have asked me about these practices (“What do you recommend, Herr Professor?”) and I have
routinely, but potentially wrongly, reassured them that in the end, truth will win out. 

Although it is widely recognized that many social psychologists p-hacked and buried studies that did not work out,  Stangor’s essay remains one of the few open admissions that these practices were used, which were not considered unethical, at least until 2010. In fact, social psychologists were trained that telling a good story was essential for social psychologists (Bem, 2001).

In short, this important paper will – must – completely change the field. It has shined a light on the elephant in the room, which is that we are publishing too many Type-1 errors, and we all know it.

Whew! What a year 2011 was – let’s hope that we come back with some good answers to these troubling issues in 2012.

In hindsight Stangor was right about the p-hacking article. It has been cited over 1,000 times so far and the term p-hacking is widely used for methods that essentially constitute a violation of research ethics.  P-values are only meaningful if all analyses are reported and failures to disclose analyses that produced inconvenient non-significant results to tell a more convincing story constitutes research misconduct according to the guidelines of APA and the HHS.

Charles Stangor’s Z-Curve

Stangor’s essay is valuable in many ways.  One important contribution is the open admission to the use of QRPs before the p-hacking article made Stangor realize that doing so was wrong.   I have been working on statistical methods to reveal the use of QRPs.  It is therefore interesting to see the results of this method when it is applied to data by a researcher who used QRPs.

stangor.png

This figure (see detailed explanation here) shows the strength of evidence (based on test statistics like t and F-values converted into z-scores in Stangor’s articles. The histogram shows a mode at 2, which is just significant (z = 1.96 ~ p = .05, two-tailed).  The steep drop on the left shows that Stangor rarely reported marginally significant results (p = .05 to .10).  It also shows the use of questionable research practices because sampling error should produce a larger number of non-significant results than are actually observed. The grey line provides a vague estimate of the expected proportion of non-significant results. The so called file-drawer (non-significant results that are not reported) is very large.  It is unlikely that so many studies were attempted and not reported. As Stangor mentions, he also used p-hacking to get significant results.  P-hacking can produce just significant results without conducting many studies.

In short, the graph is consistent with Stangor’s account that he used QRPs in his research, which was common practice and even encouraged, and did not violate any research ethics code of the times (Bem, 2001).

The graph also shows that the significant studies have an estimated average power of 71%.  This means any randomly drawn statistically significant result from Stangor’s articles has a 71% chance of producing a significant result again, if the study and the statistical test were replicated exactly (see Brunner & Schimmack, 2018, for details about the method).  This average is not much below the 80% value that is considered good power.

There are two caveats with the 71% estimate. One caveat is that this graph uses all statistical tests that are reported, but not all of these tests are interesting. Other datasets suggest that the average for focal hypothesis tests is about 20-30 percentage points lower than the estimate for all tests. Nevertheless, an average of 71% is above average for social psychology.

The second caveat is that there is heterogeneity in power across studies. Studies with high power are more likely to produce really small p-values and larger z-scores. This is reflected in the estimates below the x-axis for different segments of studies.  The average for studies with just significant results (z = 2 to 2.5) is only 49%.  It is possible to use the information from this graph to reexamine Stangor’s articles and to adjust nominal p-values.  According to this graph p-values in the range between .05 and .01 would not be significant because 50% power corresponds to a p-value of .05. Thus, all of the studies with a z-score of 2.5 or less (~ p > .01) would not be significant after correcting for the use of questionable research practices.

The main conclusion that can be drawn from this analysis is that the statistical analysis of Stangor’s reported results shows convergent validity with the description of his research practices.  If test statistics by other researchers show a similar (or worse) distribution, it is likely that they also used questionable research practices.

Charles Stangor’s Response to the Replication Crisis 

Stangor was no longer an active researcher when the replication crisis started. Thus, it is impossible to see changes in actual research practices.  However, Stangor co-edited a special issue for the Journal of Experimental Social Psychology on the replication crisis.

The Introduction mentions the p-hacking article.

At the same time, the empirical approaches adopted by social psychologists leave room for practices that distort or obscure the truth (Hales, 2016-in this issue; John, Loewenstein, & Prelec, 2012; Simmons, Nelson, & Simonsohn, 2011)

and that

social psychologists need to do some serious housekeeping in order to progress
as a scientific enterprise.

It quotes, Dovidio to claim that social psychologists are

lucky to have the problem. Because social psychologists are rapidly developing new approaches and techniques, our publications will unavoidably contain conclusions that are uncertain, because the potential limitations of these procedures are not yet known. The trick then is to try to balance “new” with “careful.

It also mentions the problem of fabricating stories by hiding unruly non-significant results.

The availability of cheap data has a downside, however,which is that there is little cost in omitting data that contradict our hypotheses from our manuscripts (John et al., 2012). We may bury unruly data because it is so cheap and plentiful. Social psychologists justify this behavior, in part, because we think conceptually. When a manipulation fails, researchers may simply argue that the conceptual variable was not created by that particular manipulation and continue to seek out others that will work. But when a study is eventually successful,we don’t know if it is really better than the others or if it is instead a Type I error. Manipulation checks may help in this regard, but they are not definitive (Sigall &Mills, 1998).

It also mentioned file-drawers with unsuccessful studies like the one shown in the Figure above.

Unpublished studies likely outnumber published studies by an order of magnitude. This is wasteful use of research participants and demoralizing for social psychologists and their students.

It also mentions that governing bodies have failed to crack down on the use of p-hacking and other questionable practices and the APA guidelines are not mentioned.

There is currently little or no cost to publishing questionable findings

It foreshadows calls for a more stringent criterion of statistical significance, known as the p-value wars (alpha  = .05 vs. alpha = .005 vs. justify your alpha vs. abandon alpha)

Researchers base statistical analyses on the standard normal distribution but the actual tails are probably bigger than this approach predicts. It is clear that p b .05 is not enough to establish the credibility of an effect. For example, in the Reproducibility Project (Open Science Collaboration, 2015), only 18% of studies with a p-value greater than .04 replicated whereas 63% of those with a p-value less than .001 replicated. Perhaps we should require, at minimum, p <  .01 

It is not clear, why we should settle for p < .01, if only 63% of results replicated with p < .001. Moreover, it ignores that a more stringent criterion for significance also increases the risk of type-II error (Cohen).  It also ignores that only two studies are required to reduce the risk of a type-I error from .05 to .05*.05 = .0025.  As many articles in experimental social psychology are based on multiple cheap studies, the nominal type-I error rate is well below .001.  The real problem is that the reported results are not credible because QRPs are used (Schimmack, 2012).  A simple and effective way to improve experimental social psychology would be to enforce the APA ethics guidelines and hold violators of these rules accountable for their actions.  However, although no new rules would need to be created, experimental social psychologists are unable to police themselves and continue to use QRPs.

The Introduction ignores this valid criticism of multiple study and continues to give the misleading impression that more studies translate into more replicable results.  However, the Open-Science Collaboration reproducibility project showed no evidence that long, multiple-study articles reported more replicable results than shorter articles in Psychological Science.

In addition, replication concerns have mounted with the editorial practice of publishing short papers involving a single, underpowered study demonstrating counterintuitive results (e.g., Journal of Experimental Social Psychology; Psychological Science; Social Psychological and Personality Science). Publishing newsworthy results quickly has benefits,
but also potential costs (Ledgerwood & Sherman, 2012), including increasing Type 1 error rates (Stroebe, 2016-in this issue). 

Once more, the problem is dishonest reporting of results.  A risky study can be published and a true type-I error rate of 20% informs readers that there is a high risk of a false positive result. In contrast, 9 studies with a misleading type-I error rate of 5% violate the implicit assumptions that readers can trust a scientific research article to report the results of an objective test of a scientific question.

But things get worse.

We do, of course, understand the value of replication, and publications in the premier social-personality psychology journals often feature multiple replications of the primary findings. This is appropriate, because as the number of successful replications increases, our confidence in the finding also increases dramatically. However, given the possibility
of p-hacking (Head, Holman, Lanfear, Kahn, & Jennions, 2015; Simmons et al., 2011) and the selective reporting of data, replication is a helpful but imperfect gauge of whether an effect is real. 

Just like Stangor dismissed Bem’s mulitple-study article in JPSP as a fluke that does not require further attention, he dismisses evidence that QRPs were used to p-hack other multiple study articles (Schimmack, 2012).  Ignoring this evidence is just another violation of research ethics. The data that are being omitted here are articles that contradict the story that an author wants to present.

And it gets worse.

Conceptual replications have been the field’s bread and butter, and some authors of the special issue argue for the superiority of conceptual over exact replications (e.g. Crandall & Sherman, 2016-in this issue; Fabrigar and Wegener, 2016–in this issue; Stroebe, 2016-in this issue).  The benefits of conceptual replications are many within social psychology, particularly because they assess the robustness of effects across variation in methods, populations, and contexts. Constructive replications are particularly convincing because they directly replicate an effect from a prior study as exactly as possible in some conditions but also add other new conditions to test for generality or limiting conditions (Hüffmeier, 2016-in this issue).

Conceptual replication is a euphemism for story telling or as Sternberg calls it creative HARKing (Sternberg, in press).  Stangor explained earlier how an article with several conceptual replication studies is constructed.

I certainly bury studies that don’t work, let alone fail to report dependent variables that have been uncooperative. And I have always argued that the researcher has the obligation to write the best story possible, even if may mean substantially “rewriting the research hypothesis.”

This is how Bem advised generations of social psychologists to write articles and that is how he wrote his 2011 article that triggered awareness of the replicability crisis in social psychology.

There is nothing wrong with doing multiple studies and to examine conditions that make an effect stronger or weaker.  However, it is psuedo-science if such a program of research reports only successful results because reporting only successes renders statistical significance meaningless (Sterling, 1959).

The miraculous conceptual replications of Bem (2011) are even more puzzling in the context of social psychologists conviction that their effects can decrease over time (Stangor, 2012) or change dramatically from one situation to the next.

Small changes in social context make big differences in experimental settings, and the same experimental manipulations create different psychological states in different times, places, and research labs (Fabrigar andWegener, 2016–in this issue). Reviewers and editors would do well to keep this in mind when evaluating replications. 

How can effects be sensitive to context and the success rate in published articles is 95%?

And it gets worse.

Furthermore, we should remain cognizant of the fact that variability in scientists’ skills can produce variability in findings, particularly for studies with more complex protocols that require careful experimental control (Baumeister, 2016-in this issue). 

Baumeister is one of the few other social psychologists who has openly admitted not disclosing failed studies.  He also pointed out that in 2008 this practice did not violate APA standards.  However, in 2016 a major replication project failed to replicate the ego-depletion effect that he first “demonstrated” in 1998.  In response to this failure, Baumeister claimed that he had produced the effect many times, suggesting that he has some capabilities that researchers who fail to show the effect lack (in his contribution to the special issue in JESP he calls this ability “flair”).  However, he failed to mention that many of his attempts failed to show the effect and that his high success rate in dozens of articles can only be explained by the use of QRPs.

While there is ample evidence for the use of QRPs, there is no empirical evidence for the claim that research expertise matters.  Moreover, most of the research is carried out by undergraduate students supervised by graduate students and the expertise of professors is limited to designing studies and not to actually carrying out studies.

In the end, the Introduction also comments on the process of correcting mistakes in published articles.

Correctors serve an invaluable purpose, but they should avoid taking an adversarial tone. As Fiske (2016–this issue) insightfully notes, corrective articles should also
include their own relevant empirical results — themselves subject to
correction.

This makes no sense. If somebody writes an article and claims to find an interaction effect based on a significant result in one condition and a non-significant result in another condition, the article makes a statistical mistake (Gelman & Stern, 2005). If a pre-registration contains the statement that an interaction is predicted and a published article claims an interaction is not necessary, the article misrepresents the nature of the preregistration.  Correcting mistakes like this is necessary for science to be a science.  No additional data are needed to correct factual mistakes in original articles (see, e.g., Carlsson, Schimmack, Williams, & Bürkner, 2017).

Moreover, Fiske has been inconsistent in her assessment of psychologists who have been motivated by the events of 2011 to improve psychological science.  On the one hand, she has called these individuals “method terrorists” (2016 review).  On the other hand, she suggests that psychologists should welcome humiliation that may result from the public correction of a mistake in a published article.

Conclusion

In 2012, Stangor asked “How will social and personality psychologists look back on 2011?” Six years later, it is possible to provide at least a temporary answer. There is no unified response.

The main response by older experimental social psychologist has been denial along Stangor’s initial response to Stapel and Bem.  Despite massive replication failures and criticism, including criticism by Noble Laureate Daniel Kahneman, no eminent social psychologists has responded to the replication crisis with an admission of mistakes.  In contrast, the list of eminent social psychologists who stand by their original findings despite evidence for the use of QRPs and replication failures is long and is growing every day as replication failures accumulate.

The response by some younger social psychologists has been to nudge social psychologists slowly towards improving their research methods, mainly by handing out badges for preregistrations of new studies.  Although preregistration makes it more difficult to use questionable research practices, it is too early to see how effective preregistration is in making published results more credible.  Another initiative is to conduct replication studies. The problem with this approach is that the outcome of replication studies can be challenged and so far these studies have not resulted in a consensual correction in the scientific literature. Even articles that reported studies that failed to replicate continue to be cited at a high rate.

Finally, some extremists are asking for more radical changes in the way social psychologists conduct research, but these extremists are dismissed by most social psychologists.

It will be interesting to see how social psychologists, funding agencies, and the general public will look back on 2011 in 2021.  In the meantime, social psychologists have to ask themselves how they want to be remembered and new investigators have to examine carefully where they want to allocate their resources.  The published literature in social psychology is a mine field and nobody knows which studies can be trusted or not.

I don’t know about you, but I am looking forward to reading the special issues in 2021 in celebration of the 10-year anniversary of Bem’s groundbreaking or should I saw earth-shattering publication of “Feeling the Future.”

Visual Inspection of Strength of Evidence: P-Curve vs. Z-Curve

Statistics courses often introduce students to a bewildering range of statistical test.  They rarely point out how test statistics are related.  For example, although t-tests may be easier to understand than F-tests, every t-test could be performed as an F-test and the F-value in the F-test is simply the square of the t-value (t^2 or t*t).

At an even more conceptual level, all test statistics are ratios of the effect size (ES) and the amount of sampling error (ES).   The ratio is sometimes called the signal (ES) to noise (ES) ratio.  The higher the signal to noise ratio (ES/SE), the stronger the observed results deviate from the hypothesis that the effect size is zero.  This hypothesis is often called the null-hypothesis, but this terminology has created some confusing.  It is also sometimes called the nil-hypothesis the zero-effect hypothesis or the no-effect hypothesis.  Most important, the test-statistic is expected to average zero if the same experiment could be replicated a gazillion times.

The test statistics of statistical tests cannot be directly compared.  A t-value of 2 in a study with N = 10 participants provides weaker evidence against the null-hypothesis than a z-score of 1.96.  and an F-value of 4 with df(1,40) provides weaker evidence than an F(10,200) = 4 result.  It is only possible to compare test values directly that have the same sampling distribution (z with z, F(1,40) with F(1,40), etc.).

There are three solutions to this problem. One solution is to use effect sizes as the unit of analysis. This is useful if the aim is effect size estimation.  Effect size estimation has become the dominant approach in meta-analysis.  This blog post is not about effect size estimation.  I just mention it because many readers may be familiar with effect size meta-analysis, but not familiar with meta-analysis of test statistics that reflect the ratio of effect size and sampling error (Effect size meta-analysis: unit = ES; Test Statistic Meta-Analysis: unit ES/SE).

P-Curve

There are two approaches to standardize test statistics so that they have a common unit of measurement.  The first approach goes back to Ronald Fisher, who is considered the founder of modern statistics for researchers.  Following Fisher it is common practice to convert test-statistics into p-values (for this blog post assumes that you are familiar with p-values).   P-values have the same meaning independent of the test statistic that was used to compute them.   That is, p = .05 based on a z-test, t-test, or an F-test provide equally strong evidence against the null-hypothesis (Bayesians disagree, but that is a different story).   The use of p-values as a common metric to examine strength of evidence (evidential value) was largely forgotten, until Simonsohn, Simmons, and Nelson (SSN) used p-values to develop a statistical tool that takes publication bias and questionable research practices into account.  This statistical approach is called p-curve.  P-curve is a family of statistical methods.  This post is about the p-curve plot.

A p-curve plot is essentially a histogram of p-values with two characteristics. First, it only shows significant p-values (p < .05, two-tailed).  Second, it plots the p-values between 0 and .05 with 5 bars.  The Figure shows a p-curve for Motyl et al.’s (2017) focal hypothesis tests in social psychology.  I only selected t-test and F-tests from studies with between-subject manipulations.

p.curve.motyl

The main purpose of a p-curve plot is to examine whether the distribution of p-values is uniform (all bars have the same height).  It is evident that the distribution for Motyl et al.’s data is not uniform.  Most of the p-values fall into the lowest range between 0 and .01. This pattern is called “rigth-skewed.”  A right-skewed plot shows that the set of studies has evidential value. That is, some test statistics are based on non-zero effect sizes.  The taller the bar on the left is, the greater the proportion of studies with an effect.  Importantly, meta-analyses of p-values do not provide information about effect sizes because p-values take effect size and sampling error into account.

The main inference that can be drawn from a visual inspection of a p-curve plot is how unlikely it is that all significant results are false positives; that is, the p-value is below .05 (statistically significant), but this strong deviation from 0 was entirely due to sampling error, while the true effect size is 0.

The next Figure also shows a plot of p-values.  The difference is that it shows the full range of p-values and that it differentiates more between p-values because p = .09 provides weaker evidence than p = .0009.

all.p.curve.motyl.png

The histogram shows that most p-values are below p < .001.  It also shows very few non-significant results.  However, this plot is not more informative than the actual p-curve plot. The only conclusion that is readily visible is that the distribution is not uniform.

The main problem with p-value plots is that p-values do not have interval scale properties.  This means, the difference between p = .4 and p = .3 is not the same as the difference between p = .10 and p = .00 (e.g., .001).

Z-Curve  

Stouffer developed an alternative method to Fisher’s p-value meta-analysis.  Every p-value can be transformed into a z-scores that corresponds to a particular p-value.  It is important to distinguish between one-sided and two-sided p-values.  The transformation requires the use of one-sided p-values, which can be obtained by simply dividing a two-sided p-value by 2.  A z-score of -1.96 corresponds to a one-sided p-value of 0.025 and a z-score of 1.96 corresponds to a one-sided p-values of 0.025.  In a two sided test, the sign no longer matters and the two p-values are added to yield 0.025 + 0.025 = 0.05.

In a standard meta-analysis, we would want to use one-sided p-values to maintain information about the sign.  However, if the set of studies examines different hypothesis (as in Motyl et al.’s analysis of social psychology in general) the sign is no longer important.   So, the transformed two-sided p-values produce absolute (only positive) z-scores.

The formula in R is Z = -qnorm(p/2)   [p = two.sided p-value]

For very strong evidence this formula creates problems. that can be solved by using the log.P=TRUE option in R.

Z = -qnorm(log(p/2), log.p=TRUE)

p.to.z.transformation.png

The plot shows the relationship between z-scores and p-values.  While z-scores are relatively insensitive to variation in p-values from .05 to 1, p-values are relatively insensitive to variation in z-scores from 2 to 15.

only.sig.p.to.z.transformation

The next figure shows the relationship only for significant p-values.  Limiting the distribution of p-values does not change the fact that p-values and z-values have very different distributions and a non-linear relationship.

The advantage of using (absolute) z-scores is that z-scores have ratio scale properties.  A z-score of zero has real meaning and corresponds to the absence of evidence for an effect; the observed effect size is 0.  A z-score of 2 is twice as strong as a z-score of 1. For example, given the same sampling error the effect size for a z-score of 2 is twice as large as the effect size for a z-score of 1 (e.g., d = .2, se = .2, z = d/se = 1,  d = 4, se = .2, d/se = 2).

It is possible to create the typical p-curve plot with z-scores by selecting only z-scores above z = 1.96. However, this graph is not informative because the null-hypothesis does not predict a uniform distribution of z-scores.   For z-values the central tendency of z-values is more important.  When the null-hypothesis is true, p-values have a uniform distribution and we would expect an equal number of p-values between 0 and 0.025 and between 0.025 and 0.050.   A two-sided p-value of .025 corresponds to a one-sided p-value of 0.0125 and the corresponding z-value is 2.24

p = .025
-qnorm(log(p/2),log.p=TRUE)
[1] 2.241403

Thus, the analog to a p-value plot is to examine how many significant z-scores fall into the region from 1.96 to 2.24 versus the region with z-values greater than 2.24.

z.curve.plot1.png

The histogram of z-values is called z-curve.  The plot shows that most z-values are in the range between 1 and 6, but the histogram stretches out to 20 because a few studies had very high z-values.  The red line shows z = 1.96. All values on the left are not significant with alpha = .05 and all values on the right are significant (p < .05).  The dotted blue line corresponds to p = .025 (two tailed).  Clearly there are more z-scores above 2.24 than between 1.96 and 2.24.  Thus, a z-curve plot provides the same information as a p-curve plot.  The distribution of z-scores suggests that some significant results reflect true effects.

However, a z-curve plot provides a lot of additional information.  The next plot removes the long tail of rare results with extreme evidence and limits the plot to z-scores in the range between 0 and 6.  A z-score of six implies a signal to noise ratio of 6:1 and corresponds to a p-value of p = 0.000000002 or 1 out of 2,027,189,384 (~ 2 billion) events. Even particle physics settle for z = 5 to decide that an effect was observed if it is so unlikely for a test result to occur by chance.

> pnorm(-6)*2
[1] 1.973175e-09

Another addition to the plot is to include a line that identifies z-scores between 1.65 and 1.96.  These z-scores correspond to two-sided p-values between .05 and .10. These values are often published as weak but sufficient evidence to support the inference that a (predicted) effect was detected. These z-scores also correspond to p-values below .05 in one-sided tests.

z.curve.plot2

A major advantage of z-scores over p-values is that p-values are conditional probabilities based on the assumption that the null-hypothesis is true, but this hypothesis can be safely rejected with these data.  So, the actual p-values are not important because they are conditional on a hypothesis that we know to be false.   It is like saying, I would be a giant if everybody else were 1 foot tall (like Gulliver in Lilliput), but everybody else is not 1 foot tall and I am not a giant.

Z-scores are not conditioned on any hypothesis. They simply show the ratio of the observed effect size and sampling error.  Moreover, the distribution of z-scores tell us something about the ratio of the true effect sizes and sampling error.  The reason is that sampling error is random and like any random variable has a mean of zero.  Therefore, the mode, median, or mean of a z-curve plot tells us something about ratio of the true effect sizes and sampling error.  The more the center of a distribution is shifted to the right, the stronger is the evidence against the null-hypothesis.  In a p-curve plot, this is reflected in the height of the bar with p-values below .01 (z > 2.58), but a z-curve plot shows the actual distribution of the strength of evidence and makes it possible to see where the center of a distribution is (without more rigorous statistical analyses of the data).

For example, in the plot above it is not difficult to see the mode (peak) of the distribution.  The most common z-values are between 2 and 2.2, which corresponds to p-values of .046 (pnorm(-2.2)*2) and .028 (pnorm(-2.2)*2).   This suggests that the modal study has a ratio of 2:1 for effect size over sampling error.

The distribution of z-values does not look like a normal distribution. One explanation for this is that studies vary in sampling errors and population effect sizes.  Another explanation is that the set of studies is not a representative sample of all studies that were conducted.   It is possible to test this prediction by trying to fit a simple model to the data that assumes representative sampling of studies (no selection bias or p-hacking) and that assumes that all studies have the same ratio of population effect size over sampling error.   The median z-score provides an estimate of the center of the sampling distribution.  The median for these data is z = 2.56.   The next picture shows the predicted sampling distribution of this model, which is an approximately normal distribution with a folded tail.

 

z.curve.plot3

A comparison of the observed and predicted distribution of z-values shows some discrepancies. Most important is that there are too few non-significant results.  This observation provides evidence that the results are not a representative sample of studies.  Either non-significant results were not reported or questionable research practices were used to produce significant results by increasing the type-I error rate without reporting this (e.g., multiple testing of several DVs, or repeated checking for significance during the course of a study).

It is important to see the difference between the philosophies of p-curve and z-curve. p-curve assumes that non-significant results provide no credible evidence and discards these results if they are reported.  Z-curve first checks whether non-significant results are missing.  In this way, p-curve is not a suitable tool for assessing publication bias or other problems, whereas even a simple visual inspection of z-curve plots provides information about publication bias and questionable research practices.

z.curve.plot4.png

The next graph shows a model that selects for significance.  It no longer attempts to match the distribution of non-significant results.  The objective is only to match the distribution of significant z-values.  You can do this by hand and simply try out different values for the center of the normal distribution.  The lower the center, the more z-scores are missing because they are not significant.  As a result, the density of the predicted curve needs to be adjusted to reflect the fact that some of the area is missing.

center.z = 1.8  #pick a value
z = seq(0,6,.001)  #create the range of z-values
y = dnorm(z,center.z,1) + dnorm(z,-center.z,1)  # get the density for a folded normal
y2 = y #duplicate densities
y2[x < 1.96] = 0   # simulate selection bias, density for non-significant results is zero
scale = sum(y2)/sum(y)  # get the scaling factor so that area under the curve of only significant results is 1.
y = y / scale   # adjust the densities accordingly

# draw a histogram of z-values
# input is  z.val.input
# example; z.val.input = rnorm(1000,2)
hist(z.val.input,freq=FALSE,xlim=c(0,6),ylim=c(0,1),breaks=seq(0,20,.2), xlab=””,ylab=”Density”,main=”Z-Curve”)

abline(v=1.96,col=”red”)   # draw the line for alpha = .05 (two-tailed)
abline(v=1.65,col=”red”,lty=2)  # draw marginal significance (alpha = .10 (two-tailed)

par(new=TRUE) #command to superimpose next plot on histogram

# draw the predicted sampling distribution
plot(x,y,type=”l”,lwd=4,ylim=c(0,1),xlim=c(0,6),xlab=”(absolute) z-values”,ylab=””)

Although this model fits the data better than the previous model without selection bias, it still has problems fitting the data.  The reason is that there is substantial heterogeneity in the true strength of evidence.  In other words, the variability in z-scores is not just sampling error but also variability in sampling errors (some studies have larger samples than others) and population effect sizes (some studies examine weak effects and others examine strong effects).

Jerry Brunner and I developed a mixture model to fit a predicted model to the observed distribution of z-values.  In a nutshell the mixture model has multiple (folded) normal distributions.  Jerry’s z-curve lets the center of the normal distribution move around and give different weights to them.  Uli’s z-curve uses fixed centers one standard deviation apart (0,1,2,3,4,5 & 6) and uses different weights to fit the model to the data.  Simulation studies show that both methods work well.  Jerry’s method works a bit better if there is little variability and Uli’s method works a bit better with large variability.

The next figure shows the result for Uli’s method because the data have large variability.

z.curve.plot5

The dark blue line in the figure shows the density distribution for the observed data.  A density distribution assigns densities to an observed distribution that does not fit a mathematical sampling distribution like the standard normal distribution.   We use the Kernel Density Estimation method implemented in the R base package.

The grey line shows the predicted density distribution based on Uli’s z-curve method.  The z-curve plot makes it easy to see the fit of the model to the data, which is typically very good.  The model result of the model is the weighted average of the true power that corresponds to the center of the simulated normal distributions.  For this distribution,  the weighted average is 48%.

The 48% estimate can be interpreted in two ways.  First, it means that if researchers randomly sampled from the set of studies in social psychology and were able to exactly reproduce the original study (including sample size),  they have a probability of 48% to replicate a significant result with alpha = .05.  The complementary interpretation is that if researchers were successful in replicating all studies exactly,  the reproducibility project is expected to produce 48% significant results and 52% non-significant results.  Because average power of studies predicts the success of exact replication studies, Jerry and I refer to the average power of studies that were selected for significance replicability.  Simulation studies show that our z-curve methods have good large sample accuracy (+/- 2%) and we adjust for the small estimation bias in large samples by computing a conservative confidence interval that adds 2% to the upper limit and 2% to the lower limit.

Below is the R-Code to obtain estimates of replicability from a set of z-values using Uli’s method.

<<<Download Zcurve R.Code>>>

Install R.Code on your computer, then run from anywhere with the following code

location =  <user folder>  #provide location information where z-curve code is stored
source(paste0(location,”fun.uli.zcurve.sharing.18.1.R”))  #read the code
run.zcurve(z.val.input)  #get z-curve estimates with z-values as input

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Z-curve vs. P-curve: Break down of an attempt to resolve disagreement in private.

Background:   In a tweet that I can no longer find because Uri Simonsohn blocked me from his twitter account, Uri suggested that it would be good if scientists could discuss controversial issues in private before they start fighting on social media.  I was just about to submit a manuscript that showed some problems with his p-curve approach to power estimation and a demonstration that z-curve works better in some situations, namely when there is substantial variation in studies in statistical power. So, I thought I give it a try and sent him the manuscript so that we could try to find agreement in a private email exchange.

The outcome of this attempt was that we could not reach agreement on this topic.  At best, Uri admitted that p-curve is biased when some extreme test statistics (e.g., F(1,198) = 40, or t(48) = 5.00) are included in the dataset.  He likes to call these values outliers. I consider them part of the data that influence the variability and distribution of test statistics.

For the most part Uri disagreed with my conclusions and considers the simulation results that show evidence for my claims unrealistic.   Meanwhile, Uri published a blog post with his simulations that have small heterogeneity to claim that p-curve works even better than z-curve when there is heterogeneity.

The reason for the discrepancy between his results and my results are different assumptions about what is realistic variability in strength of evidence against the null-hypothesis, as reflected in absolute z-scores (transformation of p-values into z-scores by means of  -qnorm(p.2t) with p.2t equals two.tailed t-test or F-test.

To give everybody an opportunity to examine the arguments that were exchanged during our discussion of p-curve versus z-curve, I am sharing the email exchange.  I hope that more statisticians will examine the properties of p-curve and z-curve and add to the discussion.  To facilitate this, I will make the r-code to run simulation studies of p-curve and z-curve available in a separate blog post.

P.S.  P-curve is available as an online app that provides power estimates without any documentation how p-curve behaves in simulation studies or warnings that datasets with large test statistics can produce inflated estimates of average power.

My email correspondence with Uri Simonsohn – RE: p-curve and heterogeneity

From:    URI
To:          ULI
Date:     11/24/2017

Hi Uli,

I think email is better at this point.

Ok I am behind a ton of stuff and have a short workday today so cannot look in detail are your z-curve paper right now.

I did a quick search for “osf”, “http” and “code” and could not find the R Code , that may facilitate things if you can share it. Mostly, I would like the code that shows p-curve is biased, especially looking at how the population parameter being estimated is being defined.

I then did a search for “p-curve” and found this

Quick reactions:

1)            For power estimation p-curve does not assume homogeneity of effect size, indeed, if anything it assumes homogeneity of power and allows each study to have a different effect size, but it is not really assuming a single power, it is asking what single power best fits the data, which is a different thing. It is computing an average. All average computations ask “what single value best fits the data” but that’s not the same as saying “I think all values are identical, and identical to the average”

2)            We do report a few tests of the impact of heterogeneity on p-curve, maybe you have something else in mind. But here they go just in case:

Figure 2C in our POPS paper, has d~N(x,sd=.2)

[Clarification: This Figure shows estimation of effect sizes. It does not show estimation of power.]

Supplement 2

[Again. It does not show simulations for power estimation.]

A key thing to keep in mind is the population parameter of interst. P-curve does not estimate the population effect size or power of all studies attempted, published, reported, etc. It does for the set of studies included in p-curve. So note, for example, in the figure S2C above that  when half of studies are .5 and half are .3 among the attempted, p-curve estimates the average included study accurately but differently from .4. The truth is .48 for included studies, p-curve says .47, and the average attempted study is .4

[This is not the issue. Replicability implies conditioning on significance. We want to predict the success rate of studies that replicate significant results. Of course it is meaningful to do follow up studies on non-significant results. But the goal here is not to replicate another inconclusive non-significant result.]

Happy to discuss of course, Uri

—————————————————————————————————————————————

From     ULI
To           URI
Date      11/24/2017

Hi Uri,

I will change the description of your p-curve code for power.

Honest, I am not fully clear about what the code does or what the underlying assumptions are.

So, thanks for clarifying.

I agree with you that pcurve (also puniform) are surprisingly robust estimates of effect sizes even with heterogeneity (I have pointed that out in comments in the Facebook Discussion group), but that doesn’t mean it works well for power.   If you have published any simulation tests for the power estimation function, I am happy to cite them.

Attached is a single R code file that contains (a) my shortened version of your p-curve code, (b) the z-curve code, (c) the code for the simulation studies.

The code shows the cumulative results. You don’t have to run all 5,000 replications before you see the means stabilizing.

Best, Uli

—————————————————————————————————————————————

From     URI
To           ULI
Date      11/27/2017

Hi Uli,

Thanks for sending the code, I am trying to understand it.  I am a little confused about how the true power is being generated. I think you are drawing “noncentrality” parameters  (ncp) that are skewed, and then turning those into power, rather than drawing directly skewedly distributed power, correct? (I am not judging that as good or bad, I am just verifying).

[Yes that is correct]

In any case, I created a histogram of the distribution of true power implied by the ncp’s that you are drawing (I think, not 100% sure I am getting that right).

For scenario 3.1 it looks like this:

 

Uri1

 

For scenario 3.3 it looks like this:

Uri.2

 

(the only code I added was to turn all the true power values into a vector before averaging it, and then ploting a histogram for that vector, if interestd, you can copy paste this into the line of code that just reads “tp” in your code and you will re-produce my histogram)

#ADDED BY URI uri

power.i=pnorm(z,z.crit)[obs.z > z.crit]                #line added by Uri SImonsohn to look at the distribution

hist(power.i,xlab=’true power of each study’)

mean.pi=round(mean(power.i),2)

median.pi=round(median(power.i),2)

sd.pi=round(sd(power.i),2)

mtext(side=3,line=0,paste0(“mean=”,mean.pi,”   median=”,median.pi,”   sd=”,sd.pi))

I wanted to make sure

1)            I am correctly understanding this variable as being the true power of the observed studies, the average/median of which we are trying to estimate

2)            Those distributions are the distributions you intended to generate

[Yes, that is correct. To clarify, 90% power for p < .05 (two-tailed) is obtained with a z-score of  qnorm(.90, 1.96)  = 3.24.   A z-score of 4 corresponds to 97.9% power.  So, in the literature with adequately powered studies, we would expect studies to bunch up at the upper limit of power, while some studies may have very low power because the theory made the wrong prediction and effect sizes are close to zero and power is close to alpha (5%).]

Thanks, Uri

—————————————————————————————————————————————

From     ULI
To           URI
Date      11/27/2017

Hi Uri,

Thanks for getting back to me so quickly.   You are right, it would be more accurate to describe the distribution as the distribution of the non-centrality parameters rather than power.

The distribution of power is also skewed but given the limit of 1,  all high power studies will create a spike at 1.  The same can happen at the lower end and you can easily get U-shaped distributions.

So, what you see is something that you would also see in actual datasets.  Actually, the dataset minimizes skew because I only used non-centrality parameters from 0 to 6.

I did this because z-curve only models z-values between 0 and 6 and treats all observed z-scores greater than 6 as having a true power of 1.  That reduces the pile on the right side.

You could do the same to improve performance of p-curve, but it will still not work as well as z-curve, as the simulations with z-scores below 6 show.

Best, Uli

—————————————————————————————————————————————

From     URI
To           ULI
Date      11/27/2017

OK, yes, probably worth clarifying that.

Ok, now I am trying to make sure I understand the function you use to estimate power with z-curve.

If I  see p-values, say c(.001,.002,.003,.004,.005) and I wanted to estimate true power for them via z-curve, I would run:

p= c(.001,.002,.003,.004,.005)

z= -qnorm(p/2)

fun.zcurve(z)

And estimate true power to be 85%, correct?

Uri

—————————————————————————————————————————————

From     ULI
To           URI
Date      11/27/2017

Yes.

—————————————————————————————————————————————

From     URI
To           ULI
Date      11/27/2017

Hi Uli,

To make sure I understood z-curve’s function I run a simple simulation.
I am getting somewhat biased results with z-curve, do you want to take a look and see if I may be doing something wrong?

I am attaching the code, I tried to make it clear but it is sometimes hard to convey what one is trying to do, so feel free to ask any questions.

Uri

—————————————————————————————————————————————

From     ULI
To           URI
Date      11/27/2017

Hi Uri,

What is the k in these simulations?   (z-curve requires somewhat large k because the smoothing of the density function can distort things)

You may also consult this paper (the smallest k was 15 in this paper).

http://www.utstat.toronto.edu/~brunner/zcurve2016/HowReplicable.pdf

In this paper, we implemented pcurve differently, so you can ignore the p-curve results.

If you get consistent underestimation with z-curve, I would like to see how you simulate the data.

I haven’t seen this behavior in z-curve in my simulations.

Best, Uli

—————————————————————————————————————————————

From     ULI
To           URI
Date      11/27/2017

Hi Uli,

I don’t know where “k” is set, I am using the function you sent me and it does not have k as a parameter

I am running this:

fun.zcurve = function(z.val.input, z.crit = 1.96, Int.End=6, bw=.05) {…

Where would k be set?

Into the function you have this

### resolution of density function (doesn’t seem to matter much)

bars = 500

Is that k?

URI

—————————————————————————————————————————————

From     ULI
To           URI
Date      11/27/2017

I mean the number of test statistics that you submit to z-curve.

length(z.val.input)

—————————————————————————————————————————————

From     ULI
To           URI
Date      11/27/2017

I just checked with k = 20, the z-curve code I sent you underestimates fixed power of 80 as 72.

The paper I sent you shows a similar trend with true power of 75.

k             15     25    50    100  250
Z-curve 0.704 0.712 0.717 0.723 0.728

[Clarification: This is from the Brunner & Schimmack, 2016, article]

—————————————————————————————————————————————

From     ULI
To           URI
Date      11/30/2017

Hi Uli,

Sorry for disappearing, got distracted with other things.

I looked a bit more at the apparent bias downwards that z-curve has on power estimates.

First, I added p-curve’s estimates to the chart I had sent, I know p-curve performs well for that basic setup so I used it as a way to diagnose possible errors in my simulations, but p-curve did correctly recover power, so I conclude the simulations are fine.

If you spot a problem with them, however, let me know.

—————————————————————————————————————————————

From     ULI
To           URI
Date      11/30/2017

Hi Uri,

I am also puzzled why z-curve underestimates power in the homogeneous case even with large N.  This is clearly an undersirable behavior and I am going to look for solutions to the problem.

However, in real data that I analyze, this is not a problem because there is heterogeneity.

When there is heterogenity, z-curve performs very well, no matter what the distribution of power/non-centrality parameters is. That is the point of the paper.  Any comments on comparisons in the heterogeneous case?

Best, Uli

—————————————————————————————————————————————

From     ULI
To           URI
Date      11/30/2017

Hey Uli,

I have something with heterogeneity but want to check my work and am almost done for the day, will try tomorrow.

Uri

[Remember: I supplied Uri with r-code to rerun the simulations of heterogeneity and he ran them to show what the distribution of power looks like.  So at this point we could discuss the simulation results that are presented in the manuscript.]

—————————————————————————————————————————————

From     ULI
To           URI
Date      11/30/2017

I ran simulations with t-distrubutions and N = 40.

The results look the same for me.

Mean estimates for 500 simulations

32, 48, 75

As you can see, p-curve also has bias when t-values are converted into z-scores and then analyzed with p-curve.

This suggests that with small N,  the transformation from t to z introduces some bias.

The simulations by Jerry Brunner showed less bias because we used the sample sizes in Psych Science for the simulation (median N ~ 80).

So, we are in agreement that zcurve underestimates power when true power is fixed, above 50%, and N and k are small.

—————————————————————————————————————————————

From     URI
To           ULI
Date      11/30/2017

Hi Uli,

The fact that p-curve is also biased when you convert to z-scores suggests to me that approximation is indeed part of the problem.

[Clarification: I think URI means z-curve]

Fortunately p-curve analysis does not require that transformation and one of the reasons we ask in the app to enter test-statistics is to avoid unnecessary transformations.

I guess it would also be true that if you added .012 to p-values p-curve would get it wrong, but p-curve does not require one to add .012 to p-values.

You write “So, we are in agreement that zcurve underestimates power when true power is fixed, above 50%, and N and k are small.”

Only partial agreement, because the statement implies that for larger N and larger K z-curve is not biased, I believe it is also biased for large k and large N. Here, for instance, is the chart with n=50 per cell (N=100 total) and 50 studies total.

Today I modified the code I sent you so that I would accommodate any power distribution in the submitted studies, not just a fixed level. (attached)

I then used the new montecarlo function to play around with heterogeneity and skewness.

The punchline is that p-curve continues to do well, and z-curve continues to be biased downward.

I also noted, by computed the standard deviation of estimates across simulations, that p-curve has a slightly less random error.

My assessment is that z-curve and p-curve are very similar and will generally agree, but that z-curve is more biased and has more variance.

In any case, let’s get to the simulations Below I show 8 scenarios sorted by the ex-post average true power for the sets of studies.

[Note, N = 20 per cell.  As I pointed out earlier, with these small sample sizes the t to z-transformation is a factor. Also k = 20 is a small set of studies that makes it difficult to get good density distributions.  So, this plot is p-hacked to show that p-curve is perfect and z-curve consistently worse.  The results are not wrong, but they do not address the main question. What happens when we have substantial heterogeneity in true power?  Again, Uri has the data, he has the r-code, and he has the results that show p-curve starts overestimating.  However, he ignores this problem and presents simulations that are most favorable for p-curve.]

—————————————————————————————————————————————

From     URI
To           ULI
Date      12/1/2017

Hi Uri,

I really do not care so much about bias in the homogeneous case. I just fixed the problem by first doing a test of the variance and if variance is small to use a fixed effects model.

[Clarification:  This is not yet implemented in z-curve and was not done for the manuscript submitted for publication which just acknowledges that p-curve is superior when there is no heterogeneity.]

The main point of the manuscript is really about data that I actually encounter in the literature (see demonstrations in the manuscript, including power posing) where there is considerable heterogeneity.

In this case, p-curve overestimates as you can see in the simulations that I sent you.   That is really the main point of the paper and any comments from you about p-curve and heterogeneity would be welcome.

And, I did not mean to imply that pcurve needs transformation. I just found it interesting that transformation is a problem when N is small (as N gets bigger t approaches z and the transformation has less influence).

So, we are in agreement that pcurve does very well when there is little variability in the true power across studies.  The question is whether we are in agreement about heterogeneity in power?

Best, Uli

—————————————————————————————————————————————

From     ULI
To           URI
Date      12/1/2017

Hi Uri,

Why not simulate scenarios that match onto real data.

[I attached data from my focal hypothesis analysis of Bargh’s book “Before you know it” ]

https://replicationindex.com/2017/11/28/before-you-know-it-by-john-a-bargh-a-quantitative-book-review/

—————————————————————————————————————————————

From     ULI
To           URI
Date      12/1/2017

P.P.S

Also, my simulations show that z-curve OVERestimates when true power is below 50%.   Do you find this as well?

This is important because power posing estimates are below 50%, so estimation problems with small k and N would mean that z-curve estimate is inflated rather than suggesting that p-curve estimate is correct.

Best, Uli

—————————————————————————————————————————————

From     URI
To           ULI
Date      12/2/2017

Hi Uli,

The results I sent show substantial heterogeneity and p-curve does well, do you disagree?

Uri

—————————————————————————————————————————————

From     URI
To           ULI
Date      12/2/2017

Not sure what you mean here. What aspect of real data would you like to add to the simulations? I did what I did to address the concerns you had that p-curve may not handle heterogeneity and skewed distributions of power, and it seems to do well with very substantial skew and heterogeneity.

What aspect are the simulations abstracting away from that you worry may lead p-curve to break down with real data?

Uri

—————————————————————————————————————————————

From     ULI
To           URI
Date      12/2/2017

Hi Uri,

I think you are not simulating sufficient heterogeneity to see that p-curve is biased in these situations.

Let’s focus on one example (simulation 2.3) in the r-code I sent you: High true power (.80) and heterogeneity.

This is the distribution of the non-centrality parameters.

And this is the distribution of true power for p < 05 (two-tailed, |z| > = 1.96).

[Clarification: this is not true power, it is the distribution of observed absolute z-scores]

More important, the variance of the observed significant (z > 1.96) z-scores is 2.29.

[Clarification: In response to this email exchange, I added the variance of significant z-scores to the manuscript as a measure of heterogeneity.  Due to the selection for significance, variance with low power can be well below 1.   A variance of 2.29 is large heterogeneity. ]

In comparison the variance for the fixed model (non-central z = 2.80) is 0.58.

So, we can start talking about heterogeneity in quantitative terms. How much variance do you simulated observed p-values have when you convert them into z-scores?

The whole point of the paper is that performance of z-curve suffers, the greater the heterogeneity of true power is.  As sampling error is constant for z-scores, variance of observed z-scores has a maximum of 1 if true power is constant. It is lower than 1 due to selection for significance, which is more severe the lower the power is.

The question is whether my simulations use some unrealistic, large amount of heterogeneity.   I attached some Figures for the Journal of Judgment and Decision Making.

As you can see, heterogeneity can be even larger than the heterogeneity simulated in scenario 2.3 (with a normal distribution around z = 2.75).

In conclusion, I don’t doubt that you can find scenarios where p-curve does well with some heterogeneity.  However, the point of the paper is that it is possible to find scenarios where there is heterogeneity and p-curve does not well.   What your simulations suggest is that z-curve can also be biased in some situations, namely with low variability, small N (so that transformation to z-scores matters) and small number of studies.

I am already working on a solution for this problem, but I see it as a minor problem because most datasets that I have examined (like the one’s that I used for the demonstrations in the ms) do not match this scenario.

So, if I can acknowledge that p-curve outperforms z-curve in some situations, I wonder whether you can do the same and acknowledge that z-curve outperforms p-curve when power is relatively high (50%+) and there is substantial heterogeneity?

Best, Uli

—————————————————————————————————————————————

From     ULI
To           URI
Date      12/2/2017

What surprises me is that I sent you r-code with 5 simulations that showed when p-curve is breaking down (starting with normal distributed variability of non-central z-scores and 50% power (sim2.2) followed by higher power (80%) and all skewed distributions (sim 3.1, 3.2, 3.3).  Do you find a problem with these simulations or is there some other reason why you ignore these simulation studies?

—————————————————————————————————————————————

From     ULI
To           URI
Date      12/2/2017

I tried “power = runif(n.sim)*.4 + .58”  with k = 100.

Now pcurve starts to overestimate and zcurve is unbiased.

So, k makes a difference.  Even if pcurve does well with k = 20,  we also have to look for larger sets of studies.

Results of 500 simulations with k = 100

—————————————————————————————————————————————

From     ULI
To           URI
Date      12/2/2017

Even with k = 40,  pcurve overestimates as much as zcurve underestimates.

zcurve           pcurve

Min.   :0.5395   Min.   :0.5600

1st Qu.:0.7232   1st Qu.:0.7900

Median :0.7898   Median :0.8400

Mean   :0.7817   Mean   :0.8246

3rd Qu.:0.8519   3rd Qu.:0.8700

Max.   :0.9227   Max.   :0.9400

—————————————————————————————————————————————

From     ULI
To           URI
Date      12/2/2017

Hi Uri,

This is what I find with systematic variation of number of studies (k) and the maximum heterogeneity for a uniform distribution of power and average power of 80% after selection for significance.

power = runif(n.sim)*.4 + .58”

zcurve   pcurve

k = 20                    77.5        81.2

k = 40                    78.2        82.5

k = 100                  79.3        82.7

k = 10000             80.2       81.7

(1 run)

If we are going to look at k = 20, we also have to look at k = 100.

Best, Uli

—————————————————————————————————————————————

From     ULI
To           URI
Date      12/2/2017

Hi Uri,

Why did you truncate the beta distributions so that they start at 50% power?

Isn’t it realistic to assume that some studies have less than 50% power, including false positives (power = alpha = 5%)?

How about trying this beta distribution?

curve(dbeta(x,.5,.35)*.95+.05,0,1,ylim=c(0,3),col=”red”)

80% true power after selection for significance.

Best, Uli

—————————————————————————————————————————————

From     ULI
To           URI
Date      12/2/2017

HI Uli,

I know I have a few emails from you, thanks.

My plan is to get to them on Monday or Tuesday. OK?

Uri

—————————————————————————————————————————————

Hi Uli,

We have a blogpost going up tomorrow and have been distracted with that, made someprogress with z- vs p- but am not ready yet.

Sorry Uri

—————————————————————————————————————————————

Hi Uli,

From     ULI
To           URI
Date      12/2/2017

Ok, finally I have time to answer your emails from over the weekend.

Why I run something different?

First, you asked why I run simulations that were different from those you have in your paper (scenario 2.1 and 3.1).

The answer is that I tried to simulate what I thought you were describing in the text: heterogeneity in power that was skewed.

When I saw you had run simulations that led to a power distribution that looked like this:

I assumed that was not what was intended.

First, that’s not skewed

Second, that seems unrealistic, you are simulating >30% of studies powered above 90%.

[Clarification:  If studies were powered at 80%,  33% of studies would be above 90% :

1-pnorm(qnorm(.90,1.96),qnorm(.80,1.96))

It is important to remember that we are talking only about studies that produced a significant result. Even if many null-hypothesis are tested, relatively few of these would make it into the set of studies that produced a significant result.  Most important, this claim ignores the examples in the paper and my calculations of heterogeneity that can be used to compare simulations of heterogeneity with real data.]

Third, when one has extremely bimodal data, central tendency measures are less informative/important (e.g., the average human wears half a bra). So if indeed power was distributed that way, I don’t think I would like to estimate average power anyway. And if it did, saying the average is 60% or 80% is almost irrelevant, hardly any studies are in that range in reality (like say the average person wears .65 bras, that’s wrong, but inconsequentially worse that .5 bras).

Fourth, if indeed 30% of studies have >90% power, we don’t need p-curve or z-curve. Stuff is gonna be obviously true to naked eye.

But below I will ignore these reservations and stick to that extreme bimodal distribution you propose that we focus our attention on.

The impact of null findings

Actually, before that, let me acknowledge I think you raised a very valid point about the importance of adding null findings to the simulations. I don’t think the extreme bimodal you used is the way to do it, but I do think power=5% in the mix does make sense.

We had not considered p-curve’s performance there and we should have.

Prompted by this exchange I did that, and I am comfortable with how p-curve handles power=5% in the mix.

For example, I considered 40 studies, starting with all 40 null, and then having an increasing number drawn from U(40*-80%) power. Looks fine.

Why p-curve overshoots?

Ok. So having discuss the potential impact of null findings on estimates, and leaving aside my reservations with defining the extreme bimodal distribution of power as something we should worry about, let’s try to understand why p-curve over-estimates and z-curve does not.

Your paper proposes it is because p-curve assumes homogeneity.

It doesn’t. p-curve does not assume homogeneity of power any more than computing average height involves assuming homogeneity of height. It is true that p-curve does not estimate heterogeneity in power, but averaging height also does not compute the SD(height). P-curve does not assume it is zero, in fact, one could use p-curve results to estimate heterogeneity.

But in any case, is z-curve handling the extreme bimodal better thanks to its mixture of distributions, as you propose in the paper, or due to something else?

Because power is nonlinearly related to ncp I assumed it had to do with the censoring of high z-values you did rather than the mixture (though I did not actually look into the mixture in any detail at all)..

To look into that I censored t-values going into p-curve. Not as a proposal for a modification but to make the discussion concrete. I censored at t<3.5 so that any t>3.5 is replaced by 3.5 before being entered into p-curve.  I did not spend much time fine-tuning it and I am definitely not proposing htat if one were to censore t-values in p-curve they should be censored at 3.5

 

Ok, so I run p-curve with censored t-values for the rbeta() distribution you sent and for various others of the same style.

We see that censored p-curve behaves very similarly to z-curve (which is censored also).

I also tried adding more studies, running rbeta(3,1) and (1,3), etc.. Across the board, I find that if there is a high share of extremely high powered studies, censored p-curve and z-curve look quite similar.

If we knew nothing else, we would be inclined to censor p-curve going forward, or to use z-curve instead. But censored p-curve, and especially z-curve, give worse answers when the set of studies does not include many extremely high-powered ones, and in real life we don’t have many extremely high-powered studies. So z-curve and censored p-curve make gains in an world that I don’t think exist, and exhibit losses in one that I do think exists.

In particular, z-curve estimates power to be about 10% when the null is true, instead of 5% (censored p-curve actually get this one right, null is estimated at 5%).

Also, z-curve underestimates power in most scenarios not involving an extreme bimodal distribution (see charts I sent in my previous email). IN addition, z-curve tends to have higher variance than p-curve.

As indicated in my previous email, z-curve and p-curve agree most of the time, their differences will typically be within sampling error. It is a low stakes decision to use p-curve vs z-curve, especially compared to the much more important issue of which studies are selected and which tests are selected within studies.

Thanks for engaging in this conversation.

We don’t have to converge to agreement to gain from discussing things.

Btw, we will write a blog post on the repeated and incorrect claim that p-curve assumes homogeneity and does not deal with heterogeneity well. We will send you a draft when we do, but it could be several weeks till we get to that. I don’t anticipate it being a contentions post from your point of view but figured I would tell you about it now.

Uri

—————————————————————————————————————————————

From     ULI
To           URI
Date      12/2/2017

Hi Uri,

Now that we are on the same page, the only question is what is realistic.

First, your blog post on outliers already shows what is realistic. A single outlier in the power pose study increases the p-curve estimate by more than 10% points.

You can fix this now, but p-curve as it existed did not do this.   I would also describe this as a case of heterogeneity. Clearly the study with z = 7 is different from studies with z = 2.

This is in the manuscript that I asked you to evaluate and you haven’t commented on it at all, while writing a blog post about it.

The paper contains several other examples that are realistic because they are based on real data.

I mainly present them as histograms of z-scores rather than historgrams of p-values or observed power because I find the distribution of the z-scores more informative (e.g., where is the mode,  is the distribution roughly normal, etc.), but if you convert the z-scores into power you get distributions like the one shown below (U-shpaed), which is not surprising because power is bounded at  alpha and 1.  So, that is a realistic scenario, whereas your simulations of truncated distributions are not.

I think we can end the discussion here.  You have not shown any flaws with my analyses. You have shown that under very limited and unrealistic situations p-curve performs better than z-curve, which is fine because I already acknowledged in the paper that p-curve does better in the homogeneous case.

I will change the description of the assumption underlying p-curve, but leave everything else as is.

If you think there is an error let me know but I have been waiting patiently for you to comment on the paper, and examined your new simulations.

Best, Uli

—————————————————————————————————————————————

Hi Uri,

What about the real world of power posing?

A few z-scores greater than 4 mess up p-curve as you just pointed out in your outlier blog.

I have presented several real world data to you that you continue to ignore.

Please provide one REAL dataset where p-curve gets it right and z-curve underestimates.

Best, Uli

Hi Uli,

—————————————————————————————————————————————

From     ULI
To           URI
Date      12/6/2017

With real datasets you don’t know true power so you don’t know what’s right and wrong.

The point of our post today is that there is no point statistically analyzing the studies that Cuddy et al put together, with p-curve or any other tool.

I personally don’t think we ever observe true power with enough granularity to make z- vs p-curve prediction differences consequential.

But I don’t think we, you and I, should debate this aspect (is this bias worth that bias). Let’s stick to debating basic facts such as whether or not p-curve assumes homogeneity, or z-curve differs from p-curve because of homogeneity assumption or because of censoring, or how big bias is with this or that assumption. Then when we write we present those facts as transparently as possible to our readers, and they can make an educated decision about it based on their priors and preferences.

Uri

—————————————————————————————————————————————

From     ULI
To           URI
Date      12/6/2017

Just checking where we agree or disagree.

p-curve uses a single parameter for true power to predict observed p-values.

Agree

Disagree

z-curve uses multiple parameters, which improves prediction when there is substantial heterogeneity?

Agree

Disagree

In many cases, the differences are small and not consequential.

Agree

Disagree

When there is substantial heterogeneity and moderate to high power (which you think is rare), z-curve is accurate and p-curve overestimates.

(see simulations in our manuscript)

Agree

Disagree

I want to submit the manuscript by end of the week.

Best, Uli

—————————————————————————————————————————————

From     ULI
To           URI
Date      12/6/2017

Going through the manuscript one more time, I found this.

To examine the robustness of estimates against outliers, we also obtained estimates for a subset of studies with z-scores less than 4 (k = 49).  Excluding the four studies with extreme scores had relatively little effect on z-curve; replicability estimate = 34%.  In contrast, the p-curve estimate dropped from 44% to 5%, while the 90%CI of p-curve ranged from 13% to 30% and did not include the point estimate.

Any comments on this, I mean point estimate is 5% and 90%CI is 13 to 30%,

Best, Uli

[Clarification:  this was a mistake. I confused point estimate and lower bound of CI in my output]

—————————————————————————————————————————————

From     URI
To           ULI
Date      12/7/2017

Hi Uli.

See below:

From: Ulrich Schimmack [mailto:ulrich.schimmack@utoronto.ca]

Sent: Wednesday, December 6, 2017 10:44 PM

To: Simonsohn, Uri <uws@wharton.upenn.edu>

Subject: RE: Its’ about censoring i think

Just checking where we agree or disagree.

p-curve uses a single parameter for true power to predict observed p-values. 

Agree

z-curve uses multiple parameters,

Agree I don’t know the details of how z-curve works, but I suspect you do and are correct.

which improves prediction when there is substantial heterogeneity?

Disagree.

Few fronts.

1)            I don’t think heterogeneity per-se is the issue, but extremity of the values. P-curve is accurate with very substantial heterogeneity. In your examples what causes the trouble are those extremely high power values. Even with minimal heterogeneity you will get over-estimation if you use such values.

2)            I also don’t know that it is the extra parametres in z-curve that are helping because p-curve with censoring does just as well. so I suspect it is the censoring and not the multiple parameters. That’s also consistent with z-curve under-estimating almost everywhere, the multiple parameters should not lead to that I don’t think.

In many cases, the differences are small and not consequential.

Agree, mostly. I would not state that in an unqualified way out of context.

For example, my persona assessment, which I realize you probably don’t share, is that z-curve does worse in contexts that matter a bit more, and that are vastly more likely to be observed.

When there is substantial heterogeneity and moderate to high power (which you think is rare), z-curve is accurate and p-curve overestimates. 

(see simulations in our manuscript)

Disagree.

You can have very substantial heterogeneity and very high power and p-curve is accurate (z-curve under-estimates).

For example, for the blogpost on heterogeneity and p-curve I figured that rather than simulating power directly  it made more sense to simulate n and d distributions, over which people have better intuitions.. and then see what happened to power (rather than simulating power or ncp directly).

Here is one example. Sets of 20 studies, drawn with n and d from the first two panels, with the implied true power and its estimate in the 3rd panel.

I don’t mention this in the post, but z-curve in this simulation under-estimates power, 86% instead of 93%

The parameters are

n~rnorm(mean=100,sd=10)

d~rnorm(mean=.5,sd=.05)

What you need for p-curve to over-estimate and for z-curve to not under-estimate is substantial share of studies at both extremes, many null, many with power>95%

In general very high power leads to over-estimation, but it is trivial in the absence of many very low power studies that lower the average enough that it matters.

That’s the combination I find unlikely, 30%+ with >90% power and at the same time 15% of null findings (approx., going off memory here).

I don’t generically find high power with heterogeneity unlikely, I find the figure above super plausible for instance.

NOTE: For the post I hope to gain more insight on the precise boundary conditions for over-estimation, I am not sure I totally get it just yet.

I want to submit the manuscript by end of the week.

Hope that helps.  Good luck.

Best, Uli

 

From     URI
To           ULI
Date      12/7/2017

Hi Uli,

First, I had not read your entire paper and only now I realize you analyze the Cuddy et al paper, that’s an interesting coincidence. For what is worth, we worked on the post before you and I had this exchange (the post was written in November and we first waited for Thanksgiving and then over 10 days for them to reply). And moreover, our post is heavily based off the peer-review Joe wrote when reviewing this paper, nearly a year ago, and which was largely ignored by the authors unfortunately.

In terms of the results. I am not sure I understand. Are you saying you get an estimate of 5% with a confidence interval between 13 and 30?

That’s not what I get.

—————————————————————————————————————————————

From     ULI
To           URI
Date      12/7/2017

Hi Uri

That was a mistake. It should be 13% estimate with 5% to 30% confidence interval.

I was happy to see pcurve mess up (motivated bias), but I already corrected it yesterday when I went through the manuscript again and double checked.

As you can see in your output, the numbers are switched (I should label columns in output).

So, the question is whether you will eventually admit that pcurve overestimates when there is substantial heterogeneity.

We can then fight over what is realistic and substantial, etc. but to simply ignore the results of my simulations seems defensive.

This is the last chance before I will go public and quote you as saying that pcurve is not biased when there is substantial heterogeneity.

If that is really your belief, so be it. Maybe my simulations are wrong, but you never commented on them.

Best, Uli

—————————————————————————————————————————————

HI Uli,

See below

From: Ulrich Schimmack [mailto:ulrich.schimmack@utoronto.ca]

Sent: Friday, December 8, 2017 12:39 AM

To: Simonsohn, Uri <uws@wharton.upenn.edu>

Subject: RE: one more question

Hi Uri

That was a mistake. It should be 13% estimate with 5% to 30% confidence interval.

*I figured

I was happy to see pcurve mess up (motivated bias), but I already corrected it yesterday when I went through the manuscript again and double checked.

*Happens to the best of us

As you can see in your output, the numbers are switched (I should label columns in output).

*I figured

So, the question is whether you will eventually admit that pcurve overestimates when there is substantial heterogeneity.

*The tone is a bit accusatorial “admit”, but yes, in my blog post I will talk about it. My goal is to present facts in a way that lets readers decide with the same information I am using to decide.

It’s not always feasible to achieve that goal, but I strive for it. I prefer people making right inferences than relying on my work to arrive at them.

We can then fight over what is realistic and substantial, etc. but to simply ignore the results of my simulations seems defensive.

*I don’t think that’s for us to decide. We can ‘fight’ about how to present the facts to readers, they decide which is more realistic.

I am not ignoring your simulation results.

This is the last chance before I will go public and quote you as saying that pcurve is not biased when there is substantial heterogeneity.

*I would prefer if you don’t speak on my behalf either way, our conversation is for each of us to learn from the other, then you speak for yourself.

If that is really your belief, so be it. Maybe my simulations are wrong, but you never commented on them.

*I haven’t tried to reproduce your simulations, but I did indicate in our emails that if you run the rbeta(n,.35,.5)*.95+.05 p-curve over-estimates, I also explained why I don’t find that particularly worrisome. But you are not publishing a report on our email exchange, you are proposing a new tool. Our exchange hopefully helped make that paper clearer.

Please don’t quote any aspect of our exchange. You can say you discussed matters with me, but please do not quote me. This is a private email exchange. You can quote from my work and posts. The heterogeneity blog post may be up in a week or two.

Uri