Original Post: 11/26/2018 Modification: 4/15/2021 The z-curve analysis was updated using the latest version of z-curve
“Trust is good, but control is better”
Information about the replicability of published results is important because empirical results can only be used as evidence if the results can be replicated. However, the replicability of published results in social psychology is doubtful.
Brunner and Schimmack (2018) developed a statistical method called z-curve to estimate how replicable a set of significant results are, if the studies were replicated exactly. In a replicability audit, I am applying z-curve to the most cited articles of psychologists to estimate the replicability of their studies.
Fritz Strack is an eminent social psychologist (H-Index in WebofScience = 51).
Fritz Strack also made two contributions to meta-psychology.
First, he volunteered his facial-feedback study for a registered replication report; a major effort to replicate a published result across many labs. The study failed to replicate the original finding. In response, Fritz Strack argued that the replication study introduced cameras as a confound or that the replication team actively tried to find no effect (reverse p-hacking).
Second, Strack co-authored an article that tried to explain replication failures as a result of problems with direct replication studies (Strack & Stroebe, 2014). This is a concern, when replicability is examined with actual replication studies. However, this concern does not apply when replicability is examined on the basis of test statistics published in original articles. Using z-curve, we can estimate how replicable these studies are, if they could be replicated exactly, even if this is not possible.
Given Fritz Strack’s skepticism about the value of actual replication studies, he may be particularly interested in estimates based on his own published results.
I used WebofScience to identify the most cited articles by Fritz Strack (datafile). I then selected empirical articles until the number of coded articles matched the number of citations, resulting in 42 empirical articles (H-Index = 42). The 42 articles reported 117 studies (average 2.8 studies per article). The total number of participants was 8,029 with a median of 55 participants per study. For each study, I identified the most focal hypothesis test (MFHT). The result of the test was converted into an exact p-value and the p-value was then converted into a z-score. The z-scores were submitted to a z-curve analysis to estimate mean power of the 114 results that were significant at p < .05 (two-tailed). Three studies did not test a hypothesis or predicted a non-significant result. The remaining 11 results were interpreted as evidence with lower standards of significance. Thus, the success rate for 114 reported hypothesis tests was 100%.
The z-curve estimate of replicability is 38% with a 95%CI ranging from 23% to 54%. The complementary interpretation of this result is that the actual type-II error rate is 62% compared to the 0% failure rate in the published articles.
The histogram of z-values shows the distribution of observed z-scores (blue line) and the predicted density distribution (grey line). The predicted density distribution is also projected into the range of non-significant results. The area under the grey curve is an estimate of the file drawer of studies that need to be conducted to achieve 100% successes with 28% average power. Although this is just a projection, the figure makes it clear that Strack and collaborators used questionable research practices to report only significant results.
Z-curve.2.0 also estimates the actual discovery rate in a laboratory. The EDR estimate is 24% with a 95%CI ranging from 5% to 44%. The actual observed discovery rate is well outside this confidence interval. Thus, there is strong evidence that questionable research practices had to be used to produce significant results. The estimated discovery rate can also be used to estimate the risk of false positive results (Soric, 1989). With an EDR of 24%, the false positive risk is 16%. This suggests that most of Strack’s results may show the correct sign of an effect, but that the effect sizes are inflated and that it is often unclear whether the population effect sizes would have practical significance.
The false discovery risk decreases for more stringent criteria of statistical significance. A reasonable criterion for the false discovery risk is 5%. This criterion can be achieved by lowering alpha to .005. This is in line with suggestions to treat only p-values less than .005 as statistically significant (Benjamin et al., 2017). This leaves 35 significant results.
The analysis of Fritz Strack’s published results provides clear evidence that questionable research practices were used and that published significant results have a higher type-I error risk than 5%. More important, the actual discovery rate is low and implies that 16% of published results could be false positives. This explains some replication failures in large samples for Strack’s item-order and facial feedback studies. I recommend to use alpha = .005 to evaluate Strack’s empirical findings. This leaves about a third of his discoveries as statistically significant results.
It is important to emphasize that Fritz Strack and colleagues followed accepted practices in social psychology and did nothing unethical by the lax standards of research ethics in psychology. That is, he did not commit research fraud.
It is nearly certain that I made some mistakes in the coding of Fritz Strack’s articles. However, it is important to distinguish consequential and inconsequential mistakes. I am confident that I did not make consequential errors that would alter the main conclusions of this audit. However, control is better than trust and everybody can audit this audit. The data are openly available and the z-curve results can be reproduced with the z-curve package in R. Thus, this replicability audit is fully transparent and open to revision.
UPDATE 3/27/2018: Here is R-code to see how z-curve and p-curve work and to run the simulations used by Datacolada and to try other ones. (R-Code download)
The blog Datacolada is a joint blog by Uri Simonsohn, Leif Nelson, and Joe Simmons. Like this blog, Datacolada blogs about statistics and research methods in the social sciences with a focus on controversial issues in psychology. Unlike this blog, Datacolada does not have a comments section. However, this shouldn’t stop researchers to critically examine the content of Datacolada. As I have a comments section, I will first voice my concerns about blog post  and then open the discussion to anybody who cares about estimating the average power of studies that reported a “discovery” in a psychology journal.
Estimating power is easy when all studies are honestly reported. In this ideal world, average power can be estimated by the percentage of significant results and with the median observed power (Schimmack, 2015). However, in reality not all studies are published and researchers use questionable research practices that inflate success rates and observed power. Currently two methods promise to correct for these problems and to provide estimates of the average power of studies that yielded a significant result.
Uri Simonsohn’s P-Curve has been in the public domain in the form of an app since January 2015. Z-Curve has been used to critique published studies and individual authors for low power in their published studies on blog posts since June 2015. Neither method has the stamp of approval of peer-review. P-Curve has been developed from Version 3.0 to Version 4.6 without presenting any simulations that the method works. It is simply assumed that the method works because it is built on a peer-reviewed method for the estimation of effect sizes. Jerry Brunner and I have developed four methods for the estimation of average power in a set of studies selected for significance, including z-curve and our own version of p-curve that estimates power and not effect sizes .
We have carried out extensive simulation studies and asked numerous journals to examine the validity of our simulation results. We also posted our results in a blog post and asked for comments. The fact that our work is still not published in 2018 does not reflect problems with out results. The reasons for rejection were mostly that it is not relevant to estimate average power of studies that have been published.
Respondents to an informal poll in the Psychological Methods Discussion Group mostly disagree and so do we.
There are numerous examples on this blog that show how this method can be used to predict that major replication efforts will fail (ego-depletion replicability report) or that claims about the way people (that is you and I) think in a popular book (Thinking: Fast and Slow) for a general audience (again that is you and me) by a Nobel Laureate are based on studies that were obtained with deceptive research practices.
The author, Daniel Kahneman, was as dismayed as I am by the realization that many published findings that are supposed to enlighten us have provided false facts and he graciously acknowledged this.
“I accept the basic conclusions of this blog. To be clear, I do so (1) without expressing an opinion about the statistical techniques it employed and (2) without stating an opinion about the validity and replicability of the individual studies I cited. What the blog gets absolutely right is that I placed too much faith in underpowered studies.” (Daniel Kahneman).
It is time to ensure that methods like p-curve and z-curve are vetted by independent statistical experts. The traditional way of closed peer review in journals that need to reject good work because for-profit publishers and organizations like APS need to earn money from selling print-copies of their journals has failed.
Therefore we ask statisticians and methodologists from any discipline that uses significance testing to draw inferences from empirical studies to examine the claims in our manuscript and to help us to correct any errors. If p-curve is the better tool for the job, so be it.
It is unfortunate that the comparison of p-curve and z-curve has become a public battle. In an idealistic world, scientists would not be attached to their ideas and would resolve conflicts in a calm exchange of arguments. What better field to reach consensus than math or statistics where a true answer exists and can be revealed by means of mathematical proof or simulation studies.
However, the real world does not match the ideal world of science. Just like Uri-Simonsohn is proud of p-curve, I am proud of z-curve and I want z-curve to do better. This explains why my attempt to resolve this conflict in private failed (see email exchange).
The main outcome of the failed attempt to find agreement in private was that Uri Simonsohn posted a blog on Datacolada with the bold claim “P-Curve Handles Heterogeneity Just Fine,” which contradicts the claims that Jerry and I made in the manuscript that I sent him before we submitted it for publication. So, not only did the private communication fail. Our attempt to resolve disagreement resulted in an open blog post that contradicted our claims. A few months later, this blog post was cited by the editor of our manuscript as a minor reason for rejecting our comparison of p-curve and z-curve.
Just to be clear, I know that the datacolada post that Nelson cites was posted after your paper was submitted and I’m not factoring your paper’s failure to anticipate it into my decision (after all, Bem was wrong (Dan Simons, Editor of AMMPS)
Please remember, I shared a document and R-Code with simulations that document the behavior of p-curve. I had a very long email exchange with Uri Simonsohn in which I asked him to comment on our simulation results, which he never did. Instead, he wrote his own simulations to convince himself that p-curve works.
The tweet below shows that Uri is aware of the problem that statisticians can use statistical tricks, p-hacking, to make their method look better than they are.
I will now demonstrate that Uri p-hacked his simulations to make p-curve look better than it is and to hide the fact that z-curve is the better tool for the job.
Critical Examination of Uri Simonsohn’s Simulation Studies
On the blog, Uri Simonsohn shows the Figure below which was based on an example that I provided during our email exchange. The Figure shows the simulated distribution of true power. It also shows the the mean true power is 61%, whereas the p-curve estimate is 79%. Uri Simonssohn does not show the z-curve estimate. He also does not show what the distribution of observed t-values looks like. This is important because few readers are familiar with histograms of power and the fact that it is normal for power to pile up at 1 because 1 is the upper limit for power.
I used the R-Code posted on the Datacolada website to provide additional information about this example. Before I show the results it is important to point out that Uri Simonshon works with a different selection model than Jerry and I. We verified that this has no implications for the performance of p-curve or z-curve, but it does have implications for the distribution of true power that we would expect in real data.
Selection for Significance 1: Jerry and I work with a simple model where researchers conduct studies, test for significance, and then publish the significant results. They may also publish the non-significant results, but they cannot be used to claim a discovery (of course, we can debate whether a significant result implies a discovery, but that is irrelevant here). We use z-curve to estimate the average power of those studies that produced a significant result. As power is the probabilty of obtaining a significant result, the average true power of significant results predicts the success rate in a set of exact replication studies. Therefore, we call this estimate an estimate of replicability.
Selection for Significance 2: The Datacolada team famously coined the term p-hacking. p-hacking refers to massive use of questionable research practices in order to produce statistically significant results. In an influential article, they created the impression that p-hacking allows researchers to get statistical significance in pretty much every study without a real effect (i.e., a false positive). If this were the case, researchers would not have failed studies hidden away like our selection model implies.
No File Drawers: Another Unsupported Claim by Datacolada
In the 2018 volume of Annual Review of Psychology (edited by Susan Fiske), the Datacolada team explicitly claims that psychology researchers do not have file drawers of failed studies.
There is an old, popular, and simple explanation for this paradox. Experiments that work are sent to a journal, whereas experiments that fail are sent to the file drawer (Rosenthal 1979). We believe that this “file-drawer explanation” is incorrect. Most failed studies are not missing. They are published in our journals, masquerading as successes.
They provide no evidence for this claim and ignore evidence to the contrary. For example, Bem (2011) pointed out that it is a common practice in experimental social psychology to conduct small studies so that failed studies can be dismissed as “pilot studies.” In addition, some famous social psychologists have stated explicitly that they have a file drawer of studies that did not work.
“We did run multiple studies, some of which did not work, and some of which worked better than others. You may think that not reporting the less successful studies is wrong, but that is how the field works.” (Roy Baumeister, personal email communication)
In response to replication failures, Kathleen Vohs acknowledged that a couple of studies with non-significant results were excluded from the manuscript submitted for publication that was published with only significant results.
(2) With regard to unreported studies, the authors conducted two additional money priming studies that showed no effects, the details of which were shared with us. (quote from Rohrer et al., 2015, who failed to replicate Vohs’s findings; see also Vadillo et al., 2016.)
Dan Gilbert and Timothy Wilson acknowledged that they did not publish non-significant results that they considered to be uninformative.
“First, it’s important to be clear about what “publication bias” means. It doesn’t mean that anyone did anything wrong, improper, misleading, unethical, inappropriate, or illegal. Rather it refers to the wellknown fact that scientists in every field publish studies whose results tell them something interesting about the world, and don’t publish studies whose results tell them nothing. Let us be clear: We did not run the same study over and over again until it yielded significant results and then report only the study that “worked.” Doing so would be clearly unethical. Instead, like most researchers who are developing new methods, we did some preliminary studies that used different stimuli and different procedures and that showed no interesting effects. Why didn’t these studies show interesting effects? We’ll never know. Failed studies are often (though not always) inconclusive, which is why they are often (but not always) unpublishable. So yes, we had to mess around for a while to establish a paradigm that was sensitive and powerful enough to observe the effects that we had hypothesized.” (Gilbert and Wilson).
Bias analyses show some problems with the evidence for stereotype threat effects. In a radio interview, Michael Inzlicht reported that he had several failed studies that were not submitted for publication and he is now skeptical about the entirely stereotype threat literature (conflict of interest: Mickey Inzlicht is a friend and colleague of mine who remains the only social psychologists who has published a critical self-analysis of his work before 2011 and is actively involved in reforming research practices in social psychology).
Steve Spencer also acknowledged that he has a file drawer with unsuccessful studies. In 2016, he promised to open his file-drawer and make the results available.
By the end of the year, I will certainly make my whole file drawer available for any one who wants to see it. Despite disagreeing with some of the specifics of what Uli says and certainly with his tone I would welcome everyone else who studies stereotype threat to make their whole file drawer available as well.
Nearly two years later, he hasn’t followed through on this promise (how big can it be? LOL).
Although this anecdotal evidence makes it clear that researchers have file drawers with non-significant results, it remains unclear how large file-drawers are and how often researchers p-hacked null-effects to significance (creating false positive results).
The Influence of Z-Curve on the Distribution of True Power and Observed Test-Statistics
Z-Curve, but not p-curve, can address this question to some extent because p-hacking influences the probability that a low-powered study will be published. A simple selection model with alpha = .05 implies that only 1 out of 20 false positive results produces a significant result and will be included in the set of studies with significant results. In contrast, extreme p-hacking implies that every false positive result (20 out of 20) will be included in the set of studies with significant results.
To illustrate the implications of selection for significance versus p-hacking, it is instructive to examine the distribution of observed significant results based on the simulated distribution of true power in Figure 1.
Figure 2 shows the distribution assuming that all studies will be p-hacked to significance. P-hacking can influence the observed distribution, but I am assuming a simple p-hacking model that is statistically equivalent to optional stopping with small samples. Just keep repeating the experiment (with minor variations that do not influence power to deceive yourself that you are not p-hacking) and stop when you have a significant result.
The histogram of t-values looks very similar to a z-score because t-values with df = 98 are approximately normally distributed. As all studies were p-hacked, all studies are significant with qt(.975,98) = 1.98 as criterion value. However, some studies have strong evidence against the null-hypothesis with t-values greater than 6. The huge pile of t-values just above the criterion value of 1.98 occurs because all low powered studies became significant.
The distribution in Figure 3 looks different than the distribution in Figure 2.
Now there are numerous non-significant results and even a few significant results with the opposite sign of the true effect (t < -1.98). For the estimation of replicability only the results that reached significance are relevant, if only for the reason that they are the only results that are published (success rates in psychology are above 90%; Sterling, 1959, see also real data later on). To compare the distributions it is more instructive to select only the significant results in Figure 3 and to compare the densities in Figures 2 and 3.
The graph in Figure 4 shows that p-hacking results in more just significant results with t-values between 2 and 2.5 than mere publication bias does. The reason is that the significance filter of alpha = .05 eliminates false positives and low powered true effects. As a result the true power of studies that produced significant results is higher in the set of studies that were selected for significance. The average true power of the honest significant results without p-hacking is 80% (as seen in Figure 1, the average power for the p-hacked studies in red is 61%).
With real data, the distribution of true power is unknown. Thus, it is unknown how much p-hacking occurred. For the reader of a journal that reports only significant it is also irrelevant whether p-hacking occurred. A result may be reported because 10 similar studies tested a single hypothesis or 10 conceptual replication studies produced 1 significant result. In either scenario, the reported significant result provides weak evidence for an effect if the significant result occurred with low power.
It is also important to realize (and it took Jerry and I some time to convince ourselves with simulations that this is actually true) that p-curve and z-curve estimates do not depend on the selection mechanism. The only information that matters is the true power of studies and not how studies were selected. To illustrate this fact, I also used p-curve and z-curve to estimate the average power of the t-values without p-hacking (blue distribution in Figure 4). P-Curve again overestimates true power. While average true power is 80%, the p-curve estimate is 94%.
In conclusion, the datacolada blog post did present one out of several examples that I provided and that were included in the manuscript that I shared with Uri. The Datacolada post correctly showed that z-curve provides good estimates of the average true power and that p-curve produces inflated estimates.
I elaborated on this example by pointing out the distinction between p-hacking (all studies are significant) and selection for significance (e.g., due to publication bias or in assessing replicability of published results). I showed that z-curve produces the correct estimates with and without p-hacking because the selection process does not matter. The only consequence of p-hacking is that more low-powered studies become significant because it undermines the function of the significance filter to prevent studies with weak evidence from entering the literature.
In conclusion, the actual blog post shows that p-curve can be severely biased when data are heterogeneous, which contradicts the title that P-Curve handles heterogeneity just fine.
When The Shoe Doesn’t Fit, Cut of Your Toes
To rescue p-curve and to justify the title, Uri Simonsohn suggests that the example that I provided is unrealistic and that p-curve performs as well or better in simulations that are more realistic. He does not mention that I also provided real world examples in my article that showed better performance of z-curve with real data.
So, the real issue is not whether p-curve handles heterogeneity well (it does not). The real issue is now how much heterogeneity we should expect.
Figure 5 shows that Uri Simonsohn considers to be realistic data. The distribution of true power uses the same beta distribution as the distribution in Figure 1, but instead of scaling it from the lowest possible value (alpha = 5%) to the highest possible value 1-1/infinity), it scales power from alpha to a maximum of 80%. For readers less familiar with power, a value of 80% implies that researches plan studies deliberately with the risk of a 20% probability to end up with a false negative result (i.e., the effect exists, but the evidence is not strong enough, p > .05).
The labeling in the graph implies that studies with more than recommended 80% power, including 81% power are considered to have extremely high power (again, with a 20% risk of a false positive result). The graph also shows that p-curve provided an unbiased estimate of true average power despite (extreme) heterogeneity in true power between 5% and 80%.
Figure 6 shows the histogram of observed t-values based on a simulation in which all studies in Figure 5 are p-hacked to get significance. As p-hacking inflates all t-values to meet the minimum value of 1.98, and truncation of power to values below 80% removes high t-values, 92% of t-values are within the limited range from 1.98 to 4. A crud measure of heterogeneity is the variance of t-values, which is 0.51. With N = 100, a t-distribution is just a little bit wider than the standard normal distribution, which has a standard deviation of 1. Thus, the small variance of 0.51 indicates that these data have low variability.
The histogram of observed t-values and the variance in these observed t-values makes it possible to quantify heterogeneity in true power. In Figure 2, heterogeneity was high (Var(t) = 1.56) and p-curve overestimated average true power. In Figure 6, heterogeneity is low (Var(t) = 0.51) and p-curve provided accurate estimates. This finding suggests that estimation bias in p-curve is linked to the distribution and variance in observed t-values, which reflects the distribution and variance in true power.
When the data are not simulated, test statistics can come from different tests with different degrees of freedom. In this case, it is necessary to convert all test statistics into z-scores so that strength of evidence is measured in a common metric. In our manuscript, we used the variance of z-scores to quantify heterogeneity and showed that p-curve overestimates when heterogeneity is high.
In conclusion, Uri Simonsohn demonstrated that p-curve can produce accurate estimates when the range of true power is arbitrarily limited to values below 80% power. He suggests that this is reasonable because having more than 80% power is extremely high power and rare.
Thus, there is no disagreement between Uri Simonsohn and us when it comes to the statistical performance of p-curve and z-curve. P-curve overestimates when power is not truncated at 80%. The only disagreement concerns the amount of actual variability in real data.
What is realistic?
Jerry and I are both big fans of Jacob Cohen who has made invaluable contributions to psychology as a science, including his attempt to introduce psychologists to Neyman-Pearson’s approach to statistical inferences that avoids many of the problems of Fishers’ approach that dominates statistics training in psychology to this day.
The concept of statistical power requires that researchers formulate an alternative hypothesis, which requires specifying an expected effect size. To facilitate this task, Cohen developed standardized effect sizes. For example, Cohen’s standardizes a mean difference (e.g., height difference between men and women in centimeters) by the standard deviation. As a result, the effect size is independent of the unit of measurement and is expressed in terms of percentages of a standard deviation. Cohen provided rough guidelines about the size of effect sizes that one could expect in psychology.
It is now widely accepted that most effect sizes are in the range between 0 and 1 standard deviation. It is common to refer to effect sizes of d = .2 (20% of a standard deviation) as small, d = .5 as medium, and d = .8 as large.
True power is a function of effect size and sampling error. In a between subject study sampling error is a function of sample size and most sample sizes in between-subject designs fall into a range from 40 to 200 participants, although sample sizes have been increasing somewhat in response to the replication crisis. With N = 40 to 200, sampling error ranges from 0.14 (2/sqrt(200) to .32 (2/sqrt(40).
The non-central t-values are simply the ratio of standardized effect sizes and sampling error of standardized measures. At the lowest end, effect sizes of 0 have a non-central t-value of 0 (0/.14 = 0; 0/.32 = 0). At the upper end, a large effect size of .8 obtained in the largest sample (N = 200) yields a t-value of .8/.14 = 5.71. While smaller non-central t-values than 0 are not possible, larger non-central t-values can occur in some studies. Either the effect size is very large or sampling error is smaller. Smaller sampling errors are especially likely when studies use covariates, within-subject designs or one-sample t-tests. For example, a moderate effect size (d = .5) in a within-subject design with 90% fixed error variance (r = .9), yields a non-central t-value of 11.
A simple way to simulate data that are consistent with these well-known properties of results in psychology is to assume that the average effect size is half a standard deviation (d = .5) and to model variability in true effect sizes with a normal distribution with a standard deviation of SD = .2. Accordingly, 95% of effect sizes would fall into the range from d = .1 to d = .9. Sample sizes can be modeled with a simple uniform distribution (equal probability) from N = 40 to 200.
Converting the non-centrality parameters to power with p < .05 shows that many values fall into the region from .80 to 1 that Uri Simonsohn called extremely high power. The graph shows that it does not require extremely large effect sizes (d > 1) or large samples (N > 200) to conduct studies with 80% power or more. Of course, the percentage of studies with 80% power or more depends on the distribution of effect sizes, but it seems questionable to assume that studies rarely have 80% power.
The mean true power is 66% (I guess you see where this is going).
This is the distribution of the observed t-values. The variance is 1.21 and 23% of the t-values are greater than 4. The z-curve estimate is 66% and the p-curve estimate is 83%.
In conclusion, a simulation that starts with knowledge about effect sizes and sample sizes in psychological research shows that it is misleading to call 80% power or more extremely high power that is rarely achieved in actual studies. It is likely that real datasets will include studies with more than 80% power and that this will lead p-curve to overestimate average power.
A comparison of P-Curve and Z-Curve with Real Data
The point of fitting p-curve and z-curve to real data is not to validate the methods. The methods have been validated in simulation studies that show good performance of z-curve and poor performance of p-curve when hterogeneity is high.
The only question remains how biased p-curve is with real data. Of course, this depends on the nature of the data. It is therefore important to remember that the Datacolada team proposed p-curve as an alternative to Cohen’s (1962) seminal study of power in the 1960 issue of the Journal of Abnormal and Social Psychology.
“Estimating the publication-bias corrected estimate of the average power of a set of studies can be useful for at least two purposes. First, many scientists are intrinsically interested in assessing the statistical power of published research (see e.g., Button et al., 2013; Cohen, 1962; Rossi, 1990; Sedlmeier & Gigerenzer, 1989).
There have been two recent attempts at estimating the replicability of results in psychology. One project conducted 100 actual replication studies (Open Science Collaboration, 2015). A more recent project examined the replicability of social psychology using a larger set of studies and statistical methods to assess replicability (Motyl et al., 2017).
The authors sampled articles from four journals, the Journal of Personality and Social Psychology, Personality and Social Psychology Bulletin, Journal of Experimental Psychology, and Psychological Science and four years, 2003, 2004, 2013, and 2014. They randomly sampled 543 articles that contained 1,505 studies. For each study, a coding team picked one statistical test that tested the main hypothesis. The authors converted test-statistics into z-scores and showed histograms for the years 2003-2004 and 2013-2014 to examine changes over time. The results were similar.
The histograms show clear evidence that non-significant results are missing either due to p-hacking or publication bias. The authors did not use p-curve or z-curve to estimate the average true power. I used these data to examine the performance of z-curve and p-curve. I selected only tests that were coded as ANOVAs (k = 751) or t-tests (k = 232). Furthermore, I excluded cases with very large test statistics (> 100) and experimenter degrees of freedom (10 or more). For participant degrees of freedom, I excluded values below 10 and above 1000. This left 889 test statistics. The test statistics were converted into z-scores. The variance of the significant z-scores was 2.56. However, this is due to a long tail of z-scores with a maximum value of 18.02. The variance of z-scores between 1.96 and 6 was 0.83.
Fitting z-curve to all significant z-scores yielded an estimate of 45% average true power. The p-curve estimate was 78% (90%CI = 75;81). This finding is not surprising given the simulation results and the variance in the Motyl et al. data.
One possible solution to this problem could be to modify p-curve in the same way that z-curve only models z-scores between 1.96 and 6 and treats all z-scores of 6 as having power = 1. The z-curve estimate is then adjusted by the proportion of extreme z-scores
Using the same approach with p-curve does help to reduce the bias in p-curve estimates, but p-curve still produces a much higher estimate than z-curve, namely 63% (90%CI = .58;67. This is still nearly 20% higher than the z-curve estimate.
In response to these results, Leif Nelson argued that the problem is not with p-curve, but with the Motyl et al. data.
A detailed examination of datacolada 60 will be the subject of another open discussion about Datacolada. Here it is sufficient to point that Nelson’s strong claim that Motyl et al.’s data are “clearly invalid” is not based on empirical evidence. It is based on disagreement about the coding of 10 out of over 1,500 tests (0.67%). Moreover, it is wrong to label these disagreements mistakes because there is no right or wrong way to pick one test from a set of tests.
In conclusion, the Datacolada team has provided no evidence to support their claim that my simulations are unrealistic. In contrast, I have demonstrated that their truncated simulation does not match reality. Their only defense is now that I cheery-picked data that make z-curve look good. However, a simulation with realistic assumptions about effect sizes and sample sizes also shows large heterogeneity and p-curve fails to provide reasonable estimates.
The fact that sometimes p-curve is not biased is not particularly important because z-curve provides practically useful estimates in these scenarios as well. So, the choice is between one method that gets it right sometimes and another method that gets it right all the time. Which method would you choose?
It is important to point out that z-curve shows some small systematic bias in some situations. The bias is typically about 2% points. We developed a conservative 95%CI to address this problem and demonstrated that his 95% confidence interval has good coverage under these conditions and is conservative in situations when z-curve is unbiased. The good performance of z-curve is the result of several years of development. Not surprisingly, it works better than a method that has never been subjected to stress-tests by the developers.
Z-curve has many additional advantages over p-curve. First, z-curve is a model for heterogeneous data. As a result, it is possible to develop methods that can quantify the amount of variability in power while correcting for selection bias. Second, heterogeneity implies that power varies across studies. As studies with higher power tend to produce larger z-scores, it is possible to provide corrected power estimates for sets of z-values. For example, the average power of just significant results (z < 2.5) could be very low.
Although these new features are still under development, first tests show promising results. For example, the local power estimates for Motyl et al. suggest that test statistics with z-scores below 2.5 (p = .012) have only 26% power and even those between 2.5 and 3.0 (p = .0026) have only 34% power. Moreover, test statistics between 1.96 and 3 account for two-thirds of all test statistics. This suggests that many published results in social psychology will be difficult to replicate.
The problem with fixed-effect models like p-curve is that the average may be falsely generalized to individual studies. Accordingly, an average estimate of 45% might be misinterpreted as evidence that most findings are replicable and that replication studies with a little bit power would be able to replicate most findings. However, this is not the case (OSC, 2015). In reality, there are many studies with low power that are difficult to replicate and relatively few studies with very high power that are easy to replicate. Averaging across these studies gives the wrong impression that all studies have moderate power. Thus, p-curve estimates may be misinterpreted easily because p-curve ignores heterogeneity in true power.
In the datacolada 67 blog post, the Datacolada team tried to defend p-curve against evidence that p-curve fails when data are heterogeneous. It is understandable that authors are defensive about their methods. In this comment on the blog post, I tried to reveal flaws in Uri’s arguments and to show that z-curve is indeed a better tool for the job. However, I am just as motivated to promote z-curve as the Datacolada team is to promote p-curve.
To address this problem of conflict of interest and motivated reasoning, it is time for third parties to weigh in. Neither method has been vetted by traditional peer-review because editors didn’t see any merit in p-curve or z-curve, but these methods are already being used to make claims about replicability. It is time to make sure that they are used properly. So, please contribute to the discussion about p-curve and z-curve in the comments section. Even if you simply have a clarification question, please post it.
“For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication” (Cohen, 1994).
DEFINITION OF REPLICABILITY: In empirical studies with sampling error, replicability refers to the probability of a study with a significant result to produce a significant result again in an exact replication study of the first study using the same sample size and significance criterion (Schimmack, 2017).
The purpose of the R-Index blog is to increase the replicability of published results in psychological science and to alert consumers of psychological research about problems in published articles.
I have used these tools to demonstrate that several claims in psychological articles are incredible (a.k.a., untrustworthy), starting with Bem’s (2011) outlandish claims of time-reversed causal pre-cognition (Schimmack, 2012). This article triggered a crisis of confidence in the credibility of psychology as a science.
Over the past decade it has become clear that many other seemingly robust findings are also highly questionable. For example, I showed that many claims in Nobel Laureate Daniel Kahneman’s book “Thinking: Fast and Slow” are based on shaky foundations (Schimmack, 2020). An entire book on unconscious priming effects, by John Bargh, also ignores replication failures and lacks credible evidence (Schimmack, 2017). The hypothesis that willpower is fueled by blood glucose and easily depleted is also not supported by empirical evidence (Schimmack, 2016). In general, many claims in social psychology are questionable and require new evidence to be considered scientific (Schimmack, 2020).
Each year I post new information about the replicability of research in 120 Psychology Journals (Schimmack, 2021). I also started providing information about the replicability of individual researchers and provide guidelines how to evaluate their published findings (Schimmack, 2021).
Replication is essential for an empirical science, but it is not sufficient. Psychology also has a validation crisis (Schimmack, 2021). That is, measures are often used before it has been demonstrate how well they measure something. For example, psychologists have claimed that they can measure individuals’ unconscious evaluations, but there is no evidence that unconscious evaluations even exist (Schimmack, 2021a, 2021b).
If you are interested in my story how I ended up becoming a meta-critic of psychological science, you can read it here (my journey).
Replicability rankings of psychology journals differs from traditional rankings based on impact factors (citation rates) and other measures of popularity and prestige. Replicability rankings use the test statistics in the results sections of empirical articles to estimate the average power of statistical tests in a journal. Higher average power means that the results published in a journal have a higher probability to produce a significant result in an exact replication study and a lower probability of being false-positive results.
The rankings are based on statistically significant results only (p < .05, two-tailed) because only statistically significant results can be used to interpret a result as evidence for an effect and against the null-hypothesis. Published non-significant results are useful for meta-analysis and follow-up studies, but they provide insufficient information to draw statistical inferences.
The average power across the 105 psychology journals used for this ranking is 70%. This means that a representative sample of significant results in exact replication studies is expected to produce 70% significant results. The rankings for 2015 show variability across journals with average power estimates ranging from 84% to 54%. A factor analysis of annual estimates for 2010-2015 showed that random year-to-year variability accounts for 2/3 of the variance and that 1/3 is explained by stable differences across journals.
The Journal Names are linked to figures that show the powergraphs of a journal for the years 2010-2014 and 2015. The figures provide additional information about the number of tests used, confidence intervals around the average estimate, and power estimates that estimate power including non-significant results even if these are not reported (the file-drawer).
Old post on rankings of social psychology research at 100 Psychology Departments
This post provides the first analysis of replicability for individual departments. The table focuses on social psychology and the results cannot be generalized to other research areas in the same department. An explanation of the rational and methodology of replicability rankings follows in the text below the table.
Replicability scores of 50% and less are considered inadequate (grade F). The reason is that less than 50% of the published results are expected to produce a significant result in a replication study, and with less than 50% successful replications, the most rational approach is to treat all results as false because it is unclear which results would replicate and which results would not replicate.
RATIONALE AND METHODOLOGY
University rankings have become increasingly important in science. Top ranking universities use these rankings to advertise their status. The availability of a single number of quality and distinction creates pressures on scientists to meet criteria that are being used for these rankings. One key criterion is the number of scientific articles that are being published in top ranking scientific journals under the assumption that these impact factors of scientific journals track the quality of scientific research. However, top ranking journals place a heavy premium on novelty without ensuring that novel findings are actually true discoveries. Many of these high-profile discoveries fail to replicate in actual replication studies. The reason for the high rate of replication failures is that scientists are rewarded for successful studies, while there is no incentive to publish failures. The problem is that many of these successful studies are obtained with the help of luck or questionable research methods. For example, scientists do not report studies that fail to support their theories. The problem of bias in published results has been known for a long time (Sterling, 1959). However, few researchers were aware of the extent of the problem. New evidence suggests that more than half of published results provide false or extremely biased evidence. When more than half of published results are not credible, a science loses its credibility because it is not clear which results can be trusted and which results provide false information.
The credibility and replicability of published findings varies across scientific disciplines (Fanelli, 2010). More credible sciences are more willing to conduct replication studies and to revise original evidence. Thus, it is inappropriate to make generalized claims about the credibility of science. Even within a scientific discipline credibility and replicability can vary across sub-disciplines. For example, results from cognitive psychology are more replicable than results from social psychology. The replicability of social psychological findings is extremely low. Despite an increase in sample size, which makes it easier to obtain a significant result in a replication study, only 5 out of 38 replication studies produced a significant result. If the replication studies had used the same sample sizes as the original studies, only 3 out of 38 results would have replicated, that is, produced a significant result in the replication study. Thus, most published results in social psychology are not trustworthy.
There have been mixed reactions by social psychologists to the replication crisis in social psychology. On the one hand, prominent leaders of the field have defended the status quo with the following arguments.
1 – The experiments who conducted the replication studies are incompetent (Bargh, Schnall, Gilbert).
2 – A mysterious force makes effects disappear over time (Schooler).
3 – A statistical artifact (regression to the mean) will always make it harder to find significant results in a replication study (Fiedler).
4 – It is impossible to repeat social psychological studies exactly and a replication study is likely to produce different results than an original study (the hidden moderator) (Schwarz, Strack).
These arguments can be easily dismissed because they do not explain why cognitive psychologists and other scientific disciplines have more successful replications and more failed results. The real reason for the low replicability of social psychology is that social psychologists conduct many, relatively cheap studies that often fail to produce the expected results. They then conduct exploratory data analyses to find unexpected patterns in the data or they simply discard the study and publish only studies that support a theory that is consistent with the data (Bem). This hazardous approach to science can produce false positive results. For example, it allowed Bem (2011) to publish 9 significant results that seemed to show that humans can foresee unpredictable outcomes in the future. Some prominent social psychologists defend this approach to science.
“We did run multiple studies, some of which did not work, and some of which worked better than others. You may think that not reporting the less successful studies is wrong, but that is how the field works.” (Roy Baumeister,)
The lack of rigorous scientific standards also allowed Diederik Stapel, a prominent social psychologist to fabricate data, which led to over 50 retractions of scientific articles. The commission that investigated Stapel came to the conclusion that he was only able to publish so many fake articles because social psychology is a “sloppy science,” where cute findings and sexy stories count more than empirical facts.
Social psychology faces a crisis of confidence. While social psychology tried hard to convince the general public that it is a real science, it actually failed to follow standard norms of science to ensure that social psychological theories are based on objective replicable findings. Social psychology therefore needs to reform its practices if it wants to be taken serious as a scientific field that can provide valuable insights into important question about human nature and human behavior.
There are many social psychologists who want to improve scientific standards. For example, the head of the OSF-reproducibility project, Brian Nosek, is a trained social psychologist. Mickey Inzlicht published a courageous self-analysis that revealed problems in some of his most highly cited articles and changed the way his lab is conducting studies to improve social psychology. Incoming editors of social psychology journals are implementing policies to increase the credibility of results published in their journals (Simine Vazire; Roger Giner-Sorolla). One problem for social psychologists willing to improve their science is that the current incentive structure does not reward replicability. The reason is that it is possible to count number of articles and number of citations, but it seems difficult to quantify replicability and scientific integrity.
To address this problem, Jerry Brunner and I developed a quantitative measure of replicability. The replicability-score uses published statistical results (p-values) and transforms them into absolute z-scores. The distribution of z-scores provides information about the statistical power of a study given the sample size, design, and observed effect size. Most important, the method takes publication bias into account and can estimate the true typical power of published results. It also reveals the presence of a file-drawer of unpublished failed studies, if the published studies contain more significant results than the actual power of studies allows. The method is illustrated in the following figure that is based on t- and F-tests published in the most important journals that publish social psychology research.
The green curve in the figure illustrates the distribution of z-scores that would be expected if a set of studies had 53% power. That is, random sampling error will sometimes inflate the observed effect size and sometimes deflate the observed effect size in a sample relative to the population effect size. With 54% power, there would be 46% (1 – .54 = .46) non-significant results because the study had insufficient power to demonstrate an effect that actually exists. The graph shows that the green curve fails to describe the distribution of observed z-scores. On the one hand, there are more extremely high z-scores. This reveals that the set of studies is heterogeneous. Some studies had more than 54% power and others had less than 54% power. On the other hand, there are fewer non-significant results than the green curve predicts. This discrepancy reveals that non-significant results are omitted from the published reports.
Given the heterogeneity of true power, the red curve is more appropriate. It provides the best fit to the observed z-scores that are significant (z-scores > 2). It does not model the z-scores below 2 because non-significant z-scores are not reported. The red-curve gives a lower estimate of power and shows a much larger file-drawer.
I limit the power analysis to z-scores in the range from 2 to 4. The reason is that z-scores greater than 4 imply very high power (> 99%). In fact, many of these results tend to replicate well. However, many theoretically important findings are published with z-scores less than 4 as evidence. These z-scores do not replicate well. If social psychology wants to improve its replicability, social psychologists need to conduct fewer studies with more statistical power that yield stronger evidence and they need to publish all studies to reduce the file-drawer.
To provide an incentive to increase the scientific standards in social psychology, I computed the replicability-score (homogeneous model for z-scores between 2 and 4) for different journals. Journal editors can use the replicability rankings to demonstrate that their journal publishes replicable results. Here I report the first rankings of social psychology departments. To rank departments, I searched the database of articles published in social psychology journals for the affiliation of articles’ authors. The rankings are based on the z-scores of these articles published in the years 2010 to 2014. I also conducted an analysis for the year 2015. However, the replicability scores were uncorrelated with those in 2010-2014 (r = .01). This means that the 2015 results are unreliable because the analysis is based on too few observations. As a result, the replicability rankings of social psychology departments cannot reveal recent changes in scientific practices. Nevertheless, they provide a first benchmark to track replicability of psychology departments. This benchmark can be used by departments to monitor improvements in scientific practices and can serve as an incentive for departments to create policies and reward structures that reward scientific integrity over quantitative indicators of publication output and popularity. Replicabilty is only one aspect of high-quality research, but it is a necessary one. Without sound empirical evidence that supports a theoretical claim, discoveries are not real discoveries.
The OSF-Reproducibility-Project examined the replicability of 99 statistical results published in three psychology journals. The journals covered mostly research in cognitive psychology and social psychology. An article in Science, reported that only 35% of the results were successfully replicated (i.e., produced a statistically significant result in the replication study).
I have conducted more detailed analyses of replication studies in social psychology and cognitive psychology. Cognitive psychology had a notably higher success rate (50%, 19 out of 38) than social psychology (8%, 3 out of 38). The main reason for this discrepancy is that social psychologists and cognitive psychologists use different designs. Whereas cognitive psychologists typically use within-subject designs with many repeated measurements of the same individual, social psychologists typically assign participants to different groups and compare behavior on a single measure. This so-called between-subject design makes it difficult to detect small experimental effects because it does not control the influence of other factors that influence participants’ behavior (e.g., personality dispositions, mood, etc.). To detect small effects in these noisy data, between-subject designs require large sample sizes.
It has been known for a long time that sample sizes in between-subject designs in psychology are too small to have a reasonable chance to detect an effect (less than 50% chance to find an effect that is actually there) (Cohen, 1962; Schimmack, 2012; Sedlmeier & Giegerenzer, 1989). As a result, many studies fail to find statistically significant results, but these studies are not submitted for publication. Thus, only studies that achieved statistical significance with the help of chance (the difference between two groups is inflated by uncontrolled factors such as personality) are reported in journals. The selective reporting of lucky results creates a bias in the published literature that gives a false impression of the replicability of published results. The OSF-results for social psychology make it possible to estimate the consequences of publication bias on the replicability of results published in social psychology journals.
A naïve estimate of the replicability of studies would rely on the actual success rate in journals. If journals would publish significant and non-significant results, this would be a reasonable approach. However, journals tend to publish exclusively significant results. As a result, the success rate in journals (over 90% significant results; Sterling, 1959; Sterling et al., 1995) gives a drastically inflated estimate of replicability.
A somewhat better estimate of replicability can be obtained by computing post-hoc power based on the observed effect sizes and sample sizes of published studies. Statistical power is the long-run probability that a series of exact replication studies with the same sample size would produce significant results. Cohen (1962) estimated that the typical power of psychological studies is about 60%. Thus, even for 100 studies that all reported significant results, only 60 are expected to produce a significant result again in the replication attempt.
The problem with Cohen’s (1962) estimate of replicability is that post-hoc-power analysis uses the reported effect sizes as an estimate of the effect size in the population. However, due to the selection bias in journals, the reported effect sizes and power estimates are inflated. In collaboration with Jerry Brunner, I have developed an improved method to estimate typical power of reported results that corrects for the inflation in reported effect sizes. I applied this method to results from 38 social psychology articles included in the OSF-reproducibility project and obtained a replicability estimate of 35%.
The OSF-reproducbility project provides another opportunity to estimate the replicability of results in social psychology. The OSF-project selected a representative set of studies from two journals and tried to reproduce the same experimental conditions as closely as possible. This should produce unbiased results and the success rate provides an estimate of replicability. The advantage of this method is that it does not rely on statistical assumptions. The disadvantage is that the success rate depends on the ability to exactly recreate the conditions of the original studies. Any differences between studies (e.g., recruiting participants from different populations) can change the success rate. The OSF replication studies also often changed the sample size of the replication study, which will also change the success rate. If sample sizes in a replication study are larger, power increases and the success rate no longer can be used as an estimate of the typical replicability of social psychology. To address this problem, it is possible to apply a statistical adjustment and use the success rate that would have occurred with the original sample sizes. I found that 5 out of 38 (13%) produced significant results and after correcting for the increase in sample size, replicability was only 8% (3 out of 38).
One important question is how how representative the 38 results from the OSF-project are for social psychology in general. Unfortunately, it is practically impossible and too expensive to conduct a large number of exact replication studies. In comparison, it is relatively easy to apply post-hoc power analysis to a large number of statistical results reported in social psychology. Thus, I examined the representativeness of the OSF-reproducibility results by comparing the results of my post-hoc power analysis based on the 38 results in the OSF to a post-hoc-power analysis of a much larger number of results reported in major social psychology journals .
I downloaded articles from 12 social psychology journals, which are the primary outlets for publishing experimental social psychology research: Basic and Applied Social Psychology, British Journal of Social Psychology, European Journal of Social Psychology, Journal of Experimental Social Psychology, Journal of Personality and Social Psychology: Attitudes and Social Cognition, Journal of Personality and Social Psychology: Interpersonal Relationships and Group Processes, Journal of Social and Personal Relationships, Personal Relationships, Personality and Social Psychology Bulletin, Social Cognition, Social Psychology and Personality Science, Social Psychology.
I converted pdf files into text files and searched for all reports of t-tests or F-tests and converted the reported test-statistic into exact two-tailed p-values. The two-tailed p-values were then converted into z-scores by finding the z-score corresponding to the probability of 1-p/2, with p equal the two-tailed p-value. The total number of z-scores included in the analysis is 134,929.
I limited my estimate of power to z-scores in the range between 2 and 4. Z-scores below 2 are not statistically significant (z = 1.96, p = .05). Sometimes these results are reported as marginal evidence for an effect, sometimes they are reported as evidence that an effect is not present, and sometimes they are reported without an inference about the population effect. It is more important to determine the replicability of results that are reported as statistically significant support for a prediction. Z-scores greater than 4 were excluded because z-scores greater than 4 imply that this test had high statistical power (> 99%). Many of these results replicated successfully in the OSF-project. Thus, a simple rule is to assign a success rate of 100% to these findings. The Figure below shows the distribution of z-scores in the range form z = 0 to6, but the power estimate is applied to z-scores in the range between 2 and 4 (n = 66,212).
The power estimate based on the post-hoc-power curve for z-scores between 2 and 4 is 46%. It is important to realize that this estimate is based on 70% of all significant results that were reported. As z-scores greater than 4 essentially have a power of 100%, the overall power estimate for all statistical tests that were reported is .46*.70 + .30 = .62. It is also important to keep in mind that this analysis uses all statistical tests that were reported including manipulation checks (e.g., pleasant picture were rated as more pleasant than unpleasant pictures). For this reason, the range of z-scores is limited to values between 2 and 4, which is much more likely to reflect a test of a focal hypothesis.
46% power for z-scores between 2 and 4 of is a higher estimate than the estimate for the 38 studies in the OSF-reproducibility project (35%). This suggests that the estimated replicability based on the OSF-results is an underestimation of the true replicability. The discrepancy between predicted and observed replicability in social psychology (8 vs. 38) and cognitive psychology (50 vs. 75), suggests that the rate of actual successful replications is about 20 to 30% lower than the success rate based on statistical prediction. Thus, the present analysis suggests that actual replication attempts of results in social psychology would produce significant results in about a quarter of all attempts (46% – 20% = 26%).
The large sample of test results makes it possible to make more detailed predictions for results with different strength of evidence. To provide estimates of replicability for different levels of evidence, I conducted post-hoc power analysis for intervals of half a standard deviation (z = .5). The power estimates are:
Strength of Evidence Power
2.0 to 2.5 33%
2.5 to 3.0 46%
3.0 to 3.5 58%
3.5 to 4.0 72%
IMPLICATIONS FOR PLANNING OF REPLICATION STUDIES
These estimates are important for researchers who are aiming to replicate a published study in social psychology. The reported effect sizes are inflated and a replication study with the same sample size has a low chance to produce a significant result even if a smaller effect exists. To conducted a properly powered replication study, researchers would have to increase sample sizes. To illustrate, imagine that a study demonstrate a significant difference between two groups with 40 participants (20 in each cell) with a z-score of 2.3 (p = .02, two-tailed). The observed power for this result is 65% and it would suggest that a slightly larger sample of N = 60 is sufficient to achieve 80% power (80% chance to get a significant result). However, after correcting for bias, the true power is more likely to be just 33% (see table above) and power for a study with N = 60 would still only be 50%. To achieve 80% power, the replication study would need a sample size of 130 participants. Sample sizes would need to be even larger taking into account that the actual probability of a successful replication is even lower than the probability based on post-hoc power analysis. In the OSF-project only 1 out of 30 studies with an original z-score between 2 and 3 was successfully replicated.
IMPLICATIONS FOR THE EVALUATION OF PUBLISHED RESULTS
The results also have implications for the way social psychologists should conduct and evaluate new research. The main reason why z-scores between 2 and 3 provide untrustworthy evidence for an effect is that they are obtained with underpowered studies and publication bias. As a result, it is likely that the strength of evidence is inflated. If, however, the same z-scores were obtained in studies with high power, a z-score of 2.5 would provide more credible evidence for an effect. The strength of evidence in a single study would still be subject to random sampling error, but it would no longer be subject to systematic bias. Therefore, the evidence would be more likely to reveal a true effect and it would be less like to be a false positive. This implies that z-scores should be interpreted in the context of other information about the likelihood of selection bias. For example, a z-score of 2.5 in a pre-registered study provides stronger evidence for an effect than the same z-score in a study where researchers may have had a chance to conduct multiple studies and to select the most favorable results for publication.
The same logic can also be applied to journals and labs. A z-score of 2.5 in a journal with an average z-score of 2.3 is less trustworthy than a z-score of 2.5 in a journal with an average z-score of 3.5. In the former journal, a z-score of 2.5 is likely to be inflated, whereas in the latter journal a z-score of 2.5 is more likely to be negatively biased by sampling error. For example, currently a z-score of 2.5 is more likely to reveal a true effect if it is published in a cognitive journal than a social journal (see ranking of psychology journals).
The same logic applies even more strongly to labs because labs have a distinct research culture (MO). Some labs conduct many underpowered studies and publish only the studies that worked. Other labs may conduct fewer studies with high power. A z-score of 2.5 is more trustworthy if it comes from a lab with high average power than from a lab with low average power. Thus, providing information about the post-hoc-power of individual researchers can help readers to evaluate the strength of evidence of individual studies in the context of the typical strength of evidence that is obtained in a specific lab. This will create an incentive to publish results with strong evidence rather than fishing for significant results because a low replicability index increases the criterion at which results from a lab provide evidence for an effect.
Abstract: I predicted the replicability of 38 social psychology results in the OSF-Reproducibility Project. Based on post-hoc-power analysis I predicted a success rate of 35%. The actual success rate was 8% (3 out of 38) and post-hoc-power was estimated to be 3% for 36 out of 38 studies (5% power = type-I error rate, meaning the null-hypothesis is true).
The OSF-Reproducibility Project aimed to replicate 100 results published in original research articles in three psychology journals in 2008. The selected journals focus on publishing results from experimental psychology. The main paradigm of experimental psychology is to recruit samples of participants and to study their behaviors in controlled laboratory conditions. The results are then generalized to the typical behavior of the average person.
An important methodological distinction in experimental psychology is the research design. In a within-subject design, participants are exposed to several (a minimum of two) situations and the question of interest is whether responses to one situation differ from behavior in other situations. The advantage of this design is that individuals serve as their own controls and variation due to unobserved causes (mood, personality, etc.) does not influence the results. This design can produce high statistical power to study even small effects. The design is often used by cognitive psychologists because the actual behaviors are often simple behaviors (e.g., pressing a button) that can be repeated many times (e.g., to demonstrate interference in the Stroop paradigm).
In a between-subject design, participants are randomly assigned to different conditions. A mean difference between conditions reveals that the experimental manipulation influenced behavior. The advantage of this design is that behavior is not influenced by previous behaviors in the experiment (carry over effects). The disadvantage is that many uncontrolled factors (e..g, mood, personality) also influence behavior. As a result, it can be difficult to detect small effects of an experimental manipulation among all of the other variance that is caused by uncontrolled factors. As a result, between-subject designs require large samples to study small effects or they can only be used to study large effects.
One of the main findings of the OSF-Reproducibility Project was that results from within-subject designs used by cognitive psychology were more likely to replicate than results from between-subject designs used by social psychologists. There were two few between-subject studies by cognitive psychologists or within-subject designs by social psychologists to separate these factors. This result of the OSF-reproducibility project was predicted by PHP-curves of the actual articles as well as PHP-curves of cognitive and social journals (Replicability-Rankings).
Given the reliable difference between disciplines within psychology, it seems problematic to generalize the results of the OSF-reproducibility project to all areas of psychology. The Replicability-Rankings suggest that social psychology has a lower replicability than other areas of psychology. For this reason, I conducted separate analyses for social psychology and for cognitive psychology. Other areas of psychology had two few studies to conduct a meaningful analysis. Thus, the OSF-reproducibility results should not be generalized to all areas of psychology.
The master data file of the OSF-reproducibilty project contained 167 studies with replication results for 99 studies. 57 studies were classified as social studies. However, this classification used a broad definition of social psychology that included personality psychology and developmental psychology. It included six articles published in the personality section of the Journal of Personality and Social Psychology. As each section functions essentially like an independent journal, I excluded all studies from this section. The file also contained two independent replications of two experiments (experiment 5 and 7) in Albarracín et al. (2008; DOI: 10.1037/a0012833). As the main sampling strategy was to select the last study of each article, I only included Study 7 in the analysis (Study 5 did not replicate, p = .77). Thus, my selection did not lower the rate of successful replications. There were also two independent replications of the same result in Bressan and Stranieri (2008). Both replications produced non-significant results (p = .63, p = .75). I selected the replication study with the larger sample (N = 318 vs. 259). I also excluded two studies that were not independent replications. Rule and Ambady (2008) examined the correlation between facial features and success of CEOs. The replication study had new raters to rate the faces, but used the same faces. Heine, Buchtel, and Norenzayan (2008) examined correlates of conscientiousness across nations and the replication study examined the same relationship across the same set of nations. I also excluded replications of non-significant results because non-significant results provide ambiguous information and cannot be interpreted as evidence for the null-hypothesis. For this reason, it is not clear how the results of a replication study should be interpreted. Two underpowered studies could easily produce consistent results that are both type-II errors. For this reason, I excluded Ranganath and Nosek (2008) and Eastwick and Finkel (2008). The final sample consisted of 38 articles.
I first conducted a post-hoc-power analysis of the reported original results. Test statistics were first converted into two-tailed p-values and two-tailed p-values were converted into absolute z-scores using the formula (1 – norm.inverse(1-p/2). Post-hoc power was estimated by fitting the observed z-scores to predicted z-scores with a mixed-power model with three parameters (Brunner & Schimmack, in preparation).
Estimated power was 35%. This finding reflects the typical finding that reported results are a biased sample of studies that produced significant results, whereas non-significant results are not submitted for publication. Based on this estimate, one would expect that only 35% of the 38 findings (k = 13) would produce a significant result in an exact replication study with the same design and sample size.
The Figure visualizes the discrepancy between observed z-scores and the success rate in the original studies. Evidently, the distribution is truncated and the mode of the curve (it’s highest point) is projected to be on the left side of the significance criterion (z = 1.96, p = .05 (two-tailed)). Given the absence of reliable data in the range from 0 to 1.96, the data make it impossible to estimate the exact distribution in this region, but the step decline of z-scores on the right side of the significance criterion suggests that many of the significant results achieved significance only with the help of inflated observed effect sizes. As sampling error is random, these results will not replicate again in a replication study.
The replication studies had different sample sizes than the original studies. This makes it difficult to compare the prediction to the actual success rate because the actual success rate could be much higher if the replication studies had much larger samples and more power to replicate effects. For example, if all replication studies had sample sizes of N = 1,000, we would expect a much higher replication rate than 35%. The median sample size of the original studies was N = 86. This is representative of studies in social psychology. The median sample size of the replication studies was N = 120. Given this increase in power, the predicted success rate would increase to 50%. However, the increase in power was not uniform across studies. Therefore, I used the p-values and sample size of the replication study to compute the z-score that would have been obtained with the original sample size and I used these results to compare the predicted success rate to the actual success rate in the OSF-reproducibility project.
The depressing finding was that the actual success rate was much lower than the predicted success rate. Only 3 out of 38 results (8%) produced a significant result (without the correction of sample size 5 findings would have been significant). Even more depressing is the fact that a 5% criterion, implies that every 20 studies are expected to produce a significant result just by chance. Thus, the actual success rate is close to the success rate that would be expected if all of the original results were false positives. A success rate of 8% would imply that the actual power of the replication studies was only 8%, compared to the predicted power of 35%.
The next figure shows the post-hoc-power curve for the sample-size corrected z-scores.
The PHP-Curve estimate of power for z-scores in the range from 0 to 4 is 3% for the homogeneous case. This finding means that the distribution of z-scores for 36 of the 38 results is consistent with the null-hypothesis that the true effect size for these effects is zero. Only two z-scores greater than 4 (one shown, the other greater than 6 not shown) appear to be replicable and robust effects.
One replicable finding was obtained in a study by Halevy, Bornstein, and Sagiv. The authors demonstrated that allocation of money to in-group and out-group members is influenced much more by favoring the in-group than by punishing the out-group. Given the strong effect in the original study (z > 4), I had predicted that this finding would replicate.
The other successful replication was a study by Lemay and Clark (DOI: 10.1037/0022-35220.127.116.117). The replicated finding was that participants’ projected their own responsiveness in a romantic relationship onto their partners’ responsiveness while controlling for partners’ actual responsiveness. Given the strong effect in the original study (z > 4), I had predicted that this finding would replicate.
Based on weak statistical evidence in the original studies, I had predicted failures of replication for 25 studies. Given the low success rate, it is not surprising that my success rate was 100.
I made the wrong prediction for 11 results. In all cases, I predicted a successful replication when the outcome was a failed replication. Thus, my overall success rate was 27/38 = 71%. Unfortunately, this success rate is easily beaten by a simple prediction rule that nothing in social psychology replicates, which is wrong in only 3 out of 38 predictions (89% success rate).
Below I briefly comment on the 11 failed predictions.
1 Based on strong statistics (z > 4), I had predicted a successful replication for Förster, Liberman, and Kuschel (DOI: 10.1037/0022-3518.104.22.1689). However, even when I made this predictions based on the reported statistics, I had my doubts about this study because statisticians had discovered anomalies in Jens Förster’s studies that cast doubt on the validity of these reported results. Post-hoc power analysis can correct for publication bias, but it cannot correct for other sources of bias that lead to vastly inflated effect sizes.
2 I predicted a successful replication of Payne, MA Burkley, MB Stokes. The replication study actually produced a significant result, but it was no longer significant after correcting for the larger sample size in the replication study (180 vs. 70, p = .045 vs. .21). Although the p-value in the replication study is not very reassuring, it is possible that this is a real effect. However, the original result was probably still inflated by sampling error to produce a z-score of 2.97.
3 I predicted a successful replication of McCrae (DOI: 10.1037/0022-3522.214.171.1244). This prediction was based on a transcription error. Whereas the z-score for the target effect was 1.80, I posted a z-score of 3.5. Ironically, the study did successfully replicate with a larger sample size, but the effect was no longer significant after adjusting the result for sample size (N = 61 vs. N = 28). This study demonstrates that marginally significant effects can reveal real effects, but it also shows that larger samples are needed in replication studies to demonstrate this.
4 I predicted a successful replication for EP Lemay, MS Clark (DOI: 10.1037/0022-35126.96.36.1990). This prediction was based on a transcription error because EP Lemay and MS Clark had another study in the project. With the correct z-score of the original result (z = 2.27), I would have predicted correctly that the result would not replicate.
5 I predicted a successful replication of Monin, Sawyer, and Marquez (DOI: 10.1037/0022-35188.8.131.52) based on a strong result for the target effect (z = 3.8). The replication study produced a z-score of 1.45 with a sample size that was not much larger than the original study (N = 75 vs. 67).
6 I predicted a successful replication for Shnabel and Nadler (DOI: 10.1037/0022-35184.108.40.206). The replication study increased sample size by 50% (Ns = 141 vs. 94), but the effect in the replication study was modest (z = 1.19).
7 I predicted a successful replication for van Dijk, van Kleef, Steinel, van Beest (DOI: 10.1037/0022-35220.127.116.110). The sample size in the replication study was slightly smaller than in the original study (N = 83 vs. 103), but even with adjustment the effect was close to zero (z = 0.28).
8 I predicted a successful replication of V Purdie-Vaughns, CM Steele, PG Davies, R Ditlmann, JR Crosby (DOI: 10.1037/0022-3518.104.22.1685). The original study had rather strong evidence (z = 3.35). In this case, the replication study had a much larger sample than the original study (N = 1,490 vs. 90) and still did not produce a significant result.
9 I predicted a successful replication of C Farris, TA Treat, RJ Viken, RM McFall (doi:10.1111/j.1467-9280.2008.02092.x). The replication study had a somewhat smaller sample (N = 144 vs. 280), but even with adjustment of sample size the effect in the replication study was close to zero (z = 0.03).
10 I predicted a successful replication of KD Vohs and JW Schooler (doi:10.1111/j.1467-9280.2008.02045.x)). I made this prediction of generally strong statistics, although the strength of the target effect was below 3 (z = 2.8) and the sample size was small (N = 30). The replication study doubled the sample size (N = 58), but produced weak evidence (z = 1.08). However, even the sample size of the replication study is modest and does not allow strong conclusions about the existence of the effect.
11 I predicted a successful replication of Blankenship and Wegener (DOI: 10.1037/0022-3522.214.171.124.2.196). The article reported strong statistics and the z-score for the target effect was greater than 3 (z = 3.36). The study also had a large sample size (N = 261). The replication study also had a similarly large sample size (N = 251), but the effect was much smaller than in the original study (z = 3.36 vs. 0.70).
In some of these failed predictions it is possible that the replication study failed to reproduce the same experimental conditions or that the population of the replication study differs from the population of the original study. However, there are twice as many studies where the failure of replication was predicted based on weak statistical evidence and the presence of publication bias in social psychology journals.
In conclusion, this set of results from a representative sample of articles in social psychology reported a 100% success rate. It is well known that this success rate can only be achieved with selective reporting of significant results. Even the inflated estimate of median observed power is only 71%, which shows that the success rate of 100% is inflated. A power estimate that corrects for inflation suggested that only 35% of results would replicate, and the actual success rate is only 8%. While mistakes by the replication experimenters may contribute to the discrepancy between the prediction of 35% and the actual success rate of 8%, it was predictable based on the results in the original studies that the majority of results would not replicate in replication studies with the same sample size as the original studies.
This low success rate is not characteristic of other sciences and other disciplines in psychology. As mentioned earlier, the success rate for cognitive psychology is higher and comparisons of psychological journals show that social psychology journals have lower replicability than other journals. Moreover, an analysis of time trends shows that replicability of social psychology journals has been low for decades and some journals even show a negative trend in the past decade.
The low replicability of social psychology has been known for over 50 years, when Cohen examined the replicability of results published in the Journal of Social and Abnormal Psychology (now Journal of Personality and Social Psychology), the flagship journal of social psychology. Cohen estimated a replicability of 60%. Social psychologists would rejoice if the reproducibility project had shown a replication rate of 60%. The depressing result is that the actual replication rate was 8%.
The main implication of this finding is that it is virtually impossible to trust any results that are being published in social psychology journals. Yes, two articles that posted strong statistics (z > 4) replicated, but several results with equally strong statistics did not replicate. Thus, it is reasonable to distrust all results with z-scores below 4 (4 sigma rule), but not all results with z-scores greater than 4 will replicate.
Given the low credibility of original research findings, it will be important to raise the quality of social psychology by increasing statistical power. It will also be important to allow publication of non-significant results to reduce the distortion that is created by a file-drawer filled with failed studies. Finally, it will be important to use stronger methods of bias-correction in meta-analysis because traditional meta-analysis seemed to show strong evidence even for incredible effects like premonition for erotic stimuli (Bem, 2011).
In conclusion, the OSF-project demonstrated convincingly that many published results in social psychology cannot be replicated. If social psychology wants to be taken seriously as a science, it has to change the way data are collected, analyzed, and reported and demonstrate replicability in a new test of reproducibility.
The silver lining is that a replication rate of 8% is likely to be an underestimation and that regression to the mean alone might lead to some improvement in the next evaluation of social psychology.
A type-I error is defined as the probability of rejecting the null-hypothesis (i.e., the effect size is zero) when the null-hypothesis is true.
A type-II error is defined as the probability of failing to reject the null-hypothesis when the null-hypothesis is false (i.e., there is an effect).
A common application of statistics is to provide empirical evidence for a theoretically predicted relationship between two variables (cause-effect or covariation). The results of an empirical study can produce two outcomes. Either the result is statistically significant or it is not statistically significant. Statistically significant results are interpreted as support for a theoretically predicted effect.
Statistically non-significant results are difficult to interpret because the prediction may be false (the null-hypothesis is true) or a type-II error occurred (the theoretical prediction is correct, but the results fail to provide sufficient evidence for it).
To avoid type-II errors, researchers can design studies that reduce the type-II error probability. The probability of avoiding a type-II error when a predicted effect exists is called power. It could also be called the probability of success because a significant result can be used to provide empirical support for a hypothesis.
Ideally researchers would want to maximize power to avoid type-II errors. However, powerful studies require more resources. Thus, researchers face a trade-off between the allocation of resources and their probability to obtain a statistically significant result.
Jacob Cohen dedicated a large portion of his career to help researchers with the task of planning studies that can produce a successful result, if the theoretical prediction is true. He suggested that researchers should plan studies to have 80% power. With 80% power, the type-II error rate is still 20%, which means that 1 out of 5 studies in which a theoretical prediction is true would fail to produce a statistically significant result.
Cohen (1962) examined the typical effect sizes in psychology and found that the typical effect size for the mean difference between two groups (e.g., men and women or experimental vs. control group) is about half-of a standard deviation. The standardized effect size measure is called Cohen’s d in his honor. Based on his review of the literature, Cohen suggested that an effect size of d = .2 is small, d = .5 moderate, and d = .8. Importantly, a statistically small effect size can have huge practical importance. Thus, these labels should not be used to make claims about the practical importance of effects. The main purpose of these labels is that researchers can better plan their studies. If researchers expect a large effect (d = .8), they need a relatively small sample to have high power. If researchers expect a small effect (d = .2), they need a large sample to have high power. Cohen (1992) provided information about effect sizes and sample sizes for different statistical tests (chi-square, correlation, ANOVA, etc.).
Cohen (1962) conducted a meta-analysis of studies published in a prominent psychology journal. Based on the typical effect size and sample size in these studies, Cohen estimated that the average power in studies is about 60%. Importantly, this also means that the typical power to detect small effects is less than 60%. Thus, many studies in psychology have low power and a high type-II error probability. As a result, one would expect that journals often report that studies failed to support theoretical predictions. However, the success rate in psychological journals is over 90% (Sterling, 1959; Sterling, Rosenbaum, & Weinkam, 1995). There are two explanations for discrepancies between the reported success rate and the success probability (power) in psychology. One explanation is that researchers conduct multiple studies and only report successful studies. The other studies remain unreported in a proverbial file-drawer (Rosenthal, 1979). The other explanation is that researchers use questionable research practices to produce significant results in a study (John, Loewenstein, & Prelec, 2012). Both practices have undesirable consequences for the credibility and replicability of published results in psychological journals.
A simple solution to the problem would be to increase the statistical power of studies. If the power of psychological studies in psychology were over 90%, a success rate of 90% would be justified by the actual probability of obtaining significant results. However, meta-analysis and method articles have repeatedly pointed out that psychologists do not consider statistical power in the planning of their studies and that studies continue to be underpowered (Maxwell, 2004; Schimmack, 2012; Sedlmeier & Giegerenzer, 1989).
One reason for the persistent neglect of power could be that researchers have no awareness of the typical power of their studies. This could happen because observed power in a single study is an imperfect indicator of true power (Yuan & Maxwell, 2005). If a study produced a significant result, the observed power is at least 50%, even if the true power is only 30%. Even if the null-hypothesis is true, and researchers publish only type-I errors, observed power is dramatically inflated to 62%, when the true power is only 5% (the type-I error rate). Thus, Cohen’s estimate of 60% power is not very reassuring.
Over the past years, Schimmack and Brunner have developed a method to estimate power for sets of studies with heterogeneous designs, sample sizes, and effect sizes. A technical report is in preparation. The basic logic of this approach is to convert results of all statistical tests into z-scores using the one-tailed p-value of a statistical test. The z-scores provide a common metric for observed statistical results. The standard normal distribution predicts the distribution of observed z-scores for a fixed value of true power. However, for heterogeneous sets of studies the distribution of z-scores is a mixture of standard normal distributions with different weights attached to various power values. To illustrate this method, the histograms of z-scores below show simulated data with 10,000 observations with varying levels of true power: 20% null-hypotheses being true (5% power), 20% of studies with 33% power, 20% of studies with 50% power, 20% of studies with 66% power, and 20% of studies with 80% power.
The plot shows the distribution of absolute z-scores (there are no negative effect sizes). The plot is limited to z-scores below 6 (N = 99,985 out of 10,000). Z-scores above 6 standard deviations from zero are extremely unlikely to occur by chance. Even with a conservative estimate of effect size (lower bound of 95% confidence interval), observed power is well above 99%. Moreover, quantum physics uses Z = 5 as a criterion to claim success (e.g., discovery of Higgs-Boson Particle). Thus, Z-scores above 6 can be expected to be highly replicable effects.
Z-scores below 1.96 (the vertical dotted red line) are not significant for the standard criterion of (p < .05, two-tailed). These values are excluded from the calculation of power because these results are either not reported or not interpreted as evidence for an effect. It is still important to realize that true power of all experiments would be lower if these studies were included because many of the non-significant results are produced by studies with 33% power. These non-significant results create two problems. Researchers wasted resources on studies with inconclusive results and readers may be tempted to misinterpret these results as evidence that an effect does not exist (e.g., a drug does not have side effects) when an effect is actually present. In practice, it is difficult to estimate power for non-significant results because the size of the file-drawer is difficult to estimate.
It is possible to estimate power for any range of z-scores, but I prefer the range of z-scores from 2 (just significant) to 4. A z-score of 4 has a 95% confidence interval that ranges from 2 to 6. Thus, even if the observed effect size is inflated, there is still a high chance that a replication study would produce a significant result (Z > 2). Thus, all z-scores greater than 4 can be treated as cases with 100% power. The plot also shows that conclusions are unlikely to change by using a wider range of z-scores because most of the significant results correspond to z-scores between 2 and 4 (89%).
The typical power of studies is estimated based on the distribution of z-scores between 2 and 4. A steep decrease from left to right suggests low power. A steep increase suggests high power. If the peak (mode) of the distribution were centered over Z = 2.8, the data would conform to Cohen’s recommendation to have 80% power.
Using the known distribution of power to estimate power in the critical range gives a power estimate of 61%. A simpler model that assumes a fixed power value for all studies produces a slightly inflated estimate of 63%. Although the heterogeneous model is correct, the plot shows that the homogeneous model provides a reasonable approximation when estimates are limited to a narrow range of Z-scores. Thus, I used the homogeneous model to estimate the typical power of significant results reported in psychological journals.
The results presented below are based on an ongoing project that examines power in psychological journals (see results section for the list of journals included so far). The set of journals does not include journals that primarily publish reviews and meta-analysis or clinical and applied journals. The data analysis is limited to the years from 2009 to 2015 to provide information about the typical power in contemporary research. Results regarding historic trends will be reported in a forthcoming article.
I downloaded pdf files of all articles published in the selected journals and converted the pdf files to text files. I then extracted all t-tests and F-tests that were reported in the text of the results section searching for t(df) or F(df1,df2). All t and F statistics were converted into one-tailed p-values and then converted into z-scores.
The plot above shows the results based on 218,698 t and F tests reported between 2009 and 2015 in the selected psychology journals. Unlike the simulated data, the plot shows a steep drop for z-scores just below the threshold of significance (z = 1.96). This drop is due to the tendency not to publish or report non-significant results. The heterogeneous model uses the distribution of non-significant results to estimate the size of the file-drawer (unpublished non-significant results). However, for the present purpose the size of the file-drawer is irrelevant because power is estimated only for significant results for Z-scores between 2 and 4.
The green line shows the best fitting estimate for the homogeneous model. The red curve shows fit of the heterogeneous model. The heterogeneous model is doing a much better job at fitting the long tail of highly significant results, but for the critical interval of z-scores between 2 and 4, the two models provide similar estimates of power (55% homogeneous & 53% heterogeneous model). If the range is extended to z-scores between 2 and 6, power estimates diverge (82% homogenous, 61% heterogeneous). The plot indicates that the heterogeneous model fits the data better and that the 61% estimate is a better estimate of true power for significant results in this range. Thus, the results are in line with Cohen (1962) estimate that psychological studies average 60% power.
The distribution of z-scores between 2 and 4 was used to estimate the average power separately for each journal. As power is the probability to obtain a significant result, this measure estimates the replicability of results published in a particular journal if researchers would reproduce the studies under identical conditions with the same sample size (exact replication). Thus, even though the selection criterion ensured that all tests produced a significant result (100% success rate), the replication rate is expected to be only about 50%, even if the replication studies successfully reproduce the conditions of the published studies. The table below shows the replicability ranking of the journals, the replicability score, and a grade. Journals are graded based on a scheme that is similar to grading schemes for undergraduate students (below 50 = F, 50-59 = E, 60-69 = D, 70-79 = C, 80-89 = B, 90+ = A).
The average value in 2000-2014 is 57 (D+). The average value in 2015 is 58 (D+). The correlation for the values in 2010-2014 and those in 2015 is r = .66. These findings show that the replicability scores are reliable and that journals differ systematically in the power of published studies.
The main limitation of the method is that focuses on t and F-tests. The results might change when other statistics are included in the analysis. The next goal is to incorporate correlations and regression coefficients.
The second limitation is that the analysis does not discriminate between primary hypothesis tests and secondary analyses. For example, an article may find a significant main effect for gender, but the critical test is whether gender interacts with an experimental manipulation. It is possible that some journals have lower scores because they report more secondary analyses with lower power. To address this issue, it will be necessary to code articles in terms of the importance of statistical test.
The ranking for 2015 is based on the currently available data and may change when more data become available. Readers should also avoid interpreting small differences in replicability scores as these scores are likely to fluctuate. However, the strong correlation over time suggests that there are meaningful differences in the replicability and credibility of published results across journals.
This article provides objective information about the replicability of published findings in psychology journals. None of the journals reaches Cohen’s recommended level of 80% replicability. Average replicability is just about 50%. This finding is largely consistent with Cohen’s analysis of power over 50 years ago. The publication of the first replicability analysis by journal should provide an incentive to editors to increase the reputation of their journal by paying more attention to the quality of the published data. In this regard, it is noteworthy that replicability scores diverge from traditional indicators of journal prestige such as impact factors. Ideally, the impact of an empirical article should be aligned with the replicability of the empirical results. Thus, the replicability index may also help researchers to base their own research on credible results that are published in journals with a high replicability score and to avoid incredible results that are published in journals with a low replicability score. Ultimately, I can only hope that journals will start competing with each other for a top spot in the replicability rankings and as a by-product increase the replicability of published findings and the credibility of psychological science.
Citation: Dr. R (2015). Meta-analysis of observed power. R-Index Bulletin, Vol(1), A2.
In a previous blog post, I presented an introduction to the concept of observed power. Observed power is an estimate of the true power on the basis of observed effect size, sampling error, and significance criterion of a study. Yuan and Maxwell (2005) concluded that observed power is a useless construct when it is applied to a single study, mainly because sampling error in a single study is too large to obtain useful estimates of true power. However, sampling error decreases as the number of studies increases and observed power in a set of studies can provide useful information about the true power in a set of studies.
This blog post introduces various methods that can be used to estimate power on the basis of a set of studies (meta-analysis). I then present simulation studies that compare the various estimation methods in terms of their ability to estimate true power under a variety of conditions. In this blog post, I examine only unbiased sets of studies. That is, the sample of studies in a meta-analysis is a representative sample from the population of studies with specific characteristics. The first simulation assumes that samples are drawn from a population of studies with fixed effect size and fixed sampling error. As a result, all studies have the same true power (homogeneous). The second simulation assumes that all studies have a fixed effect size, but that sampling error varies across studies. As power is a function of effect size and sampling error, this simulation models heterogeneity in true power. The next simulations assume heterogeneity in population effect sizes. One simulation uses a normal distribution of effect sizes. Importantly, a normal distribution has no influence on the mean because effect sizes are symmetrically distributed around the mean effect size. The next simulations use skewed normal distributions. This simulation provides a realistic scenario for meta-analysis of heterogeneous sets of studies such as a meta-analysis of articles in a specific journal or articles on different topics published by the same author.
Observed Power Estimation Method 1: The Percentage of Significant Results
The simplest method to determine observed power is to compute the percentage of significant results. As power is defined as the long-range percentage of significant results, the percentage of significant results in a set of studies is an unbiased estimate of the long-term percentage. The main limitation of this method is that the dichotomous measure (significant versus insignificant) is likely to be imprecise when the number of studies is small. For example, two studies can only show observed power values of 0, 25%, 50%, or 100%, even if true power were 75%. However, the percentage of significant results plays an important role in bias tests that examine whether a set of studies is representative. When researchers hide non-significant results or use questionable research methods to produce significant results, the percentage of significant results will be higher than the percentage of significant results that could have been obtained on the basis of the actual power to produce significant results.
Observed Power Estimation Method 2: The Median
Schimmack (2012) proposed to average observed power of individual studies to estimate observed power. Yuan and Maxwell (2005) demonstrated that the average of observed power is a biased estimator of true power. It overestimates power when power is less than 50% and it underestimates true power when power is above 50%. Although the bias is not large (no more than 10 percentage points), Yuan and Maxwell (2005) proposed a method that produces an unbiased estimate of power in a meta-analysis of studies with the same true power (exact replication studies). Unlike the average that is sensitive to skewed distributions, the median provides an unbiased estimate of true power because sampling error is equally likely (50:50 probability) to inflate or deflate the observed power estimate. To avoid the bias of averaging observed power, Schimmack (2014) used median observed power to estimate the replicability of a set of studies.
Observed Power Estimation Method 3: P-Curve’s KS Test
Another method is implemented in Simonsohn’s (2014) pcurve. Pcurve was developed to obtain an unbiased estimate of a population effect size from a biased sample of studies. To achieve this goal, it is necessary to determine the power of studies because bias is a function of power. The pcurve estimation uses an iterative approach that tries out different values of true power. For each potential value of true power, it computes the location (quantile) of observed test statistics relative to a potential non-centrality parameter. The best fitting non-centrality parameter is located in the middle of the observed test statistics. Once a non-central distribution has been found, it is possible to assign each observed test-value a cumulative percentile of the non-central distribution. For the actual non-centrality parameter, these percentiles have a uniform distribution. To find the best fitting non-centrality parameter from a set of possible parameters, pcurve tests whether the distribution of observed percentiles follows a uniform distribution using the Kolmogorov-Smirnov test. The non-centrality parameter with the smallest test statistics is then used to estimate true power.
Observed Power Estimation Method 4: P-Uniform
van Assen, van Aert, and Wicherts (2014) developed another method to estimate observed power. Their method is based on the use of the gamma distribution. Like the pcurve method, this method relies on the fact that observed test-statistics should follow a uniform distribution when a potential non-centrality parameter matches the true non-centrality parameter. P-uniform transforms the probabilities given a potential non-centrality parameter with a negative log-function (-log[x]). These values are summed. When probabilities form a uniform distribution, the sum of the log-transformed probabilities matches the number of studies. Thus, the value with the smallest absolute discrepancy between the sum of negative log-transformed percentages and the number of studies provides the estimate of observed power.
Observed Power Estimation Method 5: Averaging Standard Normal Non-Centrality Parameter
In addition to these existing methods, I introduce to novel estimation methods. The first new method converts observed test statistics into one-sided p-values. These p-values are then transformed into z-scores. This approach has a long tradition in meta-analysis that was developed by Stouffer et al. (1949). It was popularized by Rosenthal during the early days of meta-analysis (Rosenthal, 1979). Transformation of probabilities into z-scores makes it easy to aggregate probabilities because z-scores follow a symmetrical distribution. The average of these z-scores can be used as an estimate of the actual non-centrality parameter. The average z-score can then be used to estimate true power. This approach avoids the problem of averaging power estimates that power has a skewed distribution. Thus, it should provide an unbiased estimate of true power when power is homogenous across studies.
Observed Power Estimation Method 6: Yuan-Maxwell Correction of Average Observed Power
Yuan and Maxwell (2005) demonstrated a simple average of observed power is systematically biased. However, a simple average avoids the problems of transforming the data and can produce tighter estimates than the median method. Therefore I explored whether it is possible to apply a correction to the simple average. The correction is based on Yuan and Maxwell’s (2005) mathematically derived formula for systematic bias. After averaging observed power, Yuan and Maxwell’s formula for bias is used to correct the estimate for systematic bias. The only problem with this approach is that bias is a function of true power. However, as observed power becomes an increasingly good estimator of true power in the long run, the bias correction will also become increasingly better at correcting the right amount of bias.
The Yuan-Maxwell correction approach is particularly promising for meta-analysis of heterogeneous sets of studies such as sets of diverse studies in a journal. The main advantage of this method is that averaging of power makes no assumptions about the distribution of power across different studies (Schimmack, 2012). The main limitation of averaging power was the systematic bias, but Yuan and Maxwell’s formula makes it possible to reduce this systematic bias, while maintaining the advantage of having a method that can be applied to heterogeneous sets of studies.
Homogeneous Effect Sizes and Sample Sizes
The first simulation used 100 effect sizes ranging from .01 to 1.00 and 50 sample sizes ranging from 11 to 60 participants per condition (Ns = 22 to 120), yielding 5000 different populations of studies. The true power of these studies was determined on the basis of the effect size, sample size, and the criterion p < .025 (one-tailed), which is equivalent to .05 (two-tailed). Sample sizes were chosen so that average power across the 5,000 studies was 50%. The simulation drew 10 random samples from each of the 5,000 populations of studies. Each sample of a study simulated a between-subject design with the given population effect size and sample size. The results were stored as one-tailed p-values. For the meta-analysis p-values were converted into z-scores. To avoid biases due to extreme outliers, z-scores greater than 5 were set to 5 (observed power = .999).
The six estimation methods were then used to compute observed power on the basis of samples of 10 studies. The following figures show observed power as a function of true power. The green lines show the 95% confidence interval for different levels of true power. The figure also includes red dashed lines for a value of 50% power. Studies with more than 50% observed power would be significant. Studies with less than 50% observed power would be non-significant. The figures also include a blue line for 80% true power. Cohen (1988) recommended that researchers should aim for a minimum of 80% power. It is instructive how accurate estimation methods are in evaluating whether a set of studies met this criterion.
The histogram shows the distribution of true power across the 5,000 populations of studies.
The histogram shows that the simulation covers the full range of power. It also shows that high-powered studies are overrepresented because moderate to large effect sizes can achieve high power for a wide range of sample sizes. The distribution is not important for the evaluation of different estimation methods and benefits all estimation methods equally because observed power is a good estimator of true power when true power is close to the maximum (Yuan & Maxwell, 2005).
The next figure shows scatterplots of observed power as a function of true power. Values above the diagonal indicate that observed power overestimates true power. Values below the diagonal show that observed power underestimates true power.
Visual inspection of the plots suggests that all methods provide unbiased estimates of true power. Another observation is that the count of significant results provides the least accurate estimates of true power. The reason is simply that aggregation of dichotomous variables requires a large number of observations to approximate true power. The third observation is that visual inspection provides little information about the relative accuracy of the other methods. Finally, the plots show how accurate observed power estimates are in meta-analysis of 10 studies. When true power is 50%, estimates very rarely exceed 80%. Similarly, when true power is above 80%, observed power is never below 50%. Thus, observed power can be used to examine whether a set of studies met Cohen’s recommended guidelines to conduct studies with a minimum of 80% power. If observed power is 50%, it is nearly certain that the studies did not have the recommended 80% power.
To examine the relative accuracy of different estimation methods quantitatively, I computed bias scores (observed power – true power). As bias can overestimate and underestimate true power, the standard deviation of these bias scores can be used to quantify the precision of various estimation methods. In addition, I present the mean to examine whether a method has large sample accuracy (i.e. the bias approaches zero as the number of simulations increases). I also present the percentage of studies with no more than 20% points bias. Although 20% bias may seem large, it is not important to estimate power with very high precision. When observed power is below 50%, it suggests that a set of studies was underpowered even if the observed power estimate is an underestimation.
The quantitative analysis also shows no meaningful differences among the estimation methods. The more interesting question is how these methods perform under more challenging conditions when the set of studies are no longer exact replication studies with fixed power.
The next simulation simulated variation in sample sizes. For each population of studies, sample sizes were varied by multiplying a particular sample size by factors of 1 to 5.5 (1.0, 1.5,2.0…,5.5). Thus, a base-sample-size of 40 created a range of sample sizes from 40 to 220. A base-sample size of 100 created a range of sample sizes from 100 to 2,200. As variation in sample sizes increases the average sample size, the range of effect sizes was limited to a range from .004 to .4 and effect sizes were increased in steps of d = .004. The histogram shows the distribution of power in the 5,000 population of studies.
The simulation covers the full range of true power, although studies with low and very high power are overrepresented.
The results are visually not distinguishable from those in the previous simulation.
The quantitative comparison of the estimation methods also shows very similar results.
In sum, all methods perform well even when true power varies as a function of variation in sample sizes. This conclusion may not generalize to more extreme simulations of variation in sample sizes, but more extreme variations in sample sizes would further increase the average power of a set of studies because the average sample size would increase as well. Thus, variation in effect sizes poses a more realistic challenge for the different estimation methods.
Heterogeneous, Normally Distributed Effect Sizes
The next simulation used a random normal distribution of true effect sizes. Effect sizes were simulated to have a reasonable but large variation. Starting effect sizes ranged from .208 to 1.000 and increased in increments of .008. Sample sizes ranged from 10 to 60 and increased in increments of 2 to create 5,000 populations of studies. For each population of studies, effect sizes were sampled randomly from a normal distribution with a standard deviation of SD = .2. Extreme effect sizes below d = -.05 were set to -.05 and extreme effect sizes above d = 1.20 were set to 1.20. The first histogram of effect sizes shows the 50,000 population effect sizes. The histogram on the right shows the distribution of true power for the 5,000 sets of 10 studies.
The plots of observed and true power show that the estimation methods continue to perform rather well even when population effect sizes are heterogeneous and normally distributed.
The quantitative comparison suggests that puniform has some problems with heterogeneity. More detailed studies are needed to examine whether this is a persistent problem for puniform, but given the good performance of the other methods it seems easier to use these methods.
Heterogeneous, Skewed Normal Effect Sizes
The next simulation puts the estimation methods to a stronger challenge by introducing skewed distributions of population effect sizes. For example, a set of studies may contain mostly small to moderate effect sizes, but a few studies examined large effect sizes. To simulated skewed effect size distributions, I used the rsnorm function of the fGarch package. The function creates a random distribution with a specified mean, standard deviation, and skew. I set the mean to d = .2, the standard deviation to SD = .2, and skew to 2. The histograms show the distribution of effect sizes and the distribution of true power for the 5,000 sets of studies (k = 10).
This time the results show differences between estimation methods in the ability of various estimation methods to deal with skewed heterogeneity. The percentage of significant results is unbiased, but is imprecise due to the problem of averaging dichotomous variables. The other methods show systematic deviations from the 95% confidence interval around the true parameter. Visual inspection suggests that the Yuan-Maxwell correction method has the best fit.
This impression is confirmed in quantitative analyses of bias. The quantitative comparison confirms major problems with the puniform estimation method. It also shows that the median, p-curve, and the average z-score method have the same slight positive bias. Only the Yuan-Maxwell corrected average power shows little systematic bias.
To examine biases in more detail, the following graphs plot bias as a function of true power. These plots can reveal that a method may have little average bias, but has different types of bias for different levels of power. The results show little evidence of systematic bias for the Yuan-Maxwell corrected average of power.
The following analyses examined bias separately for simulation with less or more than 50% true power. The results confirm that all methods except the Yuan-Maxwell correction underestimate power when true power is below 50%. In contrast, most estimation methods overestimate true power when true power is above 50%. The exception is puniform which still underestimated true power. More research needs to be done to understand the strange performance of puniform in this simulation. However, even if p-uniform could perform better, it is likely to be biased with skewed distributions of effect sizes because it assumes a fixed population effect size.
This investigation introduced and compared different methods to estimate true power for a set of studies. All estimation methods performed well when a set of studies had the same true power (exact replication studies), when effect sizes were homogenous and sample sizes varied, and when effect sizes were normally distributed and sample sizes were fixed. However, most estimation methods were systematically biased when the distribution of effect sizes was skewed. In this situation, most methods run into problems because the percentage of significant results is a function of the power of individual studies rather than the average power.
The results of these analyses suggest that the R-Index (Schimmack, 2014) can be improved by simply averaging power and then applying the Yuan-Maxwell correction. However, it is important to realize that the median method tends to overestimate power when power is greater than 50%. This makes it even more difficult for the R-Index to produce an estimate of low power when power is actually high. The next step in the investigation of observed power is to examine how different methods perform in unrepresentative (biased) sets of studies. In this case, the percentage of significant results is highly misleading. For example, Sterling et al. (1995) found percentages of 95% power, which would suggest that studies had 95% power. However, publication bias and questionable research practices create a bias in the sample of studies that are being published in journals. The question is whether other observed power estimates can reveal bias and can produce accurate estimates of the true power in a set of studies.