BACKGROUND
About 10 years ago, I became disillusioned with psychology; mostly social psychology broadly defined, which is the main area of psychology in which I specialized. Articles published in top journals became longer and longer, with more and more studies, and more and more incredible findings that made no sense to me and that I could not replicate in my own lab.
I also became more familiar with Jacob Cohen’s criticism of psychology and the concept of power. At some point during these dark years, I found a short article in the American Statistician that changed my life (Sterling et al., 1995). The article presented a simple formula and explained that the high success rate in psychology journals (over 90% of reported results confirm authors’ predictions) are incredible, unbelievable, or unreal. Of course, I was aware that publication bias contributed to these phenomenal success rates, but Sterling’s article suggested that there is a way to demonstrate this with statistical methods.
Cohen (1962) estimated that a single study in psychology has only 50% power. This means that a paper with two studies, has only a 25% probability to confirm an authors’ predictions. An article with 4 studies, only has a probability of doing so that is less than 10%. Thus, it was clear that many of these multiple-study articles in top journals had to be produced by means of selective reporting of significant results.
I started doing research with large samples and I started ignoring research based on these made-up results. However, science is a social phenomenon and questionable theories about unconscious emotions and attitudes became popular in psychology. Sometimes being right and being popular are different things. I started trying to educate my colleagues about the importance of power, and I regularly questioned speakers at our colloquium about their small sample sizes. For a while using the word power became a running joke in our colloquium, but research practices did not change.
Then came the year 2011. At the end of 2010, psychology departments all over North America were talking about the Bem paper. An article in press in the top journal JPSP was going to present evidence for extrasensory perception. In 9 out of 10 statistical tests undergraduate students appeared to be able to have precognition of random future events. I was eager to participate in the discussion group at the University of Toronto to point out that these findings are unbelievable, not because we know ESP does not exist, but because it is impossible to get 9 out of 10 significant results without having very high power in each study. Using Sterling’s logic it was clear to me that Bem’s article was not credible.
When I made this argument, I was surprised that some participants in the discussion doubted the logic of my argument more than Bem’s results. I decided to use Bem’s article to make my case in a published article. I was not alone. In 2011 and 2012 numerous articles appeared that pointed out problems with the way psychologists (ab)use the scientific method. Although there are many problems, the key problem is publication bias. Once researchers can select which results they report, it is no longer clear how many reported results are false positive results (Sterling et al., 1995).
When I started writing my article, I wanted to develop a test that reveals selective reporting so that this unscientific practice can be detected and punished, just like a doping test for athletes. Many psychologists do not like to use punishment and think carrots are better than sticks. However, athletes do not get medals for not taking doping and tax payers do not get a reward for filing their taxes. If selective reporting of results violates the basic principle of science, scientists should not do it, and they do not deserve to get a reward for doing what they are supposed to be doing.
THE INCREDIBILITY INDEX
In June 2011, I submitted my manuscript to Psychological Methods. After one and a half years and three rounds of reviews my manuscript finally appeared in print (Schimmack, 2012). Meanwhile, Greg Francis had developed a similar method that also used statistical power to reveal bias. Psychologists were not very enthusiastic about the introduction of our doping test.
This is understandable because the use of scientific doping was a widely accepted practice and there was no formal ban on selective reporting of results. Everybody was doing it, so when Greg Francis used the method to target a specific article, the authors felt attacked. Why me? You could have attacked any other article and found the same result.
When Greg Francis did analyze all articles (published in the top journal Psychological Science for a specific time period) he found, indeed, that over 80% showed positive signs of bias. So, selective reporting of results is a widely used practice and it makes no sense to single out a specific article. Most articles are produced with selective reporting of results. When Greg Francis submitted his findings to Psychological Science, the article was rejected. It was not rejected because it was flawed. After all, it merely confirmed what everybody already knew, namely that all researchers report only the results that support their theory. It was probably rejected because it was undesirable to document this widely used practice scientifically and to show how common selective reporting is. It was probably more desirable to maintain the illusion that psychology is an exact science with excellent theories that make accurate predictions that are confirmed when they are submitted to an empirical test. In truth, it is unclear how many of these success stories are false and would fail if the were replicated without the help of selective reporting.
ESTIMATION OF REPLICABILITY
After the publication of my 2012 paper, I continued to work on the issue of publication bias. In 2013 I met Jerry Brunner in the statistics department. As a former, disillusioned social psychologist, who got a second degree in statistics, he was interested in my ideas. Like many statisticians, he was skeptical (to say the least) about my use of post-hoc power to reveal publication bias. However, he kept an open mind and we have been working together on statistical methods for the estimation of power. As this topic has been largely neglected by statisticians, we were able to make some new discoveries and we developed the first method that can estimate power under difficult conditions when publication bias is present and when power is heterogeneous (varies across studies).
In 2015, I learned programming in R and wrote software to extract statistical results from journal articles (PDF’s converted into text files). After downloading all articles from 105 journals for a specific time period (2010-2015) with the help of Andrew , I was able to apply the method to over 1 million statistical tests reported in psychology journals. The beauty of using all articles is that the results do not suffer from selection bias (cheery-picking). Of course, the extraction method misses some tests (e.g., tests reported in figures or tables) and the average across journals is based on the selection of journals. But the result for a single journal is based on all tests that are automatically extracted.
It is important to realize the advantage of this method compared to typical situations where researchers rely on samples to estimate population parameters. For example, the OSF-reproducibility project selected three journals and a single statistical test from only 100 articles (Science, 2015). Not surprisingly, the results of the project have been criticized as not being representative of psychology in general or even the subject areas represented in the three journals. Similarly, psychologists routinely collect data from students at their local university, but assume that the results generalize to other populations. It would be easy to dismiss these results as invalid, simply because they are not based on a representative sample. However, most psychologists are willing to accept theories based on these small and unrepresentative samples until somebody demonstrates that the results cannot be replicated in other populations (or still accept the theory because they dismiss the failed replication). None of these sampling problems plague research that obtains data for the total population.
When the data were available and the method had been validated in simulation studies, I started using it to estimate the replicability of results in psychological journals. I also used it for individual researchers and for departments. The estimates were in the range from 40% to 70%. This estimate was broadly in line with estimates obtained using Cohen’s (1962) method which results in power estimates of 50-60% (Sedlmeier & Gigerenzer, 1989). Estimates in this range were consistent with the well-known fact that reported success rates in journals of over 90% are inflated by publication bias (Sterling et al., 1995). It would also be unreasonable to assume that all reported results are false positives, which would result in an estimate of 5% replicability because false positive results have a 5% probability to be significant again in a replication study. Clearly, psychology has produced some reliable findings that can be replicated every year in simple class-room demonstrations. Thus, an estimate somewhere in the mild of the extremes between nihilism (nothing in psychology is true) and naive optimism (everything is true) seems reasonable and consistent across estimation methods.
My journal rankings also correctly predicted the ranking of journals in the OSF-reproducibility project, where articles published in JEP:General were most replicable, followed by Psychological Science, and then JPSP. There is even a direct causal link between the actual replication rate and power because cognitive psychologists use more powerful designs, and power determines the replicability in an exact replication study (Sterling et al., 1995).
I was excited to share my results in blogs and in a Facebook discussion group because I believed (and still believe) that these results provide valuable information about the replicability of psychological research; a topic that has been hotly debated since Bem’s (2011) article appeared.
The lack of reliable and valid information fuels this debate because opponent in the debate do not agree about the extent of the crisis. Some people assume that most published results are replicable (Gilbert, Wilson), whereas others suggest that the majority of published results is false (Ioannidis). Surprisingly, this debate rarely mentions Cohen’s seminal estimate of 50%. I was hoping that my results provide some much needed objective estimates of the replicability of psychological research based on a comprehensive analysis of published results.
At present, there exists about five different estimates of replicability of psychological research that range from 20% or less to 95%.
Less than 20%: Button et al. (2014) used meta-analyses in neuroscience, broadly defined, to suggest that power is only 20% and their method did not even correct for inflated effect sizes due to publication bias.
About 40%: A project that replicated 100 studies from social and cognitive psychology yielded about 40% successful replications; that is, they reproduced a significant result in the replication study. This estimate is slightly inflated because the replication studies sometimes used larger samples, which increased the probability of obtaining a significant result, but it can also be attenuated because replication studies were not carried out by the same researchers using the same population.
About 50%: Cohen (1962) and subsequent articles estimated that the typical power in psychology is about 50% to detect a moderate effect size of d = .5, which is slightly higher than the average effect size in a meta-analysis of social psychology (
50-80%: The average replicability for my rankings is 70% for journals and 60% for departments. The discrepancy is likely due to the fact that journals that publish more statistical results (e.g, a six study article in JPSP) have lower replicability. There is variability across journals and departments, but few analyses have produced values below 50% or over 80%. If I had to pick a single number, I would pick 60%, the average for psychology departments. 60% is also the estimate for new, open access journals that publish thousands of articles a year compared to small quarterly journals that publish less than one-hundred articles a year.
If we simply use the median of these five estimates, Cohen’s estimate of 50% provides the best estimate that we currently have. The average estimate for 51 psychology departments is 60%. The discrepancy may be explained by the fact that Cohen focused on theoretically important tests. In contrast, an automatic extraction of statistical results retrieves all statistical tests that are being reported in articles. It is unfortunate that psychologists often report hypothesis tests even when they are meaningless (e.g., the positive stimuli were rated as more positive (Mean = 6.00, SD = 1.00) than the negative stimuli (M = 2.00, SD = 1.00, d = 4.00, p < .001). Eventually, it may be possible to develop algorithms that exclude these statistical tests, but while they are included, replicability estimates include the probability of rejecting the null-hypothesis for these obvious hypotheses. Taking this into account, estimates of 60% are likely to overestimate the replicability of theoretically important tests, which may explain the discrepancy between Cohen’s estimate and the results in my rankings.
CONCERNS ABOUT MY RANKINGS
Since I started publishing my rankings, some psychologists have raised some concerns about my rankings. In this blog post, I address these concerns.
#1 CONCERN: Post-Hoc Power is not the same as Replicability
Some researchers have argued that only actual replication studies can be used to measure replicability. This argument has two problems. First actual replication studies do not provide a gold standard to estimate replicability. The reason is that there are many reasons why an actual replication study may fail and there is no shortage of examples where researchers have questioned the validity of actual replication studies. Thus, even the success rate of actual replication studies is only an estimate of the replicability of original studies.
Second, original studies would already provide an estimate of replicability if no publication bias were present. If a set of original studies produced 60% significant result, an exact replication of these studies is also expected to produce 60% significant results; within the margins of sampling error. The reason is that the success rate of any set of studies is determined by the average power of studies (Sterling et al., 1995) and the average power of identical sets of studies is the same. The problem with using published success rates as estimates of replicability is that published success rates are inflated by selection bias (selective reporting of results that support a theoretical prediction).
The main achievement of Brunner and Schimmack’s statistical estimation method was to correct for selection bias so that reported statistical results can be used to estimate replicability. The estimate produced by this method is an estimate of the success rate in an unbiased set of exact replication studies.
#2 CONCERN: Post-Hoc Power does not predict replicability.
In the OSF-project observed power predicts actual replication success with a correlation of r = .23. This may be interpreted as evidence that post-hoc power is a poor predictor of actual replicability. However, the problem with this argument is that statisticians have warned repeatedly about the use of post-hoc power for a single statistical result (Heisey & Hoenig, 2001). The problem is that the confidence interval around the estimate is so wide that only extremely high power (> 99%) leads to accurate predictions that a study will replicate. For most studies, the confidence interval around the point-estimate is too wide to make accurate predictions.
However, this does not mean that post-hoc power cannot predict replicability for larger sets of studies. The reason is that precision of the estimate increases as the number of tests increases. So, when my rankings are based on hundreds or thousands of tests published in a journal, the estimates are sufficiently precise to be useful. Moreover, Brunner and Schimmack developed a bootstrap method that estimates 95% confidence intervals that provide information about the precision of estimates and these confidence intervals can be used to compare whether differences in ranks are statistically meaningful differences.
#3 CONCERN: INSUFFICIENT EVIDENCE OF VALIDITY
I have used the OSF-reproducibility project (Science, 2015) to validate my rankings of journals. My method correctly predicted that results from JEP:General would be more replicable than those by Psychological Science, and JPSP in this order. My estimate based on extraction of all test statistics from the three journals was 64%, whereas the actual replication success rate was 36%. The replicability based on all tests overestimates replicability for replications of theoretically important tests and the actual success rate underestimates replicability because of problems in conducting exact replication studies. The average of the two estimates is 50%, close to the best estimate of replicability.
It has been suggested that a single comparison is insufficient to validate my method. However, this argument ignores that the OSF-project was the first attempt at replicating a representative set of psychological studies and that the study received a lot of praise for doing so. So, N = 1 is all we have to compare my estimates to estimates based on actual replication studies. When more projects of this nature become available, I will use the evidence to validate my rankings and if there are discrepancy use this information to increase the validity of my rankings.
Meanwhile, it is simply false that a single data point is insufficient to validate an estimate. There is only one Earth, so any estimate of global temperature has to be validated with just one data point. We cannot wait for validation of this method on 199 other planets to decide whether estimates of global temperature are valid.
To use an example from psychology, if a psychologist wants to validate a method that presents stimuli subliminally and lets participants guess whether a stimulus was presented or not, the method is valid if participants are correct 50% of the time. If the percentage is 55%, the method is invalid because participants are able to guess above chance.
Also, validity is not an either or construct. Validity comes in degrees. The estimate based on rankings does not perfectly match the OSF-results or Cohen’s method. None of these methods are perfect. However, they converge on the conclusion that the glass is half full and half empty. The consensus across methods is encouraging. Future research has to examine why the methods differ.
In conclusion, the estimated underlying my replicability rankings are broadly consistent with two other methods of estimating replicability; Cohen’s method of estimating post-hoc power for medium effect sizes and the actual replication rate in the OSF-project. The replicability rankings are likely to overestimate replicability of focal tests by about 10% because they include statistical tests of manipulation checks and covariates that are theoretically less important. This bias may also not be constant across journals, which could affect the rankings to some extent, but it is unknown whether this is actually the case and how much rankings would be affected by this. Pointing out that this potential bias could reduce the validity of the rankings does not lead to the conclusion that they are invalid.
#4 RANKINGS HAVE TO PASS PEER-REVIEW
Some researchers have suggested that I should wait with publishing my results until this methodology has passed peer-review. In my experience, this would take probably a couple of years. Maybe that would have been an option when I started as a scientist in the late 1980s where articles were printed, photocopied, and send by mail if the local library did not have a journal. However, this is 2016, information is shared at lightning speed and where articles are already critiqued on twitter or pubpeer before they are officially published.
I learned my lesson, when the Bem (2011) article appeared and it took one-and-a half years for my article to be published. By this time, numerous articles had been published and Greg Francis had published a critique of Bem using a similar method. I was too slow.
In the meantime, Uri Simonsohn gave two SPSP symposia on pcurve before the actual pcurve article was published in print and he had a pcurve.com website. When Uri presented the method the first time (I was not there), it created an angry response by Norbert Schwarz. Nobody cares about Norbert’s response anymore and pcurve is widely accepted and version 4.0 looks very different from the original version of pcurve. Angry and skeptical responses are to be expected when somebody does something new, important, and disruptive, but this is part of innovation.
Second, I am not the first one to rank journals or departments or individuals. Some researchers get awards suggesting that their work is better than the work of those who do not get awards. Journals with more citations are more prestigious, and departments are ranked in terms of popularity among peers. Who has validated these methods of evaluation and how valid are they? Are they more valid than my replicability rankings?
At least my rankings are based on solid statistical theory and predict correctly that cognitive psychology is more replicable than social psychology. The fact that mostly social psychologists have raised concerns about my method may reveal more about social psychologists than about the validity of my method. Social psychologists also conveniently ignore that the OSF replicability estimate of 36% is an average across areas and that the estimate for social psychology was an abysmal 25% and that my journal rankings place many social psychology journals at the bottom of the ranking. One would only have to apply social psychological theories about heuristics and biases in cognitive processes to explain social psychologists’ concerns about my rankings.
CONCLUSION
In conclusion, the actual replication rate for a set of exact replication studies is identical to the true average power of studies. Average power can be estimated on the basis of reported test statistics and Brunner and Schimmack’s method can produce valid estimates when power is heterogeneous and when selection bias is present. When this method is applied to all statistics in the population (all journals, all articles by an author, etc.), rankings are not affected by selection bias (cheery-picking). When the set of statistics includes all statistical tests, as for example, from an automated extraction of test statistics, the estimate is an estimate of the replicability of a randomly picked statistically significant result from a journal. This may be a manipulation check or a theoretically important test. It is likely that this estimate overestimates the replicability of critically important tests, especially those that are just significant, because selection bias has a stronger impact on results with weak evidence. The estimates are broadly consistent with other estimation methods and more data from actual replication studies are needed to further validate the rankings. Nevertheless, the rankings provide the first objective estimate of replicability for different journals or departments.
The main result of this first attempt at estimating replicability provides clear evidence that selective reporting undermines the validity of published success rates. Whereas published success rates are over 90%, the actual success rate for studies that end up being published when they produced a desirable result is closer to 50%. The negative consequences of selection bias are well known. Reliable information about actual replicability and selection bias is needed to increase the replicability, credibility, and trustworthiness of psychological research. It is also needed to demonstrate to consumers of psychological research that psychologists are improving the replicability of research. Whereas rankings will always show differences, all psychologists are responsible for increasing the average. Real improvement would produce an increase in replicability on all three estimation methods (actual replications, Cohens’ method, & Brunner & Schimmack’s method). It is an interesting empirical question when and how much replicability estimates increase in the future. My replicability rankings will play an important role in answering this question.
Good write-up.
“(Mean = 6.00, SD = 1.00) than the negative stimuli (M = 2.00, SD = 1.00, d = 5.00, p < .001)" I think this should be d=4.00. 6-2=4, 4/1=4.
Given the extreme slowness of peer review, perhaps you should consider publishing on The Winnower. There is no pre-review, but it gets the material on Google Scholar and there is an option for post-review (i.e. with comments like here). Personally, I publish papers there that do not need heavy reviewing such as commentaries.
Thanks for the feedback and I will correct the mistake.