Dr. Ulrich Schimmack Blogs about Replicability

For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication” (Cohen, 1994).

DEFINITION OF REPLICABILITYIn empirical studies with sampling error, replicability refers to the probability of a study with a significant result to produce a significant result again in an exact replication study of the first study using the same sample size and significance criterion (Schimmack, 2017). 

See Reference List at the end for peer-reviewed publications.

Mission Statement

The purpose of the R-Index blog is to increase the replicability of published results in psychological science and to alert consumers of psychological research about problems in published articles.

To evaluate the credibility or “incredibility” of published research, my colleagues and I developed several statistical tools such as the Incredibility Test (Schimmack, 2012); the Test of Insufficient Variance (Schimmack, 2014), and z-curve (Version 1.0; Brunner & Schimmack, 2020; Version 2.0, Bartos & Schimmack, 2021). 

I have used these tools to demonstrate that several claims in psychological articles are incredible (a.k.a., untrustworthy), starting with Bem’s (2011) outlandish claims of time-reversed causal pre-cognition (Schimmack, 2012). This article triggered a crisis of confidence in the credibility of psychology as a science. 

Over the past decade it has become clear that many other seemingly robust findings are also highly questionable. For example, I showed that many claims in Nobel Laureate Daniel Kahneman’s book “Thinking: Fast and Slow” are based on shaky foundations (Schimmack, 2020).  An entire book on unconscious priming effects, by John Bargh, also ignores replication failures and lacks credible evidence (Schimmack, 2017).  The hypothesis that willpower is fueled by blood glucose and easily depleted is also not supported by empirical evidence (Schimmack, 2016). In general, many claims in social psychology are questionable and require new evidence to be considered scientific (Schimmack, 2020).  

Each year I post new information about the replicability of research in 120 Psychology Journals (Schimmack, 2021).  I also started providing information about the replicability of individual researchers and provide guidelines how to evaluate their published findings (Schimmack, 2021). 

Replication is essential for an empirical science, but it is not sufficient. Psychology also has a validation crisis (Schimmack, 2021).  That is, measures are often used before it has been demonstrate how well they measure something. For example, psychologists have claimed that they can measure individuals’ unconscious evaluations, but there is no evidence that unconscious evaluations even exist (Schimmack, 2021a, 2021b). 

If you are interested in my story how I ended up becoming a meta-critic of psychological science, you can read it here (my journey). 

References

Brunner, J., & Schimmack, U. (2020). Estimating population mean power under conditions of heterogeneity and selection for significance. Meta-Psychology, 4, MP.2018.874, 1-22
https://doi.org/10.15626/MP.2018.874

Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17, 551–566
http://dx.doi.org/10.1037/a0029487

Schimmack, U. (2020). A meta-psychological perspective on the decade of replication failures in social psychology. Canadian Psychology/Psychologie canadienne, 61(4), 364–376. 
https://doi.org/10.1037/cap0000246

Mastodon

Mis-Modeling of Multi-Method Matrices

To be continued,…

I just discovered the journal “Applied Psychological Measurement” by accident. The 4th most cited article is from 1985 and discusses hierarchical modeling of multi-method data. It is probably telling that I was able to do this for decades without finding this reference. Stay tuned for a modern update on the proper modeling of multi-method data.

If you are interested in this topic and want to work on this together let me know.

Why You Should Not Trust P-Curve

In 2011, Simmons, Nelson, & Simonsohn published an article that showed with simulation studies how researchers can get significant results without a real effect. This practice has become widely known as p-hacking. In 2014, the authors presented a statistical method that uses the distribution of significant p-values to determine whether the significant results have evidential value or whether they were p-hacked. This method is called p-curve and it has been used in numerous meta-analysis.

While p-hackers like Norbert Schwarz were initially afraid that p-curve could reveal their questionable research practices, the past 10 years have shown that p-curve is often used to sell p-hacked results as credible or even robust evidence.

In this blog post, I focus on the article”Meta-Analyses and P-Curves Support Robust Cycle Shifts in Women’s Mate Preferences: Reply to Wood and Carden (2014) and Harris, Pashler, and Mickes (2014)” by Kelly Gildersleeve, Martie G. Haselton, and Melissa R. Fales, in the prestigious journal Psychological Bulletin.

The article reports a p-curve analysis of studies that examined the influence of women’s cycle on their mate preferences. A meta-analysis by the authors in 2014 appeared to support the ovulatory shift hypothesis. Articles by Wood et al. and Harris et al. questioned these results and suggested that the evidence might be inflated by selective reporting of confirmatory evidence.

To respond to this criticism, Gildersleeve et al. conducted a p-curve analysis. A p-curve analysis has two parts. One part is a histogram of significant p-values with five bins for p-values from .05 to .04, .04 to .03, .03 to .02, .02 to .01, and .01 to .00. The second part is a significance test that of the distribution of p-values against the null-hypothesis that p-values have a uniform distribution. A uniform distribution is expected when the null-hypothesis is true.

They report two p-curve analyses. Figure 3 shows a so-called right-skewed distribution and a p-value of .0005.

Another analysis supports this conclusion, but with a less impressive p-value.

Based on these results, the authors were allowed to conclude that there is strong evidence in support of the cycle-shift hypotheses and to claim in the title that there is robust evidence for it.

As we previously reported, our meta-analysis revealed strong support for the ovulatory shift hypothesis. New analyses using the p-curve method again revealed strong support for genuine cycle shifts as predicted by the ovulatory shift hypothesis. Claims by Wood et al. (2014), Wood and Carden (2014), and Harris et al. (2014) that the abundance of positive findings in the cycle shifts literature merely reflects publication bias, p-hacking, or other research artifacts did not anticipate and cannot explain these new findings.

The authors go even further and suggest that their results contradict claims that p-hacking is major problem in this line of research.

Given recent doubts about the evidential value of published research findings, many researchers have called for “cleaning up” psychological science. We fully support this effort. However, just as claims regarding the existence of hypothesized effects should be supported with strong empirical evidence, so should claims regarding whether p-hacking or other practices have produced the illusion of positive evidence. In the case of the literature on cycle shifts in women’s mate preferences, speculations about a widespread “false positive problem” are unwarranted.

So far, this article has been cited 50 times. I could not find articles criticizing these conclusions. Instead, some articles cited the article to claim that evidence for the ovulatory shift hypothesis. For example, Lewis (2020) wrote.

Gildersleeve, Haselton, and Fales (2014b) provide a robust defence of their meta-analysis suggesting that the p-hacking could not be the sole reason for the observed shifts in masculinity preferences (p. 2).

Since p-curve was published in 2014, new methods have been developed to examine publication bias and evidential value in meta-analyses. My colleagues and I have developed z-curve for this purpose (Bartos & Schimmack, 2022; Brunner & Schimmack, 2021). I used the data in Gildersleeve et al.’s Supplement to conduct a z-curve analysis of their data.

A z-curve analysis converts two-sided p-values into absolute z-scores. A z-curve plot shows the distribution of z-scores, but unlike a p-curve plot, it does not truncate p-values / z-scores at the level of significance (z = 1.96, red dashed line). Thus, visual inspection of a z-curve plot makes it easy to spot p-hacking and other practices that lead to an overrepresentation of significant results. The z-curve plot for Gildersleeve et al.’s data make it obvious that the data that the published results are selected for significance. The only reason why there is a non-significant result, is that one study used a one-sided test to get significance, but the two-sided p-value falls short of significance. Z-curve also provides a statistical test of selection for significance. For this purpose, z-curve uses a mixture model to predict the distribution of non-significant results (the dotted blue curve from 0 to 1.96). Based on the model, the reported significant results are just 23% of all results that would be expected based on the selection model. Due to the small number of studies, the 95%CI around this point-estimate ranges from 5% to 51%, but that is still well below the observed rate of 92% significant results. In short, z-curve makes it clear that selection bias is present, whereas p-curve does not provide information about the presence of selection bias. Z-curve does not show how much p-hacking or failure to report non-significant results contributes to the discrepancy between the observed and expected discovery rate, but that is not important. What matters is that questionable practices contributed to the evidence for the ovulatory cycle hypothesis.

Z-curve also provides a different answer to the question whether the data provide evidence for the hypothesis after taking selection bias into account. One way to address this question is the expected replication rate (ERR). The ERR is a measure of average power of studies with significant results. It predicts the outcome of exact replication studies with the same sample sizes because the long-run rate of significant results is determined by the average power of these studies (Brunner & Schimmack, 2021). The point estimate of the ERR is 23%. This is a bit lower than the implied power in p-curve plots that show the predicted line for 33% power. More importantly, this point estimate also comes with a wide confidence interval due to the small number of studies and the 95% confidence interval includes a value of 5%, which is expected when the null-hypothesis is true and all studies were false positives. Thus, there is insufficient evidence to rule out the possibility that all studies are false positives. Moreover, even if some of these studies are not false positives, replication failures with the same sample sizes are likely and replication studies would need larger samples to provide evidence for the effect, despite the fact that two studies had sample sizes over 7,000 participants. This implies that the true effect sizes are very small even if they are not zero.

Finally, z-curve.2.0 makes it possible to estimate the false positive risk; that is the percentage of significant results that were false positives. This risk can be estimated based on the expected discovery rate using a formula by Soric (1989). The point-estimate is 22%, suggesting that some of the results may not be false positives, but the 95% confidence interval around this estimate ranges from 5% to 100%. Thus, the evidence is insufficient to rule out the possibility that all of the results are false positives at the typical level of certainty that is used to draw scientific conclusions (alpha = 5%).

Importantly, the absence of evidence is not the same as evidence of the absence of an effect. The data are simply uninformative. Based on 24 significant results in Gildersleeve et al.’s meta-analysis we have no scientific evidence in support of the ovulatory shift hypothesis. The z-curve analysis makes it obvious that the data lack evidential value. In contrast, a p-curve analysis of these data allowed the authors to claim robust evidence for the hypothesis. Not surprisingly, p-curve results are often used to claim that effects are real and p-curve is rarely used to claim that data have no evidential value. This problem is even worse when data are more heterogeneous than the present data because p-curve assumes a single parameter whereas z-curve is a mixture model that assumes heterogeneity in population z-scores.

If you find the z-curve results convincing, you may not want to trust p-curve results and subject the data to a z-curve analysis to ensure that you are not fooled by p-hacked data.

If you trust p-curve and evolutionary theory, you might change your mind after reading the reflections of a leading researcher in this area, who’s lab contribute 12 of the 25 results in this meta-analysis.

Gangestad in “The Wax and Wane of Ovulating-Woman Science” by Daniel Engber in Slate
“When we wrote the book, we were drawing on a broad literature,” Gangestad told me, “but some of what we wrote was just garbage because we trusted all that work, including our own.”

Harvard’s Implicit Psychopathology Problem

A popular children’s song in America has the lyrics “When your happy and you know it, clap your hands.”

Since the 1960s, hundreds of surveys with millions of people have relied on self-reports of happiness to study wellbeing. While critics have challenged the validity of these reports (Schwarz & Strack, 1999), hundreds of studies have shown that these reports have some validity (Diener, Lucas, Schimmack, & Helliwell, 2009). For example, they show moderate correlations with informant ratings (Schneider & Schimmack, 2009; Zou et al., 2013) and national averages correlate r = .8 with average median income across nations (Gallup World Poll, 2008-2023).

At the same time, self-reports are not 100% valid. They are influenced by response styles and socially desirable responding. To address these concerns, it would be beneficial to have alternative measures of happiness that do not rely on self-reports.

Over the past 20 years, social psychologists have popularized the use of implicit measures. Implicit measures assess content in people’s memory without asking for the information directly. The most popular implicit measure is the Implicit Association Test (IAT). In this task, respondents work on two independent classification tasks that are easy to do on their own. For example, they have to press one button for words related to them and another button for words not related to them (me / not me). In the other task, they have to classify words related to happiness (smiling) and sadness (crying). The critical trials mix the stimuli on the screen and pair responses on the same key. Most people find it easier to do the task when me is paired with happy and not me is paired with sad than in the reverse order. The crucial question is whether differences in reaction times on this task provide a valid measure of individual’s happiness. For example, somebody may “clap their hand,” report high happiness on a survey items, but the IAT shows faster responses in the “me/sad vs. not me/happy” than in the “me/happy vs. not me/sad” condition.

Walker and Schimmack (2008) examined the convergent validity of the happiness IAT with self-reports and informant reports. The use of informant reports was crucial because the IAT is only useful if it provides information about happiness that self-reports do not provide. If the happiness IAT merely confirms the information of self-reports (i.e., people who know how happy they are and report it accurately also show faster responses in the “happy-me” condition). However, Walker and Schimmack replicated moderate agreement between self-ratings and informant ratings and found equally low correlations with the happiness IAT. These results suggested that the happiness IAT cannot be used to improve the measurement of happiness.

The Harvard Implicit Association Test

In the United States, Harvard is a brand, although the brand has lost a bit of its luster lately. Nevertheless, many people may associate Harvard with excellence in science and assume a test administered by Harvard is scientifically valid (I did not administer a Harvard – Science IAT to test this hypothesis).

Visitors of the Harvard IAT website are not provided with any information about the validity of their test scores. This is problematic because people may assume that their test scores are valid without a warning that their test scores can be biased by measurement error. This is particularly problematic when the test is used for the assessment of mental illnesses like depression or suicidal ideation. I have shown that the suicide IAT lacks evidence of validity (Schimmack, 2021). However, I was not able to find information about validity of the depression IAT, that is essentially a happiness-sadness IAT (me/not me, happy/sad). I emailed the Project Implicit team to ask for information about validity of the Harvard IAT as a measure of clinical depression.

Here is the answer:

Hello, and thank you for your question.
We would advise contacting the Project Implicit Health team: https://implicit.harvard.edu/implicit/user/pih/pih/thescientists.html
The Project Implicit Services Team

I then contacted Dr. Teachman, who was listed first on the Project Implicit Health website. Dr. Teachman sent me two articles. One of them was written by Dr. Teachman and colleagues and published in 2019 (Teachman et al., 2019).

To summarize the article, there is essentially no evidence that the depression IAT measures clinical depression or even just normal variation in happiness.

1. Disclosure Statement

All scientists, including myself, have biases. It is therefore important to make readers aware of potential conflicts of interest. Teachman does point out that she has a stake in IAT research on mental health. This makes it more likely that she would present ambiguous evidence in a more positive light.

2. The summary statement

makes it clear that the IAT is a procedure that can be more or less valid. For example, I showed that it works well for the measurement of political orientation, but not for self-esteem (Schimmack, 2021). Teachman et al. point out that it works better for anxiety than for depression. Yet, they also point out that measures have methodological limitations.

3. Major Depressive Disorder: Empirical Evidence

The review of the empirical evidence starts with the observation that there is lack of convergent validity.

Then a couple of studies are mentioned that show the expected relationship.

These inconsistent findings are used to claim that there is convergent validity between IAT scores and clinical diagnoses, while pointing out that the evidence is mixed. What is lacking is some quantitative information about the amount of convergent validity between IAT scores and clinical diagnoses.

A valid measure of depression should be sensitive to changes in depression. However, this evidence is also mixed.

This section ends with the conclusion that “measures of implicit cognition are useful for understanding the phenomenology of MDD” (p. 135). Call me crazy, but I do not see how this conclusion follows from the mixed evidence reviewed in this section.

Conclusion

Project Implicit uses the Harvard brand to lure people to their website to take tests to obtain data without paying participants. Participants may assume that they are provided adequate compensation because they receive feedback about themselves from scientific tests of depression and other mental health problems. These visitors are not provided with information about the validity of tests and warned that test scores may be invalid due to measurement error. Moreover, the Project Implicit researchers do not even have information about the validity of their tests and ignore valid criticism of the IAT as a measure. While this may not be a problem when the IAT is used to diagnose whether somebody likes Pepsi or Coke, it is a serious problem when visitors take a mental health IAT and receive false information that they suffer from implicit depression or have suicidal ideation outside of their awareness. It is clear why Project Implicit likes to sell their questionable tests with the Harvard brand, but it is less clear why Harvard would want to further taint its brand with pseudo-scientific tests.

Why you should not trust Jolynn Pek and Duanne T. Wegener

When I started as an undergraduate student in 1988, a big problem was finding relevant articles or books. I became interested in emotion research and by 1993, I had read pretty much every relevant piece of academic research published at that time. I even called a researcher at the University of Hawaii, who sent me an unpublished manuscript.

Fast forward to 2024 and we live in a world where we are flooded with too many articles on every possible topic. The new problem is that most of these published articles are useless because they (a) ignore other articles because authors did not have time to read all of the other articles in their area, (b) misrepresent facts, or (c) present well-known facts as news. In this new world of information overflow, the new challenge is to find the articles that are actually useful. This is especially problematic for novices, who are unable to evaluate the quality of published articles.

What we urgently need is a consumer report that provides consumers of scientific information with credible information about the quality of the product they are about to consume. This requires funding researchers to be independent evaluators of research. Maybe there could be a meta-journal that republishes articles that have passed expert-review rather than peer-review.

Unfortunately, I am not an independent expert. Rather, my colleagues and I have worked for over a decade to create statistical tools that can reveal questionable research practices that turn non-significant results into significant ones that can be published. The basic idea of these statistical tools is simple and can be traced to Sterling et al.’s (1995) article that examined the success rate in psychology journals. That is, how often do psychology articles report a statistically significant result that supports researchers predictions about the direction of an effect (e.g., priming people with words about old people without their awareness makes them walk slower).  

Sterling et al. (1995) replicated his earlier finding from 1959 that psychology journals have a success rate over 90%. He pointed out that this is implausible for two reasons. First, to have such a high success rate requires testing only true hypotheses because false hypotheses only have a probability of 5% to produce a significant result. Second, even if a true hypothesis is tested, sampling error can produce non-significant results, especially in between-subject designs with small samples favored by experimental social psychologists before 2010. Thus, the high rate of significant results suggests that articles with significant results are published and articles with non-significant results are not (Rosenthal, 1979). The problem is that selection bias undermines the value of a significant result and in theory all published results or at least most could report false results (Ioannidis, 2005).

Sterling’s theoretical insight led to several empirical attempts to detect selection bias (Francis, 2012; Ioannidis & Trikalonis, 2007; Schimmack, 2012). These tests rely on a comparison of the success rate and mean observed power. Without selection bias, the success rate cannot be higher than mean observed power (Brunner & Schimmack, 2020). Numerous articles have shown that social psychology articles often report success rates of 90% or more with much lower estimates of mean power. For example, Schimmack (2020) analyzed 678 statistical tests that were hand-coded by Motyl et al. (2017) from social psychology journals.

The results showed a success rate of 90%, which replicates Sterling’s findings from 1959 and 1995. The estimated man power – called the expected discovery rate – was 19%. Even if we take uncertainty in the power estimate into account and use the upper limit of a conservative confidence interval, mean power is only 36%. These results also explain why actual replications of a representative sample of studies produced only 25% significant results (Open Science Collaboration, 2015).

In short, estimating the mean power of published studies provides a powerful tool to examine the credibility of success rates by authors, journals, or scientific disciplines. The near perfect success rate in experimental social psychology is inconsistent with the power of studies to produce significant results. This is a key problem that has led to the replication crisis or crisis of confidence in social psychology and initiatives that encourage more honest reporting of non-significant results (badges for data sharing and pre-registration of designs and analysis plans).

You Shall Not Compute Observed Power

Not everybody likes a powerful tool to reveal shady practices. Least of all, people who have or are engaging in shady practices. For example, many people who got away with murder were not pleased when it became possible to identify them with DNA analyses decades after they committed murder. In sports, it was suddenly possible to reveal doping when more powerful tests were used to analyze frozen urine samples decades later. In this regard, observed power is a doping test for scientists, especially prolific ones who have published hundreds of statistical results. There large number of statistical results makes it easy to show that their success rates are incredible.

The first criticism against the use of bias tests was that it is obvious that 90% success rate are not realistic and that a bias test only reveals the obvious that everybody was selecting for significance. For this reasons, the developers of p-curve did not even care to test for publication bias and just assumed that it is present. The response to this criticism is that bias tests can actually show that there is no bias and that they can quantify the amount of publication bias. For example, psychologists might be surprised that meta-analyses of clinical trials show no signs of publication bias (van Zwet et al., 2023).

Gaslighting about the Use of Bias Test

A team of 11 authors, led by social psychologist Roger Giner-Sorolla wrote a long review article of statistical power for social psychologists (Giner-Sorolla et al., 2023). They agree that “publication bias can be assumed in most topics of psychology” (p. 22). They also mention our attempt to estimate the amount of publication bias using statistical analysis of published results. “Numerous published analyses have attempted to test the credibility of published multi-study articles or even literatures, using one of the specific methods developed for these purposes: for example, p-curve (Simonsohn et al., 2014) or Z-curve (Bartoš & Schimmack, 2022; Brunner & Schimmack, 2020). [To clarify, p-curve explicitly was not designed to asses publication bias because it assumes that publication bias is present.]

However, the authors caution readers that “the application of these methods has itself come under criticism, most strongly by Pek et al. (2022). Apart from definitional objections to calculating power post-hoc, they also bring up cautions about the theoretical assumptions of power analysis that are likely to be violated by properties of aggregated actual studies. These assumptions render diagnostic analyses based on observed statistical power imprecise, and “at best”, exploratory” (p. 261).

I was not aware of Pek et al.’s criticism of our method and I was never asked to review any of their articles. So, let’s examine these arguments in open, post-publication review .

The Argument Against Post-Hoc Power Estimation

Pek, J., Hoisington-Shaw, K. J., & Wegener, D. T. (2022). Avoiding questionable research
practices surrounding statistical power analysis. In W. O’Donohue, A. Masada, & S.
Lilienfeld. (Eds.), Avoiding questionable research practices in applied psychology (pp.
243-267). Springer International Publishing

It is notable that Pek et al. published their criticism of our method that can reveal questionable research practices in a book on questionable research practices with a title that implied our method is itself a questionable practice. That is of course fun when you write a chapter for a book and don’t have to defend yourself against the people you are attacking. However, it should not come as a surprise that the criticized authors feel motivated to respond and are not particularly motivated to be kind and polite in their response (it is also not my personality).

One way to evaluate the scientific value of an article is to examine the list of references that are cited and that are not cited. Notable omissions in Pek et al.’s article are Sterling’s observations of incredibly high success rates in psychology and his theoretical insight that success rates can be compared to the power of studies. They cite Simmons et al.’s famous “False Positive Psychology” article, but they do not mention that it led to widespread concerns that many published results in social psychology might be false positives. They do not mention that a celebrated replication project found that only 25% of replication attempts in social psychology produced a significant result, suggesting massive publication bias in original articles. Finally, the article does not discuss Motyl et al.’s coding of hundreds of research findings in social psychology journals or my z-curve analyses of their results (Schimmack, 2020). In short, a chapter in a book on questionable research practices does not mention any empirical evidence that publication bias is a serious concern in experimental social psychology. Even if our bias-detection method were flawed, it doesn’t mean that the results are flawed because there is ample convergent evidence that experimental social psychology has produced many false findings that do not hold up in actual replication attempts.

Pek et al. also (have to) admit that our method has been validated in extensive simulation studies, which is typically use as evidence that a statistical method works. So, what is Pek’s criticism of our method to reveal publication bias and inflated success rates? Their only argument is that these simulations are not informative because real data are different from real data. I an not kidding you. Here is the quote.

“Simulation studies validate the performance of these methods on hypothetical data, but the separation between theoretical and empirical data raise questions about the valid performance of these methods on collected data” (p. 28).

There are two problems with this argument. First, stating the obvious that simulated data are different from real data does not explain why real data would not produce the same results as simulated data. If we simulate data with 50% power and only select the significant results, we are going to see a discrepancy between the success rate (100%) and observed power (75% not corrected for selection bias, Schimmack, 2012) or 50% using a correction model (Bartos & Schimmack, 2021). Pek et al. do not explain how real data can produce a success rate of 100% when our model estimates only 50% power. They would have to argue that our method underestimates power, but they do not provide any reasonable explanation why our method produces correct estimates with simulated data and underestimates power with real data.

The second problem is that they do not explain why replication studies of studies with real data often produce replication failures with real data. If real data have no publication bias and our method is flawed, we would expect high success rates in replication studies. However, if our method is correct and power in real studies is much lower than success rates suggest, we would expect low replication rates and that is exactly what we see in replication studies of social psychology.

In short, Pek et al.’s article lacks any scientifically valid criticism of our method that is based on a simple mathematical necessity. If you conduct studies with 50% power, you can expect only 50% significant results, not 100% (Brunner & Schimmack, 2020; Sterling et al., 1959).

Despite the lack of any valid arguments, Pek et al conclude “results from such tests for QRPs cannot be definitive and remain, at best, exploratory,” which is then quoted verbatim by Giner-Sorolla to discredit our method.

The attempt by (some) social psychologist to deflect criticism from their discipline to maintain a picture of integrity is comical and few people may care because most of the results by social psychologists are harmless and void of real-world significance. However, every year thousands of students are eager to find out about social behaviors and pay for courses to learn what scientists have found out about social behavior. Universities and faculty are not interested in telling them the truth that decades have been wasted on bogus research. My students are often hear the first time from me about replication problems and how they can protect themselves from bad research. They are shocked to learn that only 25% of results can be replicated and welcome a statistical tool that can tell them whether results are honestly reported or not. Pek et al. have nothing to offer as an alternative and they are not interested in an alternative. Their game is to discredit researchers who reveal shady practices. I know how I feel about academics who engage in questionable criticism of methods that can reveal questionable practices, but I don’t think I have to spell it out for you.

In conclusion, I am not an unbiased observer. I can only tell you what Pek et al. told their readers: “our method works well in a broad range of simulations” (Bruner & Schimmack, 2021; Bartos & Schimmack, 2022). Their objection is that results based on simulated data differ from results with actual data in mysterious ways that make real data produce 90% significant results without the use of questionable practices to get significance. It is a free world and you are allowed to believe what you want to believe. Unfortunately, these believes also have real world consequences for the world we live in. So, check your sources wisely. If you just read bullshit, you will believe bullshit.

Why you should not trust IAT researchers

An outdated idealistic concept of science is that scientists are trying hard to test their theories in empirical studies and revise false theories when studies do not confirm their predictions. In reality, scientists are human and act in accordance with social psychologists’ description of human information processing.

“Instead of a naïve scientist entering the environment in search of the truth, we find the rather unflattering picture of a charlatan trying to make the data come out in a manner most advantageous to his or her already-held theories” (Fiske & Taylor, 1984).

So, a logical conclusion is that IAT researchers are charlatans because they are humans and humans are charlatans. More direct evidence for their untrustworthiness can be found in their publications (Schimmack, 2021). IAT researchers continue to conflate performance on an Implicit Association Test with measurement of implicit biases, although the wider community has rejected this view (Sherman & Klein, 2020). Even Greenwald and Banaji (2017) have walked back the original claim that the IAT probes implicit attitudes.

IAT researchers also continue to ignore valid criticism of their work. I feel compelled to write this blog post to highlight this blatant disregard of scientific criticism to promote a questionable computer task as an important tool to fight racism. A key claim is that changing scores on the race IAT is an important goal because these changes reflect changes in people’s attitudes that influence their behavior. This claim legitimizes simple and quick online studies that can be run with large samples, but may have little practical consequences for the understanding of race relationships and intergroup behaviors.

The latest propaganda piece by IAT researchers is Kurdi, Sanchez, Dasgupta, and Banaji’s (2023) article “(When) Do Counter attitudinal Exemplars Shift Implicit Racial Evaluations? Replications and Extensions of Dasgupta and Greenwald (2001)”

I was actually a reviewer of a manuscript of this paper and made several critical comments that the authors’ blissfully ignored. Instead, they provide a misleading description of the history of studies that aim to change scores on the race IAT and omit relevant articles that do not fit their uplifting message that IAT research is thriving and making theoretical progress in the understanding of racism.

The Bullshit Story

The authors start with a highly cited article by Dasgupta and Greenwald (2001) that suggested showing some counter-attitudinal examples can change IAT scores and that these changes even last for a week.

They claim that “the article by Dasgupta and Greenwald (2001) set into motion what was soon to become a fundamental shift in our understanding of the nature of racial attitudes and, specifically, implicit racial evaluations.”

They claim that over the past 20 years, “a firm theoretical understanding has emerged that implicit evaluations, including implicit racial evaluations, can exhibit sizable temporary shifts toward neutrality in response to a wide range of interventions”

They cite Lai et al. (2014), who demonstrated that “implicit racial
evaluations [a.k.a., IAT scores] were found to shift in response to a broad range of
experimental manipulations.”

The authors note that it hardly seems necessary to conduct a replication study of Dasgupta and Greenwald’s study, if there is robust evidence that IAT scores can be moved around. Their argument to do so is based on the fact that the original article has been cited over 1,800 times in Google Scholars. However, a better reason is that Dasgupta and Greewald’s study was the target of a large replication attempt well before large replication studies became fashionable in social psychology.

“Second, an independent replication today is timely given that a previous replication attempt published by Joy-Gaba and Nosek (2010) over a decade ago replicated the Dasgupta and Greenwald (2001) result with large samples but with considerably smaller effect
sizes (Cohen’s ds = 0.17 and 0.14).” [The authors do not mention that this set of replication studies had over N = 3,000 participants compared to the N < 50 in the original study.]

The authors then spend a lot of time to suggest that population effect sizes may have decreased over time due to societal changes. They leave out the well-known fact that large effect sizes in small samples are often followed by small effect sizes in large samples because researchers with small samples require luck or flair to get significance with low statistical power.

Study 1 is a simple online study with N = 1,533 participants. Despite the large sample size, the study failed to replicate Dasgupta and Greenwald’s results and produced an effect size of d = 0.02. In other words, nada. Nothing to see here, a result very much consistent with Joy-Gaba and Nosek’s (2010) results in large online studies a decade earlier. 

Study 2 was also a bust, d = -0.04 and provided further evidence that effect sizes in Dasgupta and Greenwalds’ underpowered study were inflated and that the true effect size is much smaller or zero (Joy-Gaba & Nosek, 2010).

Study 3 and 4 once more confirmed that Dasgupta and Greenwald’s results could not be replicated, d = 0.15.

Although these replication failures are largely consistent with Joy-Gaba and Nosek’s results, the authors ignore this consistency.

“The failure to replicate the shifts in implicit racial evaluations observed by Dasgupta and Greenwald (2001) is puzzling for several reasons, the primary one being that other procedures, far less potent, have been shown to create malleability in implicit evaluations.” 

Studies 5-7 use different procedures to shift IAT scores. Although these results are interesting on their own, they do not explain why Dasgupta and Greenwald’s (2017) results could not be replicated.

In the General Discussion, the authors discuss the replication failures.

“In three high-powered (total N > 1,800) and close-to-exact replications, we failed to obtain the effect originally reported by Dasgupta and Greenwald (2001). That is, we found no reduction in
pro-White/anti-Black implicit evaluations after exposure to positive Black and negative White exemplars (Experiments 1–3). Given the substantial amount of time that has elapsed since the original results were published, we can only make informed guesses about the reasons for the lack of replication.

“At a first approximation, it is conceivable that the original result was a false positive, in which case one should expect replication attempts to yield null results. Contrary to this possibility, some of the experiments conducted as part of the only known previous independent replication attempt by Joy-Gaba and Nosek (2010) produced statistically significant results.”

The shift to statistical significance is problematic because the effect sizes in Joy-Gaba and Nosek’s studies were small and much closer to the zero effect sizes in this study than to the large effect sizes in the original study. Consistent with this finding, Joy-Gaba and Nosek’s article was titled “The surprisingly limited malleability of implicit racial evaluations”. Kurdi et al’s results only show that it is even more limited than in Joy-Gaba and Nosek’s studies, but the fact remains that Dasgupta and Greenwald’s results do not provide a solid empirical foundation for interventions that can reduce racial biases.

The authors then try to sell the hypothesis that Dasgupta and Greenwald’s study in a small sample miraculously produced a precise estimate of the population effect size and that population effect sizes have really decreased over time.

“As such, we believe that it is more likely that the effect originally obtained in the late 1990s
decreased in size over time, both between 2001 and 2010 and between 2010 and 2023.”

If you believe this, you probably also believe in Santa Clause and Immaculate Conception.

They then suggest that the difference between lab and online studies could contribute to the different findings. They may not have read Joy-Gaba and Nosek’s article or simply failed to mention the fact that Joy-Gaba and Nosek tested this hypothesis.

“Experiment 3 was a direct replication of Experiments 2a and 2b across different settings (Internet or laboratory) and samples (undergraduate participant pool or heterogeneous
volunteers). Students in the participant pool at the University of Virginia completed the study either online or in the laboratory….As shown in Figure 2, a 2 (Condition) × 3 (Sample)
ANOVA revealed no main effect of Condition, F(1, 1178) = .76, p = .74, d = .05, and no interaction between Condition and Sample, F(2, 1177) = .04, p = .96, d = .01.” (Joy-Gaba & Nosek, 2010, Study 3).

To put a cherry on top, the authors totally ignore that one of Dasgupta and Greenwald’s amazing findings was that the manipulation seemed to be lasting a full day.

“Results revealed that exposure to admired Black and disliked White exemplars significantly weakened automatic pro-White attitudes for 24 hr beyond the treatment” (Dasgupta & Greenwald, 2001, abstract).

The authors did not even attempt to replicate this important finding, they also fail to mention that Lei et al. (2016) found that none of the manipulations that had immediate effects on IAT scores produced changes several days later. They also did not examine whether their successful manipulations in Studies 5-7 produced lasting effects.

In sum, this article makes no scientific contribution to the understanding of racism and ways to reduce it. Instead, it is a glaring piece of evidence why you shouldn’t trust IAT researchers. Of course, whether you trust them or me is up to you. It is a free world and there are no ethical guidelines that regulate publications. There is the illusion that peer-review corrects mistakes, but authors can get away with bullshit if editors let them.

In conclusion, producing lasting changes on IAT scores is hard and there is no solid evidence that it is possible. It is also not clear why this would be important because scores on the race IAT are messy measures of consciously accessible racial biases that do not predict behavior (Schimmack, 2021). The notion of implicit bias is unscientific, lacks empirical support, and implicit bias training has been ineffective or even harmful. It is time to fund research that studies real behaviors of discrimination and to stop wasting time on reaction times in online studies (Baumeister, Vohs, & Funder, 2007).

Meta-Science vs. Meta-Physics: How many false discoveries are there?

The goal of empirical sciences is to rely on observations to test theories. As it is typically not possible to observe all relevant phenomena, empirical sciences often rely on samples to make claims that are assumed to generalize to all observations (i.e., the population). The generalization of results from samples to populations works pretty well, when the phenomenon is clear (e.g., it is easier to see things during the day than at night, older people are more likely to die than younger people, people who have sex are more likely to have children, etc., etc.). It becomes more difficult to draw the correct conclusions about populations when samples are small and the relationship between two variables is less obvious (e.g., does eating more vegetables lead to a longer life, does a subliminal presentation of a coke bottle make people drink more coke, does solving crosswords reduce the risk of dementia, etc., etc.). When empirical data are used to answer non-obvious questions, it is possible that a single study will produce the wrong answer. There are many names for these false answers such as Type 1 error (Neyman & Pearson, 1928), False Positives (Simons, Nelson, & Simonsohn, 2011, False Relationships (Ioannidis, 2005), or False Discoveries (Soric, 1989). I will use the term False Discoveries because it is used most commonly in the literature that discusses the risk of false discoveries (Bartos & Schimmack, 2022).

Whether a study produced a false discovery depends on many factors. Most important, it depends on the specification of the hypotheses that are being tested. For example, a study might find that exercising more than 1 hour a week extends life expectancy by 216 days. The researchers conclude that the population effect size exactly matches their estimate in the sample. This claim is unlikely to be true because sampling error will produce different estimates of the population effect size. Thus, the risk that the discovery is false is very high and practically 100%. It is well known among statisticians that point-estimates of point-hypotheses are virtually always false. The solution to this problem is to propose a range of values. The wider the range of values, the more likely it is that the discovery is true. For example, based on their finding of a difference of 216 days, researchers might simply claim that exercise has a positive effect on life-expectancy without saying anything about the magnitude of the effect. It could be 1 day or several years. This conclusion is more likely to be true than the conclusion that the effect is exactly 216 days. The problem with this conclusion is that it does not tell us how big the benefits are. Few people might push themselves to exercise regularly, if the benefit is 1 day. More people might be willing to do so, if the effect is 1 year or more. While precise estimates of effect sizes are desirable, it is often not possible to get more precise estimates because sampling error is too large.

Continuing with the example of exercise, a difference of 219 days is a very small difference compared to the large variability in mortality. Assume that the average life-expectancy is 80 years and that the standard deviation is 10 years. Accordingly, 95% of people die within an age range of 60 to 100 (the real distribution is skewed, but that is not relevant for this example). Compared to the natural variability in life expectancy, 219 days is a very small difference. 219 days are 0.60 years and 0.60 years are 0.060 standard deviations. Effects of this magnitude are statistically small and it would require large samples to provide evidence that there is a positive effect of exercise. A sample size of N ~ 4,500 participants is is needed to reduce sampling error to .03, which yields a statistically significant result with the standard criterion for statistical significance, z = .06/.03 = 2.00, p = .046. 

In conclusion, while we would like to know the direction and magnitude of effects in populations, empirical studies are often unable to provide this information. Therefore, scientists settle for answers that their studies can provide. At a minimum, scientists try to determine the sign of a relationship. For example, researchers have studied whether having children INCREASES or DECREASES happiness or sex differences in hundreds of traits. While magnitude of differences is important, the first question is often whether there is relationship between two variables and the most common conclusion drawn in empirical articles is that there is a positive or negative relationship. Thus, the most common false discoveries are claims where the results of a study lead to the conclusion of a positive relationship when the population relationship is not positive or the conclusion that there is a negative relationship when there is no negative relationship in the population.

There is a lot of confusion among statisticians and users about statistics about the hypotheses that are being tested in empirical studies. The confusion arises from the fact that most studies test two hypotheses simultaneously with a single test (Hodges & Lehmann, 1954). Take having children and happiness as an example. There are many reasons why children could increase and decrease happiness and the overall effect in the population is unclear. To test the hypothesis that children increase happiness, we can test the hypothesis that the difference (happiness of people with children minus happiness of people without children) is positive, d > 0. To test the hypothesis that children make people less happy, we can test the hypothesis that the difference score is negative, d < 0. to test both hypotheses simultaneously, we can test the hypothesis that there is no difference d = 0 and a statistically significant result allows us to reject the d = 0 hypothesis in favor of d > 0, if we observe a positive difference and in favor of d < 0 if we observe a negative difference.

It is a misunderstanding that rejecting the hypothesis d = 0 only tells us that there is an effect and does not tell us anything about the direction of the effect. When we accept d > 0, we are not only rejecting d = 0, but also rejecting d < 0. And when we accept d < 0, we are not only rejecting d = 0, but also d > 0. 

Any decision in favor of a hypothesis implies that there is a risk that the decision was wrong. We may conclude that children increase happiness, when in fact there is no effect or a negative effect or we may conclude that children decrease happiness when having children has no effect on happiness or increases happiness.

In conclusion, empirical studies aim to provide information about populations based on information in samples. Sampling error can distort the results in samples and lead to false conclusions. Conclusions about the magnitude of effects require large samples. Empirical studies, especially of new questions, often settle for conclusions about the direction of an effect. These conclusions can be false when a significant result in a sample suggests a relationship in one direction, but there is no relationship or a relationship in the opposite direction in the population.

What is statistical significance?

Without going into the details of significance testing, it is necessary to point out that scientists have control over the risk of a false discovery. As noted above, it is riskier to make precise predictions about an effect size (e.g., exercise increases life-expectancy by 100 to 300 days), than to draw conclusions about the direction of an effect (exercise increases longevity by 1 day or more).

Another way to control the risk of a false discovery is to reduce sampling error. It is sometimes argued that scientists could also increase effect sizes, but that is not always possible. How would we increase the effect size of having children on happiness? The easiest and sometimes only way to decrease sampling error is to increase sample sizes. As noted above, 4,500 people are needed to get a sampling error of .03 to produce a statistically significant result when the effect size is .06 standard deviations and we require a p-value below .05 to conclude that exercise has a positive effect on longevity. What does this 5% criterion for statistical significance mean. It means that if we repeat the study over and over again with 4,500 new people, we would get p-values below .05 no more than 5% of the time when exercise has no effect. When the effect size is exactly 0, we would expect half of the significant results to be negative (and give us false information that exercise decreases life-expectancy) and half of the results to be positive (false discovery of positive effects in the population). 

Some statisticians have argued that the 5% criterion is to liberal in these days when so many scientists are testing many hypotheses and have computers to run many statistical tests. This might lead to many false discoveries. A simple solution to this problem is to lower the criterion value, which is done in some fields. For example, particle physics uses p < 0.0000006 as a criterion to avoid false discoveries (e.g., discovery of the Higgs-Boson particle). Molecular geneticists also use a similarly low criterion when they search for genetic variants that predict risks for depression or other mental illnesses. However, there is a cost to lowering the risk of a false discovery. A lower criterion value requires even larger samples to produce significant results. For example, N ~ 9,000 participants would be needed to obtain significance with an effect size of d = 0.06 to get significance with a criterion of p < .005 that some statisticians have suggested (Benjamin et al., 2017). 

In sum, scientists can control the risk of false discoveries by demanding more or less empirical evidence for a claim. While demanding stronger evidence reduces the risk of false discoveries, it also limits the ability to make true discoveries with limited resources when effect sizes are small. Ideally, researchers would set their criterion for statistical evidence to maintain a low risk of false discoveries while allowing for as many true discoveries as possible. To do so, they need information about the risk of false discoveries.

How Many False Discoveries Do Scientists Make?

As explained before, to determine whether a discovery is true or false requires a comparison of the results in a sample to the effect in the population and population effects are by definition not observable. Not surprisingly, discussion of false positive rates resemble discussions among Medieval philosophers how many angles can dance on the head of a pin.

A popular claim is that the nil-hypothesis is practically never true because effect sizes are never exactly zero. Thus, rejecting the hypothesis that the effect size is exactly zero is never wrong and there are no false discoveries. This argument ignores that researchers do not just reject the null-hypothesis, but draw inferences about the sign of the effect. Thus, sign errors are possible. Moreover, many effect sizes can be so small that they are practically zero. For example, if exercise extends life expectancy by 1 day, the standardized effect size is 0.003 standard deviations (SD = 10 years). While this value is not zero, it is practically impossible to demonstrate that it is different from zero. Thus, the argument that false discoveries never occur is logically invalid.

The argument also sounds implausible in the face of concerns that many published results might be false discoveries because researchers use statistical tricks to produce significant results. Simmons et al. (2011) showed with simulations that these tricks could produce false discoveries in over 50% of studies. A large number of replication failures in some scientific disciplines (e.g., experimental social psychology) has also led to widespread concerns that false discoveries are common (Open Science Collaboration, 2015). Finally, Ioannidis speculated that some areas of medicine have false discovery rates over 50% because they test thousands of false hypotheses.

I call these speculating meta-physical because they rely on unproven assumptions and the wide variation in conclusions can be traced to the variation in assumptions. These speculations are not very helpful for applied researchers who want to reduce the risk of false discoveries while still being able to make discoveries with limited resources.

It is not possible to estimate the false discovery risk for a single study, but it is possible to do so for sets of studies from the same field. There have been numerous attempts to estimate the false discovery risk empirically, but simulation studies have shown problems with the underlying models. The main problem with these models is that they try to estimate the rate of false discoveries. This is impossible because it is impossible to distinguish small effect sizes from each other, especially in small samples. Nobody knows whether the population effect size is d = -.05, 0, or .05. Once more metaphysical assumptions are needed to make these models work, but the assumptions influence the conclusions. To avoid this problem, Bartos and Schimmack (2022) relied on Soric’s (1989) approach to estimate the maximum false discovery rate. The maximum false discovery rate is a function of the discovery rate and the discovery rate is the percentage of significant results in a set of studies.

Schimmack and Bartos (2023) applied their model to abstracts of medical journals that reported the results of clinical trials. They found that 70% of abstracts reported a significant result. However, they also found evidence of selection bias. After correcting for selection bias, they estimated a discovery rate of 30% and a false discovery risk of 14%.

Another dataset that can be used to examine false discoveries in clinical trials are Cochrane Reviews. Cochrane reviews are meta-analyses of clinical trials. There are many types of clinical trials and some meta-analyses are based on fewer than 10 studies. However, there are sufficient Cochrane Reviews that have 10 or more studies and examine the effectiveness of a treatment against a placebo (194 reviews with 4,394 studies). All studies were scored so that positive values indicated the same sign as the meta-analytic z-score and negative values indicates a sign error. There were 14.5% sign errors, but only 0.75% significant sign errors (z < -1.96). Thus, we have a very low estimate of the false discovery rate in clinical trials with a placebo control condition.

Figure 1 shows a histogram of the z-scores and the fit of the z-curve model to the data. While positive and negative values are shown in the figure, but only positive values are used to fit the model.

The model fits the positive values well, but it underestimates sign errors. The most plausible explanation for this is that there is heterogeneity in effect sizes. However, the difference is relatively small. Due to the larger number of sign-reversals, the Observed Discovery Rate (39%; i.e., the number of significant results in the right direction) is lower than the expected discovery rate (44%). Thus, there is no evidence of selection bias that produces more significant results than predicted by the model.

According to Soric’s formula, an EDR of 44% implies a maximum false discovery rate of 7%. This is much higher than the observed false discovery rate of 0.75%. The reason is that Soric’s model assumes that true hypotheses are tested with 100% power. If average power is lower, the false discovery risk decreases. It is possible to obtain a better estimate of the false discovery risk by taking the expected replication rate into account. The expected replication rate is 62% and is composed of false discoveries that produce significant results in a positive direction with a 2.5% probability. Assuming that 7% of significant results are false discoveries implies an average power to detect true positive effects of 67%. With this power estimate, the false discovery risk decreases to 4.5%. This is still considerably higher than the actual false discovery rate of 0.75%. Thus, z-curve estimates of the false discovery risk are conservative and likely to overestimate the actual rate of false discoveries.

Z-curve also makes it possible to estimate the proportion of true and and false hypotheses that were tested. Once more, these estimates rely on the observable discovery rate and assumptions about the average power of studies that test true hypotheses. Figure 2 shows the percentage of true and false hypotheses for different levels of power. With 80% power, the ratio of true and false hypotheses is 1:1. With the estimated power of 67%, the ratio is about 3:2. Thus, there is no evidence that clinical trials test many more false than true hypotheses.

Conclusion

Concerns about high false discovery rates in science have produced many articles that ask for a new statistical approach to analyze data. Many of these articles are based on a misguided interpretation of significance testing and false assumptions about the number of false hypotheses that are being tested.

I showed that significance testing can be used to draw inferences about the direction of an effect in the population from a significant result in a sample. These directional inferences can be justified by interpreting the two-sided test of the null-hypothesis as two one-sided tests of the null-hypotheses that the effect is less or equal than zero (<= 0) or greater or equal than zero (> 0). While significance criteria control the rate of false discoveries in all tests, it remains unclear how many of the significant results are false discoveries. However, it is possible to estimate the false positive risk in sets of studies based on the percentage of significant results in all tests (i.e., the discovery rate). When no publication bias is present, the discovery rate can be directly observed. When selection bias is present, the discovery rate can be estimated using a model that takes selection bias into account (Bartos & Schimmack, 2022). I demonstrated that this approach produces conservative estimates of the false discovery risk and that the false discovery risk in clinical trials that test effectiveness against a placebo is very low. Thus, concerns about the credibility of clinical trials are based on wild speculations that are, ironically, not backed up by evidence. However, these results cannot be generalized to other disciplines. Claims about false discoveries in these areas need to be supported by empirical evidence and z-curve provides a valuable tool to provide this information, especially when selection bias is severe (e.g., experimental social psychology; Schimmack, 2020).

A Comparison of False Discovery Rates and Type-S Error Rates

Applied researchers who have substantive research questions (e.g., is there a wage gap for men and women) rely on statistical methods that were invented over 100 years. This statistical approach is known as Null-Hypothesis-Hypothesis-Testing (NHST) and every year thousands of new students in the social sciences are introduced to NHST to understand published articles and to conduct their own research. Thus, NHST is as fundamental to quantitative research as microscopes are for biologists and telescopes are for astronomers. However, unlike builders of microscopes and telescopes, many statisticians believe that NHST is flawed and needs to be replaced, ideally with their own statistical approach. After all, statisticians are human and are influenced by the same incentives that motivated other scientists. The holy grail is to replace NHST and to become the father (most of these driven statisticians are male) of the new statistics.

The religious fervor is especially notable among statisticians who call themselves Bayesians. Like atheists who only have the belief that there is no God in common, hate of NST is the only common element of Bayesians, which explains why they have failed to provide a coherent alternative to NHST. A prominent example of the religious zeal of Bayesians is Andrew Gelman’s blog. For example, a recent blog post was titled “Bayesians moving from defense to offense.”

Criticism of NHST is as old as NHST itself and is often cited. However, there have also been articles in support of NHST that are less well known and often neglected, especially by Bayesian critics of NHST. One article that made a lot of sense to me was Tukey’s (1991) defense of NHST that I featured on my blog (Schimmack, 2019).

The article makes it clear that there is a fundamental misunderstanding of the hypotheses that are typically tested in NHST. Take our example of the wage gap as an example. NHST would test this question by postulating the null-hypothesis, which is typically the assumption that there is no wage gap. This hypothesis is typically a strawman and nobody beliefs it to be true. The only reason to postulate the null-hypothesis that there is no wage gap is to use empirical data to falsify or reject it, when the data provide sufficient evidence that the hypothesis is false. This happens when the observed difference in wages is about twice the size of the sampling error, where sampling error reflects the amount of variability in the wage differences that are obtained in a random sample. This estimate of the real wage gap in the population will vary from sample to sample. However, sampling error alone is unlikely to produce differences that are twice the size of the sampling error. In fact, we can state that any wage gap that is twice the sampling error or more will occur only 5% of all attempts if there is no wage gap in the population. And when this happens, p-values below .05 allow researchers to reject the null-hypothesis that the wage gap is zero and conclude that there is a wage gap.

This version of NHST can be easily criticized. For starters, what is the point of collecting data just to reject a hypothesis that nobody believed anyways. The best defense of this practice would be that we still need empirical data to be sure. After all, there is a 1/1000000 probability that the null-hypothesis might be true. Fair enough, but that only leads to the next problem. We really do not care about the conclusion that there is a wage gap because the immediate next question is the direction of the difference. Do men earn more than women or do women earn more than men? Strictly speaking, the rejection of the null-hypothesis that wages are exactly the same does not allow us to make claims about the direction of the effect. Does that mean we need to do another study to test this hypothesis and if so how would we analyze the data to do that if NHST doesn’t allow us to draw inferences about the direction of an effect?

To use NHST to draw inferences about the direction of an effect, we need to test a directional hypothesis. In our example, we might want to specify H0 as the presumably false hypothesis that women earn more or an equal amount than men and use the statistical significance criterion to reject this hypotheses when our sample shows that men earn more than women and the difference is significant, p < .05. When we test this directional hypothesis, we actually only need an effect size that is 1.65 times larger than sampling error to reject the null-hypothesis that women earn more or an equal amount. The use of NHST with directional hypotheses is known as one-tailed or one-sided testing.

With one-sided tests, p < .05 quantifies the risk of drawing a false conclusion about the sign of an effect. That is, the results of a study might show that men earn more than women, the p-value is below .05, but in the population there is no wage gap (zero difference) or women actually earn more than men. NHST does not differentiate between an outcome where the difference is exactly zero (equal pay) or the difference is in the opposite direction (women earn more than men). Both outcomes are errors because the study suggested women earn more when this is not the case.

The risk of drawing a false conclusion in NHST is called the Type 1 error. The statistical test produced the wrong results because sampling error produced a very unlikely outcome (e.g., the sample included many unemployed men). The 5% criterion is a conventional criterion to ensure that no more than 5% of statistical tests show significance when the null-hypothesis is true. This would be the same as using rapid Covid tests that have a 5% probability to show that you are positive when you are actually negative. Statisticians have also debated the 5% criterion, but NHST allows researchers to use other criterion values. Thus, for the discussion of NHST and criticisms of NHST, it is not important what Type-1 error we find acceptable and I will continue to use the typical 5% value.

Now you may wonder what you can do when you are not sure about the sign of an effect (e.g., are men or women more extraverted)? It would be silly to make up a directional hypothesis, just to falsify it, and then conclude that the opposite is true. Moreover, you might find a significant result in the correct direction if you specified the null-hypothesis correctly, but you would not find significance if you picked the wrong hypothesis. That makes the procedure rather arbitrary and useless. Fortunately, there is a solution to this problem. You just do two one-sided tests. First, you try to reject the hypothesis that “men are as extraverted or more extraverted than women” and then you try to reject the “hypothesis that “women are as extraverted or more extraverted than men.” If you obtain significance for one of these tests you can infer that the alternative hypothesis is true. That is, if your sample shows that men are more extraverted than women and p < .05 you are allowed to infer that men are more extraverted than women in the population. If your sample shows that women are more extraverted than men and p < .05, you are allowed to infer that women are more extraverted than men in the population. So, you can generalize the sign of the effect in your sample to the population.

However, there is a catch. You tested two hypotheses and the Type-I error risk increases each time you test a new hypothesis. A 5% error rate implies that every 20 tests of a false hypothesis will produce a Type-I error in the long run. So, if you conduct two one-sided tests with the traditional 5% risk of a Type-1 error, your risk is actually 10%. Fortunately, there is a simple solution to this. You can lower the Type-I error risk of the directional tests. If you cut the risk in half, you have a 2.5% risk to make a Type-1 error in one direction and a 2.5% risk of a type-1 error in the opposite direction and your combined risk of making an error in either direction is 5%. 

Maybe you already realized it, but conducting two one-sided tests with 2.5% Type-I error rates is identical to conducting a two-sided tests with 5% probability. That is the essence of Tukey’s defense of NHST. While it looks as if we are testing an implausible null-hypothesis that there is no wage gap, we are really testing whether men earn more than women or women earn more than men and are allowed to infer from a significant result, p < .05 and higher pay for men than for women in a sample that women earn less than men in the population under investigation. Rather than refuting a silly nil-hypothesis (Cohen, 1994), NHST is a statistical tool to draw inferences about the direction of an effect in a sample about the direction of an effect in the population.

Unfortunately, Tukey’s (1991) insight that a two-tailed test is really a convenient way to conduct two one-sided tests to test for significance in both directions is often ignored by critics of NHST. Gelman introduced sign errors as an alternative to dumb NHST under the assumption that NHST is only used to reject the hypothesis that an effect size is zero, but is never used to test the direction of an effect. This is a misrepresentation of the way NHST is used, especially in clinical trials. Nobody would argue that treatment is beneficial if the p-value is below .05, but the results show more benefits in the placebo condition. The problem with Gelman’s criticism of NHST is that it is a criticism of dumb NHST and not NHST as it is used in practice. In reality researchers follow Tukey’s (1991) logic and use NHST to test two directional hypotheses simultaneously. Thus, Type-1 errors include sign errors. Type-1 errors occur when the sign of a significant result is different from the population effect size, including population effect sizes of zero.

The problem with Gelman’s Type-S error is that Gelman ignores the possibility that the population effect size can be zero. Gelman’s Type-S error is not defined when the population effect size is zero (personal communication, Gelman, 2024). The omission of the classic Type-1 error (rejecting the point-null or nil-hypothesis) is difficult to justify. The main justification for this assumption is that most population effect sizes are unlikely to be exactly zero. For example, the wage gap between men and woman is unlikely to be less than 1/100000000 dollars. However, what about time reversed causality and extrasensory perception (Bem, 2011)? Gelman clearly does not believe in ESP, even with effect sizes of 0.00000001 standard deviations. He is also often critical of other findings like ovulation effects on preferences for Obama. He might argue that these effect sizes are much smaller than inflated estimates in small samples, but not zero. However, why couldn’t some effects be zero or so close to zero that we don’t care about the effect size?

Assuming that there are no zero effects can easily explain why vanZwet et al. (2023) ended up with an estimate that only 2% of significant results in clinical trials are false, whereas Schimmack & Bartos (2023) estimated that up to 14% of significant results could be false. The difference might be due to the fact that estimates of the false discovery rate include significant results with an effect size of zero as errors, whereas the type-S error assumes that these errors do not exist.

Unfortunately, we can not quantify the amount of population effect sizes that are zero because sampling error will always result in imprecise estimates of the population effect size. A solution to this problem is rounding. At some point, we simply do not care about a difference from 0. For example, a wage gap of $0.00001 dollars is practically the same as a wage gap of 0 dollars. Even a difference of $0.01 (1 cent) is meaningless. Rounding has obvious implications for the estimation of Type-S and FDR rates. Let’s say a set of 1000 clinical trials with effect sizes close to zero (some would argue homeopathy would fit the bill) have effect sizes that are close to zero, less than 1/1000th of a standard deviation, but they are not exactly zero). Rounding would turn these miniscule effects into zero-effects and findings in either direction would be considered false positives. However, Type-S errors are cut in half because significant results with the same sign are not considered errors even if the effect size is a fraction of a cent.

In conclusion, the interpretation of NHST as two one-sided tests with alpha/2 implies that sign errors are less likely than Type-1 errors and that the percentage of sign errors among significant results (Type-S error rate) is always lower than the false discovery rate (i.e., the percentage of false positives among significant results).

The real question is whether this difference is large enough to explain the difference between vanZwet et al.’s (2023) results and other studies that estimated the FDR (Jager & Leek, 2014; Schimmack, 2023). I had to resort to simulation studies to find the answer and I am happy to share the results of this simulation (r-code on OSF).

The simulation is based on Ioannidis’s scenario for underpowered, but well performed Clinical Trials with 1 true hypothesis for every 5 false hypothesis and low power (20%). To simply the simulation, the simulation did not include bias. The false hypotheses were simulated with a small standard deviation of population effect sizes (SD = .01). This ensures that none of the false hypotheses are strictly zero, but effect sizes are close to zero (d = -.05 to .05).

Figure 1 shows the distribution of effect sizes for a between-subject design with N = 100 (50 per group).

This simulation produces equal Type-S and FDR estimates of 24%. The reason that these estimates are the same that the condition “<” “<=” are the same when no population effect sizes are zero.

Figure 2 shows the same simulation, but effect sizes are rounded to one decimal. As a result, the small effect sizes around 0 all become zero.

In this scenario, the FDR is 53% and the Type-S error rate is only 0.4%. 

In sum, the criticism of dumb NHST is that there are no sign errors because it only tests whether the effect size is zero, which it rarely is. Second, the smart version of NHST (Tukey, 1991) tests directional hypothesis and treats sign errors and rejection of the point-null hypotheses as errors. Third, Gelman’s Type-S error counts only sign errors and underestimates error rates when many effect sizes are practically zero. Thus, it is a mistake to exclude zero effect sizes from the computation of false discovery rates. Type-S error rates underestimate error rates when some studies have an effect size that is practically zero.

The second simulation used Zwet et al.’s (2023) parameters of their model to compute the FDR and to compare it to their estimate of the Type-S error of 2% The model assumes that the distribution of effect sizes in the Cochrane data is a mixture of four normal distributions with mean 0 and standard deviations 0.61, 1.42, 2.16, 5.64. The first three components are weighted about equally, .32, .31, and .30, and the last component is weighted less, .07.

Figure 3 shows the distribution of the implied effect sizes for N = 100.

Although the model fixes the mode of effect sizes at zero and effect sizes close to zero are the most frequent effect sizes, the probability of an exact value of zero is zero. Thus, the model assumes that there can only be sign errors. The simulation reproduced vanZwet et al.’s estimate that the Type-S Error Rate is 2%. As there are no exact zero values, the FDR is also 2%.

I then repeated the simulation with effect sizes rounded to one decimal. The distribution of effect sizes is not notably different (Figure 4).

However, 17% of the effect sizes are now zero. This increases the FDR to 4%, while the Type-S error decreases to 0.8 percent because zero effect sizes do not produce sign errors in Gelman’s formula. In short, these results show that effect sizes close to zero can produce a difference in calculations of Type-S and FDR, but this difference is relatively small and does not explain the differences between a Type-S error rate of 2% and an FDR of 14%.

The real reason for the different results are the different model assumptions. The same density distribution can be fitted with different models that make different assumptions about the underlying components. The key difference between vanZwet et al.’s (2023) model and other models is that vanZwet et al.’s model assumes that there is not a large proportion of effect sizes close to zero. In contrast, other models allow for a large proportion of effect sizes to be zero. Figure 5 shows the fit of vanZwet.’s model and z-curve (Bartos and Schimmack, 2022) to the Cochrane data. Z-curve was slightly modified to have only four components with means of (0, 2, 4, and 6). This modification models low z-scores as a mixture of studies with type-I errors (z = 0) and modest power (50%), but does not allow for studies with low power (5% to 50%). This is of course an arbitrary assumption, but it is no more arbitrary than van Zwet et al.’s assumption that there are no effect sizes of zero (p (z = 0) = 0).

Both models fit equally to these data. However, the z-curve model allocated a weight of 57% to the component with a mean of zero. This implies a false discovery rate of 10%. Thus, the same data are compatible with a Type-S error of 2% and an FDR of 10%.

This brings up the question which of these estimates is closer to the truth. The honest answer is that we do not know. At least, the distribution of z-scores alone does not provide this answer. I have tried for a year to find a way to estimate the true FDR, but simulation studies showed that it is just not possible to do so.

To avoid the problem of overly precise estimates that are based on unproven and untestable assumptions, Bartos and Schimmack (2022) suggested to focus on the false discovery risk. The false discovery risk is the maximum rate of false discoveries that is consistent with the data. Z-curve2.0 relies on the discovery rate to determine the false discovery rate using a formula developed by Soric (1989).

Soric’s formula relies on the fact that the maximum false discovery rate for a given discovery rate occurs when all true hypotheses are tested with 100% power. With 29% significant results and no evidence of publication bias, the assumption is that the 29% significant results are produced by testing 25 true hypothesis with 100 power (25% significant results) and 75% null-hypotheses with a probability of 5% to produce a significant result (3.75% significant results). 

We realize that it is implausible to assume that true hypothesis in clinical trials are tested with 100% power. However, we do not know the average power of tests of true hypotheses. Thus, we do not know the real FDR. The benefit of estimating the false discovery risk is that we can say that there are no more than 14% false positive results. In contrast, vanZwet’s (2023) estimate of 2% sign errors is only correct when we assume that there are no effect sizes that are exactly zero and that there is only a small percentage of effect sizes close to zero. Thus, the problem with this result is that it depends on assumptions that may be false, whereas the FDR estimate of 14% is a worst case scenario. In this regard, the FDR is similar to the Type-1 error that is based on the worst case scenario that all false hypotheses have an effect size of zero. The risk of sign error decreases, the more effect sizes differ from zero. Not everybody might be happy with risk assessments that are conservative and based on worst case scenarios, but we think that this approach is useful to ensure credibility of scientific results.

We also agree with Goodman (2014) that it is less interesting to know the FDR with the conventional criterion of .05 to reject the null-hypothesis. A more important question is how the alpha criterion can be adjusted to ensure an acceptable maximum percentage of false discoveries. With 29% discoveries, alpha can be set to .01 to produce a false discover risk below 5%. It is often argued that many researchers confuse alpha with the FDR and assume that alpha = .05 ensures a false discovery risk of 5% or less. By setting alpha to .01, they can actually claim that the false discovery risk is below .05. Of course, any error rate allows for errors and a single significant result with alpha = .05 or alpha = .01 requires replication and honest reporting of replication failures. 

In conclusion, statistics is needed to make sense of data, especially when data are noisy and effect sizes are small. However, statistics can only produce useful results for applied researchers if the assumptions underlying statistical models are consistent with reality and when assumptions cannot be tested, it is important to conduct sensitivity analyses or to consider worst case scenarios. This blog post examined why vanZwet et al. (2023) estimated that only 2% of clinical trials produce a significant result with a sign error, whereas Schimmack and Bartos (2023) found a false discovery risk of 14%. These differences were not explained by the analysis of different data sets. After correcting for selection bias, Schimmack and Bartos found a bias-corrected/expected discovery rate that was identical to the discovery rate in vanZwert et al.’s (2023) data. Thus, both datasets implied a false discovery risk of 14, using Sorics’s (1989) formula. The distinction between sign-errors (Type-S error rates) and false discovery rates also did not explain the differences. The key factor was the specification of the mixture model. vanZwet et al.’s model made the assumption that effect sizes follow a normal distribution centered at zero. Thus, the model does not allow for a cluster of effect sizes close to zero. Other models allow for a large number of effect sizes close to zero and these models can fit the data equally well. Thus, it is impossible to determine the false discover rate in the Cochrane data. However, the discovery rate of 29% does not allow for more than 14% false discoveries with an effect size of zero or an effect size in the wrong direction. 

The 14% FDR is clearly inconsistent with Ioannidis’s (2005) scenario that clinical trials test only 1 out of 6 true hypothesis with only 20% power, which leads to the prediction that most results are false positives. The results are more consistent with Ioannidis’s scenario 1 with adequately powered clinical trials that have only a small amount of bias and test 1 true hypothesis for every false hypothesis. For this scenario, Ioannidis (2005) predicted 15% false discoveries. Thus, our results suggest that Cochrane reviews and abstracts in leading medical journals match this scenario. We hope that z-curve analyses with other types of studies can provide empirical tests of Ioannidis’s predictions for those studies. We are pleased to provide researchers interested in the credibility of science with a tool that can provide sound empirical evidence about the false discovery risk.

A Comparison of the New Look and the Old Look at Z-Values from Clinical Trials

Authors: Z-Curve Development Team

Note 1/5/24.
This is Draft2.0. It was revised after further consultation with Erik van Zwet about his approach. In this communication it became clear that Erik’s method can be fitted to absolute z-scores. So, the comparison of the two methods with positive and negative z-scores is no longer relevant. The use of abs(z) also produced better estimates with the original dataset, but continues to perform worse than z-curve when the set of studies includes more powerful studies than those in the Cochrane Review. It remains the case that z-curve performs as well or better than vanZwert’s new approach because it does not require unrealistic assumptions about the distribution of power in sets of studies with varying effect sizes and sample sizes (Brunner & Schimmack, 2021).

Abstract

Two recent studies that extended Jager and Leek’s (2014) analysis of p-values in medical research are reviewed. A comparison of the statistical models shows that z-curve (Schimmack & Bartos, 2013) is superior to the new method proposed by vanZwet et al. (2023). Both studies show that clinical trials have low power (~ 30%), but also low false positive rates (~14%) that can be reduced to less than 5% by setting alpha to .01. Whereas abstracts in clinical journals show clear evidence of selection bias (70% significant results), Cochrane reviews show no evidence of selection bias (~ 30% significant results). The results show that clinical trials are more credible than Ioannidis (2005) suggested. In the absence of publication bias, Cochrane reviews produce unbiased effect size estimates of population effect sizes and can be used to guide practical decisions.

Introduction

The replication crisis is a term for the assumption that empirical science is less credible than it pretends to be and that most published results may be false (Ioannidis, 2005). While there are many factors that can produce misleading evidence for scientific claims, the key factor is the use of questionable research practices that produce more statistically significant results than the power of empirical studies warrants.

A seminal study examined the replicability of social psychological experiments and found that only 25% of replication studies obtained a statistically significant result again (OSC, 2015). This low success rate undermines the credibility of original articles publshed in social psychology journals that report over 90% significant result (Schimmack, 2020).

Ioannidis (2005) famously suggested that many critical trials also inflate evidence in medicine and that at least 50% of statistically significant results are false positives. Jager and Leek (2014) provided the first empirical test of this prediction. They extracted the statistical results of clinical trials from journal abstracts and modeled the distribution of p-values with a mixture model of true and false hypotheses (false hypotheses = no effect, H0 is true). They estimated a false positive risk of 13% for results that reached significance with alpha = .05. Jager and Leek’s seminal attempt to provide empirical evidence about the credibility of evidence in medicine had relatively little impact on discussions about the replication crisis in medicine, presumably because it was harshly criticized by several prominent statisticians.

Ioannidis (2014) declared in the title of his comment that Jager and Leek’s “estimate of the science-wise false discovery rate and application to the top medical
literature is false” (p. 28). To justify this conclusion, he claimed that excluding 470 p-values (from 5792, 8%) ”markedly affects the overall FDR estimates, and totally invalidates FDR estimates comparing journals and years” (p. 32). Yet, even if all of these 470 results were false positives, it would increase the estimated FDR only from 14% to (.14*.92 + .08*1 = 21%. Ioannidis (2014) also claimed that “much of the data is either wrong or makes little sense” (p. 32). If this were true, it would imply that medical abstracts are useless. His final conclusion was that Jager and Leek’s article serves as a warning “how badly things can go when automated scripts are combined with wrong methods and unreliable data” (p. 34). However, he never produced better empirical estimates of the false discovery risk in medical trials.

Gelman and O’Rouke (2014) simple found “their claims unbelievable” (p. 19). One of their key concerns is that “there is just too much selection going on” (p. 20) to obtain reasonable estimates, even though Jager and Leek’s model took selection bias into account. Ultimately, they express skepticism about the value of modeling distributions of p-values. “To us, an empirical estimate would involve looking at some number of papers with p-values and then follow up and see if the claims were replicated,” (p. 22). This did not stop Gelman from being a co-author of a recent article that used p-values to make recommendations about the interpretation of results in clinical trials (vanZwet et al., 2023).

Goodman (2014) criticizes the use of multiple p-values from a single study because these results are not independent and it is difficult to model these data without information about the amount of dependency. Another criticism was the choice of top journals. Presumably, higher false positive rates could be found in clinical trials published in less prominent journals or when pilot studies are included. A third concern was that results reported in abstracts may be misleading because abstracts feature significant results. Goodman (2014) also demands that “any estimate of the reliability of the medical literature must incorporate, as a primary result and not just as a sensitivity analysis, some estimate of the effect of biases” (p. 26). Yet, Goodman was a co-author of a recent article that used p-values to examine evidence in clinical trials without mentioning selection bias at all (vanZwet et al., 2023).

It would take nearly another decade before researchers tried again to examine the credibility of clinical trials based on distributions of p-values in two independent studies (Schimmack & Bartos, 2023; vanZwet et al., 2023). Both articles used z-scores rather than p-values, but this is a minor technical detail because p-values can be converted into z-scores and vice versa. The advantage of z-scores is that they have more interpretable distributions (see Figures below).

Schimmack and Bartos (2023) used a model that was first developed to estimate mean power of studies that are selected for significance (Brunner & Schimmack, 2021). The model is called z-curve because it fits a mixture model to the observed distribution of z-scores. The aim of z-curve1.0 was to predict the outcome of replication studies when original studies are selected for significance. This model was applied to studies in social psychology where selection bias leads to success rates over 90% (Schimmack, 2020; Sterling, 1995). Z-curve2.0 extended z-curve to estimate mean power of all studies, including non-significant studies that are not reported (Bartos & Schimmack, 2022). An estimate of the discovery rate without selection bias makes it possible to estimate the amount of selection bias (Sterling et al., 1995). Selection bias is simply the difference between the observed discovery rate (ODR; i.e., the percentage of significant results) and the estimated discovery rate (EDR; i.e., mean power to produce significant results). Without bias, the observed discovery rate matches the estimated discovery rate because the percentage of significant results is a direct function of mean power (Brunner & Schimmack, 2021). The EDR also provides an estimate of the maximum false discovery rate (Soric, 1989). The main advantage of this approach is that it does not require a priori assumptions about the amount of false and true hypotheses that are being tested. Thus, the results do not depend on assumptions that make claims about false discovery rates speculative.

Schimmack and Bartos (2023) applied z-curve.2.0 to results extracted from abstracts in medical journals. They also compared z-curve.2.0 to Jager and Leek’s model. The key finding was that z-curve.2.0 performed better than Jager and Leek’s model in simulation studies. When the model was applied to 19,751 p-values from medical abstracts using an improved extraction method, it closely reproduced Jager and Leek’s results. The point estimate was 13% with a 95% confidence interval ranging from 8% to 21%. Thus, the results confirmed that the false positive risk in clinical trials is well below 50%. The article also produced several new ways to assess clinical trials. Most important, there was clear evidence of selection bias in abstracts of journal articles. The observed discovery rate (i.e., abstracts that reported a significant result) was 70%, the model predicted only 30% significant results. Thus, the actual discovery rate more than doubles in abstracts of published articles. Finally, z-curve estimated that the mean power of studies that produced a significant result was 65% (95%CI = 61% to 69%), which implies that about 2/3 of clinical trials with a significant results are expected to produce a significant result again in an exact replication study with the same sample size. Overall, these results suggest that selection bias makes it difficult to interpret the point-estimate of the population effect size in a single clinical trial, but that clinical trials in general produce solid empirical evidence, especially when the evidence is pooled in meta-analyses.

vanZwet et al. (2023) took a different approach. First, they created their own model of z-value distributions. Second, they relied on Cochrane reviews to obtain information about effect sizes and sampling error in clinical trials and used this information to compute z-values, z ~ ES/SE (ES = effect size, SE = sampling error). I will first discuss their model and then discuss the results based on Cochrane reviews.

vanZwet et al. do not provide evidence of the validity of their model, nor do they compare their model to previous models (Jager & Leek, 2014; Bartos & Schimmack, 2021). The main differences between their model and z-curve is that their model assumes (a) full normal distributions rather than truncated, folded normal distributions, (b) allows for standard deviations greater than zero to model variation in population parameters (i.e., differences in sample sizes and population effect sizes) and sampling error, while z-curve uses (truncated, folded standard normal distributions and only models sampling error (SD = 1), and (c) their model assumes a mean of zero, whereas z-curve allows for variation in means to model variability in population parameters. The key assumptions that distinguish vanZwet’s model from z-curve are the use of full normal distributions centered at zero. Therefore, I call it the symmetrical-full-normal model (SFN-curve).

vanZwet et al. provide no theoretical justification for their assumptions that distributions of z-values from clinical trials or other studies should fit the SFN model. In personal email communication, vanZwet justified the model by stating that they “analyzed a specific dataset in an entirely appropriate way.” (December, 2023). However, the article did not provide any information about model fit. Therefore, I conducted my own comparison of model fit by fitting the data from vanZwet et al.’s article with SFN-curve and with z-curve. While z-curve is usually fitted with absolute z-scores because the sign is irrelevant, the software allows specifying negative means of the mixture components. Using this approach makes it possible to directly compare the fit of the two models against each other in a plot of the predicted density distributions.

Figure 1 shows a histogram of the k = 23,557 z-values used by vanZwet et al. along with the density distributions estimated by kernel density (blue) z-curve (dark green line) and SFN-curve (red line). Visual inspection is sufficient to see that both models fit the data relatively well, but that z-curve fits the data better than SFN-curve.

However, the relatively good fit of the SNR-model is incidental. Figure 2 shows the histogram of a subset of studies from vanZwet et al. (2023) that compared treatments to placebo. The mean of the distribution shifts towards negative values because more studies tested outcomes where negative values imply clinical benefits of a treatment. Z-curve fits the data well, but SFN-curve does not fit the data well because it makes the unreasonable assumption that the observed data have a mode at zero. 

Figure 2 makes it clear that vanZwet et al.’s new SFN-model is suboptimal. There is no reason to use their approach to fit distributions of z-scores because z-curve fits the data as well or better.

Examining Cochrane Clinical Trials with Z-Curve

The previous analyses used positive and negative z-scores to allow for a direct comparison of z-curve and SFN-curve. However, the sign of z-scores in vanZwet’s dataset has no meaning and it is easier to use z-curve with the absolute z-scores that only show how strong the evidence against the null-hypothesis is (larger z-scores are less likely to occur when the the null-hypothesis is true). Figure 3 shows the z-curve with absolute z-values as data.

One interesting results of z-curving the Cochrane data is that the observed discovery rate (i.e., the percentage of significant results, p < .05) is practically identical to the estimated discovery rate (i.e., average power), ODR = 29%, EDR = 27%. This is a noteworthy empirical finding that vanZwet et al. (2023) fail to mention. The finding is even more remarkable because Schimmack and Bartos (2023) found evidence of publication bias in abstracts of leading medical journals. An interesting question for future research is how Cochrane reviews debias results from original articles. For now, it is important that z-curve is a simple tool that can be used to assess publication bias in sets of statistical results and that effect size estimates in Cochrane reviews are not inflated by selection for significance.

A second noteworthy finding is that the EDR in Cochrane reviews (29% including all studies used by vanZwet et al., see Figure 1) is similar to the EDR estimate based on abstracts in medical journals (30%). While this may be a coincidence, it suggests that z-curve provides reasonable estimates of the EDR even when selection bias is present. Future research should compare z-curve estimates based on the Cochrane database with matching original studies.

In his critique of Jager and Leek’s article, Goodman’s (2014) pointed out that it is more important to find the alpha level that limits the false discovery risk at a reasonably low level (say 5%) rather than estimating the false discovery risk at the traditional alpha level of .05. Schimmack & Bartos found that the false positive risk is 14%. Figure 3 replicates this finding with the Cochrane data because FDR is based on the EDR and the EDRs are the same. Thus, there is converging evidence that the percentage of false positive results is 14% or less. A simple change of alpha shows that an EDR of 30% produces a false positive risk below 5% if alpha is set to .01. Thus, to answer Goodman’s question, we are recommending alpha = .01 to maintain a reasonably low false positive risk.

vanZwet et al. (2023) do not provide estimates of the false positive risk, but they do estimate the risk of sign errors, which is closely related to the false positive risk. Sign errors depend on the magnitude of the population effect sizes, but in the limit when all effects are positive or negative, but the magnitude is close to zero, the risk of sign errors approaches the type-I error rate. In other words, the false positive risk is the worst case scenario where observed effect sizes have a different sign than the population effect size (effect size estimate is positive when population effect size is zero or negative and vice versa). Interestingly, vanZwet et al. estimate that significant results have only a 2% risk of being sign errors, which also implies a false positive risk around 2%. This is much lower than the 14% estimate obtained with z-curve.

The reason for this difference is the conservative nature of FDR estimates with z-curve. Using Soric’s approach, the false positive risk assumes that the expected discovery rate is a mixture of null-hypotheses and true hypotheses tested with 100% power. Assuming lower power, reduces the false positive risk. For example, if power is only 50%, twice as many true hypotheses need to be tested to get the same number of true positive results. As a result, fewer false hypotheses are tested and there are fewer false positives. With 50% power, the false discovery risk is only 6%, which is still higher than vanZwet et al.’s estimate of 2%. .A false discovery rate of 2% can be obtained with a scenario where researchers test 80% true hypotheses with 35% power. This scenario is strikingly different from Ioannidis’s (2005) scenarios that often assume researchers are testing more false hypotheses than true hypotheses. In short, although vanZwet et al. (2023) do not comment on Ioannidis’s famous prediction, their estimate of only 2% false positive results in Cochrane clinical trials is noteworthy and further challenges Ioannidis’s assumptions about the credibility of clinical trials.

Ignoring Heterogeneity

The main purpose of vanZwet et al.’s article was to propose to interpret results from new clinical trials in the context of the power of previous clinical trials. The main problem with this recommendation is that their database is a heterogeneous set of clinical trials that examined radically different treatments. We suggest that researchers conduct z-curve analyses of specific studies that more closely match their research topic and study designs. We recommend z-curve simply because the SFN-model makes assumptions that are typically violated in these subsets of studies (see Figure 2). To illustrate the use of z-curve for specific meta-analyses, we picked the Cochrane review #CD001886 that examined “Anti‐fibrinolytic use for minimizing perioperative allogeneic blood transfusion.” This review was chosen because the distribution of z-scores with positive and negative signs was clearly not symmetrical (90% negative, 10% positive, Median = -1.97, Mean = 2.24) and because there were many efficacy outcomes (k = 719).

Figure 4 shows the z-curve results. First, once more there is no evidence of publication bias (ODR = 51%, EDR = 50%). Second, the higher discovery rate implies a lower false positive risk. No more than 5% of statistically significant results can be false positive results. The distribution of z-scores and the implied parameters differ from those for vanZwet et al.’s (2023) data. Evidently, it would be a mistake to use vanZwet et al.’s results to interpret results in these clinical trials and the same would be true for other subsets of data. We recommend that researchers conduct their own z-curve analyses of theoretically relevant studies rather than using vanZwet et al.’s results. The key problem with these results is that the “inclusion of a trial in the Cochrane Database largely depends on whether someone happens to be interested in a particular treatment or intervention, so the database is not a random sample from the population of all trials” (vanZwet et al., 2023, p. 6).

We also believe that it is unreasonable to adjust effect size estimates based on their results. While it is true that selection on significance inflates point estimates of effect sizes, it is a fallacy to interpret point estimates of effect sizes, especially in small samples. Fortunately, medical researchers routinely report results with confidence intervals that provide information about uncertainty in these point estimates. Even conditioned on significance, confidence intervals will often include the true population effect size. Moreover, the most notable finding was that there is no evidence of selection bias in the Cochrane database. Thus, there is no selection bias that inflates effect size estimates in the Cochrane reviews. Thus, effect size estimates in meta-analyses do not need to be corrected using vanZwet et al.’s or other correction methods. Rather, these methods are more likely to lead to underestimation of treatment effects. Thus, the most useful information that is provided by z-curve analyses is the examination of publication bias in the studies at hand.

Conclusion

One decade after Jager and Leek’s seminal study of p-values in medical research, two articles built on their seminal work using two different datasets and two different methods. The following conclusions can be drawn from these new studies.

First, the z-curve method is superior and more applicable to different datasets than the SFN model that fits only one dataset by chance. As both methods have the same goal of fitting a density distribution of z-scores, we recommend z-curve as the statistical tool of choice.

The two datasets produce surprisingly similar estimates of mean power to produce significant results (i.e., the expected discovery rate). About 1/3 of clinical trials are expected to produce p-values below .05. Thus, many clinical trials have low statistical power. It is therefore important to avoid the fallacy of confusing the absence of evidence (a non-significant result with a treatment effect) with evidence for the absence of an effect.

An expected discovery rate of 30%, implies a relatively low false positive risk between 10% and 15%. Even this modest estimate is based on a worst case scenario and the actual false positive rate is likely to be much lower. Thus, it is much more likely that a statistically significant result reveals a true effect than being a false positive result. The results from the Cochrane database avoid many of the criticism raised by Ioannidis against results reported in abstracts. Thus, evidence is accumulating that his estimate of 50% or more false positives is based on unrealistic scenarios and false assumptions (Schimmack & Bartos, 2023). This is an important finding in a time of widespread science skepticism. Moreover, lowering alpha to .01 can reduce the false positive risk to less than 5%.

The biggest statistical threat to the validity of meta-analysis is selection bias. Thus, it is essential to ensure that studies are not selected for significance or to correct for selection bias when it is present. Z-curve provides a simple way to estimate not only the presence of selection bias, but also to quantify selection bias. The interesting finding is that abstracts in journal articles show large selection bias whereas Cochrane reviews show no evidence of selection bias at all. Future research needs to examine how selection bias is reduced when original studies are entered into Cochrane reviews.

Overall, the results from both studies provide converging evidence that clinical trials produce robust and credible evidence about the effectiveness of treatments that can guide medical practices. The problem of low power is mitigated by the publication of many non-significant results that help to produce unbiased effect size estimates in Cochrane reviews. As researchers who have become interested in the replication crisis in psychology, we look at these results with envy and believe that the enforcement of preregistration and the high quality of Cochrane reviews can serve as an example for psychological science that is only starting to encourage preregistration and does not have rigorous standards to evaluate biases in meta-analyses. While medical research is clearly not without problems, the overly negative image that has been created by Ioannidis’s (2005) influential article is clearly not supported by empirical data (Jager & Leek, 2014; Schimmack & Bartos, 2023; vanZwet et al., 2023).

Maslow’s Hierarchy of Needs: An Empirical Perspective

Maslow’s Hierarchy of Needs is a staple of Introductory Psychology textbooks and popular psychology (Johns, 2023).

The popularity of this model of human motivation is inversely related to its empirical support. The main appeal appears to be the appealing visual presentation of a ranking in the form of a pyramid. 

The lack of empirical support may stem from the lack of a dedicated discipline that studies needs and motives. Animal psychologists can only study basic needs rather than self-esteem or self-actualization. Cognitive psychologists are not concerned with motives and experimental social psychologists focus on situations rather than internal causes of behavior. Finally, personality psychologists are interested in variation of internal causes across individuals, but Maslow’s hierarchy implies a universal law that does not leave room for variation across individuals or cultures.

Despite these problems, empirical researchers have tried to test Maslow’s theory, but the history of this work is largely forgotten. For example, the popular personality textbook “The personality puzzle” by David Funder does not include a single reference to a study that tested Maslow’s theory.

The most influential article on empirical tests of Maslow’s theory was published nearly 50 years ago (Wahba & Bridwell, 1976). The article reviewed factor analytic and ranking studies, neither of which provided much support for Maslow’s theory. The problem of factor analytic studies is obvious. Factor analysis relies on correlations across individuals, while Maslow’s theory makes predictions about the ordering of means and in its strong form assumes that any variation across individuals is just measurement error. Ranking studies, on the other hand, provide a straightforward test of Maslow’s theory. However, none of the ranking studies produced the predicted order from most to least important need.

Physiological Needs - 1. Most important
Security Needs   - 2nd
Relationship Needs - 3rd
Self-Esteem Needs - 4th
Self-Actualization - 5. Least Important

The disappointing results may have discouraged other researchers from further empirical tests. A notable exception is a recent study of an online, convenience sample (N = 943). Participants were asked to rank Maslow’s needs. The average rankings of four needs were consistent with Maslow’s model, but relationship needs were ranked number 1, before physiological and safety needs. The problem might be that participants were asked to rank needs according t how important the fulfillment of each need is for them. It is possible that fail to consider fulfilled needs as important and therefore did not rank fulfillment of physiological needs as important.

In sum, empirical studies often fail to support Maslow’s hierarchy of needs, but this lack of empirical support is often ignored in textbooks and pop psychology.

Results Form a Simple Engagement Exercise

E-textbooks and engagement tools for classrooms (IClicker, TopHat) make it possible to increase engagement with class materials with simple tasks. These exercises provide empirical data that can be shared with students. The results of these exercises can be useful to demonstrate replicability and generalizability of textbook findings that are based on older studies and studies from different populations. My textbook. Personality Science : The Science of Human Diversity (Schimmack, 2020) contains a ranking of Maslow’s values to make students think about the theory in relationship to their own values. TopHat makes it easy to present a ranking task (see Figure 1).

The results have been consistent over the past three years and support Maslow’s theory in terms of the average importance of the five needs. Figure 2 shows the results for N = 129 students at the University of Toronto, Mississauga for 2023.

Physiological needs are ranked as most important by over 40% of the sample and second by over 20% of students. Safety needs are ranked second by over 40% of students and first by another 20%. These two needs are clearly ranked as more important than the other three. Relationship needs are ranked third by over 30% of students followed by self-esteem and self-actualization. Self-esteem is ranked forth by nearly 40% of the students, while over 40% rank self-actualization the least important need. 

However, the results also show that students differ in their rank orders. For some students self-esteem is more important than relationships and vice versa. Some students even rank self-actualization as higher than physiological needs. While it is not clear whether these differences reflect true personality differences, the results suggest that the hierarchy is not universal and can vary across individuals, time periods, and cultures. I then move on to research on human values using Schwartz’s model of 10 values that has received a lot more attention and explicitly allows for diversity in human’s values, needs, motives, life-goals, etc.

Conclusion

Maslow’s hierarchy of needs is well-known inside and outside of psychology, despite a lack of empirical support for it. In fact, most empirical tests failed to provide support for it. A simple ranking tasks allows students to reflect on the importance of needs in their lives and to examine the plausibility of Maslow’s theory. Ironically, this simple class exercise provides the best empirical evidence for Maslow’s theory. In this regard, the results do not replicate existing evidence, but provide the strongest evidence so far for Maslow’s theory that needs differ in their importance, while also demonstrating that the hierarchy is probabilistic and not deterministic and universal.

Loken and Gelman’s Simulation Is Not a Fair Comparison

“What I’d like to say is that it is OK to criticize a paper, even [if, typo in original] it isn’t horrible.” (Gelman, 2023)

In this spirit, I would like to criticize Loken and Gelman’s confusing article about the interpretation of effect sizes in studies with small samples and selection for significance. They compare random measurement error to a backpack and the outcome of a study to running speed. Common sense suggests that the same individual under identical conditions would run faster without a backpack than with a backpack. The same outcome is also suggested by psychometric theories that suggest random measurement error attenuates population effect sizes, which would make it harder to demonstrate significance and produce, on average, weaker effect sizes.

The key point of Loken and Gelman’s article is to suggest that this intuition fails under some conditions. “Should we assume that if statistical significance is achieved in the presence
of measurement error, the associated effects would have been stronger without
noise? We caution against the fallacy”

To support their clam that common sense is a fallacy under certain conditions, they present the results of a simple simulation study. After some concerns about their conclusions were raised, Loken and Gelman shared the actual code of their simulation study. In this blog post, I share the code with annotations and reproduce their results. I also show that their results are based on selecting for significance only for the measure with random measurement error (with a backpack) and not for the measure without a backpack (no random measurement error). Reversing the selection shows that selection for significance without measurement error produces stronger effect sizes even more often than selection for significance with a backpack. Thus, it is not a fallacy to assume that we would all run faster without a backpack holding all other factors equal. However, a runner with a heavy backpack and tailwinds might run faster than a runner without a backpack facing strong headwinds. While this is true, the influence of wind on performance makes it difficult to see the influence of the backpack. Under identical conditions backpacks slow people down and random measurement error attenuates effects.

Loken and Gelman’s presentation of the results may explain why some readers, including us, misinterpreted their results to imply that selection bias and random measurement error may interaction in some complex way to produce even more inflated estimates of the true correlation. We added some lines of code to their simulation to compute the average correlations after selection for significance separately for the measure without error and the measure with error. This way, both measures benefit equally from selection bias. The plot also provides more direct evidence about the amount of bias that is introduced by selection bias and random measurement error. In addition, the plot shows the average 95% confidence intervals around the estimated correlation coefficients.


The plot shows that for large samples (N > 1,000), the measure without error always produces the expected true correlation of r = .15, whereas the measure with error always produces the expected attenuated correlation of r = .15 * .80 = .12. As sample sizes get smaller, the effect of selection bias becomes apparent. For the measure without error, the observed effect sizes are now inflated. For the measure with error, selection bias corrects for the inflation and the two biases cancel each other out to produce more accurate estimates of the true effect size than with the measure without error. For sample sizes below N = 400, however, both measures produce inflated estimates and in really small samples the attenuation effect due to unreliability is overwhelmed by selection bias. However, while the difference due to unreliability is negligible and approaches zero, it is clear that random measurement error combined with selection bias never produces even stronger estimates than the measure without error. Thus, it remains true that we should expect a measure without random measurement error to produce stronger correlations than a measure with random error. This fundamental principle of psychometrics, however, does not warrant the conclusion that an observed statistically significant correlation in small samples underestimates the true correlation coefficient because the correlation may have been inflated by selection for significance.

The plot also shows how researchers can avoid misinterpretation of inflated effect size estimates in small samples. In small samples, confidence intervals are wide. Figure 2 shows that the confidence interval around inflated effect size estimates in small samples is so wide that it includes the true correlation of r = .15. The width of the confidence interval in small samples make it clear that the study provided no meaningful information about the size of an effect. This does not mean the results are useless. After all, the results correctly show that the relationship between the variables is positive rather than negative. For the purpose of effect size estimation it is necessary to conduct meta-analysis and to include studies with significant and non-significant results. Furthermore, meta-analysis need to test for the presence of selection bias and correct for it when it is present.

P.S. If somebody claims that they ran a marathon in 2 hours with a heavy backpack, they may not be lying. They may just not tell you all of the information. We often fill in the blanks and that is where things can go wrong. If the backpack were a jet pack and the person was using it to fly for some of the race, we would no longer be surprised by the amazing feat. Similarly, if somebody tells you that they got a correlation of r = .8 in a sample of N = 8 with a measure that has only 20% reliable variance, you should not be surprised if they tell you that they got this result after picking 1 out of 20 studies because selection for significance will produce strong correlations in small samples even if there is no correlation at all. Once they tell you that they tried many times to get the one significant result, it is obvious that the next study is unlikely to replicate a significant result.

Sometimes You Can Be
Faster With a Heavy Backpack

Annotated Original Code

 
### This is the final code used for the simulation studies posted by Andrew Gelman on his blog
 
### Comments are highlighted with my initials #US#
 
# First just the original two plots, high power N = 3000, low power N = 50, true slope = .15
 
r <- .15
sims<-array(0,c(1000,4))
xerror <- 0.5
yerror<-0.5
 
for (i in 1:1000) {
x <- rnorm(50,0,1)
y <- r*x + rnorm(50,0,1) 
 
#US# this is a sloppy way to simulate a correlation of r = .15
#US# The proper code is r*x + rnorm(50,0,1)*sqrt(1-r^2)
#US# However, with the specific value of r = .15, the difference is trivial
#US# However, however, it raises some concerns about expertise
 
xx<-lm(y~x)
sims[i,1]<-summary(xx)$coefficients[2,1]
x<-x + rnorm(50,0,xerror)
y<-y + rnorm(50,0,yerror)
xx<-lm(y~x)
sims[i,2]<-summary(xx)$coefficients[2,1]
 
x <- rnorm(3000,0,1)
y <- r*x + rnorm(3000,0,1)
xx<-lm(y~x)
sims[i,3]<-summary(xx)$coefficients[2,1]
x<-x + rnorm(3000,0,xerror)
y<-y + rnorm(3000,0,yerror)
xx<-lm(y~x)
sims[i,4]<-summary(xx)$coefficients[2,1]
 
}
 
plot(sims[,2] ~ sims[,1],ylab=”Observed with added error”,xlab=”Ideal Study”)
abline(0,1,col=”red”)
 
plot(sims[,4] ~ sims[,3],ylab=”Observed with added error”,xlab=”Ideal Study”)
abline(0,1,col=”red”)
 
#US# There is no major issue with graphs 1 and 2. 
#US# They merely show that high sampling error produces large uncertainty in the estimates.
#US# The small attenuation effect of r = .15 vs. r = 12 is overwhelmed by sampling error
#US# The real issue is the simulation of selection for significance in the third graph
 
# third graph
 
# run 2000 regressions at points between N = 50 and N = 3050 
 
r <- .15
 
propor <-numeric(31)
powers<-seq(50,3050,100)
 
#US# These lines of code are added to illustrate the biased selection for significane 
propor.reversed.selection <-numeric(31) 
mean.sig.cor.without.error <- numeric(31) # mean correlation for the measure without error when t > 2
mean.sig.cor.with.error <- numeric(31) # mean correlation for the measure with error when t > 2
 
#US# It is sloppy to refer to sample sizes as powers. 
#US# In between subject studies, the power to produce a true positive result
#US# is a function of the population correlation and the sample size
#US# With population correlations fixed at r = .15 or r = .12, sample size is the
#US# only variable that influences power
#US# However, power varies from alpha to 1 and it would be interesting to compare the 
#US# power of studies with r = .15 and r = .12 to produce a significant result.
#US# The claim that “one would always run faster without a backback” 
#US# could be interpreted as a claim that it is always easier to obtain a 
#US# significant result without measurement error, r = .15, than with measurement error, r = .12
#US# This claim can be tested with Loken and Gelman’s simulation by computing 
#US# the percentage of significant results obtained without and with measurement error
#US# Loken and Golman do not show this comparison of power.
#US# The reason might be the confusion of sample size with power. 
#US# While sample sizes are held constant, power varies as a function of the population correlations
#US# without, r = .15, and with, r = .12, measurement error. 
 
xerror<-0.5
yerror<-0.5
 
j = 1
i = 1
 
for (j in 1:31)  {
 
sims<-array(0,c(1000,4))
for (i in 1:1000) {
x <- rnorm(powers[j],0,1)
y <- r*x + rnorm(powers[j],0,1)
#US# the same sloppy simulation of population correlations as before
xx<-lm(y~x)
sims[i,1:2]<-summary(xx)$coefficients[2,1:2]
x<-x + rnorm(powers[j],0,xerror)
y<-y + rnorm(powers[j],0,yerror)
xx<-lm(y~x)
sims[i,3:4]<-summary(xx)$coefficients[2,1:2]
}
 
#US# The code is the same as before, it just adds variation in sample sizes
#US# The crucial aspect to understand figure 3 is the following code that 
#US# compares the results for the paired outcomes without and with measurement error
 
#US# Carlos Ungil (https://ch.linkedin.com/in/ungil) pointed out on Gelman’s blog #US# that there is another sloppy mistake in the simulation code that does not alter the results #US# The code compares absolute t-values (coefficient/sampling error), while the article #US# talks about inflated effect size estimates. However, while the sampling error variation #US# creates some variability, the pattern remains the same.  #US# For sake of reproducibility I kept the comparison of t-values. 
 
# find significant observations (t test > 2) and then check proportion
temp<-sims[abs(sims[,3]/sims[,4])> 2,]
 
#US# the use of t > 2 is sloppy and unnecessary.
#US# summary(lm) gives the exact p-values that could be used to select for significance
#US# summary(xx)[2,4] < .05
#US# However, this does not make a substantial difference 
 
#US# The crucial part of this code is that it uses the outcomes of the simulation 
#US# with random measurement error to select for significance
#US# As outcomes are paired, this means that the code sometimes selects outcomes
#US# in which sampling error produces significance with random measurement error 
#US# but not without measurement error. 
 
propor[j] <- table((abs(temp[,3]/temp[,4])> abs(temp[,1]/temp[,2])))[2]/length(temp[,1])
 
#US# this line is added to compute the mean correlation for significant outcomes 
#US# when measurement error is present.
mean.sig.cor.with.error[j] = mean(temp[,3])
 
#US# Conditioning on significance for one of the two measures is a strange way
#US$ to compare outcomes with and without measurement error.
#US# Obviously, the opposite selection bias would favor the measure without error.
#US# This can be shown by computing the same proportion after selectiong for significance 
#US$ for the measure without error
 
temp<-sims[abs(sims[,1]/sims[,2])> 2,]
propor.reversed.selection[j] <- table((abs(temp[,1]/temp[,2])> abs(temp[,3]/temp[,2])))[2]/length(temp[,4])
 
#US# this line is added to compute the mean correlation for significant outcomes 
#US# without measurement error. 
mean.sig.cor.without.error[j] = mean(temp[,1])
 
print(j)
 
#US# we can also add to comparisons that are more meaningful and avoid the comparison 
###
 
}
 
 
#US# the plot code had to be modified slightly to have matching y-axes 
#US# I also added a title 
title = “Original Loken and Gelman Code”
 
plot(powers,propor,type=”l”,
ylim=c(0,1),main=title,  ### added code
xlab=”Sample Size”,ylab=”Prop where error slope greater”,col=”blue”)
 
#US# text that explains what the plot displays, not shown
#US# #text(200,.8,”How often is the correlation higher for the measure with error”,pos=4)
#US# text(200,.75,”when pairs of outcomes are selected based on significance of”,pos=4) 
#US# text(200,.70,”of the measure with error?”,pos=4)
 
#US# We can now plot the two outcomes in the same figure 
#US# The original color was blue. I used red for the reversed selection
par(new=TRUE)
plot(powers,propor.reversed.selection,type=”l”,
ylim=c(0,1), ### added code
xlab=”Sample Size”,ylab=”Prop where error slope greater”,col=”firebrick2″)
 
#US# adding a legend 
legend(1500,.9,legend=c(“with backpack only sig. \n shown in article \n “,
“without backpack only sig. \n added by me”),pch=c(15),
pt.cex=2,col=c(“blue”,”firebrick2″))
 
#US# adding a horizontal line at 50%
abline(h=.5,lty=2)
 
 
#US# The following code shows the plot of mean correlations after selection for significance
#US# for the measure with error (blue) and the measure witout error (red)
 
title = “Comparison of Correlations after Selection for Significance”
 
plot(powers,mean.sig.cor.with.error,type=”l”,ylim=c(.1,.4),main=title,
xlab=”Sample Size”,ylab=”Mean Observed Correlation”,col=”blue”)
 
par(new=TRUE)
 
plot(powers,mean.sig.cor.without.error,type=”l”,ylim=c(.1,.4),main=””,
xlab=”Sample Size”,ylab=”Mean Observed Correlation”,col=”firebrick2″)
 
#US# adding a legend 
legend(2000,.4,legend=c(“with error”,
“without error”),pch=c(15),
pt.cex=2,col=c(“blue”,”firebrick2″))
 
#US# adding a horizontal line at 50%
abline(h=.15,lty=2)