The Implicit Association Test (IAT) is the most widely used indirect measure of attitudes, including attitudes towards groups (European Americans (Caucasians), African Americans, Asian Americans, etc.).
Psychologists disagree about the ability of the IAT to measure individuals’ racial attitudes (Singal, 2017). Thus, it provides another opportunity to examine Gilovich, Keltner, Chen, and Nisbett’s (2019) promise to be “scrupulous about noting when the evidence about a given point is mixed” (p. 57)
The IAT is introduced in Chapter 11 on stereotypes, prejudice, and discrimination. Students are told that the IAT is a technique for “revealing, subtle, nonconscious prejudices, even among those who sincerely believe they are bias-free” (p. 365). This statement makes it clear that the authors consider the IAT a valid measure of unconscious attitudes, that is evaluations outside of individuals’ awareness. Somebody may think that they are prejudice, but they are actually not; or vice versa. Students are encouraged to visit the IAT website to see whether they “hold any implicit stereotypes or prejudice towards a variety of groups” (p. 367).
The difference between response latencies in the critical IAT conditions is presented as a “nonconscious attitude” (p. 367).
Students are informed that
“both young and older individuals show a pronounced prejudice in favor of the young over the old” (p. 368)
“and about two-thirds of white respondents show a strong to moderate prejudice for white over black” (p. 368)
Criticism of the IAT is countered with a claim that the race IAT shows convergent validity with other indirect measures of prejudice.
Although the test has its critics (Blanton & Jaccard, 2008; Oswald, Mitchell, Blanton, Jaccard, & Tetlock, 2013; Roddy, Stewart, & Barnes-Holmes, 2010), there’s evidence that IAT responses correlate with other indirect measures of prejudice (Lane, Banaji, Nosek, & Greenwald, 2007; Rudman & Ashmore, 2007). (p. 368)
The authors also mention a brain-imaging study that showed correlations between IAT scores and some correlates in the brain (Phelps et al. 2000).
The authers then discuss the importance of predictive validity. They cite a single study as evidence that the IAT has predictive validity (McConnell & Leibold, 2009).
In conclusion, the IAT is introduced as a measure of unconscious prejudice that people cannot know about without taking the IAT. The test is presented as having convergent validity with other indirect measures and as being predictive of actual behavior.
This presentation of the IAT is one-sided and misleading. A major problem of social psychology textbooks is to present evidence without quantifying effect sizes. Surely, it matters whether the IAT correlates with other implicit measures with r = .1, r = .3, r = .5, or r = .7. Even a replicable correlation of r = .1 with other indirect measures would not justify the conclusion that the IAT has good convergent validity. Validity is typically assessed in comparison to a standard of r = .7 as a minimum for acceptable validity if a test is used for personality assessment.
The same holds for predictive validity. A test that explains only 1% or 10% of the variance in actual behavior may be scientifically interesting, but for individuals who are given feedback on an online test of their hidden prejudices, it would be important to know that most of their actual behaviors is influenced by other factors.
The textbook does mention criticism of the IAT, but it does not mention what the criticism is. The key criticisms are that the IAT has low convergent validity, low predictive validity, and no evidence of discriminant validity; that is, there is no evidence that it measures some nonconsicous prejudices “even among those who sincerely believe they are bias-free” (p. 368).
Lane, Banaji, Nosek and Greenwald (2007)
The key claim about converent validity is based on Lane, Banaji, Nosek, and Greenwald’s chapter, which includes a section on “Convergent Validity: The Relationship of the IAT to Other Implicit Measures”
The first article mentioned is Bosson et al.’s (2000) study of the IAT as a measure of implicit self-esteem.
“Although the IAT was uncorrelated with the other implicit measures, it was not alone in failing to converge” (Lane et al., 2007, p. 72).
The next set of studies examined convergent validity between the IAT and priming tasks and found no significant correlation.
“IATs and priming tasks measuring implicit attitudes toward smoking (Sherman, Rose, Koch, Presson, & Chassin, 2003) and condom use (Marsh, Johnson, & Scott-Sheldon, 2001) were unrelated” (Lane et al., 2007, p. 72).
Fazio and Olson (2003) reported that across four studies in their lab, priming tasks and IATs measuring racial attitudes did not correlate (p. 73).
Other studies found statistically significant correlations, but the chapter doesn’t mention the magnitude of these correlations. Moreover, they did not examine the convergent validity of the IAT as a measure of prejudice.
Stereotypes about gender authority (associations between female and low-status jobs), as measured by the IAT, were correlated with three indices of attitudes toward female authorities derived from the priming task (Rudman & Kilianski, 2000). (Lane et al., 2007, p. 73).
Participants who tended to show strong implicit gender stereotypes (on the IAT) also showed more positive attitudes toward women on the priming measure, and political attitudes measured by a priming task and the IAT were reliably related (Nosek & Hansen, 2004). (Lane et al., 2007, p. 73).
The chapter than attributes low correlations to low reliability of implicit measures.
These findings suggest that when reliability is accounted for, implicit measures are more likely to be related.
This is important because students of the textbook are encouraged to receive feedback about their nonconscious prejudice based on a single visit to the IAT website. However, a single IAT measure has low reliability and as a result low validity. These results should not be considered measures of prejudice, conscious or unconscious.
The chapter concludes with the hope that future research will shed more light on the low convergent validity of implicit prejudice measures. However, there has been little research on this fundamental question and the 2019 textbook draws on the 2007 chapter to support the claim that the IAT is a valid measure of nonconscious prejudice.
Students could just trust the textbook authors, but fact checking is better. As it turns out, the chapter provides no evidence to support the claim that a visit to the IAT website can reveal some hidden secrets about their prejudices.
The textbook cites Rudman and Ashmore’s 2007 article as evidence for convergent validity, but they fail to note that the study also examined and failed to find evidence for predictive validity.
Table 3 shows the correlations of the IAT with three behavioral measures. Importantly, the typical IAT is the Attitude IAT, while the Stereotype IAT was uniquely designed for this study. The Attitude IAT shows only one significant correlation with verbal behavior and two non-significant correlations with defensive and offensive behaviors (Table 3). Moreover, in a regression analysis the Attitude IAT did not explain unique variance over the other attitude measures.
Instead, they present a study by McConnell and Leibold from 2001 with a sample size of N = 42 participants. The results are shown in Table 3 below.
It is easy to find other studies that failed to replicate this finding. For example, a more recent study, Axt, Ebersole, and Nosek (2016) found that IAT was not a significant predictor of discrimination in a hypothetical selection task. After the IAT failed in Study 1, it was replaced with a different, although similar task in the following studies. This shows low trust in the IAT by one of the developers of the online IAT. Moreover, the set of studies showed that White participants showed an overall pro-Black bias in their behaviors. This finding raises doubts about the claim that the most White Americans behavior is influenced by nonconscious prejudice even if they try not to be prejudice.
In conclusion, the textbook does not provide an accurate and balanced account of the controversies surrounding the IAT. It repeats the controversial claim that the IAT is a valid measure of some hidden prejudices that can only be revealed by the IAT. This is a strong claim that is not supported by evidence. Students deserve a better textbook, but I am not aware of a social psychology textbook that is more scientific in the introduction of the IAT.
It is 2019 and everybody is familiar with fake news. Liberals accuse Fox News of spreading fake news and conservatives return the favor by accusing the liberal media of fake news. Each side has experts and facts to support their biased and ideological construction of reality.
Innocent consumers of scientific information, like many undergraduate students who pay money for an education, may assume that universities are a safe space where professors are given a good salary and job security to search for the truth.
While I cannot speak for all science, I can say that this naive assumption about scientists does not describe the behavior of eminent social psychologists.
The problem for social psychologists is that many of their studies were obtained with questionable research practices (John et al., 2012; Schimmack, 2015). As a result, many of the published results in social psychology do not replicate. Since 2011, over 100 social psychology experiments have been replicated and less than 25% produced a significant result (Open Science Collaboration, 2015; Curate Science, 2018). The replication crisis in social psychology suggests that many published results in social psychology textbooks may not replicate and could be false positive results.
An unflinching, scientific response to replication failures in social psychology would require a thorough revision of social psychology textbooks. However, there is no evidence that textbook writers are able or willing to tell undergraduate students about the replication crisis in their field.
For example, Gilovich, Keltner, Chen, and Nisbett (2019) claim that replication failures are best explained by problems with the replication studies and point to more successful replications in other behavioral sciences such as economics to suggest that students should trust social psychology (Schimmack, 2018a). However, it can be shown that the replication failures in social psychology are mostly caused by the use of questionable research practices that inflate effect sizes and the risk of false discoveries (2018b), rather than incompetent replication attempts.
On page 57 of their textbook the authors pledge to their readers that they “tried to be scrupulous about noting when the evidence about a given point is mixed” (p. 57). Students could simply trust the authors, but it is always better to check whether they deserve students’ trust.
There is probably no better example for mixed evidence than social priming. As Daniel Kahneman wrote in an open letter to social priming researchers
your field is now the poster child for doubts about the integrity of psychological research… people have now attached a question mark to the field, and it is your responsibility to remove it… all I have personally at stake is that I recently wrote a book that emphasizes priming research as a new approach to the study of associative memory…Count me as a general believer… My reason for writing this letter is that I see a train wreck looming.
Several years later, the train wreck has materialized. First, social priming researchers have not responded to Kahneman’s plea to replicate their findings in their own laboratories to demonstrate that replication failures were caused by improper replication studies. Second, other researchers have published numerous replication studies that failed to reproduce the original findings (Curate Science, 2018). Finally, statistical analysis of social priming studies show evidence that questionable research practices were used to produce textbook findings of social priming (Schimmack, 2017; Schimmack, 2017b).
How does the 5th edition, 2019, of the textbook differ from previous versions in the presentation of priming research?
Several priming studies that were included in the 3rd edition (2013) are no longer mentioned in the 5th (2019) edition.
“activating the concept ‘professor’ actually makes students do better on a trivia test” (Dijksterhuis & van Knippenberg, 1998).
“More remarkably still, Dijksterhuis, van Knippenberg, and their colleagues demonstrated that activating the stereotype of professor or supermodel led participants to perform in a manner consistent with the stereotype, but activating a specific (extreme) example of the stereotyped group (for example, Albert Einstein or Claudia Schiffer) led participants to perform in a manner inconsistent with the stereotype” (p. 128).
“just mentioning words that call to mind the elderly (cane, Florida) causes college students to walk down a hall more slowly” (Bargh, Chen, & Burrows, 1996).
The following sentence has not been changed, but the reference to Bargh, Raymond, Pryor, & Strack, 1995, has been removed.
“Dispositionally high-powered individuals or individuals primed with feelings of power are more likely to touch others and approach them closely, to have sexual ideas running through their minds, to feel attraction to a stranger, to overestimate other people’s sexual interest in them, and to flirt in an overly forward fashion (Bargh, Raymond, Pryor & Strack, 1995, Kuntsman & Maner, 2011; Rudman & Borgida, 1995)
While it is interesting to see that the textbook authors lost confidence in some priming studies, the quiet removal of these studies contradicts the earlier claim that the authors would inform students about mixed evidence. More important, other priming studies are cited without mentioning widespread doubt about social priming among psychologists.
In Chapter 1, “An Invitation to Social Psychology” (which would be better called an indoctrination to social psychology) the authors cite Bargh and Pietromonaco (1982) to claim that “often we can’t even identify some of the crucial factors that affect our beliefs and behavior” (p. 16). The cited article reports the results of a single subliminal priming study with a just significant result, F(1,128) = 4.15, p = .044. It is doubtful that this finding could be replicated today in a preregistered replication study.
In Chapter 4 the authors make strong claims about priming effects.
“We’ve seen how schemas influence our attention, memory, and construal. Can they also influence behavior? Absolutely. Studies have shown that certain types of behavior are elicited automatically when people are exposed to stimuli in the environment that bring to mind a particular action or schema (Loersch & Payne, 2011; Weingarten et al., 2016). Such exposure is called priming a concept or schema.”
The 2011 reference is pre-crisis and the Weingarten et al. (2016) article reports a meta-analysis of studies that used questionable research practices. Without taking QRPs into account, conclusions based on such meta-analysis are as doubtful as conclusions drawn from original studies that used QRPs.
To examine the credibility of the evidence on priming in this textbook, I z-curved the test-statistics in the cited priming articles (see DATA for the list of studies). Most of the articles are cited in Chapter 4 on pages 120-121. I coded 36 empirical articles that contained 89 studies with useful information. 84 tests were significant at p < .05 and the other 5 studies reported a marginally significant result as support for a priming effect.
The Figure shows the z-curve based on the 84 significant test statistics converted into z-scores. The blue line shows the density distribution of the observed test statistics. The grey line shows the fit of the model to the observed data. The grey line is also projected into the range of non-significant z-scores. The area under the curve shows the estimated size of the file-drawer if dropping non-significant studies were the only questionable practice used to produce significant priming results. The file drawer ratio suggests that for ever published significant result there would be about 12 non-significant results that remained unpublished.
More important is the estimate of replicability for the 84 studies with significant results. The estimated success rate is 33% with a 95%CI ranging from 11% to 41%. This estimate is consistent with other analysis of priming studies and the low success rate in actual replication studies (Curate Science, 2018). It suggests that at best one-third of the cited studies could be successfully replicated even if it were possible to redo the study exactly. As it is often impossible to do exact replication studies, the actual success rate is likely to be lower.
The maximum false discovery rate is also very high with 75%. This estimate means that it is possible to fit a model to the data that fixed the false discovery rate at 75% and model fit was close to the model fit of the unconstrained model. Thus, most of the cited results could be false positives where the actual effect size is close to zero. Effect sizes are important because social psychologists claim that stimuli outside our awareness can have a strong influence on behavior.
Finally, the numbers below the x-axis show mean power for different levels of z-scores. For just significant results with p-values greater than .01, mean power is only 15%. These results are unlikely to replicate even with much larger sample sizes. A z-score greater than 3 (p < .001) is needed to have a chance greater than 50% to have a successful replication outcome. Interested readers can examine the data file to see which studies fall into this category.
In conclusion, the textbook authors do not provide an honest, balanced, and self-critical introduction to social priming studies. Contrary to their claim to be “scrupulous about noting when the evidence about a given point is mixed,” they present social priming as a well-established phenomenon that generalizes across modalities and behaviors, while they are well aware that some studies failed to replicate. Rather than informing readers about these failures, some spectacular replication failures like professor priming were simply removed.
The Power of the Situation
Social psychologists emphasize situational influences on behavior. The textbook authors’ behavior illustrates how powerful situational influences can be. While stereotypes of professors suggest that they have meta-human strength to overcome human weaknesses, they are just as susceptible to situational influences like peer-pressure and monetary incentives.
The textbook authors probably believe that they tried to be objective and we can consider the removable of some priming studies as evidence that they tried to be objective, but their position as eminent social psychologists who earn money from publishing a textbook ensured that they would be unable to present replication failures of social priming studies. The desire to invite (initiate) a new generation of students into social psychology was too strong to talk openly about the replication crisis.
Another psychologists once described the power of situational influences in science as being “a prisoner of a paradigm.” A paradigm is a scientific belief system that guides researchers’ beliefs, values, and actions. To accept the replication crisis in social psychology as a fact would trigger an existential crisis and a paradigm shift. Paradigm shift only when new researchers that are outsiders are seeing the problems that insiders cannot see. The older researchers are unable to change their beliefs because doing so would trigger an existential crisis and probably a depressive episode. As Freud pointed out, humans use powerful defense mechanisms to protect the self from threatening thoughts.
For instructors who are looking for a scientific textbook about social psychology and for students who want to learn about social psychology, I recommend using a different textbook, although at this point, I cannot recommend one that is less biased than this one. At the moment, it is best to read about social psychology with a healthy dose of skepticism unless a particular finding has been successfully replicated in credible preregistered replication studies.
Since 2011, psychologist no longer know whether they should trust published results. The crisis of confidence was triggered by Bem’s successful demonstration that extraverts have extrasensory abilities to feel random future events. Although evidence of individual studies was weak, the combined evidence of 9 studies provided strong evidence against the common-sense null-hypothesis that causality cannot be time-reversed.
The response to Bem’s (2011) incredible claims was swift. Two articles pointed out that Bem used questionable research practices to produce significant results (Francis, 2012; Schimmack, 2012). Two other articles failed to replicate his results in replication attempts without the help of QRPs (Galak, LeBoeuf, Nelson, & Simmons, 2012; Ritchie, Wiseman, French, 2012).
John, Loewenstein, and Prelec (2012) found that many psychologists assumed that the use of questionable research practices was acceptable and admitted to using them in their own research. As a result, it is now unclear which published results in psychology journals can be trusted.
To provide much needed empirical information about the replicability of published findings, Brian Nosek initiated a series of replication studies. The vast majority of these studies replicated experimental studies in social psychology. The most important finding was that replications of a representative sample of studies in social psychology produced only 25% statistically significant results (Open Science Collaboration, 2015). This shocking result has dramatic implications for the credibility of social psychology as a science.
A simple prediction rule would be to bet on the high base rate of replication failures. With a success rate of only 25%, this prediction rule would be correct 75% of the time. The aim of this article is to present a statistical model that is superior to the base-rate rule.
A better prediction model is also needed for cognitive psychology, where the success rate of replication studies was 50%. With success rates around 50%, base rates are not very informative.
A statistical prediction model would be very useful because there have been few systematic replication projects in other disciplines of psychology, where studies are more costly (e.g., infant studies, fMRI studies, longitudinal studies, etc.).
A Statistical Prediction Model
Brunner and Schimmack (2018) developed a statistical model that can predict the success rate of a set of replication studies on the basis of the test statistics in original research articles. The model works because the success rate of exact replication studies is equivalent to the mean statistical power of the original studies. The only challenge is to estimate mean power based on test statistics with heterogeneity in sample sizes and effect sizes after selection for significance. Brunner and Schimmack (2018) demonstrated with simulation studies that their method, z-curve, provides good estimates of mean power.
One goal of this article is to compare z-curve predictions with the observed success rates in social psychology and cognitive psychology. However, a more important goal is to develop better prediction rules that are able to distinguish studies with low power and studies with high power within a set of studies.
To make more refined predictions of replicability, we use the concept of local power, which is akin to Efron’s (2015) local false discovery rates. If studies are heterogeneous in sample size and effect size, they also vary in power. As sampling error simply adds random noise to the non-centrality parameters of each study, studies with higher observed z-scores are more likely to have higher power. As a result, we can postulate a monotonic relationship between observed power and true power. Thus, it is possible to provide local power estimates for segments of observed z-scores. The overall estimate of mean power is simply a weighted average of local power estimates that takes the percentage of observations for each local region into account.
To illustrate the concept of local power, imagine a simple scenario where studies have only one of three true power values: 20%, 50%, and 80%. Further, assume that 50% of studies have 20% power, 30% of studies have 50% power, and 20% of studies have 80% power. The overall mean power for this simple example equals .5*20 + .3*50 + .2*80 = 41%.
Figure 1 shows the corresponding density distribution of the mixture model. The black area shows the contribution of the low powered studies to the total density. The red area shows the contribution of the moderately powered studies and the blue area shows the contribution of the high powered studies.
For a low z-score of 2, it is readily apparent that most of the density stems from the low powered followed by the medium powered studies, and very few high powered studies. As z-scores increase the relative contribution shifts until the density is dominated by the high powered studies for z-scores greater than 4.5. For each z-score, it is possible to compute a weighted average of the true powers. The total mean power is the average of all slices. More important, the local power is the weighted average of the slices for a specific region.
To obtain local estimates of mean power (local power), it is possible to average mean power of slices for regions of z-scores (e.g., 2 to 2.5). Figure 2 shows the results of a z-curve analysis with local power estimates (Z.Curve Version 19.1).
The model is estimated using only significant z-scores although the simulated non-significant results that would normally be in the proverbial file-drawer are also shown. The grey line shows the fit of the model to the data. The model also makes projections about the shape of the file drawer based on the distribution of significant z-scores. More important, below the x-axis are the local power estimates in intervals of .5 standard deviations. Consistent with the simulation, power starts with 20% on the left side because most of the left-most observations represent studies with 20% power. As z-scores increase on the x-axis, mean power is increasingly a mixture of all three power levels and mean power gradually increases. For extremely high z-scores most of the few observations are from the set with 80% power. Thus, the mean power approaches 80%.
Figure 2 shows that just significant p-values with p-values between .01 and .05 (~ 2 > z < 2.5) have only a mean power of 35%. Thus, exact replication studies are more likely to produce a non-significant result than a significant result. For optimal prediction of replication outcomes, betting on success makes only sense when mean power is greater than 50%. For these simulated data, this level is achieved with z-scores greater than 3.5.
The observed z-score at which mean power is greater than 50% depends on the power distribution of all studies. If all studies had 50% power, even non-significant results would have probability of 50% to be significant.
By fitting z-curve to observed test-statistics of original studies, it is possible to create optimal prediction rules depending on the power of studies in a particular field. We validated these predictions against the results of actual replication studies in social psychology and cognitive psychology.
Application to Social Psychology
To have the largest set of credible replication studies, we used the Curate Science database. Curate Science includes all major replication projects and individual replication efforts. It currently contains over 180 original studies that have been replicated. 130 studies were social studies and 126 reported a significant result with a usable test statistic.
Figure 3 shows the success rates as a function of the observed z-scores.
As can be seen, it requires z-scores greater than 4.5 to break the 50% barrier. A simple “don’t believe anything” rule would make 108 correct predictions and 16 false predictions (83% success rate). A rule that predicts failure for z-scores below 4.5 and success for z-scores greater than 4.5 makes 112 correct predictions and 12 false predictions (86% success rate). Although this is not a statistically significant difference, given the small set of actual replication studies, the evidence suggests that a rule that predicts successes with z-scores greater than 4.5 is superior.
Next, we fitted z-curve to the test-statistics of the original studies. Test-statistics were first converted into p-values and p-values were then converted into z-scores.
Figure 4 shows the mean power estimates for the significant studies as replication rate. If these studies could be replicated exactly, we would expect 35% significant results in the replication studies. The graph also shows the estimated size of the file-drawer of studies that are required to produce the published significant results. The file-drawer ratio suggests that for every published study there are about 4 unpublished results with non-significant results. This estimate assumes that no other questionable research practices were used. Most important, the local power estimates show that just significant results have low power of 26% for z-scores between 2 and 2.5 and 32% for z-scores between 2.5 and 3. The actual replication rates for studies with z-scores in this range are even lower. Studies with z-scores between 3.5 and 4 are estimated to have 49% power, compared to an actual success rate of 40%. For studies with z-scores greater than 4, the success rates are above 50%. Given the small sample size, all of these estimates are imprecise, but the results are encouraging. The z-curve model also suggests that z-scores of 4 or higher are needed to achieve 50% mean power.
Taken together, these results provide first evidence that z-curve can be used to develop prediction rules that take the amount of selection in a literature into account. For social psychology, most replication attempts were failures. Given the high base-rate of failures, it is difficult to find predictors of success. These analyses suggest that original results are likely to replicate if the test statistic corresponds to a z-score of 4 or p < .0001.
The main advantage of z-curve over actual replication studies is that it is possible to apply z-curve to a representative sample of studies in social psychology. This makes it possible to create a rule that is not biased by the selection criteria for studies that were actually replicated (e.g., an oversampling of ego-depletion and priming studies).
Motyl et al. (2016) coded a representative sample of studies from three social psychology journals (JESP, PSPB, JPSP). Applying z-curve to their dataset produces a similar estimate of the predicted replication rate. It also showed that studies with just significant results have very low power. However, local power estimates suggested that results with z-scores greater than 3.5 rather than 4 have more than 50% mean power. As this estimate is based on a larger and more representative sample, the prediction rule might be adjusted accordingly. However, validation with actual replication studies is needed to verify this rule.
Application to Cognitive Psychology
The Curate Science database included 49 cognitive studies. The replication success rate for cognitive studies was considerably higher than for social psychology (57%). More important, the success rates for different levels of evidence were above 50% for z-scores greater than 3; rather than 4 for social psychology.
Given the overall success rate of 57%, the base-rate rule would predict successes for all studies and would be correct 57% of the time. Using a rule that bets on success only for z > 3 and on failure for z < 3 is correct 60% of the time. Given the small sample of studies, this is not a significant improvement.
The z-curve analysis is consistent with the actual replication results. The predicted success based on test statistics in the original studies is 61% vs. 57% in actual studies. The file drawer of unreported non-significant results is much smaller than in social psychology. Local power estimates are just below 50% for z-scores ranging from 2 to 3, but are above 50% for z-scores above 3. This confirms that the most optimal prediction rule for cognitive psychology is to bet on studies with a z-score greater than 3 or a p-value less than .001.
It would be desirable to compare this estimate against a larger, representative set of cognitive studies. We are currently working on such a project. Here we can report preliminary results based on studies published in the Journal of Experimental Psychology: Learning, Memory, and Cognition. So far, 153 studies have been coded.
The results based on a representative sample of studies that is three times larger than the sample of studies with actual replication studies are similar to the results for the Curate Science sample. On average, the model predicts 55% successful replications. The rate of successes is below 50% for studies with z-scores less than 3 and above 50% for studies with z-scores above 3.
In this article, we introduced z-curve as a prediction model for replication outcomes in social and cognitive psychology. Z-curve is based on the deterministic relationship between statistical power and replicability. If it were possible to replicate studies exactly, the success rate in a set of exact replication studies equals the mean power of the original studies. Z-curve is able to estimate mean power based on the reported test statistics in original studies (Brunner & Schimmack, 2018). We found that z-curve overestimates success rates in social psychology (35% vs. 13%), whereas predictions for cognitive psychology were close to the actual success rate (61% vs. 57%). We also introduced local power and provided mean power estimates for different levels of evidence. For social psychology, the success rates were above 50% only for z-scores greater than 4.5. Z-curve produced estimates of mean power greater than 50% for z-scores greater than 4. For cognitive psychology, mean success rates were above 50% starting with z-scores of 3. Z-curve also obtained mean power estimates greater than 50% for z-scores greater than 3. Similar results were obtained in two independent and representative samples of studies in social and cognitive psychology. Overall, these results suggests that z-curve is a valuable tool to examine the replicability of published results on the basis of the published test results. The statistical approach of examining replicability complements attempts to examine replicability of published results with actual replication studies, but it has several advantages. The main advantage is that it is much easier to examine the replicability of a large and representative set of studies. As revealed in the curate science database, so far social psychological studies that are easy to carry out have been the focus of replication attempts. Hardly anything is known about the replicability of research in other areas of psychology. Z-curve can be applied to a representative sample of studies across all disciplines in psychology to estimate the replicabilty in psychology. It can also be used to compare disciplines or to examine changes in replicability over time. Our results also have implications for some hotly debated issues in meta-psychology.
What explains replication failures in social psychology?
The success rate of replication attempts in social psychology is abysmal. Social psychologists have many ignored this evidence and blamed problems with the replication studies for this unflattering outcome. For example, Gilovich et al. (2019) write in a textbook for undergraduate students that many of the attempted replication studies in the OSC (2015) project were so poorly designed that “they weren’t replications at all.” Another criticism was that the replication studies were inconclusive because they were underpowered (Gilbert et al., 2016). The present results show that the replication studies were underpowered because the replicated underpowered original studies. Moreover, the results show that poorly designed replication studies may account for the discrepancy between 13% and 38%, but not for the discrepancy between 100% successes of published studies and 38% mean power to produce these successes. This discrepancy reveals that the high success rate in social psychology is only obtained with the use of questionable research practices that inflate the type-I error risk. As a result, the implicit claim that at most 5% of published results are false positives is misleading (Sterling, 1959). The actual risk of a false positive result is much higher than 5%.
The ability to estimate replicability based on original results makes it possible to resolve controversies about replication failures. If studies have low power, replication failures are too be expected, even if the replication study is a perfect replica of the original study. However, in social psychology it is often difficult to replicate studies exactly. Thus, the probability that an actual replication study is successful is even lower. As a result, it is paramount for social psychologists to increase the power of their studies to have a realistic chance to publish results that generalize across different experimental contexts and populations.
The positive contribution of z-curve is the ability to isolate subsets of studies that produce replicable resuls. One way to do so is to focus on the strength of evidence. Questionable research practices are more likely to produce just significant results with z-scores less than 3. As a result, studies with stronger evidence are more likely to replicate. Based on z-curve analysis, studies with z-scores greater than 3.5 or 4 have at least a mean power of 50% and are unlikely to be false positives. Future research should build on these studies to develop an empirical foundation for social psychological theories.
The Alpha Wars
Benjamin and co-authors argued that psychological science is in a crisis because the standard criterion for rejecting the null-hypothesis of alpha = .05 is too liberal. They propose alpha = .005 as a better criterion. The main problem with their critique is the Fisherian focus on type-I errors. A balanced assessment of research practices needs to take type-II errors into account. Type-II errors are often ignored because they depend on the unknown population effect sizes. Z-curve overcomes this problem by providing estimates of mean power without requiring knowledge about the distribution of population effect sizes.
The results presented here suggest that the biggest problem in psychological science are unacceptably high type-II errors rather than a liberal standard for type-I errors. With 40% power for social psychology, the average type-II error is 60% (beta = 1-power). This means, the risk of a type-II error is .60/.05 = 12 times greater than the risk of a type-I error. Cohen (1988) suggested that a ratio of 4:1 (20% type-II error vs. 5% type-I error) is reasonable. Accordingly, social psychologists would have to double power or increase alpha from .05 to .15. Lowering alpha to .005 would create an imbalanced ratio of 120 to 1 between type-II and type-I errors. Given the current level of resources, lowering type-I errors will only increase the risk of type-II errors.
Another problem of Benjamin et al.’s recommendation is that the authors ignore the influence of questionable research practices. Questionable research practices inflate the type-I error risk. Thus, published results that claim p < .05 are not justified as long as questionable research practices are used. Cracking down on the use of questionable research practices and increasing power would yield fewer just significant results that would have a higher probability of successful replication.
Cognitive psychology shows that it is unnecessary to lower alpha because studies have higher power. As a result, there are fewer just significant results and more of these were obtained with relatively high power. Cognitive psychology has not encountered stunning replication failures like social psychology. Lowering alpha for cognitive psychology would be a waste of resources. The present results suggest that cognitive psychology could benefit from doubling sample sizes in within-subject, repeated measurement studies from 25 to 50 to increase power to 80%. This alone would produce more just significant results with alpha = .05 with good replication rates.
Benjamin et al.’s article is a recommendation for future studies. They do not discuss how researchers should interpret already published results. One could interpret their article as suggesting that alpha = .005 could also be applied to published articles. However, we think a better approach is to estimate replicability and to base decisions about the strength of evidence on type-II error risk rather than type-I error risk. Studies with mean power of 50% or more are unlikely to contain many false positive results and are likely to replicate in future studies. Zcurve can be used to find studies with mean power that exceeds a minim standard of 50% or meets a higher standard such as Cohen’s 80% criterion. Using this criterion, we would only consider social studies with z-scores greater than 4 as significant results, while the standard for cognitive psychology is a z-score greater than 3. Future research can develop standards for other disciplines or areas of research.
We demonstrated how zcurve can be used to predict outcomes of replication studies and how this information can be used to develop rational decision rules that distinguish between studies that are likely to replicate or not. The results show that social psychology has a replication crisis because many results were obtained with low statistical power and the help of questionable research practices. In contrast, cognitive psychology is faring relatively well, although a success rate that is just above 50% is not ideal. Cognitive psychologists would also benefit from increasing power in their studies to publish more replicable results. We believe that z-curve is a promising tool to examine the replicability of other disciplines in psychology and other sciences that use inferential statistics. Z-curve can also be used to evaluate the effectiveness of open science initiatives to improve psychological science. In short, z-curve is awesome; at least, we think so.
Humanistic psychologists have a positive image of human nature. Given the right environment, humans would act in the interest of the greater good. Similarly, academia was founded on idealistic ideals of shared understanding of the world based on empirical facts. Prominent representative of psychological science still present this naive image of science.
Our field has always encouraged — required, really — peer critiques. (Susan T. Fiske, 2016).
The notion of peer criticism is naive because scientific peers are both active players and referees. If you don’t think that the World Cup final could be refereed by the players, you should also not believe that scientists can be objective when they have to critique their own science. This is not news to social psychologists, who teach about motivated biases in their classes, but suddenly these rules of human behavior don’t apply to social psychologists as if they were meta-humans.
Should active researchers write introductory textbooks?
It can be difficult to be objective in the absence of strong empirical evidence. Thus, disagreements among scientists are part of the normal scientific process of searching for a scientific answer to an important question. However, textbooks are supposed to introduce a new generation of students to fundamental facts that serve as the foundation for further discoveries. There is no excuse for self-serving biases in introductory textbooks.
Some textbooks are written by professional textbook writers. However, other textbooks are written by active and often eminent researchers. Everything we know about human behavior predicts that they will be unable to present criticism of their field objectively. And the discussion of the replication crisis in social psychology in Gilovich, Keltner, Chen, and Nisbett (2019) confirms this prediction.
The Replication Crisis in Social Psychology in a Social Psychology Textbook
During the past decade social psychology has been rocked by scandals ranging from outright fraud to replication failures of some of the most celebrated textbook findings like unconscious priming of social behavior (Bargh) or ego-depletion (Baumeister), and the finding that a representative sample of replication studies failed to replicate 75% of published results in social psychology (OSC, 2015).
The forthcoming 5th edition of this social psychology textbook does mention the influential OSC reproducibility project. However, the presentation is severely biased and fails to inform students that many findings in social psychology were obtained with questionable research practices and may not replicate.
How to Whitewash Replication Failures in Social Psychology
The textbook starts with the observation that replication failures generate controversy, but ends with the optimistic conclusion that scientists then reach a consensus about the reasons why a replication failure occurred.
“These debates usually result in a consensus about whether a particular finding should be accepted or not. In this way, science is self-correcting”
This rosy picture of science is contradicted by the authors own response to the replication failure in the Open Science Reproducibility Project. There is no consensus about the outcome of the reproducibility project and social psychologists’ views are very different from outsiders’ interpretation of these results.
“In 2015, Brian Nosek and dozens of other psychologists published an article in the journal Science reporting on attempts to replicate [attempts to replicate!!!] 100 psychological studies (Open Science Collaboration, 2015). They found that depending on the criterion used, 36-47 percent of the original studies were successfully replicated.”
They leave out that the article also reported different success rates for social psychology, the focus of the textbook, and cognitive psychology. The success rate for social psychology was only 25%, but this also included some personality studies. The success rate for the classic between-subject experiment in social psychology was only 4%! This information is missing, although (or because?) it would make undergraduate students wonder about the robustness of the empirical studies in their textbook.
Next students are informed that they should not trust the results of this study.
“The findings received heavy criticism from some quarters (Gilbert, King, Pettigrew, & Wilson, 2016).”
No mention is made who these people are or that Wilson is a student of textbook author Nisbett.
“The most prominent concern was that many of the attempted replications utilized procedures that differed substantially from the original studies and thus weren’t replications at all.”
What is “many” and what is a “substantial” difference? Students are basically told that the replication project was carried out in the most incompetent way (replication studies weren’t replications) and that the editor of the most prominent journal for all Sciences didn’t realize this. This is the way social psychologists often create their facts; with a stroke of a pen and without empirical evidence to back it up.
Students are then told that other studies have produced much higher estimates of replicability that reassure students that textbook findings are credible.
“Other systematic efforts to reproduce the results of findings reported in behavioral science journals have yielded higher replication rates, on the order of 75-85 percent (Camerer et al., 2016; Klein et al., 2014). “
I have been following the replication crisis since its beginning and I have never seen success rates of this magnitude. Thus, I fact checked these estimates that are presented to undergraduate students as the “real” replication rates of psychology, presumably including social psychology.
The Cramerer et al. (2016) article is titled “Evaluating replicability of laboratory experiments in economics” Economics! Even if the success rate in this article were 75%, it would have no relevance for the majority of studies reported in a social psychology textbook. Maybe telling students that replicability in economics is much better than in psychology would make some students switch to economics.
The Klein et al. (2014) article did report on the replicability of studies in psychology. However, it only replicated 13 studies and the studies were not a representative sample of studies, which makes it impossible to generalize the success rate to a population of studies like the studies in a social psychology textbook.
We all know the saying, there are lies, damn lies, and statistics. The 75-85% success rate in “good” replication study is a damn lie with statistics. It misrepresents the extent of the replication crisis in social psychology. An analysis of a representative set of hundreds of original results leads to the conclusion that no more than 50% of exact replication studies would reproduce a significant result even if the study could be replicated exactly (How replicable is psychological science). Telling students otherwise is misleading.
The textbook authors do acknowledge that failed replication studies can sometimes reveal shoddy work by original researchers.
“In those cases, investigators who report failed attempts to replicate do a great service to everyone for setting the record straight.”
They also note that social psychologists are slowly changing research practices to reduce the number of significant results that are obtained with “shoddy practices” that do not replicate.
“Foremost among these changes has been an increase in the sample sizes generally used in research.”
One wonders why these changes are needed if success rates are already 75% or higher.
The discussion of the replication crisis ends with the reassurance that probably most of the reported results in the article are credible and that evidence is presented objectively.
“In this textbook we have tried to be scrupulous about noting when the evidence about a given point is mixed.”
How credible is this claim when the authors misrepresent the OSC (2015) article as a collection of amateur studies that can be ignored and then cite a study of economics to claim that social psychology is replicable?
Moreover, the authors have a conflict of interest because they have a monetary incentive to present social psychology in the most positive light so that students take social psychology courses and buy social psychology textbooks.
A more rigorous audit of this and other social psychology textbooks by independent investigators is needed because we cannot trust social psychologists to be objective in the assessment of their field. After all, they are human.
The crisis of confidence in psychological science started with Bem’s (2011) article in the Journal of Personality and Social Psychology. The article made the incredible claim that extraverts can foresee future random events (e.g., the location of an erotic picture) above chance.
Rather than demonstrating some superhuman abilities, the article revealed major problems in the way psychologists conduct research and report their results.
Wagenmakers and colleagues were given the opportunity to express their concerns in a commentary that was published along with the original article, which is highly unusual (Wagenmakers et al., 2011).
Wagenmakers used this opportunity to attribute the problems in psychological science to the use of p-values. The claim that the replication crisis in psychology follows from the use of p-values has been repeated several times, most recently in a special issue that promotes Bayes Factors as an alternative statistical approach.
“the edifice of NHST appears to show subtle signs of decay. This is arguably due to the recent trials and tribulations collectively known as the “crisis of confidence” in psychological research, and indeed, in empirical research more generally (e.g., Begley & Ellis, 2012; Button et al., 2013; Ioannidis, 2005; John, Loewenstein, & Prelec, 2012; Nosek & Bar-Anan, 2012; Nosek, Spies, & Motyl, 2012; Pashler & Wagenmakers, 2012; Simmons, Nelson, & Simonsohn, 2011). This crisis of confidence has stimulated a methodological reorientation away from the current practice of p value NHST (Wagenmakers et al., 2018, Psychonomics Bulletin and Review).
In short, Bem used NHST and p-values, Bem’s claims are false, therefore NHST and p-values are false.
However, it does not follow from Bem’s use of p-values that NHST is flawed or caused the replication crisis in experimental social psychology, just like it does not follow from the fact that Bem is a men and that his claims were false that all claims by man are false.
The key problem with Bem’s article is that he used questionable and some would argue fraudulent research practices to produce incredible p-values (Francis, 2012; Schimmack, 2012). For example, he combined several smaller studies with promising trends into a single dataset to report a p-value less than .05 (Schimmack, 2018). This highly problematic practice violates the assumption that the observations in a dataset are drawn from a representative sample. It is not clear how any statistical method could produce valid results when its basic assumptions are violated.
So, we have two competing accounts of the replication crisis in psychology. Wagenmakers argues that even proper use of NHST produces questionable results that are difficult to replicate. In contrast, I argue that proper use of NHST produces credible p-values that can be replicated and only questionable research practices and abuse of NHST produce incredible p-values that cannot be replicated.
Who is right?
The answer is simple. Wagenmakers et al. (2011) engaged in a questionable research practice to demonstrate the superiority of Bayes-Factors when they examined Bem’s results with Bayesian statistics. They analyzed each study individually to show that each study alone produced fairly weak evidence for extraverts’ miraculous extrasensory abilities. However, they did not report the results of a meta-analysis of all studies.
The weak evidence in each single study is not important because JPSP would not have accepted Bem’s manuscript for publication, if he had presented a significant result in a single study. In 2011, social psychologists were well aware that a single p-value less than .05 provides only suggestive evidence and does not warrant publication in a top journal (Kerr, 1998). Most articles in JPSP report four or more studies. Bem reported 9 studies. Thus, the crucial statistical question is how strong the combined evidence of all 9 studies is. This question is best addressed by means of a meta-analysis of the evidence. Wagenmakers et al. (2011) are well-aware of this fact, but avoided reporting the results of a Bayesian meta-analysis.
In this article, we have assessed the evidential impact of Bem’s (2011) experiments in isolation. It is certainly possible to combine the information across experiments, for instance by means of a meta-analysis (Storm, Tressoldi, & Di Risio, 2010; Utts, 1991). We are ambivalent about the merits of meta-analyses in the context of psi: One may obtain a significant result by combining the data from many experiments, but this may simply reflect the fact that some proportion of these experiments suffer from experimenter bias and excess exploration (Wagenmakers et al., 2011)
I believe the real reason why they did not report the results of a Bayesian analysis is that it would have shown that p-values and Bayes-Factors lead to the same inference that Bem’s data are inconsistent with the null-hypothesis. After all, Bayes-Factors and p-values are mere transformations of a test-statistic into a different parameter. Holding sample size constant, p-values and Bayes-Factors in favor of the null-hypothesis decrease as the test statistic (e.g., a t-value) increases. This is shown below with Bem’s data.
Bayesian Meta-Analysis of Bem
Bem reported a mean effect size of d = .22 based on 9 studies with a total of 1170 participants. A better measure of effect size is the weighted average, which is slightly smaller, d = .197. The effect size can be tested against an expected value of 0 (no ESP) with a one-sample t-test with a sampling error of 1 / sqrt(1170) = 0.029. The t-value is .197/.029 = 6.73. The corresponding z-score is 6.66 (cf. Bem, Utts, & Johnson, 2011).
The p-value for t(1169) = 6.73 is 2.65e-11 or 0.00000000003.
I used Rouder’s online app to compute the default Bayes-Factor.
To obtain the BF in favor of the null-hypothesis, which is more comparable to a p-value that expresses evidence against the null-hypothesis, we obtain a BF with 9 zeros after the decimal, BF01 = 1/139075597 = 7.190334e-09 or 0.000000007.
Given the data, it is reasonable to reject the null-hypothesis using p-values or Bayes-Factors. Thus, the problem is the high t-value and not the transformation of the t-value into a p-value.
The problem with the t-value is clear when we consider that particle physicists (a.k.a real scientists) use values greater than 5 to rule out chance findings. Thus, Bem’s evidence meets the same strict criterion that was used to celebrate the discovery of the Higgs-Bosson particle in physics (cf. Schimmack, 2012).
The problem with Bem’s article is not that he used p-values. He could also have used Bayesian statistics to support his incredible claims. The problem is that Bem engaged in highly questionable research practices and was not transparent in reporting these practices. Holding p-values accountable for his behavior would be like holding cars responsible for drunk drivers.
Wagenmakers railing against p-values is akin to Don Quixote’s railing against windmills. It is not uncommon that a group of scientist is vehemently pushing an agenda. In fact, the incentive structure in science seems to promote self-promoters. However, it is disappointing that a peer-reviewed journal uncritically accepted his questionable claim that p-value caused the replication crisis. There is ample evidence that questionable research practices are being used to produce too many significant results (John et al., 2012; Schimmack, 2012). Disregarding this evidence to make false, self-serving attributions is just as questionable as other questionable research practices that impede scientific progress.
The biggest danger with Wagenmakers and colleagues agenda is that it distracts from the key problems that need to be fixed. Curbing the use of questionable research practices and increasing the statistical power of studies to produce strong evidence (i.e., high t-values) is paramount to improving psychological science. However, there is little evidence that psychologists have changed their practices since 2011; with the exception of some social psychologists (Schimmack, 2017).
Thus, it is important to realize that Wagenmakers’ attribution of the replication crisis to the use of NHST is a fundamental attribution error in meta-psychology that is rooted in a motivated bias to find some useful application for Bayes-Factors. Contrary to Wagenmakers et al.’s claim that “Psychologists need to change the way they analyze their data” they actually need to change the way they obtain their data. With good data, the differences between p-values and Bayes-Factors are of minor importance.
In 2002, Daniel Kahneman was awarded the Nobel Prize for Economics. He received the award for his groundbreaking work on human irrationality in collaboration with Amos Tversky in the 1970s.
In 1999, Daniel Kahneman was the lead editor of the book “Well-Being: The foundations of Hedonic Psychology.” Subsequently, Daniel Kahneman conducted several influential studies on well-being.
The aim of the book was to draw attention to hedonic or affective experiences as an important, if not the sole, contributor to human happiness. He called for a return to Bentham’s definition of a good life as a life filled with pleasure and devoid of pain a.k.a displeasure.
The book was co-edited by Norbert Schwarz and Ed Diener, who both contributed chapters to the book. These chapters make contradictory claims about the usefulness of life-satisfaction judgments as an alternative measure of a good life.
Ed Diener is famous for his conception of wellbeing in terms of a positive hedonic balance (lot’s of pleasure, little pain) and high life-satisfaction. In contrast, Schwarz is known as a critic of life-satisfaction judgments. In fact, Schwarz and Strack’s contribution to the book ended with the claim that “most readers have probably concluded that there is little to be learned from self-reports of global well-being” (p. 80).
To a large part, Schwarz and Strack’s pessimistic view is based on their own studies that seemed to show that life-satisfaction judgments are influenced by transient factors such as current mood or priming effects.
“the obtained reports of SWB are subject to pronounced question-order- effects because the content of preceding questions influences the temporary accessibility of relevant information” (Schwarz & Strack, p. 79).
There is only one problem with this claim; it is only true for a few studies conducted by Schwarz and Strack. Studies by other researchers have produced much weaker and often not statistically reliable context effects (see Schimmack & Oishi, 2005, for a meta-analysis). In fact, a recent attempt to replicate Schwarz and Strack’s results in a large sample of over 7,000 participants failed to show the effect and even found a small, but statistically significant effect in the opposite direction (ManyLabs2).
Figure 1 summarizes the results of the meta-analysis from Schimmack and Oishi 2005), but it is enhanced by new developments in meta-analysis. The blue line in the graph regresses effect sizes (converted into Fisher-z scores) onto sampling error (1/sqrt(N -3). Publication bias and other statistical tricks produce a correlation between effect size and sampling error. The slope of the blue line shows clear evidence of publication bias, z = 3.85, p = .0001. The intercept (where the line meets zero on the x-axis) can be interpreted as a bias-corrected estimate of the real effect size. The value is close to zero and not statistically significant, z = 1.70, p = .088. The green line shows the effect size in the replication study, which was also close to zero, but statistically significant in the opposite direction. The orange vertical red line shows the average effect size without controlling for publication bias. We see that this naive meta-analysis overestimates the effect size and falsely suggests that item-order effects are a robust phenomenon. Finally, the graph highlights the three results from studies by Strack and Schwarz. These results are clear outliers and even above the biased blue regression line. The biggest outlier was obtained by Strack et al. (1991) and this is the finding that is featured in Kahneman’s book, even though it is not reproducible and clearly inflated by sampling error. Interestingly, sampling error is also called noise and Kahneman wrote a whole new book about the problems of noise in human judgments.
While the figure is new, the findings were published in 2005, several years before Kahneman wrote his book “Thinking Fast and Slow). He was simply to lazy to use the slow process of a thorough literature research to write about life-satisfaction judgments. Instead, he relied on a fast memory search that retrieved a study by his buddy. Thus, while the chapter is a good example of biases that result from fast information processing, it is not a good chapter to tell readers about life-satisfaction judgments.
To be fair, Kahneman did inform his readers that he is biased against life-satisfaction judgments. Having come to the topic of well-being from the study of the mistaken memories of colonoscopies and painfully cold hands, I was naturally suspicious of global satisfaction with life as a valid measure of well-being (Kindle Locations 6796-6798). Later on, he even admits to his mistake. Life satisfaction is not a flawed measure of their experienced well-being, as I thought some years ago. It is something else entirely (Kindle Location 6911-6912).
However, insight into his bias was not enough to motivate him to search for evidence that may contradict his bias. This is known as confirmation bias. Even ideal-prototypes of scientists like Nobel Laureates are not immune to this fallacy. Thus, this example shows that we cannot rely on simple cues like “professor at Ivy League,” “respected scientists,” or “published in prestigious journals.” to trust scientific claims. Scientific claims need to be backed up by credible evidence. Unfortunately, social psychology has produced a literature that is not trustworthy because studies were only published if they confirmed theories. It will take time to correct these mistakes of the past by carefully controlling for publication bias in meta-analyses and by conducting pre-registered studies that are published even if they falsify theoretical predictions. Until then, readers should be skeptical about claims based on psychological ‘science,’ even if they are made by a Nobel Laureate.
Information about the replicability of published results is important because empirical results can only be used as evidence if the results can be replicated. However, the replicability of published results in social psychology is doubtful.Brunner and Schimmack (2018) developed a statistical method called z-curve to estimate how replicable a set of significant results are, if the studies were replicated exactly. In a replicability audit, I am applying z-curve to the most cited articles of psychologists to estimate the replicability of their studies.
Susan T. Fiske
Susan T. Fiske is an eminent social psychologist (H-Index in WebofScience = 66). She also is a prominent figure in meta-psychology. Her most important contribution to meta-psychology was a guest column in the APS Observer (Fiske, 2016), titled “A Call to Change Science’s Culture of Shaming.” ur field has always encouraged — required, really — peer critiques. But the new media (e.g., blogs, Twitter, Facebook) can encourage a certain amount of uncurated, unfiltered denigration. In the most extreme examples, individuals are finding their research programs, their careers, and their personal integrity under attack.”
In her article, she refers to researchers who examine the replicability of published results as “self-appointed data police,” which is relatively mild in comparison to the term “method terrorist” that she used in a leaked draft of her article.
She accuses meta-psychologists of speculating about the motives of researchers who use questionable research practices, but she never examines the motives of meta-psychologists. Why are they devoting their time and resources to meta-psychology and publish their results on social media rather than advancing their career by publishing original research in peer-reviewed journals. One possible reason is that meta-psychologists recognize deep and fundamental problems in the way social psychologists conduct research and they are trying to improve it.
Instread, Fiske denies that psychological science and claims that the Association for Psychological Science (APS) is a leader in promoting good scientific practices.
What’s more, APS has been a leader in encouraging robust methods: transparency, replication, power analysis, effect-size reporting, and data access.
She also dismisses meta-psychological criticism of social psychology as unfounded.
But some critics do engage in public shaming and blaming, often implying dishonesty on the part of the target and other innuendo based on unchecked assumptions.
In this blog post, I am applying z-curve to Susan T. Fiske’s results to examine whether she used questionable research practices to report mostly significant results that support her predictions, and to examine how replicable her published results are. The scientific method, z-curve, makes assumptions that have been validated in simulation studies (Brunner & Schimmack, 2018).
I used WebofScience to identify the most cited articles by Susan T. Fiske (datafile). I then selected empirical articles until the number of coded articles matched the number of citations, resulting in 41 empirical articles (H-Index = 41). The 41 articles reported 76 studies (average 1.9 studies per article). The total number of participants was 21,298 with a median of 54 participants per study. For each study, I identified the most focal hypothesis test (MFHT). The result of the test was converted into an exact p-value and the p-value was then converted into a z-score. The z-scores were submitted to a z-curve analysis to estimate mean power of the 65 results that were significant at p < .05 (two-tailed). Six studies did not test a hypothesis or predicted a non-significant result. The remaining 5 results were interpreted as evidence with lower standards of significance. Thus, the success rate for 70 reported hypothesis tests was 100%.
The z-curve estimate of replicability is 59% with a 95%CI ranging from 42% to 77%. The complementary interpretation of this result is that the actual type-II error rate is 41% compared to the 0% of non-significant results reported in the articles.
The histogram of z-values shows the distribution of observed z-scores (blue line) and the predicted density distribution (grey line). The predicted density distribution is also projected into the range of non-significant results. The area under the grey curve is an estimate of the file drawer of studies that need to be conducted to achieve 100% successes with 59% average power. The ratio of the area of non-significant results to the area of all significant results (including z-scores greater than 6) is called the File Drawer Ratio. Although this is just a projection, and other questionable practices may have been used, the file drawer ratio of 1.63 and the figure makes it clear that the reported results were selected to support theoretical predictions.
Z-curve is under development and offers additional information other than the replicabilty of significant results. One new feature is an estimate of the maximum number of false positive results. The maximum percentage of false positive results is estimated to be 30% (95%CI = 10% to 60%). This estimate means that a z-curve with a fixed percentage of 30% false positives fits the data nearly as well as a z-curve without restrictions on the percentage of false positives. Given the relatively small number of studies, the estimate is not very precise and the upper limit goes as high as 60%. It is unlikely that there are 60% false positives, but the point of empirical research is to reduce the risk of false positives to an acceptable level of 5%. Thus, the actual risk is unacceptably high.
A 59% replicability estimate is actually very high for a social psychologist. However, it would be wrong to apply this estimate to all studies. The estimate is an average and replicability varies as a function of the strength of evidence against the null-hypothesis (the magnitude of a z-score). This is shown with the replicabiilty estimates for segments of z-scores below the x-axis. For just significant results with z-scores from 2 to 2.5 (~ p < .05 & p > .01), replicability is only 33%. This means, these results are less likely to replicate and results of actual replication studies show very low success rates for studies with just significant results. Without selection bias, significant results have an average replicabilit greater than 50%. However, with selection for significance, this is no longer the case. For Susan T. Fiske’s data, the criterion value to achieve 50% average replicability is a z-score greater than 3 (as opposed to 1.96 without selection). 56 reported results meet this criterion. This is a high percentage of credible results for a social psychologist (see links to other replicability audits at the end of this post).
Although Susan T. Fiske’s work has not been the target of criticism by meta-psychologists, she has been a vocal critic of meta-psychologists. This audit shows that here work is more replicable than the work by other empirical social psychologists. One explanation for Fiske’s defense of social psychology could be the false consensus effect, which is a replicable social psychological phenomenon. In the absence of hard evidence, humans tend to believe that others are more similar to them than they actually are. Maybe Susan Fiske assumed that social psychologists who have been criticized for their research practices were conducting research like herself. A comparison of different audits (see below) shows that this is not the case. I wonder what Fiske thinks about the research practices of her colleague that produce replicability estimates well below 50%. I believe that a key contributor to the conflict between experimental social psychologists and meta-psychologist is the lack of credible information about the extend of the crisis. Actual replication studies and replicability reports provide much needed objective facts. The question is whether social psychologists like Susan Fiske are willing to engage in a scientific discussion about these facts or whether they continue to ignore these facts to maintain the positive illusion that social psychological results can be trusted.
It is nearly certain that I made some mistakes in the coding of Susan T. Fiske’s articles. However, it is important to distinguish consequential and inconsequential mistakes. I am confident that I did not make consequential errors that would alter the main conclusions of this audit. However, control is better than trust and everybody can audit this audit. The data are openly available and the z-curve code is also openly available. Thus, this replicability audit is fully transparent and open to revision.
Original Post: 11/26/2018 Modification: 4/15/2021 The z-curve analysis was updated using the latest version of z-curve
“Trust is good, but control is better”
I asked Fritz Strack to comment on this post, but he did not respond to my request.
Information about the replicability of published results is important because empirical results can only be used as evidence if the results can be replicated. However, the replicability of published results in social psychology is doubtful.
Brunner and Schimmack (2018) developed a statistical method called z-curve to estimate how replicable a set of significant results are, if the studies were replicated exactly. In a replicability audit, I am applying z-curve to the most cited articles of psychologists to estimate the replicability of their studies.
Fritz Strack is an eminent social psychologist (H-Index in WebofScience = 51).
Fritz Strack also made two contributions to meta-psychology.
First, he volunteered his facial-feedback study for a registered replication report; a major effort to replicate a published result across many labs. The study failed to replicate the original finding. In response, Fritz Strack argued that the replication study introduced cameras as a confound or that the replication team actively tried to find no effect (reverse p-hacking).
Second, Strack co-authored an article that tried to explain replication failures as a result of problems with direct replication studies (Strack & Stroebe, 2014). This is a concern, when replicability is examined with actual replication studies. However, this concern does not apply when replicability is examined on the basis of test statistics published in original articles. Using z-curve, we can estimate how replicable these studies are, if they could be replicated exactly, even if this is not possible.
Given Fritz Strack’s skepticism about the value of actual replication studies, he may be particularly interested in estimates based on his own published results.
I used WebofScience to identify the most cited articles by Fritz Strack (datafile). I then selected empirical articles until the number of coded articles matched the number of citations, resulting in 42 empirical articles (H-Index = 42). The 42 articles reported 117 studies (average 2.8 studies per article). The total number of participants was 8,029 with a median of 55 participants per study. For each study, I identified the most focal hypothesis test (MFHT). The result of the test was converted into an exact p-value and the p-value was then converted into a z-score. The z-scores were submitted to a z-curve analysis to estimate mean power of the 114 results that were significant at p < .05 (two-tailed). Three studies did not test a hypothesis or predicted a non-significant result. The remaining 11 results were interpreted as evidence with lower standards of significance. Thus, the success rate for 114 reported hypothesis tests was 100%.
The z-curve estimate of replicability is 38% with a 95%CI ranging from 23% to 54%. The complementary interpretation of this result is that the actual type-II error rate is 62% compared to the 0% failure rate in the published articles.
The histogram of z-values shows the distribution of observed z-scores (blue line) and the predicted density distribution (grey line). The predicted density distribution is also projected into the range of non-significant results. The area under the grey curve is an estimate of the file drawer of studies that need to be conducted to achieve 100% successes with 28% average power. Although this is just a projection, the figure makes it clear that Strack and collaborators used questionable research practices to report only significant results.
Z-curve.2.0 also estimates the actual discovery rate in a laboratory. The EDR estimate is 24% with a 95%CI ranging from 5% to 44%. The actual observed discovery rate is well outside this confidence interval. Thus, there is strong evidence that questionable research practices had to be used to produce significant results. The estimated discovery rate can also be used to estimate the risk of false positive results (Soric, 1989). With an EDR of 24%, the false positive risk is 16%. This suggests that most of Strack’s results may show the correct sign of an effect, but that the effect sizes are inflated and that it is often unclear whether the population effect sizes would have practical significance.
The false discovery risk decreases for more stringent criteria of statistical significance. A reasonable criterion for the false discovery risk is 5%. This criterion can be achieved by lowering alpha to .005. This is in line with suggestions to treat only p-values less than .005 as statistically significant (Benjamin et al., 2017). This leaves 35 significant results.
The analysis of Fritz Strack’s published results provides clear evidence that questionable research practices were used and that published significant results have a higher type-I error risk than 5%. More important, the actual discovery rate is low and implies that 16% of published results could be false positives. This explains some replication failures in large samples for Strack’s item-order and facial feedback studies. I recommend to use alpha = .005 to evaluate Strack’s empirical findings. This leaves about a third of his discoveries as statistically significant results.
It is important to emphasize that Fritz Strack and colleagues followed accepted practices in social psychology and did nothing unethical by the lax standards of research ethics in psychology. That is, he did not commit research fraud.
It is nearly certain that I made some mistakes in the coding of Fritz Strack’s articles. However, it is important to distinguish consequential and inconsequential mistakes. I am confident that I did not make consequential errors that would alter the main conclusions of this audit. However, control is better than trust and everybody can audit this audit. The data are openly available and the z-curve results can be reproduced with the z-curve package in R. Thus, this replicability audit is fully transparent and open to revision.
Some commentators have asked to examine myself, and I finally did it. I used the new format of a replicability audit (Baumeister, Wilson). A replicability audit picks the most cited articles until the number of articles exceeds the number of citations (H-Index). For each study, the most focal hypothesis test is selected. The test-statistic is converted into a p-value and then into a z-score. The z-scores are analysed with z-curve (Brunner & Schimmack, 2018) to estimate replicability.
I used WebofScience to identify my most cited articles (datafile). I then selected empirical articles until the number of coded articles matched the number of citations, resulting in 27 empirical articles (H-Index = 27). The 27 articles reported 64 studies (average 2.4 studies per article). 45 of the 64 studies reported a hypothesis test. The total number of participants in these 45 studies was 350,148 with a median of 136 participants per statistical test. 42 of the 45 tests were significant with alpha = .05 (two-tailed). The remaining 3 results were interpreted as evidence with lower standards of significance. Thus, the success rate for 45 reported hypothesis tests was 100%.
The z-curve plot shows evidence of a file-drawer. Counting marginally significant results, the success rate is 100%, but power to produce significant results is estimated to be only 71%. Power for the set of significant results (excluding marginally significant ones), is estimated to be 76%. The maximum false discovery rate is estimated to be 5%. Thus, even if results would not replicate with the same sample size, studies with much larger sample sizes are expected to produce a significant result in the same direction.
These results are comparable to the results in cognitive psychology, where replicability estimates are around 80% and the maximum false discovery rate is also low. In contrast, results in experimental social psychology are a lot worse with replicability estimates below 50%; both in statistical estimates with z-curve and in estimates based on actual replication studies (Open Science Collaboration, 2015). The reason is that experimental social psychologists conducted between-subject experiments with small samples. In contrast, most of my studies are correlational studies with large samples or experiments with within-subject designs and many repeated trails. These studies have high power and tend to replicate fairly well (Open Science Collaboration, 2015).
Actual replication studies have produced many replication failures and created the impression that results published in psychology journals are not credible. This is unfortunate because these replication projects have focused on between-subject paradigms in experimental social psychology. It is misleading to generalize these results to all areas of psychology.
Psychologists who want to demonstrate that their work is replicable do not have to wait for somebody to replicate their study. They can conduct a self-audit using z-curve and demonstrate that their results are different from experimental social psychology. Even just reporting the number of observations rather than number of participants may help to signal that a study had good power to produce a significant result. A within-subject study with 8 participants and 100 repetitions has 800 observations, which is 10 times more than the number of observations in the typical between-subject study with 80 participants and one observation per participant.