Social psychologists, among others, have misused the scientific method. Rather than using it to separate false from true hypotheses, they used statistical tests to find and report statistically significant results. The main problem with the search for significance is that significant results are not automatically true discoveries. The probability that a selected significant result is a true discovery also depends on the power of statistical tests to detect a true finding. However, social psychologists have ignored power and often selected significant results from studies with lower power. In this case, significance is often more due to chance than a real effect and the results are difficult to replicate. A shocking finding revealed that less than 25% of results in social psychology could be replicated (OSC, 2015). This finding has been widely cited outside of social psychology, but social psychologists have preferred to ignore the implication that most of their published results may be false (Schimmack, 2020).
Some social psychologists have responded to this replication crisis by increasing power and reporting non-significant results as evidence that effects are small and negligible (e.g., Lai et al., 2014, 2016). However, others continue to use the same old practices. This creates a problem. While the average credibility of social psychology has increased, readers do not know whether they are reading an article that used the scientific method properly or improperly.
One solution to this problem is to examine the strength of the reported statistical results. Strong statistical results are more credible than weak statistical results. Thus, the average strength of the statistical results provides useful information about the credibility of individual articles. I demonstrate this approach with two articles from 2020 in the Attitudes and Social Cognition section of the Journal of Personality and Social Psychology (JPSP-ASC).
Before I examine individual articles, I am presenting results for the entire journal based on automatic extraction of test-statistics for the years 2010 (pre-crisis) and 2020 (post-crisis).
Figure 1 shows the results for 2010. All test-statistics are first converted into p-values and then transformed into absolute z-scores. The higher the z-score, the stronger is the evidence against the null-hypothesis. The figure shows the mode of the distribution of z-scores at a value of 2, which coincides with the criterion for statistical significance (p = .05, two-tailed, z = 1.96). Fitting a model to the distribution of the significant z-scores, we would expect an even higher mode in the region of non-significant results. However, the actual distribution shows a sharp drop in reported z-scores. This pattern shows the influence of selection for significance.
The amount of publication bias is quantified by a comparison of the observed discovery rate (i.e. the percentage of reported tests with significant results and the expected discovery rate, which is the area of the grey curve for z-scores greater than 1.96). The ODR of 73% is much higher than the EDR of 15%. The fact that the confidence intervals for these two estimates do not overlap shows clear evidence of selection for significance in JPSP-ASC in 2010.
An EDR of 15% also implies that most statistical tests are extremely underpowered. Thus, even if there is an effect, it is unlikely to be significant. More relevant is the replication rate, which is the average power of results that were significant. As power determines the outcome of exact replication studies, the replication rate of 60% implies that 60% of published results are expected to be replicable in exact replication studies. However, observed effect sizes are expected to shrink and it is unclear whether the actual effect sizes are practically meaningful or would exceed the typical level of a small effect size (i.e., 0.2 standard deviations or 1% explained variance).
In short, Figure 1 visualizes incorrect use of the scientific method that capitalizes more on chance than on actual effects.
The good news is that research practices in social psychology have changed, as seen in Figure 2.
First, reporting of results is much less deceptive. The observed discovery rate of 73% is close to the estimated discovery rate of 72%. However, visual inspection of the two curves shows a small dip for results that are marginally significant (z = 1.5 to 2) and a slight excess for just significant results (z = 2 to 2.2). Thus, some selection may still happen in some articles.
Another sign of improvement is that the EDR of 72% in 2020 is much higher than the EDR of 15% in 2010. This shows that social psychologists have dramatically improved the power of their studies. This is largely due to the move from small undergraduate samples to larger online samples.
The replication rate of 85% implies that most published results in 2020 are replicable. Even if exact replications are difficult, the EDR of 73% still suggests rather high replicability (see Bartos & Schimmack, 2020, for a discussion of EDR vs. ERR to predict actual replication results).
Despite this positive trend, it is possible that individual articles are less credible than the average results suggest. This is illustrated with the article by Leander et al. (2020).
This article was not picked at random. There are several cues that suggested the results of this article may be less credible than other results. First, Wolfgang Stroebe has been an outspoken defender of the old unscientific practices in social psychology (Stroebe & Strack, 2014). Thus, it was interesting to see whether somebody who so clearly defended bad practices would have changed. This is of course a possibility because it is not clear how much influence Stroebe had on the actual studies. Another reason to be skeptical about this article is that it used priming as an experimental manipulation, although priming has been identified as a literature with many replication failures. The authors cite old priming studies as if there is no problem with these manipulations. Thus, it was interesting to see how credible these new priming results would be. Finally, the article reported many studies and it was interesting to see how the authors addressed the problem that the risk of a non-significant result increases with each additional study (Schimmack, 2012).
I first used the automatically extracted test-statistics for this article. The program found 51 test-statistics. The results are different from the z-curve for all articles in 2020.
Visual inspection shows a peak of p-values that are just significant. The comparison of the ODR of 65% and the EDR of 14% suggests selection for significance. However, even if we just focus on the significant results, the replication rate is low with just 38%, compared to the 85% average for 2020.
I also entered all test-statistics by hand. There were more test-statistics because I was able to use exact p-values and confidence intervals, which are not used by the automated procedure.
The results are very similar showing that automatically extracted values are useful if an article reports results mostly in terms of F and t-values in the text.
The low power of significant results creates a problem for focal hypothesis tests in a serious of studies. This article included 7 studies (1a, 1b, 1c, 2, 3, 4, 5) and reported significant results for all of them, ps = 0.004, 0.007, 0.014, 0.020, 0.041, 0.033, and 0.002. This 100% success rate is higher than the average observed power of these studies, 70%. Average power overestimates real power, when results are selected for significance. A simple correction is to subtract the inflation rate (100% – 70% = 30%) from the mean observed power. This Index is called the Replication Index and an R-Index of 40% shows that studies were underpowered and that a replication study with the same sample size is more likely to produce a non-significant result than a significant one.
A z-curve analysis produce a similar estimate, but also shows that these estimates are very unstable and that replicability could be 5%, which means there is no effect. Thus, after taking selection for significance into account, the 7 significant p-values in Leander et al.’s (2020) article provide as much evidence for their claims as Bem’s (2011) 9 significant p-values did for the claim that priming effects can work when the prime FOLLOWS the behavior.
Judd and Gawronski (2011) argued that they had to accept Bem’s crazy article because (a) it passed critical peer-review and (b) they had to trust the author that results were not selected for significance. Nothing has changed in JPSP-ASC. The only criterion for acceptance is peer-review and trust. Bias tests that have been evaluated whether results are actually credible are not used by peer-reviewers or editors. Thus, readers have to carry out these tests themselves to protect themselves from fake science like Leander et al.’s (2020) priming studies. Readers can still not trust social psychology journals to reject junk science like Bem’s (2011) article.
The second example shows how these tools can also provide evidence that published results are credible, using an article by Zlatev et al. (2020).
The automated method retrieved only 12 test statistics. This is a good sign because hypothesis tests are used sparingly to test only important effects, but it makes it more difficult to get precise estimates for a single article. Thus, article based information should be only used as a heuristic, especially if no other information is available. Nevertheless, the limited information suggests that the results are credible. The Observed discovery rate is even slightly below the estimated discovery rate and both the EDR and ERR are very high, 99%. 5 of the 12 test statistics exceed a z-value of 6 (6 sigma) which is even higher than the 5-sigma rule used on particle physics.
The hand-coding retrieved 22 test statistics. The main reason for the difference is that the automated method does not include chi-square tests to avoid including results from structural equation modeling. However, the results are similar. The ODR of 86% is only slightly higher than the EDR of 74% and the replication rate is estimated to be 95%.
There were six focal tests with four p-values below .001. The other two p-values were .001 and .002. The mean observed power was 96%, which means that a success rate of 100% was justified and that there is very little inflation in the success rate, resulting in an R-Index of 93%.
Psychology, especially social psychology, has a history of publishing significant results that are selected from a larger set of tests with low statistical power. This renders published results difficult to replicate. Despite a reform movement, published articles still rely on three criteria to be published: (a) p-values below .05 for focal tests, (b) peer-review, and (c) trust that researchers did not use questionable practices to inflate effect sizes and type-I error risks. These criteria do not help to distinguish credible and incredible articles.
This blog post shows how post-hoc power analysis can be used to distinguish questionable evidence from credible evidence. Although post-hoc power analysis has been criticized when it is applied to a single test statistic, meta-analyses of observed power can show whether researchers actually had good power or not. It can also be used to provide information about the presence and amount of selection for significance. This can be helpful for readers to focus on articles that published credible and replicable results.
The reason why psychology has been slow in improving is that readers have treated all significant results as equal. This encouraged researchers to p-hack their results just enough to get significance. If readers become more discerning in the reading of method section and no longer treat all p-values below .05 as equal, articles with more credible evidence will gain more attention and citations. For example, this R-Index analysis suggests that readers can ignore Leander et al.’s article and can focus on the credible evidence in Zlatev et al.’s article. Of course, solid empirical results are only a first step in assessing an article. Other questions about ecological validity remain, but there is no point in paying attention to p-hacked results, even if their are published in the most prestigious journal.
P.S. I ran a z-curve analysis on all articles with 10 or more z-scores between 2 and 6 published from 2000 to 2010. The excel file contains the DOI, the observed discovery rate, expected discovery rate, and the expected replication rate. It can be fun to plug a DOI into a search engine and to see what article pops up. I know nobody is going to believe me, but I did not know which article has the lowest EDR of 5% and ERR of 9%, but the result is not surprising. I call it predictive validity of the R-Index.
A science that cannot face its history has no future. (Anonymous).
Bem (2011) presented incredible results that seemed to provide strong empirical evidence (p < .05^9). The article was published because it passed “a rigorous review process, involving a large set of extremely thorough reviews by distinguished experts in social cognition” and “editors can only take the author at his word that his data are in fact genuine.” Here I show that social psychologists have avoided discussing the broader implications of the method crisis in social psychology. The same standards that were used for Bem’s article are still used for most articles published today: a few significant p-values, peer-review, and trust that researchers are honest are supposed to ensure that results are robust and replicable. However, the replication crisis has shown that this is not the case. Consumers of social psychology need to be aware that even 10 years after Bem’s infamous article, evidence for social psychological theories is no more credible than Bem’s evidence for extra-sensory perception.
Daryl Bem was a highly respected social psychologist, until he published his “Feeling the Future” article in 2011.
The article became a poster child of everything that is wrong with research methods in social psychology and has been cited over 300 times.
The article was also accompanied by an editorial that justified the publication of an article that seemed to provide evidence for the incredible claim that humans, or at least extraverts, can feel events that haven’t happened yet.
The editorial suggests that Bem’s article will “stimulate further thoughts about appropriate methods in research on social cognition and attitudes” (p. 406). Ten years later, we can see whether publishing Bem’s article had the intended effect.
The high citation count shows that the article did indeed generate lot’s of discussion about research practices in social cognition research and social psychology more broadly. However, an inspection of these citations shows that most of this discussion occurred outside of social psychology, by meta-psychologists who reflected on research practices by social psychologists. In stark contrast, critical self-reflection by social psychologists is insignificant.
Here I provide a bibliography of these citations. An examination of these citations shows that social psychologists have carefully avoided asking themselves the most important question that follows from Bem’s (2011) article.
If we cannot trust Bem’s article that reported nine statistically significant results, which article in social psychology can we trust?
This article clearly spells out the problems of QRPs and uses Bem’s article to raise questions about other research findings. The first author was trained as a graduate student by Gawronski, but is no longer in social psychology.
This article implies that the problem was inappropriate treatment of variation across stimuli. It does not mention the use of QRPs in social psychology, nor does it mention evidence that Bem (2011) used QRPs (Francis, 2012; Schimmack, 2012).
do not cite John et al. (2012) and do not cite Francis or Schimmack (2012) as evidence that Bem used QRPs.
Bem (2011) presented incredible results that seemed to provide strong empirical evidence (p < .05^9). The article was published because it passed “a rigorous review process, involving a large set of extremely thorough reviews by distinguished experts in social cognition” and “we can only take the author at his word that his data are in fact genuine.” The same is true for all other articles published in social psychology. A few significant p-values, peer-review, and trust are supposed to ensure that results are robust, replicable. However, the replication crisis has shown that this is not the case. As research practices have not dramatically changed, consumers of social psychology need to be warned that even 10 years after Bem’s article published results in social psychology are no more trustworthy than Bem’s claims of extra-sensory perception.
Ten years after Bem’s (2011) demonstration of extrasensory perception with the standard statistical practices in psychology it is clear that these practices were unable to distinguish true findings from false findings. In the following decade, replication studies revealed that many textbook findings, especially in social psychology, were false findings, including extrasensory perception (Świątkowski & Benoît, 2017; Schimmack, 2020).
Although a few changes have been made, especially in social psychology, research practices in psychology are mostly unchanged one decade after the method crisis in psychology became apparent. Most articles continue to report diligently results that are statistically significant, p < .05, and do not report when critical hypotheses were not confirmed. This selective publishing of significant result has characterized psychology as an anormal science for decades (Sterling, 1959).
Some remedies are unlikely to change this. Preregistration is only useful, if good studies are preregistered. Nobody would benefit from publishing all badly designed preregistered studies with null-results. Open sharing of data is also not useful if the data are meaningless. Finally, sharing of materials that helps with replication is not useful if the original studies were meaningless. What psychology needs is a revolution of research practices that leads to the publication of studies that address meaningful questions.
The fundamental problem in psychology is the asymmetry of statistical tests that focus on the nil-hypothesis that the effect size is zero (Cohen, 1994). The problem with this hypothesis is that it is impossible to demonstrate an effect size of zero. The only way would be to study the entire population. However, often the population is not clearly defined and it is unlikely that the effect size is exactly zero in the population. This asymmetry has led to the believe that non-significant results, p > .05, are inconclusive. There is always the possibility that a non-zero effect exists, so one is not allowed to draw conclusions that effects do not exist (even time-reversed pre-cognition always remains a possibility).
This problem was recognized in the 1990s, but APA came up with an even worse solution to fix this problem. Instead of just reporting statistical significance, researchers were also asked to report effect sizes. In theory, reporting of effect sizes would help researchers to evaluate whether an effect size is large enough to be meaningful or not. For example, if a researcher reported a result with p < .05, but an extremely small effect size of d = .03, it might be considered so small, that it is practically irrelevant.
So why did reporting effect sizes not improve the quality and credibility of psychological science? The reason is that studies still had to pass the significance filter, and to do so effect size estimates in a sample have to exceed a threshold value. The perverse incentive was that studies with small samples and large sampling error require larger effect size estimates than qualitatively better studies with large samples that provide more precise estimates of effect sizes. Thus, researchers who invested few resources in small studies were able to brag about their large effect sizes. Sloppy language disguised the fact that these large effects were merely estimates of the actual population effect sizes.
Many researchers relied on Cohen’s guidelines for a priori power analysis to label their effect size estimates, small, moderate or large. The problem with this is that selection for significance, inflates effect size estimates and the inflation is inversely related to sample size. The smaller the sample, the bigger the inflation, and the larger the effect size that is reported.
This inflation only becomes apparent when replication studies with larger samples are available. For example, Joy-Gaba and Nosek (2010) obtained a standardized effect size estimate of d = .08 with N = 3,000 participants, the original study with N = 48 participants reported an effect size estimate of d = .82.
The title of the article “The Surprisingly Limited Malleability of Implicit Racial Evaluations” draws attention to the comparison of the two effect size estimates, as does the discussion section.
“Further, while DG reported a large effect of exposure on implicit racial (and age) preferences (d = .82), the effect sizes in our studies were considerably smaller. None exceeded d = .20, and a weighted average by sample size suggests an average effect size of d = .08…” (p. 143).
The problem is the sloppy usage of the term effect size for effect size estimates. Dasgupta and Greenwald did not report a large effect because their small sample had so much sampling error that it was impossible to provide any precise information about the population effect size. This becomes evidence, when we compare the results in terms of confidence intervals (frequentist or objective Bayesian doesn’t matter).
The sampling error for a study with N = 33 participants is 2/sqrt(33) = .35. To create a 95% confidence interval, we multiply the sampling error by qt(.975,31) = 2. So, the 95% CI around the effect size estimate of .80 ranges from .80 – .70 = .10 to .80 + .70 = 1.50. In short, small samples produce extremely noisy estimates of effect sizes. It is a mistake to interpret the point estimates of these studies as reasonable estimates of the population effect size.
Moreover, when results are selected for significance, these noisy estimates are truncated at high values. To get a significant result in their extremely small sample, Dasgupta and Greenwald required a minimum effect size estimate of d = .70, In this case, the effect size estimate is two times the sampling error, which produces a p-value of .05.
This example is not an isolated incidence. It is symptomatic of reporting of results in psychology. Only recently, a new initiative is asking for the reporting of confidence intervals, but it is not clear whether psychology fully grasp the importance of this information. Maybe point estimates should not be reported unless confidence intervals are reasonably small.
In any case, the reporting of effect sizes did not change research practices and reporting of confidence intervals will also fail to do so because they do not change the asymmetry of nil-hypothesis testing. This is illustrated with another example.
Using a large online sample (N = 92,230), a study produced an effect size estimate (Ravary, Baldwin, & Bartz, 2019) of d = .02 (d = .0177 in the article). This effect size is reported with a 95% confidence interval from .004 to .03.
Using the standard logic of nil-hypothesis testing, this finding is used to reject the nil-hypothesis and to support the conclusion (in the abstract) that “as predicted, fat-shaming led to a spike in women’s (N=93,239) implicit anti-fat attitudes, with events of greater notoriety producing greater spikes” (p. 1580).
We now can ask ourselves a conter-factual question. What finding would have led the authors to conclude that there was no effect. What if a sample size of 1.5 million participants had shown an effect size of d = .002 with CI = .001 to .003. Would that have been sufficiently small to conclude nothing notable happened; let’s move on? Or would it still have been interpreted as evidence against the nil-hypothesis?
The main lesson from this Gedankenexperiment is that psychologists lack a procedure to weed out effects that are so small that chasing them would be a waste of time and resources.
I am by no means the first one to make this observation. In fact, some psychologists like Rouder and Wagenmakers have criticized nil-hypothesis testing for the very same reason and proposed a solution to the problem. Their solution is to test two competing hypothesis and allow for empirical data to favor either one. One of these hypotheses specifies an actual effect size, but because we do not know what the effect size, this hypothesis is specified as a distribution of effect sizes. The other hypothesis is the nil-hypothesis that there is absolutely no effect. The consistency of the data with these two hypotheses is evaluated in terms of Bayes-Factors.
The advantage of this method is that it is possible for researchers to decide in favor of the absence of an effect. The disadvantage of this method is that the absence of a relevant effect is still specified as absolutely no effect. This makes it possible to sometimes end up with the wrong inference that there is absolutely no effect with a small effect size that has practically significant population effect size.
A better way to solve the problem is to specify two hypothesis that are distinguished by the minimum relevant population effect size. Lakens, Scheel, and Isager (2018) give a detailed tutorial on equivalence testing. I diverge from their approach in two ways. First, they leave it to researchers’ expertise to define the smallest effect size of interest (SESOI) or minimum effect size (MES). This is a problem because psychologists have shown a cunning ability to game any methodological innovation to avoid changing their research practices. For example, nothing would change if the MES were set to d = .001. Rejecting d = .001 is not very different from rejecting, d = .000, and it would require 10 million participants to establish the absence of an MSE.
In fact, when psychologists obtain small effect sizes, they are quick to argue that these effects still have huge practical implications (Greenwald, Banaji, & Nosek, 2015). The danger is that these arguments are made in the discussion section, but that the results section reports effect sizes that are inflated by publication bias, d = .5, 95%CI = .1 to .9.
To solve this problem, MESs should correspond to sampling error. Studies with small samples and large sampling error need to specify a high MES, which increases the risk that researchers end up with a result that falsifies their predictions. For example, the race IAT does not predict voting against Obama, p < .05.
I therefore suggest an MSE of d = .2 or r = .1 as a default criterion. This is consistent with Cohen’s (1988) criteria for a small effect. In terms of significance testing, not much changes. For example, for a t-test, we are simply substracting .2 from the standardized mean difference. This has implications for a priori power analysis. To have 80% power with the nil-hypothesis, a sample size of with a moderate effect size of d = .5, a total of 128 participants are needed (n = 64 per cell).
To compute power with MES = .2, I wrote a little R-script. It shows that N = 356 participants are needed to achieve 80% power with a population effect size of d = .5. The program also computes the power for the hypothesis that the population effect size is below the MES. Once more, it is important to assume a fixed population effect size. A plausible value is zero, but if there is a small but negligible effect, power would be lower. The figure shows that power is only 47%. Power less than 50% implies that the effect size estimate has to be negative to produce a significant result.
Of course, the values can be changed to make other assumptions. The main point is to demonstrate the advantage of specifying a minimum effect size. It is now possible to falsify predictions. For example, with a sample of 93,239 participants, we have 100% power to determine whether an effect is larger or smaller than .2, even if the population effect size is d = .1. Thus, we can falsify the prediction that media events about fat-shaming move scores on a fat-IAT with a statistically small effect size.
The obvious downside of this approach is that it makes it harder to get statistically significant results. For many research areas in psychology, a sample size of N = 356 is very large. Animal studies or fMRI studies often have much smaller sample sizes. One solution to this problem is to increase the number of observations with repeated measurements, but this is also not always possible or not much cheaper. Limited resources are the main reason why psychologists are often conducting underpowered studies. This is not going to change overnight.
Fortunately, thinking about the minimum effect size is still helpful for consumers of research articles because they can retroactively apply these criteria to published research findings. For example, take Dasgupta and Greenwald’s seminal study that aimed to change race IAT scors with an experimental manipulation. If we apply an MSE of d = .2 to a study with N = 33 participants, we easily see that this study provided no valuable information about effect sizes, because a d-score of -.5 is needed to claim that the population effect size is below d = .2 and a population effect size of d = .9 is needed to claim that the effect size is greater than .2. If we assume that the population effect size is d = .5, the study has only 13% power to produce a significant result. Given the selection for significance, it is clear that published significant results are dramatically inflated by sampling error.
In conclusion, the biggest obstacle to improving psychological science is the asymmetry in nil-hypothesis significance testing. Whereas significant results that are obtained with “luck” can be published, non-significant results are often not published or considered inconclusive. Bayes-factors have been suggested as a solution to this problem, but Bayes-Factors do not take effect sizes into account and can also reject the nil-hypothesis with practically meaningless effect sizes. To get rid of the asymmetry, it is necessary to specify non-null effect sizes (Lakens et al., 2018). This will not happen any time soon because it requires an actual change in research practices that requires more resources. If we have learned anything from the history of psychology, it is that sample sizes have not changed. To protect themselves from “lucky” false positive results, consumers of scientific articles can specify their own minimum effect sizes and see whether results remain significant. With the typical p-values between .05 and .005 this will often not be the case. These results should be treated as interesting suggestions that need to be followed up with larger sample sizes, but readers can skip the policy implication discussion section. Maybe if readers get more demanding, researchers will work harder to convince them of their pet theories. Amen.
Psychology is not a unified paradigmatic science. That is, it lacks an overarching theory like evolution theory in biology. In a science without an empirically grounded paradigm, progress is made very much like evolution made progress in a process of trial and error. Some ideas may thrive for a moment, but if they are not fruitful, they are discarded. The emergence of a new idea is often characterized as a revolution, and psychology has seen its fair share of revolutions. Behaviorism replaced introspectionism and the cognitive revolution replaced behaviorism. For better or worse, cognitivism is dominating psychology at the moment. The cognitive revolution also had a strong influence on social psychology with the rise of social cognition research.
In the early days, social psychologists focussed on higher cognitive processes like attributions. However, in the 1980s, the implicit revolution shifted focus towards lower cognitive processes that may occur without awareness. This was not the first time, unconscious processes became popular. A special issue in the American Psychologists in 1992 called it the New Look 3 (Greenwald, 1992).
The first look was Freud’s exploration of conscious and unconscious processes. A major hurdle for this first look was conceptual confusion and a lack of empirical support. Puritan academic may also have shied away from the sexual content in Freudian theories (e.g., sexual desire directed at the mother).
However, the second look did try to study many of Freud’s ideas with empirical methods. For example, Silverman and Weinberger (1985) presented the phrase “Mommy and I are one” on a computer screen so quickly that participants were unable to say what they saw. This method is called subliminal priming. The idea was that the unconscious has a longing to be loved by mommy and that presenting this phrase would gratify the unconscious. Numerous studies used the “Mommy and I are one” priming method to see effects on behavior.
Greenwald (1992) reviewed this evidence.
Can subliminal presentations result in cognitive analyses of multiword strings? There have been reports of such effects, especially in association with tests of psychoanalytic hypotheses. The best known of these findings (described as subliminal psychodynamic activation [SPA], using “Mommy and I are One” as the text of a subliminal stimulus; Silverman & Weinberger, 1985) has been identified, on the basis of meta-analysis, as a reproducible phenomenon (Hardaway, 1990; Weinberger & Hardaway, 1990).
Despite this strong evidence, many researchers remain skeptical about the SPA result (see, e.g., the survey reported in Appendix B). Such skepticism is almost certainly due to the lack of widespread enthusiasm for the SPA result’s proposed psychodynamic interpretation (Silverman & Weinberger, 1985).
Because of the positive affective values of words in the critical stimulus (especially Mommy and I) , it is possible that observed effects might be explained by cognitive analysis limited to the level of single words. Some support for that interpretation is afforded by Hardaway’s demonstration (1990, p. 183, Table 3) that other affectively positive strings that include Mommy or One also produce significant effects. However, these other effects are weaker than the effect of the specific string, “Mommy and I are One.”
In summary of evidence from studies of subliminal activation, it is now well established that analysis occurs for stimuli presented at exposure conditions in a region between objective and subjective thresholds; this analysis can extract at least some semantic content of single words.
The New Look 3, however, was less interested in Freudian theory. Most of the influential subliminal priming studies used ordinary stimuli to study common topics in social psychology, including prejudice.
For example, Greenwald (1992) cites Devine’s (1989) highly influential subliminal priming studies with racial stimuli as evidence that “experiments using stimulus conditions that are clearly above objective thresholds (but presumably below subjective thresholds) have obtained semantic activation findings with apparent relative ease” (p. 769).
25 years later, in their Implicit Revolution article, Greenwald and Banaji feature Devine’s influential article.
“Patricia Devine’s (1989) dissertation research extended the previously mentioned subliminal priming methods of Bargh and Pietromonaco (1982) to automatic stereotypes. Devine’s article brought attention to the possibility of dissociation between automatic stereotype activation and controlled inhibition of stereotype expression” (p. 865).
In short, subliminal priming has played an important role in the implicit revolution. However, subliminal priming is still rare. Most studies use clearly visible stimuli. This is surprising, given the clear advantages of subliminal priming to study unconscious processes. A major concern with stimuli that are presented with awareness is that participants can control their behavior. In contrast, if they are not even aware that a racial stimulus was presented, they have no ability to supress a prejudice response.
Another revolution explains why subliminal studies remain rare despite their obvious advantages. This revolution has been called the credibility revolution, replication revolution, or open science revolution. The credibility revolution started in 2011, after a leading social cognition journal published a controversial article that showed time-reversed subliminal priming effects (Bem, 2011). This article revealed a fundamental problem in the way social psychologists conducted their research. Rather than using experiments to see whether effects exist, they used experiments to accumulate evidence in favor of effects. Studies that failed to show the expected effects were hidden. In the 2010s, it has become apparent that this flawed use of the scientific method has produced large literatures with results that cannot be replicated. A major replication project found that less than 25% of results in social psychological experiments could be replicated (OSC, 2015). Given these results, it is unclear which results provided credible evidence.
Despite these troubling findings, social psychologists continue to cite old studies like Devine’s (1989) study (it was just one study!) as if it provided conclusive evidence for subliminal priming of prejudice. If we need any evidence for Freud’s theory of repression, social psychologists would be a prime example. Through various defense mechanisms they maintain the belief that old findings that were obtained with bad scientific practices provided credible evidence that can inform our understanding of the unconscious.
Here I show that this is wishful thinking. To do so, I conducted a modern meta-analysis of subliminal priming studies. Unlike traditional meta-analysis that do not take publication bias into account, this new method provides a strong test of publication bias and corrects for its effect on the results. While there are several new methods, z-curve has been shown to be superior to other methods (Brunner & Schimmack, 2020).
The figure shows the results. The red line at z = 1.96, corresponds to the significance criterion of .05. It is easy to see that this criterion acts like a censor. Results with z-scores greater than 1.96 (i.e., p < .05) are made public and can enter researchers awareness. Results that are not significant, z < 1.06, are repressed and may linger only in the unconscious of researchers who prefer not to think about their failures.
Statistical evidence of repression is provided by a comparison of the observed discovery rate (i.e., the percentage of published results that are significant) of 90% and the expected discovery rate based on the z-curve model (i.e., the grey curve in the figure) of 13%. Evidently, published results are selected from a much larger number of analyses that failed to support subliminal priming. This clear evidence of selection for significance undermines the credibility of individual studies in the subliminal priming literature.
However, there is some evidence of heterogeneity across studies. This is seen in the increasing numbers below the x-axis. Whereas studies with z-scores below 4, have low average power, studies with z-scores above 4, have a mean power greater than 80%. This suggests that replications of these studies could produce significant results. This information could be used to salvage a few solid findings from a pile of junk findings. Closer examination of these studies is beyond the purpose of this blog post, and Devine’s study is not one of them.
The main point of this analysis is that there is strong scientific evidence to support the claim that subliminal priming researchers did not use the scientific method properly. By selecting only results that support the existence of subliminal priming, they created only illusory evidence in support of subliminal priming. Thirty years after Devine’s (1989) subliminal prejudice study was published, we have no scientific evidence in support of the claim that racial stimuli can bypass consciousness and directly influence behavior.
However, Greenwald and other social psychologists who made a career out of these findings repress the well-known fact that published results in experimental social psychology are not credible and cite them as if they are credible evidence (Greenwald & Banaj, 2017).
Social psychologists are of course very familiar with deception. First, they became famous for deceiving participants (Milgram studies). In 2011, it became apparent that they were deceiving themselves. Now, it seems they are willing to deceive others to avoid facing the inconvenient truth that decades of research have produced no scientific results.
The inability to face ego-threatening information is of course not new to psychologists. Freud studied defense mechanisms and social psychologists studied cognitive biases and motivated reasoning. Right now, this trait is on display in Donald Trump and his supporters inability to face the fact that he lost an election. It is ironic that social psychologists have the same inability when their own egos are on the line.
The notion of implicit bias has taken root in North America and influential politicians like Hillary Clinton or FBI director James Comey used the idea to understand persistent racism and prejudice in the United States (Greenwald, 2015).
The main idea of implicit bias is that most White Americans have negative associations about Blacks that influence their behaviors without their awareness. This explains why even Americans who hold egalitarian values and do not want to discriminate end up discriminating against Black Americans.
The idea of implicit bias emerged in experimental social psychology in the 1980s. Until then most academic psychologists dismissed Freudian ideas of unconscious processes. However, research in cognitive psychology with computerized tasks suggested that some behaviors may be directly guided by unconscious processes that cannot be controlled by our conscious and may even influence behavior without our awareness (Greenwald, 1992).
Some examples of these unconscious processes are physiological processes (breathing), highly automated behaviors (driving while talking to a friend), and basic cognitive processes (e.g., color perception). These processes differ from cognitive tasks like adding 2 + 3 + 5 or deciding what take out food to order tonight. There is no controversy about this distinction. The controversial and novel suggestion was that prejudice could work like color perception. We automatically notice skin color and our unconscious guides our actions based on this information. Eventually the term implicit bias was coined to refer to automatic prejudice.
To provide evidence for implicit bias, experimental social psychologists adopted experiments from cognitive psychology to study prejudice. For example, one procedure is to present racial stimuli on a computer screen very quickly and immediately replace them with some neutral stimulus to prevent participants from actually seeing the stimulus. This method is called subliminal (below-threshold of awareness) priming.
Some highly cited studies suggested that subliminal priming influences behaviour without awareness (Bargh et al., 1996; Devine, 1989). However, in the past decade it has become apparent that these results are not credible (Schimmack, 2020). The reason is that social psychologists did not use the scientific method properly. Instead of using experiments to examine whether an effect exists, they only looked for evidence that shows an effect. Studies that failed to show the expected effects of subliminal priming were simply not reported. As a result, even incredible subliminal priming studies that reversed the order of cause and effect were successful (Bem, 2011). In the 2010s, some courageous researchers started publish replication failures (Doyen et al., 2012). They were attacked for doing so because it was a well-known secrete among experimental social psychologists that many studies fail, but you were not supposed to tell anybody about it. In short, the evidence that started the implicit revolution (Greenwald & Banaji, 2017) is invalid and casts a shadow over the whole notion of prejudice without awareness.
Measuring Implicit Bias
In the 1990s, experimental psychologists started developing methods to measure individuals’ implicit biases. The most prominent method is the Implicit Association Test (IAT, Greenwald et al., 1998) that has produced a large literature with thousands of studies that used the IAT to measure attitudes towards the self (self-esteem), exercise, political candidates, etc. etc. However, the most important literature with the IAT are studies of implicit bias. In these studies, White Americans tend to show a clear preference for Whites over Black Americans. This preference can also be shown with self-ratings. However, a notable group of participants shows much stronger preferences for Whites with the IAT than in their self-ratings. This finding has been used to claim that some White Americans are more prejudice than their are aware off.
One problem with the IAT and other measures of implicit bias is that they are not very good. That is, an individual’s test score is much more strongly influenced by measurement error than by their implicit bias. One way to demonstrate this is to examine the reliability of IAT scores. A good measure should produce similar results when it is used twice (e.g., two Covid-19 tests should be both positive or negative, not one positive and one negative). Reliability can be assessed by examining the correlation of two IATs. A correlation of r = .5 would imply that there is a 75% chance for somebody to score above average on both tests and a 25% chance to get conflicting results (i.e., above and below average).
Experimental social psychologists rarely examines reliability because most of their studies are cross-sectional ( a single experimental session lasting from 10 minutes to 1 hour). However, a few studies with repeated measurements provide some information. Short intervals are preferable to avoid any real changes in implicit bias. Bar-Anan and Nosek (2014) reported a retest-correlation of r = .4, for tests taken within a few hours. Lai et al. (2016) conducted the largest study with several hundred participants for tests taken within a few days. The retest correlations ranged from .22 to .30. Even two similar, but not identical, race IATs in the same session produce low correlations, r ~ .2 (Cunningham et al., 2001). More extensive psychometric analysis further suggest that some of the variance in implicit bias measures is systematic measurement error that influences one type of measure, but not other measures (Schimmack, 2019). Longitudinal studies over several years further show that the reliable variance in IATs is highly stable over time (Onyeador et al., 2020).
In short, ample evidence suggests that most of the variance in implicit bias measures is measurement error. This has important implications for research with these measures that tries to change implicit bias or use implicit bias measures to predict behaviors. However, experimental social psychologists have ignored these implications when they implicitly assumed that their measures are perfectly valid.
The Numbers do not add up
Some simple math shows the problems for experimental social psychologists to study implicit bias. The main method to study implicit bias is to conduct experiments where participants are randomly assigned to two or more groups. Each group receives a different treatment and then the effects on an implicit bias measure and actual behaviors are observed. For illustrative purposes, I assume that manipulations actually have a moderate effect size of half a standard deviation (d = .5) on implicit bias. However, because only a small proportion of the variance in the implicit bias measures is valid (here the assumption is a generous .5^2 = 25%), the effect that an experimental social psychologist could observe is only .25 standard deviations. That is, measurement error cuts the actual effect size in half. The effect on an actual behavior is even smaller because the link between attitudes and a single behavior is also small, d = .5 * .3 = .15. Thus, even under favorable conditions, experimental social psychologists can only expect to observe small effect sizes.
A good scientist would plan studies to be able to reliably detect these small effect sizes. Cohen (1988) provided guidelines for scientists how to plan sample sizes that make it possible to detect these small effects. A so-called power analysis shows that N = 500 participants are needed to detect an effect size of d = .25 and 1,400 participants are needed to detected an effect size of d = .15 for behavior.
However, experimental social psychologists tend to conduct studies with much smaller sample, often fewer than 100 participants. With N = 100, they would have only a 25% chance to reliably (with a p-value below .05) detect an effect and the observed effect size would be severely inflated because the significant result can only be significant with an inflated effect size estimate. Thus, we would expect many non-significant results in the implicit bias literature. However, we do not see these results because experimental social psychologists did not report their failures.
Implicit Bias Intervention Studies
For 20 years, experimental social psychologists have reported studies that seemed to change implicit bias (Dasgupta & Greenwald, 2001; Kawakami, Dovidio, Moll, Hermsen, Russin, 2000). The most influential article was Dasgupta and Greenwald’s (2001) article with nearly 700 citations. As this article spanned an entire literature, it is worthwhile to take a closer look at it.
There were two studies, but only Study 1 focused on implicit race bias. The sample size was N = 48. These 48 participants were divided into three groups, leaving n = 18 per group. Aside from a control group, one group was shown positive example of Blacks and negative examples of Whites and another group was shown the reverse. To get a significant result for the extreme comparison of the opposing groups, we have a study with 36 participants. To have an 80% chance to get a significant result for this contrast, an observed difference of d = .96 is needed. Taking measurement error into account this requires a change in implicit bias by 2 standard deviations. Otherwise, a non-significant result is likely and the study is risky.
Surprisingly, the authors did find a very strong effect size for their manipulation, d = 1.29. They even found a significant difference with the control group, d = .58.
As shown in Figure 1, Panel A, results revealed that exposure to pro-Black exemplars had a substantial effect on automatic racial associations (or the IAT effect).5 The magnitude of the automatic White preference effect was significantly smaller immediately after exposure to pro-Black exemplars (IAT effect = 78 ms; d = 0.58) compared with nonracial exemplars (IAT effect = 174 ms; d = 1.15), F(1, 31) = 6.79, p = .01; or pro-White exemplars (IAT effect = 176 ms; d = 1.29), F(1, 31) = 5.23, p = .029. IAT effects in control and pro-White conditions were statistically comparable (F < 11)
Dasgupta and Greenwald not only wanted to show an immediate effect. They also wanted to show that this effect can last at least for a short time. Thus, they repeated the measurement a second day. The problem is that they now need to show two significant results, when they have a relatively low chance to show even one. The risk of failure therefore increased considerably, but they were successful again.
Panel B of Figure 1 illustrates the response latency data 24 hr after exemplar exposure. Compared with the control condition, the magnitude of the IAT effect in the pro-Black condition remained significantly diminished 1 day after encountering admired Black and disliked White images (IAT effects = 126 ms vs. 51 ms, respectively; ds = 0.98 vs. 0.38, respectively), F(1, 31) = 4.16, p = .05. Similarly, compared with the pro-White condition, the IAT effect in the pro-Black exemplar condition remained substantially smaller as well (IAT effects = 107 vs. 51 ms, respectively; ds = 1.06 vs. 0.38, respectively), F(1, 31) = 3.67, p = .065.
Nobody cared about p-values that are strictly not significant (p = .05, p = .068), but these days these p-values are considered red flags that may suggest the use of questionable research practices to find significance. Another sign of questionable practices is when multiple tests are all successful because each test produces a new opportunity for failure. Thus, the fact that everything always works in experimental social psychology is a sign of widespread abuse of the scientific method (Sterling, 1959; Schimmack, 2012).
Study 2 did not examine racial bias, but it is relevant because it presents more statistical tests. If they also show the desired results, we have additional evidence that QRPs were used. Study 2 examined prejudice towards old people. Notably, the reported study did not have a control group as in Study 1, thus there is only a comparison of manipulations with favorable old people versus favorable young people. Study 2 also did not bother to examine whether the changes last for a day, or at least there were no results reported if this was examined. Thus, there is only one statistical test and that was significant with p = .03.
As illustrated in Figure 2, exposure to pro-elderly exemplars yielded a substantially smaller automatic age bias effect (IAT effect = 182 ms, d = 1.23) than exposure to pro-young exemplars (IAT effect = 336 ms, d = 1.75), F ( 1 , 24) = 5.13, p = .03.
Over the past decade, meta-scientists have developed new tools to examine the presence of questionable practices even in small sets of studies. One test examines the variability of p-values as a function of sampling error (TIVA). After converting p-values into z-scores, we would expect a variance of 1, but the variance is only 0.05. This outcome has only a probability of 1 out of 180 times to occur by chance. Even if we are conservative and make this 1 out of 100, Dasgupta and Greenwald were extremely lucky to get significant results in all of their critical tests. We can also examine the power of their studies given the reported test statistics. The average observed power is 56%, yet they had 100% successes. This suggests that QRPs were used to inflate the success rate. This test is extremely conservative because mean observed power is also inflated by the use of QRPs. A simple correction is to subtract the inflation (100% – 56% = 44%) from the observed mean power. This yields a corrected replicability index of 56% – 44% = 12%. A replicability index of 21% is obtained when there is actually no effect.
In short, power analyses and bias tests suggest that Dasgupta and Greenwald’s article contains no empirical evidence that simple experimental manipulations can produce lasting changes in implicit bias. Yet, this article suggested to other experimental social psychologists that changing IAT scores is relatively easy and worthwhile. This generated a large literature with hundreds of studies. Next we are going to examine what we can learn from 20 years of research with over 40,000 participants.
A Z-Curve Analysis of Implicit Bias Intervention Studies
Psychologists often use meta-analyses to make sense of a literature. The implicit bias literature is no exception (Forscher et al., 2019; Kurdi et al., 2019). The problem with traditional meta-analyses is that they are uninformative. Their main purpose is to claim that an effect exists and to provide an average effect size estimate that nobody cares about. Take the meta-analysis by Forscher et al. (2019) as an example. After finding as many published and unpublished studies as possible, the results are converted into effect size estimates to end up with the conclusion that
“implicit measures can be changed, but effects are often relatively weak (|ds| < .30).
What do we do with this information. After all, Dasgupta and Greenwald (2001) reported an effect size of d > 1. Does this mean, they had a more powerful manipulation or does this mean their results were inflated by QRPs?
Traditional meta-analysis suffers from two problems. First, unlike medical meta-analysis where manipulations represent a treatment with the same drug, social psychologists use very different manipulations to change implicit bias ranging from living with a Black roommate for a semester to subliminal presentation of stimuli on a computer screen. Not surprisingly there is evidence of heterogeneity, that is, effect sizes vary, making any conclusions about the average effect size meaningless. What we really want to know is which manipulations reliably can produce the largest changes in implicit attitudes.
The next problem of this meta -analysis is that it did not differentiate between IATs. Implicit measures of attitudes towards alcohol or consumer products were treated the same as implicit bias. Thus, the average results may not hold for implicit bias.
The biggest problem is that meta-analysis in psychology do not take publication bias into account. Either they do not even examine it or, as in this case, they find evidence for publication bias, but don’t correct conclusions accordingly.
“we found that procedures that directly or indirectly targeted associations, depleted mental resources, or induced goals all changed implicit measures relative to neutral procedures” (p. 541).
It is not clear whether this conclusion holds after taking publication bias into account. Meta-scientists have developed better tools to examine and correct for the influence of questionable research practices that inflate effect sizes (QRP, John et al., 2012). A simulation study found that z-curve is superior to several alternative methods (Brunner & Schimmack, 2020). Thus, I conducted a z-curve analysis of the literature on implicit bias interventions.
The meta-analysis by Forscher et al. (2019) was very helpful to find studies until 2014. I also looked for newer studies that cited Dasgupta and Greenwald (2001), the seminal study in this field. I did not bother to get data from unpublished studies or dissertations. The reason is that these sources are only included in traditional meta-analysis to give the illusion that all studies were included and that there is no bias. However, original researchers who used QRPs are not going to share their failed studies. Z-curve can correct bias for the published studies and does not require cooperation from original researchers to correct the scientific record.
I found 214 studies with 49,1145 participants (data). Figure 1 shows the z-curve. A z-curve is a histogram of the reported test-statistics converted into z-scores. Each z-score reflects the strength of evidence (effect size over sampling error) against the null-hypothesis in each study. As the direction of the effect is irrelevant, all z-scores are positive.
The first notable finding is that the peak of the distribution is at z = 1.96, which corresponds to a two-sided p-value of .05. The second finding is the sharp drop from the peak to values below 1.96. The third observation is that the peak of the distribution has a density of 1.1, which is much larger than the peak density of a standard normal distribution (~ .4). All of these results together make it clear that non-significant results are missing. To quantify the amount of bias due to the use of QRPs, we can compare the observed discovery rate (the percentage of significant results) with the expected discovery rate based on the z-curve model (the grey curve is the predicted distribution without QRPs). The literature contains 74% significant results, when we would expect only 8% significant results.
Thus, there is strong evidence that QRPs undermine the credibility of this literature. Especially, p-values like those reported by Dasgupta and Greenwald (2001) are often a sign of studies with low power that required QRPs to produce a p-value less than .05 (see values below x-axis, 12% for z-scores 2 to 2.5). However, there is also clear evidence of heterogeneity. Studies with z-scores greater than 4 are expected to replicate with 90% or more (again values below x-axis) and 6 studies are not shown because their z-scores even exceeded the maximum value of 6 on the x-axis. To give a context, particle physicists use a z-score of 5 to claim major discoveries. Thus, a few studies produced credible evidence, while the bulk of studies used QRPs to achieve statistical significance in studies with low power.
There are two remarkable articles in this literature that deserve closer attention (Lai et al., 2014, 2016). Before I examine these two articles in more detail, I also conducted a z-curve analysis of the literature without these two articles to examine the credibility of typical articles in this literature.
The z-curve plot for traditional articles in this literature looks even worse. The expected discovery rate of 7% is just above the discovery rate of 5% that is expected from studies without any effect simply because the alpha criterion of .05 allows for 5% false positive discoveries. Moreover, the 95% confidence interval of the expected replication rate does include 5%, which means we cannot rule out that all of the published studies with significant results are false positives. This is also reflected in the maximum False Discovery Rate, 73%, but the upper limit of the 95% confidence interval includes 100%.
While there may be two or three studies with credible evidence, 154 studies with nearly 20,000 participants have produced no scientific information about implicit bias. In short, like several other areas of research in experimental social psychology, implicit bias research is junk science and the seminal study by Dasgupta and Greenwald is no excpetion.
Exception No 1: Lai et al. (2014)
The IAT is a popular measure of implicit bias in part because the developers of the IAT created an online site where visitors can get feedback on their (invalid) IAT scores, including the race IAT. This website is called Project Implicit. Some also volunteer to be participants in studies with the IAT. This makes it possible to get large samples. Lai et al. (2014) used Project Implicit to conduct 50 studies with 18 different interventions. Each study had several hundred participants, which allows for higher power to get significant results and more precise effect size estimates. The next figure shows the z-curve for these 50 studies.
Visual inspection of the histogram does not show the previous steep cliff around z = 1.96. In addition, the replication rate for significant studies is high and the lower limit of the 95%CI is still 65%. Thus, even if some minor QRPs may have produced a little bump around 1.96, this article provides credible evidence that IAT scores can be changed with some manipulations. However, it also shows that several manipulations produce hardly any effects.
Moreover, it is possible that the little bump around 1.96 is a chance finding. This can be examined by fitting z-curve to all values, including no-significant ones. Now the estimated discovery rate perfectly matches the observed discovery rate, suggesting that no QRPs were used.
In short, a single study with well-powered studies that honestly reported results provided more informative results than a literature with hundreds of underpowered studies that used QRPs to publish significant results. This just shows how powerful real science can be, while at the same time exposing the flaws of the way most experimental social psychologists to this day conduct their research.
Do Successful Changes of IAT scores Reveal Changes in Implicit Bias?
If we think about measures as perfect representations of constructs, any change in a measure implies that we changed the construct. However, Figure 1 showed that we need to distinguish measures and constructs. This brings up a new question. Did Lai et al. successfully change implicit biases or did they merely change IAT scores without changing attitudes.
This question can be difficult to answer. One way to examine this would be to see whether the manipulation also influenced behaviour. In the Figure a change of actual implicit bias would also produce a change in behavior, whereas the direct effect on the measure (red path) would not imply a change in behavior. However, as we saw studies with actual behaviors require even larger samples than used in the Project Implicit studies. So, this information is not available.
This brings us to the second exceptional study, which was also conducted by Lai and colleagues (2016). It is essentially a replication and extension of their first study. Focussing on the successful intervention in Lai et al. (2014), the authors examined whether the immediate effects would persist for a few days. First, the authors successfully replicated the immediate effects. More important, they failed to find significant effects a few days later, despite high power to do so. Even participants who were trained to fake the IAT did not bother to fake the IAT again the second time. Thus, even successful interventions that change IAT scores do not seem to change implicit biases measured with the IAT.
Don’t just trust me. Even Greenwald himself has declared that there are no proven ways to change implicit bias, although he fails to explain how he obtained strong effects in his seminal study.
“Importantly, there are no such situational interventions that have been established to have durable effects on IAT measures (Lai et al., 2016)” (Rae and Greenwald, 2017).
“None of the eight effective interventions produced an effect that persisted after a delay of one or a few days.This lack of persistence was not previously known because more than 90% of prior intervention studies had considered changes only within a single experimental session (Lai et al. 2013).” (Greenwald and Lai, 2020).
In short, 20 years of research that started with strong and persistent effects in Dasgupta and Greenwald’s seminal article has produced no useful information how to change implicit bias, despite hundreds of articles that claimed to change implicit bias successfully.
Where do we go from here?
Based on the famous saying “insanity is doing the same thing over and over again and expecting different results” we have to declare experimental social psychologists insane. For decades they have tried to make a contribution to the understanding of prejudice by bringing White students at White universities into labs run by mostly White professors, expose them to some stimuli and measured prejudice right afterwards. The only things that changed is that social psychologists now do even shorter studies with larger samples over the Internet. Should anybody expect that a brief manipulation can have profound effects? The only people who think this could work are social psychologists who have been deluded by inflated effect sizes in p-hacked studies that even subliminal manipulations can have profound effects on prejudice. Meanwhile, racisms remains a troubling reality in the United States as the summer in 2020 made clear.
It is time to use research funding wisely and not to waste it on experimental social psychology that is more concerned with publications and citations than with affecting real change. Resources need to be invested in longitudinal studies, studies with children, studies at work places with real outcome measures. Right now, this research does not attract funding because researchers who pump out five quick, p-hacked experiments get more publications, funding, and positions than researchers who do one well-designed longitudinal study that may fail to show a statistically significant result. Junk is drowning out good science. Maybe a new administration that actually cares about racial justice will allocate research money more wisely. Meanwhile, experimental social psychologists need to rethink their research practices and wonder what their real priorities are. As a group, they can either continue to do meaningless research or step up. However, they can no longer deceive themselves or others that their past research made a real contribution. Denial is not an answer, unless they want to take a place next to Trump in history. Publishing only studies that work was a big mistake. It is time to own up to it.
Onyeador, I. N., Wittlin, N. M., Burke, S. E., Dovidio, J. F., Perry, S. P., Hardeman, R. R., … van Ryn, M. (2020). The Value of Interracial Contact for Reducing Anti-Black Bias Among Non-Black Physicians: A Cognitive Habits and Growth Evaluation (CHANGE) Study Report. Psychological Science, 31(1), 18–30. https://doi.org/10.1177/0956797619879139
One popular topic in social psychology is persuasion. How can we make people believe something and change their attitudes. A number of variables influence how persuasive a message is. One of them is source credibility. When Donald Trump claims that he won the election, we may want to check what others say. If even Fox News calls the election for Biden, we may not trust Donald Trump and ignore his claim. Similarly, scientists respond to the reputation of scientists. For example, in 2011 it was revealed that Diederik Stapel faked the data for many of his articles. The figure below shows that other scientists responded by citing these articles less.
In 2011, it also became apparent that social psychologists used other practices to publish results that cannot be replicated (OSC, 2015). These practices are known as questionable research practices, but unlike fraud they are not banned and articles that reported these results have not been retracted. As a result, social psychologists continue to cite articles with false evidence that present misleading information.
One literature that has lost credibility is research on implicit cognitions. Early on, replication failures of implicit priming undermined the assumption that social behavior is often elicited by stimuli without awareness (Kahneman, 2012). Now, research with implicit bias measures has come under scrutiny. A main problem with implicit bias measures is that they have low convergent validity with each other (Schimmack, 2019). As most of the variance in these measures is measurement error, one can only expect small effects of experimental manipulations. This means that studies with implicit measures often have low statistical power to produce replicable effects. Nevertheless, articles that use implicit measures report mostly successful results. This can only be explained with questionable research practices. Therefore it is currently unclear whether there are any robust and replicable results in the literature with implicit bias measures.
This is even true, when an article reports several replication studies (Schimmack, 2012). Although multiple replications give the impression that a result is credible, questionable research practices undermine the trustworthiness of results. Fortunately, it is possible to examine the credibility of published results with forensic statistical tools that can reveal the use of questionable practices. Here I use these tools to examine the credibility of a five-study article that claimed implicit measures are influenced by the credibility of a source.
Consider the Source: Persuasion of Implicit Evaluations Is Moderated by Source Credibility
The article by Colin Tucker Smith, Jan De Houwer, and Brian A. Nosek reports five experimental manipulations of attitudes towards a consumer product.
“Supporting the central hypothesis of the study, source expertise significantly affected implicit preferences; participants showed a stronger implicit preference for Soltate when that information was presented by an individual “High” in expertise (M = 0.54, SD = 0.36) than “Low” in expertise (M = 0.42, SD = 0.41), t(195) = 2.24, p = .026, d = .32″
Participants indicated a stronger implicit preference for Soltate when that information was presented by an individual “High” in expertise (M = 0.60, SD = 0.33) than “Low” in expertise (M = 0.48, SD = 0.40), t(194) = 2.18, p = .031, d = 0.31
Implicit preferences were significantly affected by manipulating the level of source trustworthiness; participants indicated a stronger implicit preference for Soltate when that information was presented by an individual “High” in trustworthiness (M = 0.52, SD = 0.34) than “Low” in trustworthiness (M = 0.42, SD = 0.39), t(280) = 2.40, p = .017, d = 0.29.
Replicating Study 3, manipulating the level of source trustworthiness significantly affected implicit preferences as measured by the IAT. Participants implicitly preferred Soltate more when presented by an individual “High” in trustworthiness (M = 0.51, SD = 0.35) than “Low” in trustworthiness (M = 0.43, SD = 0.43), t(419) = 2.17, p = .031, d = 0.21.
There was a main effect of credibility, F(1, 549) = 4.43, p = .036, such that participants implicitly preferred Soltate more when presented by a source high in credibility (M = 0.62, SD = 0.36) than low in credibility (M = 0.55, SD = 0.39).
For naive readers of social psychology results, it may look impressive that the key finding was replicated in five studies. After all, the chance of a false positive result decreases exponentially with each significant result. While a single p-value less than .05 can occur by chance in 1 out of 20 studies, five significant results can only happen by chance in 1 out of 20*20*20*20*20 = 3,200,000 attempts. So, it would seem reasonable to believe that implicit attitude measures are influenced by the credibility of the source. However, when researchers use questionable practices, a p-value less than .05 is ensured and even 9 significant results do not mean that an effect is real (Bem, 2011). So, the question is whether the researchers used questionable practices to produce their results.
To answer this question, we can examine the probability of obtaining five very similar p-values in a row, p = .026, p = .031, p = .017, p = .031, p = .036. The Test of Insufficient Variance (TIVA) converts the p-values into z-scores and compares the variance against the sampling error of z-scores, which is 1. The variance is just 0.012. The probability of this happening by chance is 1/3287. In other words, it is extremely unlikely that five independent studies would produce such a small variance in p-values by chance.
Another test is to compute the average observed power of the studies (Schimmack, 2012), which is 60%. We can now ask how probable it is to get five significant results in a row with a probability of 60%, which is .6*.6*.6*.6*.6 = .08. The probability is also low, but not as low as the one for the previous test. The reason is that QRPs also inflate observed power. Thus, the 60% estimate is an overestimation. A rough index of inflation is simply the difference between the 100% success rate and the inflated power estimate of 60%, which is 40%. Subtracting the inflation from the observed power index, gives a replicability index of 20%. A value of 20% is also what is obtained in simulation studies where all studies are false positives (i.e., there is no real effect).
So, does source credibility influences implicitly measured attitudes? We do not know. At least these five studies provide no evidence for it. However, these results do provide further evidence that consumers of IAT researchers should consider the source. IAT researchers have a vested interest in making you believe that implicit measures can reveal something important about you that exists outside of your awareness. This gives them power to make big claims about social behavior that benefits their careers.
However, you also need to consider the source of this blog post. I have a vested interest in showing that social psychologists are full of shit. After all, who cares about bias analyses that always show there is no bias. So, who should you believe? The answer is that you should believe the data. Is it possible to get five p-values between .05 and .005 in a row? If you disregard probability theory, you can ignore this post. If you trust probability theory, you might wonder what other results in the IAT literature you can trust. In science we don’t trust people. We trust the evidence, but only after we make sure that we are presented with credible evidence. Unfortunately, this is often not the case in psychological science, even in 2020.
Every year, some of our best undergraduate students apply to work with professors on their research projects for one year. For several years, I have worked with students to examine the credibility of psychological science. After an intensive crash course in statistics, students code published articles. The biggest challenge for them and everybody else is to find the critical statistical test that supports the main conclusion of the article. Moreover, results are often not reported sufficiently (e.g., effect sizes without sampling error or exact p-values). For students it is a good opportunity to see why good understanding of statistics is helpful in reading original research articles.
One advantage of my ROP is that it is based on secondary data. Thus, the Covid-19 pandemic didn’t impede the project. In fact, it probably helped me to get a larger number of students. In addition, zoom made it easy to meet with students to discuss critical articles one on one.
The 2020 ROP team has 13 members: Sara Al-Omani Samanvita Bajpai Nidal Chaudhry Yeshoda Harry-Paul Nouran Hashem Memoona Maah Andrew Sedrak Dellannia Segreti Yashika Shroff Brook Tan Ze Yearwood Maria Zainab Xinyu Zhu
The main aim of the project is to get a sense of the credibility of psychological research across the diverse areas of psychology. The reason is that actual replication initiatives have focussed mostly on social and cognitive psychology where recruitment of participants is easy and studies are easy to do (Open Science Collaboration, 2015). Despite concerns about other areas, actual replication projects are lacking due to the huge costs involved. A statistical approach has the advantage that credibility can also be assessed by simply examining the strength of evidence (signal/noise) ratio in published articles.
The team started with coding articles from 2010, the year just before the replication crisis started. The journals represent a broad range of areas in psychology with an emphasis on clinical psychology because research in clinical psychology has the most direct practical implications.
Addictive Behaviors Cognitive Therapy and Research Journal of Anxiety Disorders Journal of Consulting and Clinical Psychology Journal of Counseling Psychology Journal of Applied Psychology Behavioural Neuroscience Child Development Social Development
The test statistics are converted into z-scores as a common metric to reflect the strength of evidence against the null-hypothesis. These z-scores are then analyzed with z-curve (Bartos & Schimmack, 2020; Brunner & Schimmack, 2020).
The figure and statistical results are similar to results in social psychology (Schimmack, 2020). First, the graph shows the well-known practice in psychology to publish mostly successful studies; that is, statistically significant results with p < .05 (z > 1.96) (Sterling, 1959). Here, the observed discovery rate is 88%, but the actual discovery rate is even higher because marginally signifcant results (p < .10, z > 1.65) are also often interpreted as sufficient evidence to reject the null-hypothesis.
In comparison, the estimated discovery rate is much lower at 33%. The discrepancy between the observed and expected discovery provides clear evidence that questionable research practices (QRPs, John et al., 2012; Schimmack, 2014). QRPs are research practices that increase the chances of reporting a statistically significant result, including selective reporting of significant results or highlighting significant results as discoveries (Kerr et al., 1998). The presence of QRPs in psychological research in 2010 is expected, but information about the extent of QRPs is lacking. Z-curve suggests that there is massive use of QRPs to boost actual success rates of 33% to nearly perfect success rate in published articles. This has important implication for replication attempts. If reported results are selected to be significant from results with low power, replication studies have a low probability of being significant again.
However, the chance of a replication of a significant result in the original studies, depends on the mean power of the studies with significant results and selection for significance also increases the actual power of studies (Brunner & Schimmack, 2020). The reason is that studies with higher power have a higher chance to produce significant results even without QRPs. The z-curve estimate of the expected replication rate is 52%. This would suggest that ever second study could be successfully replicated. The problem with this estimate is that it assumes that exact replications are possible. However, psychological studies are difficult or impossible to replicate exactly. This may explain why the expected replication rate is higher than the success rate in actual replication studies (cf. Bartos & Schimmack, 2020). For actual replication studies, the expected discovery rate seems to be a better predictor.
In conclusion, the results for clinical psychology and other areas of psychology are similar to those for social psychology (Schimmack, 2020). This is consistent with a comparison of disciplines based on automatic extraction of all test statistics rather than hand-coding of focal hypothesis tests (Schimmack, 2020).
In the upcoming semester (aptly called the winter semester in Canada), the team will code articles from 2019 to see whether a decade of soul searching about research practices in psychology has produced notable changes. There are two possibilities. On the one hand, journals could have become more accepting of non-significant results leading to more publications of non-significant results (i.e., a decrease in the observed discovery rate). On the other hand, journals may have asked for a priori power analysis and bigger sample sizes to reduce sampling error to produce stronger evidence against the null-hypothesis (i.e., an increase in the expected discovery rate).
Until 2011, social psychologists were able to believe that they were actually doing science. They conducted studies, often rigorous experiments with random assignment, analyzed the data and reported results only when they achieved statistical significance, p < .05. This is how they were trained to do science and most of them believed that this is how science works.
However, in 2011 an article by a well-respected social psychologists changed all this. Daryl Bem published an article that showed time-reversed causal processes. Seemingly, people were able to feel the future (Bem, 2011). This article shock the foundations of social psychology because most social psychologists did not believe in paranormal phenomena. Yet, Bem presented evidence for his crazy claim in 8 out of 9 studies. The only study that did not work was with supraliminal stimuli. The other studies used subliminal stimuli, suggesting that only our unconscious self can feel the future.
Over the past decade it has become apparent that Bem and other social psychologists had misused significance testing. They only paid attention to significant results, p < .05, and ignored non-significant results, p > .05. Selective publishing of significant results means that statistical results no longer distinguished between true and false findings. Everything was significant, even time-reversed implicit priming.
Some areas of social psychology have been hit particularly hard by replication failures. Most prominently, implicit priming research has been called out as a poster child of doubt about social psychological results by Nobel Laureate Kahneman. The basic idea of implicit priming is that stimuli outside of participants’ awareness can influence their behavior. Many implicit priming studies have failed to replicate.
Ten years later, we can examine how social psychologists have responded to the growing evidence that many classic findings were obtained with questionable practices (not reporting the failures) and cannot be replicated. Unfortunately, the response is consistent with psychodynamic theories of ego-defense mechanisms and social psychologists’ own theories of motivated reasoning. For the most part, social psychologists have simply ignored the replication failures in the 2010s and continue to treat old articles as if they provide scientific insights into human behavior. For example, Bargh – a leading figure in the implicit priming world – wrote a whole book about implicit priming that does not mention replication failures and presents questionable research as if they were well-established facts (Schimmack, 2017).
Given the questionable status of implicit priming research, it is not surprising that concerns are also growing about measures that were designed to reflect individual differences in implicit cognitions (Schimmack, 2019). The measures often have low reliability (when you test yourself you get different results each time) and show low convergent validity (one measure of your unconscious feelings towards your spouse doesn’t correlate with another measure of your unconscious feelings towards your spouse). It is therefore suspicious, when researchers consistently find results with these measures because measurement error should make it difficult to get significant results all the time.
In an article from 2019 (i.e., when the replication crisis in social psychology has been well-established), Hicks and McNulty make the following claims about implicit love; that is feelings that are not reflected in self-reports of affection or marital satisfaction.
Their title is based on a classic article by Bargh and Chartrand.
Readers are not informed that the big claims made by Bargh twenty years ago have failed to be supported by empirical evidence. Especially the claim that stimuli often influence behavior without awareness lacks any credible evidence. It is therefore sad to say that social psychologists have moved on from self-deception (they thought they were doing science, but they did not) to other-deception (spreading false information knowing that credible doubts have been raised about this research). Just like it is time to reclaim humility and honesty in American political life, it is important to demand humility and honesty from American social psychologists, who are dominating social psychology.
The empirical question is whether research on implicit love has produced robust and credible results. One advantage for relationship researchers is that a lot of this research was published after Bem (2011). Thus, researchers could have improved their research practices. This could result in two outcomes. Either relationship researchers reported their results more honestly and did report non-significant results when they emerged, or they increased sample sizes to ensure that small effect sizes could produce statistically significant results.
Hicks and McNulty’s (2019) narrative review makes the following claims about implicit love.
1. The frequency of various sexual behaviors was prospectively associated with automatic partner evaluations assessed with an implicit measure but not with self-reported relationship satisfaction. (Hicks, McNulty, Meltzer, & Olson, 2016).
2. Participants with less responsive partners who felt less connected to their partners during conflict-of-interest situations had more negative automatic partner attitudes at a subsequent assessment but not more negative subjective evaluations (Murray, Holmes, & Pinkus, 2010).
3. Pairing the partner with positive affect from other sources (i.e., positive words and pleasant images) can increase the positivity of automatic partner attitudes relative to a control group.
4. The frequency of orgasm during sex was associated with automatic partner attitudes, whereas sexual frequency was associated only with deliberate reports of relationship satisfaction for participants who believed frequent sex was important for relationship health.
5. More positive automatic partner attitudes have been linked to perceiving fewer problems over time (McNulty, Olson, Meltzer, & Shaffer, 2013).
6. More positive automatic partner attitudes have been linked to self-reporting fewer destructive behaviours (Murray et al., 2015).
7. More positive automatic partner attitudes have been linked to more cooperative relationship behaviors (LeBel & Campbell, 2013)
8. More positive automatic partner attitudes have been linked to displaying attitude-consistent nonverbal communication in conflict discussions (Faure et al., 2018).
9. More positive automatic partner attitudes were associated with a decreased likelihood of dissolution the following year, even after controlling for explicit relationship satisfaction (Lee, Rogge, & Reis, 2010).
10. Newlyweds’ implicit partner evaluations but not explicit satisfaction within the first few months of marriage were more predictive of their satisfaction 4 years later.
11. People with higher motivation to see their relationship in a positive light because of barriers to exiting their relationships (i.e., high levels of relationship investments and poor alternatives) demonstrated a weaker correspondence between their automatic attitudes and their relationship self-reports.
12. People with more negative automatic evaluations are less trusting of their partners when their working memory capacity is limited (Murray et al., 2011).
These claims are followed with the assurance that “these studies provide compelling evidence that automatic partner attitudes do have implications for relationship outcomes” (p. 256).
Should anybody who reads this article or similar claims in the popular media believe them? Have social psychologists improved their methods to produce more credible results over the past decade?
Fortunately, we can answer this question by examining the statistical evidence that was used to support these claims, using the z-curve method. First, all test statistics are converted into z-scores that represent the strength of evidence against the null-hypothesis (i.e., implicit love has no effect or does not exist) in each study. These z-scores are a function of the effect size and the amount of sampling error in a study (signal/noise ratio). Second, the z-scores are plotted as a histogram to show how many of the reported results provide weak or strong evidence against the null-hypothesis. The data are here for full transparency (Implicit.Love.xlsx).
The figure shows the z-curve for the 30 studies that reported usable test results. Most published z-scores are clustered just above the threshold value of 1.96 that corresponds to the .05 criterion to claim a discovery. This clustering is indicative of the use of selecting significant results from a much larger set of analyses that were run and produced non-significant results. The grey curve from z = 0 to 1.96 shows the predicted number of analyses that were not reported. The file drawer ratio implies that for every significant result there were 12 analyses with non-significant results.
Another way to look at the results is to compare the observed discovery rate with the expected discovery rate. The observed discovery rate is simply the percentage of studies that reported a significant result, which is 29 out of 30 or 97%. The estimated discovery rate is the average power of studies to produce a significant result. It is only 8%. This shows that social psychologists still continue to select only successes and do not report or interpret the failures. Moreover, in this small sample of studies, there is considerable uncertainty around the point estimates. The 95%confidence interval for the replication success probability includes 5%, which is not higher than chance. The complementary finding is that the maximum number of false positives is estimated to be 63%, but could be as high as 100%. In other words, the results make it impossible to conclude that even some of these studies produced a credible result.
In short, the entire research on implicit love is bullshit. Ten years ago, social psychologists had the excuse that they did not know better and misused statistics because they were trained the wrong way. This excuse is wearing thin in 2020. They know better, but they continue to report misleading results and write unscientific articles. In psychology, this is called other-deception, in everyday life it is called lying. Don’t trust social psychologists. Doing so is as stupid as believing Donald Trump when he claims that he won the election.
For decades, psychologists have misused the scientific method and statistical significance testing. Instead of using significance tests to confirm or falsify theoretical predictions, they only published statistically significant results that confirmed predictions. This selection for significance undermines the ability of statistical tests to distinguish between true and false hypotheses (Sterling, 1959).
Another problem is that psychologists ignore effect size. Significant results with the nil-hypothesis (no effect) only reject the hypothesis that the effect size is not zero. It is still possible that the population effect size is so small that it has no practical significance. In the 1990s, psychologists addressed this problem by publishing standardized effect sizes. The problem is that selection for significance also inflates these effect size estimates. Thus, journals may publish effect size estimates that seem important, when the actual effect sizes are trivial.
The impressive reproducibility project (OSC, 2015) found that original effect sizes were cut in half in replication studies that did not select for significance. In other words, population effect sizes are, on average, inflated by 100%. Importantly, this average inflation applied equally to cognitive and social psychology. However, social psychology has more replication failures which also implies larger inflation of effect sizes. Thus, most published effect sizes in social psychology are likely to provide misleading information about the actual effect sizes.
There have been some dramatic examples of effect-size inflation. Most prominently, a large literature with the ego-depletion paradigm (Baumeister et al., 1998) produced a meta-analytic mean effect size of d = .6. However, in a recent replication study that was organized by researchers who had published many studies with results selected for significance produced only an effect size of d = .06 without selection for significance (Schmeichel & Vohs, 2020). It is not important whether this effect size is different from zero or not. The real null-hypothesis here is an effect size of d = .6, and d = .06 is both statistically and practically significantly different from .6. In other words, the effect sizes in studies selected for significance were dramatically inflated by about 1000%. This means that none of the published results on ego-depletion are credible.
As I pointed out in my criticism of research practices in social psychology (Schimmack, 2012), other paradigms in social psychology have produced equally shocking inflation of effect sizes.
One possible explanation is that researchers do not care about effect sizes. Researchers may not consider it unethical to use questionable research methods that inflate effect sizes as long as they are convinced that the sign of the reported effect is consistent with the sign of the true effect. For example, the theory that implicit attitudes are malleable is supported by a positive effect of experimental manipulations on the implicit association test, no matter whether the effect size is d = .8 (Dasgupta & Greenwald, 2001) or d = .08 (Joy-Gaba & Nosek, 2010), and the influence of blood glucose levels on self-control is supported by a strong correlation of r = .6 (Gailliot et al., 2007) and a weak correlation of r = .1 (Dvorak & Simons, 2009).
How have IAT researchers responded to the realization that original effect sizes may have been dramatically inflated? Not much. Citations show that the original article with the 10 times inflated effect size is still cited much more frequently than the replication study with a trivial effect size.
Closer inspection of these citations shows that implicit bias researchers continue to cite the old study as if it provided credible evidence.
Axt, Casola, and Nosek (2019) mention the new study, but do not mention the results.
“The closest are studies investigating malleability of implicit attitudes (Joy-Gaba & Nosek, 2010; Lai et al., 2014). For example, in Lai et al. (2014), priming the concept of multiculturalism was moderately effective at reducing implicit preferences for White versus Black people, but did not alter implicit preferences for White versus Hispanic people or White versus Asian people.”
Devine, Forscher, Austin, and Cox (2012) wrote
“The reality of lingering racial disparities, combinedwith the empirically established links between implicit bias and pernicious discriminatory outcomes, has led to a clarion call for strategies to reduce these biases (Fiske, 1998; Smedley, Stith, & Nelson, 2003). In response, the field has witnessed an explosion of empirical efforts to reduce implicit biases (Blair, 2002). These efforts have yielded a number of easy-to-implement strategies, such as taking the perspective of stigmatized others (Galinsky & Moskowitz, 2000) and imagining counter-stereotypic examples (Blair, Ma,& Lenton, 2001; Dasgupta & Greenwald, 2001), that lead to substantial reductions in implicit bias, at least for a short time (i.e., up to 24 hours)” (p. 1268).
Lai et al. (2014) write.
“How can the expression of implicit racial preferences be reduced to mitigate subsequent discriminatory behavior? Indeed, significant progress has been made in the goal of identifying the processes underlying malleability and change in implicit evaluations (Dasgupta & Greenwald, 2001; Mitchell, Nosek, & Banaji, 2003; Olson & Fazio, 2006; Rudman, Ashmore, & Gary, 2001; for reviews, see Blair, 2002; Dasgupta, 2009; Gawronski & Bodenhausen, 2006; Gawronski & Sritharan, 2010; Lai, Hoffman, & Nosek., 2013; Sritharan & Gawronski, 2010).
Even more problematic is the statement “Prior research demonstrates that exposure to positive Black and negative White exemplars can shift implicit racial preferences (Dasgupta & Greenwald, 2001; Joy-Gaba & Nosek, 2010).” as if d = .8 is equivalent to d = .08 (Lai et al., 2014, p. 1771)
Payne and colleagues write
“Numerous studies have documented that performance on implicit bias tests is malleable in response to various manipulations of the context. For example, implicit racial bias scores can be shifted by interacting with an African American experimenter, listening to rap music, or looking at a photo of Denzel Washington (Dasgupta & Greenwald, 2001; Lowery, Hardin, & Sinclair, 2001; Rudman & Lee, 2002).” (p. 235).
A positive example that cites Nosek and Joy-Gaba (2010) correctly comes from an outsider.
Natalie Salmanowitz’s article writes in Journal of Law and the Biosciences that “a short, impersonal exposure to counterstereotypical exemplars cannot be expected to counteract a lifetime of ingrained mental associations” (p. 180).
In conclusion, science is self-correctiong, IAT researchers are not self-correcting, therefore IAT research is not science until IAT researchers are honest about the research practices that produced dramatically inflated effect sizes and irreproducible results. Open practices alone are not enough. Honesty and a commitment to pursing the truth (rather than fame or happiness) is essential for scientific progress.
This is part 2 of a series of blog posts that introduce a “monster model” of well-being. The first part explained how to build measurement models of affect and how to predict life-satisfaction from affect (Part 1). The key finding was that happiness and sadness are strongly negatively correlated, independently contribute to the prediction of well-being, and together account for about two-thirds of the variance in well-being. This means that about one-third (33%) of the variance in well-being is not explained by feelings.
While it has been known for over two decades that well-being has unique variance that is not explained by affect (Lucas et al, 1996), and the literature on well-being has increased exponentially in the 2000s, nobody has examined predictors of the unique variance in well-being that is not explained by affect. Part 2 examines this question.
When well-being research took off in the 1960, social scientists also started to measure well-being by asking separate questions about satisfaction with specific life domains (e.g., marriage, work, housing, income, etc.). It was commonly assumed that global life-satisfaction judgments are a summary of information about satisfaction with important domains (Andrews & Withey, ithey, 1976). However, this assumption was challenged by psychologists, who proposed that global judgments are rooted in personality and that global happiness creates a halo that colors the evaluation of life-domains (Diener, 1984). Influential review articles emphasized the importance of personality and downplayed the relevance of life domains (Diener et al., 1999; Heller et al., 2004). This changed in the 2000s, when longitudinal panel studies with representative samples showed that life-evens and circumstances matter a lot more than studies with small student samples suggested (Diener, Lucas, & Scollon, 2006). However, empirical research practices did not change and domain satisfaction remained neglected in psychological research. To my knowledge, there are only two datasets that used multiple methods (self-ratings and informant ratings) to measure domain satisfaction (Schneider & Schimmack, 2010; Payne & Schimmack, 2020). One of them is the Mississauga Family study.
Having mutli-method data of domain satisfaction makes it possible to add domain satisfaction as additional predictors of well-being to the model. The interesting novel question is whether domain satisfaction adds information about well-being that is not reflected in affective experiences. For example, is romantic satisfaction (marital satisfaction for parents) a predictor of life-satisfaction simply because a happy marriage makes life more pleasurable or are there aspects of marital life that influence well-being independent of its influence on affective experiences.
Rather than just adding all domain satisfaction measures in one step, it is useful to add these measures slowly, just like in cooking you don’t just toss all ingredients into a big pot at the same time.
Let’s start with romantic satisfaction. Only social psychologists could believe that marital satisfaction has no influence on well-being (Kahneman, 2011; Strack et al., 1988), exactly because it goes against anything that makes sense and would prove grandma wrong. The boring truth is that marital satisfaction is strongly correlated with well-being measures (Heller et al., 2004).
As a first step, we can also be agnostic about the relationship between marital satisfaction and affect and simply let all predictors correlate with each other. This is simply multiple regression with unobserved variables that control for measurement error. This model had good fit, CFI = .984, RMSEA = .028.
The figure shows the autogenerated model, which looks very messy because of all the correlated residual variances among ratings by the same rater as well as between informant ratings of mothers and fathers. This is another problem for publishing this work. Reviers may be vaguely familiar with SEM and able to follow a diagram, but complex models at some point can no longer be presented with a nice figure. When results are just a list of parameters, most psychologists lose interested very quickly (please send to a specialized journal for nerds, but these nerds are not interested in happiness).
Although a measurement model can be very messy and complex, the real hypothesis that was tested here is very simple, and the next figure shows the results. Romantic satisfaction had a significant (z = 4.99) unique effect on well-being. The unexplained variance decreased from 37% to 34% by three-percentage points.
Trying this for several domain satisfaction judgments revealed unique effects for work/academics, .3, goal-progress, .6, health, .2, housing, .4, recreation, .3, friendships, .2, and finances, .4. The only domains that did not add unique variance were weather and commute. However, just because domains predicted well-being above and beyond affect does not mean that they really contribute to well-being. The reason is that domain satisfactions are not independent of each other. It is therefore necessary to include them as simultaneous predictors of well-being.
A model with all of the significant domains and affect as predictors had good fit, CFI = .992, RMSEA = .015. The results showed that goal progress dominated as a unique predictor, .6. The only other significant (alpha = .05) predictors were housing, .2, and love, .2. The only other predictor with an effect size greater than .1 was money, p = .08. The residual variance in well-being in this model was reduced to 10%. Thus, domains make a substantial contribution to well-being above the influence of affect.
Most psychologists are used to multiple regression and focus now on the predictors that have a unique significant contribution. However, these results depend strongly on the choice of predictors. The fact that work/academics is not a unique predictor could simply reflect that this domain is important but its influence on well-being is fully captured by the measure of goal progress. It is therefore necessary to examine the relationships among the predictor variables. The correlations are shown in the next table.
The first observation is that everything is correlated with everything at non-trivial levels (all z > 5; all r > .1). This is not just a simple response style, because the measurement model removes rater specific biases. It is, however, possibel that some response styles are shared across family members.
The second observation is that there is clearly a structure to these correlations. Some correlations are notably stronger than others. For example, a very strong correlation is obtained for work/academics and goal progress, r = .8. Thus, the fact that work/academics was not a unique predictor does not mean that it is not an important life domain. Rather, it is a strong predictor of overall goal progress, which was a strong predictor of well-being. Another high correlation exists for friendships and recreation. Adding both as unique predictors can be a problem because most of the satisfaction with recreation may overlap with the satisfaction with friendship. A simple regression model only shows these effects in the explained variance, when the unique effects can be small and non-significant.
The first decision that has to be made is what to do with goal progress. It is not a specific life domain,but rather a broad measure that focuses on a specific type of satisfaction that arises from goal-related activities. One possibility is to simply not include it in the monster model (Payne et al., 2013; Zou et al., 2013). However, a monster model doesn’t try to be simple and easy. Moreover, excluding this predictor increased the residual variance in well-being from 10% to 17%. So, I decided to keep it and to allow specific domains to contribute to goal progress and goal progress to mediate the effects of these domains on life-satisfaction. The new question now is how much specific domains contribute to goal progress.
The overall model fit is not changed by modifying the relationship of goal progress to the other domains. The main finding was that work/academics was by far the strongest predictor of goal progress, .7 (z = 21). The only other significant (alpha = .05) effect would be love with a small effect size, .1 (z = 2.4). When, I modified the model and let work/academics be the only predictor of goal progress, the relationship strengthened to the point that there is little discriminant validity, .9. This still leaves the question why goal progress predicted unique variance in well-being. One possibility is shared method variance or top-down effects. That is, when raters make judgments of goal-progress they partly rely on overall life-satisfaction. In fact, modification indices suggested that model fit could be improved by allowing for a residual correlation between the goal progress and life-satisfaction factors.
The complex story for goal progress is instructive, but would be difficult to publish in a traditional journal that likes to pretend that scientists are clairvoyant and can predict everything ahead of time. Of course, most articles are written after researchers explored their data (Kerr et al., 1998). Here we see how the sausage is made, like in a restaurant where you can see the chef prepare your meal.
In short, goal progress is out after careful inspection of the correlations among the predictor variables. In this model, love (.2), work (.2), housing (.2) and money (.2) were all significant predictors (z > 3). However, remember that recreation and friendship were highly correlated. To see whether the shared variance between them adds to the prediction of well-being it is necessary to remove only one of them as predictors and see whether the other becomes significant. Although recreation showed a slightly stronger effect size, I chose friendship because recreation can happen at home and with one’s spouse. Table 3 does also show higher correlations with other domains for recreation than friendship. As expected, removing recreation as a predictor, increased the effect size for friendship, but it was still small (.1) and not significant. with a sampling error of .05, we can say that in this sample friendships did not matter as much as other domains. So, neither recreation nor friendship seem to be contributing to well-being (if you think that doesn’t make sense and you are getting crazy because Covid-19 doesn’t let you enjoy life, please wait till the final discussion. For UofT students and their parents work matters more than fun).
In this model, 17% of the variance in well-being is unexplained, compared to the model with only affect as predictor that left 37% unexplained. However, this does not mean that life domains explain 20% of the variance and affect explains 63%. We could reverse the order of predictors and suddenly life domains would explain the bulk of the variance in well-being. So, how important are our feelings really for well-being?
To answer this question, we have to make assumptions about the causal relationship of domain satisfaction and affect. Conceptually, affect is supposed to reflect the overall amount of happiness and sadness that people experience in their lives. This can happen at work, while hanging out with friends and while spending time at home with one’s spouse. Ideally, we would have measured affect during these different situations, but we didn’t. So, we have to assume that domain satisfaction judgments reflect how things are going in a domain and that how things are going in a domain can contribute to overall affect. Based on this reasoning, we assume that causality runs from domain satisfaction to affect, which in turn can influence well-being directly because we care about our feelings. If feelings matter, they should make a unique contribution to well-being and even sources of affect that are not influenced by life domains (e.g., affective dispositions) should contribute to life-satisfaction.
With some tricks, we can create unobserved variables that represents the variance in happiness and sadness that is not explained by life domains and then test the indirect effect of these variables on well-being. (the trick is to create a residual and remove the default residuals. Then the created residual can be treated like any other latent variable and be specified in the model indirect function). The results show virtually no effect for the unique variance in happiness, r = .05, z = 1.3 , but a small and significant effect for the unique variance in sadness, r = -.19, z = 5.6. This shows that the hedonic model of well-being has some serious problems (Schimmack et al., 2002). Rather than relying on their feelings to judge their lives, individuals seem to feel happy or sad when their lives are going well in important life domains.
The final model for today shows the main results, CFI = .991, RMSEA = .016. Combined, the life domains make a strong unique contribution to well-being. Happiness alone has a small effect that is not significant in this sample. Sadness makes a reliable but also relatively small unique contribution. The important results are the effects of the domains on happiness, sadness and life-satisfaction. It is easier to make sense of these results in a table.
The numbers in the table are not very meaningful in isolation. This is another problem of univariate meta-analysis that estimate effect sizes and nobody knows that to do with this information. Here we can use the numbers to see how much domains contribute to happiness, sadness, and the weighted average of domains that predicts life-satisfaction independently of affect.
One notable finding is that recreation and friendship are the strongest predictors of happiness. However, happiness was not a unique predictor of life-satisfaction judgments and these two domains also made a very small direct contribution to life-satisfaction judgments. Thus, these two domains seem to be rather unimportant for well-being. In contrast, financial satisfaction has no unique effect on affect, but a direct effect on life-satisfaction. This finding has also been observed in other studies, although with questionable measures of affect (Kahneman et al., 2006). The fact that even a multi-method model replicates this finding is notable. So, based on the present results, we would conclude that money or at least financial satisfaction matters for well-being, but friendships or health for that matter do not (see also Payne & Schimmack, 2020).
A big caveat is that this conclusion rests on the assumption that life-satisfaction judgments reflect the optimal weighing of information (Andrews & Withey, 1976). This is a strong assumption with surprisingly little support. In fact, studies with importance ratings of life domains consistently fail to confirm the prediction that more important domains are weighted more heavily, especially at the level of individuals (Rohrer & Schmukle, 2018). If we abandon this assumption, we are losing the ability to measure well-being because it is no longer clear how life domains should be weighted or we fall back to the assumption that well-being is nothing more than pleasure minus pain and use hedonic balance as our well-being measure. This is not an ideal solution because it makes well-being objective. Scientists are imposing a criterion on individuals that they are not actually find ideal. As a result, well-being science would give invalid advice about the predictors of well-being. Thus, for now we are stuck with the assumption that global life-satisfaction judgments are reasonably valid indicators of individuals’ typical concerns.
This concludes Part 2 of the monster model series. We covered a lot of ground today and had to deal with some of the most difficult questions in research on well-being that pushed us to consider the possibility that the most widely used measure of well-being, a simple global rating of life-satisfaction, may not be as valid as most well-being researchers think.
We also found out that individuals, at least those in this sample, seem to be much less hedonistic than I assumed twenty years ago (Schimmack et al., 2002). Twenty years ago it made sense to me that everything that matters is how were are feeling because I was very high in affect intensity (Larsen & Diener, 1987). I have come to realize that some people are relatively clam and don’t look at their lives through the affective lens. I have also become a bit calmer myself. So, I am happy to revise my earlier model and to see emotions more as guides that help us to realize how things are going. Thus, they are strongly related to well-being because they reflect to some extent whether our lives are good or not. However, affect that is elicited by other sources seems to have a relatively small effect on evaluations of our lives.
Part 3 will examine how all of this fits with theories that assume heritable and highly stable personality traits like neuroticism and extraversion influence well-being. To examine this question, I will add measures of the Big Five personality traits to the monster model.