All posts by Ulrich Schimmack

About Ulrich Schimmack

Since Cohen (1962) published his famous article on statistical power in psychological journals, statistical power has not increased. The R-Index makes it possible f to distinguish studies with high power (good science) and studies with low power (bad science). Protect yourself from bad science and check the R-Index before you believe statistical results.

Unconscious Thought Theory in Decline

In the late 1980s, experimental social psychology rediscovered the unconscious. One reason might be that psychological laboratories started to use personal computers to conduct studies. This made it possible to present subliminal stimuli or measure reaction times cheaply. Another reason might have been that conscious social cognitive processes are relatively boring and easily accessible to introspection. It was difficult to find novel and eye-catching results with self-reports. The so called implicit revolution (Greenwald & Banaji, 2017) is still going strong, but first signs of problems are visible everywhere. An article by Ap Dijksterhuis (2004) proposed that unconscious process are better than conscious deliberations in making complex choices. This article stimulated research into unconscious thought theory.

Figure 1 shows publication and citation rates in Web of Science. Notably, the publication rate increased steeply until 2011, the year the replication crisis in social psychology started. Afterwards, publications show a slowly decreasing trend. However, citations continue to increase, suggesting that concerns about the robustness of published results has not reduced trust in this literature.

A meta-analysis and failed replication study raised concerns that many findings in this literature may be false positive results (Nieuwenstein et al., 2017). To further examine the credibility of this literature, I subjected the 220 articles in the Web of Science topic search to a z-curve analysis. I first looked for matching articles in a database of articles from 121 psychology journals that includes all major social psychology journals (Schimmack, 2022). This search retrieved 44 articles. An automatic search of these 44 articles produced 534 test statistics. A z-curve analysis of these test statistics showed 64% significant results (not counting marginally significant results, z > 1.65), but the z-curve estimate of power was only 30%. The 95% confidence interval ranges from 10% to 50% and does not include the observed discovery rate of 64%. Thus, there is clear evidence that the published rate of significant results is inflated by unscientific research practices.

An EDR of 30% implies that up to 12% of significant results could be false positive results (Soric, 1989). However, due to uncertainty in the estimate of the EDR, the upper limit of false positive results could be as high as 49%. The main problem is that it is unclear which of the published results are false positives and which ones are real effects. Another problem is that selection for significance inflates effect size estimates and that actual effects are likely to be smaller than published effect size estimates.

One solution to this problem is to focus on results with stronger evidence against the null-hypothesis by lowering the criterion for statistical significance. Some researchers have proposed setting alpha to .005. Figure 3 shows the implications of this criterion value.

The observed discovery rate is now only 34% because many results that were significant with alpha = .05 are no longer significant with alpha = .005. The expected discovery rate also decreases, but the more stringent criterion for significance lowers the false discovery risk to 3% and even the upper limit of the 95% confidence interval is only 20%. This suggests that most of the results with p-values below .005 reported a real effect. However, automatic extraction of test statistics does not distinguish between focal tests of unconscious thought theory and incidental tests of other hypotheses. Thus, it is unclear how many and which of these 184 significant results provide support for unconscious thought theory. The failed replication study by Nieuwenstein et al. (2017) suggests that it is not easy to find conditions under which unconscious thought is superior. In conclusion, there is presently little to no empirical support for unconscious thought theory, but research articles and literature reviews often cite the existing literature as if these studies can be trusted. The decrease in new studies suggests that it is difficult to find credible evidence.

2021 Replicability Report for the Psychology Department at the University of Toronto

Since 2011, it is an open secret that many published results in psychology journals do not replicate. The replicability of published results is particularly low in social psychology (Open Science Collaboration, 2015).

A key reason for low replicability is that researchers are rewarded for publishing as many articles as possible without concerns about the replicability of the published findings. This incentive structure is maintained by journal editors, review panels of granting agencies, and hiring and promotion committees at universities.

To change the incentive structure, I developed the Replicability Index, a blog that critically examined the replicability, credibility, and integrity of psychological science. In 2016, I created the first replicability rankings of psychology departments (Schimmack, 2016). Based on scientific criticisms of these methods, I have improved the selection process of articles to be used in departmental reviews.

1. I am using Web of Science to obtain lists of published articles from individual authors (Schimmack, 2022). This method minimizes the chance that articles that do not belong to an author are included in a replicability analysis. It also allows me to classify researchers into areas based on the frequency of publications in specialized journals. Currently, I cannot evaluate neuroscience research. So, the rankings are limited to cognitive, social, developmental, clinical, and applied psychologists.

2. I am using department’s websites to identify researchers that belong to the psychology department. This eliminates articles that are from other departments.

3. I am only using tenured, active professors. This eliminates emeritus professors from the evaluation of departments. I am not including assistant professors because the published results might negatively impact their chances to get tenure. Another reason is that they often do not have enough publications at their current university to produce meaningful results.

Like all empirical research, the present results rely on a number of assumptions and have some limitations. The main limitations are that
(a) only results that were found in an automatic search are included
(b) only results published in 120 journals are included (see list of journals)
(c) published significant results (p < .05) may not be a representative sample of all significant results
(d) point estimates are imprecise and can vary based on sampling error alone.

These limitations do not invalidate the results. Large difference in replicability estimates are likely to predict real differences in success rates of actual replication studies (Schimmack, 2022).

University of Toronto

I used the department website to find core members of the psychology department. I counted 27 professors and 25 associate professors, which makes it one of the larger departments in North America. Not all researchers conduct quantitative research and report test statistics in their result sections. I limited the analysis to 19 professors and 13 associate professors who had at least 100 test statistics.

Figure 1 shows the z-curve for all 13,462 tests statistics in articles published by these 31 faculty members. I use the Figure to explain how a z-curve analysis provides information about replicability and other useful meta-statistics.

1. All test-statistics are converted into absolute z-scores as a common metric of the strength of evidence (effect size over sampling error) against the null-hypothesis (typically H0 = no effect). A z-curve plot is a histogram of absolute z-scores in the range from 0 to 6. The 1,743 z-scores greater than 6 are not shown because z-scores of this magnitude are extremely unlikely to occur when the null-hypothesis is true (particle physics uses z > 5 for significance). Although they are not shown, they are included in the meta-statistics.

2. Visual inspection of the histogram shows a steep drop in frequencies at z = 1.96 (dashed blue/red line) that corresponds to the standard criterion for statistical significance, p = .05 (two-tailed). This shows that published results are selected for significance. The dashed red/white line shows significance for p < .10, which is often used for marginal significance. Thus, there are more results that are presented as significant than the .05 criterion suggests.

3. To quantify the amount of selection bias, z-curve fits a statistical model to the distribution of statistically significant results (z > 1.96). The grey curve shows the predicted values for the observed significant results and the unobserved non-significant results. The full grey curve is not shown to present a clear picture of the observed distribution. The statistically significant results (including z > 6) make up 41% of the total area under the grey curve. This is called the expected discovery rate because the results provide an estimate of the percentage of significant results that researchers actually obtain in their statistical analyses. In comparison, the percentage of significant results (including z > 6) includes 69% of the published results. This percentage is called the observed discovery rate, which is the rate of significant results in published journal articles. The difference between a 69% ODR and a 41% EDR provides an estimate of the extent of selection for significance. The difference of~ 30 percentage points is fairly large, but other departments have even bigger discrepancies. The upper level of the 95% confidence interval for the EDR is 50%. Thus, the discrepancy is not just random. To put this result in context, it is possible to compare it to the average for 120 psychology journals in 2010 (Schimmack, 2022). The ODR is similar (both 72%), but the EDR is higher (41% vs. 28%).

4. The EDR can be used to estimate the risk that published results are false positives (i.e., a statistically significant result when H0 is true), using Soric’s (1989) formula for the maximum false discovery rate. An EDR of 41% implies that no more than 8% of the significant results are false positives, however the lower limit of the 95%CI of the EDR, 33%, allows for 11% false positive results. Most readers are likely to agree that this is too high. One solution to this problem is to lower the conventional criterion for statistical significance (Benjamin et al., 2017). Figure 2 shows that alpha = .01 reduces the point estimate of the FDR to 2% with an upper limit of the 95% confidence interval of 4%. Thus, without any further information readers could use this criterion to interpret results published in articles by UofT faculty members.

The University of Toronto has three distinct campuses with a joint graduate program. Faculty members are appointed to one of the campuses and hiring and promotion decisions are made autonomously at each of the campuses. The three campuses also have different specializations. For example, clinical psychology is concentrated at the Scarborough (UTSC) campus. It is therefore interesting to examine whether results differ across the three campuses. The next figure shows the results for the University of Toronto – Mississauga (UTM) campus; home of the R-Index.

The observed discovery rate and the expected replication rate are very similar, but the point estimate of the EDR for the UTM campus is lower than for UofT in general (29% vs. 41). The confidence intervals do overlap. Thus, it is not clear whether this is a systematic difference or just sampling error.

The results for the Scarborough campus also show a similar ODR and ERR. The point estimate of the expected discovery rate is a bit higher than for UTM and lower than for the combined analysis, but the confidence intervals overlap.

The results for the St. George campus are mostly in line with the overall results. This is partially due to the fact, that researchers on this campus contributed a large number of test results. Overall, these results show that the three departments are more similar than different from each other.

Another potential moderator is the area of research. Social psychology has been shown to be less replicable than cognitive psychology (OSC, 2015). UofT has a fairly large number of social psychologists who contributed to the z-curve (k = 13), especially on the St. George campus (k = 8). The z-curve for social psychologists at UofT is not different from the overall z-curve and the EDR is higher than for social psychologists at other universities.

The results for the other areas are based on smaller numbers of faculty members. Developmental psychology has a slightly lower EDR but the confidence interval is very wide.

There were only 4 associate or full professors in cognitive psychology with sufficient z-scores (many cognitive researchers publish in neuropsychology journals that are not yet covered). The results are similar to the overall z-curve. Thus, UofT research does not show the difference between social and cognitive psychology that is observed in general or at other universities (Schimmack, 2022).

Another possible moderator is time. Before 2011, researchers were often not aware that searching for significant p-values with many analyses inflates the risk of false positive results considerably. After 2011, some researchers have changed their research practices to increase replicability and reduce the risk of false positive results. As change takes time, I looked for articles published after 2015 to see whether UofT faculty shows signs of improved research practices. Unfortunately, this is not the case. The z-curve is similar to the z-curve for all tests.

The table below shows the meta-statistics of all 32 faculty members that provided results for the departmental z-curve. You can see the z-curve for individual faculty member by clicking on their name.

Rank   Name ARP EDR ERR FDR
1

Publication Bias in the Stereotype Threat Literature

Two recent meta-analyses of stereotype threat studies found evidence of publication bias (Flore & Wicherts, 2014; Shewach et al., 2019). This blog post adds to this evidence by using a new method to examine publication bias that also quantifies the amount of publication bias, called z-curve. The data are based on a search for studies in Web of Science that include “stereotype threat” in the Title or Abstract. This search found 1,077 articles. Figure 1 shows that publications and citation are still increasing.

I then searched for matching articles in a database with 121 psychology journals that includes all major social psychology journals. This search yielded 256 matching articles. I then performed a search of these 256 articles for tests results of hypothesis tests. This search produced 3,872 test results that were converted into absolute z-scores as a measure of the strength of evidence against the null-hypothesis. Figure 2 shows a histogram of these z-scores that is called a z-curve plot.

Visual inspection of the plot shows a clear drop in reported results at z = 1.96. This value corresponds to the standard criterion for statistical significance, p = .05 (two-tailed). This finding confirms the meta-analytic findings of publication bias. To quantify the amount of publication bias, we can compare the observed discovery rate to the expected discovery rate. The observed discovery rate is simply the percentage of statistically significant results, 2,841/3,872 = 73. The expected discovery rate is based on fitting a model to the distribution of statistically significant results and extrapolating from these results to the expected distribution for non-significant results (i.e., the grey curve in Figure 2). The full grey curve is not shown because the mode of the density distribution exceeds the maximum value on the y-axis. The significant results make up only 16% of the area under the grey curve. This suggests that actual tests of stereotype threat effects only produce significant results in 16% of all attempts.

The expected discovery rate can also be used to compute the maximum percentage of significant results that are false positives; that is, studies produced a significant result without a real effect. An expected discovery rate of 16% implies a false positive risk of 27%. Thus, about a quarter of published results could be false positives. The problem is that we do not know which of the published results are false positives and which ones are not. Another problem is that selection for significance also inflates effect size estimates. Thus, even real effects may be so small that they have no practical significance.

Is Terror Management Theory Immortal?

The basic idea of terror management theory is that humans are aware of their own mortality and that thoughts about one’s own death elicit fear. To cope with this fear or anxiety, humans engage in various behaviors that reduce death anxiety. To study these effects, participants in experimental studies are either asked to think about death or some other unpleasant event (e.g., dental visits). Numerous studies show statistically significant effects of these manipulations on a variety of measures.

Figure 1 shows that terror management research has grown exponentially. Although the rate of publications is leveling off, citations are still increasing exponentially.

While the growth of terror management research suggests that the theory rests on a large body of evidence, it is not clear whether this evidence is trustworthy. The reason is that psychologists have used a biased statistical procedure to test theories. When a statistically significant result is obtained, the results are written up and submitted for publication. However, when the results are not significant and do not support a prediction, the results typically remain unpublished. It has been pointed out a long time ago, that this bias can produce entirely literatures with significant results in the absence of a real effects (Rosenthal, 1979).

Recent advances in statistical methods make it possible to examine the strength of evidence for a theory after taking publication bias into account. To use this method, I searched Web of Science for articles with the topic “terror management”. This search retrieved 2,394 articles. I then searched for matching articles in a database of 121 psychology journals that includes all major social psychology journals (Schimmack, 2022). This search produced a list of 259 articles. I then searched these 259 articles for statistical tests and converted the results of these tests into absolute z-scores as a measure of the strength of evidence against the null-hypothesis. Figure 2 shows the z-curve plot of the 4,014 results.

The z-curve shows a peak at the criterion for statistical significance (z = 1.96 equals p = .05, two-tailed). This peak is not a natural phenomenon. It rather reflects the selective reporting of supporting evidence. Whereas the published results show 72% significant results, the z-curve model that is fitted to the distribution of significant z-scores estimates that studies had only 14% power to produce significant results. This difference between the observed discovery rate of 72% and the expected discovery rate of 14% shows that unscientific practices dramatically inflate the evidence in favor of terror management theory. This means reported effect sizes are inflated. Moreover, an expected discovery rate of 14% implies that up to 32% of the significant results could be false positive results that were obtained without any real effect. The upper limit of the 95% confidence interval even allows for 71% false positive results. The problem is that it is unclear which published results produced real findings that could be replicated. Thus, it is currently unclear how reminders of death influence human behavior.

One limitation of the method used to generate Figure 2 is the automatic extraction of all test results from articles. A better method uses hand-coding of focal hypothesis tests of terror management theory. Fortunately, an independent team of researchers conducted a hand-coding analysis of terror management studies (Chen, Benjamin, Lai, & Heine, 2022).

The results mostly confirm the results of the automated analysis. The key difference is that the selection for significance effect is even more evident because researchers hardly ever report non-significant results for focal hypothesis tests. The observed discovery rate of 95% is consistent with analyses by Sterling over 50 years ago (Sterling, 1959). Moreover, most of the non-significant results are in the range between z = 1.65 (p = .10) and z = 1.96 (p = .05) that are often used as evidence to reject the null-hypothesis. While the observed discovery rate in published articles is nearly 100%, the expected discovery rate is only 9% and the 95%CI includes 5%, which is expected by chance alone. Thus, the data provide no credible evidence for any terror management effects and it is possible that 100% of the significant results are false positive results without any real effect.

False Psychology Glossary

The core feature of science is that it can self-correct itself. Any mature science has a history with older theories, findings, or measures that were replaced with better ones. Unfortunately, psychology lacks clear evidence of progress that is marked by a graveyard of discarded concepts that failed to be supported by empirical evidence. The reason is that psychologists have used a biased statistical tool, called Null-Hypothesis-Significance-Testing, to test theoretical prediction. This tool only allows to confirm theoretical predictions when the null-hypothesis is rejected, but it does not allow to falsify predictions, when the null-hypothesis is not rejected. Due to this asymmetry, psychology journals only publish results that support a theory and fail to report results when a prediction was not confirmed.

While this problem has been known for decades (Sterling, 1959), only some psychologists have recently started to do something about this problem in the past decade. The so-called open science movement has started to publish studies even when they fail to support existing theories, and new statistical methods have been developed to correct for publication bias to examine whether a literature is credible or not.

Another problem is that even replicable findings can be misleading when measures are invalid or biased. Psychologists also have neglected to carefully validate their measures and many claims in the literature are distorted by measurement error.

Unfortunately, scientific self-correction is a slow process and motivated biases prevent researchers from correcting themselves. This glossary serves as a warning for consumers of psychological research (e.g., undergraduate students) that some of the information that they may encounter in books, articles, or lectures may not be based on scientific evidence. Of course, it is also possible that the information provided here is misleading and will be corrected in the future. However, the evidence presented here can alert readers to the fact that published results in a particular literature may be less credible than they appear and that unscientific practices were used to produce a literature that appears to be much stronger than the evidence actually is.

B

Behavioral Priming

Behavioral priming is the most commonly used term for a specific form of priming. The key difference between cognitive priming and behavioral priming is that cognitive priming examines the influence of stimuli that are in the focus of attention on related cognitions. This is typically done in studies in which stimuli are presented in close temporal sequence and responses to the second stimulus are recorded. For example, showing the word “hospital” speeds up identification of the word “doctor” as a word. In contrast, behavioral priming assumes that stimuli that are no longer in the focus of attention continue to influence subsequent behaviors. A classic study was the finding that showing words related to the elderly made participants walk slower from one room to another. A replication failure of this study triggered the replication crisis in social psychology. During the replication crisis, it has become apparent that behavioral priming researchers used unscientific practices to provide false evidence for behavioral priming effects (Schimmack, 2017a, Schimmack, 2017b). Nobel Laureate Daniel Kahneman featured behavioral priming research in his popular book “Thinking: Fast and Slow,” but he distanced himself from this research after behavioral priming researchers were unwilling or unable to replicate their own findings (Kahneman, 2012, 2017). Most recently, Kahneman declared ” behavioral priming research is effectively dead. Although the researchers never conceded, everyone now knows that it’s not a wise move for a graduate student to bet their job prospects on a priming study. The fact that social psychologists didn’t change their minds is immaterial” (Kahneman, 2022).  You may hear about priming studies in social psychology with various primes (e.g., elderly priming, flag-priming, goal priming, professor priming, religious priming, money priming, etc.). Although it is impossible to say that none of these findings are real, only results from pre-registered studies with large samples should be trusted. Even if behavioral priming effects can be demonstrated under controlled laboratory conditions, it is unlikely that residual activation of stimuli have a strong influences on behavior outside our awareness. This does not mean that our behavior is not influenced by previous situations. For example, slick advertising can influence our behavior, but it is much more likely that it does so with awareness (I want an I-phone because I think it is cooler) than without awareness (an I-phone add makes you want to buy one without you knowing why you prefer an I-phone over another smart phone).

C

Construal-Level Theory

The key assumption of construal level theory is that individuals think about psychologically distant events differently than about psychologically close events and that these differences can influence their decisions, emotions, and behaviors. Self-serving meta-analysis that do not correct for publication bias suggest that hundreds of studies provide clear evidence for construal-level theory (Soderberg et al., 2022). However, meta-analysis that correct for bias show no clear evidence for construal-level effects (Maier, 2022). This finding is consistent with my own statistical analysis of the construal level literature (Schimmack, 2022). The literature shows strong evidence that unscientific practices were used to publish only results that support the theory, while hiding findings that failed to support predictions. After taking this bias into account, the published results have a high false positive risk and it is currently unclear which findings are replicable or not. New evidence will emerge from a large replication project, but the results will not be known until 2024 (https://climr.org/).

E

Ego-depletion

The main hypothesis of ego-depletion theory is that exerting mental effort requires energy and that engaging in one task that requires mental energy reduces the ability to exert mental energy on a second task. Hundreds of studies have examined ego-depletion effects with simple tasks like crossing out letters or a measure of handgrip strength. Ten years after the theory was invented, it was also proposed that blood glucose levels track the energy that is required for mental effort. A string of replication failures showed that the evidence of blood glucose effects is not robust and statistical analyses showed clear evidence that unscientific methods were used to produce initial evidence for glucose effects; the lead author even admitted to the use of these practices (Schimmack, 2014). Even the proponents of ego-depletion effects no longer link it to glucose. More important, even the basic ego-depletion effect is not replicable. Two large registered replication report, one led by key proponents of the theory, failed to produce the effect (Vohs et al., 2021). This is not surprising because statistical analyses of the published studies show that unscientific practices were used to present only significant results in support of the theory (Schimmack, 2022).

I

Implicit Bias

The main theoretical assumption in the implicit bias literature is that individuals can hold two attitudes that can be in conflict with each other (dual-attitude model). One attitude is consciously accessible and can be measured with (honest) self-reports. The other attitude is not consciously accessible and can only be measured indirectly; typically with computerized tasks like the Implicit Association Test. The key evidence to support the notion of implicit bias is that self-ratings of some attitudes are only weakly correlated with scores on implicit measures like the IAT. The key problem with this evidence is that measurement error alone can produce low correlations between two measures. In studies that correct for random and systematic measurement error, the valid variance in self-ratings and implicit measures is often highly correlated. This suggests that discrepancies between self-ratings and implicit measures are mostly due to measurement error (Schimmack, 2019).

Implicit Self-Esteem

The concept of implicit self-esteem is based on theories that assume individuals have two types of self-esteem. Explicit self-esteem is consciously accessible and can be measured with honest self-reports. Implicit self-esteem is not consciously accessible and can only be measured indirectly. The most widely used indirect measure of implicit self-esteem is the self-esteem Implicit Association Test. Evidence for the distinction between implicit and explicit self-esteem rests entirely on the fact that self-ratings and IAT scores have very low correlations. However, several studies have shown that the main reason for this low correlation is that most of the variance in self-esteem IAT scores is measurement error (Schimmack, 2019). It is therefore surprising that a large literature of studies with this invalid measure has produced statistically significant results that seem to support predictions based on implicit self-esteem theory. The reason for this seemingly robust evidence is that researchers used unscientific practices to hide findings that are not consistent with predictions. This can be seen in a statistical analysis of the published studies (Schimmack, 2022). At present, credible evidence for an unconscious form of self-esteem that is hidden from honest self-reflection is lacking.

S

Serotonin-Transporter Gene

The serotonin transporter gene theory (also 5-HTTLPR, serotonin transporter polymorphism) postulated that genetic variations in the serotonin reuptake mechanism are linked to personality traits like neuroticism that are a risk factor for mood disorders. When it became possible to measure this variation in human DNA, many studies used this biological marker as a predictor of personality measures and measures of depression and anxiety. After an initial period of euphoria, replication failures showed that many of the first results could not be replicated even in studies with much larger samples. It became apparent that variations in a single gene have much smaller effects on complex traits than initial studies suggested and research on this topic decreased. This healthy self-correction of science is visible in decreasing publications and citations of the older, discredited studies. A statistical analysis of the published studies further confirms that significant results were obtained with unscientific methods that led to the selection of significant results (Schimmack, 2022). After correcting for this bias, there is little evidence that the genetic variation in the serotonin reuptake gene makes a practically significant contribution to variation in personality. The research has moved on to predicting personality from patterns of genetic variations across a large number of genes (genome-wide association studies). This correction is one of the few examples of scientific progress in psychology that is reflected in a body of false claims that have been discarded.

Stereotype Threat

The stereotype-threat literature is based on the main hypothesis that stereotypes about performance (e.g., White people can’t dance) can be elicited in situations in which individuals are under pressure to perform well (e.g., A White men on a date with a black woman) and activation of the stereotype impairs the performance. Initially, stereotype threat researchers focused on African Americans’ performance in academic testing situations. Later, the focus shifted to women’s performance on math and STEM related tests. Stereotype threat effects have often been used to counter biological theories of performance differences, but performance can also be influenced by environmental factors. The focus on the testing situation is partially explained by psychologists’ preference to conduct experimental studies and it is easier to experimentally manipulate the testing situation than to study actual environmental influences on performance (e.g., discrimination by teachers or lack of funding in poor neighborhoods). Meta-analyses of this literature show that more recent studies with large samples have much smaller effect sizes than the initial studies with small samples (Flore & Wicherts, 2014; Shewach et al., 2019). These meta-analyses also found evidence of publication bias. A z-curve analysis confirms these findings (Schimmack, 2022). A large replication study found no evidence of stereotype-threat effects for women and math (Flore et al., 2018). It is possible that stereotype threat effects occur for some groups under some conditions, but at present there are no robust findings to show these effects. These results suggest that situational factors in testing situations are unlikely to explain performance differences in high-stake testing situations.

T

Terror Management Theory

The basic idea of terror management theory is that humans are aware of their own mortality and that thoughts about one’s own death elicit fear. To cope with this fear or anxiety, humans engage in various behaviors that reduce death anxiety. To study these effects, participants in experimental studies are either asked to think about death or some other unpleasant event (e.g., dental visits). Numerous studies show statistically significant effects of these manipulations on a variety of measures (wikipedia). However, there is strong evidence that this evidence was produced with unscientific practices that prevented disconfirming evidence from being published. After correcting for this bias, the published studies lack credible evidence for terror management effects (Schimmack, 2022).

U

Unconscious Thought Theory

Unconscious thought theory assumed that unconscious processes are better at solving complex decision problems than conscious thought. Publications supporting the theory increased from 2004 to 2011, but output decreased since then (Schimmack, 2022). A meta-analysis and failed replication study in 2017 suggested that evidence for unconscious thought theory was inconsistent and often weak, especially in bigger samples (Nieuwenstein et al., 2017) . A direct examination of publication bias shows strong evidence that unscientific practices were used to publish evidence in favor rather than against the theory (Schimmack, 2022). At present, strong evidence from pre-registered studies with positive results is lacking. Thus, the theory lacks empirical support.

Is Ego-Depletion Resistant to Replication Failures?

The hallmark of science is that it is self-correcting. While initial results may be false, replication studies can reveal that these findings cannot be reproduced. Unfortunately, psychologists disabled the self-correction mechanism of science when they decided to publish only statistically significant results (Sterling, 1959). As a result, replication failures occurred, but remained unpublished and could not correct false claims. This has created large bubbles of topics in psychology that are based on illusory evidence.

In the past decades, some areas of research have been scrutinized using registered replication reports. These projects combine the result of many labs to provide a powerful test of a hypothesis and the results are published independent of the outcome. Ego depletion has failed in two registered replication reports. Thus, it is the most severely tested theory in psychology.

However, a look into Web of Science shows that researchers continue to publish articles on ego-depletion and that citations are still increasing despite evidence that the published studies are not credible. This shows that psychology is not a science because self-correction is a necessary feature of science.

To examine the credibility of the empirical findings in ego-depletion articles, I conducted a z-curve analysis. I looked for articles that were published in 121 psychology journals, including all leading social psychology journals (Schimmack, 2022). This search retrieved 166 matching articles. A search for test-statistics in these articles produced 1,818 results of hypothesis tests that were converted into absolute z-scores as a measure of the strength of evidence against the null-hypothesis. Figure 2 shows the results of a z-cure analysis.

Visual inspection of the plot (i.e., histogram of z-scores) shows that the most common results are just significant (z = 1.96 equals p = .05, two-tailed). This is not a natural phenomenon. The observed discovery rate of 69% significant results is inconsistent with the expected discovery rate of 13% based on the distribution of statistically significant z-scores. As a result, published effect sizes are dramatically inflated. Moreover, a low EDR of 13% implies that up to 34% of significant results could be false positive results that were produced without a real effect. The 95% confidence interval ranges from 18% all the way up to 84%. Thus, it is unclear how many published results are false positives and more importantly, it is unclear which results are false positives or not. Not surprisingly, even key proponents of ego-depletion theory are unable to identify conditions that produce the effect (Vohs et al., 2021).

It is instructive to compare the results with those for literatures that have already been discredited like research on variations in a single candidate gene like the serotonin reuptake inhibitor gene (Schimmack, 2022). This literature shows a decline in publications and citations after it became apparent that key findings could not be replicated. The evidence for ego-depletion is just as weak, but so far literature reviews fail to take into account that convincing evidence for ego-depletion effects is lacking.

Construal Level Theory is Bullshit

I don’t know anything about construal level theory. It emerged in social psychology after I paid less and less attention to the latest development in a field that published incredible results that were unlikely to be true. After 2011, it became clear that my intuitions about the credibility of social psychology were correct. (Schimmack, 2020) After a string of fraud cases and major replication failures, one would have to be a fool to trust published results in social psychology journals (OSC, 2015).

To protect myself and other consumers of social science, I have developed powerful statistical methods to examine the credibility of published results. One way to validate these tools is to apply them to literatures that have been discredited by actual replication failures, such as studies of single gene variations (Schimmack, 2022). These literatures show clear evidence that significant results were obtained with questionable methods and that non-significant results remained unpublished. Today, this literature is in decline and geneticists have increased sample sizes and improved methods to reduce the risk of false positive findings. The same cannot be said about social psychology. Most social psychologists have responded to replication failures in their fields with defiant and stoic ignorance. They continue to cite questionable results in the introduction section of their new articles and present these findings to unaware undergraduate students as if they are robust evidence from credible empirical tests of theories. The following figure shows that the construal level literature is still growing in terms of publications and citations. There is no sign of self-correction.

However, I became skeptical about the credibility of this work, when I examined the replicability of published work by professors at New York University (Schimmack, 2022). In these analyses, the inventor of construal level theory had a very low ranking (17/19) and the z-curve analysis showed clear evidence that questionable practices were used to report more significant results than were actually obtained in tests of theoretical predictions. That is, the observed discovery rate of 74% is much higher than the expected discovery rate of 14%, an inflation by 500%!

To examine whether this finding generalizes to the construal level literature, I used Web of Science (WOS) to look for articles with “construal level” as a topic. This search retrieved 1915 articles. I then looked for matching articles in a database of articles from 121 journals that cover a broad range of psychology and include most if not all social psychology journals. I found 200 matching articles. I then extracted the test-statistics reported in these articles and converted them into z-scores as a measure of the strength of evidence against the null-hypothesis. The next figure shows the z-curve for the 3,970 test statistics reported in the 200 construal level articles. The results are very similar to those for Yaacov Trope’s articles.

The key finding is that the expected discovery rate is only 14%. This is even lower than the expected discovery for candidate gene studies that have been discredited. An EDR of 14% implies that up to 31% of published significant results might be false positives. The 95%CI interval around this point estimate is 67%. Proponents of construal level theory might argue that this shows that 33% of published results are not false positives with an error rate of 2.5%. However, it is not clear which of the published results are false positives and which ones are not. Moreover, this analysis includes focal tests that actually test construal level theory and other statistical tests that have nothing to do with construal level theory. Thus, the false positive risk of actual tests of construal level theory could be even higher.

One solution to this problem is to lower the criterion of statistical significance to a level that ensures a low false positive risk. For the construal level literature, a criterion value of .001 is needed to do so. The question is whether any of the remaining significant results provide credible evidence for construal level effects under some specific conditions. Another problem is that the uncertainty around the point estimate increases and that the upper limit of the 95%CI now includes 100%. Thus, there is no credible evidence for construal level effects despite a large literature with many statistically significant results. However, when questionable research practices are used, these significant results do not provide empirical evidence for the presence of an effect.

Not everybody is going to be convinced by these statistical analyses. Fortunately, some researchers are planning a major replication project (https://climr.org/). This blog post can serve as a pre-registered prediction of the outcome of this project. While a large sample may produce a statistically significant result, p < .05, the population effect size will be negligible and much lower than the effect size estimates in studies that used questionable research practices, d < .2. This means that many of the basic findings and extensions of the theory lack empirical support. The only question is when social psychologists will eventually correct their theoretical reviews, textbooks, and lectures.

2021 Replicability Report for the Psychology Department at the University of Michigan

Introduction

Since 2011, it is an open secret that many published results in psychology journals do not replicate. The replicability of published results is particularly low in social psychology (Open Science Collaboration, 2015).

A key reason for low replicability is that researchers are rewarded for publishing as many articles as possible without concerns about the replicability of the published findings. This incentive structure is maintained by journal editors, review panels of granting agencies, and hiring and promotion committees at universities.

To change the incentive structure, I developed the Replicability Index, a blog that critically examined the replicability, credibility, and integrity of psychological science. In 2016, I created the first replicability rankings of psychology departments (Schimmack, 2016). Based on scientific criticisms of these methods, I have improved the selection process of articles to be used in departmental reviews.

1. I am using Web of Science to obtain lists of published articles from individual authors (Schimmack, 2022). This method minimizes the chance that articles that do not belong to an author are included in a replicability analysis. It also allows me to classify researchers into areas based on the frequency of publications in specialized journals. Currently, I cannot evaluate neuroscience research. So, the rankings are limited to cognitive, social, developmental, clinical, and applied psychologists.

2. I am using department’s websites to identify researchers that belong to the psychology department. This eliminates articles that are from other departments.

3. I am only using tenured, active professors. This eliminates emeritus professors from the evaluation of departments. I am not including assistant professors because the published results might negatively impact their chances to get tenure. Another reason is that they often do not have enough publications at their current university to produce meaningful results.

Like all empirical research, the present results rely on a number of assumptions and have some limitations. The main limitations are that
(a) only results that were found in an automatic search are included
(b) only results published in 120 journals are included (see list of journals)
(c) published significant results (p < .05) may not be a representative sample of all significant results
(d) point estimates are imprecise and can vary based on sampling error alone.

These limitations do not invalidate the results. Large difference in replicability estimates are likely to predict real differences in success rates of actual replication studies (Schimmack, 2022).

University of Michigan

I used the department website to find core members of the psychology department. I counted 55 professors and 11 associate professors, which makes it one of the largest departments in North America. Not all researchers conduct quantitative research and report test statistics in their result sections. I limited the analysis to 29 professors and 5 associate professors who had at least 100 test statistics.

Figure 1 shows the z-curve for all 12,365 tests statistics in articles published by these 19 faculty members. I use the Figure to explain how a z-curve analysis provides information about replicability and other useful meta-statistics.

1. All test-statistics are converted into absolute z-scores as a common metric of the strength of evidence (effect size over sampling error) against the null-hypothesis (typically H0 = no effect). A z-curve plot is a histogram of absolute z-scores in the range from 0 to 6. The 1,781 z-scores greater than 6 are not shown because z-scores of this magnitude are extremely unlikely to occur when the null-hypothesis is true (particle physics uses z > 5 for significance). Although they are not shown, they are included in the meta-statistics.

2. Visual inspection of the histogram shows a steep drop in frequencies at z = 1.96 (dashed blue/red line) that corresponds to the standard criterion for statistical significance, p = .05 (two-tailed). This shows that published results are selected for significance. The dashed red/white line shows significance for p < .10, which is often used for marginal significance. Thus, there are more results that are presented as significant than the .05 criterion suggests.

3. To quantify the amount of selection bias, z-curve fits a statistical model to the distribution of statistically significant results (z > 1.96). The grey curve shows the predicted values for the observed significant results and the unobserved non-significant results. The full grey curve is not shown to present a clear picture of the observed distribution. The statistically significant results (including z > 6) make up 41% of the total area under the grey curve. This is called the expected discovery rate because the results provide an estimate of the percentage of significant results that researchers actually obtain in their statistical analyses. In comparison, the percentage of significant results (including z > 6) includes 72% of the published results. This percentage is called the observed discovery rate, which is the rate of significant results in published journal articles. The difference between a 72% ODR and a 41% EDR provides an estimate of the extent of selection for significance. The difference of~ 30 percentage points is fairly large, but other departments have even bigger discrepancies. The upper level of the 95% confidence interval for the EDR is 57%. Thus, the discrepancy is not just random. To put this result in context, it is possible to compare it to the average for 120 psychology journals in 2010 (Schimmack, 2022). The ODR is similar (both 72%), but the EDR is higher (41% vs. 28%).

4. The EDR can be used to estimate the risk that published results are false positives (i.e., a statistically significant result when H0 is true), using Soric’s (1989) formula for the maximum false discovery rate. An EDR of 41% implies that no more than 8% of the significant results are false positives, however the lower limit of the 95%CI of the EDR, 28%, allows for 13% false positive results. Most readers are likely to agree that this is too high. One solution to this problem is to lower the conventional criterion for statistical significance (Benjamin et al., 2017). Figure 2 shows that alpha = .01 reduces the point estimate of the FDR to 3% with an upper limit of the 95% confidence interval of 5%. Thus, without any further information readers could use this criterion to interpret results published in articles by U Michigan faculty members.

Replicability varies across disciplines. Before 2015 , replicability was particularly low in social psychology (Schimmack, 2021). This difference is also visible in separate analyses of researches from different fields at the University of Michigan. For the 9 social psychologists with sufficient data, the EDR is only 27% , which also implies a higher false positive risk of 14%, but these point estimates have wide confidence intervals.

There were only six cognitive researchers with usable data. Their z-curve shows less selection bias and a higher EDR estimate than the EDR estimate for social psychologists. The difference between social and cognitive psychologists is significant (i.e., the 95%CI do not overlap).

However, it is notable that the z-curve overestimates the number of z-scores that are just significant (z = 2 to 2.2), while it underestimates the percentage of z-scores between 2.2 and 2.4. This may suggests that the selection model is wrong and that sometimes just significant p-values are not published. A sensitivity (or multiverse) analysis can use different selection models. Using only z-scores about 2.2 (the vertical blue dotted line in the figure below), doesn’t change the ERR estimate much, but the EDR estimate is considerably lower, 43%, and the 95%CI goes as low as 18%. This leads to higher false discovery risks with an upper limit of the 95%Ci of 24%. Caution would therefore suggest to be careful with p-values greater than .01.

University of Michigan has a large group of developmental psychologists. The 9 faculty with usable data provided 4,382 test statistics. The results are better than those for social psychology, but not as good as those for cognitive psychology when the standard selection criterion is used. These results are consistent with typical differences between these disciplines that are reflected in analyses of 120 psychology journals (Schimmack, 2022).

Most of the faculty at U Michigan are full professors. Only 5 associate professors provided sufficient data for a z-curve analysis. The total number of test-statistics was 1,100.

The z-curve shows no evidence that research practices of associate professors are different from those of full professors.

Another way to look at research practices is to limit the analysis to articles published since 2016, which is the first year in which some journals show an increase in replicability (Schimmack, 2022). However, there is no notable difference to the z-curve for all years. This is in part due to the relative (not absolute) good performance of University of Michigan. Other departments have a lot more room for improvement.

The table below shows the meta-statistics of all 19 faculty members. You can see the z-curve for each faculty member by clicking on their name.

Rank   Name ARP EDR ERR FDR
1Patricia A. Reuter-Lorenz7881762
2Daniel H. Weissman7479692
3John Jonides7376702
4William J. Gehring7275682
5Cindy Lustig6973663
6Henry M. Wellman6973653
7Terri D. Conley6871643
8Allison Earl6572574
9Terry E. Robinson6569613
10Arnold K. Ho6572574
11Felix Warneken6365604
12Stephanie A. Fryberg6265604
13Twila Tardif6265584
14Martin F. Sarter6064564
15Ashley N. Gearhardt5774418
16Thad A. Polk54763211
17Ethan Kross5264408
18Susan A. Gelman52782615
19Julie E. Boland5163398
20Shinobu Kitayama51713211
21David Dunning50683211
22Christopher S. Monk47742021
23Priti Shah44701824
24Patricia J. Deldin43523410
25Robert M. Sellers42632120
26Brenda L. Volling40483311
27Joshua M. Ackerman40621726
28Abigail J. Stewart37492615
29Nestor L. Lopez-Duran33521434
30Kent C. Berridge32501433
31Sheryl L. Olson32511237
32Denise Sekaquaptewa31521046
33Fiona Lee2542956

Implicit Self-Esteem: A bubble ready to burst

A hallmark characteristic of science is its ability to correct itself. When empirical observations accumulate that are inconsistent with theoretical claims, false theories are abandoned. Unfortunately, this self-correcting process can last decades because scientists are often invested in theories and are reluctant or unwilling to abandon theories. This motivated bias is particularly strong among researchers who benefited from a popular theory.

Another problem that slows down the self-correction process of science is that scientists often hide evidence that contradicts their theoretical predictions. This unscientific practice is prevalent because scientific organizations that are run by scientists are unwilling to sanction these practices. As a result, consumers of scientific information (e.g., undergraduate students) are introduced to false claims with false empirical evidence.

Over the past 10 years, meta-scientists have developed powerful statistical tools to reveal the use of unscientific practices that stand in the way of scientific progress. In this blog post, I used z-curve to demonstrate that research on implicit self-esteem is unscientific and untrustworthy.

The concept of implicit self-esteem emerged in the 1990s, when social psychologists started to believe in powerful unconscious processes that guide human behavior without our knowledge. The research on implicit self-esteem became particularly popular when Anthony G. Greenwald invented a computerized task to measure implicit association and used it to measure implicit self-esteem (Greenwald & Furnham, 2000).

Figure 1 shows the number of articles (bars) and citations (graph) for articles listed in Web Of Science with the topic implicit self-esteem. Publications increased rapidly in the 2000s, and citations continue to increase with over 7,000 citations in 2021. This shows that the scientific community is unaware of major problems with the validity of implicit measures of self-esteem that have been known since 2000 (Bosson et al., 2000; Buhrmester et al., 2011; Falk et al., 2015; Jusepeitis & Rothermund, 2022; Schimmack, 2021). Greenwald and colleagues simply ignore this evidence and perpetuate the illusions that the self-esteem IAT is a valid measure of implicit self-esteem (cf. Schimmack, 2021).

A typical argument to claim that a measure is valid is to point to the large amount of published articles that produced statistically significant results. We would not expect so many findings form a measure that has no validity. However, this argument ignores that published results are selected for significance. Thus, publications give an overly positive impression of the support for a theory. To examine the extent of publication bias in the self-esteem literature, I downloaded the list of 1,585 articles in Web of Science. I then looked for matching articles in a database of 121 major psychology journals (Schimmack, 2022). This search produced 604 matching articles (71 JPSP, 58 JESP, 45 Self & Identity, 44 PSPB, 40 PAID). A search for statistical tests results in these article produced 11,637 tests. Figure 2 shows a z-curve plot of these results. All tests are converted into absolute z-scores that show the strength of evidence (signal/noise ratio) against the null-hypothesis of no effect.

The key finding is that 72% of the results are statistically significant at p < .05 (two -tailed). However, a z-curve analysis of the distribution of significant results (z > 1.96) shows that these results are selected from a much larger set of statistical results that are not reported. The expected discovery rate is the proportion of significant results under the grey curve that extends outside the plot. The EDR is 20% and the upper limit of the 95% confidence interval is 28%. Thus, the observed percentage of 72% is dramatically inflated by selection for significance.

The EDR can also be used to estimate the false discovery risk; that is, a significant result was obtained without a real effect. The false discovery risk is estimated at 22%, but it could be as high as 37% given uncertainty about the actual discovery rate. Moreover, an analyses of all test statistics includes tests of hypotheses that do not include the implicit measures. A focal analysis of only those tests is likely to produce an even lower discovery rate and a higher false positive risk.

The key conclusion is that published significant results cannot be used to make claims about implicit self-esteem. To make credible claims about self-esteem results need to be replicated in credible, pre-registered replication studies. It is noteworthy that Greenwald has not conducted any replication studies of his seminal validation study in 2000.

It is interesting to compare these results with the literature on the serotonin transporter gene (Schimmack, 2022). Behavioral geneticists also had a period of euphoria when it became possible to measure variation in human genes. A large literature focused on genetic variation in the gene responsible for the reuptake of serotonin because this mechanism is used for the treatment of mood disorders with selective serotonin reuptake inhibitors (SSRISs). After one decade, it became apparent that most results did not replicate and interest in single gene variations decreased. This can be seen in decreasing publications and citations; an example of scientific self-correction. A z-curve analysis of this literature produced nearly identical results (EDR = 19%, FDR = 22%). The notable difference is that geneticists listened to their data and mostly abandoned this line of research. In contrast, the Web of Science statistics suggests that social psychologists are ignoring the warning signs that they are chasing a phenomenon that does not exist or that they haven’t been able to measure properly. Time will tell, how long it will take for social psychology to correct itself.

The 5-HTTLPR Debacle

When it became possible to measure genetic variation using molecular biological methods, psychologists started to correlated genetic variation in a single gene with phenotypes. After one decade of research it became apparent that few of these results replicated because the effects of a single gene on complex human traits are at best very small and would require astronomically large sample sizes to produce replicable results. Many psychologists today think that this decade of research with thousands of findings served mainly as an example of the problems of exploratory research with small samples and small effect sizes that uses the classic significance threshold of alpha = .05 to reject the null-hypothesis.

For meta-scientists, the articles that published these results provide an opportunity to test meta-statistical methods. Here I examine research on one of the most widely used genetic variations, namely the serotonin (5-HT) transporter gene (5-HTTLPR) to predict individual differences in personality traits or gene x situation interactions.

I used Web of Science to find articles with “5-HTTLPR” in the title. I found 650 articles. The citation report for these 650 articles shows an unusual decrease in citations, indicating that many researches no longer believe in these results. It also shows that publications are decreasing.

I then searched a database of articles from 121 journals for matching articles (Schimmack, 2022) and identified 181 articles. I then used an R-program to search for statistical tests results reported in these articles. One limitation of this method that results are not limited to statistical tests that include the serotonin transporter gene variation. However, chances are that this makes the results conservative because genetic effects are likely to be smaller than other effects examined in this articles.

The search produced 1,262 tests statistics (see Figure 1).

All test statistics are converted into absolute z-scores as a common metric that shows how strong the evidence against the null-hypothesis is. The histogram shows a peak (mode) at p = .05 (z = 1.96) with a steep drop for results with p-values greater than .05 (z < 1.96). This shows that results are selected for significance. Based on a statistical model (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020) of the distribution of statistically significant results (z > 1.96), the model estimates a discovery rate of 19%. This means that researchers have only a 19% chance to obtain a significant result. However, due to selection for significance articles show 63% significant results. Thus, the observed discovery rate is three times higher than the expected discovery rate. This implies that reported effect sizes are at least double if not three times larger than actual effect sizes.

The expected discovery rate of 19% can also be used to estimate the false discovery risk (Soric, 1989) because false discoveries become more likely as the number of discoveries decreases. With an expected discovery rate of 19%, the false discovery risk is 22%. However, due to sampling error this estimate may be an underestimation and the upper limit of the 95% confidence interval allows for 79% false positive results.

One solution to this problem is to lower the criterion for statistical significance, which is what modern molecular genetics studies are doing today to keep the false discovery risk at an acceptable level. Setting alpha to .005 reduces the false discovery risk to 5%, but the confidence interval increases and the upper limit is 100%. Moreover, there percentage of significant results with alpha = .05 (63%) is cut in half with the new alpha of .005 (32%).

The next figure shows that most of the significant results are from articles published before 2015. After 215, only 262 test results were found. These results are more credible with an EDR of 38%, but there is still evidence of selection for significance, ODR = 61%.

In conclusion, researchers often get carried away with new methods that produce novel findings. In these exploratory studies, it is problematic to use alpha = .05 as a standard criterion for statistical significance. Honest reporting of results would reveal that the actual discovery rate is low and alpha = .05 produces too many false positive results. In the absence of clear scientific norms that do not allow researchers to cherry-pick their published results, z-curve analysis can be used to detect low discovery rates and to recommend appropriate alpha levels.