Category Archives: R-Index4Science

Replicability Report No. 1: Is Ego-Depletion a Replicable Effect?

Abstract

It has been a common practice in social psychology to publish only significant results.  As a result, success rates in the published literature do not provide empirical evidence for the existence of a phenomenon.  A recent meta-analysis suggested that ego-depletion is a much weaker effect than the published literature suggests and a registered replication study failed to find any evidence for it.  This article presents the results of a replicability analysis of the ego-depletion literature.  Out of 165 articles with 429 studies (total N  = 33,927),  128 (78%) showed evidence of bias and low replicability (Replicability-Index < 50%).  Closer inspection of the top 10 articles with the strongest evidence against the null-hypothesis revealed some questionable statistical analyses, and only a few articles presented replicable results.  The results of this meta-analysis show that most published findings are not replicable and that the existing literature provides no credible evidence for ego-depletion.  The discussion focuses on the need for a change in research practices and suggests a new direction for research on ego-depletion that can produce conclusive results.

INTRODUCTION

In 1998, Roy F. Baumeister and colleagues published a groundbreaking article titled “Ego Depletion: Is the Active Self a Limited Resource?”   The article stimulated research on the newly minted construct of ego-depletion.  At present, more than 150 articles and over 400 studies with more than 30,000 participants have contributed to the literature on ego-depletion.  In 2010, a meta-analysis of nearly 100 articles, 200 studies, and 10,000 participants concluded that ego-depletion is a real phenomenon with a moderate to strong effect size of six tenth of a standard deviation (Hagger et al., 2010).

In 2011, Roy F. Baumeister and John Tierney published a popular book on ego-depletion titled “Will-Power,” and Roy F. Baumeister became to be known as the leading expert on self-regulation, will-power (The Atlantic, 2012).

Everything looked as if ego-depletion research has a bright future, but five years later the future of ego-depletion research looks gloomy and even prominent ego-depletion researchers wonder whether ego-depletion even exists (Slate, “Everything is Crumbling”, 2016).

An influential psychological theory, borne out in hundreds of experiments, may have just been debunked. How can so many scientists have been so wrong?

What Happened?

It has been known for 60 years that scientific journals tend to publish only successful studies (Sterling, 1959).  That is, when Roy F. Baumeister reported his first ego-depletion study and found that resisting the temptation to eat chocolate cookies led to a decrease in persistence on a difficult task by 17 minutes, the results were published as a groundbreaking discovery.  However, when studies do not produce the predicted outcome, they are not published.  This bias is known as publication bias.  Every researcher knows about publication bias, but the practice is so widespread that it is not considered a serious problem.  Surely, researches would not conduct more failed studies than successful studies and only report the successful ones.  Yes, omitting a few studies with weaker effects leads to an inflation of the effect size, but the successful studies still show the general trend.

The publication of one controversial article in the same journal that published the first ego-depletion article challenged this indifferent attitude towards publication bias. In a shocking article, Bem (2011) presented 9 successful studies demonstrating that extraverted students at Cornell University were seemingly able to foresee random events in the future. In Study 1, they seemed to be able to predict where a computer would present an erotic picture even before the computer randomly determined the location of the picture.  Although the article presented 9 successful studies and 1 marginally successful study, researchers were not convinced that extrasensory perception is a real phenomenon.  Rather, they wondered how credible the evidence in other article is if it is possible to get 9 significant results for a phenomenon that few researchers believed to be real.  As Sterling (1959) pointed out, a 100% success rate does not provide evidence for a phenomenon if only successful studies are reported. In this case, the success rate is by definition 100% no matter whether an effect is real or not.

In the same year, Simmons et al. (2011) showed how researchers can increase the chances to get significant results without a real effect by using a number of statistical practices that seem harmless, but in combination can increase the chance of a false discovery by more than 1000% (from 5% to 60%).  The use of these questionable research practices has been compared to the use of doping in sports (John et al., 2012).  Researchers who use QRPs are able to produce many successful studies, but the results of these studies cannot be replicated when other researchers replicate the reported studies without QRPs.  Skeptics wondered whether many discoveries in psychology are as incredible as Bem’s discovery of extrasensory perception; groundbreaking, spectacular, and false.  Is ego-depletion a real effect or is it an artificial product of publication bias and questionable research practices?

Does Ego-Depletion Depend on Blood Glucose?

The core assumption of ego-depletion theory is that working on an effortful task requires energy and that performance decreases as energy levels decrease.  If this theory is correct, it should be possible to find a physiological correlate of this energy.  Ten years after the inception of ego-depletion theory, Baumeister and colleagues claimed to have found the biological basis of ego-depletion in an article called “Self-control relies on glucose as a limited energy source.”  (Gailliot et al., 2007).  The article had a huge impact on ego-depletion researchers and it became a common practice to measure blood-glucose levels.

Unfortunately, Baumeister and colleagues had not consulted with physiological psychologists when they developed the idea that brain processes depend on blood-glucose levels.  To maintain vital functions, the human body ensures that the brain is relatively independent of peripheral processes.  A large literature in physiological psychology suggested that inhibiting the impulse to eat delicious chocolate cookies would not lead to a measurable drop in blood glucose levels (Kurzban, 2011).

Let’s look at the numbers. A well-known statistic is that the brain, while only 2% of body weight, consumes 20% of the body’s energy. That sounds like the brain consumes a lot of calories, but if we assume a 2,400 calorie/day diet – only to make the division really easy – that’s 100 calories per hour on average, 20 of which, then, are being used by the brain. Every three minutes, then, the brain – which includes memory systems, the visual system, working memory, then emotion systems, and so on – consumes one (1) calorie. One. Yes, the brain is a greedy organ, but it’s important to keep its greediness in perspective.

But, maybe experts on physiology were just wrong and Baumeister and colleagues made another groundbreaking discovery.  After all, they presented 9 successful studies that appeared to support the glucose theory of will-power, but 9 successful studies alone provide no evidence because it is not clear how these successful studies were produced.

To answer this question, Schimmack (2012) developed a statistical test that provides information about the credibility of a set of successful studies. Experimental researchers try to hold many factors that can influence the results constant (all studies are done in the same laboratory, glucose is measured the same way, etc.).  However, there are always factors that the experimenter cannot control. These random factors make it difficult to predict the exact outcome of a study even if everything goes well and the theory is right.  To minimize the influence of these random factors, researchers need large samples, but social psychologists often use small samples where random factors can have a large influence on results.  As a result, conducting a study is a gamble and some studies will fail even if the theory is correct.  Moreover, the probability of failure increases with the number of attempts.  You may get away with playing Russian roulette once, but you cannot play forever.  Thus, eventually failed studies are expected and a 100% success rate is a sign that failed studies were simply not reported.  Schimmack (2012) was able to use the reported statistics in Gailliot et al. (2007) to demonstrate that it was very likely that the 100% success rate was only achieved by hiding failed studies or with the help of questionable research practices.

Baumeister was a reviewer of Schimmack’s manuscript and confirmed the finding that a success rate of 9 out of 9 studies was not credible.

 “My paper with Gailliot et al. (2007) is used as an illustration here. Of course, I am quite familiar with the process and history of that one. We initially submitted it with more studies, some of which had weaker results. The editor said to delete those. He wanted the paper shorter so as not to use up a lot of journal space with mediocre results. It worked: the resulting paper is shorter and stronger. Does that count as magic? The studies deleted at the editor’s request are not the only story. I am pretty sure there were other studies that did not work. Let us suppose that our hypotheses were correct and that our research was impeccable. Then several of our studies would have failed, simply given the realities of low power and random fluctuations. Is anyone surprised that those studies were not included in the draft we submitted for publication? If we had included them, certainly the editor and reviewers would have criticized them and formed a more negative impression of the paper. Let us suppose that they still thought the work deserved publication (after all, as I said, we are assuming here that the research was impeccable and the hypotheses correct). Do you think the editor would have wanted to include those studies in the published version?”

To summarize, Baumeister defends the practice of hiding failed studies with the argument that this practice is acceptable if the theory is correct.  But we do not know whether the theory is correct without looking at unbiased evidence.  Thus, his line of reasoning does not justify the practice of selectively reporting successful results, which provides biased evidence for the theory.  If we could know whether a theory is correct without data, we would not need empirical tests of the theory.  In conclusion, Baumeister’s response shows a fundamental misunderstanding of the role of empirical data in science.  Empirical results are not mere illustrations of what could happen if a theory were correct. Empirical data are supposed to provide objective evidence that a theory needs to explain.

Since my article has been published, there have been several failures to replicate Gailliot et al.’s findings and recent theoretical articles on ego-depletion no longer assume that blood-glucose as the source of ego-depletion.

“Upon closer inspection notable limitations have emerged. Chief among these is the failure to replicate evidence that cognitive exertion actually lowers blood glucose levels.” (Inzlicht, Schmeichel, & Macrae, 2014, p 18).

Thus, the 9 successful studies that were selected by Baumeister et al. (1998) did not illustrate an empirical fact, they created false evidence for a physiological correlate of ego-depletion that could not be replicated.  Precious research resources were wasted on a line of research that could have been avoided by consulting with experts on human physiology and by honestly examining the successful and failed studies that led to the Baumeister et al. (1998) article.

Even Baumeister agrees that the original evidence was false and that glucose is not the biological correlate of ego-depletion.

In retrospect, even the initial evidence might have gotten a boost in significance from a fortuitous control condition. Hence at present it seems unlikely that ego depletion’s effects are caused by a shortage of glucose in the bloodstream” (Baumeister, 2014, p 315).

Baumeister fails to mention that the initial evidence also got a boost from selection bias.

In sum, the glucose theory of ego-depletion was based on selective reporting of studies that provided misleading support for the theory and the theory lacks credible empirical support.  The failure of the glucose theory raises questions about the basic ego-depletion effect.  If researchers in this field used selective reporting and questionable research practices, the evidence for the basic effect is also likely to be biased and the effect may be difficult to replicate.

If 200 studies show ego-depletion effects, it must be real?

Psychologists have not ignored publication bias altogether.  The main solution to the problem is to conduct meta-analyses.  A meta-analysis combines information from several small studies to examine whether an effect is real.  The problem for meta-analysis is that publication bias also influences the results of a meta-analysis.  If only successful studies are published, a meta-analysis of published studies will show evidence for an effect no matter whether the effect actually exists or not.  For example, the top journal for meta-analysis, Psychological Bulletin, has published meta-analyses that provide evidence for extransensory perception (Bem & Honorton, 1994).

To address this problem, meta-analysts have developed a number of statistical tools to detect publication bias.  The most prominent method is Eggert’s regression of effect size estimates on sampling error.  A positive correlation can reveal publication bias because studies with larger sampling errors (small samples) require larger effect sizes to achieve statistical significance.  To produce these large effect sizes when the actual effect does not exist or is smaller, researchers need to hide more studies or use more questionable research practices.  As a result, these results are particularly difficult to replicate.

Although the use of these statistical methods is state of the art, the original ego-depletion meta-analysis that showed moderate to large effects did not examine the presence of publication bias (Hagger et al., 2010). This omission was corrected in a meta-analysis by Carter and McCollough (2014).

Upon reading Hagger et al. (2010), we realized that their efforts to estimate and account for the possible influence of publication bias and other small-study effects had been less than ideal, given the methods available at the time of its publication (Carter & McCollough, 2014).

The authors then used Eggert regression to examine publication bias.  Moreover, they used a new method that was not available at the time of Hagger et al.’s (2010) meta-analysis to estimate the effect size of ego-depletion after correcting for the inflation caused by publication bias.

Not surprisingly, the regression analysis showed clear evidence of publication bias.  More stunning were the results of the effect size estimate after correcting for publication bias.  The bias-corrected effect size estimate was d = .25 with a 95% confidence interval ranging from d = .18 to d = .32.   Thus, even the upper limit of the confidence interval is about 50% less than the effect size estimate in the original meta-analysis without correction for publication bias.   This suggests that publication bias inflated the effect size estimate by 100% or more.  Interestingly, a similar result was obtained in the reproducibility project, where a team of psychologists replicated 100 original studies and found that published effect sizes were over 100% larger than effect sizes in the replication project (OSC, 2015).

An effect size of d = .2 is considered small.  This does not mean that the effect has no practical importance, but it raises questions about the replicability of ego-depletion results.  To obtain replicable results, researchers should plan studies so that they have an 80% chance to get significant results despite the unpredictable influence of random error.  For small effects, this implies that studies require large samples.  For the standard ego-depletion paradigm with an experimental group and a control group and an effect size of d = .2, a sample size of 788 participants is needed to achieve 80% power. However, the largest sample size in an ego-depletion study was only 501 participants.  A sample size of 388 participants is needed to achieve significance without an inflated effect size (50% power) and most studies fall short of this requirement in sample size.  Thus, most published ego-depletion results are unlikely to replicate and future ego-depletion studies are likely to produce non-significant results.

In conclusion, even 100 studies with 100% successful results do not provide convincing evidence that ego-depletion exists and which experimental procedures can be used to replicate the basic effect.

Replicability without Publication Bias

In response to concerns about replicability, the American Psychological Society created a new format for publications.  A team of researchers can propose a replication project.  The research proposal is peer-reviewed like a grant application.  When the project is approved, researchers conduct the studies and publish the results independent of the outcome of the project.  If it is successful, the results confirm that earlier findings that were reported with publication bias are replicable, although probably with a smaller effect size.  If the studies fail, the results suggest that the effect may not exist or that the effect size is very small.

In the fall of 2014 Hagger and Chatzisarantis announced a replication project of an ego-depletion study.

The third RRR will do so using the paradigm developed and published by Sripada, Kessler, and Jonides (2014), which is similar to that used in the original depletion experiments (Baumeister et al., 1998; Muraven et al., 1998), using only computerized versions of tasks to minimize variability across laboratories. By using preregistered replications across multiple laboratories, this RRR will allow for a precise, objective estimate of the size of the ego depletion effect.

In the end, 23 laboratories participated and the combined sample size of all studies was N = 2141.  This sample size affords an 80% probability to obtain a significant result (p < .05, two-tailed) with an effect size of d = .12, which is below the lower limit of the confidence interval of the bias-corrected meta-analysis.  Nevertheless, the study failed to produce a statistically significant result, d = .04 with a 95%CI ranging from d = -.07 to d = .14.  Thus, the results are inconsistent with a small effect size of d = .20 and suggest that ego-depletion may not even exist at all.

Ego-depletion researchers have responded to this result differently.  Michael Inzlicht, winner of a theoretical innovation prize for his work on ego-depletion, wrote:

The results of a massive replication effort, involving 24 labs (or 23, depending on how you count) and over 2,000 participants, indicates that short bouts of effortful control had no discernable effects on low-level inhibitory control. This seems to contradict two decades of research on the concept of ego depletion and the resource model of self-control. Like I said: science is brutal.

In contrast, Roy F. Baumeister questioned the outcome of this research project that provided the most comprehensive and scientific test of ego-depletion.  In a response with co-author Kathleen D. Vohs titled “A misguided effort with elusive implications,” Baumeister tries to explain why ego depletion is a real effect, despite the lack of unbiased evidence for it.

The first line of defense is to question the validity of the paradigm that was used for the replication project. The only problem is that this paradigm seemed reasonable to the editors who approved the project, researchers who participated in the project and who expected a positive result, and to Baumeister himself when he was consulted during the planning of the replication project.  In his response, Baumeister reverses his opinion about the paradigm.

In retrospect, the decision to use new, mostly untested procedures for a large replication project was foolish.

He further claims that he proposed several well-tested procedures, but that these procedures were rejected by the replication team for technical reasons.

Baumeister nominated several procedures that have been used in successful studies of ego depletion for years. But none of Baumeister’s suggestions were allowable due to the RRR restrictions that it must be done with only computerized tasks that were culturally and linguistically neutral.

Baumeister and Vohs then claim that the manipulation did not lead to ego-depletion and that it is not surprising that an unsuccessful manipulation does not produce an effect.

Signs indicate the RRR was plagued by manipulation failure — and therefore did not test ego depletion.

They then assure readers that ego-depletion is real because they have demonstrated the effect repeatedly using various experimental tasks.

For two decades we have conducted studies of ego depletion carefully and honestly, following the field’s best practices, and we find the effect over and over (as have many others in fields as far-ranging as finance to health to sports, both in the lab and large-scale field studies). There is too much evidence to dismiss based on the RRR, which after all is ultimately a single study — especially if the manipulation failed to create ego depletion.

This last statement is, however, misleading if not outright deceptive.  As noted earlier, Baumeister admitted to the practice of not publishing disconfirming evidence.  He and I disagree whether the selective publication of successful studies is honest or dishonest.  He wrote:

 “We did run multiple studies, some of which did not work, and some of which worked better than others. You may think that not reporting the less successful studies is wrong, but that is how the field works.” (Roy Baumeister, personal email communication)

So, when Baumeister and Vohs assure readers that they conducted ego-depletion research carefully and honestly, they are not saying that they reported all studies that they conducted in their labs.  The successful studies published in articles are not representative of the studies conducted in their labs.

In a response to Baumeister and Vohs, the lead authors of the replication project pointed out that ego-depletion does not exist unless proponents of ego-depletion theory can specify experimental procedures that reliably produce the predicted effect.

The onus is on researchers to develop a clear set of paradigms that reliably evoke depletion in large samples with high power (Hagger & Chatzisarantis, 2016)

In an open email letter, I asked Baumeister and Vohs to name paradigms that could replicate a published ego-depletion effect.  They were not able or willing to name a single paradigm. Roy Bameister’s response was “In view of your reputation as untrustworthy, dishonest, and otherwise obnoxious, i prefer not to cooperate or collaborate with you.” 

I did not request to collaborate with him.  I merely asked which paradigm would be able to produce ego-depletion effects in an open and transparent replication study, given his criticism of the most rigorous replication study that he initially approved.

If an expert who invented a theory and published numerous successful studies cannot name a paradigm that will work, it suggests that he does not know which studies may work because for each published successful study there are unpublished, unsuccessful studies that used the same procedure, and it is not obvious which study would actually replicate in an honest and transparent replication project.

A New Meta-Analysis of Ego-Depletion Studies:  Are there replicable effects?

Since I published the incredibility index (Schimmack, 2012) and demonstrated bias in research on glucose and ego-depletion, I have developed new and more powerful ways to reveal selection bias and questionable research practices.  I applied these methods to the large literature on ego-depletion to examine whether there are some credible ego-depletion effects and a paradigm that produces replicable effects.

The first method uses powergraphs (Schimmack, 2015) to examine selection bias and the replicability of a set of studies. To create a powergrpah, original research results are converted into absolute z-score.  A z-score shows how much evidence a study result provides against the null-hypothesis that there is no effect.  Unlike effect size measures, z-scores also contain information about the sample size (sampling error).   I therefore distinguish between meta-analysis of effect sizes and meta-analysis of evidence.  Effect size meta-analysis aims to determine the typical, average size of an effect.  Meta-analyses of evidence examine how strong the evidence for an effect (i.e., against the null-hypothesis of no effect) is.

The distribution of absolute z-scores provides important information about selection bias, questionable research practices, and replicability.  Selection bias is revealed if the distribution of z-scores shows a steep drop on the left side of the criterion for statistical significance (this is analogous to the empty space below the line for significance in a funnel plot). Questionable research practices are revealed if z-scores cluster in the area just above the significance criterion.  Replicabilty is estimated by fitting a weighted composite of several non-central distributions that simulate studies with different non-centrality parameters and sampling error.

A literature search retrieved 165 articles that reported 429 studies.  For each study, the most important statistical test was converted first into a two-tailed p-value and then into a z-score.  A single test statistic was used to ensure that all z-scores are statistically independent.

Powergraph for Ego Depletion (Focal Tests)

 

The results show clear evidence of selection bias (Figure 1).  Although there are some results below the significance criterion (z = 1.96, p < .05, two-tailed), most of these results are above z = 1.65, which corresponds to p < .10 (two-tailed) or p < .05 (one-tailed).  These results are typically reported as marginally significant and used as evidence for an effect.   There are hardly any results that fail to confirm a prediction based on ego-depletion theory.  Using z = 1.65 as criterion, the success rate is 96%, which is common for the reported success rate in psychological journals (Sterling, 1959; Sterling et al., 1995; OSC, 2015).  The steep cliff in the powergraph shows that this success rate is due to selection bias because random error would have produced a more gradual decline with many more non-significant results.

The next observation is the tall bar just above the significance criterion with z-scores between 2 and 2.2.   This result is most likely due to questionable research practices that lead to just significant results such as optional stopping or selective dropping of outliers.

Another steep drop is observed at z-scores of 2.6.  This drop is likely due to the use of further questionable research practices such as dropping of experimental conditions, use of multiple dependent variables, or simply running multiple studies and selecting only significant results.

A rather large proportion of z-scores are in the questionable range from z = 1.96 to 2.60.  These results are unlikely to replicate. Although some studies may have reported honest results, there are too many questionable results and it is impossible to say which results are trustworthy and which results are not.  It is like getting information from a group of people where 60% are liars and 40% tell the truth.  Even though 40% are telling the truth, the information is useless without knowing who is telling the truth and who is lying.

The best bet to find replicable ego-depletion results is to focus on the largest z-scores as replicability increases with the strength of evidence (OSC, 2015). The power estimation method uses the distribution of z-scores greater than 2.6 to estimate the average power of these studies.  The estimated power is 47% with a 95% confidence interval ranging from 32% to 63%.  This result suggests that some ego-depletion studies have produced replicable results.  In the next section, I examine which studies this may be.

In sum, a state-of-the art meta-analysis of evidence for an effect in the ego-depletion literature shows clear evidence for selection bias and the use of questionable research practices.  Many published results are essentially useless because the evidence is not credible.  However, the results also show that some studies produced replicable effects, which is consistent with Carter and McCollough’s finding that the average effect size is likely to be above zero.

What Ego-Depletion Studies Are Most Likely to Replicate?

Powergraphs are useful for large sets of heterogeneous studies.  However, they are not useful to examine the replicability of a single study or small sets of studies, such as a set of studies in a multiple-study article.  For this purpose, I developed two additional tools that detect bias in published results. .

The Test of Insufficient Variance (TIVA) requires a minimum of two independent studies.  As z-scores follow a normal distribution (the normal distribution of random error), the variance of z-scores should be 1.  However, if non-significant results are omitted from reported results, the variance shrinks.  TIVA uses the standard comparison of variances to compute the probability that an observed variance of z-scores is an unbiased sample drawn from a normal distribution.  TIVA has been shown to reveal selection bias in Bem’s (2011) article and it is a more powerful test than the incredibility index (Schimmack, 2012).

The R-Index is based on the Incredibilty Index in that it compares the success rate (percentage of significant results) with the observed statistical power of a test. However, the R-Index does not test the probability of the success rate.  Rather, it uses the observed power to predict replicability of an exact replication study.  The R-Index has two components. The first component is the median observed power of a set of studies.  In the limit, median observed power approaches the average power of an unbiased set of exact replication studies.  However, when selection bias is present, median observed power is biased and provides an inflated estimate of true power.  The R-Index measures the extent of selection bias by means of the difference between success rate and median observed power.  If median observed power is 75% and the success rate is 100%, the inflation rate is 25% (100 – 75 = 25).  The inflation rate is subtracted from median observed power to correct for the inflation.  The resulting replication index is not directly an estimate of power, except for the special case when power is 50% and the success rate is 100%   When power is 50% and the success rate is 100%, median observed power increases to 75%.  In this case, the inflation correction of 25% returns the actual power of 50%.

I emphasize this special case because 50% power is also a critical point at which point a rational bet would change from betting against replication (Replicability < 50%) to betting on a successful replication (Replicability > 50%).  Thus, an R-Index of 50% suggests that a study or a set of studies produced a replicable result.  With success rates close to 100%, this criterion implies that median observed power is 75%, which corresponds to a z-score of 2.63.  Incidentally, a z-score of 2.6 also separated questionable results from more credible results in the powergraph analysis above.

It may seem problematic to use the R-Index even for a single study because observed power of a single study is strongly influenced by random factors and observed power is by definition above 50% for a significant result. However, The R-Index provides a correction for selection bias and a significant result implies a 100% success rate.  Of course, it could also be an honestly reported result, but if the study was published in a field with evidence of selection bias, the R-Index provides a reasonable correction for publication bias.  To achieve an R-Index above 50%, observed power has to be greater than 75%.

This criterion has been validated with social psychology studies in the reproducibilty project, where the R-Index predicted replication success with over 90% accuracy. This criterion also correctly predicted that the ego-depletion replication project would produce fewer than 50% successful replications, which it did, because the R-Index for the original study was way below 50% (F(1,90) = 4.64, p = .034, z = 2.12, OP = .56, R-Index = .12).  If this information had been available during the planning of the RRR, researchers might have opted for a paradigm with a higher chance of a successful replication.

To identify paradigms with higher replicability, I computed the R-Index and TIVA (for articles with more than one study) for all 165 articles in the meta-analysis.  For TIVA I used p < .10 as criterion for bias and for the R-Index I used .50 as the criterion.   37 articles (22%) passed this test.  This implies that 128 (78%) showed signs of statistical bias and/or low replicability.  Below I discuss the Top 10 articles with the highest R-Index to identify paradigms that may produce a reliable ego-depletion effect.

1. Robert D. Dvorak and Jeffrey S. Simons (PSPB, 2009) [ID = 142, R-Index > .99]

This article reported a single study with an unusually large sample size for ego-depletion studies. 180 participants were randomly assigned to a standard ego-depletion manipulation. In the control condition, participants watched an amusing video.  In the depletion condition, participants watched the same video, but they were instructed to suppress all feelings and expressions.  The dependent variable was persistence on a set of solvable and unsolvable anagrams.  The t-value in this study suggests strong evidence for an ego-depletion effect, t(178) = 5.91.  The large sample size contributes to this, but the effect size is also large, d = .88.

Interestingly, this study is an exact replication of Study 3 in the seminal ego-depletion article by Baumeister et al. (1998), which obtained a significant effect with just 30 participants and a strong effect size of d = .77, t(28) = 2.12.

The same effect was also reported in a study with 132 smokers (Heckman, Ditre, & Brandon, 2012). Smokers who were not allowed to smoke persisted longer on a figure tracing task when they could watch an emotional video normally than when they had to suppress emotional responses, t(64) = 3.15, d = .78.  The depletion effect was weaker when smokers were allowed to smoke between the video and the figure tracing task. The interaction effect was significant, F(1, 128) = 7.18.

In sum, a set of studies suggests that emotion suppression influences persistence on a subsequent task.  The existing evidence suggests that this is a rather strong effect that can be replicated across laboratories.

2. Megan Oaten, Kipling D. William, Andrew Jones, & Lisa Zadro (J Soc Clinical Psy, 2008) [ID = 118, R-Index > .99]

This article reports two studies that manipulated social exclusion (ostracism) under the assumption that social exclusion is ego-depleting. The dependent variable was consumption of an unhealthy food in Study 1 and drinking a healthy, but unpleasant drink in Study 2.  Both studies showed extremely strong effects of ego-depletion (Study 1: d = 2.69, t(71) = 11.48;  Study 2: d = 1.48, t(72) = 6.37.

One concern about these unusually strong effects is the transformation of the dependent variable.  The authors report that they first ranked the data and then assigned z-scores corresponding to the estimated cumulative proportion.  This is an unusual procedure and it is difficult to say whether this procedure inadvertently inflated the effect size of ego-depletion.

Interestingly, one other article used social exclusion as an ego-depletion manipulation (Baumeister et al., 2005).  This article reported six studies and TIVA showed evidence of selection bias, Var(z) = 0.15, p = .02.  Thus, the reported effect sizes in this article are likely to be inflated.  The first two studies used consumption of an unpleasant tasting drink and eating cookies, respectively, as dependent variables. The reported effect sizes were weaker than in the article by Oaten et al. (d = 1.00, d = .90).

In conclusion, there is some evidence that participants avoid displeasure and seek pleasure after social rejection. A replication study with a sufficient sample size may replicate this result with a weaker effect size.  However, even if this effect exists it is not clear that the effect is mediated by ego-depletion.

3. Kathleen D. Vohs & Ronald J. Farber (Journal of Consumer Research) [ID = 29, R-Index > .99]

This article examined the effect of several ego-depletion manipulations on purchasing behavior.  Study 1 found a weaker effect, t(33) = 2.83,  than Studies 2 and 3, t(63) = 5.26, t(33) = 5.52, respectively.  One possible explanation is that the latter studies used actual purchasing behavior.  Study 2 used the White Bear paradigm and Study 2 used amplification of emotion expressions as ego-depletion manipulations.  Although statistically robust, purchasing behavior does not seem to be the best indicator of ego-depletion.  Thus, replication efforts may focus on other dependent variables that measure ego-depletion more directly.

4. Kathleen D. Vohs, Roy F. Baumeister, & Brandon J. Schmeichel (JESP, 2012/2013) [ID = 49, R-Index = .96]

This article was first published in 2012, but the results for Study 1 were misreported and a corrected version was published in 2013.  The article presents two studies with a 2 x 3 between-subject design. Study 1 had n = 13 participants per cell and Study 2 had n = 35 participants per cell.  Both studies showed an interaction between ego-depletion manipulations and manipulations of self-control beliefs. The dependent variables in both studies were the Cognitive Estimation Test and a delay of gratification task.  Results were similar for both dependent measures. I focus on the CET because it provides a more direct test of ego-depletion; that is, the draining of resources.

In the condition with limited-will-power beliefs of Study 1, the standard ego-depletion effect that compares depleted participants to a control condition was a decreased by about 6 points from about 30 to 24 points (no exact means or standard deviations, or t-values for this contrast are provided).  The unlimited will-power condition shows a smaller decrease by 2 points (31 vs. 29).  Study 2 replicates this pattern. In the limited-will-power condition, CET scores decreased again by 6 points from 32 to 26 and in the unlimited-will-power condition CET scores decreased by about 2 points from about 31 to 29 points.  This interaction effect would again suggest that the standard depletion effect can be reduced by manipulating participants’ beliefs.

One interesting aspect of the study was the demonstration that ego-depletion effects increase with the number of ego-depleting tasks.  Performance on the CET decreased further when participants completed 4 vs. 2 or 3 vs. 1 depleting task.  Thus, given the uncertainty about the existence of ego-depletion, it would make sense to start with a strong manipulation that compares a control condition with a condition with multiple ego-depleting tasks.

One concern about this article is the use of the CET as a measure of ego-depletion.  The task was used in only one other study by Schmeichel, Vohs, and Baumeister (2003) with a small sample of N = 37 participants.  The authors reported a just significant effect on the CET, t(35) = 2.18.  However, Vohs et al. (2013) increased the number of items from 8 to 20, which makes the measure more reliable and sensitive to experimental manipulations.

Another limitation of this study is that there was no control condition without manipulation of beliefs. It is possible that the depletion effect in this study was amplified by the limited-will-power manipulation. Thus, a simple replication of this study would not provide clear evidence for ego-depletion.  However, it would be interesting to do a replication study that examines the effect of ego-depletion on the CET without manipulation of beliefs.

In sum, this study could provide the basis for a successful demonstration of ego-depletion by comparing effects on the CET for a control condition versus a condition with multiple ego-depletion tasks.

5. Veronika Job, Carol S. Dweck, and Gregory M. Walton (Psy Science, 2010) [ID = 191, R-Index = 94]

The article by Job et al. (2010) is noteworthy for several reasons.  First, the article presented three close replications of the same effect with high t-values, ts = 3.88, 8.47, 2.62.  Based on these results, one would expect that other researchers can replicate the results.  Second, the effect is an interaction between a depletion manipulation and a subtle manipulation of theories about the effect of working on an effortful task.  Hidden among other questionnaires, participants received either items that suggested depletion (“After a strenuous mental activity your energy is depleted and you must rest to get it refueled again” or items that suggested energy is unlimited (“Your mental stamina fuels itself; even after strenuous mental exertion you can continue doing more of it”). The pattern of the interaction effect showed that only participants who received the depletion items showed the depletion effect.  Participants who received the unlimited energy items showed no significant difference in Stroop performance.  Taken at face value, this finding would challenge depletion theory, which assumes that depletion is an involuntary response to exerting effort.

However, the study also raises questions because the authors used an unconventional statistical method to analyze their data.  Data were analyzed with a multi-level model that modeled errors as a function of factors that vary within participants over time and factors that vary between participants, including the experimental manipulations.  In an email exchange, the lead author confirmed that the model did not include random factors for between-subject variance.  A statistician assured the lead author that this was acceptable.  However, a simple computation of the standard deviation around mean accuracy levels would show that this variance is not zero.  Thus, the model artificially inflated the evidence for an effect by treating between-subject variance as within-subject variance. In a betwee-subject analysis, the small differences in error rates (about 5 percentage points) are unlikely to be significant.

In sum, it is doubtful that a replication study would replicate the interaction between depletion manipulations and the implicit theory manipulation reported in Job et al. (2010) in an appropriate between-subject analysis.  Even if this result would replicate, it would not support the theory that ego-depletion is a limited resource that is depleted after a short effortful task because the effect can be undone with a simple manipulations of beliefs in unlimited energy.

6. Roland Imhoff, Alexander F. Schmidt, & Friederike Gerstenberg (Journal of Personality, 2014) [ID = 146, R-Index = .90]

Study 1 reports results a standard ego-depletion paradigm with a relatively larger sample (N = 123).  The ego-depletion manipulation was a Stroop task with 180 trials.  The dependent variable was consumption of chocolates (M&M).  The study reported a large effect, d = .72, and strong evidence for an ego-depletion effect, t(127) = 4.07.  The strong evidence is in part justified by the large sample size, but the standardized effect size seems a bit large for a difference of 2g in consumption, whereas the standard deviation of consumption appears a bit small (3g).  A similar study with M&M consumption as dependent variable found a 2g difference in the opposite direction with a much larger standard deviation of 16g and no significant effect, t(48) = -0.44.

The second study produced results in line with other ego-depletion studies and did not contribute to the high R-Index of the article, t(101) = 2.59. The third study was a correlational study with examined correlates of a trait measure of ego-depletion.  Even if this correlation is replicable, it does not support the fundamental assumption of ego-depletion theory of situational effects of effort on subsequent effort.  In sum, it is unlikely that Study 1 is replicable and that strong results are due to misreported standard deviations.

7. Hugo J.E.M. Alberts, Carolien Martijn, & Nanne K. de Vries (JESP, 2011) [ID = 56, R-Index = .86]

This article reports the results of a single study that crossed an ego-depletion manipulation with a self-awareness priming manipulation (2 x 2 with n = 20 per cell).  The dependent variable was persistence in a hand-grip task.  Like many other handgrip studies, this study assessed handgrip persistence before and after the manipulation, which increases the statistical power to detect depletion effects.

The study found weak evidence for an ego-depletion effect, but relatively strong evidence for an interaction effect, F(1,71) = 13.00.  The conditions without priming showed a weak ego depletion effect (6s difference, d = .25).  The strong interaction effect was due to the priming conditions, where depleted participants showed an increase in persistence by 10s and participants in the control condition showed a decrease in performance by 15s.  Even if this is a replicable finding, it does not support the ego-depletion effect.  The weak evidence for ego depletion with the handgrip task is consistent with a meta-analysis of handgrip studies (Schimmack, 2015).

In short, although this study produced an R-Index above .50, closer inspection of the results shows no strong evidence for ego-depletion.

8. James M. Tyler (Human Communications Research, 2008) [ID = 131, R-Index = .82]

This article reports three studies that show depletion effects after sharing intimate information with strangers.  In the depletion condition, participants were asked to answer 10 private questions in a staged video session that suggested several other people were listening.  This manipulation had strong effects on persistence in an anagram task (Study 1, d = 1.6, F(2,45) = 16.73) and the hand-grip task (Study 2: d = 1.35, F(2,40) = 11.09). Study 3 reversed tasks and showed that the crossing-E task influenced identification of complex non-verbal cues, but not simple non-verbal cues, F(1,24) = 13.44. The effect of the depletion manipulation on complex cues was very large, d = 1.93.  Study 4 crossed the social manipulation of depletion from Studies 1 and 2 with the White Bear suppression manipulation and used identification of non-verbal cues as the dependent variable.  The study showed strong evidence for an interaction effect, F(1,52) = 19.41.  The pattern of this interaction is surprising, because the White Bear suppression task showed no significant effect after not sharing intimate details, t(28) = 1.27, d = .46.  In contrast, the crossing-E task had produced a very strong effect in Study 3, d = 1.93.  The interaction was driven by a strong effect of the White Bear manipulation after sharing intimate details, t(28) = 4.62, d = 1.69.

Even though the statistical results suggest that these results are highly replicable, the small sample sizes and very large effect sizes raise some concerns about replicability.  The large effects cannot be attributed to the ego-depletion tasks or measures that have been used in many other studies that produced much weaker effect. Thus, the only theoretical explanation for these large effect sizes would be that ego depletion has particularly strong effects on social processes.  Even if these effects could be replicated, it is not clear that ego-depletion is the mediating mechanism.  Especially the complex manipulation in the first two studies allow for multiple causal pathways.  It may also be difficult to recreate this manipulation and a failure to replicate the results could be attribute to problems with reproducibility.  Thus, a replication of this study is unlikely to advance understanding of ego-depletion without first establishing that ego-depletion exists.

9. Brandon J. Schmeichel, Heath A. Demaree, Jennifer L. Robinson, & Jie Pu (Social Cognition, 2006) [ID = 52, R-Index = .80]

This article reported one study with an emotion regulation task. Participants in the depletion condition were instructed to exaggerated emotional responses to a disgusting film clip.  The study used two task to measure ego-depletion.  One task required generation of words; the other task required generation of figures.  The article reports strong evidence in an ANOVA with both dependent variables, F(1,46) = 11.99.  Separate analyses of the means show a stronger effect for the figural task, d = .98, than for the verbal task, d = .50.

The main concern with this study is that the fluency measures were never used in any other study.  If a replication study fails, one could argue that the task is not a valid measure of ego-depletion.  However, the study shows the advantage of using multiple measures to increase statistical power (Schimmack, 2012).

10. Mark Muraven, Marylene Gagne, and Heather Rosman (JESP, 2008) [ID = 15, R-Index = .78]

Study 1 reports the results of a 2 x 2 design with N = 30 participants (~ 7.5 participants per condition).  It crossed an ego-depletion manipulation (resist eating chocolate cookies vs. radishes) with a self-affirmation manipulation.  The dependent variable was the number of errors in a vigilance task (respond to a 4 after a 6).  The results section shows some inconsistencies.  The 2 x 2 ANOVA shows strong evidence for an interaction, F(1,28) = 10.60, but the planned contrast that matches the pattern of means, shows a just significant effect, F(1,28) = 5.18.  Neither of these statistics is consistent with the reported means and standard deviations, where the depleted not affirmed group has twice the number of errors (M = 12.25, SD = 1.63) than the depleted group with affirmation (M = 5.40, SD = 1.34). These results would imply a standardized effect size of d = 4.59.

Study 2 did not manipulate ego-depletion and reported a more reasonable, but also less impressive result for the self-affirmation manipulation, F(2,63) = 4.67.

Study 3 crossed an ego-depletion manipulation with a pressure manipulation.  The ego-depletion task was a computerized ego-depletion task where participants in the depletion condition had to type a paragraph without copying the letter E or spaces. This is more difficult than just copying a paragraph.  The pressure manipulation were constant reminders to avoid making errors and to be as fast as possible.  The sample size was N = 96 (n = 24 per cell).  The dependent variable was the vigilance task from Study 1.  The evidence for a depletion effect was strong, F(1, 92) = 10.72 (z = 3.17).  However, the effect was qualified by the pressure manipulation, F(1,92) = 6.72.  There was a strong depletion effect in the pressure condition, d = .78, t(46) = 2.63, but there was no evidence for a depletion effect in the no-pressure condition, d = -.23, t(46) = 0.78.

The standard deviations in Study 3 that used the same dependent variable were considerable wider than the standard deviations in Study 1, which explains the larger standardized effect sizes in Study 1.  With the standard deviations of Study 3, Study 1 would not have

DISCUSSION AND FUTURE DIRECTIONS

The original ego-depletion article published in 1998 has spawned a large literature with over 150 articles, more than 400 studies, and a total number of over 30,000 participants. There have been numerous theoretical articles and meta-analyses of this literature.  Unfortunately, the empirical results reported in this literature are not credible because there is strong evidence that reported results are biased.  The bias makes it difficult to predict which effects are replicable. The main conclusion that can be drawn from this shaky mountain of evidence is that ego-depletion researchers have to change the way they conduct and report their findings.

Importantly, this conclusion is in stark disagreement with Baumeister’s recommendations.  In a forthcoming article, he suggests that “the field has done very well with the methods and standards it has developed over recent decades,” (p. 2), and he proposes that “we should continue with business as usual” (p. 1).

Baumeister then explicitly defends the practice of selectively publishing studies that produced significant results without reporting failures to demonstrate the effect in conceptually similar studies.

Critics of the practice of running a series of small studies seem to think researchers are simply conducting multiple tests of the same hypothesis, and so they argue that it would be better to conduct one large test. Perhaps they have a point: One big study could be arguably better than a series of small ones. But they also miss the crucial point that the series of small studies is typically designed to elaborate the idea in different directions, such as by identifying boundary conditions, mediators, moderators, and extensions. The typical Study 4 is not simply another test of the same hypothesis as in Studies 1–3. Rather, each one is different. And yes, I suspect the published report may leave out a few other studies that failed. Again, though, those studies’ purpose was not primarily to provide yet another test of the same hypothesis. Instead, they sought to test another variation, such as a different manipulation, or a different possible boundary condition, or a different mediator. Indeed, often the idea that motivated Study 1 has changed so much by the time Study 5 is run that it is scarcely recognizable. (p. 2)

Baumeister overlooks that a program of research that tests novel hypothesis with new experimental procedures in small samples is most likely to produce a non-significant result.  When these results are not reported, only reporting significant results does not mean that these studies successfully demonstrated an effect or elucidated moderating factors. The result of this program of research is a complicated pattern of results that is shaped by random error, selection bias, and weak true effects that are difficult to replicate (Figure 1).

Baumeister makes the logical mistake to assume that the type-I error rate is reset when a study is not a direct replication and that the type-I error only increases for exact replications. For example, it is obvious that we should not believe that eating green jelly beans decreases the risk of cancer, if 1 out of 20 studies with green jelly beans produced a significant result.  With a 5% error rate, we would expect one significant result in 20 attempts by chance alone.  Importantly, this does not change if green jelly beans showed an effect, but red, orange, purple, blue, ….. jelly beans did not show an effect.  With each study, the risk of a false positive result increases and if 1 out of 20 studies produced a significant result, the success rate is not higher than one would expect by chance alone.  It is therefore important to report all results and to report only the one green-jelly bean study with a significant result distorts the scientific evidence.

Baumeister overlooks the multiple comparison problem when he claims that “a series of small studies can build and refine a hypothesis much more thoroughly than a single large study”

As the meta-analysis, a series of over 400 small studies with selection bias tells us very little about ego-depletion and it remains unclear under which conditions the effect can be reliably demonstrated.  To his credit, Baumeister is humble enough to acknowledge that his sanguine view of social psychological research is biased.

In my humble and biased view, social psychology has actually done quite well. (p. 2)

Baumeister remembers fondly the days when he learned how to conduct social psychological experiments.  “When I was in graduate school in the 1970s, n=10 was the norm, and people who went to n=20 were suspected of relying on flimsy effects and wasting precious research participants.”  A simple power analysis with these sample sizes shows that a study with n = 10 per cell (N = 20) has a sensitivity to detect effect sizes of d = 1.32 with 80% probability.  Even the biased effect size estimate for ego-depletion studies was only half of this effect size.  Thus, a sample size of n = 10 is ridiculously low.  What about a sample size of n = 20?   It still requires an effect size of d = .91 to have an 80% chance to produce a significant result.  Maybe Roy Baumeister might think that it is sufficient to aim for 50% success rate and to drop the other 50%.  An effect size of d = .64 gives researchers a 50% chance to get a significant result with N = 40.  But the meta-analysis shows that the bias-correct effect size is less than this.  So, even n = 20 is not sufficient to demonstrate ego-depletion effects.  Does this mean the effects are too flimsy to study?

Inadvertently, Baumeister seems to dismiss ego-depletion effects as irrelevant, if it would require large sample sizes to demonstrate ego-depletion.

Large samples increase statistical power. Therefore, if social psychology changes to insist on large samples, many weak effects will be significant that would have failed with the traditional and smaller samples. Some of these will be important effects that only became apparent with larger samples because of the constraints on experiments. Other findings will however make a host of weak effects significant, so more minor and trivial effects will enter into the body of knowledge.

If ego-depletion effects are not really strong, but only inflated by selection bias, and the real effects are much weaker, they may be minor and trivial effects that have little practical significance for the understanding of self-control in real life.

Baumeister then comes to the most controversial claim of his article that has produced a vehement response on social media.  He claims that a special skill called flair is needed to produce significant results with small samples.

Getting a significant result with n = 10 often required having an intuitive flair for how to set up the most conducive situation and produce a highly impactful procedure.

The need for flair also explains why some researchers fail to replicate original studies by researchers with flair.

But in that process, we have created a career niche for bad experimenters. This is an underappreciated fact about the current push for publishing failed replications. I submit that some experimenters are incompetent. In the past their careers would have stalled and failed. But today, a broadly incompetent experimenter can amass a series of impressive publications simply by failing to replicate other work and thereby publishing a series of papers that will achieve little beyond undermining our field’s ability to claim that it has accomplished anything.

Baumeister even noticed individual differences in flair among his graduate and post-doctoral students.  The measure of flair was whether students were able to present significant results to him.

Having mentored several dozen budding researchers as graduate students and postdocs, I have seen ample evidence that people’s ability to achieve success in social psychology varies. My laboratory has been working on self-regulation and ego depletion for a couple decades. Most of my advisees have been able to produce such effects, though not always on the first try. A few of them have not been able to replicate the basic effect after several tries. These failures are not evenly distributed across the group. Rather, some people simply seem to lack whatever skills and talents are needed. Their failures do not mean that the theory is wrong.

The first author of the glucose paper was a victim of a doctoral advisor who believed that one could demonstrate a correlation between blood glucose levels and behavior with samples of 20 or less participants.  He found a way to produce these results in a way that produced statistical evidence of bias, but this effort was wasted on a false theory and a program of research that could not produce evidence for or against the theory because sample sizes were too small to show the effect even if the theory were correct.  Furthermore, it is not clear how many graduate students left Baumeister’s lab thinking that they were failures because they lacked research skills when they only applied the scientific method correctly?

Baumeister does not elaborate further what distinguishes researchers with flair from those without flair.  To better understand flair, I examined the seminal ego-depletion study.  In this study, 67 participants were assigned to three conditions (n = 22 per cell).  The study was advertised as a study on taste perception.  Experimenters baked chocolate cookies in a laboratory room and the room smelled of freshly baked chocolate cookies.  Participants were seated at a table with a bowl of freshly baked cookies and a bowl with red and white radishes.  Participants were instructed to taste either radishes or chocolate cookies.  They were then told that they had to wait at least 15 minutes to allow the sensory memory of the food to fade.  During this time, they were asked to work on an unrelated task.  The task was a figure tracing puzzle with two unsolvable puzzles.  Participants were told that they can take as much time and as many trials as you want and that they will not be judged on the number of trials or the time they take, and that they will be judged on whether or not they finish the task.  However, if they wished to stop without finishing, they could ring a bell to notify the experimenter.  The time spent on this task was used as the dependent variable.  The study showed a strong effect of the manipulation.  Participants who had to taste radishes rang the bell 10 minutes earlier than participants who got to taste the chocolate cookies, t(44) = 6.03, d = 1.80, and 12 minutes earlier than participants in a control condition without the tasting part of the experiment, t(44) = 6.88, d = 2.04.   The ego-depletion effect in this study is gigantic.  Thus, flair might be important to create conditions that can produce strong effects, but once a researcher with flair has created such an experiment, others should be able to replicate it.  It doesn’t take flair to bake chocolate cookies, put a plate of radishes on a table, and to instruct participants how a figure tracing task works and to ring a bell when they no longer want to work on the task.  In fact, Baumeister et al. (1998) proudly reported that even high school students were able to replicate the study in a science project.

As this article went to press, we were notified that this experiment had been independently replicated by Timothy J. Howe, of Cole Junior High School in East Greenwich, Rhode Island, for his science fair project. His results conformed almost exactly to ours, with the exception that mean persistence in the chocolate condition was slightly (but not significantly) higher than in the control condition. These converging results strengthen confidence in the present findings.

If ego-depletion effects can be replicated in a school project, it undermines the idea that successful results require special skills.  Moreover, the meta-analysis shows that flair is little more than selective publishing of significant results, a conclusion that is confirmed by Baumeister’s response to my bias analyses. “you may think that not reporting the less successful studies is wrong, but that is how the field works.” (Roy Baumeister, personal email communication).

In conclusion, future researchers interested in self-regulation have a choice. They can believe in ego-depletion and ignore the statistical evidence of selection bias, failed replications, and admissions of suppressed evidence, and conduct further studies with existing paradigms and sample sizes and see what they get.  Alternatively, they may go to the other extreme and dismiss the entirely literature.

“If all the field’s prior work is misleading, underpowered, or even fraudulent, there is no need to pay attention to it.” (Baumeister, p. 4).

This meta-analysis offers a third possibility by trying to find replicable results that can provide the basis for the planning of future studies that provide better tests of ego-depletion theory.  I do not suggest to directly replicate any past study.  Rather, I think future research should aim for a strong demonstration of ego-depletion.  To achieve this goal, future studies should maximize statistical power in four ways.

First, use a strong experimental manipulation by comparing a control condition with a combination of multiple ego-depletion paradigms to maximize the standardized effect size.

Second, the study should use multiple, reliable, and valid measures of ego-depletion to minimize the influence of random and systematic measurement error in the dependent variable.

Third, the study should use a within-subject design or at least a pre-post design to control for individual differences in performance on the ego-depletion tasks to further reduce error variance.

Fourth, the study should have a sufficient sample size to make a non-significant result theoretically important.  I suggest planning for a standard error of .10 standard deviations.  As a result, any effect size greater than d = .20 will be significant, and a non-significant result if consistent with the null-hypothesis that the effect size is less than d = .20.

The next replicability report will show which path ego-depletion researcher have taken.  Even if they follow Baumeister’s suggestion to continue with business as usual, they can no longer claim that they were unaware of the consequences of going down this path.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

More blogs on replicability.

 

 

R-INDEX BULLETIN (RIB): Share the Results of your R-Index Analysis with the Scientific Community

R-Index Bulletin

The world of scientific publishing is changing rapidly and there is a growing need to share scientific information as fast and as cheap as possible.

Traditional journals with pre-publication peer-review are too slow and focussed on major ground-breaking discoveries.

Open-access journals can be expensive.

R-Index Bulletin offers a new opportunity to share results with the scientific community quickly and free of charge.

R-Index Bulletin also avoids the problems of pre-publication peer-review by moving to a post-publication peer-review process. Readers are welcome to comment on posted contributions and to post their own analyses. This process ensures that scientific disputes and their resolution are open and part of the scientific process.

For the time being, submissions can be uploaded as comments to this blog. In the future, R-Index Bulletin may develop into a free online journal.

A submission should contain a brief description of the research question (e.g., what is the R-Index of studies on X, by X, or in the journal X?), the main statistical results (median observed power, success rate, inflation rate, R-Index) and a brief discussion of the implications of the analysis. There is no page restriction and analyses of larger data sets can include moderator analysis. Inclusion of other bias tests (Egger’s regression, TIVA, P-Curve, P-Uniform) is also welcome.

If you have conducted an R-Index analysis, please submit it to R-Index Bulletin to share your findings.

Submissions can be made anonymously or with an author’s name.

Go ahead and press the “Leave a comment” or “Leave a reply” button or scroll to the bottom of the page and past your results in the “Leave a reply” box.

Questionable Research Practices: Definition, Detection, and Recommendations for Better Practices

Further reflections on the linearity in Dr. Förster’s Data

A previous blog examined how and why Dr. Förster’s data showed incredibly improbable linearity.

The main hypothesis was that two experimental manipulations have opposite effects on a dependent variable.

Assuming that the average effect size of a single manipulation is similar to effect sizes in social psychology, a single manipulation is expected to have an effect size of d = .5 (change by half a standard deviation). As the two manipulations are expected to have opposite effects, the mean difference between the two experimental groups should be one standard deviation (0.5 + 0.5 = 1). With N = 40, and d = 1, a study has 87% power to produce a significant effect (p < .05, two-tailed). With power of this magnitude, it would not be surprising to get significant results in 12 comparisonForsterTables (Table 1).

The R-Index for the comparison of the two experimental groups in Table is Ř = 87%
(Success Rate = 100%, Median Observed Power = 94%, Inflation Rate = 6%).

The Test of Insufficient Variance (TIVA) shows that the variance in z-scores is less than 1, but the probability of this event to occur by chance is 10%, Var(z) = .63, Chi-square (df = 11) = 17.43, p = .096.

Thus, the results for the two experimental groups are perfectly consistent with real empirical data and the large effect size could be the result of two moderately strong manipulations with opposite effects.

The problem for Dr. Förster started when he included a control condition and want to demonstrate in each study that the two experimental groups also differed significantly from the experimental group. As already pointed out in the original post, samples of 20 participants per condition do not provide sufficient power to demonstrate effect sizes of d = .5 consistently.

To make matters worse, the three-group design has even less power than two independent studies because the same control group is used in a three-group comparison. When sampling error inflates the mean in the control group (e.g, true mean = 33, estimated mean = 36), it benefits the comparison for the experimental group with the lower mean, but it hurts the comparison for the experimental group with the higher mean (e.g., M = 27, M = 33, M = 39 vs. M = 27, M = 36, M = 39). When sampling error leads to an underestimation of the true mean in the control group (e.g., true mean = 33, estimated mean = 30), it benefits the comparison of the higher experimental group with the control group, but it hurts the comparison of the lower experimental group and the control group.

Thus, total power to produce significant results for both comparisons is even lower than for two independent studies.

It follows that the problem for a researcher with real data was the control group. Most studies would have produced significant results for the comparison of the two experimental groups, but failed to show significant differences between one of the experimental groups and the control group.

At this point, it is unclear how Jens Förster achieved significant results under the contested assumption that real data were collected. However, it seems most plausible that QRPs would be used to move the mean of the control group to the center so that both experimental groups show a significant difference. When this was impossible, the control group could be dropped, which may explain why 3 studies in Table 1 did not report results for a control group.

The influence of QRPs on the control group can be detected by examining the variation of means in Table 1 across the 12(9) studies. Sampling error should randomly increase or decrease means relative to the overall mean of an experimental condition. Thus, there is no reason to expect a correlation in the pattern of means. Consistent with this prediction, the means of the two experimental groups are unrelated, r(12) = .05, p = .889; r(9) = .36, p = .347. In contrast, the means of the control group are correlated with the means of the two experimental groups, r(9) = .73, r(9) = .71. If the means in the control group are the result of the unbiased means in the experimental groups, it makes sense to predict the means in the control group from the means in the two experimental groups. A regression equation shows that 77% of the variance in the means of the control group is explained by the variation in the means in the experimental groups, R = .88, F(2,6) = 10.06, p = .01.

This analysis clarifies the source of the unusual linearity in the data. Studies with n = 20 per condition have very low power to demonstrate significant differences between a control group and opposite experimental groups because sampling error in the control group is likely to move the mean of the control group too close to one of the experimental groups to produce a significant difference.

This problem of low power may lead researchers to use QRPs to move the mean of the control group to the center. The problem for users of QRPs is that this statistical boost of power leaves a trace in the data that can be detected with various bias tests. The pattern of the three means will be too linear, there will be insufficient variance in the effect sizes, p-values, and observed power in the comparisons of experimental groups and control groups, the success rate will exceed median observed power, and, as shown here, the means in the control group will be correlated with the means in the experimental group across conditions.

In a personal email Dr. Förster did not comment on the statistical analyses because his background in statistics is insufficient to follow the analyses. However, he rejected this scenario as an account for the unusual linearity in his data; “I never changed any means.” Another problem for this account of what could have happened is that dropping cases from the middle group would lower the sample size of this group, but the sample size is always close to n = 20. Moreover, oversampling and dropping of cases would be a QRP that Dr. Förster would remember and could report. Thus, I now agree with the conclusion of the LOWI commission that the data cannot be explained by using QRPs, mainly because Dr. Förster denies having used any plausible QRPs that could have produced his results.

Some readers may be confused about this conclusion because it may appear to contradict my first blog. However, my first blog merely challenged the claim by the LOWI commission that linearity cannot be explained by QRPs. I found a plausible way in which QRPs could have produced linearity, and these new analyses still suggest that secretive and selective dropping of cases from the middle group could be used to show significant contrasts. Depending on the strength of the original evidence, this use of QRPs would be consistent with the widespread use of QRPs in the field and would not be considered scientific misconduct. As Roy F. Baumeister, a prominent social psychologist put it, “this is just how the field works.” However, unlike Roy Baumeister, who explained improbable results with the use of QRPs, Dr. Förster denies any use of QRPs that could potentially explain the improbable linearity in his results.

In conclusion, the following facts have been established with sufficient certainty:
(a) the reported results are too improbable to reflect just true effects and sampling error; they are not credible.
(b) the main problem for a researcher to obtain valid results is the low power of multiple-study articles and the difficulty of demonstrating statistical differences between one control group and two opposite experimental groups.
(c) to avoid reporting non-significant results, a researcher must drop failed studies and selectively drop cases from the middle group to move the mean of the middle group to the middle.
(d) Dr. Förster denies the use of QRPs and he denies data manipulation.
Evidently, the facts do not add up.

The new analyses suggest that there is one simple way for Dr. Förster to show that his data have some validity. The reason is that the comparison of the two experimental groups shows an R-Index of 87%. This implies that there is nothing statistically improbable about the comparison of these data. If these reported results are based on real data, a replication study is highly likely to replicate the mean difference between the two experimental groups. With n = 20 in each cell (N = 40), it would be relatively easy to conduct a preregistered and transparent replication study. However, without further credible evidence the published data lack credible scientific evidence and it would be prudent to retract all articles that show unusual statistical patterns that cannot be explained by the author.

Why are Stereotype-Threat Effects on Women’s Math Performance Difficult to Replicate?

Updated on May 19, 2016
– corrected mistake in calculation of p-value for TIVA

A Replicability Analysis of Spencer, Steele, and Quinn’s seminal article on stereotype threat effects on gender differences in math performance.

Background

In a seminal article, Spencer, Steele, and Quinn (1999) proposed the concept of stereotype threat. They argued that women may experience stereotype-threat during math tests and that stereotype threat can interfere with their performance on math tests.

The original study reported three experiments.

STUDY 1

Study 1 had 56 participants (28 male and 28 female undergraduate students). The main aim was to demonstrate that stereotype-threat influences performance on difficult, but not on easy math problems.

A 2 x 2 mixed model ANOVA with sex and difficulty produced the following results.

Main effect for sex, F(1, 52) = 3.99, p = .051 (reported as p = .05), z = 1.96, observed power = 50%.

Interaction between sex and difficulty, F(1, 52) = 5.34 , p = .025, z = 2.24, observed power = 61%.

The low observed power suggests that sampling error contributed to the significant results. Assuming observed power is a reliable estimate of true power, the chance of obtaining significant results in both studies would only be 31%. Moreover, if the true power is in the range between 50% and 80% power, there is only a 32% chance that observed power to fall into this range. The chance that both observed power values fall into this range is only 10%.

Median observed power is 56%. The success rate is 100%. Thus, the success rate is inflated by 44 percentage points (100% – 56%).

The R-Index for these two results is low, Ř = 12 (56 – 44).

Empirical evidence shows that studies with low R-Indices often fail to replicate in exact replication studies.

It is even more problematic that Study 1 was supposed to demonstrate just the basic phenomenon that women perform worse on math problems than men and that the following studies were designed to move this pre-existing gender difference around with an experimental manipulation. If the actual phenomenon is in doubt, it is unlikely that experimental manipulations of the phenomenon will be successful.

STUDY 2

The main purpose of Study 2 was to demonstrate that gender differences in math performance would disappear when the test is described as gender neutral.

Study 2 recruited 54 students (30 women, 24 men). This small sample size is problematic for several reasons. Power analysis of Study 1 suggested that the authors were lucky to obtain significant results. If power is 50%, there is a 50% chance that an exact replication study with the same sample size will produce a non-significant result. Another problem is that sample sizes need to increase to demonstrate that the gender difference in math performance can be influenced experimentally.

The data were not analyzed according to this research plan because the second test was so difficult that nobody was able to solve these math problems. However, rather than repeating the experiment with a better selection of math problems, the results for the first math test were reported.

As there was no repeated performance by the two participants, this is a 2 x 2 between-subject design that crosses sex and treat-manipulation. With a total sample size of 54 students, the n per cell is 13.

The main effect for sex was significant, F(1, 50) = 5.66, p = .021, z = 2.30, observed power = 63%.

The interaction was also significant, F(1, 50) = 4.18, p = .046, z = 1.99, observed power = 51%.

Once more, median observed power is just 57%, yet the success rate is 100%. Thus, the success rate is inflated by 43% and the R-Index is low, Ř = 14%, suggesting that an exact replication study will not produce significant results.

STUDY 3

Studies 1 and 2 used highly selective samples (women in the top 10% in math performance). Study 3 aimed to replicate the results of Study 2 in a less selective sample. One might expect that stereotype-threat has a weaker effect on math performance in this sample because stereotype threat can undermine performance when ability is high, but anxiety is not a factor in performance when ability is low. Thus, Study 3 is expected to yield a weaker effect and a larger sample size would be needed to demonstrate the effect. However, sample size was approximately the same as in Study 2 (36 women, 31 men).

The ANOVA showed a main effect of sex on math performance, F(1, 63) = 6.44, p = .014, z = 2.47, observed power = 69%.

The ANOVA also showed a significant interaction between sex and stereotype-threat-assurance, F(1, 63) = 4.78, p = .033, z = 2.14, observed power = 57%.

Once more, the R-Index is low, Ř = 26 (MOP = 63%, Success Rate = 100%, Inflation Rate = 37%).

Combined Analysis

The three studies reported six statistical tests. The R-Index for the combined analysis is low Ř = 18 (MOP = 59%, Success Rate = 100%, Inflation Rate = 41%).

The probability of this event to occur by chance can be assessed with the Test of Insufficient Variance (TIVA). TIVA tests the hypothesis that the variance in p-values, converted into z-scores, is less than 1. A variance of one is expected in a set of exact replication studies with fixed true power. Less variance suggests that the z-scores are not a representative sample of independent test scores.   The variance of the six z-scores is low, Var(z) = .04, p < .001,  1 / 1309.

Correction: I initially reported, “A chi-square test shows that the probability of this event is less than 1 out of 1,000,000,000,000,000, chi-square (df = 5) = 105.”

I made a mistake in the computation of the probability. When I developed TIVA, I confused the numerator and denominator in the test. I was thrilled that the test was so powerful and happy to report the result in bold, but it is incorrect. A small sample of six z-scores cannot produce such low p-values.

Conclusion

The replicability analysis of Spencer, Steele, and Quinn (1999) suggests that the original data provided inflated estimates of effect sizes and replicability. Thus, the R-Index predicts that exact replication studies would fail to replicate the effect.

Meta-Analysis

A forthcoming article in the Journal of School Psychology reports the results of a meta-analysis of stereotype-threat studies in applied school settings (Flore & Wicherts, 2014). The meta-analysis was based on 47 comparisons of girls with stereotype threat versus girls without stereotype threat. The abstract concludes that stereotype threat in this population is a statistically reliable, but small effect (d = .22). However, the authors also noted signs of publication bias. As publication bias inflates effect sizes, the true effect size is likely to be even smaller than the uncorrected estimate of .22.

The article also reports that the after a correction for bias, using the trim-and-fill method, the estimated effect size is d = .07 and not significantly different from zero. Thus, the meta-analysis reveals that there is no replicable evidence for stereotype-threat effects on schoolgirls’ math performance. The meta-analysis also implies that any true effect of stereotype threat is likely to be small (d < .2). With a true effect size of d = .2, the original studies by Steel et al. (1999) and most replication studies had insufficient power to demonstrate stereotype threat effects, even if the effect exists. A priori power analysis with d = .2 would suggest that 788 participants are needed to have an 80% chance to obtain a significant result if the true effect is d = .2. Thus, future research on this topic is futile unless statistical power is increased by increasing sample sizes or by using more powerful designs that can demonstrate small effects in smaller samples.

One possibility is that the existing studies vary in quality and that good studies showed the effect reliably, whereas bad studies failed to show the effect. To test this hypothesis, it is possible to select studies from a meta-analysis with the goal to maximize the R-Index. The best chance to obtain a high R-Index is to focus on studies with large sample sizes because statistical power increases with sample size. However, the table below shows that there are only 8 studies with more than 100 participants and the success rate in these studies is 13% (1 out of 8), which is consistent with the median observed power in these studies 12%.

R-IndexStereotypeThreatMetaAnalysis

It is also possible to select studies that produced significant results (z > 1.96). Of course, this set of studies is biased, but the R-Index corrects for bias. If these studies were successful because they had sufficient power to demonstrate effects, the R-Index would be greater than 50%. However, the R-Index is only 49%.

CONCLUSION

In conclusion, a replicability analysis with the R-Index shows that stereotype-threat is an elusive phenomenon. Even large replication studies with hundreds of participants were unable to provide evidence for an effect that appeared to be a robust effect in the original article. The R-Index of the meta-analysis by Flore and Wicherts corroborates concerns that the importance of stereotype-threat as an explanation for gender differences in math performance has been exaggerated. Similarly, Ganley, Mingle, Ryan, Ryan, and Vasilyeva (2013) found no evidence for stereotype threat effects in studies with 931 students and suggested that “these results raise the possibility that stereotype threat may not be the cause of gender differences in mathematics performance prior to college.” (p 1995).

The main novel contribution of this post is to reveal that this disappointing outcome was predicted on the basis of the empirical results reported in the original article by Spencer et al. (1999). The article suggested that stereotype threat is a pervasive phenomenon that explains gender differences in math performance. However, The R-Index and the insufficient variance in statistical results suggest that the reported results were biased and, overestimated the effect size of stereotype threat. The R-Index corrects for this bias and correctly predicts that replication studies will often result in non-significant results. The meta-analysis confirms this prediction.

In sum, the main conclusions that one can draw from 15 years of stereotype-threat research is that (a) the real reasons for gender differences in math performance are still unknown, (b) resources have been wasted in the pursuit of a negligible factor that may contribute to gender differences in math performance under very specific circumstances, and (c) that the R-Index could have prevented the irrational exuberance about stereotype-threat as a simple solution to an important social issue.

In a personal communication Dr. Spencer suggested that studies not included in the meta-analysis might produce different results. I suggested that Dr. Spencer provides a list of studies that provide empirical support for the hypothesis. A year later, Dr. Spencer has not provided any new evidence that provides credible evidence for stereotype-effects.  At present, the existing evidence suggests that published studies provide inflated estimates of the replicability and importance of the effect.

This blog also provides further evidence that male and female psychologists could benefit from a better education in statistics and research methods to avoid wasting resources in the pursuit of false-positive results.

How Power Analysis Could Have Prevented the Sad Story of Dr. Förster

[further information can be found in a follow up blog]

Background

In 2011, Dr. Förster published an article in Journal of Experimental Psychology: General. The article reported 12 studies and each study reported several hypothesis tests. The abstract reports that “In all experiments, global/local processing in 1 modality shifted to global/local processing in the other modality”.

For a while this article was just another article that reported a large number of studies that all worked and neither reviewers nor the editor who accepted the manuscript for publication found anything wrong with the reported results.

In 2012, an anonymous letter voiced suspicion that Jens Forster violated rules of scientific misconduct. The allegation led to an investigation, but as of today (January 1, 2015) there is no satisfactory account of what happened. Jens Förster maintains that he is innocent (5b. Brief von Jens Förster vom 10. September 2014) and blames the accusations about scientific misconduct on a climate of hypervigilance after the discovery of scientific misconduct by another social psychologist.

The Accusation

The accusation is based on an unusual statistical pattern in three publications. The 3 articles reported 40 experiments with 2284 participants, that is an average sample size of N = 57 participants in each experiment. The 40 experiments all had a between-subject design with three groups: one group received a manipulation design to increase scores on the dependent variable. A second group received the opposite manipulation to decrease scores on the dependent variable. And a third group served as a control condition with the expectation that the average of the group would fall in the middle of the two other groups. To demonstrate that both manipulations have an effect, both experimental groups have to show significant differences from the control group.

The accuser noticed that the reported means were unusually close to a linear trend. This means that the two experimental conditions showed markedly symmetrical deviations from the control group. For example, if one manipulation increased scores on the dependent variables by half a standard deviation (d = +.5), the other manipulation decreased scores on the dependent variable by half a standard deviation (d = -.5). Such a symmetrical pattern can be expected when the two manipulations are equally strong AND WHEN SAMPLE SIZES ARE LARGE ENOUGH TO MINIMIZE RANDOM SAMPLING ERROR. However, the sample sizes were small (n = 20 per condition, N = 60 per study). These sample sizes are not unusual and social psychologists often use n = 20 per condition to plan studies. However, these sample sizes have low power to produce consistent results across a large number of studies.

The accuser computed the statistical probability of obtaining the reported linear trend. The probability of obtaining the picture-perfect pattern of means by chance alone was incredibly small.

Based on this finding, the Dutch National Board for Research Integrity (LOWI) started an investigation of the causes for this unlikely finding. An English translation of the final report was published on retraction watch. An important question was whether the reported results could have been obtained by means of questionable research practices or whether the statistical pattern can only be explained by data manipulation. The English translation of the final report includes two relevant passages.

According to one statistical expert “QRP cannot be excluded, which in the opinion of the expert is a common, if not “prevalent” practice, in this field of science.” This would mean that Dr. Förster acted in accordance with scientific practices and that his behavior would not constitute scientific misconduct.

In response to this assessment the Complainant “extensively counters the expert’s claim that the unlikely patterns in the experiments can be explained by QRP.” This led to the decision that scientific misconduct occurred.

Four QRPs were considered.

  1. Improper rounding of p-values. This QRP can only be used rarely when p-values happen to be close to .05. It is correct that this QRP cannot produce highly unusual patterns in a series of replication studies. It can also be easily checked by computing exact p-values from reported test statistics.
  2. Selecting dependent variables from a set of dependent variables. The articles in question reported several experiments that used the same dependent variable. Thus, this QRP cannot explain the unusual pattern in the data.
  3. Collecting additional research data after an initial research finding revealed a non-significant result. This description of an QRP is ambiguous. Presumably it refers to optional stopping. That is, when the data trend in the right direction to continue data collection with repeated checking of p-values and stopping when the p-value is significant. This practices lead to random variation in sample sizes. However, studies in the reported articles all have more or less 20 participants per condition. Thus, optional stopping can be ruled out. However, if a condition with 20 participants does not produce a significant result, it could simply be discarded, and another condition with 20 participants could be run. With a false-positive rate of 5%, this procedure will eventually yield the desired outcome while holding sample size constant. It seems implausible that Dr. Förster conducted 20 studies to obtain a single significant result. Thus, it is even more plausible that the effect is actually there, but that studies with n = 20 per condition have low power. If power were just 30%, the effect would appear in every third study significantly, and only 60 participants were used to produce significant results in one out of three studies. The report provides insufficient information to rule out this QRP, although it is well-known that excluding failed studies is a common practice in all sciences.
  4. Selectively and secretly deleting data of participants (i.e., outliers) to arrive at significant results. The report provides no explanation how this QRP can be ruled out as an explanation. Simmons, Nelson, and Simonsohn (2011) demonstrated that conducting a study with 37 participants and then deleting data from 17 participants can contribute to a significant result when the null-hypothesis is true. However, if an actual effect is present, fewer participants need to be deleted to obtain a significant result. If the original sample size is large enough, it is always possible to delete cases to end up with a significant result. Of course, at some point selective and secretive deletion of observation is just data fabrication. Rather than making up data, actual data from participants are deleted to end up with the desired pattern of results. However, without information about the true effect size, it is difficult to determine whether an effect was present and just embellished (see Fisher’s analysis of Mendel’s famous genetics studies) or whether the null-hypothesis is true.

The English translation of the report does not contain any statements about questionable research practices from Dr. Förster. In an email communication on January 2, 2014, Dr. Förster revealed that he in fact ran multiple studies, some of which did not produce significant results, and that he only reported his best studies. He also mentioned that he openly admitted to this common practice to the commission. The English translation of the final report does not mention this fact. Thus, it remains an open question whether QRPs could have produced the unusual linearity in Dr. Förster’s studies.

A New Perspective: The Curse of Low Powered Studies

One unresolved question is why Dr. Förster would manipulate data to produce a linear pattern of means that he did not even mention in his articles. (Discover magazine).

One plausible answer is that the linear pattern is the by-product of questionable research practices to claim that two experimental groups with opposite manipulations are both significantly different from a control group. To support this claim, the articles always report contrasts of the experimental conditions and the control condition (see Table below). ForsterTable

In Table 1 the results of these critical tests are reported with subscripts next to the reported means. As the direction of the effect is theoretically determined, a one-tailed test was used. The null-hypothesis was rejected when p < .05.

Table 1 reports 9 comparisons of global processing conditions and control groups and 9 comparisons of local processing conditions with a control group; a total of 18 critical significance tests. All studies had approximately 20 participants per condition. The average effect size across the 18 studies is d = .71 (median d = .68).   An a priori power analysis with d = .7, N = 40, and significance criterion .05 (one-tailed) gives a power estimate of 69%.

An alternative approach is to compute observed power for each study and to use median observed power (MOP) as an estimate of true power. This approach is more appropriate when effect sizes vary across studies. In this case, it leads to the same conclusion, MOP = 67.

The MOP estimate of power implies that a set of 100 tests is expected to produce 67 significant results and 33 non-significant results. For a set of 18 tests, the expected values are 12.4 significant results and 5.6 non-significant results.

The actual success rate in Table 1 should be easy to infer from Table 1, but there are some inaccuracies in the subscripts. For example, Study 1a shows no significant difference between means of 38 and 31 (d = .60, but it shows a significant difference between means 31 and 27 (d = .33). Most likely the subscript for the control condition should be c not a.

Based on the reported means and standard deviations, the actual success rate with N = 40 and p < .05 (one-tailed) is 83% (15 significant and 3 non-significant results).

The actual success rate (83%) is higher than one would expect based on MOP (67%). This inflation in the success rate suggests that the reported results are biased in favor of significant results (the reasons for this bias are irrelevant for the following discussion, but it could be produced by not reporting studies with non-significant results, which would be consistent with Dr. Förster’s account ).

The R-Index was developed to correct for this bias. The R-Index subtracts the inflation rate (83% – 67% = 16%) from MOP. For the data in Table 1, the R-Index is 51% (67% – 16%).

Given the use of a between-subject design and approximately equal sample sizes in all studies, the inflation in power can be used to estimate inflation of effect sizes. A study with N = 40 and p < .05 (one-tailed) has 50% power when d = .50.

Thus, one interpretation of the results in Table 1 is that the true effect sizes of the manipulation is d = .5, that 9 out of 18 tests should have produced a significant contrast at p < .05 (one-tailed) and that questionable research practices were used to increase the success rate from 50% to 83% (15 vs. 9 successes).

The use of questionable research practices would also explain unusual linearity in the data. Questionable research practices will increase or omit effect sizes that are insufficient to produce a significant result. With a sample size of N = 40, an effect size of d = .5 is insufficient to produce a significant result, d = .5, se = 32, t(38) = 1.58, p = .06 (one-tailed). Random sampling error that works against the hypothesis can only produce non-significant results that have to be dropped or moved upwards using questionable methods. Random error that favors the hypothesis will inflate the effect size and start producing significant results. However, random error is normally distributed around the true effect size and is more likely to produce results that are just significant (d = .8) than to produce results that are very significant (d = 1.5). Thus, the reported effect sizes will be clustered more closely around the median inflated effect size than one would expect based on an unbiased sample of effect sizes.

The clustering of effect sizes will happen for the positive effects in the global processing condition and for the negative effects in the local processing condition. As a result, the pattern of all three means will be more linear than an unbiased set of studies would predict. In a large set of studies, this bias will produce a very low p-value.

One way to test this hypothesis is to examine the variability in the reported results. The Test of Insufficient Variance (TIVA) was developed for this purpose. TIVA first converts p-values into z-scores. The variance of z-scores is known to be 1. Thus, a representative sample of z-scores should have a variance of 1, but questionable research practices lead to a reduction in variance. The probability that a set of z-scores is a representative set of z-scores can be computed with a chi-square test and chi-square is a function of the ratio of the expected and observed variance and the number of studies. For the set of studies in Table 1, the variance in z-scores is .33. The chi-square value is 54. With 17 degrees of freedom, the p-value is 0.00000917 and the odds of this event occurring by chance are 1 out of 109,056 times.

Conclusion

Previous discussions about abnormal linearity in Dr. Förster’s studies have failed to provide a satisfactory answer. An anonymous accuser claimed that the data were fabricated or manipulated, which the author vehemently denies. This blog proposes a plausible explanation of what could have [edited January 19, 2015] happened. Dr. Förster may have conducted more studies than were reported and included only studies with significant results in his articles. Slight variation in sample sizes suggests that he may also have removed a few outliers selectively to compensate for low power. Importantly, neither of these practices would imply scientific misconduct. The conclusion of the commission that scientific misconduct occurred rests on the assumption that QRPs cannot explain the unusual linearity of means, but this blog points out how selective reporting of positive results may have inadvertently produced this linear pattern of means. Thus, the present analysis support the conclusion by an independent statistical expert mentioned in the LOWI report: “QRP cannot be excluded, which in the opinion of the expert is a common, if not “prevalent” practice, in this field of science.”

How Unusual is an R-Index of 51?

The R-Index for the 18 statistical tests reported in Table 1 is 51% and TIVA confirms that the reported p-values have insufficient variance. Thus, it is highly probable that questionable research practices contributed to the results and in a personal communication Dr. Förster confirmed that additional studies with non-significant results exist. However, in response to further inquiries [see follow up blog] Dr. Förster denied having used QRPs that could explain the linearity in his data.

Nevertheless, an R-Index of 51% is not unusual and has been explained with the use of QRPs.  For example, the R-Index for a set of studies by Roy Baumeister was 49%, . and Roy Baumeister stated that QRPs were used to obtain significant results.

“We did run multiple studies, some of which did not work, and some of which worked better than others. You may think that not reporting the less successful studies is wrong, but that is how the field works.”

Sadly, it is quite common to find an R-Index of 50% or lower for prominent publications in social psychology. This is not surprising because questionable research practices were considered good practices until recently. Even at present, it is not clear whether these practices constitute scientific misconduct (see discussion in Dialogue, Newsletter of the Society for Personality and Social Psychology).

How to Avoid Similar Sad Stories in the Future

One way to avoid accusations of scientific misconduct is to conduct a priori power analyses and to conduct only studies with a realistic chance to produce a significant result when the hypothesis is correct. When random error is small, true patterns in data can emerge without the help of QRPs.

Another important lesson from this story is to reduce the number of statistical tests as much as possible. Table 1 reported 18 statistical tests with the aim to demonstrate significance in each test. Even with a liberal criterion of .1 (one-tailed), it is highly unlikely that so many significant tests will produce positive results. Thus, a non-significant result is likely to emerge and researchers should think ahead of time how they would deal with non-significant results.

For the data in Table 1, Dr. Förster could have reported the means of 9 small studies without significance tests and conduct significance tests only once for the pattern in all 9 studies. With a total sample size of 360 participants (9 * 40), this test would have 90% power even if the effect size is only d = .35. With 90% power, the total power to obtain significant differences from the control condition for both manipulations would be 81%. Thus, the same amount of resources that were used for the controversial findings could have been used to conduct a powerful empirical test of theoretical predictions without the need to hide inconclusive, non-significant results in studies with low power.

Jacob Cohen has been trying to teach psychologists the importance of statistical power for decades and psychologists stubbornly ignored his valuable contribution to research methodology until he died in 1998. Methodologists have been mystified by the refusal of psychologists to increase power in their studies (Maxwell, 2004).

One explanation is that small samples provided a huge incentive. A non-significant result can be discarded with little cost of resources, whereas a significant result can be published and have the additional benefit of an inflated effect size, which allows boosting the importance of published results.

The R-Index was developed to balance the incentive structure towards studies with high power. A low R-Index reveals that a researcher is reporting biased results that will be difficult to replicate by other researchers. The R-Index reveals this inconvenient truth and lowers excitement about incredible results that are indeed incredible. The R-Index can also be used by researchers to control their own excitement about results that are mostly due to sampling error and to curb the excitement of eager research assistants that may be motivated to bias results to please a professor.

Curbed excitement does not mean that the R-Index makes science less exciting. Indeed, it will be exciting when social psychologists start reporting credible results about social behavior that boost a high R-Index because for a true scientist nothing is more exciting than the truth.

The R-Index of Nicotine-Replacement-Therapy Studies: An Alternative Approach to Meta-Regression

Stanley and Doucouliagos (2013) demonstrated how meta-regression can be used to obtain unbiased estimates of effect sizes from a biased set of original studies. The regression approach relies on the fact that small samples often need luck or questionable practices to produce significant results, whereas large samples can show true effects without the help of luck and questionable practices. If questionable practices or publication bias are present, effect sizes in small samples are inflated and this bias is evident in a regression of effect sizes on sampling error. When bias is present, the intercept of the regression equation can provide a better estimate of the average effect size in a set of studies.

One limitation of this approach is that other factors can also produce a correlation between effect size and sampling error. Another problem is that the regression equation can only approximate the effect of bias on effect size estimates.

The R-Index can complement meta-regression in several ways. First, it can be used to examine whether a correlation between effect size and sampling error reflects bias. If small samples have higher effect sizes due to bias, they should also yield more significant results than the power of these studies justifies. If this is not the case, the correlation may simply show that smaller samples examined stronger effects. Second, the R-Index can be used as an alternative way to estimate unbiased effect sizes that does not rely on the relationship between sample size and effect size.

The usefulness of the R-Index is illustrated with Stanley and Doucouliagos (2013) meta-analysis of the effectiveness of nicotine replacement therapy (the patch). Table A1 lists sampling errors and t-values of 42 studies. Stanley and Doucouliagos (2013) found that the 42 studies suggested a reduction in smoking by 93%, but that effectiveness decreased to 22% in a regression that controlled for biased reporting of results. This suggests that published studies inflate the true effect by more than 300%.

I entered the t-values and standard errors into the R-Index spreadsheet. I used sampling error to estimate sample sizes and degrees of freedom (2 / sqrt [N]). I used one-tailed t-tests to allow for negative t-values because the sign of effects is known in a meta-analysis of studies that try to show treatment effects. Significance was tested using p = .025, which is equivalent to using .050 in the test of significance for two-tailed tests (z > 1.96).

The R-Index for all 42 studies was 27%. The low R-Index was mostly explained by the low power of studies with small samples. Median observed power was just 34%. The number of significant results was only slightly higher 40%. The inflation rate was only 7%.

As studies with low power add mostly noise, Stanley (2010) showed that it can be preferable to exclude them from estimates of actual effect sizes. The problem is that it is difficult to find a principled way to determine which studies should be included or excluded. One solution is to retain only studies with large samples. The problem with this approach is that this often limits a meta-analysis to a small set of studies.

One solution is to compute the R-Index for different sets of studies and to base conclusions on the largest unbiased set of studies. For the 42 studies of nicotine replacement therapy, the following effect size estimates were obtained (effect sizes are d-values, d = t * se).

NicotinePatch

The results show the highest R-Index for studies with more than 80 participants. For these studies, observed power is 83% and the percentage of significant results is also 83%, suggesting that this set of studies is an unbiased sample of studies. The weighted average effect size for this set of studies is d = .44. The results also show that the weighted average effect size does not change much as a function of the selection of studies. When all studies are included, there is evidence of bias (8% inflation) and the weighted average effect size is inflated, but the amount of inflation is small (d = .56 vs. d = .44, difference d = .12).

The small amount of bias appears to be inconsistent with Stanley and Doucouliagos (2013) estimate that an uncorrected meta-analysis overestimates the true effect size by over 300% (93% vs. 22% RR). I therefore also examined the log(RR) values in Table 1a.

The average is .68 (compared to the simple mean reported as .66); the median is .53 and the weighted average is .49.   The regression-corrected estimate reported by Stanley and Doucouliagos (2013) is .31. The weighted mean for studies with more than 80 participants is .43. It is now clear why Stanley and Doucouliagos (2013) reported a large effect of the bias correction. First, they used the simple mean as a comparison standard (.68 vs. 31). The effect would be smaller if they had used the weighted mean as a comparison standard (.49 vs. .31). Another factor is that the regression procedure produces a lower estimate than the R-Index approach (.31 vs. 43). More research is needed to compare these results, but the R-Index has a simple logic. When there is no evidence of bias, the weighted average provides a reasonable estimate of the true effect size.

Conclusion

Stanley and Doucouliagos (2013) used regression of effect sizes on sampling error to reveal biases and to obtain an unbiased estimate of the typical effect size in a set of studies. This approach provides a useful tool in the fight against biased reporting of research results. One limitation of this approach is that other factors can produce a correlation between sampling error and effect size. The R-Index can be used to examine how much reporting biases contribute to this correlation. The R-Index can also be used to obtain an unbiased estimate of effect size by computing a weighted average for a select set of studies with a high R-Index.

A meta-analysis of 42 studies of nicotine replacement theory illustrates this approach. The R-Index for the full set of studies was low (24%). This reveals that many studies had low power to demonstrate an effect. These studies provide little information about effectiveness because non-significant results are just as likely to be type-II errors as demonstrations of low effectiveness.

The R-Index increased when studies with larger samples were selected. The maximum R-Index was obtained for studies with at least 80 participants. In this case, observed power was above 80% and there was no evidence of bias. The weighted average effect size for this set of studies was only slightly lower than the weighted average effect size for all studies (log(RR) = .43 vs. .49, RR = 54% vs. 63%, respectively). This finding suggests that smokers who use a nicotine patch are about 50% more likely to quit smoking than smokers without a nicotine patch.

The estimate of 50% risk reduction challenges Stanley and Doucouliagos’s (2013) preferred estimate that bias correction “reduces the efficacy of the patch to only 22%.” The R-Index suggests that this bias-corrected estimate is itself biased.

Another important conclusion is that studies with low power are wasteful and uninformative. They generate a lot of noise and are likely to be systematically biased and they contribute little to a meta-analysis that weights studies by sample size. The best estimate of effect size was based on only 6 out of 42 studies. Researchers should not conduct studies with low power and editors should not publish studies with low power.

Dr. Schnall’s R-Index

In several blog posts, Dr. Schnall made some critical comments about attempts to replicate her work and these blogs created a heated debate about replication studies. Heated debates are typically a reflection of insufficient information. Is the Earth flat? This question created heated debates hundreds of years ago. In the age of space travel it is no longer debated. In this blog, I presented some statistical information that sheds light on the debate about the replicability of Dr. Schnall’s research.

The Original Study

Dr. Schnall and colleagues conducted a study with 40 participants. A comparison of two groups on a dependent variable showed a significant difference, F(1,38) = 3.63. In these days, Psychological Science asked researchers to report P-Rep instead of p-values. P-rep was 90%. The interpretation of P-rep was that there is a 90% chance to find an effect with the SAME SIGN in an exact replication study with the same sample size. The conventional p-value for F(1,38) = 3.63 is p = .06, a finding that commonly is interpreted as marginally significant. The standardized effect size is d = .60, which is considered a moderate effect size. The 95% confidence interval is -.01 to 1.47.

The wide confidence interval makes it difficult to know the true effect size. A post-hoc power analysis, assuming the true effect size is d = .60 suggests that an exact replication study has a 46% chance to produce a significant results (p < .05, two-tailed). However, if the true effect size is lower, actual power is lower. For example, if the true effect size is small (d = .2), a study with N = 40 has only 9% power (that is a 9% chance) to produce a significant result.

The First Replication Study

Drs. Johnson, Cheung, and Donnellan conducted a replication study with 209 participants. Assuming the effect size in the original study is the true effect size, this replication study has 99% power. However, assuming the true effect size is only d = .2, the study has only 31% power to produce a significant result. The study produce a non-significant result, F(1, 206) = .004, p = .95. The effect size was d = .01 (in the same direction). Due to the larger sample, the confidence interval is smaller and ranges from -.26 to .28. The confidence interval includes d = 2. Thus, both studies are consistent with the hypothesis that the effect exists and that the effect size is small, d = .2.

The Second Replication Study

Dr. Huang conducted another replication study with N = 214 participants (Huang, 2004, Study 1). Based on the previous two studies, the true effect might be expected to be somewhere between -.01 and .28, which includes a small effect size of d = .20. A study with N = 214 participants has 31% power to produce a significant result. Not surprisingly, the study produce a non-significant result, t(212) = 1.22, p = .23. At the same time, the effect size fell within the confidence interval set by the previous two studies, d = .17.

A Third Replication Study

Dr. Hung conducted a replication study with N = 440 participants (Study 2). Maintaining the plausible effect size of d = .2 as the best estimate of the true effect size, the study has 55% power to produce a significant result, which means it is nearly as likely to produce a non-significant result as it is to produce a significant result, if the effect size is small (d = .2). The study failed to produce a significant result, t(438) = .042, p = 68. The effect size was d = .04 with a confidence interval ranging from -.14 to .23. Again, this confidence interval includes a small effect size of d = .2.

A Fourth Replication Study

Dr. Hung published a replication study in the supplementary materials to the article. The study again failed to demonstrate a main effect, t(434) = 0.42, p = .38. The effect size is d = .08 with a confidence interval of -.11 to .27. Again, the confidence interval is consistent with a small true effect size of d = .2. However, the study with 436 participants had only a 55% chance to produce a significant result.

If Dr. Huang had combined the two samples to conduct a more powerful study, a study with 878 participants would have 80% power to detect a small effect size of d = .2. However, the combined effect size of d = .06 for the combined samples is still not significant, t(876) = .89. The confidence interval ranges from -.07 to .19. It no longer includes d = .20, but the results are still consistent with a positive, yet small effect in the range between 0 and .20.

Conclusion

In sum, nobody has been able to replicate Schnall’s finding that a simple priming manipulation with cleanliness related words has a moderate to strong effect (d = .6) on moral judgments of hypothetical scenarios. However, all replication studies show a trend in the same direction. This suggests that the effect exists, but that the effect size is much smaller than in the original study; somewhere between 0 and .2 rather than .6.

Now there are three possible explanations for the much larger effect size in Schnall’s original study.

1. The replication studies were not exact replications and the true effect size in Schnall’s version of the experiment is stronger than in the other studies.

2. The true effect size is the same in all studies, but Dr. Schnall was lucky to observe an effect size that was three times as large as the true effect size and large enough to produce a marginally significant result.

3. It is possible that Dr. Schnall did not disclose all of the information about her original study. For example, she may have conducted additional studies that produced smaller and non-significant results and did not report these results. Importantly, this practice is common and legal and in an anonymous survey many researchers admitted using practices that produce inflated effect sizes in published studies. However, it is extremely rare for researchers to admit that these common practices explain one of their own findings and Dr. Schnall has attributed the discrepancy in effect sizes to problems with replication studies.

Dr. Schnall’s Replicability Index

Based on Dr. Schnall’s original study it is impossible to say which of these explanations accounts for her results. However, additional evidence makes it possible to test the third hypothesis that Dr. Schnall knows more than she was reporting in her article. The reason is that luck does not repeat itself. If Dr. Schnall was just lucky, other studies by her should have failed because Lady Luck is only on your side half the time. If, however, disconfirming evidence is systematically excluded from a manuscript, the rate of successful studies is higher than the observed statistical power in published studies (Schimmack, 2012).

To test this hypothesis, I downloaded Dr. Schnall’s 10 most cited articles (in Web of Science, July, 2014). These 10 articles contained 23 independent studies. For each study, I computed the median observed power of statistical tests that tested a theoretically important hypothesis. I also calculated the success rate for each study. The average success rate was 91% (ranging from 45% to 100%, median = 100%). The median observed power was 61%. The inflation rate is 30% (91%-61%). Importantly, observed power is an inflated estimate of replicability when the success rate is inflated. I created the replicability index (R-index) to take this inflation into account. The R-Index subtracts the inflation rate from observed median power.

Dr. Schnall’s R-Index is 31% (61% – 30%).

What does an R-Index of 31% mean? Here are some comparisons that can help to interpret the Index.

Imagine the null-hypothesis is always true, and a researcher publishes only type-I errors. In this case, observed power is 61% and the success rate is 100%. The R-Index is 22%.

Dr. Baumeister admitted that his publications select studies that report the most favorable results. His R-Index is 49%.

The Open Science Framework conducted replication studies of psychological studies published in 2008. A set of 25 completed studies in November 2014 had an R-Index of 43%. The actual rate of successful replications was 28%.

Given this comparison standards, it is hardly surprising that one of Dr. Schnall’s study did not replicate even when the sample size and power of replication studies were considerably higher.

Conclusion

Dr. Schnall’s R-Index suggests that the omission of failed studies provides the most parsimonious explanation for the discrepancy between Dr. Schnall’s original effect size and effect sizes in the replication studies.

Importantly, the selective reporting of favorable results was and still is an accepted practice in psychology. It is a statistical fact that these practices reduce the replicability of published results. So why do failed replication studies that are entirely predictable create so much heated debate? Why does Dr. Schnall fear that her reputation is tarnished when a replication study reveals that her effect sizes were inflated? The reason is that psychologists are collectively motivated to exaggerate the importance and robustness of empirical results. Replication studies break with the code to maintain an image that psychology is a successful science that produces stunning novel insights. Nobody was supposed to test whether published findings are actually true.

However, Bem (2011) let the cat out of the bag and there is no turning back. Many researchers have recognized that the public is losing trust in science. To regain trust, science has to be transparent and empirical findings have to be replicable. The R-Index can be used to show that researchers reported all the evidence and that significant results are based on true effect sizes rather than gambling with sampling error.

In this new world of transparency, researchers still need to publish significant results. Fortunately, there is a simple and honest way to do so that was proposed by Jacob Cohen over 50 years ago. Conduct a power analysis and invest resources only in studies that have high statistical power. If your expertise led you to make a correct prediction, the force of the true effect size will be with you and you do not have to rely on Lady Luck or witchcraft to get a significant result.

P.S. I nearly forgot to comment on Dr. Huang’s moderator effects. Dr. Huang claims that the effect of the cleanliness manipulation depends on how much effort participants exert on the priming task.

First, as noted above, no moderator hypothesis is needed because all studies are consistent with a true effect size in the range between 0 and .2.

Second, Dr. Huang found significant interaction effects in two studies. In Study 2, the effect was F(1,438) = 6.05, p = .014, observed power = 69%. In Study 2a, the effect was F(1,434) = 7.53, p = .006, observed power = 78%. The R-Index for these two studies is 74% – 26% = 48%.   I am waiting for an open science replication with 95% power before I believe in the moderator effect.

Third, even if the moderator effect exists, it doesn’t explain Dr. Schnall’s main effect of d = .6.

The Replicability-Index (R-Index): Quantifying Research Integrity

ANNIVERSARY POST.  Slightly edited version of first R-Index Blog on December 1, 2014.

In a now infamous article, Bem (2011) produced 9 (out of 10) statistically significant results that appeared to show time-reversed causality.  Not surprisingly, subsequent studies failed to replicate this finding.  Although Bem never admitted it, it is likely that he used questionable research practices to produce his results. That is, he did not just run 10 studies and found 9 significant results. He may have dropped failed studies, deleted outliers, etc.  It is well-known among scientists (but not lay people) that researchers routinely use these questionable practices to produce results that advance their careers.  Think, doping for scientists.

I have developed a statistical index that tracks whether published results were obtained by conducting a series of studies with a good chance of producing a positive result (high statistical power) or whether researchers used questionable research practices.  The R-Index is a function of the observed power in a set of studies. More power means that results are likely to replicate in a replication attempt.  The second component of the R-index is the discrepancy between observed power and the rate of significant results. 100 studies with 80% power should produce, on average, 80% significant results. If observed power is 80% and the success rate is 100%, questionable research practices were used to obtain more significant results than the data justify.  In this case, the actual power is less than 80% because questionable research practices inflate observed power. The R-index subtracts the discrepancy (in this case 20% too many significant results) from observed power to adjust for the inflation.  For example, if observed power is 80% and success rate is 100%, the discrepancy is 20% and the R-index is 60%.

In a paper, I show that the R-index predicts success in empirical replication studies.

The R-index also sheds light on the recent controversy about failed replications in psychology (repligate) between replicators and “replihaters.”   Replicators sometimes imply that failed replications are to be expected because original studies used small samples with surprisingly large effects, possibly due to the use of questionable research practices. Replihaters counter that replicators are incompetent researchers who are motivated to produce failed studies.  The R-Index makes it possible to evaluate these claims objectively and scientifically.  It shows that the rampant use of questionable research practices in original studies makes it extremely likely that replication studies will fail.  Replihaters should take note that questionable research practices can be detected and that many failed replications are predicted by low statistical power in original articles.