Replicability Review of 2016

2016 was surely an exciting year for anybody interested in the replicability crisis in psychology. Some of the biggest news stories in 2016 came from attempts by the psychology establishment to downplay the replication crisis in psychological research (Weired Magazine). At the same time, 2016 delivered several new replication failures that provide further ammunition for the critics of established research practices in psychology.

I. The Empire Strikes Back

1. The Open Science Collaborative Reproducibility Project was flawed.

Daniel Gilbert, Tim Wilson published a critique of the Open Science Collaborative in Science. According to Gilbert and Wilson the project that replicated 100 original research studies and reported that they could only replicate 36% was error riddled. Consequently, the low success rate only reveals the incompetence of replicators and has no implications for the replicability of original studies published in prestigious psychological journals like Psychological Science. Science Daily suggested that the critique overturned the landmark study.

Nature published a more balanced commentary. In an interview, Gilbert explains that “the number of studies that actually did fail to replicate is about the number you would expect to fail to replicate by chance alone — even if all the original studies had shown true effects.” This quote is rather strange, if we really consider the replication studies as flawed and error riddled. If the replication studies were bad, we would expect fewer studies to replicate than we would expect based on chance alone. If the success rate of 36% is consistent with the effect of chance alone, the replication studies are just as good as the original studies and the only reason for non-significant results would be chance. Thus, Gilbert’s comment implies that he believes the typical statistical power of a study in psychology is about 36%. Gilbert doesn’t seem to realize that he is inadvertently admitting that published articles report vastly inflated success rates because 97% of the original studies reported a significant result. To report 97% significant results with an average power of 36%, researchers are either hiding studies that failed to support their favored hypotheses in proverbial file-drawers or they are using questionable research practices to inflate evidence in favor of their hypotheses. Thus, ironically Gilberts’ comments rather confirm the critiques of the establishment that the low success rate in the reproducibility project can be explained by selective reporting of evidence that supports authors’ theoretical predictions.

2. Contextual Sensitivity Explains Replicability Problem in Social Psychology

Jay van Bavel and colleagues made a second attempt to downplay the low replicability of published results in psychology. He even got to write about it in the New York Times.

Van Bavel blames the Open Science Collaboration for overlooking the importance context. “Our results suggest that many of the studies failed to replicate because it was difficult to recreate, in another time and place, the exact same conditions as those of the original study.” This statement caused a lot of bewilderment. First, the OSC carefully tried to replicate the original studies as closely as possible. At the same time, they were sensitive to the effect of context. For example, if a replication study of an original study in the US was carried out in Germany, stimulus words were translated from English into German because one might expect that native German speakers might not respond the same way to the original English words as native English speakers. However, the switching of languages means that the replication study is not identical to the original study. Maybe the effect can only be obtained with English speakers. And if the study was conducted at Harvard, maybe the effect can only be replicated with Harvard students. And if the study was conducted primarily with female students, it may not replicate with male students.

To provide evidence for his claim, Jay van Bavel obtained subjective ratings of contextual sensitivity. That is, raters guessed how sensitivity the outcome of a study is to variations in the context. These ratings were then used to predict the success of the 100 replication studies in the OSC project.

Jay van Bavel proudly summarized the results in the NYT article. “As we predicted, there was a correlation between these context ratings and the studies’ replication success: The findings from topics that were rated higher on contextual sensitivity were less likely to be reproduced. This held true even after we statistically adjusted for methodological factors like sample size of the study and the similarity of the replication attempt. The effects of some studies could not be reproduced, it seems, because the replication studies were not actually studying the same thing.”

The article leaves out a few important details. First, the correlation between contextual sensitivity ratings and replication success was small, r = .20. Thus, even if contextual sensitivity contributed to replication failures, it would only explain replication failures for a small percentage of studies. Second, the authors used several measures of replicability and some of these measures failed to show the predicted relationship. Third, the statement makes an elementary mistake of confusing correlation and causality. The authors merely demonstrated that subjective ratings of contextual sensitivity predicted outcomes of replication studies. They did not show that contextual sensitivity caused replication failures. Most important, Jay van Bavel failed to mention that they also conducted an analysis that controlled for discipline. The Open Science Collaborative had already demonstrated that studies in cognitive psychology are more replicable (50% success rate) than studies in social psychology (an awful 25%). In an analysis that controlled for differences in disciplines, contextual sensitivity was no longer a statistically significant predictor of replication failures. This hidden fact was revealed in a commentary (or should we say correction) by Joel Inbar. In conclusion, this attempt at propping up the image of social psychology as a respectable science with replicable results turned out to be another embarrassing example of sloppy research methodology.

3. Anti-Terrorism Manifesto by Susan Fiske

Later that year, former president of the Association for Psychological Science (APS) caused a stir by comparing critics of established psychology to terrorists (see Business Insider article). She later withdrew the comparison to terrorists in response to the criticism of her remarks on social media (APS website).

Fiske attempted to defend established psychology by arguing that established psychology is self-correcting and does not require self-appointed social-media vigilantes. She claimed that these criticisms were destructive and damaging to psychology.

“Our field has always encouraged — required, really — peer critiques.”

“To be sure, constructive critics have a role, with their rebuttals and letters-to-the-editor subject to editorial oversight and peer review for tone, substance, and legitimacy.”

“One hopes that all critics aim to improve the field, not harm people. But the fact is that some inappropriate critiques are harming people. They are a far cry from temperate peer-reviewed critiques, which serve science without destroying lives.”

Many critics of established psychology did not share Fiske’s rosy and false description of the way psychology operates. Peer-review has been shown to be a woefully unreliable process. Moreover, the key criterion for accepting a paper is that it presents flawless results that seem to support some extraordinary claims (a 5-minute online manipulation reduces university drop-out rates by 30%), no matter how these results were obtained and whether they can be replicated.

In her commentary, Fiske is silent about the replication crisis and does not reconcile her image of a critical peer-review system with the fact that only 25% of social psychological studies are replicable and some of the most celebrated findings in social psychology (e.g., elderly priming) are now in doubt.

The rise of blogs and Facebook groups that break with the rules of the establishment poses a threat to the APS establishment with the main goal of lobbying for psychological research funding in Washington. By trying to paint critics of the establishment as terrorists, Fiske tried to dismiss criticism of established psychology without having to engage with the substantive arguments why psychology is in crisis.

In my opinion her attempt to do so backfired and the response to her column showed that the reform movement is gaining momentum and that few young researchers are willing to prop up a system that is more concerned about publishing articles and securing grant money than about making real progress in understanding human behavior.

II. Major Replication Failures in 201

4. Epic Failure to Replicate Ego-Depletion Effect in a Registered Replication Report

Ego-depletion is a big theory in social psychology and the inventor of the ego-depletion paradigm, Roy Baumeister, is arguable one of the biggest names in contemporary social psychology. In 2010, a meta-analysis seemed to confirm that ego-depletion is a highly robust and replicable phenomenon. However, this meta-analysis failed to take publication bias into account. In 2014, a new meta-analysis revealed massive evidence of publication bias. It also found that there was no statistically reliable evidence for ego-depletion after taking publication bias into account (Slate, Huffington Post).

A team of researchers, including the first-author of the supportive meta-analysis from 2010, conducted replication studies, using the same experiment in 24 different labs. Each of these studies alone would have had a low probability to detect a small ego depletion effect, but the combined evidence from all 24 labs made it possible to detect an ego-depletion effect even if it were much smaller than published articles suggest. Yet, the project failed to find any evidence for an ego-depletion effect, suggesting that it is much harder to demonstrate ego-depletion effects than one would believe based on over 100 published articles with successful results.

Critics of Baumeister’s research practices (Schimmack) felt vindicated by this stunning failure. However, even proponents of ego-depletion theory (Inzlicht) acknowledged that ego-depletion theory lacks a strong empirical foundation and that it is not clear what 20 years of research on ego-depletion have taught us about human self-control.

Not so, Roy Baumeister. Like a bank that is too big to fail, Baumeister defended ego-depletion as a robust empirical finding and blamed the replication team for the negative outcome. Although he was consulted and approved the design of the study, he later argued that the experimental task was unsuitable to induce ego-depletion. It is not hard to see the circularity in Baumeister’s argument. If a study produces a positive result, the manipulation of ego-depletion was successful. If a study produces a negative result, the experimental manipulation failed. The theory is never being tested because it is taken for granted that the theory is true. The only empirical question is whether an experimental manipulation was successful.

Baumeister also claimed that his own lab has been able to replicate the effect many times, without explaining the strong evidence for publication bias in the ego-depletion literature and the results of a meta-analysis that showed results from his own lab are no different from results from other labs.

A related article by Baumeister in a special issue on the replication crisis in psychology was another highlight in 2016. In this article, Baumeister introduced the concept of FLAIR.

Scientist with FLAIR

Baumeister writes “When I was in graduate school in the 1970s, n=10 was the norm, and people who went to n=20 were suspected of relying on flimsy effects and wasting precious research participants. Over the years the norm crept up to about n = 20. Now it seems set to leap to n = 50 or more.” (JESP, 2016, p. 154). He misses the god old days and suggests that the old system rewarded researchers with flair. “Patience and diligence may be rewarded, but competence may matter less than in the past. Getting a significant result with n = 10 often required having an intuitive flair for how to set up the most conducive situation and produce a highly impactful procedure. Flair, intuition, and related skills matter much less with n = 50.” (JESP, 2016, p. 156).

This quote explains the low replication rate in social psychology and the failure to replicate ego-depletion effects. It is simply not possible to conduct studies with n = 10 and be successful in most studies because empirical studies in psychology are subject to sampling error. Each study with n = 10 on a new sample of participants will produce dramatically different results because sample of n = 10 are very different from each other. This is a fundamental fact of empirical research that appears to elude on of the most successful empirical social psychologists. So, a researcher with FLAIR may set up a clever experiment with a strong manipulation (e.g, smelling chocolate cookies and have participants eat radishes instead) and get a significant result. But this is not a replicable finding. For every study with fair that worked, there are numerous studies that did not work. However, researchers with flair ignore these failed studies and focus on the studies that worked and then use these studies for publication. It can be shown statistically that they do, as I did with Baumeister’s glucose studies (Schimmack, 2012) and Baumeister’s ego-depletion studies in general (Schimmack, 2016). So, a researchers who gets significant results with small samples (n = 10) surely has FLAIR (False, Ludicrous, And Incredible Results).

Baumeister’s article contained additional insights into the research practices that fueled a highly productive and successful career. For example, he distinguishes researchers who report boring true positive results and interesting researches who publish interesting false positive results. He argues that science needs both types of researchers. Unfortunately, most people assume that scientists prioritize truth, which is the main reason for subjecting theories to empirical tests. But scientists with FLAIR get positive results even when their interesting ideas are false (Bem, 2011).

Baumeister mentions psychoanalysis as an example of interesting psychology. What could be more interesting than the Freudian idea that every boy goes through a phase where he wants to kill daddy and make love to mommy. Interesting stuff, indeed, but this idea has no scientific basis. In contrast, twin studies suggest that many personality traits, values, and abilities are partially inherited. To reveal this boring fact, it was necessary to recruit large samples of thousands of twins. That is not something a psychologist with FLAIR can handle. “When I ran my own experiments as a graduate student and young professor, I struggled to stay motivated to deliver the same instructions and manipulations through four cells of n=10 each. I do not know how I would have managed to reach n=50. Patient, diligent researchers will gain, relative to others” (Baumeister, JESP, 2016, p. 156). So, we may see the demise of researchers with FLAIR and diligent and patient researchers who listen to their data may take their place. Now there is something to look forward to in 2017.

Scientist without FLAIR

5. No Laughing Matter: Replication Failure of Facial Feedback Paradigm

A second Registered Replication Report (RRR) delivered another blow to the establishment. This project replicated a classic study on the facial-feedback hypothesis. Like other peripheral emotion theories, facial-feedback theories assume that experiences of emotions depend (fully or partially) on bodily feedback. That is, we feel happy because we smile rather than we smile because we are happy. Numerous studies had examined the contribution of bodily feedback to emotional experience and the evidence was mixed. Moreover, studies that found effects had a major methodological problem. Simply asking participants to smile might make them think happy thoughts, which could elicit positive feelings. In the 1980s, social psychologist Fritz Strack invented a procedure that solved this problem (see Slate article). Participants are deceived to believe that they are testing a procedure for handicapped people to complete a questionnaire by holding a pen in their mouth. Participants who hold the pen with their lips are activating muscles that are activated during sadness. Participants who hold the pen with their teeth activate muscles that are activated during happiness. Thus, randomly assigning participants to one of these two conditions made it possible to manipulate facial muscles without making participants aware of the associated emotion. Strack and colleagues reported two experiments that showed effects of the experimental manipulation. Or did it? It depends on the statistical test being used.

Experiment 1 had three conditions. The control group did the same study without manipulation of the facial muscles. The dependent variable was funniness ratings of cartoons. The mean funniness of cartoons was highest in the smile condition, followed by the control condition, and the lowest mean in the frown condition. However, a commonly used Analysis of Variance would not have produced a significant result. A two-tailed t-test also would not have produced a significant result. However a linear contrast with a one-tailed t-test produced a just significant result, t(89) = 1.85, p = .03. So, Fritz Strack was rather lucky to get a significant result. Sampling error could have easily changed the pattern of means slightly and even the directional test of the linear contrast would not have been significant. At the same time, sampling error might have been against the facial feedback hypothesis and the real effect is stronger than this study suggests. In this case, we would expect to see stronger evidence in Study 2. However, Study 2 failed to show any effect on funniness ratings of cartoons. “As seen in Table 2, subjects’ evaluations of the cartoons were hardly affected under the different experimental conditions. The ANOVA showed no significant main effects or interactions, all ps > .20” (Strack et al., 1988). However, Study 2 also included amusement ratings, and the amusement ratings once more showed a just significant result with a one-tailed t-test, t(75) = 1.78, p = .04. The article also provides an explanation for the just-significant result in Study 1, even though Study 1 used funniness ratings of cartoons. When participants are not asked to differentiate between their subjective feelings of amusement and the objective funniness of cartoons, subjective feelings influence ratings of funniness, but given a chance to differentiate between the two, subjective feelings no longer influence funniness ratings.

For 25 years, this article was uncritically cited as evidence for the facial feedback hypothesis, but none of the 17 labs that participated in the RRR were able to produce a significant result. More important, even an analysis with the combined power of all studies failed to detect an effect. Some critics pointed out that this result successfully replicates the finding of the original two studies that also failed to report statistically significant results by conventional standards of a two-tailed test (or z > 1.96).

Given the shaky evidence in the original article, it is not clear why Fritz Strack volunteered his study for a replication attempt. However, it is easier to understand his response to the results of the RRR. He does not take the results seriously. He rather believes his two original, marginally significant, studies than the 17 replication studies.

“Fritz Strack has no regrets about the RRR, but then again, he doesn’t take its findings all that seriously. “I don’t see what we’ve learned,” he said.” (Slate).

One of the most bizarre statements by Strack can only be interpreted as revealing a shocking lack of understanding of probability theory.

“So when Strack looks at the recent data he sees not a total failure but a set of mixed results. Nine labs found the pen-in-mouth effect going in the right direction. Eight labs found the opposite. Instead of averaging these together to get a zero effect, why not try to figure out how the two groups might have differed? Maybe there’s a reason why half the labs could not elicit the effect.” (Slate).

This is like a roulette player who after a night of gambling sees 49% wins and 49% loses and ponders why 49% of the attempts produced losses. Strack does not seem to realize that results of individual studies move simply by chance just like roulette balls produce different results by chance. Some people find cartoons funnier than others and the mean will depend on the allocation of these individuals to the different groups. This is called sampling error, and this is why we need to do statistical tests in the first place. And apparently it is possible to become a famous social psychologist without understanding the purpose of computing and reporting p-values.

And the full force of defense mechanisms is apparent in the next statement. “Given these eight nonreplications, I’m not changing my mind. I have no reason to change my mind,” Strack told me. Studies from a handful of labs now disagreed with his result. But then, so many other studies, going back so many years, still argued in his favor. (Slate).

No, there were not eight non-replications. There were 17! We would expect half of the studies to match the direction of the original effect simply due to chance alone.

But this is not all. Strack even accused the replication team of “reverse p-hacking.” (Strack, 2016). The term p-hacking was coined by Simmons et al. (2011) to describe a set of research practices that can be used to produce statistically significant results in the absence of a real effect (fabricating false positives). Strack turned it around and suggested that the replication team used statistical tricks to make the facial feedback effect disappear. “Without insinuating the possibility of a reverse p hacking, the current anomaly needs to be further explored.” (p. 930).

However, the statistical anomaly that requires explanation could just be sampling error (Hillgard) and it actually is the wrong statistical pattern to claim reverse p-hacking. Reverse p-hacking implies that some studies did produce a significant result, but statistical tricks were used to report the result as non-significant. This would lead to a restriction in the variability of results across studies, which can be detected with the Test for Insufficient Variance (Schimmack, 2015), but there is no evidence for reverse p-hacking in the RRR.

Fritz Strack also tried to make his case on social media, but there was very little support for his view that 17 failed replication studies can be ignored (PsychMAP thread).

Strack’s desperate attempts to defend his famous original study in the light of a massive replication failure provide further evidence for the inability of the psychology establishment to face the reality that many celebrated discoveries in psychology rest on shaky evidence and a mountain of repressed failed studies.

Meanwhile the Test of Insufficient Variance provides a simple explanation for the replication failure, namely the original results were rather unlikely to occur in the first place. Converting the observed t-values into z-scores shows very low variability, Var(z) = 0.003. The probability of observing a variance this small or smaller in a pair of studies is only p = .04. It is just not very likely for such an improbable event to repeat itself

6. Insufficient Power in Power-Posing Research

When you google “power posing” the featured link shows Amy Cuddy giving a TED talk about her research. Not unlike facial feedback, power posing assumes that bodily feedback can have powerful effects.

When you scroll down to the page, you might find a link to an article by Gelman and Fung (Slate).

Gelman has been an outspoken critic of social psychology for some time. This article is no exception. “Some of the most glamorous, popular claims in the field are nothing but tabloid fodder. The weakest work with the boldest claims often attracts the most publicity, helped by promotion from newspapers, television, websites, and best-selling books.”

They point out that a much larger study than the original study failed to replicate the original findings.

“An outside team led by Eva Ranehill attempted to replicate the original Carney, Cuddy, and Yap study using a sample population five times larger than the original group. In a paper published in 2015, the Ranehill team reported that they found no effect.”

They have little doubt that the replication study can be trusted and suggest that the original results were obtained with the help of questionable research practices.

“We know, though, that it is easy for researchers to find statistically significant comparisons even in a single, small, noisy study. Through the mechanism called p-hacking or the garden of forking paths, any specific reported claim typically represents only one of many analyses that could have been performed on a dataset.”

The replication study was published in 2015, so this replication failure does not really belong into a review of 2016. Indeed, the big news in 2016 was that Cuddy’s co-author Carney distanced herself from her contribution to the power posing article. Her public rejection of her own work (New Yorker Magazine) spread like a wildfire through social media (Psych Methods FB Group Posts 1, 2, but see 3). Most responses were very positive. Although science is often considered a self-correcting system, individual scientists rarely correct mistakes or retract articles if they discover a mistake after publication. Carney’s statement was seen as breaking with the implicit norm of the establishment to celebrate every published article as an important discovery and to cover up mistakes even in the face of replication failures.

Not surprisingly, proponent of power posing, Amy Cuddy, defended her claims about power posing. Here response makes many points, but there is one glaring omission. She does not mention the evidence that published results are selected to confirm theoretical claims and she does not comment on the fact that there is no evidence for power posing after correcting for publication bias. The psychology establishment also appears to be more interested in propping up a theory that has created a lot of publicity for psychology rather than critically examining the scientific evidence for or against power posing (APS Annual Meeting, 2017, Program, Presidential Symposium).

7. Commitment Priming: Another Failed Registered Replication Report

Many research questions in psychology are difficult to study experimentally. For example, it seems difficult and unethical to study the effect of infidelity on romantic relationships by assigning one group of participants to an infidelity condition and make them engage in non-marital sex. Social psychologists have developed a solution to this problem. Rather than creating real situations, participants are primed to think about infidelity. If these thoughts change their behavior, the results are interpreted as evidence for the effect of real infidelity. Eli Finkel and colleagues used this approach to experimentally test the effect of commitment on forgiveness. To manipulate commitment, participants in the experimental group were given some statements that were supposed to elicit commitment-related thoughts. To make sure that this manipulation worked, participants then completed a commitment measure. In the original article, the experimental manipulation had a strong effect, d = .74, which was highly significant, t(87) = 3.43, p < .001. Irene Cheung, Lorne Campbell, and Etienne P. LeBel spearheaded an initiative to replicate the experimental effect of commitment priming on forgiveness. Eli Finkel closely worked with the replication team to ensure that the replication study replicated the original study as closely as possible. Yet, the replication studies failed to demonstrate effectiveness of the commitment manipulation. Even with the much larger sample size, there was no significant effect and the effect size was close to zero. The authors of the replication report were surprised by the failure of the manipulation. “It is unclear why the RRR studies observed no effect of priming on subjective commitment when the original study observed a large effect. Given the straightforward nature of the priming manipulation and the consistency of the RRR results across settings, it seems unlikely that the difference resulted from extreme context sensitivity or from cohort effects (i.e., changes in the population between 2002 and 2015).” (PPS, 2016, p. 761). The author of the original article, Eli Finkel, also has no explanation for the failure of the experimental manipulation. “Why did the manipulation that successfully influenced commitment in 2002 fail to do so in the RRR? I don’t know.” (PPS, 2016, p. 765). However, Eli Finkel also reports that he made changes to the manipulation in subsequent studies. “The RRR used the first version of a manipulation that has been refined in subsequent work. Although I believe that the original manipulation is reasonable, I no longer use it in my own work. For example, I have become concerned that the “low commitment” prime includes some potentially commitment-enhancing elements (e.g., “What is one trait that your partner will develop as he/she grows older?”). As such, my collaborators and I have replaced the original 5-item primes with refined 3-item primes (Hui, Finkel, Fitzsimons, Kumashiro, & Hofmann, 2014). I have greater confidence in this updated manipulation than in the original 2002 manipulation. Indeed, when I first learned that the 2002 study would be the target of an RRR—and before I understood precisely how the RRR mechanism works—I had assumed that it would use this updated manipulation.” (PPS, 2016, p. 766). Surprisingly, the potential problem with the original manipulation was never brought up during the planning of the replication study (FB discussion group).

Hui et al. (2014) also do not mention any concerns about the original manipulation. They simply wrote “Adapting procedures from previous research (Finkel et al., 2002), participants in the high commitment prime condition answered three questions designed to activate thoughts regarding dependence and commitment.” (JPSP, 2014, p. 561). The results of the manipulation check closely replicated the results of the 2002 article. “The analysis of the manipulation check showed that participants in the high commitment prime condition (M = 4.62, SD = 0.34) reported a higher level of relationship commitment than participants in the low commitment prime condition (M = 4.26, SD = 0.62), t(74) = 3.11, p < .01.” (JPSP, 2014, p. 561). The study also produced a just-significant result for a predicted effect of the manipulation on support for partner’s goals that are incompatible with the relationship, relationship, beta = .23, t(73) = 2.01, p = .05. These just significant results are rare and often fail to replicate in replication studies (OSC, Science, 2016).

Altogether the results of yet another registered replication report raise major concerns about the robustness of priming as a reliable method to alter participants’ beliefs and attitudes. Selective reporting of studies that “worked” has created an illusion that priming is a very effective and reliable method to study social cognitions. However, even social cognition theories suggest that priming effects should be limited to specific situations and should not have strong effects for judgments that are highly relevant and when chronically accessible information is easily accessible.

8. Concluding Remarks

Looking back 2016 has been a good year for the reform movement in psychology. High profile replication failures have shattered the credibility of established psychology. Attempts by the establishment to discredit critics have backfired. A major problem for the establishment is that they themselves do not know how big the crisis is and which findings are solid. Consequently, there has been no major initiative by the establishment to mount replication projects that provide positive evidence for some important discoveries in psychology. Looking forward to 2017, I anticipate no major changes. Several registered replication studies are in the works, and prediction markets anticipate further failures. For example, a registered replication report of “professor priming” studies is predicted to produce a null-result.

If you are still looking for a New Year’s resolution, you may consider signing on to Brent W. Roberts, Rolf A. Zwaan, and Lorne Campbell’s initiative to improve research practices. You may also want to become a member of the Psychological Methods Discussion Group, where you can find out in real time about major events in the world of psychological science.

Have a wonderful new year.

Replicability-Index

Improving the replicability of empirical research

Like this:

8 thoughts on “Replicability Review of 2016”

Leave a ReplyCancel reply

Share this:

Like this:

8 thoughts on “Replicability Review of 2016”

Leave a ReplyCancel reply

Discover more from Replicability-Index