Disgust - Replicability-Index

Psychological scientists are human and like all humans they can be tempted to violate social norms (Fiske, 2015). To help psychologists to conduct ethical research, professional organizations have developed codes of conduct (APA). These rules are designed to help researchers to resist temptations to engage in unethical practices such as fabricate or falsify of data (Pain, Science, 2008).

Psychological science has ignored the problem of research integrity for a long time. The Association for Psychological Science (APS) still does not have formal guidelines about research misconduct (APS, 2016).

Two eminent psychologists recently edited a book with case studies that examine ethical dilemmas for psychological scientists (Sternberg & Fiske, 2015). Unfortunately, this book lacks moral fiber and fails to discuss recent initiatives to address the lax ethical standards in psychology.

Many of the brief chapters in this book are concerned with unethical behaviors of students, in clinical settings, or ethics of conducting research with animals or human participants. These chapters have no relevance for the current debates about improving psychological science. Nevertheless, a few chapter do address these issues and these chapters show how little eminent psychologists are prepared to address an ethical crisis that threatens the foundation of psychological science.

Chapter 29
Desperate Data Analysis by a Desperate Job Candidate Jonathan Haidt

Pursuing a career in science is risky and getting an academic job is hard. After a two-year funded post-doc, I didn’t have a job for one year and I worked hard to get more publications. Jonathan Haidt was in a similar situation. He didn’t get an academic job after his first post-doc and was lucky to get a second post-doc,but he needed more publications.

He was interested in the link between feelings of disgust and moral judgments. A common way to demonstrate causality in experimental social psychology is to use an incidental manipulation of the cause (disgust) and to show that the manipulation has an effect on a measure of the effect (moral judgments).

“I was looking for carry-over effects of disgust”

In the chapter, JH tells readers about the moral dilemma when he collected data and the data analysis showed the predicted pattern, but it was not statistically significant. This means the evidence was not strong enough to be publishable. He carefully looked at the data and saw several outliers. He came up with various reasons to exclude some. Many researchers have been in the same situation, but few have told their story in a book.

I knew I was doing this post hoc, and that it was wrong to do so. But I was so confident that the effect was real, and I had defensible justifications! I made a deal with myself: I would go ahead and write up the manuscript now, without the outliers, and while it was under review I would collect more data, which would allow me to get the result cleanly, including all outliers.

This account contradicts various assertions by psychological scientists that they did not know better or that questionable research practices just happen without intent. JH story is much more plausible. He needed publications to get a job. He had a promising dataset and all he was doing was eliminating a few outliers to bet an arbitrary criterion of statistical significance. So what, if the p-value was .11 with the three cases included. The difference between p = .04 and p = .11 is not statistically significant. Plus, he was not going to rely on these results. He would collect more data. Surely, there was a good reason to bend the rules slightly or as Sternberg (2015) calls it going a couple of miles over the speed limit. Everybody does it. JH realized that his behavior was unethical, it just was not significantly unethical (Sternberg, 2015).

Decide That the Ethical Dimension Is Significant. If one observes a driver going one mile per hour over the speed limit on a highway, one is unlikely to become perturbed about the unethical behavior of the driver, especially if the driver is oneself.” (Sternberg, 2015).

So what if JH was speeding a little bit to get an academic job. He wasn’t driving 80 miles in front of an elementary school like Diedrik Stapel, who just made up data. But that is not how this chapter ends. JH tells us that he never published the results of this study.

Fortunately, I ended up recruiting more participants before finishing the manuscript, and the new data showed no trend whatsoever. So I dropped the whole study and felt an enormous sense of relief. I also felt a mix of horror and shame that I had so blatantly massaged my data to make it comply with my hopes.

What vexes me about this story is that Jonathan Haidt is known for his work on morality and disgust and published a highly cited (> 2,000 citations in WebofScience) article that suggested disgust does influence moral judgments.

Wheatley and Haidt (2001) manipulated somatic markers even more directly. Highly hypnotizable participants were given the suggestion, under hypnosis, that they would feel a pang of disgust when they saw either the word take or the word often. Participants were then asked to read and make moral judgments about six stories that were designed to elicit mild to moderate disgust, each of which contained either the word take or the word often. Participants made higher ratings of both disgust and moral condemnation about the stories containing their hypnotic disgust word. This study was designed to directly manipulate the intuitive judgment link (Link 1), and it demonstrates that artificially increasing the strength of a gut feeling increases the strength of the resulting moral judgment (Haidt, 2001, Psychological Review).

A more detailed report of these studies was published in a few years later (Wheatley & Haidt, 2005). Study 1 reported a significant difference between the disgust-hypnosis group and the control group, t(44) = 2.41, p = .020. Study 2 produced a marginally significant result that was significant in a non-parametric test.

For the morality ratings, there were substantially more outliers (in both directions) than in Experiment 1 or for the other ratings in this experiment. As the paired-samples
t test loses power in the presence of outliers, we used its non-parametric analogue, the Wilcoxon signed-rank test, as well (Hollander&Wolfe, 1999). Participants judged the actions to be more morally wrong when their hypnotic word was present (M =
73.4) than when it was absent (M = 69.6), t(62) = 1.74, p = .09, Wilcoxon Z = 2.18, p < .05.

Although JH account of his failed study suggests he acted ethically, the same story also reveals that he did have at least one study that failed to provide support for the moral disgust hypothesis that was not mentioned in his Psychological Review article. Disregarding an entire study that ultimately did not support a hypothesis is a questionable research practice, just as removing some outliers is (John et al., 2012; see also next section about Chapter 35). However, JH seems to believe that he acted morally.

However, in 2015 social psychologists were well aware that hiding failed studies and other questionable practices undermine the credibility of published findings. It is therefore particularly troubling that JH was a co-author of another article that failed to mention this study. Schnall, Haidt, Core, and Jordan (2015) responded to a meta-analysis that suggested the effect of incidental disgust on moral judgments is not reliable and that there was evidence for publication bias (e..g, not reporting the failed study JH mentions in his contribution to the book on ethical challenges). This would have been a good opportunity to admit that some studies failed to show the effect and that these studies were not reported. However, the response is rather different.

With failed replications on various topics getting published these days, we were pleased that Landy and Goodwin’s (2015) meta-analysis supported most of the findings we reported in Schnall, Haidt, Clore, and Jordan (2008). They focused on what Pizarro, Inbar
and Helion (2011) had termed the amplification hypothesis of Haidt’s (2001) social intuitionist model of moral judgment, namely that “disgust amplifies moral evaluations—it makes wrong things seem even more wrong (Pizarro et al., 2011, p. 267, emphasis in original).” Like us, Landy and Goodwin (2015) found that the overall effect of incidental disgust on moral judgment is usually small or zero when ignoring relevant moderator variables.”

Somebody needs to go back in time and correct JH’s Psychological Review article and the hypnosis studies that reported main effects with moderated effect sizes and no moderator effects. Apparently, even JH doesn’t believe in these effects anymore in 2015 and so it was not important to mention failed studies. However, it might have been relevant to point out that the studies that did report main effects were false positives and what theoretical implications this would have.

More troubling is that the moderator effects are also not robust. The moderator effects were shown in studies by Schnall and may be inflated by the use of questionable research practices. In support of this interpretation of her results, a large replication study failed to replicate the results of Schnall et al.’s (2008) Study 3. Neither the main effect of the disgust manipulation nor the interaction with the personality measure were significant (Johnson et al., 2016).

The fact that JH openly admits to hiding disconfirming evidence, while he would have considered selective deletion of outliers a moral violation, and was ashamed of even thinking about it, suggests that he does not consider hiding failed studies a violation of ethics (but see APA Guidelines, 6th edition, 2010). This confirms Sternberg’s (2015) first observation about moral behavior. A researcher needs to define an event as having an ethical dimension to act ethically. As long as social psychologists do not consider hiding failed studies unethical, reported results cannot be trusted to be objective fact. Maybe it is time to teach social psychologists that hiding failed studies is a questionable research practice that violates scientific standards of research integrity.

Chapter 35
“Getting it Right” Can also be Wrong by Ronnie Janoff-Bulman

This chapter provides the clearest introduction to the ethical dilemma that researchers face when they report the results of their research. JB starts with a typical example that all empirical psychologists encountered. A study showed a promising result, but a second study failed to show the desired and expected result (p > .10). She then did what many researchers do. She changed the design of the study (a different outcome measure) and collected new data. There is nothing wrong with trying again because there are many reasons why a study may produce an unexpected result. However, JB also makes it clear that the article would not include the non-significant results.

“The null-result of the intermediary experiment will not be discussed or mentioned, but will be ignored and forgotten.”

The suppression of the failed study is called a questionable research practice (John et al., 2012). The Publication Manual of APA considers this unethical reporting of research results.

JP makes it clear that hiding failed studies undermines the credibility of published results.

“Running multiple versions of studies and ignoring the ones that “didn’t work” can have far-reaching negative effects by contributing to the false positives that pervade our field and now pass for psychological knowledge. I plead guilty.”

JP also explains why it is wrong to neglect failed studies. Running study after study to get a successful outcome, “is likely capitalize on chance, noise, or situational factors and increase the likelihood of finding a significant (but unreliable) effect.”

This observation is by no means new. Sterling (1959) pointed out that publication bias (publishing only p-values below .05), essentially increases the risk of a false positive result from the nominal level of 5% to an actual level of 100%. Even evidently false results will produce only significant results in the published literature if failures are not reported (Bem, 2011).

JP asked what can be done about this. Apparently, JP is not aware of recent developments in psychological science that range from statistical tests that reveal missing studies (like an X-ray for looked file-drawers) to preregistration of studies that will be published without a significance filter.

Although utterly unlikely given current norms, reporting that we didn’t find the effect in a previous study (and describing the measures and manipulations used) would be broadly informative for the field and would benefit individual researchers conducting related studies. Certainly publication of replications by others would serve as a corrective as well.

It is not clear why publishing non-significant results is considered utterly unlikely in 2015, if the 2010 APA Publication Manual mandates publication of these studies.

Despite her pessimism about the future of Psychological Science, JP has a clear vision how psychologists could improve their science.

A major, needed shift in research and publication norms is likely to be greatly facilitated by an embrace of open access publishing, where immediate feedback, open evaluations and peer reviews, and greater communication among researchers (including replications and null results) hold the promise of opening debate and discussion of findings. Such changes would help preclude false-positive effects from becoming prematurely reified as facts; but such changes, if they are to occur, will clearly take time.

The main message of this chapter is that researchers in psychology have been trained to chase significance because obtaining statistical significance by all means was considered a form of creativity and good research (Sternberg, 2018). Unfortunately, this is wrong. Statistical significance is only meaningful if it is obtained the right way and in an open and transparent manner.

33 Commentary to Part V Susan T. Fiske

It was surprising to read Fiske’s (2015) statement that “contrary to human nature, we as scientists should welcome humiliation, because it shows that the science is working.”

In marked contrast to this quote, Fiske has attacked psychologists who are trying to correct some of the errors in published articles as “method terrorists”

I personally find both statements problematic. Nobody should welcome humiliation and nobody who points out errors in published articles is a terrorist. Researchers should simply realize that publications in peer-reviewed journals can still contain errors and that it is part of the scientific process to correct these errors. The biggest problem in the past seven years was not that psychologists made mistakes, but that they resisted efforts to correct them that arise from a flawed understanding of the scientific method.

36 Commentary to Part VI Susan T. Fiske

Social psychologists have justified not reporting failed study (cf. Jonathan Haidt example) by calling these studies pilot studies (Bem, 2011). Bem pointed out that social psychologists have a lot of these pilot studies. But a pilot study is not a study that tests the cause effect relationship. A pilot study tests either whether a manipulation is effective or whether a measure is reliable and valid. It is simply wrong to treat studies that test the effect of a manipulation on an outcome a pilot study, if the study did not work.

“However, few of the current proposals for greater transparency recommend describingeach and every failed pilot study.”

The next statement makes it clear that Fiske conflates pilot studies with failed studies.

As noted, the reasons for failures to produce a given result are multiple, and supporting the null hypothesis is only one explanation.

Yes, but it is one plausible explanation and not disclosing the failure renders the whole purpose of empirical hypothesis testing irrelevant (Sterling, 1959).

“Deciding when one has failed to replicate is a matter of persistence and judgment.”

No it is not. Preregister the study and if you are willing to use a significant result if you obtain it, you have to report the non-significant result if you do not. Everything else is not science and Susan Fiske seems to lack an understanding of the most basic reason for conducting an experiment.

What is an ethical scientist to do? One resolution is to treat a given result – even if it required fine-tuning to produce – as an existence proof: This result demonstrably can occur, at least under some circumstances. Over time, attempts to replicate will test generalizability.

This statement ignores that the observed pattern of results is heavily influenced by sampling error, especially in the typical between-subject design with small samples that is so popular in experimental social psychology. A mean difference between two groups does not mean that anything happened in this study. It could just be sampling error. But maybe the thought that most of the published results in experimental social psychology are just errors is too much to bear for somebody at the end of her career.

I have followed the replication crisis unfold over the past seven years since Bem (2011) published the eye-opening, ridiculous claims about feeling the future location of randomly displayed erotica. I cannot predict random events in the future, but I can notice trends and I do have a feeling that the future will not look kindly on those who tried to stand in the way of progress in psychological science. A new generation of psychologists is learning everyday about replication failures and how to conduct better studies. For old people there are only two choices. Step aside or help them to learn from the mistakes of the older generation.

P.S. I think there is a connection between morality and disgust but it (mainly) goes from immoral behaviors to disgust. So let me tell you, psychological science, Uranus stinks.

Replicability-Index

Improving the replicability of empirical research

Category Archives: Disgust

Ethical Challenges for Psychological Scientists