DataColada | Replicability-Index

“It wasn’t fraud. It was other QRPs”

[collaborator] “Francesca and I have done so many studies, a lot of them as part of the CLER lab, the behavioral lab at Harvard. And I’d say 80% of them never worked out.” (Gino, 2023)

Experimental social scientists have considered themselves superior to other social scientists because experiments provide strong evidence about causality that correlational studies cannot provide. Their experimental studies often produced surprising results, but because they were obtained using the experimental method and published in respected, peer-reviewed, journals, they seemed to provide profound novel insights into human behavior.

In his popular book “Thinking: Fast and Slow” Nobel Laureate Daniel Kahneman told readers “disbelief is not an option. The results are not made up, nor are they statistical flukes. You have no choice but to accept that the major conclusions of these studies are true.” He probably regrets writing these words, because he no longer believes these findings (Kahneman, 2017).

What happened between 2011 and 2017? Social scientists started to distrust their own (or at least the results or their colleagues) findings because it became clear that they did not use the scientific method properly. The key problem is that they only published results when they provided evidence for their theories, hypothesis, and predictions, but did not report when their studies did not work. As one prominent experimental social psychologists put it.

“We did run multiple studies, some of which did not work, and some of which worked better than others. You may think that not reporting the less successful studies is wrong, but that is how the field works.” (Roy Baumeister)“

Researchers not only selectively published studies with favorable results. They also used a variety of statistical tricks to increase the chances of obtaining evidence for their claims. John et al. (2012) called these tricks questionable research practices (QRPs) and compared them to doping in sport. The difference is that doping is banned in sports, but the use of many QRPs is not banned or punished by social scientific organizations.

The use of QRPs explains why scientific journals that report the results of experiments with human participants report over 90% of the time that the results confirmed researchers’ predictions. For statistical reasons , this high success rate is implausible even if all predictions were true (Sterling et al., 1995). The selective publishing of studies that worked renders the evidence meaningless (Sterling, 1959). Even clearly false hypotheses like “learning after an exam can increase exam performance” can receive empirical support, when QRPs are being used (Bem, 2011). The use of QRPs also explains why results of experimental social scientists often fail to replicate (Schimmack, 2020).

John et al. (2012) used the term questionable research practices broadly. However, it is necessary to distinguish three types of QRPs that have different implications for the credibility of results.

One QRPs is selective publishing of significant results. In this case, the results are what they are and the data are credible. The problem is mainly that these results are likely to be inflated by sampling bias. This bias would disappear when all studies were published and the results are averaged. However, if non-significant results are not published, the average remains inflated.

The second type of QRPs are various statistical tricks that can be used to “massage” the data to produce a more favorable result. These practices are now often called p-hacking. Presumably, these practices are used mainly after an initial analysis did not produce the desired result, but may be a trend in the expected direction. P-hacking alters the data and it is no longer clear how strong the actual evidence was. While lay people may consider these practices fraud or a type of doping, professional organizations tolerate these practices and even evidence of their use would not lead to disciplinary actions against a researcher.

The third QRP is fraud. Like p-hacking, fraud implies data manipulation with the goal of getting a desirable result, but the difference is …. well, it is hard to say what the difference to p-hacking is except that it is not tolerated by professional organizations. Outright fraud in which a whole data set is made up (as some datasets by disgraced Diederik Stapel) are clear cases of fraud. However, it is harder to distinguish between fraud and p-hacking when one researcher deletes selective outliers from two groups to get significance (p-hacking) or switches extreme cases from one group to another (fraud) (GinoColada1). In both cases, the data are meaningless, but only fraud leads to reputation damage and public outrage, while p-hackers can continue to present their claims as scientific truths.

The distinction between different types of QRPs is important to understand Gino’s latest defense against accusations that she committed fraud that have been widely publicized in newspaper articles and a long article in the New Yorker. In her response, she cites from Harvards’s investigative report to make the point that she is not a data fabricator.

[collaborator] “Francesca and I have done so many studies, a lot of them as part of the CLER lab, the behavioral lab at Harvard. And I’d say 80% of them never worked out.”

The argument is clear. Why would I have so many failed studies, if I could just make up fake data that support my claim. Indeed, Stapel claims that he started faking studies outright because it was clear that p-hacking is a lot of work and making up data is the most efficient QRP (“Why not just make the data up. Same results with less effort”). Gino makes it clear that she did not just fabricate data because she clearly collected a lot of data and has many failed studies that were not p-hacked or manipulated to get significance. She only did what everybody else did; hiding the studies that did not work and lot’s of them.

Whether she sometimes did engage in practices that cross the line from p-hacking to fraud is currently being investigated and not my concern. What I find interesting is the frank admission in her defense that 80% of her studies failed to provide evidence for her hypotheses. However, if somebody would look up her published work, they would see mainly the results of studies that worked. And she has no problem of telling us that these published results are just the tip of an iceberg of studies, where many more did not work. She thinks this is totally ok, because she has been trained / brainwashed to believe that this is how science works. Significance testing is like a gold pan.

Get a lot of datasets, look for p < .05, keep the significant ones (gold) and throw away the rest. The more studies, you run, the more gold you find, and the richer you are. Unfortunately, for her and the other experimental social scientists who think every p-value below .05 is a discovery, this is not how science works, as pointed out by Sterling (1959) many, many years before, but nobody wants to listen to people to tell you something is hard work.

Let’s for the moment assume that Gino really runs 100 studies to get 20 significant results (80% do not work, p < .10). Using a formula from Soric (1989), we can compute the risk that one of her 20 significant results is a false positive result (i.e., the significant result is a fluke without a real effect), even if she did not use p-hacking or other QRPs, which would further increase the risk of false claims.

FDR = ((1/.20) – 1)*(.05/.95) = 21%

Based on Gino’s own claim that 80% of her studies fail to produce significant results, we can infer that up to 21% of her published significant results could be false positive results. Moreover, selective publishing also inflates effect sizes and even if a result is not a false positive, the effect size may be in the same direction, but too small to be practically important. In other words, Gino’s empirical findings are meaningless without independent replications, even if she didn’t use p-hacking or manipulated any data. The question whether she committed fraud is only relevant for her personal future. It has no relevance for the credibility of her published findings or those of others in her field like Dan Air-Heady. The whole field is a train wreck. In 2012, Kahneman asked researchers in the field to clean up their act, but nobody listened and Kahneman has lost faith in their findings. Maye it is time to stop nudging social scientists with badges and use some operant conditioning to shape their behavior. But until this happens, if it every happens, we can just ignore this pseudo-science, no matter what happens in the Gino versus Harvard/DataColada case. As interesting as scandals are, it has no practical importance for the evaluation of the work that has been produced by experimental social scientists.

P.S. Of course, there are also researchers who have made real contributions, but unless we find ways to distinguish between credible work that was obtained without QRPs and incredible findings that were obtained with scientific doping, we don’t know which results we can trust. Maybe we need a doping test for scientists to find out.

Link to Gino Colada Affair – 2

Link to Gino-Colada Affair – 3

There is no doubt that social psychology and its applied fields like behavioral economics and consumer psychology have a credibility problem. Many of the findings cannot be replicated because they were obtained with questionable research practices or p-hacking. QRPs are statistical tricks that help researchers to obtain p-values below the necessary threshold to claim a discovery (p < .05). To be clear, although lay people and undergraduate students consider these practices to be deceptive, fraudulent, and unscientific, they are not considered fraudulent by researchers, professional organizations, funding agencies, or universities. Demonstrating that a researchers used QRPs to obtain significant results is easy-peasy, undermines the credibility of their work, but they can keep their jobs because it is not (yet) illegal to use these practices.

The Gino-Harvard scandal is different because the DataColada team claimed that they found “four studies for which we had accumulated the strongest evidence of fraud” and that they “believe that many more Gino-authored papers contain fake data.” To lay people, it can be hard to understand the difference between allowed QRPs and forbidden fraud or data manipulation. An example of QRPs, could be selectively removing extreme values so that the difference between two groups becomes larger (e.g., removing extremely low depression scores from a control group to show a bigger treatment effect). Outright data manipulation would be switching participants with low scores from the control group to the treatment group and vice versa.

DataColada used features of the excel spreadsheet that contained the data to claim that the data were manually manipulated.

The focus is on six rows that have a strong influence on the results for all three dependent variables that were reported in the article, namely cheated or not, overreporting of performance, and deductions.

Based on the datasheet, participants in the sign-at-the-top condition (1) in rows 67, 68, and 69, did not cheat and therewith also did not overreport performance, and had very low deductions an independent measure of cheating. In contrast, participants in rows 70, 71, and 72 all cheated, had moderate amounts of overreporting, and very high deductions.

Yadi, yadi, yada, yesterday Gino posted a blog post that responded to these accusations. Personally, the most interesting rebuttal was the claim that there was no need to switch rows because the study results hold even without the flagged rows.

“Finally, recall the lack of motive for the supposed manipulation: If you re-run the entire study excluding all of the red observations (the ones that should be considered “suspicious” using Data Colada’s lens), the findings of the study still hold. Why would I manipulate data, if not to change the results of a study?“

This argument makes sense to me because fraud appears to be the last resort for researchers who are eager to present a statistically significant results. After all, nobody claims that there was no data collection as in some cases by Diederik Stapel, who committed blatant fraud around the time this article in question was published and the use of questionable research practices was rampant. When researchers conduct an actual study, they probably hope to get the desired result without QRPs or fraud. As significance requires luck, they may just hope to get lucky. When this does not work, they can use a few QRPs. When this does not work, they can just shelf the study and try again. All of this would be perfectly legal by current standards of research ethics. However, if the results are close and it is not easy to collect more data to hope for better results), it may be tempting to change a few labels of conditions to reach p < .05. And the accusation here (there are other studies) is that only 6 (or a couple more) rows were switched to get significance. However, Gino claims that the results were already significant and I agree that it makes no sense for somebody to temper with data, if the p-value is already below .05.

However, Gino did not present evidence that the results hold without the contested cases. So, I downloaded the data and took a look.

First, I was able to reproduce the published result of an ANOVA with the three conditions as categorical predictor variable and deductions as outcome variable.

In addition, the original article reported that the differences between the experimental “signature-on-top” and each of the two control conditions (“signature-on-bottom”, “no signature”) were significant. I also confirmed these results.

Now I repeated the analysis without rows 67 to 72. Without the six contested cases, the results are no longer statistically significant, F(2, 92) = 2.96, p = .057.

Interestingly, the comparisons of the experimental group with the two control groups were statistically significant.

Combining the two control groups and comparing it to the experimental group and presenting the results as a planned contrast would also have produced a significant result.

However, these results do not support Gino’s implication that the same analysis that was reported in the article would have produced a statistically significant result, p < .05, without the six contested cases. Moreover, the accusation is that she switched rows with low values to the experimental condition and rows with high values to the control condition. To simulate this scenario, I recoded the contested rows 67-69 as signature-at-the-bottom and 70-72 as signature-at-the-top and repeated the analysis. In this case, there was no evidence that the group means differed from each other, F(2,98) = 0.45, p = .637.

Conclusion

Experimental social psychology has a credibility crisis because researchers were (and still are) allowed to use many statistical tricks to get significant results or to hide studies that didn’t produce the desired results. The Gino scandal is only remarkable because outright manipulation of data is the only ethics violations that has personal consequences for researchers when it can be proven. Lack of evidence that fraud was committed or lack of fraud do not imply that results are credible. For example, the results in Study 2 are meaningless even without fraud because the null-hypothesis was rejected with a confidence interval that had a value close to zero as a plausible value. While the article claims to show evidence of mediation, the published data alone show that there is no empirical evidence for this claim even if p < .05 was obtained without p-hacking or fraud. Misleading claims based on weak data, however, do not violate any ethics guidelines and are a common, if not essential, part of a game called social psychology.

This blog post only examined one minor question. Gino claimed that she did not have to manipulate data because the results were already significant.

My results suggest that this claim lacks empirical support. A key result was only significant with the rows of data that have been contested. Of course, this finding does not warrant the conclusion that the data were tempered with to get statistical significance. We have to wait to get the answer to this 25 million dollar question.

Replicability-Index

Improving the replicability of empirical research

Tag Archives: DataColada

Gino-Colada – 2: The line between fraud and other QRPs

The Gino-Colada Affair – 1

Conclusion