Tag Archives: Fraud

Gino-Colada – 2: The line between fraud and other QRPs

“It wasn’t fraud. It was other QRPs”

[collaborator] “Francesca and I have done so many studies, a lot of them as part of the CLER lab, the behavioral lab at Harvard. And I’d say 80% of them never worked out.” (Gino, 2023)

Experimental social scientists have considered themselves superior to other social scientists because experiments provide strong evidence about causality that correlational studies cannot provide. Their experimental studies often produced surprising results, but because they were obtained using the experimental method and published in respected, peer-reviewed, journals, they seemed to provide profound novel insights into human behavior.

In his popular book “Thinking: Fast and Slow” Nobel Laureate Daniel Kahneman told readers “disbelief is not an option. The results are not made up, nor are they statistical flukes. You have no choice but to accept that the major conclusions of these studies are true.” He probably regrets writing these words, because he no longer believes these findings (Kahneman, 2017).

What happened between 2011 and 2017? Social scientists started to distrust their own (or at least the results or their colleagues) findings because it became clear that they did not use the scientific method properly. The key problem is that they only published results when they provided evidence for their theories, hypothesis, and predictions, but did not report when their studies did not work. As one prominent experimental social psychologists put it.

“We did run multiple studies, some of which did not work, and some of which worked better than others. You may think that not reporting the less successful studies is wrong, but that is how the field works.” (Roy Baumeister)

Researchers not only selectively published studies with favorable results. They also used a variety of statistical tricks to increase the chances of obtaining evidence for their claims. John et al. (2012) called these tricks questionable research practices (QRPs) and compared them to doping in sport. The difference is that doping is banned in sports, but the use of many QRPs is not banned or punished by social scientific organizations.

The use of QRPs explains why scientific journals that report the results of experiments with human participants report over 90% of the time that the results confirmed researchers’ predictions. For statistical reasons , this high success rate is implausible even if all predictions were true (Sterling et al., 1995). The selective publishing of studies that worked renders the evidence meaningless (Sterling, 1959). Even clearly false hypotheses like “learning after an exam can increase exam performance” can receive empirical support, when QRPs are being used (Bem, 2011). The use of QRPs also explains why results of experimental social scientists often fail to replicate (Schimmack, 2020).

John et al. (2012) used the term questionable research practices broadly. However, it is necessary to distinguish three types of QRPs that have different implications for the credibility of results.

One QRPs is selective publishing of significant results. In this case, the results are what they are and the data are credible. The problem is mainly that these results are likely to be inflated by sampling bias. This bias would disappear when all studies were published and the results are averaged. However, if non-significant results are not published, the average remains inflated.

The second type of QRPs are various statistical tricks that can be used to “massage” the data to produce a more favorable result. These practices are now often called p-hacking. Presumably, these practices are used mainly after an initial analysis did not produce the desired result, but may be a trend in the expected direction. P-hacking alters the data and it is no longer clear how strong the actual evidence was. While lay people may consider these practices fraud or a type of doping, professional organizations tolerate these practices and even evidence of their use would not lead to disciplinary actions against a researcher.

The third QRP is fraud. Like p-hacking, fraud implies data manipulation with the goal of getting a desirable result, but the difference is …. well, it is hard to say what the difference to p-hacking is except that it is not tolerated by professional organizations. Outright fraud in which a whole data set is made up (as some datasets by disgraced Diederik Stapel) are clear cases of fraud. However, it is harder to distinguish between fraud and p-hacking when one researcher deletes selective outliers from two groups to get significance (p-hacking) or switches extreme cases from one group to another (fraud) (GinoColada1). In both cases, the data are meaningless, but only fraud leads to reputation damage and public outrage, while p-hackers can continue to present their claims as scientific truths.

The distinction between different types of QRPs is important to understand Gino’s latest defense against accusations that she committed fraud that have been widely publicized in newspaper articles and a long article in the New Yorker. In her response, she cites from Harvards’s investigative report to make the point that she is not a data fabricator.

[collaborator] “Francesca and I have done so many studies, a lot of them as part of the CLER lab, the behavioral lab at Harvard. And I’d say 80% of them never worked out.”

The argument is clear. Why would I have so many failed studies, if I could just make up fake data that support my claim. Indeed, Stapel claims that he started faking studies outright because it was clear that p-hacking is a lot of work and making up data is the most efficient QRP (“Why not just make the data up. Same results with less effort”). Gino makes it clear that she did not just fabricate data because she clearly collected a lot of data and has many failed studies that were not p-hacked or manipulated to get significance. She only did what everybody else did; hiding the studies that did not work and lot’s of them.

Whether she sometimes did engage in practices that cross the line from p-hacking to fraud is currently being investigated and not my concern. What I find interesting is the frank admission in her defense that 80% of her studies failed to provide evidence for her hypotheses. However, if somebody would look up her published work, they would see mainly the results of studies that worked. And she has no problem of telling us that these published results are just the tip of an iceberg of studies, where many more did not work. She thinks this is totally ok, because she has been trained / brainwashed to believe that this is how science works. Significance testing is like a gold pan.

Get a lot of datasets, look for p < .05, keep the significant ones (gold) and throw away the rest. The more studies, you run, the more gold you find, and the richer you are. Unfortunately, for her and the other experimental social scientists who think every p-value below .05 is a discovery, this is not how science works, as pointed out by Sterling (1959) many, many years before, but nobody wants to listen to people to tell you something is hard work.

Let’s for the moment assume that Gino really runs 100 studies to get 20 significant results (80% do not work, p < .10). Using a formula from Soric (1989), we can compute the risk that one of her 20 significant results is a false positive result (i.e., the significant result is a fluke without a real effect), even if she did not use p-hacking or other QRPs, which would further increase the risk of false claims.

FDR = ((1/.20) – 1)*(.05/.95) = 21%

Based on Gino’s own claim that 80% of her studies fail to produce significant results, we can infer that up to 21% of her published significant results could be false positive results. Moreover, selective publishing also inflates effect sizes and even if a result is not a false positive, the effect size may be in the same direction, but too small to be practically important. In other words, Gino’s empirical findings are meaningless without independent replications, even if she didn’t use p-hacking or manipulated any data. The question whether she committed fraud is only relevant for her personal future. It has no relevance for the credibility of her published findings or those of others in her field like Dan Air-Heady. The whole field is a train wreck. In 2012, Kahneman asked researchers in the field to clean up their act, but nobody listened and Kahneman has lost faith in their findings. Maye it is time to stop nudging social scientists with badges and use some operant conditioning to shape their behavior. But until this happens, if it every happens, we can just ignore this pseudo-science, no matter what happens in the Gino versus Harvard/DataColada case. As interesting as scandals are, it has no practical importance for the evaluation of the work that has been produced by experimental social scientists.

P.S. Of course, there are also researchers who have made real contributions, but unless we find ways to distinguish between credible work that was obtained without QRPs and incredible findings that were obtained with scientific doping, we don’t know which results we can trust. Maybe we need a doping test for scientists to find out.

The Gino-Colada Affair – 1

Link to Gino Colada Affair – 2

Link to Gino-Colada Affair – 3

There is no doubt that social psychology and its applied fields like behavioral economics and consumer psychology have a credibility problem. Many of the findings cannot be replicated because they were obtained with questionable research practices or p-hacking. QRPs are statistical tricks that help researchers to obtain p-values below the necessary threshold to claim a discovery (p < .05). To be clear, although lay people and undergraduate students consider these practices to be deceptive, fraudulent, and unscientific, they are not considered fraudulent by researchers, professional organizations, funding agencies, or universities. Demonstrating that a researchers used QRPs to obtain significant results is easy-peasy, undermines the credibility of their work, but they can keep their jobs because it is not (yet) illegal to use these practices.

The Gino-Harvard scandal is different because the DataColada team claimed that they found “four studies for which we had accumulated the strongest evidence of fraud” and that they “believe that many more Gino-authored papers contain fake data.” To lay people, it can be hard to understand the difference between allowed QRPs and forbidden fraud or data manipulation. An example of QRPs, could be selectively removing extreme values so that the difference between two groups becomes larger (e.g., removing extremely low depression scores from a control group to show a bigger treatment effect). Outright data manipulation would be switching participants with low scores from the control group to the treatment group and vice versa.

DataColada used features of the excel spreadsheet that contained the data to claim that the data were manually manipulated.

The focus is on six rows that have a strong influence on the results for all three dependent variables that were reported in the article, namely cheated or not, overreporting of performance, and deductions.

Based on the datasheet, participants in the sign-at-the-top condition (1) in rows 67, 68, and 69, did not cheat and therewith also did not overreport performance, and had very low deductions an independent measure of cheating. In contrast, participants in rows 70, 71, and 72 all cheated, had moderate amounts of overreporting, and very high deductions.

Yadi, yadi, yada, yesterday Gino posted a blog post that responded to these accusations. Personally, the most interesting rebuttal was the claim that there was no need to switch rows because the study results hold even without the flagged rows.

“Finally, recall the lack of motive for the supposed manipulation: If you re-run the entire study excluding all of the red observations (the ones that should be considered “suspicious” using Data Colada’s lens), the findings of the study still hold. Why would I manipulate data, if not to change the results of a study?

This argument makes sense to me because fraud appears to be the last resort for researchers who are eager to present a statistically significant results. After all, nobody claims that there was no data collection as in some cases by Diederik Stapel, who committed blatant fraud around the time this article in question was published and the use of questionable research practices was rampant. When researchers conduct an actual study, they probably hope to get the desired result without QRPs or fraud. As significance requires luck, they may just hope to get lucky. When this does not work, they can use a few QRPs. When this does not work, they can just shelf the study and try again. All of this would be perfectly legal by current standards of research ethics. However, if the results are close and it is not easy to collect more data to hope for better results), it may be tempting to change a few labels of conditions to reach p < .05. And the accusation here (there are other studies) is that only 6 (or a couple more) rows were switched to get significance. However, Gino claims that the results were already significant and I agree that it makes no sense for somebody to temper with data, if the p-value is already below .05.

However, Gino did not present evidence that the results hold without the contested cases. So, I downloaded the data and took a look.

First, I was able to reproduce the published result of an ANOVA with the three conditions as categorical predictor variable and deductions as outcome variable.

In addition, the original article reported that the differences between the experimental “signature-on-top” and each of the two control conditions (“signature-on-bottom”, “no signature”) were significant. I also confirmed these results.

Now I repeated the analysis without rows 67 to 72. Without the six contested cases, the results are no longer statistically significant, F(2, 92) = 2.96, p = .057.

Interestingly, the comparisons of the experimental group with the two control groups were statistically significant.

Combining the two control groups and comparing it to the experimental group and presenting the results as a planned contrast would also have produced a significant result.

However, these results do not support Gino’s implication that the same analysis that was reported in the article would have produced a statistically significant result, p < .05, without the six contested cases. Moreover, the accusation is that she switched rows with low values to the experimental condition and rows with high values to the control condition. To simulate this scenario, I recoded the contested rows 67-69 as signature-at-the-bottom and 70-72 as signature-at-the-top and repeated the analysis. In this case, there was no evidence that the group means differed from each other, F(2,98) = 0.45, p = .637.

Conclusion

Experimental social psychology has a credibility crisis because researchers were (and still are) allowed to use many statistical tricks to get significant results or to hide studies that didn’t produce the desired results. The Gino scandal is only remarkable because outright manipulation of data is the only ethics violations that has personal consequences for researchers when it can be proven. Lack of evidence that fraud was committed or lack of fraud do not imply that results are credible. For example, the results in Study 2 are meaningless even without fraud because the null-hypothesis was rejected with a confidence interval that had a value close to zero as a plausible value. While the article claims to show evidence of mediation, the published data alone show that there is no empirical evidence for this claim even if p < .05 was obtained without p-hacking or fraud. Misleading claims based on weak data, however, do not violate any ethics guidelines and are a common, if not essential, part of a game called social psychology.

This blog post only examined one minor question. Gino claimed that she did not have to manipulate data because the results were already significant.

“Finally, recall the lack of motive for the supposed manipulation: If you re-run the entire study excluding all of the red observations (the ones that should be considered “suspicious” using Data Colada’s lens), the findings of the study still hold. Why would I manipulate data, if not to change the results of a study?

My results suggest that this claim lacks empirical support. A key result was only significant with the rows of data that have been contested. Of course, this finding does not warrant the conclusion that the data were tempered with to get statistical significance. We have to wait to get the answer to this 25 million dollar question.

Dan Ariely and the Credibility of (Social) Psychological Science

It was relatively quiet on academic twitter when most academics were enjoying the last weeks of summer before the start of a new, new-normal semester. This changed on August 17, when the datacolada crew published a new blog post that revealed fraud in a study of dishonesty (http://datacolada.org/98). Suddenly, the integrity of social psychology was once again discussed on twitter, in several newspaper articles, and an article in Science magazine (O’Grady, 2021). The discovery of fraud in one dataset raises questions about other studies in articles published by the same researcher as well as in social psychology in general (“some researchers are calling Ariely’s large body of work into question”; O’Grady, 2021).

The brouhaha about the discovery of fraud is understandable because fraud is widely considered an unethical behavior that violates standards of academic integrity that may end a career (e.g., Stapel). However, there are many other reasons to be suspect of the credibility of Dan Ariely’s published results and those by many other social psychologists. Over the past decade, strong scientific evidence has accumulated that social psychologists’ research practices were inadequate and often failed to produce solid empirical findings that can inform theories of human behavior, including dishonest ones.

Arguably, the most damaging finding for social psychology was the finding that only 25% of published results could be replicated in a direct attempt to reproduce original findings (Open Science Collaboration, 2015). With such a low base-rate of successful replications, all published results in social psychology journals are likely to fail to replicate. The rational response to this discovery is to not trust anything that is published in social psychology journals unless there is evidence that a finding is replicable. Based on this logic, the discovery of fraud in a study published in 2012 is of little significance. Even without fraud, many findings are questionable.

Questionable Research Practices

The idealistic model of a scientist assumes that scientists test predictions by collecting data and then let the data decide whether the prediction was true or false. Articles are written to follow this script with an introduction that makes predictions, a results section that tests these predictions, and a conclusion that takes the results into account. This format makes articles look like they follow the ideal model of science, but it only covers up the fact that actual science is produced in a very different way; at least in social psychology before 2012. Either predictions are made after the results are known (Kerr, 1998) or the results are selected to fit the predictions (Simmons, Nelson, & Simonsohn, 2011).

This explains why most articles in social psychology support authors’ predictions (Sterling, 1959; Sterling et al., 1995; Motyl et al., 2017). This high success rate is not the result of brilliant scientists and deep insights into human behaviors. Instead, it is explained by selection for (statistical) significance. That is, when a result produces a statistically significant result that can be used to claim support for a prediction, researchers write a manuscript and submit it for publication. However, when the result is not significant, they do not write a manuscript. In addition, researchers will analyze their data in multiple ways. If they find one way that supports their predictions, they will report this analysis, and not mention that other ways failed to show the effect. Selection for significance has many names such as publication bias, questionable research practices, or p-hacking. Excessive use of these practices makes it easy to provide evidence for false predictions (Simmons, Nelson, & Simonsohn, 2011). Thus, the end-result of using questionable practices and fraud can be the same; published results are falsely used to support claims as scientifically proven or validated, when they actually have not been subjected to a real empirical test.

Although questionable practices and fraud have the same effect, scientists make a hard distinction between fraud and QRPs. While fraud is generally considered to be dishonest and punished with retractions of articles or even job losses, QRPs are tolerated. This leads to the false impression that articles that have not been retracted provide credible evidence and can be used to make scientific arguments (studies show ….). However, QRPs are much more prevalent than outright fraud and account for the majority of replication failures, but do not result in retractions (John, Loewenstein, & Prelec, 2012; Schimmack, 2021).

The good news is that the use of QRPs is detectable even when original data are not available, whereas fraud typically requires access to the original data to reveal unusual patterns. Over the past decade, my collaborators and I have worked on developing statistical tools that can reveal selection for significance (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020; Schimmack, 2012). I used the most advanced version of these methods, z-curve.2.0, to examine the credibility of results published in Dan Ariely’s articles.

Data

To examine the credibility of results published in Dan Ariely’s articles I followed the same approach that I used for other social psychologists (Replicability Audits). I selected articles based on authors’ H-Index in WebOfKnowledge. At the time of coding, Dan Ariely had an H-Index of 47; that is, he published 47 articles that were cited at least 47 times. I also included the 48th article that was cited 47 times. I focus on the highly cited articles because dishonest reporting of results is more harmful, if the work is highly cited. Just like a falling tree may not make a sound if nobody is around, untrustworthy results in an article that is not cited have no real effect.

For all empirical articles, I picked the most important statistical test per study. The coding of focal results is important because authors may publish non-significant results when they made no prediction. They may also publish a non-significant result when they predict no effect. However, most claims are based on demonstrating a statistically significant result. The focus on a single result is needed to ensure statistical independence which is an assumption made by the statistical model. When multiple focal tests are available, I pick the first one unless another one is theoretically more important (e.g., featured in the abstract). Although this coding is subjective, other researchers including Dan Ariely can do their own coding and verify my results.

Thirty-one of the 48 articles reported at least one empirical study. As some articles reported more than one study, the total number of studies was k = 97. Most of the results were reported with test-statistics like t, F, or chi-square values. These values were first converted into two-sided p-values and then into absolute z-scores. 92 of these z-scores were statistically significant and used for a z-curve analysis.

Z-Curve Results

The key results of the z-curve analysis are captured in Figure 1.

Figure 1

Visual inspection of the z-curve plot shows clear evidence of selection for significance. While a large number of z-scores are just statistically significant (z > 1.96 equals p < .05), there are very few z-scores that are just shy of significance (z < 1.96). Moreover, the few z-scores that do not meet the standard of significance were all interpreted as sufficient evidence for a prediction. Thus, Dan Ariely’s observed success rate is 100% or 95% if only p-values below .05 are counted. As pointed out in the introduction, this is not a unique feature of Dan Ariely’s articles, but a general finding in social psychology.

A formal test of selection for significance compares the observed discovery rate (95% z-scores greater than 1.96) to the expected discovery rate that is predicted by the statistical model. The prediction of the z-curve model is illustrated by the blue curve. Based on the distribution of significant z-scores, the model expected a lot more non-significant results. The estimated expected discovery rate is only 15%. Even though this is just an estimate, the 95% confidence interval around this estimate ranges from 5% to only 31%. Thus, the observed discovery rate is clearly much much higher than one could expect. In short, we have strong evidence that Dan Ariely and his co-authors used questionable practices to report more successes than their actual studies produced.

Although these results cast a shadow over Dan Ariely’s articles, there is a silver lining. It is unlikely that the large pile of just significant results was obtained by outright fraud; not impossible, but unlikely. The reason is that QRPs are bound to produce just significant results, but fraud can produce extremely high z-scores. The fraudulent study that was flagged by datacolada has a z-score of 11, which is virtually impossible to produce with QRPs (Simmons et al., 2001). Thus, while we can disregard many of the results in Ariely’s articles, he does not have to fear to lose his job (unless more fraud is uncovered by data detectives). Ariely is also in good company. The expected discovery rate for John A. Bargh is 15% (Bargh Audit) and the one for Roy F. Baumester is 11% (Baumeister Audit).

The z-curve plot also shows some z-scores greater than 3 or even greater than 4. These z-scores are more likely to reveal true findings (unless they were obtained with fraud) because (a) it gets harder to produce high z-scores with QRPs and replication studies show higher success rates for original studies with strong evidence (Schimmack, 2021). The problem is to find a reasonable criterion to distinguish between questionable results and credible results.

Z-curve make it possible to do so because the EDR estimates can be used to estimate the false discovery risk (Schimmack & Bartos, 2021). As shown in Figure 1, with an EDR of 15% and a significance criterion of alpha = .05, the false discovery risk is 30%. That is, up to 30% of results with p-values below .05 could be false positive results. The false discovery risk can be reduced by lowering alpha. Figure 2 shows the results for alpha = .01. The estimated false discovery risk is now below 5%. This large reduction in the FDR was achieved by treating the pile of just significant results as no longer significant (i.e., it is now on the left side of the vertical red line that reflects significance with alpha = .01, z = 2.58).

With the new significance criterion only 51 of the 97 tests are significant (53%). Thus, it is not necessary to throw away all of Ariely’s published results. About half of his published results might have produced some real evidence. Of course, this assumes that z-scores greater than 2.58 are based on real data. Any investigation should therefore focus on results with p-values below .01.

The final information that is provided by a z-curve analysis is the probability that a replication study with the same sample size produces a statistically significant result. This probability is called the expected replication rate (ERR). Figure 1 shows an ERR of 52% with alpha = 5%, but it includes all of the just significant results. Figure 2 excludes these studies, but uses alpha = 1%. Figure 3 estimates the ERR only for studies that had a p-value below .01 but using alpha = .05 to evaluate the outcome of a replication study.

Figur e3

In Figure 3 only z-scores greater than 2.58 (p = .01; on the right side of the dotted blue line) are used to fit the model using alpha = .05 (the red vertical line at 1.96) as criterion for significance. The estimated replication rate is 85%. Thus, we would predict mostly successful replication outcomes with alpha = .05, if these original studies were replicated and if the original studies were based on real data.

Conclusion

The discovery of a fraudulent dataset in a study on dishonesty has raised new questions about the credibility of social psychology. Meanwhile, the much bigger problem of selection for significance is neglected. Rather than treating studies as credible unless they are retracted, it is time to distrust studies unless there is evidence to trust them. Z-curve provides one way to assure readers that findings can be trusted by keeping the false discovery risk at a reasonably low level, say below 5%. Applying this methods to Ariely’s most cited articles showed that nearly half of Ariely’s published results can be discarded because they entail a high false positive risk. This is also true for many other findings in social psychology, but social psychologists try to pretend that the use of questionable practices was harmless and can be ignored. Instead, undergraduate students, readers of popular psychology books, and policy makers may be better off by ignoring social psychology until social psychologists report all of their results honestly and subject their theories to real empirical tests that may fail. That is, if social psychology wants to be a science, social psychologists have to act like scientists.

Aber bitte ohne Sanna

Abstract

Social psychologists have failed to clean up their act and their literature. Here I show unusually high effect sizes in non-retracted articles by Sanna, who retracted several articles. I point out that non-retraction does not equal credibility and I show that co-authors like Norbert Schwarz lack any motivation to correct the published record. The inability of social psychologists to acknowledge and correct their mistakes renders social psychology a para-science that lacks credibility. Even meta-analyses cannot be trusted because they do not correct properly for the use of questionable research practices.

Introduction

When I grew up, a popular German Schlager was the song “Aber bitte mit Sahne.” The song is about Germans love of deserts with whipped cream. So, when I saw articles by Sanna, I had to think about whipped cream, which is delicious. Unfortunately, articles by Sanna are the exact opposite. In the early 2010s, it became apparent that Sanna had fabricated data. However, unlike the thorough investigation of a similar case in the Netherlands, the extent of Sanna’s fraud remains unclear (Retraction Watch, 2012). The latest count of Sanna’s retracted articles was 8 (Retraction Watch, 2013).

WebOfScience shows 5 retraction notices for 67 articles, which means 62 articles have not been retracted. The question is whether these article can be trusted to provide valid scientific information? The answer to this question matters because Sanna’s articles are still being cited at a rate of over 100 citations per year.

Meta-Analysis of Ease of Retrieval

The data are also being used in meta-analyses (Weingarten & Hutchinson, 2018). Fraudulent data are particularly problematic for meta-analysis because fraud can produce large effect size estimates that may inflate effect size estimates. Here I report the results of my own investigation that focusses on the ease-of-retrieval paradigm that was developed by Norbert Schwarz and colleagues (Schwarz et al., 1991).

The meta-analysis included 7 studies from 6 articles. Two studies produced independent effect size estimates for 2 conditions for a total of 9 effect sizes.

Sanna, L. J., Schwarz, N., & Small, E. M. (2002). Accessibility experiences and the hindsight bias: I knew it all along versus it could never have happened. Memory & Cognition, 30(8), 1288–1296. https://doi.org/10.3758/BF03213410 [Study 1a, 1b]

Sanna, L. J., Schwarz, N., & Stocker, S. L. (2002). When debiasing backfires: Accessible content and accessibility experiences in debiasing hindsight. Journal of Experimental Psychology: Learning, Memory, and Cognition, 28(3), 497–502. https://doi.org/10.1037/0278-7393.28.3.497
[Study 1 & 2]

Sanna, L. J., & Schwarz, N. (2003). Debiasing the hindsight bias: The role of accessibility experiences and (mis)attributions. Journal of Experimental Social Psychology, 39(3), 287–295. https://doi.org/10.1016/S0022-1031(02)00528-0 [Study 1]

Sanna, L. J., Chang, E. C., & Carter, S. E. (2004). All Our Troubles Seem So Far Away: Temporal Pattern to Accessible Alternatives and Retrospective Team Appraisals. Personality and Social Psychology Bulletin, 30(10), 1359–1371. https://doi.org/10.1177/0146167204263784
[Study 3a]

Sanna, L. J., Parks, C. D., Chang, E. C., & Carter, S. E. (2005). The Hourglass Is Half Full or Half Empty: Temporal Framing and the Group Planning Fallacy. Group Dynamics: Theory, Research, and Practice, 9(3), 173–188. https://doi.org/10.1037/1089-2699.9.3.173 [Study 3a, 3b]

Carter, S. E., & Sanna, L. J. (2008). It’s not just what you say but when you say it: Self-presentation and temporal construal. Journal of Experimental Social Psychology, 44(5), 1339–1345. https://doi.org/10.1016/j.jesp.2008.03.017 [Study 2]

When I examined Sanna’s results, I found that all 9 of these 9 effect sizes were extremely large with effect size estimates being larger than one standard deviation. A logistic regression analysis that predicted authorship (With Sanna vs. Without Sanna) showed that the large effect sizes in Sanna’s articles were unlikely to be due to sampling error alone, b = 4.6, se = 1.1, t(184) = 4.1, p = .00004 (1 / 24,642).

These results show that Sanna’s effect sizes are not typical for the ease-of-retrieval literature. As one of his retracted articles used the ease-of retrieval paradigm, it is possible that these articles are equally untrustworthy. As many other studies have investigated ease-of-retrieval effects, it seems prudent to exclude articles by Sanna from future meta-analysis.

These articles should also not be cited as evidence for specific claims about ease-of-retrieval effects for the specific conditions that were used in these studies. As the meta-analysis shows, there have been no credible replications of these studies and it remains unknown how much ease of retrieval may play a role under the specified conditions in Sanna’s articles.

Discussion

The blog post is also a warning for young scientists and students of social psychology that they cannot trust researchers who became famous with the help of questionable research practices that produced too many significant results. As the reference list shows, several articles by Sanna were co-authored by Norbert Schwarz, the inventor of the ease-of-retrieval paradigm. It is most likely that he was unaware of Sanna’s fraudulent practices. However, he seemed to lack any concerns that the results might be too good to be true. After all, he encountered replicaiton failures in his own lab.

of course, we had studies that remained unpublished. Early on we experimented with different manipulations. The main lesson was: if you make the task too blatantly difficult, people correctly conclude the task is too difficult and draw no inference about themselves. We also had a couple of studies with unexpected gender differences” (Schwarz, email communication, 5/18,21).

So, why was he not suspicious when Sanna only produced successful results? I was wondering whether Schwarz had some doubts about these studies with the help of hindsight bias. After all, a decade or more later, we know that he committed fraud for some articles on this topic, we know about replication failures in larger samples (Yeager et al., 2019), and we know that the true effect sizes are much smaller than Sanna’s reported effect sizes (Weingarten & Hutchinson, 2018).

Hi Norbert, 
   thank you for your response. I am doing my own meta-analysis of the literature as I have some issues with the published one by Evan. More about that later. For now, I have a question about some articles that I came across, specifically Sanna, Schwarz, and Small (2002). The results in this study are very strong (d ~ 1).  Do you think a replication study powered for 95% power with d = .4 (based on meta-analysis) would produce a significant result? Or do you have concerns about this particular paradigm and do not predict a replication failure?
Best, Uli (email

His response shows that he is unwilling or unable to even consider the possibility that Sanna used fraud to produce the results in this article that he co-authored.

Uli, that paper has 2 experiments, one with a few vs many manipulation and one with a facial manipulation.  I have no reason to assume that the patterns won’t replicate. They are consistent with numerous earlier few vs many studies and other facial manipulation studies (introduced by Stepper & Strack,  JPSP, 1993). The effect sizes always depend on idiosyncracies of topic, population, and context, which influence accessible content and accessibility experience. The theory does not make point predictions and the belief that effect sizes should be identical across decades and populations is silly — we’re dealing with judgments based on accessible content, not with immutable objects.  

This response is symptomatic of social psychologists response to decades of research that has produced questionable results that often fail to replicate (see Schimmack, 2020, for a review). Even when there is clear evidence of questionable practices, journals are reluctant to retract articles that make false claims based on invalid data (Kitayama, 2020). And social psychologist Daryl Bem wants rather be remembered as loony para-psychologists than as real scientists (Bem, 2021).

The problem with these social psychologists is not that they made mistakes in the way they conducted their studies. The problem is their inability to acknowledge and correct their mistakes. While they are clinging to their CVs and H-Indices to protect their self-esteem, they are further eroding trust in psychology as a science and force junior scientists who want to improve things out of academia (Hilgard, 2021). After all, the key feature of science that distinguishes it from ideologies is the ability to correct itself. A science that shows no signs of self-correction is a para-science and not a real science. Thus, social psychology is currently para-science (i.e., “Parascience is a broad category of academic disciplines, that are outside the scope of scientific study, Wikipedia).

The only hope for social psychology is that young researchers are unwilling to play by the old rules and start a credibility revolution. However, the incentives still favor conformists who suck up to the old guard. Thus, it is unclear if social psychology will ever become a real science. A first sign of improvement would be to retract articles that make false claims based on results that were produced with questionable research practices. Instead, social psychologists continue to write review articles that ignore the replication crisis (Schwarz & Strack, 2016) as if repression can bend reality.

Nobody should believe them.