Tag Archives: DataColada

DataColada: No Comments

Science is like an iceberg. The published record is only a fraction of the things that university -paid academics do. Some time ago, Brian Nosek dreamed about a scientific utopia of open science that would make the workings of academia more transparent, but all we got was preprints and some badges – that are apparently rolled back. He never was interested in an open discussion about the IAT or the ethics of Project Implicit.

Other famous open science academics are also less open than you may think. Datacolada benefited from open data to find evidence of fraud. As there is no law to share data, the fraudsters must kick themselves for being so foolish to share them and lose millions and reputation.

Meanwhile, Datacolada is not as open as one might think. Most notable, their blog does not even have an open comment section that allows sharing of alternative viewpoints or – oh my god – criticism of the claims made by the Datacolada team. This is worse than an old-school journal with peer-review that occasionally would publish some critical comments to maintain the image of being science that searchers for the truth. Not so DataColada: here the opinions of three academics are presented as facts.

In comparison, I have fully embraced open science. My blog has comment sections and I have even revised errors that people have pointed out in my blog posts. Living in utopia, I have also shared emails that the authors wanted to hide like my exchange with Uri about the poor performance of p-curve when data are heterogeneous). After our discussion, Uri just posted a blog post claiming p-curve handles heterogeneity just fine based on simulation studies that assume very low levels of heterogeneity and claiming that more heterogeneity does not exist in real data. A simple examination of actual heterogeneity in real datasets shows this to be false, but I was not able to correct the false claim on the blog: No Comments Allowed!

For the sake of my radically open approach to science, I am sharing an email by a colleague, who also expressed frustration about Uri Simonsohn’s use of the blog to present his side of the story without giving people criticized by him a chance to openly respond to his criticism.

I found this email from Greg Francis from 2015 recently in an unrelated search of my inbox, and I think it is interesting enough – if only, for historians who want to write about the replication crisis one day. The email was written in response to Datacolada’s claim that we do not need statistical tests to reveal publication bias.

[24] P-curve vs. Excessive Significance Test – Data Colada

Uri writes with typical intellectual humility “In this post I use data from the Many-Labs replication project to contrast the (pointless) inferences one arrives at using the Excessive Significant Test, with the (critically important) inferences one arrives at with p-curve.

It may seem strange that a data sleuth and p-hacking detector argues against bias tests, but that is how academia works. My bias tests are good, all other bias tests are bad. Now let me write a blog post that makes other tests look bad to sell my own work. Not science, but academia.

In short, the DataColada team used the replication crisis to their own benefit. After using simulations to claim that questionable research practices can make practically every result significant, they published p-curve to reveal studies that were severely p-hacked. P-curve has over 1,000 citations but at best a handful of articles that show lack of evidential value. Meanwhile, publication bias that cannot be detected by p-curve is present in over 80% of articles that are examined (Francis, 2012). So, which tool is useless?

—————————————————————————-

Uri’s email to Greg Francis

On 27 Jun 2014, at 04:59 pm, Simonsohn, Uri <uws@wharton.upenn.edu> wrote:

Hi Greg,

Thanks for your email.
The policy is to contact authors whose research we discuss. I do not discuss research you conducted, so I did not contact you.

One could extend the policy to contacting everybody whose work is related to the post, but that would be impractical, I would have needed to contact Kahneman, Klein et at, Ioannidins & Trikalinos, Uli and you, and presumably the people whose work you have analyzed via EST and perhaps even the OSF people. Or perhaps extend the policy to contacting anybody who is likely to disagree with the post. Similarly impractical.

Looking at the comments you sent via email, note how you don’t need to refer to any paper you have written to make your arguments, they are based exclusively on new analyses I run on data you had never analyzed before. That indicates to me the post is separate from your past work.  

When I wrote a post about Bayesian analysis (http://datacolada.org/2014/01/13/13-posterior-hacking/) , I did not contact Bayesian statisticians like Kruschke or EJ. As in this case, I was talking about statistical tools they use, but not about analyses they have run, so our policy did not require me to contact them either. When we have written about replications we have not contacted Nosek.  
When I wrote about ceiling effects in one replication paper, I did not contact authors of other papers that may also have a ceiling effect, or other people who have talkeda bout ceiling effects in that paper, I only contacted the authors whose work I was directly discussing.

Now, if I write a post about analyses EJ runs, or a replication that Nosek does, then of course we will contact them.
If I write a post about your use of the EST in this or that paper, then of course I will contact you.

You may disagree with the policy,  but I thought it would be fair to share the rationale with you.

Thanks again,

Uri

—–Original Message—–
From: Gregory Francis [mailto:gfrancis@purdue.edu]
Sent: Friday, June 27, 2014 8:51 AM
To: Simonsohn, Uri
Cc: <uli.schimmack@utoronto.ca> Schimmack; Leif Nelson; Simmons, Joseph
Subject: Data Colada

Hi Uri,

I saw your Data Colada posting on the P-curve vs. the excessive significance test (http://datacolada.org/2014/06/27/24-p-curve-vs-excessive-significance-test/ ). I really don’t understand the motivation for this posting, and I think you misrepresented the TES (Test for Excess Significance- Ioannidis’ term).

In particular, you conclude that the inference from the TES is pointless because we know there are 5 studies not reported. Indeed, if you know some relevant studies were not reported (since you removed them!) then you are correct that there is no reason to run the TES.  I would suggest that the more interesting test for this set of data would be to include the 5 non-significant studies (since they were actually published). Running the TES then gives 0.9699841 (I quickly modified your code to include all published studies; I am pretty sure this correct).  The details are

Pooled d: 0.598
Observed number of significant studies: 31 Expected number of significant studies: 31.08
Chi-square: 0.0014159
p: 0.96998

So, the TES would not claim that there is anything amiss with the full set of 36 reported studies.

I also object to your argument that nobody publishes “all” findings. Taken broadly enough, the statement is true, but somewhat silly and naive. What the TES considers is whether the stated theoretical claims are consistent with the reported findings. For example, in the TES analysis of all 36 studies, the theoretical claims (a fixed effect size of d=0.598) is consistent with the reported frequency of rejecting the null. On the other hand, if we take just the 31 significant experiments, then the theoretical claim (a fixed effect size of d=0.629) is not consistent with the reported frequency of rejecting the null. One need not report all studies for consistency to hold, and if there are valid methodological reasons to not publish some studies then they should not be published. I have explained this to you many times, so I get the feeling you are being deliberately obtuse on this issue, which is a shame because you are confusing people and, in the long-run, undermining your own credibility.

I also think your post is misleading in a broader context.  The “about” section of Data Colada states::

“When discussing research by other authors we contact them before posting; we ask for suggestions to improve the post, and invite them to comment within the original blog post.”

Readers of your blog who believe you take the policy seriously should infer that Uli and I were shown a draft, asked for feedback, and given an opportunity to comment, which is not true. It is too late for you to follow parts one and two of your policy, but you can fix the third: allow Uli (if he wishes) and me to write a follow-up post on Data Colada that explains our views of the TES and p-curve analyses.  

Greg Francis

Professor of Psychological Sciences
Purdue University

What a jerk!

Greg

On 27 Jun 2014, at 04:59 pm, Simonsohn, Uri <uws@wharton.upenn.edu> wrote:

Hi Greg,

Thanks for your email.
The policy is to contact authors whose research we discuss. I do not discuss research you conducted, so I did not contact you.

One could extend the policy to contacting everybody whose work is related to the post, but that would be impractical, I would have needed to contact Kahneman, Klein et at, Ioannidins & Trikalinos, Uli and you, and presumably the people whose work you have analyzed via EST and perhaps even the OSF people. Or perhaps extend the policy to contacting anybody who is likely to disagree with the post. Similarly impractical.

Looking at the comments you sent via email, note how you don’t need to refer to any paper you have written to make your arguments, they are based exclusively on new analyses I run on data you had never analyzed before. That indicates to me the post is separate from your past work.  

When I wrote a post about Bayesian analysis (http://datacolada.org/2014/01/13/13-posterior-hacking/) , I did not contact Bayesian statisticians like Kruschke or EJ. As in this case, I was talking about statistical tools they use, but not about analyses they have run, so our policy did not require me to contact them either. When we have written about replications we have not contacted Nosek.  
When I wrote about ceiling effects in one replication paper, I did not contact authors of other papers that may also have a ceiling effect, or other people who have talkeda bout ceiling effects in that paper, I only contacted the authors whose work I was directly discussing.

Now, if I write a post about analyses EJ runs, or a replication that Nosek does, then of course we will contact them.
If I write a post about your use of the EST in this or that paper, then of course I will contact you.

You may disagree with the policy,  but I thought it would be fair to share the rationale with you.

Thanks again,

Uri

—–Original Message—–
From: Gregory Francis [mailto:gfrancis@purdue.edu]
Sent: Friday, June 27, 2014 8:51 AM
To: Simonsohn, Uri
Cc: <uli.schimmack@utoronto.ca> Schimmack; Leif Nelson; Simmons, Joseph
Subject: Data Colada

Hi Uri,

I saw your Data Colada posting on the P-curve vs. the excessive significance test (http://datacolada.org/2014/06/27/24-p-curve-vs-excessive-significance-test/ ). I really don’t understand the motivation for this posting, and I think you misrepresented the TES (Test for Excess Significance- Ioannidis’ term).

In particular, you conclude that the inference from the TES is pointless because we know there are 5 studies not reported. Indeed, if you know some relevant studies were not reported (since you removed them!) then you are correct that there is no reason to run the TES.  I would suggest that the more interesting test for this set of data would be to include the 5 non-significant studies (since they were actually published). Running the TES then gives 0.9699841 (I quickly modified your code to include all published studies; I am pretty sure this correct).  The details are

Pooled d: 0.598
Observed number of significant studies: 31 Expected number of significant studies: 31.08
Chi-square: 0.0014159
p: 0.96998

So, the TES would not claim that there is anything amiss with the full set of 36 reported studies.

I also object to your argument that nobody publishes “all” findings. Taken broadly enough, the statement is true, but somewhat silly and naive. What the TES considers is whether the stated theoretical claims are consistent with the reported findings. For example, in the TES analysis of all 36 studies, the theoretical claims (a fixed effect size of d=0.598) is consistent with the reported frequency of rejecting the null. On the other hand, if we take just the 31 significant experiments, then the theoretical claim (a fixed effect size of d=0.629) is not consistent with the reported frequency of rejecting the null. One need not report all studies for consistency to hold, and if there are valid methodological reasons to not publish some studies then they should not be published. I have explained this to you many times, so I get the feeling you are being deliberately obtuse on this issue, which is a shame because you are confusing people and, in the long-run, undermining your own credibility.

I also think your post is misleading in a broader context.  The “about” section of Data Colada states::

“When discussing research by other authors we contact them before posting; we ask for suggestions to improve the post, and invite them to comment within the original blog post.”

Readers of your blog who believe you take the policy seriously should infer that Uli and I were shown a draft, asked for feedback, and given an opportunity to comment, which is not true. It is too late for you to follow parts one and two of your policy, but you can fix the third: allow Uli (if he wishes) and me to write a follow-up post on Data Colada that explains our views of the TES and p-curve analyses.  

Greg Francis

Professor of Psychological Sciences
Purdue University

Gino-Colada – 2: The line between fraud and other QRPs

“It wasn’t fraud. It was other QRPs”

[collaborator] “Francesca and I have done so many studies, a lot of them as part of the CLER lab, the behavioral lab at Harvard. And I’d say 80% of them never worked out.” (Gino, 2023)

Experimental social scientists have considered themselves superior to other social scientists because experiments provide strong evidence about causality that correlational studies cannot provide. Their experimental studies often produced surprising results, but because they were obtained using the experimental method and published in respected, peer-reviewed, journals, they seemed to provide profound novel insights into human behavior.

In his popular book “Thinking: Fast and Slow” Nobel Laureate Daniel Kahneman told readers “disbelief is not an option. The results are not made up, nor are they statistical flukes. You have no choice but to accept that the major conclusions of these studies are true.” He probably regrets writing these words, because he no longer believes these findings (Kahneman, 2017).

What happened between 2011 and 2017? Social scientists started to distrust their own (or at least the results or their colleagues) findings because it became clear that they did not use the scientific method properly. The key problem is that they only published results when they provided evidence for their theories, hypothesis, and predictions, but did not report when their studies did not work. As one prominent experimental social psychologists put it.

“We did run multiple studies, some of which did not work, and some of which worked better than others. You may think that not reporting the less successful studies is wrong, but that is how the field works.” (Roy Baumeister)

Researchers not only selectively published studies with favorable results. They also used a variety of statistical tricks to increase the chances of obtaining evidence for their claims. John et al. (2012) called these tricks questionable research practices (QRPs) and compared them to doping in sport. The difference is that doping is banned in sports, but the use of many QRPs is not banned or punished by social scientific organizations.

The use of QRPs explains why scientific journals that report the results of experiments with human participants report over 90% of the time that the results confirmed researchers’ predictions. For statistical reasons , this high success rate is implausible even if all predictions were true (Sterling et al., 1995). The selective publishing of studies that worked renders the evidence meaningless (Sterling, 1959). Even clearly false hypotheses like “learning after an exam can increase exam performance” can receive empirical support, when QRPs are being used (Bem, 2011). The use of QRPs also explains why results of experimental social scientists often fail to replicate (Schimmack, 2020).

John et al. (2012) used the term questionable research practices broadly. However, it is necessary to distinguish three types of QRPs that have different implications for the credibility of results.

One QRPs is selective publishing of significant results. In this case, the results are what they are and the data are credible. The problem is mainly that these results are likely to be inflated by sampling bias. This bias would disappear when all studies were published and the results are averaged. However, if non-significant results are not published, the average remains inflated.

The second type of QRPs are various statistical tricks that can be used to “massage” the data to produce a more favorable result. These practices are now often called p-hacking. Presumably, these practices are used mainly after an initial analysis did not produce the desired result, but may be a trend in the expected direction. P-hacking alters the data and it is no longer clear how strong the actual evidence was. While lay people may consider these practices fraud or a type of doping, professional organizations tolerate these practices and even evidence of their use would not lead to disciplinary actions against a researcher.

The third QRP is fraud. Like p-hacking, fraud implies data manipulation with the goal of getting a desirable result, but the difference is …. well, it is hard to say what the difference to p-hacking is except that it is not tolerated by professional organizations. Outright fraud in which a whole data set is made up (as some datasets by disgraced Diederik Stapel) are clear cases of fraud. However, it is harder to distinguish between fraud and p-hacking when one researcher deletes selective outliers from two groups to get significance (p-hacking) or switches extreme cases from one group to another (fraud) (GinoColada1). In both cases, the data are meaningless, but only fraud leads to reputation damage and public outrage, while p-hackers can continue to present their claims as scientific truths.

The distinction between different types of QRPs is important to understand Gino’s latest defense against accusations that she committed fraud that have been widely publicized in newspaper articles and a long article in the New Yorker. In her response, she cites from Harvards’s investigative report to make the point that she is not a data fabricator.

[collaborator] “Francesca and I have done so many studies, a lot of them as part of the CLER lab, the behavioral lab at Harvard. And I’d say 80% of them never worked out.”

The argument is clear. Why would I have so many failed studies, if I could just make up fake data that support my claim. Indeed, Stapel claims that he started faking studies outright because it was clear that p-hacking is a lot of work and making up data is the most efficient QRP (“Why not just make the data up. Same results with less effort”). Gino makes it clear that she did not just fabricate data because she clearly collected a lot of data and has many failed studies that were not p-hacked or manipulated to get significance. She only did what everybody else did; hiding the studies that did not work and lot’s of them.

Whether she sometimes did engage in practices that cross the line from p-hacking to fraud is currently being investigated and not my concern. What I find interesting is the frank admission in her defense that 80% of her studies failed to provide evidence for her hypotheses. However, if somebody would look up her published work, they would see mainly the results of studies that worked. And she has no problem of telling us that these published results are just the tip of an iceberg of studies, where many more did not work. She thinks this is totally ok, because she has been trained / brainwashed to believe that this is how science works. Significance testing is like a gold pan.

Get a lot of datasets, look for p < .05, keep the significant ones (gold) and throw away the rest. The more studies, you run, the more gold you find, and the richer you are. Unfortunately, for her and the other experimental social scientists who think every p-value below .05 is a discovery, this is not how science works, as pointed out by Sterling (1959) many, many years before, but nobody wants to listen to people to tell you something is hard work.

Let’s for the moment assume that Gino really runs 100 studies to get 20 significant results (80% do not work, p < .10). Using a formula from Soric (1989), we can compute the risk that one of her 20 significant results is a false positive result (i.e., the significant result is a fluke without a real effect), even if she did not use p-hacking or other QRPs, which would further increase the risk of false claims.

FDR = ((1/.20) – 1)*(.05/.95) = 21%

Based on Gino’s own claim that 80% of her studies fail to produce significant results, we can infer that up to 21% of her published significant results could be false positive results. Moreover, selective publishing also inflates effect sizes and even if a result is not a false positive, the effect size may be in the same direction, but too small to be practically important. In other words, Gino’s empirical findings are meaningless without independent replications, even if she didn’t use p-hacking or manipulated any data. The question whether she committed fraud is only relevant for her personal future. It has no relevance for the credibility of her published findings or those of others in her field like Dan Air-Heady. The whole field is a train wreck. In 2012, Kahneman asked researchers in the field to clean up their act, but nobody listened and Kahneman has lost faith in their findings. Maye it is time to stop nudging social scientists with badges and use some operant conditioning to shape their behavior. But until this happens, if it every happens, we can just ignore this pseudo-science, no matter what happens in the Gino versus Harvard/DataColada case. As interesting as scandals are, it has no practical importance for the evaluation of the work that has been produced by experimental social scientists.

P.S. Of course, there are also researchers who have made real contributions, but unless we find ways to distinguish between credible work that was obtained without QRPs and incredible findings that were obtained with scientific doping, we don’t know which results we can trust. Maybe we need a doping test for scientists to find out.

The Gino-Colada Affair – 1

Link to Gino Colada Affair – 2

Link to Gino-Colada Affair – 3

There is no doubt that social psychology and its applied fields like behavioral economics and consumer psychology have a credibility problem. Many of the findings cannot be replicated because they were obtained with questionable research practices or p-hacking. QRPs are statistical tricks that help researchers to obtain p-values below the necessary threshold to claim a discovery (p < .05). To be clear, although lay people and undergraduate students consider these practices to be deceptive, fraudulent, and unscientific, they are not considered fraudulent by researchers, professional organizations, funding agencies, or universities. Demonstrating that a researchers used QRPs to obtain significant results is easy-peasy, undermines the credibility of their work, but they can keep their jobs because it is not (yet) illegal to use these practices.

The Gino-Harvard scandal is different because the DataColada team claimed that they found “four studies for which we had accumulated the strongest evidence of fraud” and that they “believe that many more Gino-authored papers contain fake data.” To lay people, it can be hard to understand the difference between allowed QRPs and forbidden fraud or data manipulation. An example of QRPs, could be selectively removing extreme values so that the difference between two groups becomes larger (e.g., removing extremely low depression scores from a control group to show a bigger treatment effect). Outright data manipulation would be switching participants with low scores from the control group to the treatment group and vice versa.

DataColada used features of the excel spreadsheet that contained the data to claim that the data were manually manipulated.

The focus is on six rows that have a strong influence on the results for all three dependent variables that were reported in the article, namely cheated or not, overreporting of performance, and deductions.

Based on the datasheet, participants in the sign-at-the-top condition (1) in rows 67, 68, and 69, did not cheat and therewith also did not overreport performance, and had very low deductions an independent measure of cheating. In contrast, participants in rows 70, 71, and 72 all cheated, had moderate amounts of overreporting, and very high deductions.

Yadi, yadi, yada, yesterday Gino posted a blog post that responded to these accusations. Personally, the most interesting rebuttal was the claim that there was no need to switch rows because the study results hold even without the flagged rows.

“Finally, recall the lack of motive for the supposed manipulation: If you re-run the entire study excluding all of the red observations (the ones that should be considered “suspicious” using Data Colada’s lens), the findings of the study still hold. Why would I manipulate data, if not to change the results of a study?

This argument makes sense to me because fraud appears to be the last resort for researchers who are eager to present a statistically significant results. After all, nobody claims that there was no data collection as in some cases by Diederik Stapel, who committed blatant fraud around the time this article in question was published and the use of questionable research practices was rampant. When researchers conduct an actual study, they probably hope to get the desired result without QRPs or fraud. As significance requires luck, they may just hope to get lucky. When this does not work, they can use a few QRPs. When this does not work, they can just shelf the study and try again. All of this would be perfectly legal by current standards of research ethics. However, if the results are close and it is not easy to collect more data to hope for better results), it may be tempting to change a few labels of conditions to reach p < .05. And the accusation here (there are other studies) is that only 6 (or a couple more) rows were switched to get significance. However, Gino claims that the results were already significant and I agree that it makes no sense for somebody to temper with data, if the p-value is already below .05.

However, Gino did not present evidence that the results hold without the contested cases. So, I downloaded the data and took a look.

First, I was able to reproduce the published result of an ANOVA with the three conditions as categorical predictor variable and deductions as outcome variable.

In addition, the original article reported that the differences between the experimental “signature-on-top” and each of the two control conditions (“signature-on-bottom”, “no signature”) were significant. I also confirmed these results.

Now I repeated the analysis without rows 67 to 72. Without the six contested cases, the results are no longer statistically significant, F(2, 92) = 2.96, p = .057.

Interestingly, the comparisons of the experimental group with the two control groups were statistically significant.

Combining the two control groups and comparing it to the experimental group and presenting the results as a planned contrast would also have produced a significant result.

However, these results do not support Gino’s implication that the same analysis that was reported in the article would have produced a statistically significant result, p < .05, without the six contested cases. Moreover, the accusation is that she switched rows with low values to the experimental condition and rows with high values to the control condition. To simulate this scenario, I recoded the contested rows 67-69 as signature-at-the-bottom and 70-72 as signature-at-the-top and repeated the analysis. In this case, there was no evidence that the group means differed from each other, F(2,98) = 0.45, p = .637.

Conclusion

Experimental social psychology has a credibility crisis because researchers were (and still are) allowed to use many statistical tricks to get significant results or to hide studies that didn’t produce the desired results. The Gino scandal is only remarkable because outright manipulation of data is the only ethics violations that has personal consequences for researchers when it can be proven. Lack of evidence that fraud was committed or lack of fraud do not imply that results are credible. For example, the results in Study 2 are meaningless even without fraud because the null-hypothesis was rejected with a confidence interval that had a value close to zero as a plausible value. While the article claims to show evidence of mediation, the published data alone show that there is no empirical evidence for this claim even if p < .05 was obtained without p-hacking or fraud. Misleading claims based on weak data, however, do not violate any ethics guidelines and are a common, if not essential, part of a game called social psychology.

This blog post only examined one minor question. Gino claimed that she did not have to manipulate data because the results were already significant.

“Finally, recall the lack of motive for the supposed manipulation: If you re-run the entire study excluding all of the red observations (the ones that should be considered “suspicious” using Data Colada’s lens), the findings of the study still hold. Why would I manipulate data, if not to change the results of a study?

My results suggest that this claim lacks empirical support. A key result was only significant with the rows of data that have been contested. Of course, this finding does not warrant the conclusion that the data were tempered with to get statistical significance. We have to wait to get the answer to this 25 million dollar question.