Category Archives: Datacolada

Open Science Needs Open Scientists

November 5, 2025Datacolada, SimonsohnDataColada, Open Science, P-Curve, Publication BiasUlrich Schimmack

Science is like an iceberg. The published record is only a fraction of the things that university -paid academics do. Some time ago, Brian Nosek dreamed about a scientific utopia of open science that would make the workings of academia more transparent, but all we got was preprints and some badges – that are apparently rolled back. He never was interested in an open discussion about the IAT or the ethics of Project Implicit.

Other famous open science academics are also less open than you may think. Datacolada benefited from open data to find evidence of fraud. As there is no law to share data, the fraudsters must kick themselves for being so foolish to share them and lose millions and reputation.

Meanwhile, Datacolada is not as open as one might think. Most notable, their blog does not even have an open comment section that allows sharing of alternative viewpoints or – oh my god – criticism of the claims made by the Datacolada team. This is worse than an old-school journal with peer-review that occasionally would publish some critical comments to maintain the image of being science that searchers for the truth. Not so DataColada: here the opinions of three academics are presented as facts.

In comparison, I have fully embraced open science. My blog has comment sections and I have even revised errors that people have pointed out in my blog posts. Living in utopia, I have also shared emails that the authors wanted to hide like my exchange with Uri about the poor performance of p-curve when data are heterogeneous). After our discussion, Uri just posted a blog post claiming p-curve handles heterogeneity just fine based on simulation studies that assume very low levels of heterogeneity and claiming that more heterogeneity does not exist in real data. A simple examination of actual heterogeneity in real datasets shows this to be false, but I was not able to correct the false claim on the blog: No Comments Allowed!

For the sake of my radically open approach to science, I am sharing an email by a colleague, who also expressed frustration about Uri Simonsohn’s use of the blog to present his side of the story without giving people criticized by him a chance to openly respond to his criticism.

I found this email from Greg Francis from 2015 recently in an unrelated search of my inbox, and I think it is interesting enough – if only, for historians who want to write about the replication crisis one day. The email was written in response to Datacolada’s claim that we do not need statistical tests to reveal publication bias.

[24] P-curve vs. Excessive Significance Test – Data Colada

Uri writes with typical intellectual humility “In this post I use data from the Many-Labs replication project to contrast the (pointless) inferences one arrives at using the Excessive Significant Test, with the (critically important) inferences one arrives at with p-curve.”

It may seem strange that a data sleuth and p-hacking detector argues against bias tests, but that is how academia works. My bias tests are good, all other bias tests are bad. Now let me write a blog post that makes other tests look bad to sell my own work. Not science, but academia.

In short, the DataColada team used the replication crisis to their own benefit. After using simulations to claim that questionable research practices can make practically every result significant, they published p-curve to reveal studies that were severely p-hacked. P-curve has over 1,000 citations but at best a handful of articles that show lack of evidential value. Meanwhile, publication bias that cannot be detected by p-curve is present in over 80% of articles that are examined (Francis, 2012). So, which tool is useless?

—————————————————————————-

Uri’s email to Greg Francis

On 27 Jun 2014, at 04:59 pm, Simonsohn, Uri <uws@wharton.upenn.edu> wrote:

Hi Greg,

Thanks for your email.
The policy is to contact authors whose research we discuss. I do not discuss research you conducted, so I did not contact you.

One could extend the policy to contacting everybody whose work is related to the post, but that would be impractical, I would have needed to contact Kahneman, Klein et at, Ioannidins & Trikalinos, Uli and you, and presumably the people whose work you have analyzed via EST and perhaps even the OSF people. Or perhaps extend the policy to contacting anybody who is likely to disagree with the post. Similarly impractical.

Looking at the comments you sent via email, note how you don’t need to refer to any paper you have written to make your arguments, they are based exclusively on new analyses I run on data you had never analyzed before. That indicates to me the post is separate from your past work.

When I wrote a post about Bayesian analysis (http://datacolada.org/2014/01/13/13-posterior-hacking/) , I did not contact Bayesian statisticians like Kruschke or EJ. As in this case, I was talking about statistical tools they use, but not about analyses they have run, so our policy did not require me to contact them either. When we have written about replications we have not contacted Nosek.
When I wrote about ceiling effects in one replication paper, I did not contact authors of other papers that may also have a ceiling effect, or other people who have talkeda bout ceiling effects in that paper, I only contacted the authors whose work I was directly discussing.

Now, if I write a post about analyses EJ runs, or a replication that Nosek does, then of course we will contact them.
If I write a post about your use of the EST in this or that paper, then of course I will contact you.

You may disagree with the policy, but I thought it would be fair to share the rationale with you.

Thanks again,

Uri

—–Original Message—–
From: Gregory Francis [mailto:gfrancis@purdue.edu]
Sent: Friday, June 27, 2014 8:51 AM
To: Simonsohn, Uri
Cc: <uli.schimmack@utoronto.ca> Schimmack; Leif Nelson; Simmons, Joseph
Subject: Data Colada

Hi Uri,

I saw your Data Colada posting on the P-curve vs. the excessive significance test (http://datacolada.org/2014/06/27/24-p-curve-vs-excessive-significance-test/ ). I really don’t understand the motivation for this posting, and I think you misrepresented the TES (Test for Excess Significance- Ioannidis’ term).

In particular, you conclude that the inference from the TES is pointless because we know there are 5 studies not reported. Indeed, if you know some relevant studies were not reported (since you removed them!) then you are correct that there is no reason to run the TES. I would suggest that the more interesting test for this set of data would be to include the 5 non-significant studies (since they were actually published). Running the TES then gives 0.9699841 (I quickly modified your code to include all published studies; I am pretty sure this correct). The details are

Pooled d: 0.598
Observed number of significant studies: 31 Expected number of significant studies: 31.08
Chi-square: 0.0014159
p: 0.96998

So, the TES would not claim that there is anything amiss with the full set of 36 reported studies.

I also object to your argument that nobody publishes “all” findings. Taken broadly enough, the statement is true, but somewhat silly and naive. What the TES considers is whether the stated theoretical claims are consistent with the reported findings. For example, in the TES analysis of all 36 studies, the theoretical claims (a fixed effect size of d=0.598) is consistent with the reported frequency of rejecting the null. On the other hand, if we take just the 31 significant experiments, then the theoretical claim (a fixed effect size of d=0.629) is not consistent with the reported frequency of rejecting the null. One need not report all studies for consistency to hold, and if there are valid methodological reasons to not publish some studies then they should not be published. I have explained this to you many times, so I get the feeling you are being deliberately obtuse on this issue, which is a shame because you are confusing people and, in the long-run, undermining your own credibility.

I also think your post is misleading in a broader context. The “about” section of Data Colada states::

“When discussing research by other authors we contact them before posting; we ask for suggestions to improve the post, and invite them to comment within the original blog post.”

Readers of your blog who believe you take the policy seriously should infer that Uli and I were shown a draft, asked for feedback, and given an opportunity to comment, which is not true. It is too late for you to follow parts one and two of your policy, but you can fix the third: allow Uli (if he wishes) and me to write a follow-up post on Data Colada that explains our views of the TES and p-curve analyses.

Greg Francis

Professor of Psychological Sciences
Purdue University

What a jerk!

Greg

On 27 Jun 2014, at 04:59 pm, Simonsohn, Uri <uws@wharton.upenn.edu> wrote:

Hi Greg,

Thanks for your email.
The policy is to contact authors whose research we discuss. I do not discuss research you conducted, so I did not contact you.

One could extend the policy to contacting everybody whose work is related to the post, but that would be impractical, I would have needed to contact Kahneman, Klein et at, Ioannidins & Trikalinos, Uli and you, and presumably the people whose work you have analyzed via EST and perhaps even the OSF people. Or perhaps extend the policy to contacting anybody who is likely to disagree with the post. Similarly impractical.

Looking at the comments you sent via email, note how you don’t need to refer to any paper you have written to make your arguments, they are based exclusively on new analyses I run on data you had never analyzed before. That indicates to me the post is separate from your past work.

When I wrote a post about Bayesian analysis (http://datacolada.org/2014/01/13/13-posterior-hacking/) , I did not contact Bayesian statisticians like Kruschke or EJ. As in this case, I was talking about statistical tools they use, but not about analyses they have run, so our policy did not require me to contact them either. When we have written about replications we have not contacted Nosek.
When I wrote about ceiling effects in one replication paper, I did not contact authors of other papers that may also have a ceiling effect, or other people who have talkeda bout ceiling effects in that paper, I only contacted the authors whose work I was directly discussing.

Now, if I write a post about analyses EJ runs, or a replication that Nosek does, then of course we will contact them.
If I write a post about your use of the EST in this or that paper, then of course I will contact you.

You may disagree with the policy, but I thought it would be fair to share the rationale with you.

Thanks again,

Uri

—–Original Message—–
From: Gregory Francis [mailto:gfrancis@purdue.edu]
Sent: Friday, June 27, 2014 8:51 AM
To: Simonsohn, Uri
Cc: <uli.schimmack@utoronto.ca> Schimmack; Leif Nelson; Simmons, Joseph
Subject: Data Colada

Hi Uri,

I saw your Data Colada posting on the P-curve vs. the excessive significance test (http://datacolada.org/2014/06/27/24-p-curve-vs-excessive-significance-test/ ). I really don’t understand the motivation for this posting, and I think you misrepresented the TES (Test for Excess Significance- Ioannidis’ term).

In particular, you conclude that the inference from the TES is pointless because we know there are 5 studies not reported. Indeed, if you know some relevant studies were not reported (since you removed them!) then you are correct that there is no reason to run the TES. I would suggest that the more interesting test for this set of data would be to include the 5 non-significant studies (since they were actually published). Running the TES then gives 0.9699841 (I quickly modified your code to include all published studies; I am pretty sure this correct). The details are

Pooled d: 0.598
Observed number of significant studies: 31 Expected number of significant studies: 31.08
Chi-square: 0.0014159
p: 0.96998

So, the TES would not claim that there is anything amiss with the full set of 36 reported studies.

I also object to your argument that nobody publishes “all” findings. Taken broadly enough, the statement is true, but somewhat silly and naive. What the TES considers is whether the stated theoretical claims are consistent with the reported findings. For example, in the TES analysis of all 36 studies, the theoretical claims (a fixed effect size of d=0.598) is consistent with the reported frequency of rejecting the null. On the other hand, if we take just the 31 significant experiments, then the theoretical claim (a fixed effect size of d=0.629) is not consistent with the reported frequency of rejecting the null. One need not report all studies for consistency to hold, and if there are valid methodological reasons to not publish some studies then they should not be published. I have explained this to you many times, so I get the feeling you are being deliberately obtuse on this issue, which is a shame because you are confusing people and, in the long-run, undermining your own credibility.

I also think your post is misleading in a broader context. The “about” section of Data Colada states::

“When discussing research by other authors we contact them before posting; we ask for suggestions to improve the post, and invite them to comment within the original blog post.”

Readers of your blog who believe you take the policy seriously should infer that Uli and I were shown a draft, asked for feedback, and given an opportunity to comment, which is not true. It is too late for you to follow parts one and two of your policy, but you can fix the third: allow Uli (if he wishes) and me to write a follow-up post on Data Colada that explains our views of the TES and p-curve analyses.

Greg Francis

Professor of Psychological Sciences
Purdue University

Gino-Colada – 2: The line between fraud and other QRPs

October 9, 2023Datacolada, Gino, HarvardDataColada, File-Drawer, Fraud, Gino, Harvard, P-Hacking, QRPUlrich Schimmack

“It wasn’t fraud. It was other QRPs”

[collaborator] “Francesca and I have done so many studies, a lot of them as part of the CLER lab, the behavioral lab at Harvard. And I’d say 80% of them never worked out.” (Gino, 2023)

Experimental social scientists have considered themselves superior to other social scientists because experiments provide strong evidence about causality that correlational studies cannot provide. Their experimental studies often produced surprising results, but because they were obtained using the experimental method and published in respected, peer-reviewed, journals, they seemed to provide profound novel insights into human behavior.

In his popular book “Thinking: Fast and Slow” Nobel Laureate Daniel Kahneman told readers “disbelief is not an option. The results are not made up, nor are they statistical flukes. You have no choice but to accept that the major conclusions of these studies are true.” He probably regrets writing these words, because he no longer believes these findings (Kahneman, 2017).

What happened between 2011 and 2017? Social scientists started to distrust their own (or at least the results or their colleagues) findings because it became clear that they did not use the scientific method properly. The key problem is that they only published results when they provided evidence for their theories, hypothesis, and predictions, but did not report when their studies did not work. As one prominent experimental social psychologists put it.

“We did run multiple studies, some of which did not work, and some of which worked better than others. You may think that not reporting the less successful studies is wrong, but that is how the field works.” (Roy Baumeister)“

Researchers not only selectively published studies with favorable results. They also used a variety of statistical tricks to increase the chances of obtaining evidence for their claims. John et al. (2012) called these tricks questionable research practices (QRPs) and compared them to doping in sport. The difference is that doping is banned in sports, but the use of many QRPs is not banned or punished by social scientific organizations.

The use of QRPs explains why scientific journals that report the results of experiments with human participants report over 90% of the time that the results confirmed researchers’ predictions. For statistical reasons , this high success rate is implausible even if all predictions were true (Sterling et al., 1995). The selective publishing of studies that worked renders the evidence meaningless (Sterling, 1959). Even clearly false hypotheses like “learning after an exam can increase exam performance” can receive empirical support, when QRPs are being used (Bem, 2011). The use of QRPs also explains why results of experimental social scientists often fail to replicate (Schimmack, 2020).

John et al. (2012) used the term questionable research practices broadly. However, it is necessary to distinguish three types of QRPs that have different implications for the credibility of results.

One QRPs is selective publishing of significant results. In this case, the results are what they are and the data are credible. The problem is mainly that these results are likely to be inflated by sampling bias. This bias would disappear when all studies were published and the results are averaged. However, if non-significant results are not published, the average remains inflated.

The second type of QRPs are various statistical tricks that can be used to “massage” the data to produce a more favorable result. These practices are now often called p-hacking. Presumably, these practices are used mainly after an initial analysis did not produce the desired result, but may be a trend in the expected direction. P-hacking alters the data and it is no longer clear how strong the actual evidence was. While lay people may consider these practices fraud or a type of doping, professional organizations tolerate these practices and even evidence of their use would not lead to disciplinary actions against a researcher.

The third QRP is fraud. Like p-hacking, fraud implies data manipulation with the goal of getting a desirable result, but the difference is …. well, it is hard to say what the difference to p-hacking is except that it is not tolerated by professional organizations. Outright fraud in which a whole data set is made up (as some datasets by disgraced Diederik Stapel) are clear cases of fraud. However, it is harder to distinguish between fraud and p-hacking when one researcher deletes selective outliers from two groups to get significance (p-hacking) or switches extreme cases from one group to another (fraud) (GinoColada1). In both cases, the data are meaningless, but only fraud leads to reputation damage and public outrage, while p-hackers can continue to present their claims as scientific truths.

The distinction between different types of QRPs is important to understand Gino’s latest defense against accusations that she committed fraud that have been widely publicized in newspaper articles and a long article in the New Yorker. In her response, she cites from Harvards’s investigative report to make the point that she is not a data fabricator.

[collaborator] “Francesca and I have done so many studies, a lot of them as part of the CLER lab, the behavioral lab at Harvard. And I’d say 80% of them never worked out.”

The argument is clear. Why would I have so many failed studies, if I could just make up fake data that support my claim. Indeed, Stapel claims that he started faking studies outright because it was clear that p-hacking is a lot of work and making up data is the most efficient QRP (“Why not just make the data up. Same results with less effort”). Gino makes it clear that she did not just fabricate data because she clearly collected a lot of data and has many failed studies that were not p-hacked or manipulated to get significance. She only did what everybody else did; hiding the studies that did not work and lot’s of them.

Whether she sometimes did engage in practices that cross the line from p-hacking to fraud is currently being investigated and not my concern. What I find interesting is the frank admission in her defense that 80% of her studies failed to provide evidence for her hypotheses. However, if somebody would look up her published work, they would see mainly the results of studies that worked. And she has no problem of telling us that these published results are just the tip of an iceberg of studies, where many more did not work. She thinks this is totally ok, because she has been trained / brainwashed to believe that this is how science works. Significance testing is like a gold pan.

Get a lot of datasets, look for p < .05, keep the significant ones (gold) and throw away the rest. The more studies, you run, the more gold you find, and the richer you are. Unfortunately, for her and the other experimental social scientists who think every p-value below .05 is a discovery, this is not how science works, as pointed out by Sterling (1959) many, many years before, but nobody wants to listen to people to tell you something is hard work.

Let’s for the moment assume that Gino really runs 100 studies to get 20 significant results (80% do not work, p < .10). Using a formula from Soric (1989), we can compute the risk that one of her 20 significant results is a false positive result (i.e., the significant result is a fluke without a real effect), even if she did not use p-hacking or other QRPs, which would further increase the risk of false claims.

FDR = ((1/.20) – 1)*(.05/.95) = 21%

Based on Gino’s own claim that 80% of her studies fail to produce significant results, we can infer that up to 21% of her published significant results could be false positive results. Moreover, selective publishing also inflates effect sizes and even if a result is not a false positive, the effect size may be in the same direction, but too small to be practically important. In other words, Gino’s empirical findings are meaningless without independent replications, even if she didn’t use p-hacking or manipulated any data. The question whether she committed fraud is only relevant for her personal future. It has no relevance for the credibility of her published findings or those of others in her field like Dan Air-Heady. The whole field is a train wreck. In 2012, Kahneman asked researchers in the field to clean up their act, but nobody listened and Kahneman has lost faith in their findings. Maye it is time to stop nudging social scientists with badges and use some operant conditioning to shape their behavior. But until this happens, if it every happens, we can just ignore this pseudo-science, no matter what happens in the Gino versus Harvard/DataColada case. As interesting as scandals are, it has no practical importance for the evaluation of the work that has been produced by experimental social scientists.

P.S. Of course, there are also researchers who have made real contributions, but unless we find ways to distinguish between credible work that was obtained without QRPs and incredible findings that were obtained with scientific doping, we don’t know which results we can trust. Maybe we need a doping test for scientists to find out.

The Gino-Colada Affair – 1

September 30, 2023Datacolada, Gino, HarvardDataColada, File-Drawer, Fraud, Gino, Harvard, P-HackingUlrich Schimmack

Link to Gino Colada Affair – 2

Link to Gino-Colada Affair – 3

There is no doubt that social psychology and its applied fields like behavioral economics and consumer psychology have a credibility problem. Many of the findings cannot be replicated because they were obtained with questionable research practices or p-hacking. QRPs are statistical tricks that help researchers to obtain p-values below the necessary threshold to claim a discovery (p < .05). To be clear, although lay people and undergraduate students consider these practices to be deceptive, fraudulent, and unscientific, they are not considered fraudulent by researchers, professional organizations, funding agencies, or universities. Demonstrating that a researchers used QRPs to obtain significant results is easy-peasy, undermines the credibility of their work, but they can keep their jobs because it is not (yet) illegal to use these practices.

The Gino-Harvard scandal is different because the DataColada team claimed that they found “four studies for which we had accumulated the strongest evidence of fraud” and that they “believe that many more Gino-authored papers contain fake data.” To lay people, it can be hard to understand the difference between allowed QRPs and forbidden fraud or data manipulation. An example of QRPs, could be selectively removing extreme values so that the difference between two groups becomes larger (e.g., removing extremely low depression scores from a control group to show a bigger treatment effect). Outright data manipulation would be switching participants with low scores from the control group to the treatment group and vice versa.

DataColada used features of the excel spreadsheet that contained the data to claim that the data were manually manipulated.

The focus is on six rows that have a strong influence on the results for all three dependent variables that were reported in the article, namely cheated or not, overreporting of performance, and deductions.

Based on the datasheet, participants in the sign-at-the-top condition (1) in rows 67, 68, and 69, did not cheat and therewith also did not overreport performance, and had very low deductions an independent measure of cheating. In contrast, participants in rows 70, 71, and 72 all cheated, had moderate amounts of overreporting, and very high deductions.

Yadi, yadi, yada, yesterday Gino posted a blog post that responded to these accusations. Personally, the most interesting rebuttal was the claim that there was no need to switch rows because the study results hold even without the flagged rows.

“Finally, recall the lack of motive for the supposed manipulation: If you re-run the entire study excluding all of the red observations (the ones that should be considered “suspicious” using Data Colada’s lens), the findings of the study still hold. Why would I manipulate data, if not to change the results of a study?“

This argument makes sense to me because fraud appears to be the last resort for researchers who are eager to present a statistically significant results. After all, nobody claims that there was no data collection as in some cases by Diederik Stapel, who committed blatant fraud around the time this article in question was published and the use of questionable research practices was rampant. When researchers conduct an actual study, they probably hope to get the desired result without QRPs or fraud. As significance requires luck, they may just hope to get lucky. When this does not work, they can use a few QRPs. When this does not work, they can just shelf the study and try again. All of this would be perfectly legal by current standards of research ethics. However, if the results are close and it is not easy to collect more data to hope for better results), it may be tempting to change a few labels of conditions to reach p < .05. And the accusation here (there are other studies) is that only 6 (or a couple more) rows were switched to get significance. However, Gino claims that the results were already significant and I agree that it makes no sense for somebody to temper with data, if the p-value is already below .05.

However, Gino did not present evidence that the results hold without the contested cases. So, I downloaded the data and took a look.

First, I was able to reproduce the published result of an ANOVA with the three conditions as categorical predictor variable and deductions as outcome variable.

In addition, the original article reported that the differences between the experimental “signature-on-top” and each of the two control conditions (“signature-on-bottom”, “no signature”) were significant. I also confirmed these results.

Now I repeated the analysis without rows 67 to 72. Without the six contested cases, the results are no longer statistically significant, F(2, 92) = 2.96, p = .057.

Interestingly, the comparisons of the experimental group with the two control groups were statistically significant.

Combining the two control groups and comparing it to the experimental group and presenting the results as a planned contrast would also have produced a significant result.

However, these results do not support Gino’s implication that the same analysis that was reported in the article would have produced a statistically significant result, p < .05, without the six contested cases. Moreover, the accusation is that she switched rows with low values to the experimental condition and rows with high values to the control condition. To simulate this scenario, I recoded the contested rows 67-69 as signature-at-the-bottom and 70-72 as signature-at-the-top and repeated the analysis. In this case, there was no evidence that the group means differed from each other, F(2,98) = 0.45, p = .637.

Conclusion

Experimental social psychology has a credibility crisis because researchers were (and still are) allowed to use many statistical tricks to get significant results or to hide studies that didn’t produce the desired results. The Gino scandal is only remarkable because outright manipulation of data is the only ethics violations that has personal consequences for researchers when it can be proven. Lack of evidence that fraud was committed or lack of fraud do not imply that results are credible. For example, the results in Study 2 are meaningless even without fraud because the null-hypothesis was rejected with a confidence interval that had a value close to zero as a plausible value. While the article claims to show evidence of mediation, the published data alone show that there is no empirical evidence for this claim even if p < .05 was obtained without p-hacking or fraud. Misleading claims based on weak data, however, do not violate any ethics guidelines and are a common, if not essential, part of a game called social psychology.

This blog post only examined one minor question. Gino claimed that she did not have to manipulate data because the results were already significant.

My results suggest that this claim lacks empirical support. A key result was only significant with the rows of data that have been contested. Of course, this finding does not warrant the conclusion that the data were tempered with to get statistical significance. We have to wait to get the answer to this 25 million dollar question.

When DataColada kissed Fiske’s ass to publish in Annual Review of Psychology

December 20, 2019Datacolada, File-Drawer, P-Hacking., Questionable Research PracticesUlrich Schimmack

One of the worst articles about the decade of replication failures is the “Psychology’s Renaissance” article by the datacolada team (Leif Nelson, Joseph Simmons, & Uri Simonsohn).

This is not your typical Annual Review article that aims to give a review over developments in the field. it is an opinion piece filled with bold claims that lack empirical evidence.

The worst claim is that p-hacking is so powerful that pretty much every study can be made to work.

“Experiments that work are sent to a journal, whereas experiments that fail are sent to the file drawer (Rosenthal 1979). We believe that this “file-drawer explanation” is incorrect. Most failed studies are not missing. They are published in our journals, masquerading as successes.”

We can all see that not publishing failed studies is a bit problematic. Even Bem’s famous manual for p-hackers warned that it is unethical to hide contradictory evidence. “The integrity of the scientific enterprise requires the reporting of disconfirming results” (Bem). Thus, the idea that researchers are sitting on a pile of failed studies that they failed to disclose makes psychologists look bad and we can’t have that in Fiske’s Annual Review of Psychology journal. Thus, psychologists must have been doing something that is not dishonest and can be sold as normal science.

“P-hacking is the only honest and practical way to consistently get underpowered studies to be statistically significant. Researchers did not learn from experience to increase their sample sizes precisely because their underpowered studies were not failing.” (p. 515).

This is utter nonsense. First, researchers have file-drawers of studies that did not work. Just ask them and they may tell you that they do.

Leading social psychologists, Gilbert and Wilson provide an even more detailed account of their research practices that produce many non-significant results that are not reported (a.k.a. a file drawer), which has been preserved thanks to Greg Francis.

First, it’s important to be clear about what “publication bias” means. It doesn’t mean that anyone did anything wrong, improper, misleading, unethical, inappropriate, or illegal. Rather it refers to the well known fact that scientists in every field publish studies whose results tell them something interesting about the world, and don’t publish studies whose results tell them nothing. Francis uses sophisticated statistical tools to discover what everyone already knew—and what he could easily have discovered simply by asking us. Yes, of course we ran some studies on “consuming experience” that failed to show interesting effects and are not reported in our JESP paper. Let us be clear: We did not run the same study over and over again until it yielded significant results and then report only the study that “worked.” Doing so would be clearly unethical. Instead, like most researchers who are developing new methods, we did some preliminary studies that used different stimuli and different procedures and that showed no interesting effects. Why didn’t these studies show interesting effects? We’ll never know. Failed studies are often (though not always) inconclusive, which is why they are often (but not always) unpublishable. So yes, we had to mess around for a while to establish a paradigm that was sensitive and powerful enough to observe the effects that we had hypothesized. In one study we might have used foods that didn’t differ sufficiently in quality, in another we might have made the metronome tick too fast for people to chew along. Exactly how good a potato chip should be and exactly how fast a person can chew it are the kinds of mundane things that scientists have to figure out in preliminary testing, and they are the kinds of mundane things that scientists do not normally report in journals (but that they informally share with other scientists who work on similar phenomenon). Looking back at our old data files, it appears that in some cases we went hunting for potentially interesting mediators of our effect (i.e., variables that might make it larger or smaller) and although we replicated the effect, we didn’t succeed in making it larger or smaller. We don’t know why, which is why we don’t describe these blind alleys in our paper. All of this is the hum-drum ordinary stuff of day-to-day science.

Aside from this anecdotal evidence, the datacolada crew actually had access to empirical evidence in an article that they cite, but maybe never read. An important article in the 2010s reported a survey of research practices (John, Loewenstein, & Prelec, 2012). The survey asked about several questionable research practices, including not reporting entire studies that failed to support the main hypothesis.

Not reporting studies that “did not work” was the third most frequently used QRP. Unfortunately, this result contradicts datacolada’s claim that there are no studies in file-drawers and so they ignore this inconvenient empirical fact to tell their fairy tail of honest p-hackers that didn’t know better until 2011 when they published their famous “False Positive Psychology” article.

This is a cute story that isn’t supported by evidence, but that has never stopped psychologists from writing articles that advance their own career. The beauty of review articles is that you don’t even have to phack data. You just pick and choose citations or make claims without evidence. As long as the editor (Fiske) likes what you have to say, it will be published. Welcome to psychology’s renaissance; same bullshit as always.

Z-curve vs. P-curve: Break down of an attempt to resolve disagreement in private.

March 20, 2018Datacolada, Heterogeneity, P-Curve, Simonsohn, Uri Simonsohn, Z-Curve, ZcurveUlrich Schimmack

Background: In a tweet that I can no longer find because Uri Simonsohn blocked me from his twitter account, Uri suggested that it would be good if scientists could discuss controversial issues in private before they start fighting on social media. I was just about to submit a manuscript that showed some problems with his p-curve approach to power estimation and a demonstration that z-curve works better in some situations, namely when there is substantial variation in studies in statistical power. So, I thought I give it a try and sent him the manuscript so that we could try to find agreement in a private email exchange.

The outcome of this attempt was that we could not reach agreement on this topic. At best, Uri admitted that p-curve is biased when some extreme test statistics (e.g., F(1,198) = 40, or t(48) = 5.00) are included in the dataset. He likes to call these values outliers. I consider them part of the data that influence the variability and distribution of test statistics.

For the most part Uri disagreed with my conclusions and considers the simulation results that show evidence for my claims unrealistic. Meanwhile, Uri published a blog post with his simulations that have small heterogeneity to claim that p-curve works even better than z-curve when there is heterogeneity.

The reason for the discrepancy between his results and my results are different assumptions about what is realistic variability in strength of evidence against the null-hypothesis, as reflected in absolute z-scores (transformation of p-values into z-scores by means of -qnorm(p.2t) with p.2t equals two.tailed t-test or F-test.

To give everybody an opportunity to examine the arguments that were exchanged during our discussion of p-curve versus z-curve, I am sharing the email exchange. I hope that more statisticians will examine the properties of p-curve and z-curve and add to the discussion. To facilitate this, I will make the r-code to run simulation studies of p-curve and z-curve available in a separate blog post.

P.S. P-curve is available as an online app that provides power estimates without any documentation how p-curve behaves in simulation studies or warnings that datasets with large test statistics can produce inflated estimates of average power.

My email correspondence with Uri Simonsohn – RE: p-curve and heterogeneity

From:    URI
To:          ULI
Date:     11/24/2017

Hi Uli,

I think email is better at this point.

Ok I am behind a ton of stuff and have a short workday today so cannot look in detail are your z-curve paper right now.

I did a quick search for “osf”, “http” and “code” and could not find the R Code , that may facilitate things if you can share it. Mostly, I would like the code that shows p-curve is biased, especially looking at how the population parameter being estimated is being defined.

I then did a search for “p-curve” and found this

Quick reactions:

1) For power estimation p-curve does not assume homogeneity of effect size, indeed, if anything it assumes homogeneity of power and allows each study to have a different effect size, but it is not really assuming a single power, it is asking what single power best fits the data, which is a different thing. It is computing an average. All average computations ask “what single value best fits the data” but that’s not the same as saying “I think all values are identical, and identical to the average”

2) We do report a few tests of the impact of heterogeneity on p-curve, maybe you have something else in mind. But here they go just in case:

Figure 2C in our POPS paper, has d~N(x,sd=.2)

[Clarification: This Figure shows estimation of effect sizes. It does not show estimation of power.]

Supplement 2

[Again. It does not show simulations for power estimation.]

A key thing to keep in mind is the population parameter of interst. P-curve does not estimate the population effect size or power of all studies attempted, published, reported, etc. It does for the set of studies included in p-curve. So note, for example, in the figure S2C above that when half of studies are .5 and half are .3 among the attempted, p-curve estimates the average included study accurately but differently from .4. The truth is .48 for included studies, p-curve says .47, and the average attempted study is .4

[This is not the issue. Replicability implies conditioning on significance. We want to predict the success rate of studies that replicate significant results. Of course it is meaningful to do follow up studies on non-significant results. But the goal here is not to replicate another inconclusive non-significant result.]

Happy to discuss of course, Uri

—————————————————————————————————————————————

From     ULI
To           URI
Date      11/24/2017

Hi Uri,

I will change the description of your p-curve code for power.

Honest, I am not fully clear about what the code does or what the underlying assumptions are.

So, thanks for clarifying.

I agree with you that pcurve (also puniform) are surprisingly robust estimates of effect sizes even with heterogeneity (I have pointed that out in comments in the Facebook Discussion group), but that doesn’t mean it works well for power. If you have published any simulation tests for the power estimation function, I am happy to cite them.

Attached is a single R code file that contains (a) my shortened version of your p-curve code, (b) the z-curve code, (c) the code for the simulation studies.

The code shows the cumulative results. You don’t have to run all 5,000 replications before you see the means stabilizing.

Best, Uli

—————————————————————————————————————————————

From     URI
To           ULI
Date      11/27/2017

Hi Uli,

Thanks for sending the code, I am trying to understand it. I am a little confused about how the true power is being generated. I think you are drawing “noncentrality” parameters (ncp) that are skewed, and then turning those into power, rather than drawing directly skewedly distributed power, correct? (I am not judging that as good or bad, I am just verifying).

[Yes that is correct]

In any case, I created a histogram of the distribution of true power implied by the ncp’s that you are drawing (I think, not 100% sure I am getting that right).

For scenario 3.1 it looks like this:

For scenario 3.3 it looks like this:

(the only code I added was to turn all the true power values into a vector before averaging it, and then ploting a histogram for that vector, if interestd, you can copy paste this into the line of code that just reads “tp” in your code and you will re-produce my histogram)

#ADDED BY URI uri

power.i=pnorm(z,z.crit)[obs.z > z.crit] #line added by Uri SImonsohn to look at the distribution

hist(power.i,xlab=’true power of each study’)

mean.pi=round(mean(power.i),2)

median.pi=round(median(power.i),2)

sd.pi=round(sd(power.i),2)

mtext(side=3,line=0,paste0(“mean=”,mean.pi,” median=”,median.pi,” sd=”,sd.pi))

I wanted to make sure

1) I am correctly understanding this variable as being the true power of the observed studies, the average/median of which we are trying to estimate

2) Those distributions are the distributions you intended to generate

[Yes, that is correct. To clarify, 90% power for p < .05 (two-tailed) is obtained with a z-score of qnorm(.90, 1.96) = 3.24. A z-score of 4 corresponds to 97.9% power. So, in the literature with adequately powered studies, we would expect studies to bunch up at the upper limit of power, while some studies may have very low power because the theory made the wrong prediction and effect sizes are close to zero and power is close to alpha (5%).]

Thanks, Uri

—————————————————————————————————————————————

From     ULI
To           URI
Date      11/27/2017

Hi Uri,

Thanks for getting back to me so quickly. You are right, it would be more accurate to describe the distribution as the distribution of the non-centrality parameters rather than power.

The distribution of power is also skewed but given the limit of 1, all high power studies will create a spike at 1. The same can happen at the lower end and you can easily get U-shaped distributions.

So, what you see is something that you would also see in actual datasets. Actually, the dataset minimizes skew because I only used non-centrality parameters from 0 to 6.

I did this because z-curve only models z-values between 0 and 6 and treats all observed z-scores greater than 6 as having a true power of 1. That reduces the pile on the right side.

You could do the same to improve performance of p-curve, but it will still not work as well as z-curve, as the simulations with z-scores below 6 show.

Best, Uli

—————————————————————————————————————————————

From     URI
To           ULI
Date      11/27/2017

OK, yes, probably worth clarifying that.

Ok, now I am trying to make sure I understand the function you use to estimate power with z-curve.

If I see p-values, say c(.001,.002,.003,.004,.005) and I wanted to estimate true power for them via z-curve, I would run:

p= c(.001,.002,.003,.004,.005)

z= -qnorm(p/2)

fun.zcurve(z)

And estimate true power to be 85%, correct?

Uri

—————————————————————————————————————————————

From     ULI
To           URI
Date      11/27/2017

Yes.

—————————————————————————————————————————————

From     URI
To           ULI
Date      11/27/2017

Hi Uli,

To make sure I understood z-curve’s function I run a simple simulation.
I am getting somewhat biased results with z-curve, do you want to take a look and see if I may be doing something wrong?

I am attaching the code, I tried to make it clear but it is sometimes hard to convey what one is trying to do, so feel free to ask any questions.

Uri

—————————————————————————————————————————————

From     ULI
To           URI
Date      11/27/2017

Hi Uri,

What is the k in these simulations? (z-curve requires somewhat large k because the smoothing of the density function can distort things)

You may also consult this paper (the smallest k was 15 in this paper).

http://www.utstat.toronto.edu/~brunner/zcurve2016/HowReplicable.pdf

In this paper, we implemented pcurve differently, so you can ignore the p-curve results.

If you get consistent underestimation with z-curve, I would like to see how you simulate the data.

I haven’t seen this behavior in z-curve in my simulations.

Best, Uli

—————————————————————————————————————————————

From     ULI
To           URI
Date      11/27/2017

Hi Uli,

I don’t know where “k” is set, I am using the function you sent me and it does not have k as a parameter

I am running this:

fun.zcurve = function(z.val.input, z.crit = 1.96, Int.End=6, bw=.05) {…

Where would k be set?

Into the function you have this

### resolution of density function (doesn’t seem to matter much)

bars = 500

Is that k?

URI

—————————————————————————————————————————————

From     ULI
To           URI
Date      11/27/2017

I mean the number of test statistics that you submit to z-curve.

length(z.val.input)

—————————————————————————————————————————————

From     ULI
To           URI
Date      11/27/2017

I just checked with k = 20, the z-curve code I sent you underestimates fixed power of 80 as 72.

The paper I sent you shows a similar trend with true power of 75.

k 15 25 50 100 250
Z-curve 0.704 0.712 0.717 0.723 0.728

[Clarification: This is from the Brunner & Schimmack, 2016, article]

—————————————————————————————————————————————

From     ULI
To           URI
Date      11/30/2017

Hi Uli,

Sorry for disappearing, got distracted with other things.

I looked a bit more at the apparent bias downwards that z-curve has on power estimates.

First, I added p-curve’s estimates to the chart I had sent, I know p-curve performs well for that basic setup so I used it as a way to diagnose possible errors in my simulations, but p-curve did correctly recover power, so I conclude the simulations are fine.

If you spot a problem with them, however, let me know.

—————————————————————————————————————————————

From     ULI
To           URI
Date      11/30/2017

Hi Uri,

I am also puzzled why z-curve underestimates power in the homogeneous case even with large N. This is clearly an undersirable behavior and I am going to look for solutions to the problem.

However, in real data that I analyze, this is not a problem because there is heterogeneity.

When there is heterogenity, z-curve performs very well, no matter what the distribution of power/non-centrality parameters is. That is the point of the paper. Any comments on comparisons in the heterogeneous case?

Best, Uli

—————————————————————————————————————————————

From     ULI
To           URI
Date      11/30/2017

Hey Uli,

I have something with heterogeneity but want to check my work and am almost done for the day, will try tomorrow.

Uri

[Remember: I supplied Uri with r-code to rerun the simulations of heterogeneity and he ran them to show what the distribution of power looks like. So at this point we could discuss the simulation results that are presented in the manuscript.]

—————————————————————————————————————————————

From     ULI
To           URI
Date      11/30/2017

I ran simulations with t-distrubutions and N = 40.

The results look the same for me.

Mean estimates for 500 simulations

32, 48, 75

As you can see, p-curve also has bias when t-values are converted into z-scores and then analyzed with p-curve.

This suggests that with small N, the transformation from t to z introduces some bias.

The simulations by Jerry Brunner showed less bias because we used the sample sizes in Psych Science for the simulation (median N ~ 80).

So, we are in agreement that zcurve underestimates power when true power is fixed, above 50%, and N and k are small.

—————————————————————————————————————————————

From     URI
To           ULI
Date      11/30/2017

Hi Uli,

The fact that p-curve is also biased when you convert to z-scores suggests to me that approximation is indeed part of the problem.

[Clarification: I think URI means z-curve]

Fortunately p-curve analysis does not require that transformation and one of the reasons we ask in the app to enter test-statistics is to avoid unnecessary transformations.

I guess it would also be true that if you added .012 to p-values p-curve would get it wrong, but p-curve does not require one to add .012 to p-values.

You write “So, we are in agreement that zcurve underestimates power when true power is fixed, above 50%, and N and k are small.”

Only partial agreement, because the statement implies that for larger N and larger K z-curve is not biased, I believe it is also biased for large k and large N. Here, for instance, is the chart with n=50 per cell (N=100 total) and 50 studies total.

Today I modified the code I sent you so that I would accommodate any power distribution in the submitted studies, not just a fixed level. (attached)

I then used the new montecarlo function to play around with heterogeneity and skewness.

The punchline is that p-curve continues to do well, and z-curve continues to be biased downward.

I also noted, by computed the standard deviation of estimates across simulations, that p-curve has a slightly less random error.

My assessment is that z-curve and p-curve are very similar and will generally agree, but that z-curve is more biased and has more variance.

In any case, let’s get to the simulations Below I show 8 scenarios sorted by the ex-post average true power for the sets of studies.

[Note, N = 20 per cell. As I pointed out earlier, with these small sample sizes the t to z-transformation is a factor. Also k = 20 is a small set of studies that makes it difficult to get good density distributions. So, this plot is p-hacked to show that p-curve is perfect and z-curve consistently worse. The results are not wrong, but they do not address the main question. What happens when we have substantial heterogeneity in true power? Again, Uri has the data, he has the r-code, and he has the results that show p-curve starts overestimating. However, he ignores this problem and presents simulations that are most favorable for p-curve.]

—————————————————————————————————————————————

From     URI
To           ULI
Date      12/1/2017

Hi Uri,

I really do not care so much about bias in the homogeneous case. I just fixed the problem by first doing a test of the variance and if variance is small to use a fixed effects model.

[Clarification: This is not yet implemented in z-curve and was not done for the manuscript submitted for publication which just acknowledges that p-curve is superior when there is no heterogeneity.]

The main point of the manuscript is really about data that I actually encounter in the literature (see demonstrations in the manuscript, including power posing) where there is considerable heterogeneity.

In this case, p-curve overestimates as you can see in the simulations that I sent you. That is really the main point of the paper and any comments from you about p-curve and heterogeneity would be welcome.

And, I did not mean to imply that pcurve needs transformation. I just found it interesting that transformation is a problem when N is small (as N gets bigger t approaches z and the transformation has less influence).

So, we are in agreement that pcurve does very well when there is little variability in the true power across studies. The question is whether we are in agreement about heterogeneity in power?

Best, Uli

—————————————————————————————————————————————

From     ULI
To           URI
Date      12/1/2017

Hi Uri,

Why not simulate scenarios that match onto real data.

[I attached data from my focal hypothesis analysis of Bargh’s book “Before you know it” ]

https://replicationindex.com/2017/11/28/before-you-know-it-by-john-a-bargh-a-quantitative-book-review/

—————————————————————————————————————————————

From     ULI
To           URI
Date      12/1/2017

P.P.S

Also, my simulations show that z-curve OVERestimates when true power is below 50%. Do you find this as well?

This is important because power posing estimates are below 50%, so estimation problems with small k and N would mean that z-curve estimate is inflated rather than suggesting that p-curve estimate is correct.

Best, Uli

—————————————————————————————————————————————

From     URI
To           ULI
Date      12/2/2017

Hi Uli,

The results I sent show substantial heterogeneity and p-curve does well, do you disagree?

Uri

—————————————————————————————————————————————

From     URI
To           ULI
Date      12/2/2017

Not sure what you mean here. What aspect of real data would you like to add to the simulations? I did what I did to address the concerns you had that p-curve may not handle heterogeneity and skewed distributions of power, and it seems to do well with very substantial skew and heterogeneity.

What aspect are the simulations abstracting away from that you worry may lead p-curve to break down with real data?

Uri

—————————————————————————————————————————————

From     ULI
To           URI
Date      12/2/2017

Hi Uri,

I think you are not simulating sufficient heterogeneity to see that p-curve is biased in these situations.

Let’s focus on one example (simulation 2.3) in the r-code I sent you: High true power (.80) and heterogeneity.

This is the distribution of the non-centrality parameters.

And this is the distribution of true power for p < 05 (two-tailed, |z| > = 1.96).

[Clarification: this is not true power, it is the distribution of observed absolute z-scores]

More important, the variance of the observed significant (z > 1.96) z-scores is 2.29.

[Clarification: In response to this email exchange, I added the variance of significant z-scores to the manuscript as a measure of heterogeneity. Due to the selection for significance, variance with low power can be well below 1. A variance of 2.29 is large heterogeneity. ]

In comparison the variance for the fixed model (non-central z = 2.80) is 0.58.

So, we can start talking about heterogeneity in quantitative terms. How much variance do you simulated observed p-values have when you convert them into z-scores?

The whole point of the paper is that performance of z-curve suffers, the greater the heterogeneity of true power is. As sampling error is constant for z-scores, variance of observed z-scores has a maximum of 1 if true power is constant. It is lower than 1 due to selection for significance, which is more severe the lower the power is.

The question is whether my simulations use some unrealistic, large amount of heterogeneity. I attached some Figures for the Journal of Judgment and Decision Making.

As you can see, heterogeneity can be even larger than the heterogeneity simulated in scenario 2.3 (with a normal distribution around z = 2.75).

In conclusion, I don’t doubt that you can find scenarios where p-curve does well with some heterogeneity. However, the point of the paper is that it is possible to find scenarios where there is heterogeneity and p-curve does not well. What your simulations suggest is that z-curve can also be biased in some situations, namely with low variability, small N (so that transformation to z-scores matters) and small number of studies.

I am already working on a solution for this problem, but I see it as a minor problem because most datasets that I have examined (like the one’s that I used for the demonstrations in the ms) do not match this scenario.

So, if I can acknowledge that p-curve outperforms z-curve in some situations, I wonder whether you can do the same and acknowledge that z-curve outperforms p-curve when power is relatively high (50%+) and there is substantial heterogeneity?

Best, Uli

—————————————————————————————————————————————

From     ULI
To           URI
Date      12/2/2017

What surprises me is that I sent you r-code with 5 simulations that showed when p-curve is breaking down (starting with normal distributed variability of non-central z-scores and 50% power (sim2.2) followed by higher power (80%) and all skewed distributions (sim 3.1, 3.2, 3.3). Do you find a problem with these simulations or is there some other reason why you ignore these simulation studies?

—————————————————————————————————————————————

From     ULI
To           URI
Date      12/2/2017

I tried “power = runif(n.sim)*.4 + .58” with k = 100.

Now pcurve starts to overestimate and zcurve is unbiased.

So, k makes a difference. Even if pcurve does well with k = 20, we also have to look for larger sets of studies.

Results of 500 simulations with k = 100

—————————————————————————————————————————————

From     ULI
To           URI
Date      12/2/2017

Even with k = 40, pcurve overestimates as much as zcurve underestimates.

zcurve pcurve

Min. :0.5395 Min. :0.5600

1st Qu.:0.7232 1st Qu.:0.7900

Median :0.7898 Median :0.8400

Mean :0.7817 Mean :0.8246

3rd Qu.:0.8519 3rd Qu.:0.8700

Max. :0.9227 Max. :0.9400

—————————————————————————————————————————————

From     ULI
To           URI
Date      12/2/2017

Hi Uri,

This is what I find with systematic variation of number of studies (k) and the maximum heterogeneity for a uniform distribution of power and average power of 80% after selection for significance.

power = runif(n.sim)*.4 + .58”

zcurve pcurve

k = 20 77.5 81.2

k = 40 78.2 82.5

k = 100 79.3 82.7

k = 10000 80.2 81.7

(1 run)

If we are going to look at k = 20, we also have to look at k = 100.

Best, Uli

—————————————————————————————————————————————

From     ULI
To           URI
Date      12/2/2017

Hi Uri,

Why did you truncate the beta distributions so that they start at 50% power?

Isn’t it realistic to assume that some studies have less than 50% power, including false positives (power = alpha = 5%)?

How about trying this beta distribution?

curve(dbeta(x,.5,.35)*.95+.05,0,1,ylim=c(0,3),col=”red”)

80% true power after selection for significance.

Best, Uli

—————————————————————————————————————————————

From     ULI
To           URI
Date      12/2/2017

HI Uli,

I know I have a few emails from you, thanks.

My plan is to get to them on Monday or Tuesday. OK?

Uri

—————————————————————————————————————————————

Hi Uli,

We have a blogpost going up tomorrow and have been distracted with that, made someprogress with z- vs p- but am not ready yet.

Sorry Uri

—————————————————————————————————————————————

Hi Uli,

From     ULI
To           URI
Date      12/2/2017

Ok, finally I have time to answer your emails from over the weekend.

Why I run something different?

First, you asked why I run simulations that were different from those you have in your paper (scenario 2.1 and 3.1).

The answer is that I tried to simulate what I thought you were describing in the text: heterogeneity in power that was skewed.

When I saw you had run simulations that led to a power distribution that looked like this:

I assumed that was not what was intended.

First, that’s not skewed

Second, that seems unrealistic, you are simulating >30% of studies powered above 90%.

[Clarification: If studies were powered at 80%, 33% of studies would be above 90% :

1-pnorm(qnorm(.90,1.96),qnorm(.80,1.96))

It is important to remember that we are talking only about studies that produced a significant result. Even if many null-hypothesis are tested, relatively few of these would make it into the set of studies that produced a significant result. Most important, this claim ignores the examples in the paper and my calculations of heterogeneity that can be used to compare simulations of heterogeneity with real data.]

Third, when one has extremely bimodal data, central tendency measures are less informative/important (e.g., the average human wears half a bra). So if indeed power was distributed that way, I don’t think I would like to estimate average power anyway. And if it did, saying the average is 60% or 80% is almost irrelevant, hardly any studies are in that range in reality (like say the average person wears .65 bras, that’s wrong, but inconsequentially worse that .5 bras).

Fourth, if indeed 30% of studies have >90% power, we don’t need p-curve or z-curve. Stuff is gonna be obviously true to naked eye.

But below I will ignore these reservations and stick to that extreme bimodal distribution you propose that we focus our attention on.

The impact of null findings

Actually, before that, let me acknowledge I think you raised a very valid point about the importance of adding null findings to the simulations. I don’t think the extreme bimodal you used is the way to do it, but I do think power=5% in the mix does make sense.

We had not considered p-curve’s performance there and we should have.

Prompted by this exchange I did that, and I am comfortable with how p-curve handles power=5% in the mix.

For example, I considered 40 studies, starting with all 40 null, and then having an increasing number drawn from U(40*-80%) power. Looks fine.

Why p-curve overshoots?

Ok. So having discuss the potential impact of null findings on estimates, and leaving aside my reservations with defining the extreme bimodal distribution of power as something we should worry about, let’s try to understand why p-curve over-estimates and z-curve does not.

Your paper proposes it is because p-curve assumes homogeneity.

It doesn’t. p-curve does not assume homogeneity of power any more than computing average height involves assuming homogeneity of height. It is true that p-curve does not estimate heterogeneity in power, but averaging height also does not compute the SD(height). P-curve does not assume it is zero, in fact, one could use p-curve results to estimate heterogeneity.

But in any case, is z-curve handling the extreme bimodal better thanks to its mixture of distributions, as you propose in the paper, or due to something else?

Because power is nonlinearly related to ncp I assumed it had to do with the censoring of high z-values you did rather than the mixture (though I did not actually look into the mixture in any detail at all)..

To look into that I censored t-values going into p-curve. Not as a proposal for a modification but to make the discussion concrete. I censored at t<3.5 so that any t>3.5 is replaced by 3.5 before being entered into p-curve. I did not spend much time fine-tuning it and I am definitely not proposing htat if one were to censore t-values in p-curve they should be censored at 3.5

Ok, so I run p-curve with censored t-values for the rbeta() distribution you sent and for various others of the same style.

We see that censored p-curve behaves very similarly to z-curve (which is censored also).

I also tried adding more studies, running rbeta(3,1) and (1,3), etc.. Across the board, I find that if there is a high share of extremely high powered studies, censored p-curve and z-curve look quite similar.

If we knew nothing else, we would be inclined to censor p-curve going forward, or to use z-curve instead. But censored p-curve, and especially z-curve, give worse answers when the set of studies does not include many extremely high-powered ones, and in real life we don’t have many extremely high-powered studies. So z-curve and censored p-curve make gains in an world that I don’t think exist, and exhibit losses in one that I do think exists.

In particular, z-curve estimates power to be about 10% when the null is true, instead of 5% (censored p-curve actually get this one right, null is estimated at 5%).

Also, z-curve underestimates power in most scenarios not involving an extreme bimodal distribution (see charts I sent in my previous email). IN addition, z-curve tends to have higher variance than p-curve.

As indicated in my previous email, z-curve and p-curve agree most of the time, their differences will typically be within sampling error. It is a low stakes decision to use p-curve vs z-curve, especially compared to the much more important issue of which studies are selected and which tests are selected within studies.

Thanks for engaging in this conversation.

We don’t have to converge to agreement to gain from discussing things.

Btw, we will write a blog post on the repeated and incorrect claim that p-curve assumes homogeneity and does not deal with heterogeneity well. We will send you a draft when we do, but it could be several weeks till we get to that. I don’t anticipate it being a contentions post from your point of view but figured I would tell you about it now.

Uri

—————————————————————————————————————————————

From     ULI
To           URI
Date      12/2/2017

Hi Uri,

Now that we are on the same page, the only question is what is realistic.

First, your blog post on outliers already shows what is realistic. A single outlier in the power pose study increases the p-curve estimate by more than 10% points.

You can fix this now, but p-curve as it existed did not do this. I would also describe this as a case of heterogeneity. Clearly the study with z = 7 is different from studies with z = 2.

This is in the manuscript that I asked you to evaluate and you haven’t commented on it at all, while writing a blog post about it.

The paper contains several other examples that are realistic because they are based on real data.

I mainly present them as histograms of z-scores rather than historgrams of p-values or observed power because I find the distribution of the z-scores more informative (e.g., where is the mode, is the distribution roughly normal, etc.), but if you convert the z-scores into power you get distributions like the one shown below (U-shpaed), which is not surprising because power is bounded at alpha and 1. So, that is a realistic scenario, whereas your simulations of truncated distributions are not.

I think we can end the discussion here. You have not shown any flaws with my analyses. You have shown that under very limited and unrealistic situations p-curve performs better than z-curve, which is fine because I already acknowledged in the paper that p-curve does better in the homogeneous case.

I will change the description of the assumption underlying p-curve, but leave everything else as is.

If you think there is an error let me know but I have been waiting patiently for you to comment on the paper, and examined your new simulations.

Best, Uli

—————————————————————————————————————————————

Hi Uri,

What about the real world of power posing?

A few z-scores greater than 4 mess up p-curve as you just pointed out in your outlier blog.

I have presented several real world data to you that you continue to ignore.

Please provide one REAL dataset where p-curve gets it right and z-curve underestimates.

Best, Uli

Hi Uli,

—————————————————————————————————————————————

From     ULI
To           URI
Date      12/6/2017

With real datasets you don’t know true power so you don’t know what’s right and wrong.

The point of our post today is that there is no point statistically analyzing the studies that Cuddy et al put together, with p-curve or any other tool.

I personally don’t think we ever observe true power with enough granularity to make z- vs p-curve prediction differences consequential.

But I don’t think we, you and I, should debate this aspect (is this bias worth that bias). Let’s stick to debating basic facts such as whether or not p-curve assumes homogeneity, or z-curve differs from p-curve because of homogeneity assumption or because of censoring, or how big bias is with this or that assumption. Then when we write we present those facts as transparently as possible to our readers, and they can make an educated decision about it based on their priors and preferences.

Uri

—————————————————————————————————————————————

From     ULI
To           URI
Date      12/6/2017

Just checking where we agree or disagree.

p-curve uses a single parameter for true power to predict observed p-values.

Agree

Disagree

z-curve uses multiple parameters, which improves prediction when there is substantial heterogeneity?

Agree

Disagree

In many cases, the differences are small and not consequential.

Agree

Disagree

When there is substantial heterogeneity and moderate to high power (which you think is rare), z-curve is accurate and p-curve overestimates.

(see simulations in our manuscript)

Agree

Disagree

I want to submit the manuscript by end of the week.

Best, Uli

—————————————————————————————————————————————

From     ULI
To           URI
Date      12/6/2017

Going through the manuscript one more time, I found this.

To examine the robustness of estimates against outliers, we also obtained estimates for a subset of studies with z-scores less than 4 (k = 49). Excluding the four studies with extreme scores had relatively little effect on z-curve; replicability estimate = 34%. In contrast, the p-curve estimate dropped from 44% to 5%, while the 90%CI of p-curve ranged from 13% to 30% and did not include the point estimate.

Any comments on this, I mean point estimate is 5% and 90%CI is 13 to 30%,

Best, Uli

[Clarification: this was a mistake. I confused point estimate and lower bound of CI in my output]

—————————————————————————————————————————————

From     URI
To           ULI
Date      12/7/2017

Hi Uli.

See below:

From: Ulrich Schimmack [mailto:ulrich.schimmack@utoronto.ca]

Sent: Wednesday, December 6, 2017 10:44 PM

To: Simonsohn, Uri <uws@wharton.upenn.edu>

Subject: RE: Its’ about censoring i think

Just checking where we agree or disagree.

p-curve uses a single parameter for true power to predict observed p-values.

Agree

z-curve uses multiple parameters,

Agree I don’t know the details of how z-curve works, but I suspect you do and are correct.

which improves prediction when there is substantial heterogeneity?

Disagree.

Few fronts.

1) I don’t think heterogeneity per-se is the issue, but extremity of the values. P-curve is accurate with very substantial heterogeneity. In your examples what causes the trouble are those extremely high power values. Even with minimal heterogeneity you will get over-estimation if you use such values.

2) I also don’t know that it is the extra parametres in z-curve that are helping because p-curve with censoring does just as well. so I suspect it is the censoring and not the multiple parameters. That’s also consistent with z-curve under-estimating almost everywhere, the multiple parameters should not lead to that I don’t think.

In many cases, the differences are small and not consequential.

Agree, mostly. I would not state that in an unqualified way out of context.

For example, my persona assessment, which I realize you probably don’t share, is that z-curve does worse in contexts that matter a bit more, and that are vastly more likely to be observed.

When there is substantial heterogeneity and moderate to high power (which you think is rare), z-curve is accurate and p-curve overestimates.

(see simulations in our manuscript)

Disagree.

You can have very substantial heterogeneity and very high power and p-curve is accurate (z-curve under-estimates).

For example, for the blogpost on heterogeneity and p-curve I figured that rather than simulating power directly it made more sense to simulate n and d distributions, over which people have better intuitions.. and then see what happened to power (rather than simulating power or ncp directly).

Here is one example. Sets of 20 studies, drawn with n and d from the first two panels, with the implied true power and its estimate in the 3rd panel.

I don’t mention this in the post, but z-curve in this simulation under-estimates power, 86% instead of 93%

The parameters are

n~rnorm(mean=100,sd=10)

d~rnorm(mean=.5,sd=.05)

What you need for p-curve to over-estimate and for z-curve to not under-estimate is substantial share of studies at both extremes, many null, many with power>95%

In general very high power leads to over-estimation, but it is trivial in the absence of many very low power studies that lower the average enough that it matters.

That’s the combination I find unlikely, 30%+ with >90% power and at the same time 15% of null findings (approx., going off memory here).

I don’t generically find high power with heterogeneity unlikely, I find the figure above super plausible for instance.

NOTE: For the post I hope to gain more insight on the precise boundary conditions for over-estimation, I am not sure I totally get it just yet.

I want to submit the manuscript by end of the week.

Hope that helps. Good luck.

Best, Uli

From     URI
To           ULI
Date      12/7/2017

Hi Uli,

First, I had not read your entire paper and only now I realize you analyze the Cuddy et al paper, that’s an interesting coincidence. For what is worth, we worked on the post before you and I had this exchange (the post was written in November and we first waited for Thanksgiving and then over 10 days for them to reply). And moreover, our post is heavily based off the peer-review Joe wrote when reviewing this paper, nearly a year ago, and which was largely ignored by the authors unfortunately.

In terms of the results. I am not sure I understand. Are you saying you get an estimate of 5% with a confidence interval between 13 and 30?

That’s not what I get.

—————————————————————————————————————————————

From     ULI
To           URI
Date      12/7/2017

Hi Uri

That was a mistake. It should be 13% estimate with 5% to 30% confidence interval.

I was happy to see pcurve mess up (motivated bias), but I already corrected it yesterday when I went through the manuscript again and double checked.

As you can see in your output, the numbers are switched (I should label columns in output).

So, the question is whether you will eventually admit that pcurve overestimates when there is substantial heterogeneity.

We can then fight over what is realistic and substantial, etc. but to simply ignore the results of my simulations seems defensive.

This is the last chance before I will go public and quote you as saying that pcurve is not biased when there is substantial heterogeneity.

If that is really your belief, so be it. Maybe my simulations are wrong, but you never commented on them.

Best, Uli

—————————————————————————————————————————————

HI Uli,

See below

From: Ulrich Schimmack [mailto:ulrich.schimmack@utoronto.ca]

Sent: Friday, December 8, 2017 12:39 AM

To: Simonsohn, Uri <uws@wharton.upenn.edu>

Subject: RE: one more question

Hi Uri

That was a mistake. It should be 13% estimate with 5% to 30% confidence interval.

*I figured

I was happy to see pcurve mess up (motivated bias), but I already corrected it yesterday when I went through the manuscript again and double checked.

*Happens to the best of us

As you can see in your output, the numbers are switched (I should label columns in output).

*I figured

So, the question is whether you will eventually admit that pcurve overestimates when there is substantial heterogeneity.

*The tone is a bit accusatorial “admit”, but yes, in my blog post I will talk about it. My goal is to present facts in a way that lets readers decide with the same information I am using to decide.

It’s not always feasible to achieve that goal, but I strive for it. I prefer people making right inferences than relying on my work to arrive at them.

We can then fight over what is realistic and substantial, etc. but to simply ignore the results of my simulations seems defensive.

*I don’t think that’s for us to decide. We can ‘fight’ about how to present the facts to readers, they decide which is more realistic.

I am not ignoring your simulation results.

This is the last chance before I will go public and quote you as saying that pcurve is not biased when there is substantial heterogeneity.

*I would prefer if you don’t speak on my behalf either way, our conversation is for each of us to learn from the other, then you speak for yourself.

If that is really your belief, so be it. Maybe my simulations are wrong, but you never commented on them.

*I haven’t tried to reproduce your simulations, but I did indicate in our emails that if you run the rbeta(n,.35,.5)*.95+.05 p-curve over-estimates, I also explained why I don’t find that particularly worrisome. But you are not publishing a report on our email exchange, you are proposing a new tool. Our exchange hopefully helped make that paper clearer.

Please don’t quote any aspect of our exchange. You can say you discussed matters with me, but please do not quote me. This is a private email exchange. You can quote from my work and posts. The heterogeneity blog post may be up in a week or two.

Uri

Replicability-Index

Improving the replicability of empirical research

Category Archives: Datacolada

Open Science Needs Open Scientists

Uri’s email to Greg Francis

Gino-Colada – 2: The line between fraud and other QRPs

The Gino-Colada Affair – 1

Conclusion

When DataColada kissed Fiske’s ass to publish in Annual Review of Psychology

Z-curve vs. P-curve: Break down of an attempt to resolve disagreement in private.