Category Archives: Uncategorized

False False Positive Psychology

The article “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant”, henceforth the FPP article, is now a classic in the meta-psychological literature that emerged in the wake of the replication crisis in psychology.

The the main claim of the FPP article is that it is easy to produce false positive results when researchers p-hack their data. P-hacking is a term coined by the FPP authors for the use of questionable research practices (QRPs, John et al., 2012). There are many QRPs, but they all have in common that researchers conduct more statistical analysis than they report and selectively report only the results of those analyses that produce a statistically significant result.

The main distinction between p-hacking and QRPs is that p-hacking ignores some QRPs. John et al. (2012) also include fraud as a QRP, but I prefer to treat the fabrication of data as a distinct form of malpractice that clearly requires intent to deceive others. The main difference between p-hacking and QRPs is that p-hacking does not consider publication bias. Publication bias implies that researchers fail to publish entire studies with non-significant results. The FPP authors are not concerned about publication bias because their main claim is that p-hacking makes it so easy to obtain significant results that it is unnecessary to discard entire datasets. After showing that a combination of four QRPSs can produce false positive results with a 60% success rate (for alpha = .05), the author hasten to warn readers that this is a conservative estimate because actual researchers might use even more QRPs. “As high as these estimates are, they may actually be conservative” (p. 1361).

The article shook the foundations of mainstream psychology because it suggested that most published results in psychology could be false positive results; that is, a statistically significant results was reported even though the reported effect does not exist. The FPP article provided a timely explanation for Bem’s (2011) controversial finding that humans have extrasensory abilities, which unintentionally contributed to the credibility crisis in social psychology (Simmons, Nelson, & Simonsohn, 2018; Schimmack, 2020).

In 2018, the FPP authors published their own reflection on their impactful article for a special issue of the most highly cited articles in Psychological Science (Simmons et al., 2018). In this article, the authors acknowledge that they used questionable research practices in their work and knew that using these practices was wrong. However, like many other psychologists they thought these practices were harmless because nothing substantial changes when a p-values is .04 rather than .06. Their own article convinced them that their practices were more like robbing a bank than jay walking.

The FPP authors were also asked to critically reflect on their article and to comment on things they might have done differently with the benefit of hindsight. The main regret was the recommendation to require a minimum sample size of n = 20 per cell. After learning about statistical power, they realized that sample sizes should be justified based on power analysis. Otherwise, false positive psychology would simply become false negative psychology where article mostly report non-significant results when effect sizes exist. To increase the credibility of psychological science it is necessary to curb the use of questionable research practices and to increase statistical power (Schimmack, 2020).

The 2018 reflections reinforce the main claim of the 2011 article that (a) p-hacking nil-effects to significance is easy and (b) that many published significant results might be false positive results. A blog post by the FPP authors in 2016 makes clear that the authors consider this to be the core findings of their article (http://datacolada.org/55).

In my critical examination of the FPP article, I challenge both of these claims. First it is important to clarify what the authors mean by “a bit of p-hacking.” To use an analogy, what does a bit of making out mean? Answers range from kissing to intercourse. So, what do you actually have to do to have a 60% probability of getting pregnant? The FPP article falsely suggests that a bit of kissing may get you there. However Table 1 shows that you actually have to f*&% the data to get a significant result.

The table also shows that it gets harder to p-hack results as the alpha criterion decreases. While the combination of four QRPs can produce 81.5% marginally significant results (p < .10), only 21.5% attempts were successful with p < .01 as the significance criterion. One sensible recommendation based on this finding would be to disregard significant results with p-values greater than .01.

Another important finding is that each QRP alone increased the probability of a false positive result only slightly from the nominal 5% to an actual level of no more than 12.6%. Based on these results, I would not claim that it is easy to get false positive results. I consider the combination of four QRPs in every study that is being conducted research fraud that is sanctioned by professional organizations. That is, even if a raid of a laboratory would find that a researcher actually uses this approach to analyze data, the research would not be considered to engage in fraudulent practices by psychological organizations like the Association for Psychological Science or granting agencies.

The distinction between a little and massive is not just a matter of semantics. It influences beliefs about the prevalence of false positive results in psychology journals. If it takes only a little bit of p-hacking to get false positive results, it is reasonable to assume that many published results are false positives. Hence, the title “False Positive Psychology.”

Aside from the simulation study, the FPP article also presents two p-hacked studies. The presentation of these two studies reinforces the narrative that p-hacking virtually guarantees significant results. At least, the authors do not mention that they also ran some additional studies with non-significant results that they did not report. However, their own simulation results suggest that a file-drawer of non-significant studies should exist despite massive p-hacking. After all, the probability to get two significant results in a row with a probability of 60% is only 36%. This means that the authors were lucky to get the desired result, used even more QRPs to ensure a nearly 100% success rate, or failed to disclose a file-drawer of non-significant results. To examine these hypothesis, I simulated their actual p-hacking design of Study 2.

A Z-curve analysis of massive p-hacking

The authors do not disclose how they p-hacked Study 1. For Study 2 they provide the following information. The design had study had three groups (“When I’m Sixty-Four”, “Kalimba”, “Hot Potato”) and the “Hot Potato” condition was dropped like a hot potato. It is not clear how the sample size decreased from 34 to 20 as a result, but maybe participants were not equally assigned to the three conditions and there were 14 participants in the “Hot Potato” condition. The next QRP was that there were two dependent variables; actual age and felt age. Then there were a number of co-variates, including bizarre and silly ones like the square root of 100 to enhance the humor of the article. In total, there were 10 covariates. Finally, the authors used optional stopping. They checked after every 10 participants. It is not specified whether they meant 10 participants per condition or in total, but to increase the chances of a significant result it is better to use smaller increments. So, I assume it was just 3 participants per condition.

To examine the long-run success rate of this p-hacking design, I simulated the following combination of QRPs: (a) three conditions, (b) two dependent variables, (c) 10 covariates, and (d) increasing sample size from n = 10 until N > 200 per condition in steps of 3. I ran 10,000 simulations of this p-hacking design. The first finding was that it provided a success rate of 77% (7718 / 10,000), which is even higher than the 60% success rate featured in the FPP article. Thus, more massive p-hacking partially explains why both studies were significant.

The simulation also produced a new insight into p-hacking by examining the success rates for every increment in sample sizes (Figure 1). It is readily apparent that the chances of a significant result decrease. The reason is that favorable sampling error in the beginning quickly produces significant results. However, unfavorable sampling error in the beginning takes a long time to be reversed.

It follows that no smart p-hacker would use optional stopping or only continue if the first test shows a promising trend. This is what Bem (2011) did to get his significant results (Schimmack, 2016). It is not clear why the FPP authors did not simulate optional stopping. However, the failure to include this QRP explains why they maintain that p-hacking does not leave a file drawer of non-significant results. In theory, adding participants would eventually produce a significant result, resulting in a success rate of 100%. However, in practice resources would often be depleted before a significant result emerges. Thus, even with massive p-hacking a file drawer of non-significant results is inevitable.

It is notable that both studies that are reported in the FPP article have very small sample sizes (Ns = 30, 34). This shows that adding participants does not explain the 100% success rate. This also means that the actual probability of a success on the first trial was only about 40% based on the QRP design for Study 2. This means the chance of getting two significant results in a row was only 16%. This low success rate suggests that the significant p-values in the FPP article are not replicable. I bet that a replication project would produce more non-significant than significant results.

In sum, the FPP article suggested that it is easy to get significant results with a little bit of p-hacking. Careful reading of the article and a new simulation study show that this claim is misleading. It requires massive p-hacking that is difficult to distinguish from fraud to consistently produce significant results in the absence of a real effect and even massive p-hacking is likely to produce a file-drawer of non-significant results unless researchers are willing to continue data collection until sample sizes are very large.

Detecting massive p-hacking

in the wake of the replication crisis, numerous statistical methods have been developed that enable detection of bias introduced by QRPs (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020; Schimmack, 2012; Simonsohn et al., 2014; Schimmack, 2016). The advantage of z-curve is that it also provides valuable additional information such as estimates of the success rate of replication attempts and information about the false discovery risk (Bartos & Schimmack, 2021).

Figure 2 shows the z-curve plot for the 10,000 p-values from the previous simulation of the FPP p-hacking design. To create a z-curve plot, he p-values are converted into z-scores, using the formula qnorm(1-p/2). Accordingly, a p-value of .05 corresponds to a z-score of 1.96 and all z-scores greater than 1.96 (the solid red line) are significant.

Visual inspection shows that z-curve is unable to fit the actual distribution of z-scores because the distribution of actual z-scores is even steeper than z-curve predicts. However, the distinction between p-hacking and other QRPs is irrelevant for the evaluation of evidential value. Z-curve correctly predicts that the actual discovery rate is 5%, which is expected when only false hypotheses are tested with alpha = .05. It also correctly predicts that the probability of a successful replication without QRPs is only 5%. Finally, z-curve also correctly estimates that the false discovery risk is 100%. That is, the distribution of z-scores suggests that all of the significant results are false positive results.

The results address outdated criticisms of bias-detection methods that they merely show the presence of publication bias. First, the methods do not care about the distinction between p-hacking and publication bias. All QRPs inflate the success rate and bias-detection method reveal inflated success rates (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020; Schimmack, 2012). Second, while older methods merely showed the presence of bias, newer methods like z-curve also quantify the amount of bias. Thus, even if bias is always present, they provide valuable information about the amount of bias. In the present example, massive p-hacking produced massive bias in the success rate. Finally, z-curve.2.0 also quantifies the false positive risk after correcting for bias and correctly shows that massively p-hacked nil-hypothesis produces only false positive results.

The simulation also allows to replicate the influence of alpha on the false positive risk. A simple selection model predicts that only 20% of the results that are significant with alpha = .05 are still significant with alpha = .01. This follows from the uniform distribution of p-values, which implies that .01/.05 p-values are below .05 and .01. However, massive p-hacking clusters even more p-values in the range between .01 and .05. In this simulation only 6% (500 / 7718) p-values were below .01. Thus, it is possible to reduce the false positive risk from 100% to close to 5% by disregarding all p-values between .05 and .01. Thus, massive p-hacking provides another reason for calls to adjust the alpha level for statistically significant results to .005 (Benjamin et al., 2017) to reduce the risk of false positive results even for p-hacked literatures.

In sum, since the FPP article was published it has become possible to detect p-hacking in actual data using statistical methods like z-curve. These methods work even when massive p-hacking was used because massive p-hacking makes the detection of bias easier, especially when massive p-hacking is used to produce false positive results. The development of z-curve makes it possible to compare the FPP scenarios with 60% or more false positive results to actual p-values in published journals.

How Prevalent is Massive P-Hacking?

Since the FPP article was published, other articles have examined the prevalence of questionable research practices. Most of these studies rely on surveys (John et al., 2012; Fiedler & Schwarz, 2016). The problem with survey results is that they do not provide sufficient evidence about the amount of p-hacking and do not provide information about the severity of p-hacking. Furthermore, it is possible that many researchers use QRPs when they are testing a real effect. This practice would inflate effect sizes, but does not increase the risk of false positive results. These problems are addressed by z-curve analyses of published results. Figure 3 shows the results for Motyl et al.’s (2017) representative sample of test statistics in social psychology journals.

The z-curve plot of actual p-values differs in several ways from the z-curve plot of massive p-hacking. The estimated discovery rate is 23% and the estimated replication rather is 45%. The point estimate of the false discovery risk is only 18%, suggesting that no more than a quarter of published results are false positives. However, due to the small set of p-values, the 95%CI around the point estimate of the false positive risk reaches all the way to 70%. Thus, it remains unclear how high the false positive risk in social psychology is.

Results from a bigger coding project help to narrow down the uncertainty about the actual EDR in social psychology. This project coded at least 20 focal hypothesis tests from the most highly cited articles by eminent social psychologists, where eminence was based on the H-Index (Radosic & Diener, 2021). This project produced 2,208 p-values. The z-curve analysis of these p-values closely replicated the point estimates for Motyl et al.’s (2017) data (EDRs 26% vs. 23%, ERR 49% vs. 45%, FDR 15% vs. 18%). The confidence intervals are narrower and the upper limit of the false positive risk decreased to 47%.

This image has an empty alt attribute; its file name is image-5.png

However, combining the two samples did not notably reduce the confidence interval around the false discovery risk, 15%, 95%CI = 10% to 47%. Thus, up to 50% of published results in social psychology could be false positive results. This is an unacceptably high risk of false positive results, but the risk may seem small in comparison to a scenario where a little bit of p-hacking can produce over 60% false positive results.

In sum, empirical analyses of actual data suggest that false positive results are not as prevalent as the FPP article suggested. The main reason for the relatively low false positive risk is not that QRPs are rare. Rather, QRPs also help to inflate success rates when a small true effect exists. If effect sizes were not important, it might seem justifiable to reduce false negative rates with the help of QRPs. However, effect sizes matter and QRPs produce inflated effect sizes estimates by over 100% (Open Science Collaboration, 2015). Thus, p-hacking is a problem even if it does not generate a high rate of false positive results.

Individual Differences in Massive P-Hacking?

Psychologists study different topics and use different methods. Some research areas and some research methods have many true effects and high power to detect them. For example, cognitive psychologists appears to have few false positive results and relatively high replication rates (Open Science Collaboration, 2015; Schimmack, 2020). In contrast, between-subject experiments in social psychology are the most likely candidate for massive p-hacking and high rates of false positive results (Schimmack, 2020). As researchers focus on specific topics and paradigms, they are more or less likely to require massive p-hacking to produce significant results. To examine this variation across researchers, I combined the data from the 10 eminent social psychologists with the lowest EDR.

The results are disturbing. The EDR of 6% is just one percentage point above the 5% that is expected when only nil-hypotheses are tested and the 95%CI includes 5%. The upper limit reaches only 14%. The corresponding false discovery risk is 76% and the 95%CI includes 100%. Thus, the FPP article may describe the actual practices of some psychologists, but not the practices of psychology in general. It may not be surprising that one of the authors of the FPP article has a low EDR of 16%, even if the analysis is not limited to focal tests (Schimmack, 2021). It is well-known that the consensus bias leads individuals to project themselves onto others. The present results suggest that massive p-hacking of true null-results is the exception rather than the norm in psychology.

The last figure shows the z-curve plot for the 10 social psychologists with the highest EDR. The z-curve looks very different and shows that not all researchers were massive p-hackers. There is still publication bias because the ODR, 91%, matches the upper limit of the 95%CI of the EDR, 66% to 91%, but the effect size is much smaller (91% – 78% = 13%) than for the other extreme group (90% – 6% = 84%). As this comparison is based on extreme group, a replication study would show a smaller difference due to regression to the mean, but the difference is likely to remain substantial.

In sum, z-curve analysis of actual data can be used to evaluate how prevalent massive p-hacking actually is. The results suggest that only a minority of psychologists consistently used massive p-hacking to produce significant results that have a high risk of being false positive results.

Discussion

The FPP article made an important positive contribution to psychological science. The recommendations motivated some journals and editors to implement policies that discourage the use of QRPs and motivate researchers to preregister their data analysis plans. At the same time, the FPP article also had some negative consequences. The main problem with the article is that it ignored statistical power, false negatives, and effect sizes. That is, the article suggested that the main problem in psychological science is a high risk of false positive results. Instead, the key problem in psychological science remains qualitative thinking in terms of true and false hypotheses that is rooted in the nil-hypothesis ritual that is still being taught to undergraduate and graduate students. Psychological science will only advance by replacing nil-hypothesis testing with quantitative statistics that take effect sizes into account. However, the FPP article succeeded where previous meta-psychologists failed by suggesting that most published results are false positives. It therefore stimulated much needed reforms that decades of methodological criticism failed to deliver.

False False Positive Psychology

The main claim of the FPP article was that many published results in psychology journals could be false positives. Unfortunately, the focus on false positive results has created a lot of confusion and misguided attempts to quantify false positive results in psychology. The problem with false positives is that they are mathematical entities rather than empirically observable phenomena. Based on the logic of nil-hypothesis testing, a false positive result requires an effect size that is exactly zero. Even a significant result with a population effect size of d = 0.0000000001 would count as a true positive result, although it is only possible to produce significant results for this effect size with massive p-hacking.

Thus, it is not very meaningful to worry about false positive results. For example, ego-depletion researchers have seen effect sizes reduced from d = 6 to d = .1 in studies without p-hacking. Proponents of ego-depletion point to these fact that d = .1 is different from 0 and supports their theory. However, the honest effect size invalidates hundreds of studies that claim to have demonstrate the effect for different dependent variables and under different conditions. None of these p-hacked studies are credible and each one would require a new replication study with over 1,000 participants to see whether the small effect size is really observed for a specific outcome under specific conditions. Whether the effect size is really zero or small is entirely irrelevant.

A study with a N = 40 participants and d = .1 has only 6% power to produce a significant result. Thus, there is hardly a difference between a study with a true null-effect (5% power) and a study with a small effect size. Nothing is learned from a significant result in either case and as Cohen once said “God hates studies with 5% power as much as studies with 6% power.”

To demonstrate that an effect is real, it is important to show that the majority of studies are successful without the use of questionable research practices (Brunner & Schimmack, 2020; Cohen, 1994). Thus, the empirical foundation of a real science requires (a) making true predictions, (b) designing studies that can provide evidence for the prediction, and (c) honest reporting of results. The FPP article illustrated the importance of honest reporting of results. It did not address the other problems that have plagued psychological science since its inception. As Cohen pointed out, results have to be replicable, and to be replicable, studies need high power. Honest reporting alone is insufficient.

P-Hacking versus Publication Bias

In my opinion, the main problem of the FPP article is the implicit claim that publication bias is less important than p-hacking. This attitude has led the authors to claim that bias detection tools are irrelevant because “we already know that researchers do not report 100% of their nonsignificant studies and analyses” (Nelson, Simmons, & Simonsohn, 2018). This argument is invalid for several reasons. Most important, bias detection tools do not distinguish between p-hacking and publication bias. As demonstrated here, they detect p-hacking as well as publication bias. For the integrity of a science it is also not important whether 1 researcher tests 20 dependent variables in one study or 20 researchers test 1 dependent variable in 20 independent studies. As long as results are only reported when they are significant, p-hacking and publication bias introduce bias in the literature and undermine the credibility of a science.

It is also unwarranted to make the strong claim that publication bias is unavoidable. Pre-registration and registered reports are designed to ensure that there is no bias in the reporting of results. Bias-detection methods can be used to verify this assumption. For example, they show no bias in the replication studies of the Open Science Collaboration project.

Third, new developments in bias detection methods do not only test for the presence of bias. As shown here, a comparison of the ODR and EDR provides a quantitative estimate of the amount of bias and small amounts of bias have no practical implications for the credibility of findings.

In conclusion, it makes no sense to draw a sharp line between hiding of dependent variables and hiding of entire studies. All questionable research practices produce bias and massive use of QRPs leads to more bias. Bias-detection methods play an important role in verifying that published results can be trusted.

Reading a Questionable Literature

One overlooked implication of the FPP article is the finding that it is much harder to produce significant results with p-hacking if the significance criterion is lowered from .05 to .01. This provides an easy solution to the problem how psychologists should interpret findings in psychology journals with dramatically inflated success rates (Sterling, 1959; Sterling et al., 1995). The success rate and the false positive risk can be reduced by adjusting alpha to .01 or even .005 (Benjamin et al., 2017). This way it is possible to build on some published results that produced credible evidence.

Conclusion

The FPP article was an important stepping stone in the evolution of psychology towards becoming a respectable science. It alerted many psychologists who were exploiting questionable research practices to various degrees that these practices were undermining the credibility of psychology as a science. However, one constant in science is that science is always evolving. The other one is that scientists who made a significant contribution think that they reached the end of history. In this article, I showed that meta-psychology has evolved over the past decade since the FPP article appeared. Ten years later, it is clear that massive p-hacking of nil-results is the exception rather than the norm in psychological science. As a result, the false positive risk is lower than feared ten years ago. However, this does not imply that psychological science is credible. The reason is that success rates and effect sizes in psychology journals are severely inflated by the use of questionable research practices. This makes it difficult to trust published results. One overlooked implication of the FPP article is that p-values below .01 are much more trustworthy than p-values below .05 because massive p-hacking mostly produces p-values between .05 and .01. Thus, one recommendation for readers of psychology journals is to ignore results with p-values greater than .01. Finally, bias detection tools like z-curve can be used to assess the credibility of published literatures and to correct for the bias introduced by questionable research practices.

Power and Success: When the R-Index meets the H-Index

A main message of the Lord of the Rings novels is that power is dangerous and corrupts. The main exception is statistical power. High statistical power is desirable because it reduces the risk of false negative results and therewith increases the rate of true discoveries. A high rate of true discoveries is desirable because it reduces the risk that significant results are false positives. For example, a researcher who conducts studies with low power to produce many significant results, but also tests many false hypotheses, will have a high rate of false positive discoveries (Finkel, 2018). In contrast, a researcher who invests more resources in any single study will have fewer significant results, but a lower risk of false positives. Another advantage of high power is that true discoveries are more replicable. A true positive that was obtained with 80% power has an 80% chance to produce a successful replication. In contrast, a true discovery that was obtained with 20% power has an 80% chance to end with a failure to replicate that requires additional replication studies to determine whether the original result was a false positive.

Although most researchers agree that high power is desirable – and specify that they are planning studies with 80% power in their grant proposals, they no longer care about power once the study is completed and a significant result was obtained. The fallacy is to assume that a significant result was obtained because the hypothesis was true and the study had good power. Until recently, there was also no statistical method to estimate researchers’ actual power. The main problem was that questionable research practices inflate post-hoc estimates of statistical power. Selection for significance ensures that post-hoc power is at least 50%. This problem has been solved with selection models that correct for selection for significance, namely p-curve and z-curve. A comparison of these methods with simulation studies shows that p-curve estimates can be dramatically inflated when studies are heterogeneous in power (Brunner & Schimmack, 2020). Z-curve is also the only method that estimates power for all studies that were conducted and not just the subset of studies that produced a significant results. A comparison with actual success rates of replication studies shows that these estimates predict actual replication outcomes (Bartos & Schimmack, 2021).

The ability to estimate researchers’ actual power offers new opportunities for meta-psychologists. One interesting question is how statistical power is related to traditional indicators of scientific success or eminence. There are three several possible outcomes.

One possibility is that power could be positively correlated with success, especially for older researchers. The reason is that low power should produce many replication failures for other researchers who are trying to build on the work of this researcher. Faced with replication failures, they are likely to abandon this research and work on this topic will cease after a while. Accordingly, low powered studies are unlikely to produce a large body of research. In contrast, high powered studies replicate and many other researchers who build on this work are building on these findings, leading to many citations and a large H-Index.

A second possibility is that there is no relationship between power and success. The reason would be that power is determined by many other factors such as the effect sizes in a research area and the type of design that is used to examine these effects. Some research areas will have robust findings that replicate often. Other areas will have low power, but everybody in this area accepts that studies do not always work. In this scenario, success is determined by other factors that vary within research areas and not by power, which varies mostly across research areas.

Another reason for the lack of a correlation could be a floor effect. In a system that does not value credibility and replicability, researchers who use questionable practices to push articles out might win and the only way to survive is to do bad research (Smaldino & McElreath, 2016).

A third possibility is that power is negatively correlated with success. Although there is no evidence for a negative relationship, concerns have been raised that some researchers are gaming the system by conducting many studies with low power to produce as many significant results as possible. The costs of replication failures are passed on to other researchers that try to build on these findings, whereas the innovator moves on to produce more significant results on new questions.

Given the lack of data and plausible predictions for any type of relationship, it is not possible to make a priori predictions about the relationship. Thus, the theoretical implications can only be examined after we look at the data.

Data

Success was measured with the H-Index in Web of Science. Information about statistical power of over 300 social/personality psychologists was obtained using z-curve analyses of automatically extracted test statistics (Schimmack, 2021). A sample size of N = 300 provides reasonably tight confidence intervals to evaluate whether there is a substantial relationship between H-Index and power. I transformed the H-Index using log-transformation to compute the correlation with the estimated discovery rate, which corresponds to the average power before selection for significance (Brunner & Schimmack, 2020). The results show a weak positive relationship that is not significantly different from zero, r(N -= 304) = .07, 95%CI = -.04 to .18. Thus, the results are most consistent with theories that predict no relationship between success and research practices. Figure 1 shows the scatterplot and there is no indication that the weak correlation is due to a floor effect. There is considerable variation in the estimated discovery rate across researchers.

One concern could be that the EDR is just a very noisy and unreliable measure of statistical power. To examine this, I split the z-values of researchers in half, computed separate z-curves and then computed the split-half correlation and adjusted it to compute alpha for the full set of z-scores. Reliability of the EDR was alpha r = .5. To increase reliability, I used extreme groups for the EDR and excluded values between 25 and 45. However , the correlation with the H-Index did not increase, r = .08, 95%CI = -.08 to .23.

I also correlated the H-Index with the more reliable estimated replication rate (reliability = .9), which is power after selection for significance. This correlation was also not significant, r = .08, 95%CI = -.04 to .19.

In conclusion, we can reject the hypothesis that higher success is related to conducting many small studies with low power and selectively reporting only significant results (r > -.1, p < .05). There may be a small positive correlation, (r < .2, p < .05), but a larger sample would be needed to reject the hypotheses that there is no relationship between success and statistical power.

Discussion

Low replication rates and major replication failures of some findings in social psychology created a crisis of confidence. Some articles suggests that most published results are false and were obtained with questionable research practices. The present results suggests that these fears are unfounded and that it would be false to generalize from a few researchers to the whole group of social psychologists.

The present results also suggest that it is not necessary to burn social psychology to the ground. Instead, social psychologists should carefully examine which important findings are credible and replicable and which ones are not. Although this work has begun, it is moving slowly. The present results show that researchers success, wich is measured in terms of citations by peers, is not tight to the credibility of their findings. Personalized information about power may help to change this in the future.

A famous quote in management is “If You Can’t Measure It, You Can’t Improve It.” This might explain why statistical power remained low despite early warnings about low power (Cohen, 1961; Tversky & Kahneman, 1971). Z-curve analysis is a game changer because it makes it possible to measure power and with the use of modern computers, it is possible to do so quickly and on a large scale. If we agree that power is important and that it can be measured, it is time to improve it. Every researcher can do so and the present results suggest that increasing power is not a career ending move. So, I hope this post empowers researchers to invest more resources in high-powered studies.

If Consumer Psychology Wants to be a Science It Has to Behave Like a Science

Consumer psychology is an applied branch of social psychology that uses insights from social psychology to understand consumers’ behaviors. Although there is cross-fertilization and authors may publish in more basic and more applied journals, it is its own field in psychology with its own journals. As a result, it has escaped the close attention that has been paid to the replicability of studies published in mainstream social psychology journals (see Schimmack, 2020, for a review). However, given the similarity in theories and research practices, it is fair to ask why consumer research should be more replicable and credible than basic social psychology. This question was indirectly addressed in a diaologue about the merits of pre-registration that was published in the Journal of Consumer Psychology (Krishna, 2021).

Open science proponents advocate pre-registration to increase the credibility of published results. The main concern is that researchers can use questionable research practices to produce significant results (John et al., 2012). Preregistration of analysis plans would reduce the chances of using QRPs and increase the chances of a non-significant result. This would make the reporting of significant results more valuable because signifiance was produced by the data and not by the creativity of the data analyst.

In my opinion, the focus on pre-registration in the dialogue is misguided. As Pham and Oh (2021) point out, pre-registration would not be necessary, if there is no problem that needs to be fixed. Thus, a proper assessment of the replicability and credibility of consumer research should inform discussions about preregistration.

The problem is that the past decade has seen more articles talking about replications than actual replication studies, especially outside of social psychology. Thus, most of the discussion about actual and ideal research practices occurs without facts about the status quo. How often do consumer psychologists use questionable research practices? How many published results are likely to replicate? What is the typical statistical power of studies in consumer psychology? What is the false positive risk?

Rather than writing another meta-psychological article that is based on paranoid or wishful thinking, I would like to add to the discussion by providing some facts about the health of consumer psychology.

Do Consumer Psychologists Use Questionable Research Practices?

John et al. (2012) conducted a survey study to examine the use of questionable research practices. They found that respondents admitted to using these practices and that they did not consider these practices to be wrong. In 2021, however, nobody is defending the use of questionable practices that can inflate the risk of false positive results and hide replication failures. Consumer psychologists could have conducted an internal survey to find out how prevalent these practices are among consumer psychologists. However, Pham and Oh (2021) do not present any evidence about the use of QRPs by consumer psychologists. Instead, they cite a survey among German social psychologists to suggest that QPRs may not be a big problem in consumer psychology. Below, I will show that QRPs are a big problem in consumer psychology and that consumer psychologists have done nothing over the past decade to curb the use of these practices.

Are Studies in Consumer Psychology Adequately Powered

Concerns about low statistical power go back to the 1960s (Cohen, 1961; Maxwell, 2004; Schimmack, 20212; Sedlmeier & Gigerenzer, 1989; Smaldino & McElreath, 2016). Tversky and Kahneman (1971) refused “to believe that any that a serious investigator will knowingly accept a .50 risk of failing to confirm a valid research hypothesis” (p. 110). Yet, results from the reproducibility project suggest that social psychologists conduct studies with less than 50% power all the time (Open Science Collaboration, 2015). It is not clear why we should expect higher power from consumer research. More concerning is that Pham and Oh (2021) do not even mention low power as a potential problem for consumer psychology. One advantage of a pre-registration is that researchers are forced to think ahead of time about the sample size that is required to have a good chance to show the desired outcome, assuming the theory is right. More than 20 years ago, the APA taskforce on on statistical inference recommended a priori power analysis, but researchers continued to conduct underpowered studies. Pre-registration, however, would not be necessary if consumer psychologists already conduct studies with adquate power. Here I show that power in consumer psychology is unacceptably low and has not increased over the past decade.

False Positive Risk

Pham and Oh note that Simmons, Nelson, & Simmonsohn’s (2011) influential article relied exclusively on simulations and speculations and suggest that the fear of massive p-hacking may be unfounded. “Whereas Simmons et al. (2011) highly influential computer simulations point to massive distortions of test statistics when QRPs are used, recent empirical estimates of the actual impact of self-serving analyses suggest more modest degrees of distortion of reported test statistics in recent consumer studies (see Krefeld-Schwalb & Scheibehenne, 2020). Here I presents of empirical analyses to estimate the false discovery risk in consumer psychology.

Data

The data are part of a larger project that examines research practices in psychology over the past decade. For this purpose, my research team and I downloaded all articles form 2010 to 2020 published in 120 psychology journals that cover a broad range of disciplines. Four journals represent research in consumer psychology, namely the Journal of Consumer Behavior, the Journal of Consumer Psychology, the Journal of Consumer Research and Psychology and Marketing. The articles were converted into text files and the text files were searched for test statistics. All F, t, and z-tests were used, but most test statistics were F and t tests. There were 2,304 tests for Journal of Consumer Behavior, 8940 for Journal of Consumer Psychology, 10,521 for Journal of Consumer Research, and 5,913 for Psychology and Marketing.

Results

I first conducted z-curve analyses for each journal and year separately. The 40 results were analyzed with year as continuous and journal as categorical predictor variable. No time trends were significant, but the main effect for the expected replication rate of journals was significant, F(3,36) = 9.63, p < .001. Inspection of the means showed higher values for Journal of Consumer Psychology and Psychology & Marketing than for the other two journals. No other effects were significant. Therefore, I combined the data of Journal of Consumer Psychology and Psychology of Marketing and the Journal of Consumer Behavior and Journal of Consumer Reserach.

Figure 1 shows the z-curve analysis for the first set of journals. The observed discovery rate (ODR) is simply the percentage of results that are significant. Out of the 14,853 tests, 10636 were significant which yields an ODR of 72%. To examine the influence of questionable research practices, the ODR can be compared to the estimated discovery rate (EDR). The EDR is an estimate that is based on a finite mixture model that is fitted to the distribution of the signifiant test statistics. Figure 1 shows that the fitted grey curve closely matches the observed distribution of test statistics that are all converted into z-scores. Figure 1 also shows the projected distribution that is expected for non-significant results. Contrary to the predicted distribution, observed non-significant results sharply drop off at the level of significance (z = 1.96). This pattern provides visual evidence that non-significant results do not follow a sampling distribution. The EDR is the area under the curve for the significant values relative to the total distribution. The EDR is only 34%. The 95%CI of the EDR can be used to test statistical significance. The ODR of 72% is well out side the 95% confidence interval of the EDR that ranges from 17% to 34%. Thus, there is strong evidence that consumer researchers use QRPs and publish too many significant results.

The EDR can also be used to assess the risk of publishing false positive results; that is significant results without a true population effect. Using a formula from Soric (1989), we can use the EDR to estimate the maximum percentage of false positive results. As the EDR decreases, the false discovery risk increases. With an EDR of 34%, the FDR is 10%, with a 95% confidence interval ranging from 7% to 26%. Thus, the present results do not suggest that most results in consumer psychology journals are false positives as some meta-scientists suggested (Ioannidis, 2005; Simmons et al., 2011).

It is more difficult to asses the replicability of results published in these two journals. On the one hand, z-curve provides an estimate of the expected replication rate. That is, the probability that a significant result produces a significant result again in an exact replication study (Brunner & Schimmack, 2020). The ERR is higher than the EDR because studies that produced a significant result have higher power than studies that did not produce a significant result. The ERR of 63% suggests that more than 50% of significant results can be successfully replicated. However, a comparison of the ERR with success rate in actual replication studies showed that the ERR overestimates actual replication rates (Brunner & Schimmack, 2020). There are a number of reasons for this discrepancy. One reason is that replication studies in psychology are never exact replications and that regression to the mean lowers the chances of reproducing the same effect size in a replication study. In social psychology, the EDR is actually a better predictor of the actual success rate. Thus, the present results suggest that actual replication studies in consumer psychology are likely to produce as many replication failures as studies in social psychology have (Schimmack, 2020).

Figure 2 shows the results for the Journal of Consumer Behavior and the Journal of Consumer Research.

The results are even worse. The ODR of 73% is above the EDR of 26% and well outside the 95%CI of the EDR, . The EDR of 24% implies a false discovery risk of 15%, 95%CI =

Conclusion

The present results show that consumer psychology is plagued by the same problems that have produced replication failures in social psychology. Given the similarities between consumer psychology and social psychology, it is not surprising that the two disciplines are alike. Researchers conduct underpowered studies and use QRPs to report inflated success rates. These illusory results cannot be replicated and it is unclear which statistically significant results reveal effects that have practical significance and which ones are mere false positives. To make matters worse, social psychologists have responded to awareness of these problems by increasing power of their studies and by implementing changes in their research practices. In contrast, z-curve analyses of consumer psychology show no improvement in research practices over the past year. In light of this disappointing tend, it is disconcerting to read an article that suggests improvements in consumer psychology are not needed and that everything is well (Pham and Oh, 2021). I demonstrated with hard data and objective analysis that this assessment is false. It is time for consumer psychologists to face reality and to follow in the footsteps of social psychologists to increase the credibility of their science. While preregistration may be optional, increasing power is not.

Guest Post by Peter Holtz: From Experimenter Bias Effects To the Open Science Movement

This post was first shared as a post in the Facebook Psychological Methods Discussion Group. (Group, Post). I thought it was interesting and deserved a wider audience.

Peter Holtz

I know that this is too long for this group, but I don’t have a blog …

A historical anecdote:

In 1963, Rosenthal and Fode published a famous paper on the Experimenter Bias Effect (EBE): There were of course several different experiments and conditions etc., but for example, research assistants were given a set of 20 photos of people that were to be rated by participants on a scale from -10 ([will experience …] “extreme failure”) to + 10 (…“extreme success”).

The research assistants (e.g., participants in a class on experimental psychology) were told to replicate a “well-established” psychological finding just like “students in physics labs are expected to do” (p. 494). On average, the sets of photos had been rated in a large pre-study as neutral (M=0), but some research assistants were told that the expected mean of their photos was -5, whereas others were told that it was +5. When the research assistants, who were not allowed to communicate with each other during the experiments, handed in the results of their studies, their findings were biased in the direction of the effect that they had expected. Funnily enough, similar biases could be found for experiments with rats in Skinner boxes as well (Rosenthal & Fode, 1963b).

The findings on the EBE were met with skepticism from other psychologists since they casted doubt on experimental psychology’s self-concept as a true and unbiased natural science. And what do researchers do since the days of Socrates if they doubt the findings of a colleague? Sure, they attempt to replicate them. Whereas Rosenthal and colleagues (by and large) produced several successful “conceptual replications” in slightly different contexts (for a summary see e.g. Rosenthal, 1966), others (most notably T. X. Barber) couldn’t replicate Rosenthal and Fode’s original study (e.g., Barber et al., 1969; Barber & Silver, 1968, but also Jacob, 1968; Wessler & Strauss, 1968).

Rosenthal, a versed statistician, responded (e.g., Rosenthal, 1969) that the difference between significant and non-significant may be not itself significant and used several techniques that about ten years later came to be known as “meta-analysis” to argue that although Barber’s and others’ replications, which of course used other groups of participants and materials etc., most often did not yield significant results, a summary of results suggests that there may still be an EBE (1968; albeit probably smaller than in Rosenthal and Fode’s initial studies – let me think… how can we explain that…).

Of course, Barber and friends responded to Rosenthal’s responses (e.g., Barber, 1969 titled “invalid arguments, post-mortem analyses, and the experimenter bias effect”) and vice versa and a serious discussion of psychology’s methodology emerged. Other notables weighed in as well and frequently statisticians such as Rozeboom (1960) and Bakan (1966) were quoted who had by then already done their best to explain to their colleagues the problems of the p-ritual that psychologists use(d) as a verification procedure. (On a side note: To me, Bakan’s 1966 paper is better than much of the recent work on the problems with the p-ritual; in particular the paragraph on the problematic assumption of an “automacity of inference” on p. 430 is still worth reading).

Lykken (1968) and Meehl (1967) soon joined the melee and attacked the p-ritual also from an epistemological perspective. In 1969, Levy wrote an interesting piece about the value of replications in which he argued that replicating the EBE-studies doesn’t make much sense as long as there are no attempts to embed the EBE into a wider explanatory theory that allows for deducing other falsifiable hypotheses as well. Levy knew very well already by 1969 that the question whether some effect “exists” or “does not exist” is only in very rare cases relevant (exactly then when there are strong reasons to assume that an effect does not exist – as is the case, for example, with para-psychological phenomena).

Eventually Rosenthal himself (e.g., 1968a) came to think critically of the “reassuring nature of the null hypothesis decision procedure”. What happened then? At some point Rosenthal moved away from experimenter expectancy effects in the lab to Pygmalion effects in the classroom (1968b) – an idea that is much less likely to provoke criticism and replication attempts: Who doesn’t believe that teachers’ stereotypes influence the way they treat children and consequently the children’s chances to succeed in school? The controversy fizzled out and if you take up a social psychology textbook, you may find the comforting story in it that this crisis was finally “overcome” (Stroebe, Hewstone, & Jonas, 2013, p. 18) by enlarging psychology’s methodological arsenal, for example, with meta-analytic practices and by becoming a stronger and better science with a more rigid methodology etc. Hooray!

So psychology was finally great again from the 1970s on … was it? What can we learn from this episode?- It is not the case that psychologists didn’t know the replication game, but they only played it whenever results went against their beliefs – and that was rarely the case (exceptions are apart from Rosenthal’s studies of course Bem’s “feeling the future” experiments). –

Science is self-correcting – but only whenever there are controversies (and not if subcommunities just happily produce evidence in favor of their pet theories). – Everybody who wanted to know it could know by the 1960s that something is wrong with the p-ritual – but no one cared. This was the game that needed to be played to produce evidence in favor of theories and to get published and to make a career; consequently, people learned to play the verification game more and more effectively. (Bakan writes on p. 423: “What will be said in this paper is hardly original. It is, in a certain sense, what “everybody knows.” To say it “out loud” is, as it were, to assume the role of the child who pointed out that the emperor was really outfitted only in his underwear.” – in 1966!)-

Just making it more difficult to verify a theory will not solve the problem imo; ambitious psychologists will again find ways to play the game – and to win.- I see two risks with the changes that have been proposed by the “open science community” (in particular preregistration): First, I am afraid that since the verification game still dominates in psychology researchers will simply shift towards “proving” more boring hypotheses; second, there is the risk that psychological theories will be shielded even more from criticism since only criticism based on “good science” (preregistered experiments with a priori power analysis and open data) will be valid whereas criticism based on other types of research activities (e.g., simulations, case studies … or just rational thinking for a change) will be dismissed as “unscientific” => no criticism => no controversy => no improvement => no progress. – And of course, pre-registration and open science etc. allow psychologists to still maintain the misguided, unfortunate, and highly destructive myth of the “automacity of inferences”; no inductive mechanism whatsoever can ensure “true discovery”.-

I think what is needed more is a discussion about the relationship between data and theory and about epistemological questions such as the question what a “growth of knowledge” in science could look like and how it can be facilitated (I call this a “falsificationist turn”).- Irrespective of what is going to happen, authors of textbooks will find ways to write up the history of psychology as a flawless cumulative success story …

A Z-Curve Analysis of a Self-Replication: Shah et al. (2012) Science

Since 2011, psychologists are wondering which published results are credible and which results are not. One way to answer this question would be for researchers to self-replicate their most important findings. However, most psychologists have avoided conducting or publishing self-replications (Schimmack, 2020).

It is therefore always interesting when a self-replication is published. I just came across Shah, Mullainathana, and Shafir (2019). The authors conducted high-powered (much larger sample-sizes) replications of five studies that were published in Shah, Mullainathana, and Shafir’s (2012) Science article.

The article reported five studies with 1, 6, 2, 3, and 1 focal hypothesis tests. One additional test was significant, but the authors focussed on the small effect size and considered it not theoretically important. The replication studies successfully replicated 9 of the 13 significant results; a success rate of 69%. This is higher than the success rate in the famous reproducibility project of 100 studies in social and cognitive psychology; 37% (OSC, 2015).

One interesting question is whether this success rate was predictable based on the original findings. An even more interesting question is whether original results provide clues about the replicability of specific effects. For example, why were the results of Study 1 and 5 harder to replicate than those of the other studies.

Z-curve relies on the strength of the evidence against the null-hypothesis in the original studies to predict replication outcomes (Brunner & Schimmack, 2020; Bartos & Schimmack, 2020). It also takes into account that original results may be selected for significance. For example, the original article reported 14 out of 14 significant results. It is unlikely that all statistical tests of critical hypotheses produce significant results (Schimmack, 2012). Thus, some questionable practices were probably used although the authors do not mention this in their self-replication article.

I converted the 13 test statistics into exact p-values and converted the exact p-values into z-scores. Figure 1 shows the z-curve plot and the results of the z-curve analysis. The first finding is that the observed success rate of 100% is much higher than the expected discovery rate of 15%. Given the small sample of tests, the 95%CI around the estimated discovery rate is wide, but it does not include 100%. This suggests that some questionable practices were used to produce a pretty picture of results. This practice is in line with widespread practices in psychology in 2012.

The next finding is that despite a low discovery rate, the estimated replication rate of 66% is in line with the observed discovery rate. The reason for the difference is that the estimated discovery rate includes the large set of non-significant results that the model predicts. Selection for significance selects studies with higher power that have a higher chance to be significant (Brunner & Schimmack, 2020).

It is unlikely that the authors conducted many additional studies to get only significant results. It is more likely that they used a number of other QRPs. Whatever method they used, QRPs make just significant results questionable. One solution to this problem is to alter the significance criterion post-hoc. This can be done gradually. For example, a first adjustment might lower the significance criterion to alpha = .01.

Figure 2 shows the adjusted results. The observed discovery rate decreased to 69%. In addition, the estimated discovery rate increased to 48% because the model no longer needs to predict the large number of just significant results. Thus, the expected and observed discovery rate are much more in line and suggest little need for additional QRPs. The estimated replication rate decreased because it uses the more stringent criterion of alpha = .01. Otherwise, it would be even more in line with the observed replication rate.

Thus, a simple explanation for the replication outcomes is that some results were obtained with QRPs that produced just significant results with p-values between .01 and .05. These results did not replicate, but the other results did replicate.

There was also a strong point-biseral correlation between the z-scores and the dichotomous replication outcome. When the original p-values were split into p-values above or below .01, they perfectly predicted the replication outcome; p-values greater than .01 did not replicate, those below .01 did replicate.

In conclusion, a single p-values from a single analysis provides little information about replicability, although replicability increases as p-values decrease. However, meta-analyses of p-values with models that take QRPs and selection for significance into account are a promising tool to predict replication outcomes and to distinguish between questionable and solid results in the psychological literature.

Meta-analyses that take QRPs into account can also help to avoid replication studies that merely confirm highly robust results. Four of the z-scores in Shah et al.’s (2019) project were above 4, which makes it very likely that the results replicate. Resources are better spend on findings that have high theoretical importance, but weak evidence. Z-curve can help to identify these results because it corrects for the influence of QRPs.

Conflict of Interest statement: Z-curve is my baby.

How Credible is Clinical Psychology?

Don Lynam and the clinical group at Purdue University invited me to give a talk and they generously gave me permission to share it with you.

Talk (the first 4 min. were not recorded, it starts right away with my homage to Jacob Cohen).

The first part of the talk discusses the problems with Fisher’s approach to significance testing and the practice in psychology to publish only significant results. I then discuss Neyman-Pearson’s alternative approach, statistical power, and Cohen’s seminal meta-analysis of power in social/abnormal psychology. I then point out that questionable research practices must have been used to publish 95% significant results with only 50% power.

The second part of the talk discusses Soric’s insight that we can estimate the false discovery risk based on the discovery rate. I discuss the Open Science Collaboration project as one way to estimate the discovery rate (prettty high for within-subject cognitive psychology, terribly low for between-subject social psychology), but point out that it doesn’t tell us about clinical psychology. I then introduce z-curve to estimate the discovery rate based on the distribution of significant p-values (converted into z-scores).

In the empirical part, I show the z-curve for Positive Psychology Interventions that shows massive use of QRPs and a high false discovery risk.

I end with a comparison of the z-curve for the Journal of Abnormal Psychology in 2010 and 2020 that shows no change in research practices over time.

The discussion focussed on changing the way we do research and what research we reward. I argue strongly against the implementation of alpha = .005 and for the adoption of Neyman Pearson’s approach with pre-registration which would allow researchers to study small populations (e.g., mental health issues in the African American community) with a higher false-positive risk to balance type-I and type-II errors.

A tutorial about effect sizes, power, z-curve analysis, and personalized p-values.

I recorded a meeting with my research assistants who are coding articles to estimate the replicability of psychological research. It is unedited and raw, but you might find it interesting to listen to. Below I give a short description of the topics that were discussed starting from an explanation of effect sizes and ending with a discussion about the choice of a graduate supervisor.

Link to video

The meeting is based on two blog posts that introduce personalized p-values.
1. https://replicationindex.com/2021/01/15/men-are-created-equal-p-values-are-not/
2. https://replicationindex.com/2021/01/19/personalized-p-values/

1. Rant about Fischer’s approach to statistics that ignores effect sizes.
– look for p < .05, and do a happy dance if you find it, now you can publish.
– still the way statistics is taught to undergraduate students.

2. Explaining statistics starting with effect sizes.
– unstandardized effect size (height difference between men and women in cm)
– unstandardized effect sizes depend on the unit of measurement
– to standardize effect sizes we divide by standard deviation (Cohen’s d)

3. Why do/did social psychologists run studies with n = 20 per condition?
– limited resources, small subject pool, statistics can be used with n = 20 ~ 30.
– obvious that these sample sizes are too small after Cohen (1961) introduced power analysis
– but some argued that low power is ok because it is more efficient to get significant results.

4. Simulation of social psychology: 50% of hypothesis are true, 50% are false, the effect size of true hypotheses is d = .4 and the sample size of studies is N = 20.
– Analyzing the simulated results (with k = 200 studies) with z-curve.2.0. In this simulation, the true discovery rate is 14%. That is 14% of the 200 studies produced a significant result.
– Z-curve correctly estimates this discovery rate based on the distribution of the significant p-values, converted into z-scores.
– If only significant results are published, the observed discovery rate is 100%, but the true discovery rate is only 14%.
– Publication bias leads to false confidence in published results.
– Publication is wasteful because we are discarding useful information.

5. Power analysis.
– Fischer did not have power analysis.
– Neyman and Pearson invented power analysis, but Fischer wrote the textbook for researchers.
– We had 100 years to introduce students to power analysis, but it hasn’t happened.
– Cohen wrote books about power analysis, but he was ignored.
– Cohen suggested we should aim for 80% power (more is not efficient).
– Think a priori about effect size to plan sample sizes.
– Power analysis was ignored because it often implied very large samples.
(very hard to get participants in Germany with small subject pools).
– no change because all p-values were treated as equal. p < .05 = truth.
– Literature reviews or textbook treat every published significant results as truth.

6. Repeating simulation (50% true hypotheses, effect size d = .4) with 80% power, N = 200.
– much higher discovery rate (58%)
– much more credible evidence
– z-curve makes it possible to distinguish between p-values from research with low or high discovery rate.
– Will this change the way psychologists look at p-values? Maybe, but Cohen and others have tried to change psychology without success. Will z-curve be a game-changer?

7. Personalized p-values
– P-values are being created by scientists.
– Scientists have some control about the type of p-values they publish.
– There are systemic pressures to publish more p-values based on low powered studies.
– But at some point, researchers get tenure.
– nobody can fire you if you stop publishing
– social media allow researchers to publish without censure from peers.
– tenure also means you have a responsibility to do good research.
– Researcher who are listed on the post with personalized p-values all have tenure.
– Some researchers, like David Matsumoto, have a good z-curve.
– Other researchers have way too many just significant results.
– The observed discovery rates between good and bad researchers are the same.
– Z-curve shows that the significant results were produced very differently and differ in credibility and replicability; this could be a game changer if people care about it.
– My own z-curve doesn’t look so good. 😦
– How can researchers improve their z-curve
– publish better research now
– distance yourself from bad old research
– So far, few people have distanced themselves from bad old work because there was no incentive to do so.
– Now there is an incentive to do so, because researchers can increase credibility of their good work.
– some people may move up when we add the 2020 data.
– hand-coding of articles will further improve the work.

8. Conclusion and Discussion
– not all p-values are created equal.
– working with undergraduate is easy because they are unbiased.
– once you are in grad school, you have to produce significant results.
– z-curve can help to avoid getting into labs that use questionable practices.
– I was lucky to work in labs that cared about the science.

Nations’ Well-Being and Wealth

Scientists have made a contribution when a phenomenon or a statist is named after them. Thus, it is fair to say that Easterlin made a contribution to happiness research because researchers who write about income and happiness often mention his 1974 article “Does Economic Growth Improve the Human Lot? Some Empirical Evidence” (Easterlin, 1974).

To be fair, the article examines the relationship between income and happiness from three perspectives: (a) the correlation between income and happiness across individuals within nations, (b) the correlation of average incomes and average happiness across nations, and (c) the correlation between average income and average happiness within nations over time. A forth perspective, namely the correlation between income and happiness within individuals over time was not examined because no data were available in 1974.

Even for some of the other questions, the data were limited. Here I want to draw attention to Easterlin’s examination of correlations between nations’ wealth and well-being. He draws heavily on Cantril’s seminal contribution to this topic. Cantil (1965) not only developed a measure that can be used to compare well-being across nations, he also used this measure to compare the well-being of 14 nations (Cuba is not included in Table 1 because I did not have new data).

Cantril.Cross-Cultural.Data.png

Cantril also correlated the happiness scores with a measure of nations’ wealth. The correlation was r = .5. Cantril also suggested that Cuba and the Dominican Republic were positive and negative outliers, respectively. Excluding these two nations increases the correlation to r = .7.

Easterlin took issue with these results.

“Actually the association between wealth and happiness indicated by Cantril”s international data is not so clear-cut. This is shown by a scatter diagram of the data (Fig. I). The inference about a positive association relies heavily on the observations for India and the United States. [According to Cantril (1965, pp. 130-131), the values for Cuba and the Dominican Republic reflect unusual political circumstances-the immediate aftermath of a successful revolution in Cuba and prolonged political turmoil in the Dominican Republic].

What is perhaps most striking is that the personal happiness ratings for 10 of the 14 countries lie virtually within half a point of the midpoint rating of 5, as is brought out by the broken horizontal lines in the diagram. While a difference of rating of only 0.2 is significant at the 0.05 level, nevertheless there is not much evidence, for these IO countries, of a systematic association between income and happiness. The closeness of the happiness ratings implies also that a similar lack of association would be found between happiness and other economic magnitudes such as income inequality or the rate of change of income.

Nearly 50 years later, it is possible to revisit Easterlin’s challenge of Cantril’s claim that nations’ well-being is tied to their wealth with much better data from the Gallup World Poll. The Gallup World Poll used the same measure of well-being. However, it also provides a better measure of citizens’ wealth by asking for income. In contrast, GDP can be distorted and may not reflect the spending power of the average citizen very well. The data about well-being (World Happiness Report, 2020) and median per capita income (Gallup) are publicly available. All I needed to do was to compute the correlation and make a pretty graph.

The Pearson correlation between income and the ladder scores is r(126) = .75. The rank correlation is r(126) = .80. and the Pearson correlation between the log of income and the ladder scores is r(126) = .85. These results strongly support Cantril’s prediction based on his interpretation of the first cross-national study in the 1960s and refute Eaterlin’s challenge that that this correlation is merely driven by two outliers. Other researchers who analyzed the Gallup World Poll data also reported correlations of r = .8 and showed high stability of nations’ wealth and income over time (Zyphur et al., 2020).

Figure 2 also showed that Easterlin underestimate the range of well-being scores. Even ignoring additional factors like wars, income alone can move well-being from a 4 in one of the poorest countries in the world (Burundi) close to an 8 in one of the richest countries in the world (Norway). It also does not show that Scandinavian countries have a happiness secret. The main reason for their high average well-being appears to be that median personal incomes are very high.

The main conclusion is that social scientists are often biased for a number of reasons. The bias is evident in Easterlin’s interpretation of Cantril’s data. The same anti-materialstic bias can be found in many other articles on this topic that claim the benefits of wealth are limited.

To be clear, a log-function implies that the same amount of wealth buys more well-being in poor countries, but the graph shows no evidence that the benefits of wealth level off. It is also true that the relationship between GDP and happiness over time is more complicated. However, regarding cross-national differences the results are clear. There is a very strong relationship between wealth and well-being. Studies that do not control for this relationship may report spurious relationships that disappear when income is included as a predictor.

Furthermore, the focus on happiness ignores that wealth also buys longer lives. Thus, individuals in richer nations not only have happier lives they also have more happy life years. The current Covid-19 pandemic further increases these inequalities.

In conclusion, one concern about subjective measures of well-being has been that individuals in poor countries may be happy with less and that happiness measures fail to reflect human suffering. This is not the case. Sustainable, global economic growth that raises per capita wealth remains a challenge to improve human well-being.

Jens Forster and the Credibility Crisis in Social Psychology

  • Please help out to improve this post. If you have conducted successful or unsuccessful replication studies of work done by Jens Forster, please share this information with me and I will add it to this blog post.

Jens Forster was a social psychologists from Germany. He was a rising star and on the way to receiving a prestigious 5 million Euro award from the Alexander von Humboldt Foundation (Retraction Watch, 2015). Then an anonymous whistle blower accused him of scientific misconduct. Under pressure, Forster returned the award without admitting to any wrongdoing.

He also was in transition to move from the University of Amsterdam to the University of Bochum. After a lengthy investigation, Forster was denied tenure and he is no longer working in academia (Science, 2016), despite the fact that an investigation by the German association of psychologists (DGP) did not conclude that he conducted fraud.

While the personal consequences for Forster are similar to those of Stapel, who admitted to fraud and left his tenured position, the effect on the scientific record is different. Stapel retracted over 50 articles that are no longer being cited at high numbers. In contrast, Forster retracted only a few papers and most of his articles are not flagged to readers as potentially fraudulent. We can see the differences in citation counts for Stapel and Forster.

Stapels Citation Counts

Stapel’s citation counts peaked at 350 and are now down to 150 citations a year. Some of these citations are with co-authors and from papers that have been cleared as credible.

Jens Forster Citations

Citation counts for Forster peaked at 450. The also decreased by 200 citations to 250 citations, but we are also seeing an uptick by 100 citations in 2019. The question is whether this muted correction is due to Forster’s denial of wrongdoing or whether the articles that were not retracted actually are more credible.

The difficulty in proving fraud in social psychology is that social psychologists also used many questionable practices to produce significant results. These questionable practices have the same effect as fraud, but they were not considered unethical or illegal. Thus, there are two reasons why articles that have not been retracted may still lack credible evidence. First, it is difficult to prove fraud when authors do not confess. Second, even if no fraud was committed, the data may lack credible evidence because they were produced with questionable practices that are not considered data fabrication.

For readers of the scientific literature it is irrelevant whether incredible (results with low credibility) results were produced with fraud or with other methods. The only question is whether the published results provide credible evidence for the theoretical claims in an article. Fortunately, meta-scientists have made progress over the past decade in answering this question. One method relies on a statistical examination of an author’s published test statistics. Test statistics can be converted into p-values or z-scores so that they have a common metric (e.g., t-values can be compared to F-values). The higher the z-score, the stronger is the evidence against the null-hypothesis. High z-scores are also difficult to obtain with questionable practices. Thus, they are either fraudulent or provide real evidence for a hypothesis (i.e. against the null-hypothesis).

I have published z-curve analyses of over 200 social/personality psychologists that show clear evidence of variation in research practices across researchers (Schimmack, 2021). I did not include Stapel or Forster in these analyses because doubts have been raised about their research practices. However, it is interesting to compare Forster’s z-curve plot to the plot of other researchers because it is still unclear whether anomalous statistical patterns in Forster’s articles are due to fraud or the use of questionable research practices.

The distribution of z-scores shows clear evidence that questionable practices were used because the observed discovery rate of 78% is much higher than the estimated discovery rate of 18% and the ODR is outside of the 95% CI of the EDR, 9% to 47%. An EDR of 18% places Forster at rank #181 in the ranking of 213 social psychologists. Thus, even if Forster did not conduct fraud, many of his published results are questionable.

The comparison of Forster with other social psychologists is helpful because humans’ are prone to overgeneralize from salient examples which is known as stereotyping. Fraud cases like Stapel and Forster have tainted the image of social psychology and undermined trust in social psychology as a science. The fact that Forster would rank very low in comparison to other social psychologists shows that he is not representative of research practices in social psychology. This does not mean that Stapel and Forster are bad apples and extreme outliers. The use of QRPs was widespread but how much researchers used QRPs varied across researchers. Thus, we need to take an individual difference perspective and personalize credibility. The average z-curve plot for all social psychologists ignores that some research practices were much worse and others were much better. Thus, I argue against stereotyping social psychologists and in favor of evaluating each social psychologists based on their own merits. As much as all social psychologists acted within a reward structure that nearly rewarded Forster’s practices with a 5 million dollar prize, researchers navigated this reward structure differently. Hopefully, making research practices transparent can change the reward structure so that credibility gets rewarded.

Unconscious Emotions: Mindless Citations of Questionable Evidence

The past decade has seen major replication failures in social psychology. This has led to a method revolution in social psychology. Thanks to technological advances, many social psychologists moved from studies with smallish undergraduate samples to online studies with hundreds of participants. Thus, findings published after 2016 are more credible than those published before 2016.

However, social psychologists have avoided to take a closer look at theories that were built on the basis of questionable results. Review articles continue to present these theories and cite old studies as if they provided credible evidence for them as if the replication crisis never happened.

One influential theory in social psychology is that stimuli can bypass conscious awareness and still influence behavior. This assumption is based on theories of emotions that emerged in the 1980s. In the famous Lazarus-Zajonc debate most social psychologists sided with Zajonc who quipped that “Preferences need no inferences.”

The influence of Zajonc can be seen in hundreds of studies with implicit primes (Bargh et al., 1996; Devine, 1989) and in modern measures of implicit cognition such as the evaluative priming task and the affect misattribution paradigm (AMP, Payne et al., . 2005).

Payne and Lundberg (2014) credit a study by Murphy and Zajonc (1993) for the development of the AMP. Interestingly, the AMP was developed because Payne was unable to replicate a key finding from Murphy and Zajonc’ studies.

In these studies, a smiling or frowning face was presented immediately before a target stimulus (e.g., a Chinese character). Participants had to evaluate the target. The key finding was that the faces influenced evaluations of the targets only when the faces were processed without awareness. When participants were aware of the faces, they had no effect. When Payne developed the AMP, he found that preceding stimuli (e.g., faces of African Americans) still influenced evaluations of Chinese characters, even though the faces were presented long enough (75ms) to be clearly visible.

Although research with the AMP has blossomed, there has been little interest in exploring the discrepancy between Murphy and Zajonc’s (1993) findings and Payne’s findings.

Payne and Lundbert (2014)

One possible explanation for the discrepancy is that the Murphy and Zajonc’s (1993) results were obtained with questionable research practices (QRPs, John et al., 2012). Fortunately, it is possible to detect the use of QRPs using forensic statistical tools. Here I use these tools to examine the credibility of Murphy and Zajonc’s claims that subliminal presentations of emotional faces produce implicit priming effects.

Before I examine the small set of studies from this article, it is important to point out that the use of QRPs in this literature is highly probable. This is revealed by examining the broader literature of implicit priming, especially with subliminal stimuli (Schimmack, 2020).

This image has an empty alt attribute; its file name is image-36.png

Figure 1 shows that published studies rarely report non-significant results, although the distribution of significant results shows low power and a high probability of non-significant results. While the observed discovery rate is 90%, the expected discovery rate is only 13%. This shows that QRPs were used to supress results that did not show the expected implicit priming effects.

Study 1

Study 1 in Murphy and Zajonc (1993) had 32 participants; 16 with subliminal presentations and 16 with supraliminal presentations. There were 4 within-subject conditions (smiling, frowning & two control conditions). The means of the affect ratings were 3.46 for smiling, 3.06 for both control conditions and 2.70 for the frowning faces. The perfect ordering of means is a bit suspicious, but even more problematic is that the mean differences of experimental conditions and control conditions were all statistically significant. The t-values, df = 15, are 2.23, 2.31, 2.31, and 2.59. Too many significant contrasts have been the downfall for a German social psychologist. Here we can only say that Murphy and Zajonc were very lucky that the two control conditions fell smack in the middle of the two experimental conditions. Any deviation in one direction would have increased one comparison, but decreased the other comparison and increased the risk of a non-significant result.

Study 2

Study 2 was similar, except that the judgments was changed from subjective liking to objective goodness vs. badness judgments.

The means for the two control conditions were again right in the middle, nearly identical to each other, and nearly identical to the means in Study 1 (M = 3.05, 3.06). Given sampling error, it is extremely unlikely that even the same condition produces the same means. Without reporting actual t-values, the authors further claim that all four comparisons of experimental and control conditions are significant.

Taken together, these two studies with surprisingly simiar t-values and 32 participants provide the only evidence for the claim that stimuli outside of awareness can elicit affective reactions. This weak evidence has garnered nearly 1,000 citations without ever being questioned or published replication attempts.

Studies 3-5 did not examine affective priming, but Study 6 did. The paradigm here was different. Participants were subliminally presented with a smiling or a frowning face. Then they had to choose between two pictures, the prime and a foil. The foil either had the same facial expression or a different facial expression. Another manipulation was to have the same or a different gender. This study showed a strong effect of facial expression, t(62) = 6.26, but not of gender.

I liked this design and conducted several conceptual replication studies with emotional pictures (beautiful beaches, dirty toilets). It did not work. Participants were not able to use their affect to pick the right picture from a prime-foil pair. I also manipulated presentation times and with increasing presentation times, participants could pick out the picture, even if the affect was the same (e.g., prime and foil were both pleasant).

Study 6 also explains why Payne was unable to get priming effects for subliminal stimuli that varied race or other features.

One possible explanation for the results in Study 6 is that it is extremely difficult to mask facial expressions, especially smiles. I also did some studies that tried that and at least with computers it was impossible to prevent detection of smiling faces.

Thus, we are left with some questionable results in Studies 1 and 2 as the sole evidence that subliminal stimuli can elicit affective reactions that are transferred to other stimuli.

Conclusion

I have tried to get implicit priming effects on affect measures and failed. It was difficult to publish these failures in the early 2000s. I am sure there are many other replication failures (see Figure 1) and Payne et al.’s (2014) account of the developed the AMP implies as much. Social psychology is still in the process of cleaning up the mess that the use of QRPs created. Implicit priming research is a posterchild of the replication crisis and researchers should stop citing these old articles as if they produced credible evidence.

Emotion researchers may also benefit from revisiting the Lazarus-Zajonc debate. Appraisal theory may not have the sex appeal of unconscious emotions, but it may be a more robust and accurate theory of emotions. Preference may not always require inferences, but preferences that are based on solid inferences are likely to be a better guide of behavior. Therefore I prefer Lazarus over Zajonc.