Yearly Archives: 2016

Replicability Review of 2016

2016 was surely an exciting year for anybody interested in the replicability crisis in psychology. Some of the biggest news stories in 2016 came from attempts by the psychology establishment to downplay the replication crisis in psychological research (Weired Magazine). At the same time, 2016 delivered several new replication failures that provide further ammunition for the critics of established research practices in psychology.

I. The Empire Strikes Back

1. The Open Science Collaborative Reproducibility Project was flawed.

Daniel Gilbert, Tim Wilson published a critique of the Open Science Collaborative in Science. According to Gilbert and Wilson the project that replicated 100 original research studies and reported that they could only replicate 36% was error riddled. Consequently, the low success rate only reveals the incompetence of replicators and has no implications for the replicability of original studies published in prestigious psychological journals like Psychological Science. Science Daily suggested that the critique overturned the landmark study.


Nature published a more balanced commentary.  In an interview, Gilbert explains that “the number of studies that actually did fail to replicate is about the number you would expect to fail to replicate by chance alone — even if all the original studies had shown true effects.”   This quote is rather strange, if we really consider the replication studies as flawed and error riddled.  If the replication studies were bad, we would expect fewer studies to replicate than we would expect based on chance alone.  If the success rate of 36% is consistent with the effect of chance alone, the replication studies are just as good as the original studies and the only reason for non-significant results would be chance. Thus, Gilbert’s comment implies that he believes the typical statistical power of a study in psychology is about 36%. Gilbert doesn’t seem to realize that he is inadvertently admitting that published articles report vastly inflated success rates because 97% of the original studies reported a significant result.  To report 97% significant results with an average power of 36%, researchers are either hiding studies that failed to support their favored hypotheses in proverbial file-drawers or they are using questionable research practices to inflate evidence in favor of their hypotheses. Thus, ironically Gilberts’ comments rather confirm the critiques of the establishment that the low success rate in the reproducibility project can be explained by selective reporting of evidence that supports authors’ theoretical predictions.

2. Contextual Sensitivity Explains Replicability Problem in Social Psychology

Jay van Bavel and colleagues made a second attempt to downplay the low replicability of published results in psychology. He even got to write about it in the New York Times.


Van Bavel blames the Open Science Collaboration for overlooking the importance context. “Our results suggest that many of the studies failed to replicate because it was difficult to recreate, in another time and place, the exact same conditions as those of the original study.”   This statement caused a lot of bewilderment.  First, the OSC carefully tried to replicate the original studies as closely as possible.  At the same time, they were sensitive to the effect of context. For example, if a replication study of an original study in the US was carried out in Germany, stimulus words were translated from English into German because one might expect that native German speakers might not respond the same way to the original English words as native English speakers.  However, the switching of languages means that the replication study is not identical to the original study. Maybe the effect can only be obtained with English speakers. And if the study was conducted at Harvard, maybe the effect can only be replicated with Harvard students. And if the study was conducted primarily with female students, it may not replicate with male students.

To provide evidence for his claim, Jay van Bavel obtained subjective ratings of contextual sensitivity. That is, raters guessed how sensitivity the outcome of a study is to variations in the context.  These ratings were then used to predict the success of the 100 replication studies in the OSC project.

Jay van Bavel proudly summarized the results in the NYT article. “As we predicted, there was a correlation between these context ratings and the studies’ replication success: The findings from topics that were rated higher on contextual sensitivity were less likely to be reproduced. This held true even after we statistically adjusted for methodological factors like sample size of the study and the similarity of the replication attempt. The effects of some studies could not be reproduced, it seems, because the replication studies were not actually studying the same thing.”

The article leaves out a few important details.  First, the correlation between contextual sensitivity ratings and replication success was small, r = .20.  Thus, even if contextual sensitivity contributed to replication failures, it would only explain replication failures for a small percentage of studies. Second, the authors used several measures of replicability and some of these measures failed to show the predicted relationship. Third, the statement makes an elementary mistake of confusing correlation and causality.  The authors merely demonstrated that subjective ratings of contextual sensitivity predicted outcomes of replication studies. They did not show that contextual sensitivity caused replication failures.  Most important, Jay van Bavel failed to mention that they also conducted an analysis that controlled for discipline. The Open Science Collaborative had already demonstrated that studies in cognitive psychology are more replicable (50% success rate) than studies in social psychology (an awful 25%).  In an analysis that controlled for differences in disciplines, contextual sensitivity was no longer a statistically significant predictor of replication failures.  This hidden fact was revealed in a commentary (or should we say correction) by Joel Inbar.  In conclusion, this attempt at propping up the image of social psychology as a respectable science with replicable results turned out to be another embarrassing example of sloppy research methodology.

3. Anti-Terrorism Manifesto by Susan Fiske

Later that year, former president of the Association for Psychological Science (APS) caused a stir by comparing critics of established psychology to terrorists (see Business Insider article).  She later withdrew the comparison to terrorists in response to the criticism of her remarks on social media (APS website).


Fiske attempted to defend established psychology by arguing that established psychology is self-correcting and does not require self-appointed social-media vigilantes. She claimed that these criticisms were destructive and damaging to psychology.

“Our field has always encouraged — required, really — peer critiques.”

“To be sure, constructive critics have a role, with their rebuttals and letters-to-the-editor subject to editorial oversight and peer review for tone, substance, and legitimacy.”

“One hopes that all critics aim to improve the field, not harm people. But the fact is that some inappropriate critiques are harming people. They are a far cry from temperate peer-reviewed critiques, which serve science without destroying lives.”

Many critics of established psychology did not share Fiske’s rosy and false description of the way psychology operates.  Peer-review has been shown to be a woefully unreliable process. Moreover, the key criterion for accepting a paper is that it presents flawless results that seem to support some extraordinary claims (a 5-minute online manipulation reduces university drop-out rates by 30%), no matter how these results were obtained and whether they can be replicated.

In her commentary, Fiske is silent about the replication crisis and does not reconcile her image of a critical peer-review system with the fact that only 25% of social psychological studies are replicable and some of the most celebrated findings in social psychology (e.g., elderly priming) are now in doubt.

The rise of blogs and Facebook groups that break with the rules of the establishment poses a threat to the APS establishment with the main goal of lobbying for psychological research funding in Washington. By trying to paint critics of the establishment as terrorists, Fiske tried to dismiss criticism of established psychology without having to engage with the substantive arguments why psychology is in crisis.

In my opinion her attempt to do so backfired and the response to her column showed that the reform movement is gaining momentum and that few young researchers are willing to prop up a system that is more concerned about publishing articles and securing grant money than about making real progress in understanding human behavior.

II. Major Replication Failures in 201

4. Epic Failure to Replicate Ego-Depletion Effect in a Registered Replication Report

Ego-depletion is a big theory in social psychology and the inventor of the ego-depletion paradigm, Roy Baumeister, is arguable one of the biggest names in contemporary social psychology.  In 2010, a meta-analysis seemed to confirm that ego-depletion is a highly robust and replicable phenomenon.  However, this meta-analysis failed to take publication bias into account.  In 2014, a new meta-analysis revealed massive evidence of publication bias. It also found that there was no statistically reliable evidence for ego-depletion after taking publication bias into account (Slate, Huffington Post).


A team of researchers, including the first-author of the supportive meta-analysis from 2010, conducted replication studies, using the same experiment in 24 different labs.  Each of these studies alone would have had a low probability to detect a small ego depletion effect, but the combined evidence from all 24 labs made it possible to detect an ego-depletion effect even if it were much smaller than published articles suggest.  Yet, the project failed to find any evidence for an ego-depletion effect, suggesting that it is much harder to demonstrate ego-depletion effects than one would believe based on over 100 published articles with successful results.

Critics of Baumeister’s research practices (Schimmack) felt vindicated by this stunning failure. However, even proponents of ego-depletion theory (Inzlicht) acknowledged that ego-depletion theory lacks a strong empirical foundation and that it is not clear what 20 years of research on ego-depletion have taught us about human self-control.

Not so, Roy Baumeister.  Like a bank that is too big to fail, Baumeister defended ego-depletion as a robust empirical finding and blamed the replication team for the negative outcome.  Although he was consulted and approved the design of the study, he later argued that the experimental task was unsuitable to induce ego-depletion. It is not hard to see the circularity in Baumeister’s argument.  If a study produces a positive result, the manipulation of ego-depletion was successful. If a study produces a negative result, the experimental manipulation failed. The theory is never being tested because it is taken for granted that the theory is true. The only empirical question is whether an experimental manipulation was successful.

Baumeister also claimed that his own lab has been able to replicate the effect many times, without explaining the strong evidence for publication bias in the ego-depletion literature and the results of a meta-analysis that showed results from his own lab are no different from results from other labs.

A related article by Baumeister in a special issue on the replication crisis in psychology was another highlight in 2016.  In this article, Baumeister introduced the concept of FLAIR.

scientist-with-flair   Scientist with FLAIR

Baumeister writes “When I was in graduate school in the 1970s, n=10 was the norm, and people who went to n=20 were suspected of relying on flimsy effects and wasting precious research participants. Over the years the norm crept up to about n = 20. Now it seems set to leap to n = 50 or more.” (JESP, 2016, p. 154).  He misses the god old days and suggests that the old system rewarded researchers with flair.  “Patience and diligence may be rewarded, but competence may matter less than in the past. Getting a significant result with n = 10 often required having an intuitive flair for how to set up the most conducive situation and produce a highly impactful procedure. Flair, intuition, and related skills matter much less with n = 50.” (JESP, 2016, p. 156).

This quote explains the low replication rate in social psychology and the failure to replicate ego-depletion effects.   It is simply not possible to conduct studies with n = 10 and be successful in most studies because empirical studies in psychology are subject to sampling error.  Each study with n = 10 on a new sample of participants will produce dramatically different results because sample of n = 10 are very different from each other.  This is a fundamental fact of empirical research that appears to elude on of the most successful empirical social psychologists.  So, a researcher with FLAIR may set up a clever experiment with a strong manipulation (e.g, smelling chocolate cookies and have participants eat radishes instead) and get a significant result. But this is not a replicable finding. For every study with fair that worked, there are numerous studies that did not work. However, researchers with flair ignore these failed studies and focus on the studies that worked and then use these studies for publication.  It can be shown statistically that they do, as I did with Baumeister’s glucose studies (Schimmack, 2012) and Baumeister’s ego-depletion studies in general (Schimmack, 2016).  So, a researchers who gets significant results with small samples (n = 10) surely has FLAIR (False, Ludicrous, And Incredible Results).

Baumeister’s article contained additional insights into the research practices that fueled a highly productive and successful career.  For example, he distinguishes researchers who report boring true positive results and interesting researches who publish interesting false positive results.  He argues that science needs both types of researchers. Unfortunately, most people assume that scientists prioritize truth, which is the main reason for subjecting theories to empirical tests. But scientists with FLAIR get positive results even when their interesting ideas are false (Bem, 2011).

Baumeister mentions psychoanalysis as an example of interesting psychology. What could be more interesting than the Freudian idea that every boy goes through a phase where he wants to kill daddy and make love to mommy.  Interesting stuff, indeed, but this idea has no scientific basis.  In contrast, twin studies suggest that many personality traits, values, and abilities are partially inherited. To reveal this boring fact, it was necessary to recruit large samples of thousands of twins.  That is not something a psychologist with FLAIR can handle.  “When I ran my own experiments as a graduate student and young professor, I struggled to stay motivated to deliver the same instructions and manipulations through four cells of n=10 each. I do not know how I would have managed to reach n=50. Patient, diligent researchers will gain, relative to others” (Baumeister, JESP, 2016, p. 156). So, we may see the demise of researchers with FLAIR and diligent and patient researchers who listen to their data may take their place. Now there is something to look forward to in 2017.

scientist-without-flair Scientist without FLAIR

5. No Laughing Matter: Replication Failure of Facial Feedback Paradigm

A second Registered Replication Report (RRR) delivered another blow to the establishment.  This project replicated a classic study on the facial-feedback hypothesis.  Like other peripheral emotion theories, facial-feedback theories assume that experiences of emotions depend (fully or partially) on bodily feedback.  That is, we feel happy because we smile rather than we smile because we are happy.  Numerous studies had examined the contribution of bodily feedback to emotional experience and the evidence was mixed.  Moreover, studies that found effects had a major methodological problem. Simply asking participants to smile might make them think happy thoughts, which could elicit positive feelings.  In the 1980s, social psychologist Fritz Strack invented a procedure that solved this problem (see Slate article).  Participants are deceived to believe that they are testing a procedure for handicapped people to complete a questionnaire by holding a pen in their mouth.  Participants who hold the pen with their lips are activating muscles that are activated during sadness. Participants who hold the pen with their teeth activate muscles that are activated during happiness.  Thus, randomly assigning participants to one of these two conditions made it possible to manipulate facial muscles without making participants aware of the associated emotion.  Strack and colleagues reported two experiments that showed effects of the experimental manipulation.  Or did it?  It depends on the statistical test being used.


Experiment 1 had three conditions. The control group did the same study without manipulation of the facial muscles. The dependent variable was funniness ratings of cartoons.  The mean funniness of cartoons was highest in the smile condition, followed by the control condition, and the lowest mean in the frown condition.  However, a commonly used Analysis of Variance would not have produced a significant result.  A two-tailed t-test also would not have produced a significant result.  However a linear contrast with a one-tailed t-test produced a just significant result, t(89) = 1.85, p = .03.  So, Fritz Strack was rather lucky to get a significant result.  Sampling error could have easily changed the pattern of means slightly and even the directional test of the linear contrast would not have been significant.  At the same time, sampling error might have been against the facial feedback hypothesis and the real effect is stronger than this study suggests. In this case, we would expect to see stronger evidence in Study 2.  However, Study 2 failed to show any effect on funniness ratings of cartoons.  “As seen in Table 2, subjects’ evaluations of the cartoons were hardly affected under the different experimental conditions. The ANOVA showed no significant main effects or interactions, all ps > .20” (Strack et al., 1988).  However, Study 2 also included amusement ratings, and the amusement ratings once more showed a just significant result with a one-tailed t-test, t(75) = 1.78, p = .04.  The article also provides an explanation for the just-significant result in Study 1, even though Study 1 used funniness ratings of cartoons.  When participants are not asked to differentiate between their subjective feelings of amusement and the objective funniness of cartoons, subjective feelings influence ratings of funniness, but given a chance to differentiate between the two, subjective feelings no longer influence funniness ratings.

For 25 years, this article was uncritically cited as evidence for the facial feedback hypothesis, but none of the 17 labs that participated in the RRR were able to produce a significant result. More important, even an analysis with the combined power of all studies failed to detect an effect.  Some critics pointed out that this result successfully replicates the finding of the original two studies that also failed to report statistically significant results by conventional standards of a two-tailed test (or z > 1.96).

Given the shaky evidence in the original article, it is not clear why Fritz Strack volunteered his study for a replication attempt.  However, it is easier to understand his response to the results of the RRR.  He does not take the results seriously.  He rather believes his two original, marginally significant, studies than the 17 replication studies.

“Fritz Strack has no regrets about the RRR, but then again, he doesn’t take its findings all that seriously. “I don’t see what we’ve learned,” he said.”  (Slate).

One of the most bizarre statements by Strack can only be interpreted as revealing a shocking lack of understanding of probability theory.

“So when Strack looks at the recent data he sees not a total failure but a set of mixed results. Nine labs found the pen-in-mouth effect going in the right direction. Eight labs found the opposite. Instead of averaging these together to get a zero effect, why not try to figure out how the two groups might have differed? Maybe there’s a reason why half the labs could not elicit the effect.” (Slate).

This is like a roulette player who after a night of gambling sees 49% wins and 49% loses and ponders why 49% of the attempts produced losses. Strack does not seem to realize that results of individual studies move simply by chance just like roulette balls produce different results by chance. Some people find cartoons funnier than others and the mean will depend on the allocation of these individuals to the different groups.  This is called sampling error, and this is why we need to do statistical tests in the first place.  And apparently it is possible to become a famous social psychologist without understanding the purpose of computing and reporting p-values.

And the full force of defense mechanisms is apparent in the next statement.  “Given these eight nonreplications, I’m not changing my mind. I have no reason to change my mind,” Strack told me. Studies from a handful of labs now disagreed with his result. But then, so many other studies, going back so many years, still argued in his favor. (Slate).

No, there were not eight non-replications. There were 17!  We would expect half of the studies to match the direction of the original effect simply due to chance alone.

But this is not all.  Strack even accused the replication team of “reverse p-hacking.” (Strack, 2016).  The term p-hacking was coined by Simmons et al. (2011) to describe a set of research practices that can be used to produce statistically significant results in the absence of a real effect (fabricating false positives).  Strack turned it around and suggested that the replication team used statistical tricks to make the facial feedback effect disappear.  “Without insinuating the possibility of a reverse p hacking, the current anomaly needs to be further explored.” (p. 930).

However, the statistical anomaly that requires explanation could just be sampling error (Hillgard) and it actually is the wrong statistical pattern to claim reverse p-hacking.  Reverse p-hacking implies that some studies did produce a significant result, but statistical tricks were used to report the result as non-significant. This would lead to a restriction in the variability of results across studies, which can be detected with the Test for Insufficient Variance (Schimmack, 2015), but there is no evidence for reverse p-hacking in the RRR.

Fritz Strack also tried to make his case on social media, but there was very little support for his view that 17 failed replication studies can be ignored (PsychMAP thread).


Strack’s desperate attempts to defend his famous original study in the light of a massive replication failure provide further evidence for the inability of the psychology establishment to face the reality that many celebrated discoveries in psychology rest on shaky evidence and a mountain of repressed failed studies.

Meanwhile the Test of Insufficient Variance provides a simple explanation for the replication failure, namely the original results were rather unlikely to occur in the first place.  Converting the observed t-values into z-scores shows very low variability, Var(z) = 0.003. The probability of observing a variance this small or smaller in a pair of studies is only p = .04.  It is just not very likely for such an improbable event to repeat itself

6. Insufficient Power in Power-Posing Research

When you google “power posing” the featured link shows Amy Cuddy giving a TED talk about her research. Not unlike facial feedback, power posing assumes that bodily feedback can have powerful effects.


When you scroll down to the page, you might find a link to an article by Gelman and Fung (Slate).

Gelman has been an outspoken critic of social psychology for some time.  This article is no exception. “Some of the most glamorous, popular claims in the field are nothing but tabloid fodder. The weakest work with the boldest claims often attracts the most publicity, helped by promotion from newspapers, television, websites, and best-selling books.”


They point out that a much larger study than the original study failed to replicate the original findings.

“An outside team led by Eva Ranehill attempted to replicate the original Carney, Cuddy, and Yap study using a sample population five times larger than the original group. In a paper published in 2015, the Ranehill team reported that they found no effect.”

They have little doubt that the replication study can be trusted and suggest that the original results were obtained with the help of questionable research practices.

“We know, though, that it is easy for researchers to find statistically significant comparisons even in a single, small, noisy study. Through the mechanism called p-hacking or the garden of forking paths, any specific reported claim typically represents only one of many analyses that could have been performed on a dataset.”

The replication study was published in 2015, so this replication failure does not really belong into a review of 2016.  Indeed, the big news in 2016 was that Cuddy’s co-author Carney distanced herself from her contribution to the power posing article.   Her public rejection of her own work (New Yorker Magazine) spread like a wildfire through social media (Psych Methods FB Group Posts 1, 2, but  see 3). Most responses were very positive.  Although science is often considered a self-correcting system, individual scientists rarely correct mistakes or retract articles if they discover a mistake after publication.  Carney’s statement was seen as breaking with the implicit norm of the establishment to celebrate every published article as an important discovery and to cover up mistakes even in the face of replication failures.


Not surprisingly, proponent of power posing, Amy Cuddy, defended her claims about power posing. Here response makes many points, but there is one glaring omission. She does not mention the evidence that published results are selected to confirm theoretical claims and she does not comment on the fact that there is no evidence for power posing after correcting for publication bias.  The psychology establishment also appears to be more interested in propping up a theory that has created a lot of publicity for psychology rather than critically examining the scientific evidence for or against power posing (APS Annual Meeting, 2017, Program, Presidential Symposium).

7. Commitment Priming: Another Failed Registered Replication Report

Many research questions in psychology are difficult to study experimentally.  For example, it seems difficult and unethical to study the effect of infidelity on romantic relationships by assigning one group of participants to an infidelity condition and make them engage in non-marital sex.  Social psychologists have developed a solution to this problem.  Rather than creating real situations, participants are primed to think about infidelity. If these thoughts change their behavior, the results are interpreted as evidence for the effect of real infidelity.  Eli Finkel and colleagues used this approach to experimentally test the effect of commitment on forgiveness.  To manipulate commitment, participants in the experimental group were given some statements that were supposed to elicit commitment-related thoughts.  To make sure that this manipulation worked, participants then completed a commitment measure.  In the original article, the experimental manipulation had a strong effect, d = .74, which was highly significant, t(87) = 3.43, p < .001.  Irene Cheung, Lorne Campbell, and Etienne P. LeBel spearheaded an initiative to replicate the experimental effect of commitment priming on forgiveness.  Eli Finkel closely worked with the replication team to ensure that the replication study replicated the original study as closely as possible.  Yet, the replication studies failed to demonstrate effectiveness of the commitment manipulation. Even with the much larger sample size, there was no significant effect and the effect size was close to zero.  The authors of the replication report were surprised by the failure of the manipulation. “It is unclear why the RRR studies observed no effect of priming on subjective commitment when the original study observed a large effect. Given the straightforward nature of the priming manipulation and the consistency of the RRR results across settings, it seems unlikely that the difference resulted from extreme context sensitivity or from cohort effects (i.e., changes in the population between 2002 and 2015).” (PPS, 2016, p. 761).  The author of the original article, Eli Finkel, also has no explanation for the failure of the experimental manipulation. “Why did the manipulation that successfully influenced commitment in 2002 fail to do so in the RRR? I don’t know.” (PPS, 2016, p. 765).  However, Eli Finkel also reports that he made changes to the manipulation in subsequent studies. “The RRR used the first version of a manipulation that has been refined in subsequent work. Although I believe that the original manipulation is reasonable, I no longer use it in my own work. For example, I have become concerned that the “low commitment” prime includes some potentially commitment-enhancing elements (e.g., “What is one trait that your partner will develop as he/she grows older?”). As such, my collaborators and I have replaced the original 5-item primes with refined 3-item primes (Hui, Finkel, Fitzsimons, Kumashiro, & Hofmann, 2014). I have greater confidence in this updated manipulation than in the original 2002 manipulation. Indeed, when I first learned that the 2002 study would be the target of an RRR—and before I understood precisely how the RRR mechanism works—I had assumed that it would use this updated manipulation.” (PPS, 2016, p. 766).   Surprisingly, the potential problem with the original manipulation was never brought up during the planning of the replication study (FB discussion group).


Hui et al. (2014) also do not mention any concerns about the original manipulation.  They simply wrote “Adapting procedures from previous research (Finkel et al., 2002), participants in the high commitment prime condition answered three questions designed to activate thoughts regarding dependence and commitment.” (JPSP, 2014, p. 561).  The results of the manipulation check closely replicated the results of the 2002 article. “The analysis of the manipulation check showed that participants in the high commitment prime condition (M = 4.62, SD = 0.34) reported a higher level of relationship commitment than participants in the low commitment prime condition (M = 4.26, SD = 0.62), t(74) = 3.11, p < .01.” (JPSP, 2014, p. 561).  The study also produced a just-significant result for a predicted effect of the manipulation on support for partner’s goals that are incompatible with the relationship, relationship, beta = .23, t(73) = 2.01, p = .05.  These just significant results are rare and often fail to replicate in replication studies (OSC, Science, 2016).

Altogether the results of yet another registered replication report raise major concerns about the robustness of priming as a reliable method to alter participants’ beliefs and attitudes.  Selective reporting of studies that “worked” has created an illusion that priming is a very effective and reliable method to study social cognitions. However, even social cognition theories suggest that priming effects should be limited to specific situations and should not have strong effects for judgments that are highly relevant and when chronically accessible information is easily accessible.

8. Concluding Remarks

Looking back 2016 has been a good year for the reform movement in psychology.  High profile replication failures have shattered the credibility of established psychology.  Attempts by the establishment to discredit critics have backfired. A major problem for the establishment is that they themselves do not know how big the crisis is and which findings are solid.  Consequently, there has been no major initiative by the establishment to mount replication projects that provide positive evidence for some important discoveries in psychology.  Looking forward to 2017, I anticipate no major changes. Several registered replication studies are in the works, and prediction markets anticipate further failures.  For example, a registered replication report of “professor priming” studies is predicted to produce a null-result.


If you are still looking for a New Year’s resolution, you may consider signing on to Brent W. Roberts, Rolf A. Zwaan, and Lorne Campbell’s initiative to improve research practices. You may also want to become a member of the Psychological Methods Discussion Group, where you can find out in real time about major events in the world of psychological science.

Have a wonderful new year.



Z-Curve: Estimating Replicability of Published Results in Psychology (Revision)

Jerry Brunner and I developed two methods to estimate replicability of published results based on test statistics in original studies.  One method, z-curve, is used to provide replicabiltiy estimates in my powergraphs.

In September, we submitted a manuscript that describes these methods to Psychological Methods, where it was rejected.

We now revised the manuscript. The new manuscript contains a detailed discussion of various criteria for replicability with arguments why a significant result in an exact replication study is an important, if not the only, criterion to evaluate the outcome of replication studies.

It also makes a clear distinction between selection for significance in an original study and the file drawer problem in a series of conceptual or exact replication studies. Our methods only assumes selection for significance in original studies, but no file drawer or questionable research practices.  This idealistic assumption may explain why our model predicts a much higher success rate in the OSC reproducibility project (66%) than was actually obtained (36%).  As there is ample evidence for file-drawers with non-significant conceptual replication studies, we believe that file-drawers and QRP contribute to the low success rate in the OSC project. However, we also mention concerns about the quality of some replication studies.

We hope that the revised version is clearer, but fundamentally nothing has changed. Reviewers at Psychological Methods didn’t like our paper, the editor thought NHST is no longer relevant (see editorial letter and reviews), but nobody challenged our statistical method or the results of our simulation studies that validate the method. It works and it provides an estimate of replicability under very idealistic conditions, which means we can only expect a considerably lower success rate in actual replication studies as long as researchers file-drawer non-significant results.


Diederik A. Stapel: Not Retracted, but Still Incredible

Originally Posted: December 6, 2016
Revised Version with Z-curve plots: January 16, 2021

Diederik A. Stapel represents everything that has gone wrong in experimental social psychology.  Until 2011, he was seen as a successful scientists who made important contributions to the literature on social priming.  

In 2011, an investigation into Diedrik Stapel’s reserach practices revealed scientific fraud, which resulted in over 50 retractions (Retraction Watch), including the article on unconscious social comparisons (Retraction Notice).  In a book, Diederik Stapel told his story about his motives and practices, but the book is not detailed enough to explain how particular datasets were fabricated.  All we know, is that he used a number of different methods that range from making up datasets to the use of questionable research practices that increase the chance of producing a significant result.  These practices are widely used and are not considered scientific fraud, although the end result is the same. Published results no longer provide credible empirical evidence for the claims made in a published article.

One retracted article is the article “From Seeing to Being: Subliminal Social Comparisons Affect Implicit and Explicit Self-Evaluations” by Stapel and Blaton (2004). The APA retraction notice claims that the data were considered to be fraudulent.

The report by the Noort Committee lists the following problems with the article.

There is no indication or admission that the data were fabricated, which is often the way Stapel’s practices are described. Rather, the problem appears to be that data were collected, but deceptive research practices were used to present results that supported the main hypothesis. It is well known that these practices were common in social psychology and psychology in general. Thus, the only reason this article was retracted and other articles that used QRPs were not retracted was that Stapel declared these data to be fraudulent. It is therefore interesting to examine what these results look like and how they compare to other results that have been published and are not retracted.

A researcher who starts with real data and then uses questionable practices to get signifiance is likely to use as little dishonest practices as possible because this makes it easier to justify the questionable decisions.  For example, removing 10% of data may seem justified, especially if some rational for exclusion can be found.  However, removing 60% of data cannot be justified.  The researcher will need to use these practices to produce the desired outcome, namely a p-value below .05 (or at least very close to .05).  As more use of questionable practices is not needed and harder to justify, the researcher will stop producing stronger evidence.  This should produce a disproportation number of p-values that are just significant.

I developed two statistical tests that detect the presence of too many just significant results. One test is the Replicability-Index (R-Index). The other one is the Test of Insufficient Variance (TIVA).  I applied these tests to the focal statistical tests in the 8 studies. The table shows the key finding of each study.  


All results were interpreted as evidence for an effect and the p-value for Study 6 was reported as p = .05. Although the Noort commission highlights this misreporting of a p-value, it is common practice to report p = .053 as significant and many more articles would have to be retracted if this was not acceptable. However, a real p-value of .053 provides as much or as little evidence against the null-hypothesis as a real p-value of .047.

A much bigger problem that was not noticed by the Noort commission is that all the p-values are just signifcant. This is a highly improbable outcome of actual statistical results because sampling error produces high variability in p-values.

TIVA examines whether the observed variance in p-values is significantly lower than we would expect based on sampling error. First, p-values are converted into z-scores.  The variance of z-scores due to sampling error alone is expected to be approximately 1.  However, the observed variance is only Var(z) = 0.032.  A chi-square test shows that this observed variance is unlikely to occur by chance alone,  p = .00035. Thus, there is strong evidence that the results were obtained with questionable research practices (p-hacked).

The last column transforms z-scores into a measure of observed power. Observed power is an estimate of the probability of obtaining a significant result under the assumption that the observed effect size matches the population effect size.  These estimates are influenced by sampling error.  To get a more reliable estimate of the probability of a successful outcome, the R-Index uses the average power across the 8 studies. The average is 54%. It is unlikely that a set of 8 studies with a 54% chance of obtaining a significant result produced significant results in all studies (Schimmack, 2012).  Thus, once more we have evidence that the article reported too many significant results. The R-Index quantifies the inflation of the success rate by subtracting the observed power from the success rate (100% – 54% = 46%). This is close to the maximum discrepancy that is possible because the minimum value for observed power with a significant result is 50% (all power values below 50% imply p-values > .05).

To make matters worse, the probability of obtaining a significant result is inflated when a set of studies contains too many significant results.  To correct for this bias, the R-Index computes the inflation rate.  With 53% probability of success and 100% success rate, the inflation rate is 47%. To correct for inflation, the inflation rate is subtracted from median observed probability, which yields an R-Index of 53% – 47% = 6%.  Based on this value, it is extremely unlikely that a researcher would obtain a significant result, if they would actually replicate the original studies exactly.  

In short, we have positive evidence that the results reported in this article provide no credible evidence for the hypotheses. However, there is no evidence that Stapel simply made up data. Rather, he seems to have used questionable research practices that are considered acceptable until this day. Many articles that are not retracted also used these practices. This might also be true for some of Stapel’s articles that have not been retracted. To examine this, I conducted a z-curve analysis of Stapel’s articles that have not been retracted (data). This analysis relies on automatic extraction of all test-statistics rather than hand-coding of focal hypotheses.

In a z-curve analysis, p-values are converted into z-scores and a model is used to fit the distribution of the significant values (z > 1.96). The key finding is that the program found 1,075 test statistics and 866 were significant. This is an observed discovery rate of 81%. However, there is clear evidence of publication bias because the mode of significant results is right at z = 1.96 and then there is a steep drop when results are not significant (z < 1.96). This is revealed by a comparison of the observed discovery rate and the expected discovery rate (34%). The 95%CI ranges from 15% to 60%. The fact that it does not include the ODR shows that questionable practices were used to inflate the percentage of significant results. So, even some articles that were not retracted used practices that led to the retraction of Stapel and Blanton (2004).

However, it is possible that Stapel made up data to match the p-values that other articles report. As most researchers used QRPs, it seemed normal that most p-values were between .05 and .005. So, he might have fabricated data to produce p-values in this range. To test this hypothesis I also conducted a z-curve analysis of retracted articles.

The results look very similar. Thus, it is not clear when Stapel used QRPs with real data and when Stapel made up fraudulent data that look similar to other p-values in the literature. However, in both cases, the data do not conform to distributions that are produced with proper scientific methods.

The main conclusion is that p-values below .05 in published articles are insufficient to claim a discovery. It is therefore necessary to find other ways to distinguish between credible and incredible evidence to support scientific claims about human social behavior.

A sarcastic comment on “Promise, peril, and perspective: Addressing concerns about reproducibility in social–personality psychology” by Harry Reis

“Promise, peril, and perspective: Addressing concerns about reproducibility in social–personality psychology”
Journal of Experimental Social Psychology 66 (2016) 148–152

a.k.a The Swan Song of Social Psychology During the Golden Age

Disclaimer: i wrote this piece because Jamie Pennebeker recommended writing as therapy to deal with trauma.  However, in his defense, he didn’t propose publishing the therapeutic writings.


You might think an article with reproducibiltiy in the title would have something to say about the replicability crisis in social psychology.  However, this article has very little to say about the causes of the replication crisis in social psychology and possible solutions to improve replicability. Instead, it appears to be a perfect example of repressive coping to avoid the traumatic realization that decades of work were fun, yet futile.

1. Introduction

The authors start with a very sensible suggestion. “We propose that the goal of achieving sound scientific insights and useful applications will be better facilitated over the long run by promoting good scientific practice rather than by stressing the need to prevent any and all mistakes.”  (p. 149).  The only question is how many mistakes we consider tolerable and that we do not know what the error rates are. Rosenthal pointed out it could be 100%, which even the authors might consider to be a little bit too high.

2. Improving research practice”

In this chapter, the authors suggest that “if there is anything on which all researchers might agree, it is the call for improving our research practices and techniques.” (p. 149).  If this were the case, we wouldn’t see articles in 2016 that make statistical mistakes that have been known for decades like pooling data from a heterogeneous set of studies or computing difference scores and using one of the variables as a predictor of the difference score.

It is also puzzling to read “the contemporary literature indicates just how central methodological innovation has been to advancing the field” (p. 149), when the key problem of low power has been known since 1962 and there is still no sign of improvement.

The authors also are not exactly in favor of adapting better methods, when these methods might reveal major problems in older studies.  For example, a meta-analysis in 2010 might not have examined publication bias and produced an effect size of more than half a standard deviation, when a new method that controls for publication bias finds that it is impossible to reject the null-hypothesis. No, these new methods are not welcome. “In our  view, they will stifle progress and innovation if they are seen primarily through the lens of maladaptive perfectionism; namely as ways of rectifying flaws and shortcomings in prior work.”  (p. 149).  So, what is the solution. Let’s pretend that subliminal priming made people walk slower in 1996, but stopped working in 2011?

This ends the chapter of improving research practice.  Yes, that is the way to deal with a crisis.  When the city is bankrupt, cut back on the Christmas decorations. Problem solved.

3. How to think about replications

Let’s start with a trivial statement that is as meaningless as saying, we would welcome more funding.  “Replications are valuable.” (p. 149).  Let’s also not mention that social psychologists have been the leader of requesting replication studies. No single study article shall be published in a social psychology journal. A minimum of three studies with conceptual replications of the key finding are needed to show that the results are robust and always produce significant results with p < .05 (or at least p < .10).  Yes, no other science has cherished replications as much as social psychology.

And eminent social psychologists Crandall and Sherman explain why. “to be a cumulative
and self-correcting enterprise, replications, be their results supportive, qualifying, or contradictory, must occur.”  Indeed, but what explains the 95% success rate of published replications in social psychology.  No need for self-correction, if the predictions are always confirmed.

Surprisingly, however, since 2011 a number of replication studies have been published in obscure journals that fail to replicate results.  This has never happened before and raises some concerns. What is going on here?  Why can these researchers not replicate the original results?  The answer is clear. They are doing it wrong.  “We concur with several authors (Crandall and Sherman, Stroebe) that conceptual replications offer the greatest potential to our field…  Much of the current debate, however, is focused narrowly on direct
or exact replications.” (p. 149). As philosopher know, you cannot step into the same river twice and so you cannot replicate the same study again.  To get a significant result, you need to do a similar, but not an identical replication study.

Another problem with failed replication studies is that these researchers assume that they are doing an exact replication study, but do not test this assumption. “In this light, Fabrigar’s insistence that researchers take more care to demonstrate psychometric invariance is well-placed” (p. 149).  Once more, the superiority of conceptual replication studies is self-evident. When you do a conceptual replication study, psychometric invariance is guaranteed and does not have to be demonstrated. Just one more reason, why conceptual replication studies in social psychology journals produce 95% success rate, whereas misguided exact replication attempts have failure rates of over 50%.

It is also important to consider the expertise of researchers.  Social psychologists often have demonstrated their expertise by publishing dozens of successful, conceptual replications.  In contrast, failed replications are often produced by novices with no track-record of ever producing a successful study.  These vast differences in previous success rate need to be taken into account in the evaluation of replication studies.  “Errors caused by low expertise or inadvertent changes are often catastrophic, in the sense of causing a study to fail completely, as Stroebe nicely illustrates.”

It would be a shame if psychology would start rewarding these replication studies.  Already limited research funds would be diverted to conducting studies that are easy to do, yet to difficult to do correctly for inexperienced researchers away from senior researchers who do difficult novel studies that always work and produced groundbreaking new insights into social phenomena during the “golden age” (p. 150) of social psychology.

The authors also point that failed studies are rarely failed studies. When these studies are properly combined with successful studies in a meta-analysis, the results nearly always show the predicted effect and that it was wrong to doubt original studies simply because replication studies failed to show the effect. “Deeper consideration of the terms “failed” and “underpowered” may reveal just how limited the field is by dichotomous thinking. “Failed” implies that a result at p = .06 is somehow inferior to one at p = .05, a conclusion
that scarcely merits disputation.” (p. 150).

In conclusion, we learn nothing from replication studies. They are a waste of time and resources and can only impede further development of social psychology by means of conceptual replication studies that build on the foundations laid during the “golden age” of social psychology.

4. Differential demands of different research topics

Some studies are easier to replicate than others, and replication failures might be “limited to studies that presented methodological challenges (i.e., that had protocols that were considered difficult to carry out) and that provided opportunities for experimenter bias” (p. 150).  It is therefore better, not to replicate difficult studies or to let original authors with a track-record of success conduct conceptual replication studies.

Moreover, some people have argued that the high succeess rate of original studies is inflated by publication bias (not writing up failed studies) and the use of questionable research practices (run more participants until p < .05).  To ensure that reported successes are real successes, some initiatives call for data sharing, pre-registration of data analysis plans, and a priori power analysis.  Although these may appear to be reasonable suggestions, the authors disagree.  “We worry that reifying any of the various proposals as a “best practice” for research integrity may marginalize researchers and research areas that study phenomena or use methods that have a harder time meeting these requirements.” (p. 150).

They appear to be concerns that researchers who do not preregister data analysis plans or do not share data may be stigmatized. “If not, such principles, no matter how well-intentioned, invite the possibility of discrimination, not only within the field but also by decision-makers who are not privy to these realities.”  (p. 150).

5. Considering broader implications

These are confusing times.  In the old days, the goal of research was clearly defined. Conduct at least three, loosely related , successful studies and write them up with a good story.  During these times, it was not acceptable to publish failed studies to maintain the 95% success rate. This made it hard for researchers who did not understand the rules of publishing only significant results. “Recently, a colleague of ours relayed his frustrating experience of submitting a manuscript that included one null-result study among several studies with statistically significant findings. He was met with rejection after rejection, all the while being told that the null finding weakened the results or confused the manuscript” (p. 151).

It is not clear what researchers should be doing now. Should they now report all of their studies, the good, the bad, and the ugly, or should they continue to present only the successful studies?   What if some researchers continue to publish the good old fashioned way that evolved during the golden age of social psychology and others try to publish results more in accordance with what actually happened in their lab?  “There is currently, a disconnect between what is good for scientists and what is good for science” and nobody is going to change while researchers who report only significant results get rewarded with publications in top journals.






There may also be little need to make major changes. “We agree with Crandall and Sherman, and also Stroebe, that social psychology is, like all sciences, a self-correcting enterprise” (p. 151).   And if social psychology is already self-correcting, it do not need new guidelines how to do research and new replication studies. Rather than instituting new policies, it might be better to make social psychology great again. Rather than publishing means and standard deviations or test statistics that allow data detectives to check results, it might be better to report only whether a result was significant, p < .05, and because 95% of studies are significant and the others are failed studies, we might simply not report any numbers.  False results will be corrected eventually because they will no longer be reported in journals and the old results might have been true even if they fail to replicate today.   The best approach is to fund researchers with a good track record of success and let them publish in the top journals.


Most likely, the replication crisis only exists in the imagination of overly self-critical psychologists. “Social psychologists are often reputed to be among the most severe critics of work within their own discipline” (p. 151).  A healthier attitude is to realize that “we already know a lot; with these practices, we can learn even more” (p. 151).

So, let’s get back to doing research and forget this whole thing that was briefly mentioned in the title called “concerns about reproducibility.”  Who cares that only 25% of social psychology studies from 2008 could be replicated in 2014.  In the meantime, thousands of new discoveries were made and it is time to make more new discoveries. “We should not get so caught up in perfectionistic concerns that they impede the rapid accumulation and dissemination of research findings” (p. 151).

There you have it folks.  Don’t worry about recent failed replications. This is just a normal part of science, especially a science that studies fragile, contextually sensitive phenomena. The results from 2008 do not necessarily replicate in 2014 and the results from 2014 may not replicate in 2018.  What we need is fewer replications. We need permanent research because many effects may disappear the moment they were discovered. This is what makes social psychology so exciting.  If you want to study stable phenomena that replicate decade after decade you might as well become a personality psychologist.





A replicability analysis of”I like myself but I don’t know why: Enhancing implicit self-esteem by subliminal evaluative conditioning”

Dijksterhuis, A. (2004). I like myself but I don’t know why: Enhancing implicit self-esteem by subliminal evaluative conditioning. JOURNAL OF PERSONALITY AND SOCIAL PSYCHOLOGY,   Volume: 86,   Issue: 2,   Pages: 345-355. 

DOI: 10.1037/0022-3514.86.2.345

There are a lot of articles with questionable statistical results and it seems pointless to single out particular articles.  However, once in a while, an article catches my attention and I will comment on the statistical results in it.  This is one of these articles….

The format of this review highlights why articles like this passed peer-review and are cited at high frequency as if they provided empirical facts.  The reason is a phenomenon called “verbal overshadowing.”   In work on eye-witness testimony, participants first see the picture of a perpetrator. Before the actual line-up task, they are asked to give a verbal description of the tasks.  The verbal description can distort the memory of the actual face and lead to a higher rate of misidentifications.  Something similar happens when researchers read articles. Sometimes they only read abstracts, but even when they read the article, the words can overshadow the actual empirical results. As a result, memory is more strongly influenced by verbal descriptions than by the cold and hard statistical facts.

In the first part, I will present the results of the article verbally without numbers. In the second part, I will present only the numbers.

Part 1:

In the article “I Like Myself but I Don’t Know Why: Enhancing Implicit Self-Esteem by
Subliminal Evaluative Conditioning” Ap Dijksterhuis reports the results of six studies (1-4, 5a, 5b).  All studies used a partially or fully subliminal evaluative conditioning task to influence implicit measures of self-esteem. The abstract states: “Participants were repeatedly presented with trials in which the word I was paired with positive trait terms. Relative to control conditions, this procedure enhanced implicit self-esteem.”  Study 1 used preferences for initials to measure implicit self-esteem. and “results confirmed the hypothesis that evaluative conditioning enhanced implicit self-esteem.” (p. 348). Study 2 modified the control condition and showed that “participants in the conditioned self-esteem condition showed higher implicit self-esteem after the treatment than before the treatment, relative to control participants” (p. 348).  Experiment 3 changed the evaluative conditioning procedure. Now, both the CS and the US (positive trait terms) were
presented subliminally for 17 ms.  It also used the Implicit Association Test to measure implicit self-esteem.  The results showed that “difference in response latency between blocks was much more pronounced in the conditioned self-esteem condition, indicating higher self-esteem” (p. 349).  Study 4 also showed that “participants in the conditioned self-esteem condition exhibited higher implicit self-esteem than participants
in the control condition” (p. 350).  Study 5a and 5b showed that “individuals whose
self-esteem was enhanced seemed to be insensitive to personality feedback, whereas control participants whose self-esteem was not enhanced did show effects of the intelligence feedback.” (p. 352).  The General Discussion section summarizes the results. “In our experiments, implicit self-esteem was enhanced through subliminal evaluative conditioning. Pairing the self-depicting word I with positive trait terms consistently improved implicit self-esteem.” (p. 352).  A final conclusion section points out the potential of this work for enhancing self-esteem. “It is worthwhile to explicitly mention an intriguing aspect of the present work. Implicit self-esteem can be enhanced, at least temporarily, subliminally in about 25 seconds.” (p. 353).


Part 2:

Study Statistic p z OP
1 F(1,76)=5.15 0.026 2.22 0.60
2 F(1,33)=4.32 0.046 2.00 0.52
3 F(1,14)=8.84 0.010 2.57 0.73
4 F(1,79)=7.45 0.008 2.66 0.76
5a F(1,89)=4.91 0.029 2.18 0.59
5b F(1,51)=4.74 0.034 2.12 0.56

All six studies produced statistically significant results. To achieve this outcome two conditions have to be met: (a) the effect exists and (b) sampling error is small to avoid a failed study  (i.e., a non-significant result even though the effect is real).   The probability of obtaining a significant result is called power. The last column shows observed power. Observed power can be used to estimate the actual power of the six studies. Median observed power is 60%.  With 60% power, we would expect that only 60% of the 6 studies (3.6 studies) produce a significant result, but all six studies show a significant result.  The excess of significant result shows that the results in this article present an overly positive picture of the robustness of the effect.  If these six studies were replicated exactly, we would not expect to obtain six significant results again.  Moreover, the inflation of significant results also leads to an inflation of the power estimate. The R-Index corrects for this inflation by subtracting the inflation rate (100% observed success rate – 60% median observed power) from the power estimate.  The R-Index is .60 – .40 = .20.  Results with such a low R-Index often do not replicate in independent replication attempts.

Another method to examine the replicability of these results is to examine the variability of the z-scores (second last column).  Each z-score reflects the strength of evidence against the null-hypothesis. Even if the same study is replicated, this measure will vary as a function of random sampling.  The expected variance is approximately 1 (the standard deviation of a standard normal distribution).  Low variance suggests that future studies will produce more variable results and with p-values close to .05, this means that future studies are expected to produce non-significant results.  This bias test is called the Test of Insufficient Variance (TIVA).  The variance of the z-scores is Var(z) = 0.07.  The probability of this restricted variance to occur by chance is p = .003 (1/300).

Based on these results, the statistical evidence presented in this article is questionable and does not provide support for the conclusion that subliminal evaluative conditioning can enhance implicit self-esteem.  Another problem with this conclusion is that implicit self-esteem measures have low reliability and low convergent validity.  As a result, we would not expect strong and consistent effects of any experimental manipulation on these measures.  Finally, even if a small and reliable effect could be obtained, it remains an open question whether this effect shows an effect on implicit self-esteem or whether the manipulation produces a systematic bias in the measurement of implicit self-esteem.  “It is not yet known how long the effects of this manipulation last. In addition, it is not yet
known whether people who could really benefit from enhanced self-esteem (i.e., people with problematically low levels of self-esteem) can benefit from subliminal conditioning techniques.” (p. 353).  12 years later, we may wonder whether these results have been replicated in other laboratories and whether these effects last more than a few minutes after the conditioning experiment.

If you like Part I better, feel free to boost your self-esteem here.



Bayesian Meta-Analysis: The Wrong Way and The Right Way

Carlsson, R., Schimmack, U., Williams, D.R., & Bürkner, P. C. (in press). Bayesian Evidence Synthesis is no substitute for meta-analysis: a re-analysis of Scheibehenne, Jamil and Wagenmakers (2016). Psychological Science.

In short, we show that the reported Bayes-Factor of 36 in the original article is inflated by pooling across a heterogeneous set of studies, using a one-sided prior, and assuming a fixed effect size.  We present an alternative Bayesian multi-level approach that avoids the pitfalls of Bayesian Evidence Synthesis, and show that the original set of studies produced at best weak evidence for an effect of social norms on reusing of towels.

Peer-Reviews from Psychological Methods

Times are changing. Media are flooded with fake news and journals are filled with fake novel discoveries. The only way to fight bias and fake information is full transparency and openness.
Jerry Brunner and I wrote a paper that examined the validity of z-curve, the method underlying powergraphs, to Psychological Methods.

As soon as we submitted it, we made the manuscript and the code available. Nobody used the opportunity to comment on the manuscript. Now we got the official reviews.

We would like to thank the editor and reviewers for spending time and effort on reading (or at least skimming) our manuscript and writing comments.  Normally, this effort would be largely wasted because like many other authors we are going to ignore most of their well-meaning comments and suggestions and try to publish the manuscript mostly unchanged somewhere else. As the editor pointed out, we are hopeful that our manuscript will eventually be published because 95% of written manuscripts get eventually published. So, why change anything.  However, we think the work of the editor and reviewers deserves some recognition and some readers of our manuscript may find them valuable. Therefore, we are happy to share their comments for readers interested in replicabilty and our method of estimating replicability from test statistics in original articles. 


Dear Dr. Brunner,

I have now received the reviewers’ comments on your manuscript. Based on their analysis and my own evaluation, I can no longer consider this manuscript for publication in Psychological Methods. There are two main reasons that I decided not to accept your submission. The first deals with the value of your statistical estimate of replicability. My first concern is that you define replicability specifically within the context of NHST by focusing on power and p-values. I personally have fewer problems with NHST than many methodologists, but given the fact that the literature is slowly moving away from this paradigm, I don’t think it is wise to promote a method to handle replicability that is unusable for studies that are conducted outside of it. Instead of talking about replicability as estimating the probability of getting a significant result, I think it would be better to define it in more continuous terms, focusing on how similar we can expect future estimates (in terms of effect sizes) to be to those that have been demonstrated in the prior literature. I’m not sure that I see the value of statistics that specifically incorporate the prior sample sizes into their estimates, since, as you say, these have typically been inappropriately low.

Sure, it may tell you the likelihood of getting significant results if you conducted a replication of the average study that has been done in the past. But why would you do that instead of conducting a replication that was more appropriately powered?

Reviewer 2 argues against the focus on original study/replication study distinction, which would be consistent with the idea of estimating the underlying distribution of effects, and from there selecting sample sizes that would produce studies of acceptable power. Reviewer 3 indicates that three of the statistics you discussed are specifically designed for single studies, and are no longer valid when applied to sets of studies, although this reviewer does provide information about how these can be corrected.

The second main reason, discussed by Reviewer 1, is that although your statistics may allow you to account for selection biases introduced by journals not accepting null results, they do not allow you to account for selection effects prior to submission. Although methodologists will often bring up the file drawer problem, it is much less of an issue than people believe. I read about a survey in a meta-analysis text (I unfortunately can’t remember the exact citation) that indicated that over 95% of the studies that get written up eventually get published somewhere. The journal publication bias against non-significant results is really more an issue of where articles get published, rather than if they get published. The real issue is that researchers will typically choose not to write up results that are non-significant, or will suppress non-significant findings when writing up a study with other significant findings. The latter case is even more complicated, because it is often not just a case of including or excluding significant results, but is instead a case where researchers examine the significant findings they have and then choose a narrative that makes best use of them, including non-significant findings when they are part of the story but excluding them when they are irrelevant. The presence of these author-side effects means that your statistic will almost always be overestimating the actual replicability of a literature.

The reviewers bring up a number of additional points that you should consider. Reviewer 1 notes that your discussion of the power of psychological studies is 25 years old, and therefore likely doesn’t apply. Reviewer 2 felt that your choice to represent your formulas and equations using programming code was a mistake, and suggests that you stick to standard mathematical notation when discussing equations. Reviewer 2 also felt that you characterized researcher behaviors in ways that were more negative than is appropriate or realistic, and that you should tone down your criticisms of these behaviors. As a grant-funded researcher, I can personally promise you that a great many researchers are concerned about power,since you cannot receive government funding without presenting detailed power analyses. Reviewer 2 noted a concern with the use of web links in your code, in that this could be used to identify individuals using your syntax. Although I have no suspicions that you are using this to keep track of who is reviewing your paper, you should remove those links to ensure privacy. Reviewer 1 felt that a number of your tables were not necessary, and both reviewers 2 and 3 felt that there were parts of your writing that could be notably condensed. You might consider going through the document to see if you can shorten it while maintaining your general points. Finally, reviewer 3 provides a great many specific comments that I feel would greatly enhance the validity and interpretability of your results. I would suggest that you attend closely to those suggestions before submitting to another journal.

For your guidance, I append the reviewers’ comments below and hope they will be useful to you as you prepare this work for another outlet.

Thank you for giving us the opportunity to consider your submission.

Sincerely, Jamie DeCoster, PhD
Associate Editor
Psychological Methods


Reviewers’ comments:

Reviewer #1:

The goals of this paper are admirable and are stated clearly here: “it is desirable to have an alternative method of estimating replicability that does not require literal replication. We see this method as complementary to actual replication studies.”

However, I am bothered by an assumption of this paper, which is that each study has a power (for example, see the first two paragraphs on page 20). This bothers me for several reasons. First, any given study in psychology will often report many different p-values. Second, there is the issue of p-hacking or forking paths. The p-value, and thus the power, will depend on the researcher’s flexibility in analysis. With enough researcher degrees of freedom, power approaches 100% no matter how small the effect size is. Power in a preregistered replication is a different story. The authors write, “Selection for significance (publication bias) does not change the power values of individual studies.” But to the extent that there is selection done _within_ a study–and this is definitely happening–I don’t think that quoted sentence is correct.

So I can’t really understand the paper as it is currently written, as it’s not clear to me what they are estimating, and I am concerned that they are not accounting for the p-hacking that is standard practice in published studies.

Other comments:

The authors write, “Replication studies ensure that false positives will be promptly discovered when replication studies fail to confirm the original results.” I don’t think “ensure” is quite right, since any replication is itself random. Even if the null is true, there is a 5% chance that a replication will confirm just by chance. Also many studies have multiple outcomes, and if any appears to be confirmed, this can be taken as a success. Also, replications will not just catch false positives, they will also catch cases where the null hypothesis is false but where power is low. Replication may have the _goal_ of catching false positives, but it is not so discriminating.

The Fisher quote, “A properly designed experiment rarely fails to give …significance,” seems very strange to me. What if an experiment is perfectly designed, but the null hypothesis happens to be true? Then it should have a 95% chance of _not_ giving significance.

The authors write, “Actual replication studies are needed because they provide more information than just finding a significant result again. For example, they show that the results can be replicated over time and are not limited to a specific historic, cultural context. They also show that the description of the original study was sufficiently precise to reproduce the study in a way that it successfully replicated the original result.” These statements seem too strong to me. Successful replication is rejection of the null, and this can happen even if the original study was not described precisely, etc.

The authors write, “A common estimate of power is that average power is about 50% (Cohen 1962, Sedlmeier and Gigerenzer 1989). This means that about half of the studies in psychology have less than 50% power.” I think they are confusing the mean with the median here. Also I would guess that 50% power is an overestimate. For one thing, psychology has changed a lot since 1962 or even 1989 so I see no reason to take this 50% guess seriously.

The authors write, “We define replicability as the probability of obtaining the same result in an exact replication study with the same procedure and sample sizes.” I think that by “exact” they mean “pre-registered” but this is not clear. For example, suppose the original study was p-hacked. Then, strictly speaking, an exact replication would also be p-hacked. But I don’t think that’s what the authors mean. Also, it might be necessary to restrict the definition to pre-registered studies with a single test. Otherwise there is the problem that a paper has several tests, and any rejection will be taken as a successful replication.

I recommend that the authors get rid of tables 2-15 and instead think more carefully about what information they would like to convey to the reader here.

Reviewer #2:

This paper is largely unclear, and in the areas where it is clear enough to decipher, it is unwise and unprofessional.

This study’s main claim seems to be: “Thus, statistical estimates of replicability and the outcome of replication studies can be seen as two independent methods that are expected to produce convergent evidence of replicability.” This is incorrect. The approaches are unrelated. Replication of a scientific study is part of the scientific process, trying to find out the truth. The new study is not the judge of the original article, its replicability, or scientific contribution. It is merely another contribution to the scientific literature. The replicator and the original article are equals; one does not have status above the other. And certainly a statistical method applied to the original article has no special status unless the method, data, or theory can be shown to be an improvement on the original article.

They write, “Rather than using traditional notation from Statistics that might make it difficult for non-statisticians to understand our method, we use computer syntax as notation.” This is a disqualifying stance for publication in a serious scholarly journal, and it would an embarrassment to any journal or author to publish these results. The point of statistical notation is clarity, generality, and cross-discipline understanding. Computer syntax is specific to the language adopted, is not general, and is completely opaque to anyone who uses a different computer language. Yet everyone who understands their methods will have at least seen, and needs to understand, statistical notation. Statistical (i.e., mathematical) notation is the one general language we have that spans the field and different fields. No computer syntax does this. Proofs and other evidence are expressed in statistical notation, not computer syntax in the (now largely unused) S statistical language. Computer syntax, as used in this paper, is also ill-defined in that any quantity defined by a primitive function of the language can change any time, even after publication, if someone changes the function. In fact, the S language, used in this paper, is not equivalent to R, and so the authors are incorrect that R will be more understandable. Not including statistical notation, when the language of the paper is so unclear and self-contradictory, is an especially unfortunate decision. (As it happens I know S and R, but I find the manuscript very difficult to understand without imputing my own views about what the authors are doing. This is unacceptable. It is not even replicable.) If the authors have claims to make, they need to state them in unambiguous mathematical or statistical language and then prove their claims. They do not do any of these things.

It is untrue that “researchers ignore power”. If they do, they will rarely find anything of interest. And they certainly write about it extensively. In my experience, they obsess over power, balancing whether they will find something with the cost of doing the experiment. In fact, this paper misunderstands and misrepresents the concept: Power is not “the long-run probability of obtaining a statistically significant result.” It is the probability that a statistical test will reject a false null hypothesis, as the authors even say explicitly at times. These are very different quantities.

This paper accuses “researchers” of many other misunderstandings. Most of these are theoretically incorrect or empirically incorrect.One point of the paper seems to be “In short, our goal is to estimate average power of a set of studies with unknown population effect sizes that can assume any value, including zero.” But I don’t see why we need to know this quantity or how the authors’ methods contribute to us knowing it. The authors make many statistical claims without statistical proofs, without any clear definition of what their claims are, and without empirical evidence. They use simulation that inquires about a vanishingly small portion of the sample space to substitute for an infinite domain of continuous parameter values; they need mathematical proofs but do not even state their claims in clear ways that are amenable to proof.

No coherent definition is given of the quantity of interest. “Effect size” is not generic and hypothesis tests are not invariant to the definition, even if it is true that they are monotone transformations of each other. One effect size can be “significant” and a transformation of the effect size can be “not significant” even if calculated from the same data. This alone invalidates the authors’ central claims.

The first 11.5 pages of this paper should be summarized in one paragraph. The rest does not seem to contribute anything novel. Much of it is incorrect as well. Better to delete throat clearing and get on with the point of the paper.

I’d also like to point out that the authors have hard-coded URL links to their own web site in the replication code. The code cannot be run without making a call to the author’s web site, and recording the reviewer’s IP address in the authors’ web logs. Because this enables the authors to track who is reviewing the manuscript, it is highly inappropriate. It also makes it impossible to replicate the authors results. Many journals (and all federal grants) have prohibitions on this behavior.

I haven’t checked whether Psychological Methods has this rule, but the authors should know better regardless.

Reviewer 3

Review of “How replicable is psychology? A comparison of four methods of estimating replicability on the bias of test statistics in original studies”

It was my pleasure to review this manuscript. The authors compare four methods of estimating replicability. One undeniable strength of the general approach is that these measures of replicability can be computed before or without actually replicating the study/studies. As such, one can see the replicability measure of a set of statistically significant findings as an index of trust in these findings, in the sense that the measure provides an estimate of the percentage of these studies that is expected to be statistically significant when replicating them under the same conditions and same sample size (assuming the replication study and the original study assess the same true effect). As such, I see value in this approach. However, I have many comments, major and minor, which will enable the authors to improve their manuscript.

Major comments

1. Properties of index.

What I miss, and what would certainly be appreciated by the reader, is a description of properties of the replicability index. This would include that it has a minimum value equal to 0.05 (or more generally, alpha), when the set of statistically significant studies has no evidential value. Its maximum value equals 1, when the power of studies included in the set was very large. A value of .8 corresponds to the situation where statistical power of the original situation was .8, as often recommended. Finally, I would add that both sample size and true effect size affect the replicability index; a high value of say .8 can be obtained when true effect size is small in combination with a large sample size (you can consider giving a value of N, here), or with a large true effect size in combination with a small sample size (again, consider giving values).

Consider giving a story like this early, e.g. bottom of page 6.

2. Too long explanations/text

Perhaps it is a matter of taste, but sometimes I consider explanations much too long. Readers of Psychological Methods may be expected to know some basics. To give you an example, the text on page 7 in “Introduction of Statistical Methods for Power estimation” is very long. I believe its four paragraphs can be summarized into just one; particularly the first one can be summarized in one or two sentences. Similarly, the section on “Statistical Power” can be shortened considerably, imo. Other specific suggestions for shortening the text, I mention below in the “minor comments” section. Later on I’ll provide one major comment on the tables, and how to remove a few of them and how to combine several of them.

3. Wrong application of ML, p-curve, p-uniform

This is THE main comment, imo. The problem is that ML (Hedges, 1984), p-curve, p-uniform, enable the estimation of effect size based on just ONE study. Moreover,  Simonsohn (p-curve) as well as the authors of p-uniform would argue against estimating the average effect size of unrelated studies. These methods are meant to meta-analyze studies on ONE topic.

4. P-uniform and p-curve section, and ML section

This section needs a major revision. First, I would start the section with describing the logic of the method. Only statistically significant results are selected. Conditional on statistical significance, the methods are based on conditional p-values (not just p-values), and then I would provide the formula on top of page 18. Most importantly, these techniques are not constructed for estimating effect size of a bunch of unrelated studies. The methods should be applied to related studies. In your case, to each study individually. See my comments earlier.

Ln(p), which you use in your paper, is not a good idea here for two reasons: (1) It is most sensitive to heterogeneity (which is also put forward by Van Assen et al (2014), and (2) applied to single studies it estimates effect size such that the conditional p-value equals 1/e, rather than .5  (resulting in less nice properties).

The ML method, as it was described, focuses on estimating effect size using one single study (see Hedges, 1984). So I was very surprised to see it applied differently by the authors. Applying ML in the context of this paper should be the same as p-uniform and p-curve, using exactly the same conditional probability principle. So, the only difference between the three methods is the method of optimization. That is the only difference.

You develop a set-based ML approach, which needs to assume a distribution of true effect size. As said before, I leave it up to you whether you still want to include this method. For now, I have a slight preference to include the set-based approach because it (i) provides a nice reference to your set-based approach, called z-curve, and (ii) using this comparison you can “test” how robust the set-based ML approach is against a violation of the assumption of the distribution of true effect size.

Moreover, I strongly recommend showing how their estimates differ for certain studies, and include this in a table. This allows you to explain the logic of the methods very well. Here a suggestion. I would provide the estimates of four methods (…) for p-values .04, .025, .01, .001, and perhaps .0001). This will be extremely insightful. For small p-values, the three methods&rsquo; estimates will be similar to the traditional estimate. For p-values > .025, the estimate will be negative, for p = .025 the estimate will be (close to) 0. Then, you can also use these same studies and p-values to calculate the power of a replication study (R-index).

I would exclude Figure 1, and the corresponding text. Is not (no longer) necessary.

For the set-based ML approach, if you still include it, please explain how you get to the true value distribution (g(theta)).

5a. The MA set, and test statistics

Many different effect sizes and test statistics exist. Many of them can be transformed to ONE underlying parameter, with a sensible interpretation and certain statistical properties. For instance, the chi2, t, and F(1,df) can all be transformed to d or r, and their SE can be derived. In the RPP project and by John et al (2016) this is called the MA set. Other test statistics, such as F(>1, df) cannot be converted to the same metric, and no SE is defined on that metric. Therefore, the statistics F(>1,df) were excluded from the meta-analyses in the RPP  (see the supplementary materials of the RPP) and by Johnson et al (2016) and also Morey and Lakens (2016), who also re-analyzed the data of the RPP.

Fortunately, in your application you do not estimate effect size but only estimate power of a test, which only requires estimating the ncp and not effect size. So, in principle you can include the F(>1,df) statistics in your analyses, which is a definite advantage. Although I can see you can incorporate it for the ML, p-curve, p-uniform approach, I do not directly see how these F(>1,df) statistics can be used for the two set-based methods (ML and z-curve); in the set-based methods, you put all statistics on one dimension (z) using the p-values. How do you defend this?

5b. Z-curve

Some details are not clear to me, yet. How many components (called r in your text) are selected, and why? Your text states: “First, select a ncp parameter m ; . Then generate Z from a normal distribution with mean m ; I do not understand, since the normal distribution does not have an ncp. Is it that you nonparametrically model the distribution of observed Z, with different components?

Why do you use kernel density estimation? What is it’s added value? Why making it more imprecise by having this step in between? Please explain.

Except for these details, procedure and logic of z-curve are clear

6. Simulations (I): test statistics

I have no reasons, theoretical or empirical, why the analyses would provide different results for Z, t, F(1,df), F(>1,df), chi2. Therefore, I would omit all simulation results of all statistics except 1, and not talk about results of these other statistics. For instance, in the simulations section I would state that results are provided on each of these statistics but present here only the results of t, and of others in supplementary info. When applying the methods to RPP, you apply them to all statistics simultaneously, which you could mention in the text (see also comment 4 above).

7. mean or median power (important)

One of my most important points is the assessment of replicability itself. Consider a set of studies for which replicability is calculated, for each study. So, in case of M studies, there are M replicability indices. Which statistics would be most interesting to report, i.e., are most informative? Note that the distribution of power is far some symmetrical, and actually may be bimodal with modes at 0.05 and 1.  For that reason alone, I would include in any report of replicability in a field the proportion of R-indices equal to 0.05 (which amounts to the proportion of results with .025 < p < .05) and the proportion of R-indices equal to 1.00 (e.g., using two decimals, i.e. > .995). Moreover, because power values are recommend of .8 or more, I also could include the proportion of studies with power > .8.

We also would need a measure of central tendency. Because the distribution is not symmetric, and may be skewed, I recommend using the median rather than the mean. Another reason to use the median rather than the mean is because the mean does not provide useable information on whether methods are biased or not, in the simulations. For instance, if true effect size = 0, because of sampling error the observed power will exceed .05 in exactly 50% of the cases (this is the case for p-uniform; since with probability .5 the p-value will exceed .025) and larger than .05 in the other 50% of the cases. Hence, the median will be exactly equal to .05, whereas the mean will exceed .05. Similarly, if true effect size is large the mean power will be too small (distribution skewed to the left). To conclude, I strongly recommend including the median in the results of the simulation.

In a report, such as for the RPP later on in the paper, I recommend including (i)

p(R=.05), (ii) p(R >= .8), (iii) p(R>= .995), (iv) median(R), (v) sd(R), (vi)

distribution R, (vii) mean R. You could also distinguish this for soc psy and cog psy.

8. simulations (II): selection of conditions

I believe it is unnatural to select conditions based on “mean true power” because we are most familiar with effect size and their distribution, and sample sizes and their distribution. I recommend describing these distributions, and then the implied power distribution (surely the median value as well, not or not only the mean).

9.  Omitted because it could reveal identity of reviewer

10. Presentation of results

I have comments on what you present, on how you present the results. First, what you present. For the ML and p-methods, I recommend presenting the distribution of R in each of the conditions (at least for fixed true effect size and fixed N, where results can be derived exactly relatively easy). For the set-based methods, if you focus on average R (which I do not recommend, I recommend median R), then present the RMSE. The absolute median error is minimized when you use the median. So, average-RMSE is a couple, and median-absolute error is a couple.

Now the presentation of results. Results of p-curve/p-uniform/ML are independent of the number of tests, but set-based methods (your ML variant) and z-curve are not.

Here the results I recommend presenting:

Fixed effect size, heterogeneity sample size

**For single-study methods, the probability distribution of R (figure), including mean(R), median(R), p(R=.05), p(R>= .995), sd(R). You could use simulation for approximating this distribution. Figures look like those in Figure 3, to the right.

**Median power, mean/sd as a function of K

**Bias for ML/p-curve/p-uniform amounts to the difference between median of distribution and the actual median, or the difference between the average of the distribution and the actual average. Note that this is different from set-based methods.

**For set-based methods, a table is needed (because of its dependence on k).

Results can be combined in one table (i.e., 2-3, 5-6, etc)

Significance tests comparing methods

I would exclude Table 4, Table 7, Table 10, Table 13. These significance tests do not make much sense. One method is better than another, or not – significance should not be relevant (for a very large number of iterations, a true difference will show up). You could simply describe in the text which method works best.

Heterogeneity in both sample size and effect size

You could provide similar results as for fixed effect size (but not for chi2, or other statistics). I would also use the same values of k as for the fixed effect case. For the fixed effect case you used 15, 25, 50, 100, 250. I can imagine using as values of k for both conditions k = 10, 30, 100, 400, 2,000 (or something).

Including the k = 10 case is important, because set-based methods will have more problems there, and because one paper or a meta-analysis or one author may have published just one or few statistically significant effect sizes. Note, however, that k=2,000 is only realistic when evaluating a large field.

Simulation of complex heterogeneity

Same results as for fixed effect size and heterogeneity in both sample size and effect size. Good to include a condition where the assumption of set-based ML is violated. I do not yet see why a correlation between N and ES may affect the results. Could you explain? For instance, for the ML/p-curve/p-uniform methods, all true effect sizes in combination with N result in a distribution of R for different studies; how this distribution is arrived at, is not relevant, so I do not yet see the importance of this correlation. That is, this correlation should only affect the results through the distribution of R. More reasoning should be provided, here.

Simulation of full heterogeneity

I am ambivalent about this section. If test statistic should not matter, then what is the added value of this section? Other distributions of sample size may be incorporated in previous section “complex heterogeneity”;. Other distributions of true effect may also be incorporated in the previous section. Note that Johnson et al (2016) use the RPP data to estimate that 90% of effects in psychology estimate a true zero effect. You assume only 10%.

Conservative bootstrap

Why only presenting the results of z-curve? By changing the limits of the interval, the interpretation becomes a bit awkward; what kind of interval is it now? Most importantly, coverages of .9973 or .9958 are horrible (in my opinion, these coverages are just as bad as coverages of .20). I prefer results of 95% confidence intervals, and then show their coverages in the table. Your &lsquo;conservative&rsquo; CIs are hard to interpret. Note also that this is paper on statistical properties of the methods, and one property is how well the methods perform w.r.t. 95% CI.

By the way, examining 95% CI of the methods is very valuable.

11. RPP

In my opinion, this section should be expanded substantially. This is where you can finally test your methodology, using real data! What I would add is the following: **Provide the distribution of R (including all statistics mentioned previously, i.e. p(R=0.05), p(R >= .8), p(R >= .995), median(R), mean(R), sd(R), using single-study methods **Provide previously mentioned results for soc psy and cog psy separately **Provide results of z-curve, and show your kernel density curve (strange that you never show this curve, if it is important in your algorithm).  What would be really great, is if you predict the probability of replication success (power) using the effect size estimate based on the original effect size estimated (derived from a single study) and the N of the replication sample. You could make a graph with on the X-axis this power, and on the Y-axis the result of the replication. Strong evidence in favor of your method would be if your result better predicts future replicability than any other index (see RPP for what they tried). Logistic regression seems to be the most appropriate technique for this.

Using multiple logistic regression, you can also assess if other indices have an added value above your predictions.

To conclude, for now you provide too limited results to convince readers that your approach is very useful.

Minor comments

P4 top: “heated debates” A few more sentence on this debate, including references to those debates would be fair. I would like to mention/recommend the studies of Maxwell et al (2015) in American Psychologist, the comment on the OSF piece in Science, and its response, and the very recent piece of Valen E Johnson et al (2016).

P4, middle: consider starting a new paragraph at “Actual replication”; In the sentence after this one, you may add “or not”;.

Another advantage of replication is that it may reveal heterogeneity (context dependence). Here, you may refer to the ManyLabs studies, which indeed reveal heterogeneity in about half of the replicated effects. Then, the next paragraph may start with “At the same time” To conclude, this piece starting with “Actual replication”; can be expanded a bit

P4, bottom,  “In contrast”; This and the preceding sentence is formulated as if sampling error does not exist. It is much too strong! Moreover, if the replication study had low power, sampling error is likely the reason of a statistically insignificant result. Here you can be more careful/precise. The last sentence of this paragraph is perfect.

P5, middle: consider adding more refs on estimates of power in psychology, e.g. Bakker and Wicherts 35% and that study on neuroscience with power estimates close to 20%. Last sentence of the same paragraph; assuming same true effect and same sample size.

P6, first paragraph around Rosenthal. Consider referring to the study of Johson et al (2016), who used a Bayesian analysis to estimate how many non-significant studies remain unpublished.

P7, top: &ldquo;studies have the same power (homogenous case) “(heterogenous case). This is awkward. Homogeneity and heterogeneity is generally reserved for variation in true effect size. Stick to that. Another problem here is that “heterogeneous”; power can be created by “heterogeneity”; in sample size and/or heterogeneity in effect size. These should be distinguished, because some methods can deal with heterogeneous power caused by heterogeneous N, but not heterogeneous true effect size. So, here, I would simple delete the texts between brackets.

P7, last sentence of first paragraph; I do not understand the sentence.

P10, “average power”. I did not understand this sentence.

P10, bottom: Why do you believe these methods to be most promising?

P11, 2nd par: Rephrase this sentence. Heterogeneity of effect size is not because of sampling variation. Later in this paragraph you also mix up heterogeneity with variation in power again. Of course, you could re-define heterogeneity, but I strongly recommend not doing so (in order not to confuse others); reserve heterogeneity to heterogeneity in true effect size.

P11, 3rd par, 1st sentence: I do not understand this sentence. But then again, this sentence may not be relevant (see major comments), because for applying p-uniform and p-curve heterogeneity of effect size is not relevant.

P11 bottom: maximum likelihood method. This sentence is not specific enough. But then again, this sentence may not be relevant (see major comments).

P12: Statistics without capital.

P12: “random sampling distribution”; delete “random”;. By the way, I liked this section on Notation and statistical background.

Section “Two populations of power”;. I believe this section is unnecessarily long, with a lot of text. Consider shortening. The spinning wheel analogy is ok.

P16, “close to the first” You mean second?

P16, last paragraph, 1st sentence: English?

Principle 2: The effect on what? Delete last sentence in the principle.

P17, bottom: include the average power after selection in your example.

p-curve/p-uniform: modify, as explained in one of the major comments.

P20, last sentence: Modify the sentence – the ML approach has excellent properties asymptotically, but not sample size is small. Now it states that it generally yields more precise estimates.

P25, last sentence of 4. Consider deleting this sentence (does not add anything useful).

P32: “We believe that a negative correlation between” some part of sentence is missing.

P38, penultimate sentence: explain what you mean by “decreasing the lower limit by .02”; and “increasing the upper limit by .02”;.

How replicable is psychology? A comparison of four methods of estimating replicability on the basis of test statistics in original studies

Manuscript under review, copyright belongs to Jerry Brunner and Ulrich Schimmack

How replicable is psychology? A comparison of four methods of estimating replicability on the basis of test statistics in original studies

Jerry Brunner and Ulrich Schimmack
University of Toronto @ Mississauga

In the past five years, the replicability of original findings published in psychology journals has been questioned. We show that replicability can be estimated by computing the average power of studies. We then present four methods that can be used to estimate average power for a set of studies that were selected for significance: p-curve, p-uniform, maximum likelihood, and z-curve. We present the results of large-scale simulation studies with both homogeneous and heterogeneous effect sizes. All methods work well with homogeneous effect sizes, but only maximum likelihood and z-curve produce accurate estimates with heterogeneous effect sizes. All methods overestimate replicability using the Open Science Collaborative reproducibility project and we discuss possible reasons for this. Based on the simulation studies, we recommend z-curve as a valid method to estimate replicability. We also validated a conservative bootstrap confidence interval that makes it possible to use z-curve with small sets of studies.

Keywords: Power estimation, Post-hoc power analysis, Publication bias, Maximum likelihood, P-curve, P-uniform, Z-curve, Effect size, Replicability, Simulation.

Link to manuscript:

Link to website with technical supplement:



A Critical Review of Schwarz and Strack’s (1999) “Reports of Subjective Well-Being: Judgmental Processes and Their Methodological Implications”

In this review of Schwarz and Strack’s (1999) “Reports of Subjective Well-Being: Judgmental Processes and Their Methodological Implications”, I present verbatim quotes form their chapter and explain why these statements are misleading or false, and how the authors distort the actual evidence by selectively citing research that supports their claims, while hiding evidence that contradicts their claims. I show that the empirical evidence for the claims made by Schwarz and Strack is weak and biased.

Unfortunately, this chapter has had a strong influence on Daniel Kahneman’s attitude towards life-satisfaction judgments and his fame as a Noble laureate has led many people to believe that life-satisfaction judgments are highly sensitive to the context in which these questions are asked and practically useless for the measurement of well-being.  This has led to claims that wealth is not a predictor of well-being, but only a predictor of invalid life-satisfaction judgments (Kahneman et al., 2006) or that the effects of wealth on well-being are limited to low incomes.  None of these claims are valid because they rely on the unsupported assumption that life-satisfaction judgments are invalid measures of well-being.

The original quotes are highlighted in bold followed by my comments.

Much of what we know about individuals’ subjective well-being (SWB) is based on self-reports of happiness and life satisfaction.

True. The reason is that sociologists developed brief, single-item measures of well-being that could be included easily in large surveys such as the World Value Survey, the German Socio-Economic Panel, or the US General Social Survey.  As a result, there is a wealth of information about life-satisfaction judgments that transcends scientific disciplines. The main contribution of social psychologists to this research program that examines how social factors influence human well-being has been to dismiss the results based on claims that the measure of well-being is invalid.

As Angus Campbell (1981) noted, the “use of these measures is based on the assumption that all the countless experiences people go through from day to day add to . . . global feelings of well-being, that these feelings remain relatively constant over extended periods, and that people can describe them with candor and accuracy”

Half true.  Like all self-report measures, the validity of life-satisfaction judgments depends on respondents’ ability and willingness to provide accurate information.  However, it is not correct to suggest that life-satisfaction judgments assume that feelings remain constant over extended periods of time or that respondents have to rely on feelings to answer questions about their satisfaction with life.  There is a long tradition in the well-being literature to distinguish cognitive measures of well-being like Cantrill’s ladder and affective measures that focus on affective experiences in the recent past like Bradburn’s affect balance scale.  The key assumption underlying life-satisfaction judgments is that respondents have chronically accessible information about their lives or can accurately estimate the frequency of positive and negative feelings. It is not necessary that the feelings are stable.

These assumptions have increasingly been drawn into question, however, as the empirical work has progressed.

It is not clear which assumptions have been drawn into question.  Are people unwilling to report their well-being, are they unable to do so, or are feelings not as stable as they are assumed to be? Moreover, the statement ignores a large literature that has demonstrated validity of well-being measures going back to the 1930s (see Diener et al., 2009; Scheider & Schimmack, 2009, for a meta-analysis).

First, the relationship between individuals’ experiences and objective conditions of life and their subjective sense of well-being is often weak and sometimes counter-intuitive.  Most objective life circumstances account for less than 5 percent of the variance in measures of SWB, and the combination of the circumstances in a dozen domains of life does not account for more than 10 percent (Andrews and Whithey 1976; Kammann; 1982; for a review, see Argyle, this volume).


First, it is not clear what weak means. How strong should the correlation between objective conditions of life and subjective well-being be?  For example, should marital status be a strong predictor of happiness? Maye it matters more whether people are happily married or unhappily married than whether they are married or single?  Second, there is no explanation for the claim that these relationships are counter-intuitive.  Employment, wealth, and marriage are positively related to well-being as most people would expect. The only finding in the literature that may be considered counter-intuitive is that having children does not notably increase well-being and sometimes decreases well-being. However, this does not mean well-being measures are false, it may mean that people’s intuitions about the effects of life-events on well-being are wrong. If intuitions would always be correct, we would not need scientific studies of determinants of well-being.


Second, measures of SWB have low test-retest reliabilities, usually hovering around .40, and not exceeding .60 when the same question is asked twice during the same one-hour interview (Andrews and Whithey 1976; Glatzer 1984). 


This argument ignores that responses to a single self-report item often have a large amount of random measurement error, unless participants can recall their previous answer.  The typical reliability of a single-item self-report measure is about r  =.6 +/- .2.  There is nothing unique about the results reported here for well-being measures. Moreover, the authors blatantly ignore evidence that scales with multiple items like Diener’s Satisfaction with Life Scale have retest correlations over r = .8 over a one-month period (see Schimmack & Oishi, 2005, for a meta-analysis).  Thus, this statement is misleading and factually incorrect.


Moreover, these measures are extremely sensitive to contextual influences.


This claim is inconsistent with the high retest correlation over periods of one month. Moreover, survey researchers have conducted numerous studies in which they examined the influence of the survey context on well-being measures and a meta-analysis of these studies shows only a small effect of previous items on these judgments and the pattern of results is not consistent across studies (see Schimmack & Oishi, 2005 for a meta-analysis).


Thus, minor events, such as finding a dime (Schwarz 1987) or the outcome of soccer games (Schwarz et al. 1987), may profoundly affect reported satisfaction with one’s life as a whole.


As I will show, the chapter makes many statements about what may happen.  For example, finding a dime may profoundly affect well-being report or it may not have any effect on these judgments.  These statements are correct because well-being reports can be made in many different ways. The real question is how these judgments are made when well-being measures are used to measure well-being. Experimental studies that manipulate the situation cannot answer this question because they purposefully create the situation to demonstrate that respondents may use mood (when mood is manipulated) or may use information that is temporarily accessible, when relevant information is made salient and temporarily accessible. The processes underlying judgments in these experiments may reveal influences on life-satisfaction judgment in a real survey context or they may reveal processes that do not occur under normal circumstances.


Most important, however, the reports are a function of the research instrument and are strongly influenced by the content of preceding questions, the nature of the response alternatives, and other “technical” aspects of questionnaire design (Schwarz and Strack 1991a, 1991b).


We can get different answers to different questions.  The item “So far, I have gotten everything I wanted in life” may be answered differently than the item “I feel good about my life, these days.”  If so, it is important to examine which of these items is a better measure of well-being.  It does not imply that all well-being items are flawed.  The same logic applies to the response format.  If some response formats produce different results than others, it is important to determine which response formats are better for the measurement of well-being.  Last, but not least, the claim that well-being reports are “strongly influenced by the content of preceding questions” is blatantly false.  A meta-analysis shows that strong effects were only observed in two studies by Strack, but that other studies find much weaker or no effects (see Schimmack & Oishi, 2005, for a meta-analysis).


Such findings are difficult to reconcile with the assumption that subjective social indicators directly reflect stable inner states of well-being (Campbell 1981) or that the reports are based on careful assessments of one’s objective conditions in light of one’s aspirations (Glatzer and Zapf 1984). Instead, the findings suggest that reports of SWB are better conceptualized as the result of a judgment process that is highly context-dependent.


Indeed. A selective and bias list of evidence is inconsistent with the hypothesis that well-being reports are valid measures of well-being, but this only shows that the authors misrepresent the evidence, not that well-being reports lack validity, which was carefully examined in Andrew & Withey’s book (1976), which the authors cite without mentioning the evidence presented in the book for the usefulness of well-being reports.




Not surprisingly, individuals may draw on a wide variety of information when asked to assess the subjective quality of their lives.


Indeed. This means that it is impossible to generalize from an artificial context created in an experiment to the normal conditions of a well-being survey because respondents may use different information in the experiment than in the naturalistic context. The experiment may led respondents to use information that they normally would not use.




Comparison-based evaluative judgments require a mental representation of the object of judgment, commonly called a target, as well as a mental representation of a relevant standard to which the target can be compared.


True. In fact, Cantril’s ladder explicitly asks respondents to compare their actual life to the best possible life they could have and the worst possible life they could have.  We can think about these possible lives as imaginary intrapersonal comparisons.


When asked, “Taking all things together, how would you say things are these days?” respondents are ideally assumed to review the myriad of relevant aspects of their lives and to integrate them into a mental representation of their life as a whole.”


True, this is the assumption underlying the use of well-being reports as measures of well-being.


In reality, however, individuals rarely retrieve all information that may be relevant to a judgment


This is also true. It is impossible to retrieve ALL of the relevant information. But it is possible that respondents retrieve most of the relevant information or enough relevant information to make these judgments valid. We do not require 100% validity for measures to be useful.


Instead, they truncate the search process as soon as enough information has come to mind to form a judgment with sufficient subjective certainty (Bodenhausen and Wyer 1987).


This is also plausible. The question is what would be the criterion for sufficient certainty for well-being judgments and whether this level of certainty is reached without retrieval of relevant information. For example, if I have to report how satisfied I am with my life overall and I am thinking first about my marriage would I stop there or would I think that my overall life is more than my marriage and also think about my work?  Depending on the answer to this question, well-being judgments may be more or less valid.


Hence, the judgment is based on the information that is most accessible at that point in time. In general, the accessibility of information depends on the recency and frequency of its use (for a review, see Higgins 1996).


This also makes sense.  A sick person may think about their health. A person in a happy marriage may think about their loving wife, and a person with financial problems may think about their problems paying bills.  Any life domain that is particularly salient in a person’s life is also likely to be a salient when they are confronted with a life-satisfaction question. However, we still do not know which information people will use and how much information they will use before they consider their judgment sufficiently accurate to provide an answer. Would they use just one salient temporarily accessible piece of information or would be continue to look for more information?


Information that has just been used-for example, to answer a preceding question in the questionnaire-is particularly likely to come to mind later on, although only for a limited time.


Wait a second.  Higgins emphasized that accessibility is driven by recency and frequency (!) of use. Individual who are going through a divorce or cancer treatment have probably thought frequently about this aspect of their lives.  A single question about their satisfaction with their recreational activities may not make them judge their lives based on their hobbies. Thus, it does not follow from Higgins’s work on accessibility that preceding items have a strong influence on well-being judgments.


This temporarily accessible information is the basis of most context effects in survey measurement and results in variability in the judgment when the same question is asked at different times (see Schwarz and Strack 1991b; Strack 1994a; Sudman, Bradburn, and Schwarz 1996, chs. 3 to 5; Tourangeau and Rasinski 1988)


Once more, the evidence for these temporary accessibility effects is weak and it is not clear why well-being judgments would be highly stable over time, if they were driven by making irrelevant information temporarily accessible.  In fact, the evidence is more consistent with Higgins’ suggests that frequent of use influences well-being judgments.  Life domains that are salient to individuals are likely to influence life-satisfaction judgments because they are chronically accessible even if other information is temporarily accessible or primed by preceding questions.


Other information, however, may come to mind because it is used frequently-for example, because it relates to the respondent’s current concerns (Klinger 1977) or life tasks (Cantor and Sanderson, this volume). Such chronically accessible information reflects important aspects of respondents’ lives and provides for some stability in judgments over time.


Indeed, but look at the wording. “This temporarily accessible information IS the basis of most context effects in survey measurement” vs. “Other information, however, MAY come to mind.”  The wording is not balanced and it does not match the evidence that most of the variation in well-being reports across individuals is stable over time and only a small proportion of the variance changes systematically over time. The wording is an example of how some scientists create the illusion of a balanced literature review while pushing their biased opinions.


As an example, consider experiments on question order. Strack, Martin, and Schvvarz (1988) observed that dating frequency was unrelated to students’ life satisfaction when a general satisfaction question preceded a question about the respondent’s dating frequency, r = – 12.  Yet reversing the question order increased the correlation to r = .66.  Similarly, marital satisfaction correlated with general life satisfaction r = .32 when the general question preceded the marital one in another study (Schwarz, Strack, and Mai 1991). Yet reversing the question order again increased this correlation to r = .67.


The studies that are cited here are not representative. They show the strongest item-order effects and the effects are much stronger than the meta-analytic average (Schimmack & Oishi, 2005). Both studies were conducted by Strack. Thus, these examples are at best considered examples what might happen under very specific conditions that differ from other specific conditions where the effect was much smaller. Moreover, it is not clear why dating frequency should be a strong positive predictor of life-satisfaction. Why is my life better when I have a lot of dates as opposed to somebody who is in a steady relationship, and we would not expect a married respondent with lots of dates to be happy with their marriage. The difference between r = .32 and r = .66 is strong, but it was obtained with small samples and it is common that small samples overestimate effect sizes. In fact, large survey studies show much weaker effects. In short, by focusing on these two examples, the authors create the illusion that strong effects of preceding items are common and that these studies are just an example of these effects. In reality, these are the only two studies with extremely and unusually strong effects that are not representative of the literature. The selective use of evidence is another example of unscientific practices that undermine a cumulative science.


Findings of this type indicate that preceding questions may bring information to mind that respondents would otherwise not consider.


Yes, it may happen, but we do not know under what specific circumstances it happens.  At present, the only predictor of these strong effects is that the studies were conducted by Fritz Strack. Nobody else has reported such strong effects.


If this information is included in the representation that the respondent forms of his or her life, the result is an assimilation effect, as reflected in increased correlations. Thus, we would draw very different inferences about the impact of dating frequency or marital satisfaction on overall SWB, depending on the order in which the questions are asked.


Now the authors extrapolate from extreme examples and discuss possible theoretical implications if this were a consistent and replicable finding.  “We would draw different inferences.”  True. If this were a replicable finding and we would ask about specific life domains first, we would end up with false inferences about the importance of dating and marriage for life-satisfaction. However, it is irrelevant what follows logically from a false assumption (if Daniel Kahneman had not won the Noble price, it would be widely accepted that money buys some happiness). Second, it is possible to ask global life-satisfaction question first without making information about specific aspects of life temporarily salient.  This simple procedure would ensure that well-being reports are more strongly influenced on chronically accessible information that reflects people’s life concerns.  After all, participants may draw on chronically accessible information or temporarily accessible information and if no relevant information was made temporarily accessible, respondents will use chronically accessible information.


Theoretically, the impact of a given piece of accessible information increases with its extremity and decreases with the amount and extremity of other information that is temporarily or chronically accessible at the time of judgment (see Schwarz and Bless 1992a). To test this assumption, Schwarz, Strack, and Mai ( 1991) asked respondents about their job satisfaction, leisure time satisfaction, and marital satisfaction prior to assessing their general life satisfaction, thus rendering a more varied set of information accessible. In this case, the correlation between marital satisfaction and life satisfaction increased from r = .32 (in the general-marital satisfaction order) to r = .46, yet this increase was less pronounced than the r = .67 observed when marital satisfaction was the only specific domain addressed.


This finding also suggests that strong effects of temporarily accessible information are highly context dependent. Just asking for satisfaction with several life-domains reduces the item order effect and with the small samples in Schwarz et al. (1991), the difference between r = .32 and r = .46 is not statistically significant, meaning it could be a chance finding.  So, their own research suggests that temporarily accessible information may typically have a small effect on life-satisfaction and this conclusion would be consistent with the evidence in the literature.


In light of these findings, it is important to highlight some limits for the emergence of question-order effects. First, question-order effects of the type discussed here are to be expected only when answering a preceding question increases the temporary accessibility of information that is not chronically accessible anyway…  Hence, chronically accessible current concerns would limit the size of any emerging effect, and the more they do so, the more extreme the implications of these concerns are.


Here the authors acknowledge that there are theoretical reasons why item-order effects should typically not have a strong influence on well-being reports.  One reason is that some information such as marital satisfaction is likely to be used even if marriage is not made salient by a preceding question.  It is therefore, not clear why marital satisfaction would produce a big increase from r = .32 to r = .67, as this would imply that numerous respondents do not consider their marriage when they made the judgment and it would explain why other studies found much weaker effects for item-order effects with marital satisfaction and higher correlations between marital satisfaction and life-satisfaction than r  =.32.  However, it is interesting that this important theoretical point is offered only as a qualification after presenting evidence from two studies that did show strong item-order effects. If the argument had been presented first, the question would arise why these studies did produce strong item-order effects and it would be evident that it is impossible to generalize from these specific studies to well-being reports in general.




“Complicating things further, information rendered accessible by a preceding question may not always be used.”


How is this complicating things further?  If there are ways to communicate to respondents that they should not be influenced by previous items (e.g., “Now on to another topic” or “take a moment to think about the most important aspects of your life”) and this makes context effects disappear, why don’t we just use the proper conversational norms to avoid these undesirable effects? And some surveys actually do this and we would therefore expect that they elicit valid reports of well-being that are not based on responses to previous questions in the survey.


In the above studies (Strack et al. 1988; Schwarz ct al. 1991), the conversational norm of nonrcdundancy was evoked by a joint lead-in that informed respondents that they would now be asked two questions pertaining to their well-being. Following this lead-in, they first answered the specific question (about dating frequency or marital satisfaction) and subsequently reported their general life satisfaction. In this case, the previously observed correlations of r = .66 between dating frequency and life satisfaction, or of r = .67 between marital satisfaction and life satisfaction, dropped to r = -15 and .18, respectively. Thus, the same question order resulted in dramatically different correlations, depending on the elicitation of the conversational norm of nonredundancy. 


The only evidence for these effects comes from a couple of studies by the authors.  Even if these results hold, they suggest that it should be possible to use conversational norms to get the same results for both item-orders if conversational norms suggest that participants should use all relevant chronically accessible information.  However, the authors did not conduct such as study. One reason may be that the prediction would be that there is no effect and researchers are only interested in using manipulations that show effects so that they can reject the null-hypothesis. Another explanation could be that Schwarz and Strack’s program of research on well-being reports was built on the heuristics and bias program in social psychology that is only interested in showing biases and ignores evidence for accuracy (Funder, 1987). The only result that is deemed relevant and worthy of publishing are experiments that successfully created a bias in judgments. The problem with this approach is that it cannot reveal that these judgments are also accurate and can be used as valid measures of well-being.




Judgments arc based on the subset of potentially applicable information that is chronically or temporarily accessible at the time.


Yes, it is not clear what else the judgments could be based on.


Accessible information, however, may not be used when its repeated use would violate conversational norms of nonredundancy.


Interestingly this statement would imply that participants are not influenced by subtle information (priming). The information has to be consciously accessible to determine whether it is relevant and only accessible information that is considered relevant is assumed to influence judgments.  This also implies that making information accessible that is not considered relevant will not have an influence on well-being reports. For example, asking people about their satisfaction with the weather or the performance of a local sports team does not lead to a strong influence of this information on life-satisfaction judgments because most people do not consider this information relevant (Schimmack et al., 2002). Once more, it is not clear how well-being reports can be highly context dependent, if information is carefully screened for relevance and responses are only made when sufficient relevant information was retrieved.




Suppose that an extremely positive (or negative) life event comes to mind. If this event is included in the temporary representation of the target “my life now,” it results in a more positive (negative) assessment of SWB, reflecting an assimilation effect, as observed in an increased correlation in the studies discussed earlier. However, the same event may also be used in constructing a standard of comparison, resulting in a contrast efficient: compared to an extremely positive (negative) event, one’s life in general may seem relatively bland (or pretty benign). These opposite influences of the same event are sometimes referred to as endowment (assimilation) and contrast effects (Tversky. and Griffin 1991).


This is certainly a possibility, but it not necessarily limited to temporarily accessible information.  A period in an individuals’ life may be evaluated relative to other periods in a person’s life.  In this way, subjective well-being is subjective. Objectively identical lives can be evaluated differently because past experiences created different ideals or comparison standards (see Cantril’s early work on human concerns).  This may happen for chronically accessible information just as much as for temporarily accessible information and it does not imply that well-being reports are invalid; it just shows that they are subjective.


Strack, Schwarz, and Gschneidingcr (1985, Experiment 1) asked respondents to report either three positive or three negative recent life events, thus rendering these events temporarily accessible.  As shown in the top panel of Table 1, these respondents reported higher current life satisfaction after they recalled three positive rather than negative recent events. Other respondents, however, had to recall events that happened at least five years before. These respondents reported higher current life satisfaction after recalling negative rather than positive past events. 


This finding shows that contrast effects can occur.  However, it is important to note that these context effects were created by the experimental manipulation.  Participants were asked to recall events from 5 years ago.  In the naturalistic scenario, where participants are simply asked to report “how is your life these days” participants are unlikely to suddenly recall events from 5 years ago.   Similarly, if you were asked about your happiness with your last vacation you are unlikely to recall earlier vacations and contrast your most recent vacation with it.  Indeed, Suh et al. (1996) showed that life-satisfaction judgments are influenced by recent events and that older events do not have an effect. They found no evidence for contrast effects when participants were not asked to recall events from the distant past.  So, this research shows what can happen in a specific context where participants were to recall extreme negative or positive from their past, but without prompting by an experimenter this context hardly ever would occur.  Thus, this study has no ecological or external validity for the question how participants actually make life-satisfaction judgments.


These experimental results are consistent with correlational data (Elder 1974) indicating that U.S. senior citizens, the “children of the Great Depression,” are more likely to report high subjective well-being the more they suffered under adverse economic conditions when they were adolescents. 


This finding again does not mean that elderly US Americans who suffered more during the Great Depression were actively thinking about the Great Depression when they answered questions about their well-being. It is more likely that they may have lower aspirations and expectations from life (see Easterlin). This means that we can interpret this result in many ways. One explanation would be that well-being judgments are subjective and that cultural and historic events can shape individuals’ evaluation standards of their lives.




In combination, the reviewed research illustrates that the same life event may affect judgments of SWB in opposite directions, depending on its use in the construction of the target “my life now” and of a relevant standard of comparison.


Again, the word “may” makes this statement true. Many things may happen, but that tells us very little about what actually is happening when respondents report on their well-being.  How past negative events can become positive events (a divorce was terrible, but it feels like a blessing after being happily remarried, etc.) and positive events can become negative events (e.g., the dream of getting tenure comes true, but doing research for life happens to be less fulfilling than one anticipated) is an interesting topic for well-being research, but none of these evaluative reversals undermine the usefulness of well-being measures. In fact, they are needed to reveal that subjective evaluations have changed and that past evaluations may have carry over effects on future evaluations.


It therefore comes as no surprise that the relationship between life events and judgments of SWB is typically weak. Today’s disaster can become tomorrow’s standard, making it impossible to predict SWB without a consideration of the mental processes that determine the use of accessible information.


Actually, the relationship between life-events and well-being is not weak.  Lottery winners are happier and accident victims are unhappier.  And cross-cultural research shows that people do not simply get used to terrible life circumstances.  Starving is painful. It does not become a normal standard for well-being reports on day 2 or 3.  Most of the time, past events simply lose importance and are replaced by new events and well-being measures are meant to cover a certain life period rather than an individual’s whole life from birth to death.  And because subjective evaluations are not just objective reports of life-events, they depend on mental processes. The problem is that a research program that uses experimental manipulations does not tell us about the mental processes that are underlying life-satisfaction judgments when participants are not manipulated.




Counterfactual thinking can influence affect and subjective well-being in several ways (see Roese 1997; Roese and Olson 1995b).


Yes, it can, it may, and it might, but the real question is whether it does influence well-being reports and if so, how it influences these reports.


For example, winners of Olympic bronze medals reported being more satisfied than silver medalists (Medvec, Madey, and Gilovich 1995), presumably because for winners of bronze medals, it is easier to imagine having won no medal at all (a “downward counterfactual”), while for winners of silver medals, it is easier to imagine having won the gold medal (an “upward counterfactual”).


This is not an accurate summary of the article that contained three studies.  Study 1 used ratings of video clips of Olympic medalists immediately after the event (23 silver & 18 bronze medalists).  The study showed a strong effect that bronze medalists were happier than silver medalists, F(1,72) = 18.98.  The authors also noted that in some events the silver medal means that an athlete lost a finals match, whereas in other events they just placed second in a field of 8 or more athletes.  An analysis that excluded final matches showed weaker evidence for the effect, F(1,58) = 6.70.  Most important, this study did not include subjective reports of satisfaction as claimed in the review article. Study 2 examined interviews of 13 silver and 9 bronze medalists.  Participants in Study 2 rated interviews of silver medal athletes to contain more counter-factual statements (e.g., I almost), t(20) = 2.37, p< ,03.  Importantly, no results regarding satisfaction are reported. Study 3 actually recruited athletes for a study and had a larger sample size (N = 115). Participants were interviewed by the experimenters after they won a silver or bronze medal at an athletic completion (not the Olympics).   The description of the procedure is presented verbatim here.


Procedure. The athletes were approached individually following their events and asked to rate their thoughts about their performance on the same 10-point scale used in Study 2. Specifically, they were asked to rate the extent to which they were concerned with thoughts of “At least I. . .” (1) versus “/ almost” (10). Special effort was made to ensure that the athletes understood the scale before making their ratings. This was accomplished by mentioning how athletes might have different thoughts following an athletic competition, ranging from “I almost did better” to “at least I did this well.”


What is most puzzling about this study is why the experiments seemingly did not ask questions about emotions or satisfaction with performance.  It would have taken only a couple of questions to obtain reports that speak to the question of the article whether winning a silver medal is subjectively better than winning a bronze medal.  Alas, these questions are missing. The only result from Study 3 is “as predicted, silver medalists’ thoughts following the competition were more focused on “I almost” than were bronze medalists’.  Silver medalists described their thoughts with a mean rating of 6.8 (SD = 2.2), whereas bronze medalists assigned their thoughts an average rating of 5.7 (SD = 2.7), t(113) = 2.4, p < .02.


In sum, there is no evidence in this study that winning an Olympic silver medal or any other silver medal for that matter makes athletes less happy than winning a bronze medal. The misrepresentation of the original study by Schwarz and Strack is another example of unscientific practices that can lead to the fabrication of false facts that are difficult to correct and can have a lasting negative effect on the creation of a cumulative science.


In summary, judgments of SWB can be profoundly influenced by mental constructions of what might have been.


This statement is blatantly false. The cited study on medal winners does not justify this claim and thre is no scientific basis for the claim that these effects are profound.


In combination, the discussion in the preceding sections suggests that nearly any aspect of one’s life can be used in constructing representations of one’s “life now” or a relevant standard, resulting in many counterintuitive findings.


A collection of selective findings that were obtained using different experimental procedures does not mean that well-being reports obtained under naturalistic conditions produce many counterintuitive findings, nor is there any evidence that they do produce many counterintuitive findings.  This statement lacks any empirical foundation and is inconsistent with other findings in the well-being literature.


Common sense suggests that misery that lasts for years is worse than misery that lasts only for a few days.


Indeed. Extended periods of severed depression can drive some people to attempt suicide. A week with the flu does not. Consistent with this common sense observation, well-being reports of depressed people are much lower than those of other people, once more showing that well-being reports often produce results that are consistent with intuitions.


Recent research suggests, however, that people may largely neglect the duration of the episode, focusing instead on two discrete data points, namely, its most intense hedonic moment (“peak”) and its ending (Fredrickson and Kahneman 1993; Varey and Kahneman 1992). Hence, episodes whose worst (or best) moments and endings are of comparable intensity are evaluated as equally (un)pleasant, independent of their duration (for a more detailed discussion, see Kahneman, this volume).


Yes, but this research focusses on brief episodes with a single emotional event.  It is interesting that duration of episodes seems to matter very little, but life is a complex series of events and episodes. Having sex for 20 minutes or 30 minutes may not matter, but having sex regularly, at least once a week, does seem to matter for couples’ well-being.  As Diener et al. (1985) noted, it is the frequency, not the intensity (or duration) of positive and negative events in people’s lives that matters.


Although the data are restricted to episodes of short duration, it is tempting to speculate about the possible impact of duration neglect on the evaluation of more extended episodes.


Yes, interesting, but this statement clearly indicates that the research on duration neglect is not directly relevant for well-being reports.


Moreover, retrospective evaluations should crucially depend on the hedonic value experienced at the end of the respective episode.


This is a prediction not a fact. I have actually examined this question and found that frequency of positive and negative events has a stronger influence on satisfaction judgments with a day than how respondents felt at the end of the day when they reported daily satisfaction.




As our selective review illustrates, judgments of SWB are not a direct function of one’s objective conditions of life and the hedonic value of one’s experiences.


First, it is great that the authors acknowledge here that their review is selective.  Second, we do not need a review to know that subjective well-being is not a direct function of objective life conditions. The whole point of subjective well-being reports is to allow respondents to evaluate these events from their own subjective point of view.  And finally, at no point has this selective review shown that these reports do not depend on the hedonic value of one’s experiences. In fact, measures of hedonic experiences are strong predictors of life-satisfaction judgments (Schimmack et al., 2002; Lucas et al., 1996; Zou et al., 2012).


Rather they crucially depend on the information that is accessible at the time of judgment and how this information is used in constructing mental representations of the to-be-evaluated episode and a relevant standard.


This factual statement cannot be supported by a selective review of the literature. You cannot say, my selective review of creationist literature shows that evolution theory is wrong.  You can say that a selective review of creationist literature would suggest that evolution theory is wrong, but you cannot say that it is wrong. To make scientific statements about what is (highly probable to be) true and what is (highly probable to be) false, you need to conduct a review of the evidence that is not selective and not biased.


As a result of these construal processes, judgments of SWB are highly malleable and difficult to predict on the basis of objective conditions. 


This is not correct.  Evaluations do not directly depend on objective conditions. This is not a feature of well-being reports but a feature of evaluations.  At the same time, the construal processes that related objective events to subjective well-being are systematic, predictable, and depend on chronically accessible and stable information.  Well-being reports are highly correlated with objective characteristics of nations, bereavement, unemployment, and divorce have negative effects on well-being and winning the lottery, marriage, and remarriage have positive effects on well-being.  Schwarz and Strack are fabricating facts. This is not considered fraud. Only data manipulation and fabricating data is considered scientific fraud, but this does not mean that fabricated facts are less harmful than fabricated data.  Science can only provide a better understanding if it is based on empirically verified and replicable facts. Simply stating ‘judgments of SWB are difficult to predict” without providing any evidence for this claim is unscientific.




The causal impact of comparison processes has been well supported in laboratory experiments that exposed respondents to relevant comparison standards…For example, Strack and his colleagues (1990) observed that the mere presence of a handicapped confederate was sufficient to increase reported SWB under self-administered questionnaire conditions, presumably because the confederate served as a salient standard of comparison….As this discussion indicates, the impact of social comparison processes on SWB is more complex than early research suggested. As far as judgments of global SWB are concerned, we can expect that exposure to someone who is less well off will usually result in more positive-and to someone who is better off in more negative assessments of one’s own life.  However, information about the other’s situation will not always be used as a comparison standard.

The whole section about social comparison does not really address the question of the influence of social comparison effects on well-being reports.  Only a single study with a small sample is used to provide evidence that respondents may engage in social comparison processes when they report their well-being.  The danger of this occurring in a naturalistic context is rather slim.  Even in face-to-face interviews, the respondent is likely to have answered several questions about themselves and it seems far-fetched that they would suddenly think about the interviewer as a relevant comparison standard, especially if the interviewer does not have a salient characteristic like a disability that may be considered relevant. Once more the authors generalize from one very specific laboratory experiment to the naturalistic context in which SWB reports are normally made without considering the possibility that the experimental results are highly contextual sensitive and do not reveal how respondents normally judge their lives.

[Standards Provided by the Social Environment]

In combination, these examples draw attention to the possibility that salient comparison standards in one’s immediate environment, as well as socially shared norms, may constrain the impact of fortuitous temporary influences. At present, the interplay of chronically and temporarily accessible standards on judgments of SWB has received little attention. The complexities that are likely to result from this interplay provide a promising avenue for future research.

Here the authors acknowledge that their program of research is limited and fails to address how respondents use chronically accessible information. They suggest that this is a promising avenue for future research, but they fail to acknowledge why they haven’t conducted studies that start to address this question. The reason is that their research program with experimental manipulations of the situation doesn’t allow to study the use of chronically accessible information.  The use of information that by definition comes to mind spontaneously independent of researchers’ experimental manipulations is a blind-spot of the experimental approach.

[Interindividual Standards Implied by the Research Instrument]

Finally, we extend our look at the influences of the research instrument by addressing a frequently overlooked source of temporarily accessible comparison information…As numerous studies have indicated (for a review, see Schwarz 1996), respondents assume that the list of response alternatives reflects the researcher’s knowledge of the distribution of the behavior: they assume that the “average” or “usual” behavioral frequency is represented by values in the middle range of the scale, and that the extremes of the scale correspond to the extremes of the distribution. Accordingly, they use the range of the response alternatives as a frame of reference in estimating their own behavioral frequently, resulting in different estimates of their own behavioral frequency, as shown in table 4.2. More important for our present purposes, they further extract comparison information from their low location on the scale…Similar findings have been obtained with regard to the frequency of physical symptoms and health satisfaction (Schwarz and Scheuring 1992), the frequency of sexual behaviors and marital satisfaction (Schwarz and Scheuring 1988), and various consumer behaviors (Menon, Raghubir, and Schwarz 1995).

One study is in German and not available.  I examined the study by Schwarz and Scheurig (1988) in European Journal of Social Psychology.   Study 1 had four conditions with n = 12 or 13 per cell (N = 51).  The response format varied frequencies so that having sex or masturbating once a week was either a high or low frequency occurrence.  Subsequently, participants reported their relationship satisfaction. The relationship satisfaction ratings were analyzed with an ANOVA.  “Analysis of variance indicates a marginally reliable interaction of both experimental variables, F(1,43) = 2.95, p < 0.10, and no main effects.”  The result is not significant by conventional standards and the degrees of freedom show that some participants were excluded from this analysis without further mentioning of this fact.  Study 2 manipulated the response format for frequency of sex and masturbation within subject. That is, all subjects were asked to rate frequencies of both behaviors in four different combinations. There were n = 16 per cell, N = 64. No ANOVA is reported presumably because it was not significant. However, a PLANNED contrast between the high sex/low masturbation and the low/sex high masturbation group showed a just significant result, t(58) = 2.17, p = .034. Again, the degrees of freedom do not match sample size. In conclusion, the evidence that subtle manipulations of response formats can lead to social comparison processes that influence well-being reports is not conclusive. Replication studies with larger samples would be needed to show that these effects are replicable and to determine how strong these effects are.

In combination, they illustrate that response alternatives convey highly salient comparison standards that may profoundly affect subsequent evaluative judgments.

Once, more the word “may” makes the statement true in a trivial sense that many things may happen. However, there is no evidence that these effects actually have profound effects on well-being reports, and the existing studies show statistically weak evidence and provide no information about the magnitude of these effects.

Researchers are therefore well advised to assess information about respondents’ behaviors or objective conditions in an open-response format, thus avoiding the introduction of comparison information that respondents would not draw on in the absence of the research instrument.

There is no evidence that this would improve the validity of frequency reports and research on sexual frequency shows similar results with open and closed measures of sexual frequency (Muise et al., 2016).


In summary, the use of interindividual comparison information follows the principle of cognitive accessibility that WC have highlighted in our discussion of intraindividual comparisons. Individuals often draw on the comparison information that is rendered temporarily accessible by the research instrument or the social context in which they form the judgment, although chronically accessible standards may attenuate the impact of temporarily accessible information.

The statement that people often rely on interpersonal comparison standards is not justified by the research.  By design, experiments that manipulate one type of information and make it salient cannot determine how often participants use this type of information when it is not made salient.


In the preceding sections, we considered how respondents use information about their own lives or the lives of others in comparison-based evaluation strategies. However, judgments of well-being are a function not only of what one thinks about but also of how one feels at the time of judgment.

Earlier, the authors stated that respondents are likely to use a minimum of information that is deemed sufficient. “Instead, they truncate the search process as soon as enough information has come to mind to form a judgment with sufficient subjective certainty (Bodenhausen and Wyer 1987)”  Now we are supposed to believe that they use intrapersonal and interpersonal information that is temporarily and chronically accessible and their feelings.  That is a lot of information and it is not clear how all of this information is combined into a single judgment. A more parsimonious explanation for the host of findings is that each experiment carefully created a context that made respondents use the information that the experimenters wanted respondents to use to confirm the hypothesis that they use this information. The problem is that this only shows that a particular source of information may be used in one particular context. It does not mean that all of these sources of information are used and need to be integrated into a single judgment under naturalistic conditions. The program of research simply fails to address the question which information respondents actually use when they are asked to judge their well-being in a normal context.

A wide range of experimental data confirms this intuition. Finding a dime on a copy machine (Schwarz 1987), spending time in a pleasant rather than an unpleasant room (Schwarz ct al. 1987, Expcrimcnt 2), or watching the German soccer team win rather than lose a championship game (Schwarz et al. 1987, Experiment 1) all resulted in increased reports of happiness and satisfaction with one’s life as a whole…Experimental evidence supports this assumption. For example, Schwarz and Clore (1983, Experiment 2) called respondents on sunny or rainy days and assessed reports of SWB in telephone interviews. As expected, respondents reported being in a better mood, and being happier and more satisfied with their life as a whole, on sunny rather than on rainy days. Not so, however, when respondents’ attention was subtly drawn to the weather as a plausible cause of their current feelings.

The problem is that all of the cited studies were conducted by Schwarz and that other studies that produced different results are not mentioned.  The famous weather study has recently been called into question.  However, the weather effect on life-satisfaction judgments is not ideal because weather effects on mood are not very strong either.  Respondents in sunny California do not report higher life-satisfaction than respondents in Ohio (Schkade & Kahneman), and several large scale studies have now failed to replicate the famous weather effect on well-being reports (Lucas & Lawless, 2013; Schmiedeberg, 2014).

On theoretical grounds, we may assume that people are more likely to use the simplifying strategy of consulting their affective state the more burdensome it would be to form a judgment on the basis of comparison information.

Here it is not clear why it would be burdensome to make global life-satisfaction judgments. The previous chapters suggested that respondents have access to large amount of chronically and temporarily information that they apparently used in the previous studies. Suddenly, it is claimed that retrieving relevant information is too hard and mood is used. It is not clear why respondents would consider their current mood sufficient to evaluate their lives, especially if inconsistent accessible information also comes to mind.

Note in this regard that evaluations of general life satisfaction pose an extremely complex task that requires a large number of comparisons along many dimensions with ill-defined criteria and the subsequent integration of the results of these comparisons into one composite judgment. Evaluations of specific life domains, on the other hand, are often less complex.

If evaluations of specific life domains are less complex and global questions are just an average of specific domains, it is not clear why it would be so difficult to evaluate satisfaction in a few important life domains (health, family, work) and integrate this information.  The hypothesis that mood is only used as a heuristic for global well-being reports also suggests that it would be possible to avoid the use of this heuristic by asking participants to report satisfaction with specific life domains. As these questions are supposed to be easier to answer, participants would not use mood. Moreover, preceding items are less likely to make information accessible that is relevant for a specific life domain.  For example, a dating question is irrelevant for satisfaction with academic or health satisfaction.  Thus, participants are most likely to draw on chronically accessible information that is relevant for answering a question about satisfaction with specific domains. It follows that averages of domain satisfaction judgments would be more valid than global judgments, if participants were relying on mood to judge global judgments. For example, finding a dime would make people judge their lives more positively, but not their health, social relationships, and income.  Thus, many of the alleged problems with global well-being reports could be avoided by asking for domain specific reports and then aggregate them (Andrews & Whithey, 1976; Zou et al., 2013).

If judgments of general well-being are based on respondents’ affective state, whereas judgments of domain satisfaction are based on comparison processes, it is conceivable that the same event may influence evaluations of one’s life as a whole and evaluations of specific domains in opposite directions. For example, an extremely positive event in domain X may induce good mood, resulting in reports of increased global SWB. However, the same event may also increase the standard of comparison used in evaluating domain X, resulting in judgments of decreased satisfaction with this particular domain. Again, experimental evidence supports this conjecture. In one study (Schwarz ct al. 1987, Experiment 2), students were tested in cither a pleasant or an unpleasant room, namely, a friendly office or a small, dirty laboratory that was overheated and noisy, with flickering lights and a bad smell. As expected, participants reported lower general life satisfaction in the unpleasant room than in the pleasant room, in line with the moods induced by the experimental rooms. In contrast, they reported higher housing satisfaction in the unpleasant than in the pleasant room, consistent with the assumption that the rooms served as salient standards of comparison.

The evidence here is a study with 22 female students assigned to two conditions (n = 12 and 10 per condition).  The 2 x 2 ANOVA with room (pleasant vs. unpleasant) and satisfaction judgment (life vs. housing) produced a significant interaction of measure and room, F(1,20) = 7.25, p = .014.  The effect for life-satisfaction was significant, F(1,20) = 8.02, p = .010 (reported as p < .005), and not significant for housing satisfaction, F(1,20) = 1.97, p = .18 (reported as p < .09 one-tailed).

This weak evidence in a single study with a very small sample is used to conclude that life-satisfaction judgments and domain satisfaction judgments may diverge.  However, numerous studies have shown high correlations between average domain satisfaction judgments and global life-satisfaction judgments (Andrews & Whithey, 1976; Schimmack & Oishi, 2005; Zou et al., 2013).  This finding cannot occur if respondents use mood for life-satisfaction judgments and other information for domain satisfaction judgments.  Yet readers are not informed about this finding that undermines Schwarz and Stracks’ model of well-being reports and casts doubt on the claim that the same information has opposite effects on global life-satisfaction judgments and domain specific judgments. This may happen in highly artificial laboratory conditions, but it does not happen often in normal survey contexts.

The Relative Salience of Mood and Competing Information

If recalling a happy or sad life event elicits a happy or sad mood at the time of recall, however, respondents are likely to rely on their feelings rather than on recalled content as a source of information. This overriding impact of current feelings is likely to result in mood-congruent reports of SWB, independent of the mental construal variables discussed earlier. The best evidence for this assumption comes from experiments that manipulated the emotional involvement that subjects experienced while thinking about past life events.

This section introduces a qualification of the earlier claim that recall of events in the remote past leads to a contrast effect.  Here the claim is that recalling a positive event from the remote past (a happy time with a deceased spouse) will not lead to a contrast effect (intensify dissatisfaction of a bereaved person), if the recall of the event triggers an actual emotional experiences (My life these days is good because I feel good when I think about the good times in the past).  The problem with this theory is that it is inconsistent with the earlier claims that people will discount their current feelings if they think they are irrelevant. If respondents do not use mood to judge their lives when they attribute it to the weather, it is not clear why they would use their feelings if they are triggered by recall of an emotional event from their past?  Why would a widower evaluate his current life as a widower more favorably when he is recalling the good times with his wife?

Even if this were a reliable finding, it would be practically irrelevant for actual ratings of life-satisfaction because respondents are unlikely to recall specific events in sufficient detail to elicit strong emotional reactions.  The studies that demonstrated the effect instructed participants to do so, but under normal circumstances participants make judgments very quickly often without recall of detailed, specific emotional episodes.  In fact, even the studies that showed these effects showed only weak evidence that recall of emotional events had notable effects on mood (Strack et al.. 1985).


Self-presentation and social desirability concerns may arise at the reporting stage, and respondents may edit their private judgment before they communicate it

True. All subjective ratings are susceptible to reporting styles. This is why it is important to corroborate self-ratings of well-being with other evidence such as informant ratings of well-being.  However, the problem of reporting biases would be irrelevant, if the judgment without these biases is already valid. A large literature on reporting biases in general shows that these biases account for a relatively small amount of the total variance in ratings. Thus, the key question remains whether the remaining variance provides meaningful information about respondents’ subjective evaluations of their lives or whether this variance reflects highly unreliable and context-dependent information that has no relationship to individuals’ subjective well-being.


Figure 4.2 summarizes the processes reviewed in this chapter. If respondents are asked to report their happiness and satisfaction with their “life as a whole,” they are likely to base their judgment on their current affective state; doing so greatly simplifies the judgmental task.

As noted before, this would imply that global well-being reports are highly unstable and strongly correlated with measures of current mood, but the empirical evidence does not support these predictions.  Current mood has a small effect on global well-being reports (Eid & Diener, 2004) and they are highly stable (Schimmack & Oishi, 2005) and predicted by personality traits even when these traits are measured a decade before the well-being reports (Costa & McCrae, 1980).

If the informational value of their affective state is discredited, or if their affective state is not pronounced and other information is more salient, they are likely to use a comparison strategy. This is also the strategy that is likely to be used for evaluations of less complex specific life domains.

Schwarz and Strack’s model would allow for weak mood effects. We only have to make the plausible assumption that respondents often have other information to judge their lives and that they find this information more relevant than their current feelings.  Therefore, this first stage of the judgment model is consistent with evidence that well-being judgments are only weakly correlated with mood and highly stable over time.

When using a comparison strategy, individuals draw on the information that is chronically or temporarily most accessible at that point in time. 

Apparently the term “comparison strategy” is now used to refer to the retrieval of any information rather than an active comparison that takes place during the judgment process.  Moreover, it is suddenly equally plausible that participants draw on chronically accessible information or on temporarily accessible information.  While the authors did not review evidence that would support the use of chronically accessible information, their model clearly allows for the use of chronically accessible information.

Whether information that comes to mind is used in constructing a representation of the target  “my life now” or a representation of a relevant standard depends on the variables that govern the use of information in mental construal (Schwarz and Bless 1992a; Strack 1992). 

This passage suggests that participants have to go through the process of evaluating their live each time when they are asked to make a well-being report. They have to construct what their live is like, what they want from life, and make a comparison. However, it is also possible that they can draw on previous evaluations of life domains (e.g., I hate my job, I am healthy, I love my wife, etc.). As life-satisfaction judgments are made rather quickly within a few seconds, it seems more plausible that some pre-established evaluations are retrieved than to assume that complex comparison processes are being made at the time of judgments.

If the accessibility of information is due to temporary influences, such as preceding questions in a questionnaire, the obtained judgment is unstable over time and a different judgment will be obtained in a different context.

This statement makes it obvious that retest correlations provide direct evidence on the use of temporarily accessible information.  Importantly, low retest stability could be caused by several factors (e.g. random responding).  So, we cannot verify that participants rely on temporarily accessible information when retest correlations are low. However, we can use high retest stability to falsify the hypothesis that respondents rely heavily on temporarily accessible information because the theory makes the opposite prediction.  It is therefore highly relevant that retest correlations show high temporal consistency in global well-being reports.  Based on this solid empirical evidence we can infer that responses are not heavily influenced by temporarily accessible information (Schimmack & Oishi, 2005).

On the other hand, if the accessibility of information reflects chronic influences such as current concerns or life tasks, or stable characteristics of the social environment, the judgment is likely to be less context dependent.

This implies that high retest correlations are consistent with the use of chronically accessible information, but high retest correlations do not prove that participants use chronically accessible information. It is also possible that stable variance is due to reporting styles. Thus, other information is needed to test the use of chronically accessible information. For example, agreement in well-being reports by several raters (self, spouse, parent, etc.) cannot be attributed to response styles and shows that different raters rely on the same chronically accessible information to provide well-being reports (Schneider & Schimmack, 2012).

The size of context-dependent assimilation effects increases with the amount and extremity of the temporarily accessible information that is included in the representation of the target. 

This part of the model would explain why experiments and naturalistic studies often produce different results. Experiments make temporarily accessible information extremely salient, which may lead participants to use it. In contrast, such extremely salient information is typically absent in naturalistic studies, which explains why chronically accessible information is used. The results are only inconsistent if results from experiments with extreme manipulations are generalized to normal contexts without these extreme conditions.


Our review emphasizes that reports of well-being are subject to a number of transient influences. 

This is correct. The review emphasized evidence from the authors’ experimental research that showed potential threats to the validity of well-being judgments. The review did not examine how serious these threats are for the validity of well-being judgments.

Although the information that respondents draw on reflects the reality in which they live, which aspects of this reality they consider and how they use these aspects in forming a judgment is profoundly influenced by features of the research instrument.

This statement is blatantly false.  The reviewed evidence suggests that the testing situation (a confederate, a room) or an experimental manipulation (recall positive or negative events) can influence well-being reports. There was very little evidence that the research instrument influenced well-being reports and there was no evidence that these effects are profound.

[Implications for Survey Research]

The reviewed findings have profound methodological implications.

This is wrong. The main implication is that researchers have to consider a variety of potential threats to the validity of well-being judgments. All of these threats can be reduced and many survey studies do take care to avoid some of these potential problems.

First, the obtained reports of SWB are subject to pronounced question-order effects because the content of preceding questions influences the temporary accessibility of relevant information.

As noted earlier, this was only true in two studies by the authors. Other studies do not replicate this finding.

Moreover, questionnaire design variables, like the presence or absence of a joint lead-in to related questions, determine how respondents use the information that comes to mind. As a result, mean reported well-being may differ widely, as seen in many of the reviewed examples

The dramatic shifts in means are limited to experimental studies that manipulated lead-ins to demonstrate these effects. National representative surveys show very similar means year after year.

Moreover, the correlation between an objective condition of life (such as dating frequency) and reported SWB can run anywhere from r = – .l to r = .6, depending on the order in which the same questions are asked (Strack et al. 1988), suggesting dramatically different substantive conclusions.

Moreover?  This statement just repeats the first false claim that question order has profound effects on life-satisfaction judgments.

Second, the impact of information that is rendered accessible by preceding questions is attenuated the more the information is chronically accessible (see Schwarz and Bless 1992a).

So, how can we see pronounced item-order effects for marital satisfaction if marital satisfaction is a highly salient and chronically accessible aspects of married people’s lives? So, this conclusion directly undermines the previous claim that item-order has profound effects.

Third, the stability of reports of SWB over time (that is, their test-retest reliability) depends on the stability of the context in which they are assessed. The resulting stability or change is meaningful when it reflects the information that respondents spontaneously consider because the same, or different, concerns are on their mind at different points in time. 

There is no support for this claim. If participants draw on chronically accessible information, which the authors’ model allows, the judgments do not depend on the stability of the context because chronically accessible information is by definition context-independent.

Fourth, in contrast to influences of the research instrument, influences of respondents’ mood at the time of judgment are less likely to result in systematic bias. The fortuitous events that affect one respondent’s mood are unlikely to affect the mood of many others.

This is true, but it would still undermine the validity of the judgments.  If participants rely on their current mood, variation in these responses will be unreliable and unreliable measures are by definition invalid. Moreover, the average mood of participants during the time of a survey is also not a valid measure of average well-being. So, even though mood effects may not be systematic, they would undermine the validity of well-being reports. Fortunately, there is no evidence that mood has a strong influence on these judgments, while there is evidence that participants draw on chronically accessible information from important life domains (Schimmack & Oishi, 2005).

Hence, mood effects are likely to introduce random variation.

Yes, this is a correct prediction, but evidence contradicts this prediction, and the correct conclusion is that mood does not introduce a lot of random variation in well-being reports because it is not heavily used by respondents to evaluate their lives or specific aspects of their lives.

Fifth, as our review indicates, there is no reason to expect strong relationships between the objective conditions of life and subjective assessments of well-being under most circumstances.

There are many reasons not to expect strong correlations between life-events and well-being reports. One reason is that a single event is only a small part of a whole life and that few life events have such dramatic effects on life-satisfaction that they make any notable contribution to life-satisfaction judgments.  Another reason is that well-being is subjective and the same life event can be evaluated differently by different individuals. For example, the publication of this review in a top journal in psychology would have different effects on my well-being and on the well-being of Schwarz and Strack.

Specifically, strong positive relationships between a given objective aspect of life and judgments of SWB are likely to emerge when most respondents include the relevant aspect in the representation that they form of their life and do not draw on many other aspects. This is most likely to be the case when (a) the target category is wide (“my life as a whole”) rather than narrow (a more limited episode, for example); (b) the relevant aspect is highly accessible; and (c) other information that may be included in the representation of the target is relatively less accessible. These conditions were satisfied, for example, in the Strack, Martin, and Schwarz (1988) dating frequency study, in which a question about dating frequency rendered this information highly accessible, resulting in a correlation of r = .66 with evaluations of the respondent’s life as a whole. Yet, as this example illustrates, we would not like to take the emerging correlation seriously when it reflects only the impact of the research instrument, as indicated by the fact that the correlation was r = – .l if the question order was reversed.

The unrepresentative extreme result from Strack’s study is used again as evidence, when other studies do not show the effect (Schimmack & Oishi, 2005).

Finally, it is worth noting that the context effects reviewed in this chapter limit the comparability of results obtained in different studies. Unfortunately, this comparability is a key prerequisite for many applied uses of subjective social indicators, in particular their use in monitoring the subjective side of social change over time (for examples see Campbell 198 1; Glatzer and Zapf 1984).

This claim is incorrect. The experimental demonstrations of effects under artificial conditions that were needed to manipulate judgment processes do not have direct implications for the way participants actually judge well-being and the authors model allows for chronically accessible information to have a strong influence on these judgments under less extreme and less artificial conditions, and the authors model makes predictions that are disconfirmed by evidence of high stability and low correlations with mood.

Which Measures Are We to Use?

By now, most readers have probably concluded that there is little to be learned from self-reports of global well-being being.

If so, the authors succeeded with their biased presentation of the evidence to convince readers that these reports are highly susceptible to a host of context effects that make the outcome of the judgment process essentially unpredictable. Readers would be surprised to learn that well-being reports of twins who never met are positively correlated (Lykken & Tellgen, 1996).

Although these reports do reflect subjectively meaningful assessments, what is being assessed, and how, seems too context dependent to provide reliable information about a population’s well- being, let alone information that can guide public policy (but see Argyle, this volume, for a more optimistic take).

The claim that well-being reports are too context dependent to provide reliable information about a population’s well-being is false for several reasons.  First, the authors did not show that well-being reports are context dependent. They showed that with very extreme manipulations in highly contrived and unrealistic contexts, judgments moved around statistically significantly in some studies.  They did not show that these shifts are large, as it would require larger samples to estimate effect sizes. They did not show that these effects have a notable influence on well-being reports in actual surveys of populations well-being.  And finally, they already pointed out that some of these effects (e.g., mood effects) would only add random noise, which would lower the reliability of individuals’ well-being reports, but when aggregated across responses would not alter the mean of a sample. And last, but not least, the authors blatantly ignore evidence (reviewed in this volume by Diener and colleagues) that variation across nationally representative samples shows highly reliable variation across populations in different nations that are highly correlated with objective life circumstances that are correlated with nations’ wealth.

In short, Schwarz and Strack’s claims are not scientifically founded and merely express the authors’ pessimistic take on the validity of well-being reports.  This pessimistic view is a direct consequence of a myopic focus on laboratory experiments that were designed to invalidate well-being reports and ignoring evidence from actual well-being surveys that are more suitable to examine the reliability and validity of well-being reports when well-being reports are provided under naturalistic conditions.

As an alternative approach, several researchers have returned to Bentham’s ( 1789/1948) notion of happiness as the balance of pleasure over pain (for examples, SW Kahneman, this volume; Parducci 1995).

This statement ignores the important contribution of Diener (1984) who argued that the concept of well-being may consist of life evaluations as well as the balance of pleasure over pain or Positive Affect and Negative Affect, as these constructs are called in contemporary psychology. As a result of Diener’ s (1984) conception of well-being as a construct with three components, researchers have routinely measured global life-evaluations along with measures of positive and negative affect. A key finding is that these measures are highly correlated, although not perfectly identical (Lucas et al., 1996; Zou et al., 2013).  Schwarz and Strack ignore this evidence, presumably because it would undermine their view that global life-satisfaction judgments are highly context sensitive and that measures of positive and negative affect could produce notably different results.


In conclusion, Schwarz and Strack’s (1999) chapter is a prototypical example of several bad scientific practices.  First, the authors conduct a selective review of the literature that focuses on one specific paradigm and ignores evidence from other approaches.  Second, the review focuses strongly on original studies conducted by the authors themselves and ignores studies by other researchers that produced different results. Third, the original studies are often obtained with small samples and there are no independent replications by other researchers, but the results are discussed as if they are generalizable.  Fourth, life-satisfaction judgments are influenced by a host of factors and any study that focuses on one possible predictor of these judgments is likely to account for only a small amount of the variance. Yet, the literature review does not take effect sizes into account and the theoretical model overemphasizes the causes that were studied and ignores causes that were not studied.  Fifth, the experimental method has the advantage of isolating single causes, but it has the disadvantage that results cannot be generalized to ecologically valid contexts in which well-being reports are normally obtained. Nevertheless, the authors generalize from artificial experiments to the typical survey context without examining whether their predictions are confirmed.  Finally, the authors make broad and profound claims that do not logically follow from their literature review. They suggest that decades of research with global well-being reports can be dismissed because the measures are unreliable, but these claims are inconsistent with a mountain of evidence that shows the validity of these measures that the authors willfully ignore (Diener et al., 2009).

Unfortunately, the claims in this chapter were used by Noble Laureate Daniel Kahneman as arguments to push for an alternative conception and measurement of well-being.  In combination, the unscientific review of the literature and the political influence of a Noble price has had a negative influence on well-being science.  The biggest damage to the field has been the illusion that the processes underlying global well-being reports are well-understood. In fact, we know very little how respondents make these judgments and how accurate these judgments are.  The chapter lists a number of possible threats to the validity of well-being reports, but it is not clear how much these threats actually undermine the validity of well-being reports and what can be done to reduce biases in these measures to improve their validity.  A program that relies exclusively on experimental manipulations that create biases in well-being reports is unable to answer these questions because well-being judgments can be made in numerous ways and results that are obtained in artificial laboratory contexts may or may not generalize to the context that is most relevant, namely when well-being reports are used to measure well-being.

What is needed is a real scientific program of research that examines accuracy and biases in well-being reports and creates well-being measures that maximize accuracy and minimize biases. This is what all other sciences do when they develop measures of theoretical important constructs. It is time for well-being researchers to act like a normal science. To do so, research on well-being reports needs a fresh start and needs an objective and scientific review of the empirical evidence regarding the validity of well-being measures.

Dr. R responds to Finkel, Eastwick, & Reis (FER)’s article “Replicability and Other Features of a High-Quality Science: Toward a Balanced and Empirical Approach”

Dr. R responds to Finkel, Eastwick, & Reis (FER)’s article: Replicability and Other Features of a High-Quality Science: Toward a Balanced and Empirical Approach

My response is organized as a commentary on key sections of the article. The sections of the article are direct quotations to give readers quick and easy access to FER’s arguments and conclusions, followed by my comments.  The quotations are printed in bold.

Here, we extend FER2015’s analysis to suggest that much of the discussion of best research practices since 2011 has focused on a single feature of high-quality science—replicability—with insufficient sensitivity to the implications of recommended practices for other features, like discovery, internal validity, external validity, construct validity, consequentiality, and cumulativeness.

I see replicability as being equivalent to the concept of reliability in psychological measurement.  Reliability is necessary for validity, which means a measure needs to be reliable to produce valid results and this includes internal validity and external validity. And valid results are needed to create a solid body of research that provides the basis for a cumulative science.

Take life-satisfaction judgments as an example. In a review article, Schwarz and Strack (1999) claimed that life-satisfaction judgments are unreliable, extremely sensitive to context, and that responses can change dramatically as a function of characteristic of the survey questions.  Do we think, a measure with low reliability can be used to study well-being and to build a cumulative science of well-being? No. It seems self-evidence that reliable measures are better than unreliable measures.

The reason why some measures are not reliable is that scores on the measure are influenced by factors that are difficult or too expensive to control.  As a result, these factors have an undesirable effect on responses. The effect is not systematic, or it is too difficult to study the systematic effects, and therefore results will randomly change when the same measure is used again and again.  We can assess the influence of these random factors by administering the same measurement procedure again and again and see how much scores change (in the absence of real change).

The same logic applies to replicability.  Replicability means that we get the same result if we repeat a study again and again.  Just like scores on a psychological measure can change, the results of even exact replication studies will not be the same. The reason is the same. Random factors that are outside the control of the experimenter influences the results that are obtained in a single study.  Hence, we cannot expect that exact replication studies will always produce the same results.  For example, the gender ratio in a psychology class will not be the same year after year, even if there is no real change in the gender ratio of psychology students over time.

So what does it even mean for a result to be replicable; that is for a replication study to produce the same result as the original study?  It depends on the interpretation of the results of an original study.  A professor interested in gender composition of psychology could compute the gender ratio for each year. The exact number would vary from year to year.  However, the researcher could also compute a 95% confidence interval around these numbers.  This interval specifies the amount of variability that is expected by chance.  We may then say that a study is replicable if subsequent studies produce results that are compatible with the 95% confidence interval of the original studies.  In contrast, low replicability would mean that results vary from study to study.  For example, in one year the gender ratio is 70% female (+/- 10% 95% CI), in the next year it is 25% female (again +/-10%), and the following year it is 99% (+/- 10%).  In this case, the gender ratio jumps around dramatically and the result from one study cannot be used to predict gender ratios in other years, and provides no solid empirical foundations for theories of the effect of gender on interest in psychology.

Using this criterion of replicability, many results in psychology are highly replicable.  The problem is that using this criterion, many results in psychology are also not very informative because effect sizes tend to be relatively small compared to the width of confidence intervals (Cohen, 1994).  With a standardized effect size of d = .4, and a typical confidence interval width of d ~ 1 (se = .25), the typical finding in psychology is that the effect size ranges from -.1 to d = 0.9. This means the typical result is consistent with a small effect in the opposite direction from the one in the sample (chocolate eating leads to weight gain, even if my study shows that chocolate eating leads to weight loss) and very large effects in the same direction (chocolate eating is a highly effective way of losing weight). Most important, the result is also consistent with the null-hypothesis (chocolate eating has no effect on weight; which in this case would be a sensational and important finding that would make Willy Wonka very happy).  I hope this example makes the point that it is not very informative to conduct studies of small effect sizes with wide confidence intervals because we do not learn much from these studies. Mostly, we are not more informed about a research question after looking at the data than we were before we looked at the data.

Not surprisingly, psychology journals do not publish findings like d = .2 +/- .8.  The typical criterion for reporting a newsworthy result is that the confidence interval falls into one of two regions. The region of effect sizes less than zero or the region of effect sizes greater than zero.  If the 95%CI falls in one of these two regions, it is possible to say that there is only a maximum error rate of 5%, when we infer from a confidence interval in the positive region that the actual effect size is positive, and from a confidence interval in the negative region that the actual effect size is negative.  In other words, it wasn’t just those random factors that produced a positive effect in a sample when the actual effect size is 0 or negative and it wasn’t just random factors that produced a negative effect when the actual effect size is 0 or positive.  To examine whether the results of a study provide sufficient information to claim that an effect is real and not just due to random factors, researchers compute p-values and check whether the p-value is less than 5%.

If the original study, reported a significant result to make inferences about the direction of an effect, and replicability is defined as obtaining the same result, replicability means that we obtain a significant result again in the replication study.  The famous statistician Sir Ronald Fisher made replicability a criterion for a good study. “A properly designed experiment rarely fails to give … significance” (Fisher, 1926, p. 504).

What are the implications of replication studies that do not replicate a significant result?  These studies are often called failed replication studies, but this term is unfortunate because the study was not a failure.  Maybe we might want to call these studies unsuccessful replication studies, although I am not sure this term is much better.  The problem with unsuccessful replication studies is that there are a host of reasons why a replication study might fail.  This means, additional research is needed to uncover why the original study and the replication study produced different results. In contrast, if a series of studies produces significant results, it is highly likely that the result is a real finding and can be used as an empirical foundation for theories.  For example, the gender ratio in my PSY230 course is always significantly different from a 50/50 split that we might expect if both genders were equally interested in psychology. This shows that my study that recorded the gender of students and compared the ratio of men and women against a fixed probability of 50% meets at least one criterion of a properly designed experiment, namely it rarely fails to reject the null-hypothesis.

In short, it is hard to argue with the proposition that replicability is an important criterion for a good study.  If study results cannot be replicated, it is not clear whether a phenomenon exists, and if it is not clear whether a phenomenon exists, it is impossible to make theoretical predictions about other phenomena based on this phenomenon.  For example, we cannot predict gender differences in professions that require a psychology degree if we do not have replicable evidence that there is a gender difference in psychology students.

The present analysis extends FER2015’s “error balance” logic to emphasize tradeoffs among features of a high-quality science (among scientific desiderata). When seeking to optimize the quality of our science, scholars must consider not only how a given research practice influences replicability, but also how it influences other desirable features.

A small group of social relationship researchers (Finkel, Eeastwick, & Reis; henceforce FER) are concerned about the recent shift in psychology from a scientific discipline that ignored replicability entirely to a field that actually cares about the replicability of results published in original research articles.  Although methodologists have criticized psychology for a long time, it was only after Bem (2011) published extraordinarily unbelievable results that psychologists finally started to wonder how replicable published results actually are.  In response to this new focus on replicability, several projects have conducted replication studies with shocking results. In FER’s research area, replicability is estimated to be as low as 25%. That is, three-quarter of published results are not replicable and require further research efforts to examine why original studies and replication studies produced inconsistent results.  In a large-scale replication study, one of the authors original findings failed to replicate and the replication studies cast doubt on theoretical assumptions about the determinants of forgiveness in close relationships.

FER ask “Should Scientists Consistently Prioritize Replicability Above Other Core Features?”

As FER are substantive researchers with little background in research methodology, it may be understandable that they do not mention important contributions by methodologists like Jacob Cohen.  Cohen’s answer is clear.  Less is more, except for sample size.  This statement makes it clear that replicability is necessary for a good study.  According to Cohen a study design can be perfect in many ways (e.g., elaborate experimental manipulation of real-world events with highly valid outcome measures), but if the sample size is small (e.g., N = 3), the study simply cannot produce results that can be used as an empirical foundation for theories.  If a study cannot reject the null-hypothesis with some degree of confidence, it is impossible to say whether there is a real effect or whether the result was just caused by random factors.

Unfazed by their lack of knowledge about research methodology, FER take a different view.

In our view, the field’s discussion of best research practices should revolve around how we prioritize the various features of a high-quality science and how those priorities may shift across our discipline’s many subfields and research contexts.

Similarly, requiring very large sample sizes increases replicability by reducing false-positive rates and increases cumulativeness by reducing false-negative rates, but it also reduces the number of studies that can be run with the available resources, so conceptual replications and real-world extensions may remain unconducted.

So, who is right. Should researchers follow Cohen’s advice and conduct a small number of studies with large samples or is it better to conduct a large number of studies with small samples? If resources are limited and a researcher can collect data from 500 participants in one year.  Should the researcher conduct one study with N = 500, five studies with N = 100, or 25 studies with N = 20?  FER suggest that we have a trade-off between replicability and discoveries.

Also, large sample size norms and requirements may limit the feasibility of certain sorts of research, thereby reducing discovery.

This is true, if we consider true and false discoveries as discoveries (FER do not make a distinction).  Bem (2011) discovered that human minds can time travel. This was a fascinating discovery, yet it was a false discovery. Bem (2001) himself advocated the view that all discoveries are valuable, even false discoveries (Let’s err on the side of discovery.).  Maybe FER learned about research methods from Bem’s chapter.  Most scientists and lay people, however, value true discoveries over false discoveries.  Many people would feel cheated if the Moon landing was actually faked, for example, and if billions spent on cancer drugs are not helping to fight cancer (it really was just eating garlic).  So, the real question is whether many studies with small samples produce more true discoveries than a single study with a large sample.

This question was examined in LeBell, Campbell, and Loving (2015), who concluded largely in favor of Cohen’s recommendation that a slow approach with fewer studies and high replicability is advantageous for a cumulative science.

For example, LCL2016’s Table 3 shows that the N-per-true discovery decreases from N=1,742 when the original research is statistically powered at 25% to N=917 when the original research is statistically powered at 95%.

FER criticize that LCL focused on efficient use of resources for replication studies and ignored the efficient use of resources for original researcher.  As many researchers are often doing more than one study on a particular research question, the distinction between original researcher and replication researcher is artificial. Ultimately, researchers may conduct a number of studies. The studies can be totally new, conceptual replications of previous studies, or exact replications of previous studies. A significant result always will be used to claim a discovery. When a non-significant result contradicts a previous significant result, discovery, additional research is needed to examine whether the original result was a false discovery or whether the replication result was a false negative.

FER observe that “original researchers will be more efficient (smaller N-per-true discovery) when they prioritize lower-powered studies. That is, when assuming that an original researcher wishes to spend her resources efficiently to unearth many true effects, plans never to replicate her own work, and is insensitive to the resources required to replicate her studies, she should run many weakly powered studies.”

FER may have discovered why some researchers, including themselves, pursue a strategy of conducting many studies with relatively low power.  It produces many discoveries that can be published.  They also produce many non-significant results that do not lend to a discovery. But the absolute number of true discoveries is still likely to be greater than the 1 true discovery by a researcher who conducted only one study.  The problem is that the researchers are also likely to make more false discoveries than the researcher who conducts only one study.  They just make more discoveries, true discoveries and false discoveries, and replication studies are needed to examine whether the results are true discoveries or false discoveries.  When other researchers conduct replication studies and fail to replicate an effect, further resources are needed to examine why the original study produced a non-significant result. However, this is not a problem for discoverers who are only in the business of testing new and original hypothesis and reporting those that produced a significant result and leave it to other researchers to examine which of these discoveries are true or false.  These researchers were rewarded handsomely in the years before Bem (2011) because nobody wanted to be in the business of conducting replication studies. As a result, all discoveries produced by original researchers were treated as if they would replicate and researchers with a high number of discoveries were treated as researchers with more true discoveries. There just was no distinction between true and false discoveries and it made sense to err on the side of discovery.

Given the conflicting efficiency goals between original researchers and replicators, whose goals shall we prioritize?

This is a bizarre question.  The goal of science is to uncover the truth and to create theories that rest on a body of replicable, empirical findings.  Apparently, this is not the goal of original researchers.  Their goal is to make as many discoveries as possible and to leave it to replicators to test which of these discoveries are replicable or not.  This division is not very appealing and few scientists want to be the maid of original scientists and clean up their mess when they do cooking experiments in the kitchen.  Original researchers should routinely replicate their own results and when they do so with small studies, they suddenly face the problem of replicators that they end up with non-significant results and now have to conduct further studies to uncover the reasons for these discrepancies.  FER seem to agree.

We must prioritize the field’s efficiency goals rather than either the replicator’s or the original researcher’s in isolation. The solid line in Figure 2 illustrates N-per-true-discovery from the perspective of the field—when the original researcher’s 5,000 participants are added to the pool of participants used by the replicator. This line forms a U-shaped pattern, suggesting that the field will be more efficient (smaller N-per true-discovery) when original researchers prioritize moderately powered studies).

This conclusion is already implied in Cohen’s power calculations.  The reason is that studies with very low power have a low chance of getting a significant result. As a result, resources are wasted on these studies and it would have been better not to conduct these studies, especially when we take into account that each study requires a new ethics approval, training of personal, data analysis time, etc.  All of these costs multiply with the number of studies that are conducted to get a significant result.  At the other extreme, power increases as a log-function of sample size. This means, once power has achieved a certain level, it requires more and more resources to increase power even further. Moreover, 80% power means that 8 out of 10 studies are significant and 90% power means that 9 out of 10 studies are significant. The extra costs of increasing power to 90% may not warrant the increase in success rate from 8 to 9 studies.  For this reason, Cohen did not really suggest that infinite sample sizes are optimal. Instead, he suggested that researchers should aim for 80% power. That is 4 out of 5 studies that examine a real effect show a significant result.

However, FER’s simulations come to a different conclusion.  Their Figure suggests that studies with 30% power are just as good as studies with 70% power and could be even better than studies with 80% power.

For example, if a hypothesis is 75% likely to be true, which might be the case if the finding had a strong theoretical foundation, the most efficient use of field-wide N appears to favor power of ~25% for d=.41 and ~40% for d=.80.

The problem with taking these results seriously is that the criterion N per true discovery does not take into account the costs of a type-I error.  Conducting studies with small samples and low power can produce a larger number of significant results than a smaller sample of studies with large samples, simply due to the larger number of studies. However, it also implies a higher rate of false positives.  Thus, it is important to take the seriousness of a type-I error or a type-II error into account.

So, let’s use a scenario where original results need to be replicated. In fact, many journals require at least two if not more significant results to provide evidence for an effect.  The researcher who conducts many studies with low power has a problem because the probability of obtaining two significant results in a row has only a power-squared probability of getting the desired result.  Even if a single significant result is reported, other researchers need to replicate this finding and many of these replication studies will fail, until eventually a replication study with a significant result corroborates the original finding.

In a simulation with d = .4 and an equal proportion of null-hypothesis and real effects, a researcher with 80% power (N = 200, d = .4, alpha = .05, two-tailed), needs about 900 participants for every discovery.  A researcher with 20% power (N = 40, d = .4, alpha = .05, two-tailed) needs about 1800 participants for every discovery.

When the rate of true null-results decreases, the number of true discoveries increases and it is easier to make true discoveries.  Nevertheless, the advantage of high powered studies remains. It takes about half of the participants for high powered studies to make a true discovery than for low powered studies (N = 665 vs. 1157).

The reason for the discrepancy between my results and FER’s result is that they do not take replicability into account. This is ironic because their title suggest that they are going to write about replicability, when they actually ignore that results from small studies with low power have low replicability. That is, if we only try to get a result once, it can be more efficient to do so with small, underpowered studies because random sampling error will often dramatically inflate effect sizes and produce a significant result. However, this inflation is not replicable and replication studies are likely to produce non-significant results and cast doubt on the original finding.  In other words, they ignore the key characteristic of replicability that replication studies of the same effect should produce significant results again.  Thus, FER’s argument is fundamentally flawed because it ignores the very key concept of replicability. Low powered studies are less replicable and original studies that are not replicable make it impossible to create a cumulative science.

The problems of underpowered studies increase exponentially in a research environment that rewards publication of discoveries, whether they are true or false, and provides no incentives for researches to publish non-significant results, even if these non-significant results challenge the significant results of an original article.  Rather than treating these unsuccessful warning sign that the original results might have been false positives, the non-significant result is treated as evidence that the replication study must have been flawed; after all, the original study found the effect.  Moreover, the replication study might just have low power and the effect exists.  As a result, false positive results can poison theory development because theories have to explain findings that are actually false positives, and researchers continue to conduct unsuccessful replication studies because they are unaware that other researchers have already failed to replicate an original false positive result.  These problems have been discussed at length in the years, but FER blissfully ignore these arguments and discussion.

Since 2011, psychological science has witnessed major changes in its standard operating procedures—changes that hold great promising for bolstering the replicability of our science. We have come a long way, we hope, from the era in which editors routinely encouraged authors to jettison studies or variables with ambiguous results, the file drawer received only passing consideration, and p<.05 was the statistical holy of holies. We remain, as in FER2015, enthusiastic about such changes. Our goal is to work alongside other meta-scientists to generate an empirically grounded, tradeoff-based framework for improving the overall quality of our science.

That sounds good, but it is not clear what FER bring to the table.

We must focus greater attention on establishing which features are most important in a given research context, the extent to which a given research practice influences the alignment of a collective knowledge base with each of the relevant features, and, all things considered, which research practices are optimal in light of the various tradeoffs involved. Such an approach will certainly prioritize replicability, but it will also prioritize other features of a high-quality science, including discovery, internal validity, external validity, construct validity, consequentiality, and cumulativeness.

What is lacking here is a demonstration that it is possible to prioritize internal validity, external validity, conequentiality, and cumulativeness without replicability. How do we build on results that emerge only in one out of two, three, or five studies, let alone 1 out of 10 studies?  FER create the illusion that we can make more true discoveries by conducting many small studies with low power.  This is true, in the limited sense of needing fewer participants for an initial discovery. But their own criterion of cumulativeness implies that we are not interested in a single finding that may or may not replicate. To build on original findings, others should be able to redo a study and get a significant result again.  This is what Fisher had in mind and what Neyman and Pearson formalized into power analysis.

FER also overlook a much simpler solution to balance the rate of original recovery and replicability.  Namely, researchers an increase the type-I error rate from the conventional 5% criterion to 20% (or more).  As the type-I error rate increases, power increases.  At the same time, readers are properly warned that the results are only suggestive, but definitely require further researcher and cannot be treated as evidence that needs to be incorporated in a theory.  At the same time, researchers with large samples do not have to waste their resources on rejecting H0 with apha = .05 and 99.9% power. They can use their resources to make more definitive statements about their data and reject H0 with a p-value that corresponds to 5 standard deviations of a standard normal (5 sigma rule in particle physics).

No matter what the solution to the replicability crisis in psychology is, the solution cannot be a continuation of the old practice to conduct numerous statistical tests on a small sample and then report only the results that are statistically significant at p < .05.  It is unfortunate that FER’s article can be easily misunderstood as suggesting that using small samples and testing for significance with p < .05 and low power can be a viable research strategy in some specific context.  I think they failed to make their case and to demonstrate in which research context this strategy benefits psychology.