Happiness has become a big top in the social sciences. Many universities offer happiness courses that teach how to be happier. Many of the exercises that are being taught in these courses are not based on evidence of effectiveness. I am teaching a different course. The course is an introduction to the science of well-being. The aim of this course is to provide an overview of the empirical research on well-being.

Textbooks that cover well-being science are often written by textbook writers who are not experts on the topic. They are often pretty bad. A better alternative is the free textbook published by well-being expert Ed Diener on his free textbook site Noba publishing (link).

For my students, I wrote my own textbook. It is still a work in progress, but given the costly alternatives, I decided to make it public. As I said, it is a work in progress. I am always looking for ways to improve it and to correct it. Feel free to provide comments in the comment section or by email.

Neither the authors, not the critics appear to be familiar with the statistical concept of power that is being discussed.

The article mentions Jacob Cohen as the pioneer of power analysis only to argue that his recommendation that studies should have 80% power is not applicable to surgical science.

They apparently didn’t read the rest of Cohen’s book on power analysis or any other textbook about statistical power.

Let’s first define statistical power. Statistical power is the long-run proportion of studies with a statistically significant result that one can expect given the sample size and population effect size of a study and the criterion for statistical significance.

Given this definition of power, we can ask whether an 80% success rate is to high and what success rate would be more applicable in studies with small sample sizes. Assuming that sample sizes are fixed by low frequency of events and effect sizes are not under the control of a researcher, we might simply have to accept that power is only 50% or only 20%. There is nothing we can do about it.

What are the implications of conducting significance tests with 20% power? 80% of the studies will produce a type-II error; that is, the test cannot reject the null-hypothesis (e.g., two surgical treatments are equally effective), when the null-hypothesis is actually false (one surgical procedure is better than another). Is it desirable to have an error rate of 80% in surgery studies? This is what the article seems to imply, but it is unlikely that the authors would actually agree with this, unless they are insane.

So, what the authors are really trying to say is probably something like “some data are better than no data and we should be able to report results even if they are based on small samples.” The authors might be surprised that many online trolls would agree with them, while they vehemently disagree with the claim that we can empower studies with small samples by increasing the type-II error rate.

What Cohen really said was that researchers should balance the type-I error risk (concluding one surgical procedure is better than the other) when this is actually not the case (both surgical procedures are approximately equally effective) and the type-II error risk (the reverse error).

To balance error probabilities, researchers should set the criterion for statistical significance according to the risk of drawing false conclusions (Lakens et al., 2018). In small samples with modest effect sizes, a reasonable balance of type-I and type-II errors is achieved by increasing the type-I risk from the standard criterion of alpha = .05 to, say, alpha = .20, or if necessary even to alpha = .50.

Changing alpha is the only way to empower small studies to produce significant results. Somehow the eight authors, the reviewers, and the editor of the target article missed this basic fact about statistical power.

In conclusion, the article is another example that applied researchers receive poor training in statistics and that the concept of statistical power is poorly understood. Jacob Cohen made an invaluable contribution to statistics by popularizing Neyman-Pearson’s extension of null-hypothesis testing by considering type-II error probabilities. However, his work is not finished and it is time for statistics textbooks and introductory statistics courses to teach statistical power so that mistakes like this article will not happen again. Nobody should think that it is desirable to run studies with less than 50% power (Tversky & Kahneman, 1971). Setting alpha to 5% even if this implies that a study has a high chance of producing a type-II error is insane and may even be considered unethical, especially in surgery where a better procedure may save lives.

It is well known that focal
hypothesis tests in psychology journals nearly always reject the
null-hypothesis (Sterling, 1959; Sterling et al., 1995). However, meta-analyses
often contain a fairly large number of non-significant results. To my
knowledge, the emergence of non-significant results in meta-analysis has not
been examined systematically (happy to be proven wrong). Here I used the
extremely well-done meta-analysis of money priming studies to explore this
issue (Lodder, Ong, Grasman, & Wicherts, 2019).

I downloaded their data and computed z-scores by (1) dividing Cohen’s d by sampling errror (2/sqrt(N)) to compute t-values, (2) convert the absolute t-values into two-sided p-values, and (3) converting the p-values into absolute z-scores. The z-scores were submitted to a z-curve analysis (Brunner & Schimmack, 2019).

The first figure shows the z-curve
for all test-statistics. Out of 282 tests, only 116 (41%) are significant. This
finding is surprising, given the typical discovery rates over 90% in psychology
journals. The figure also shows that the observed discovery rate of 41% is
higher than the expected discovery rate of 29%, although the difference is
relatively small and the confidence intervals overlap. This might suggest that
publication bias in the money priming literature is not a serious problem. On
the other hand, meta-analysis may mask the presence of publication bias in the
published literature for a number of reasons.

Published
vs. Unpublished Studies

Publication bias implies that studies with non-significant results end up in the proverbial file-drawer. Meta-analysts try to correct for publication bias by soliciting unpublished studies. The money-priming meta-analysis included 113 unpublished studies.

Figure 2 shows the z-curve for these studies. The observed discovery rate is slightly lower than for the full set of studies, 29%, and more consistent with the expected discovery rate, 25%. Thus, there this set of studies appears to be unbiased.

The complementary finding for published studies (Figure 3) is that the observed discovery rate increases, 49%, while the expected discovery rate remains low, 31%. Thus, published articles report a higher percentage of significant results without more statistical power to produce significant results.

A
New Type of Publications: Independent Replication Studies

In response to concerns about publication bias and questionable research practices, psychology journals have become more willing to publish null-results. An emerging format are pre-registered replication studies with the explicit aim of probing the credibility of published results. The money priming meta-analysis included 47 independent replication studies.

Figure 4 shows that independent replication studies had a very low observed discovery rate, 4%, that is matched by a very low expected discovery rate, 5%. It is remarkable that the discovery rate for replication studies is lower than the discovery rate for unpublished studies. One reason for this discrepancy is that significance alone is not sufficient to get published and authors may be selective in the sharing of unpublished results.

Removing independent replication studies from the set of published studies further increases the observed discovery rate, 66%. Given the low power of replication studies, the expected discovery rate also increases somewhat, but it is notably lower than the observed discovery rate, 35%. The difference is now large enough to be statistically significant, despite the rather wide confidence interval around the expected discovery rate estimate.

Coding
of Interaction Effects

After a (true or false) effect has
been established in the literature, follow up studies often examine boundary
conditions and moderators of an effect. Evidence for moderation is typically
demonstrated with interaction effects that are sometimes followed by contrast
analysis for different groups. One way to code these studies would be to focus
on the main effect and to ignore the moderator analysis. However, meta-analysts
often split the sample and treat different subgroups as independent samples.
This can produce a large number of non-significant results because a moderator
analysis allows for the fact that the effect emerged only in one group. The
resulting non-significant results may provide false evidence of honest
reporting of results because bias tests rely on the focal moderator effect to
examine publication bias.

The next figure is based on studies that involved an interaction hypothesis. The observed discovery rate, 42%, is slightly higher than the expected discovery rate, 25%, but bias is relatively mild and interaction effects contribute 34 non-significant results to the meta-analysis.

The analysis of the published main effect shows a dramatically different pattern. The observed discovery rate increased to 56/67 = 84%, while the expected discovery rate remained low with 27%. The 95%CI do not overlap, demonstrating that the large file-drawer of missing studies is not just a chance finding.

I also examined more closely the 7
non-significant results in this set of studies.

Gino and Mogliner (2014) reported results of a money priming study with cheating as the dependent variable. There were 98 participants in 3 conditions. Results were analyzed with percentage of cheating participants and extent of cheating. The percentage of cheating participants produced a significant contrast of the money priming and control condition, chi2(1, N = 65) = 3.97. However, the meta-analysis used the extent of cheating dependent variable, which should only a marginally significant effect with a one-tailed p-value of .07. “Simple contrasts revealed that participants cheated more in the money condition (M = 4.41, SD = 4.25) than in both the control condition (M = 2.76, SD = 3.96; p = .07) and the time condition (M = 1.55, SD = 2.41; p = .002).” Thus, this non-significant results was presented as supporting evidence in the original article.

Jin, Z., Shiomura, K., & Jiang, L. (2015) conducted a priming studies with reaction times as dependent variables. This design is different from social priming studies in the meta-analysis. Moreover, money priming effects were examined within-participants, and the study produced several significant complex interaction effects. Thus, this study also does not count as a published failure to replicate money priming effects.

Mukherjee, S., Nargundkar, M., & Manjaly, J. A. (2014) examined the influence of money primes on various satisfaction judgments. Study 1 used a small sample of N = 48 participants with three dependent variables. Two achieved significance, but the meta-analysis aggregated across DVs, which resulted in a non-significant outcome. Study 2 used a larger sample and replicated significance for two outcomes. It was not included in the meta-analysis. In this case, aggregation of DVs explains a non-significant result in the meta-analysis, while the original article reported significant results.

I was unable to retrieve this article, but the abstract suggests that the article reports a significant interaction. ” We found that although money-primed reactance in control trials in which the majority provided correct responses, this effect vanished in critical trials in which the majority provided incorrect answers.” [https://www.sbp-journal.com/index.php/sbp/article/view/3227]

Wierzbicki, J., & Zawadzka, A. (2014) published two studies. Study 1 reported a significant result. Study 2 added a non-significant result to the meta-analysis. Although the effect for money priming was not significant, this study reported a significant effect for credit-card priming and a money priming x morality interaction effect. Thus, the article also did not report a money-priming failure as the key finding.

Gasiorowska, A. (2013) is an article in Polish.

is a duplication of article 5

In conclusion, none of the 7 studies with non-significant results in the meta-analysis that were published in a journal reported that money priming had no effect on a dependent variable. All articles reported some significant results as the key finding. This further confirms how dramatically publication bias distorts the evidence reported in psychology journals.

Conclusion

In this blog post, I examined the
discrepancy between null-results in journal articles and in meta-analysis,
using a meta-analysis of money priming. While the meta-analysis suggested that
publication bias is relatively modest, published articles showed clear evidence
of publication bias with an observed discovery rate of 89%, while the expected
discovery rate was only 27%.

Three factors contributed to this
discrepancy: (a) the inclusion of unpublished studies, (b) independent
replication studies, and (c) the coding of interaction effects as separate
effects for subgroups rather than coding the main effect.

After correcting for publication
bias, expected discovery rates are consistently low with estimates around 30%.
The main exception are the independent replication studies that found no
evidence at all. Overall, these results confirm that published money priming
studies and other social priming studies cannot be trusted because the
published studies overestimate replicability and effect sizes.

It is not the aim of this blog post
to examine whether some money priming paradigms can produce replicable effects.
The main goal was to explain why publication bias in meta-analysis is often
small, when publication bias in the published literature is large. The results
show that several factors contribute to this discrepancy and that the inclusion
of unpublished studies, independent replication studies, and coding of effects
explain most of these discrepancies.

Here is a link to the manuscript, data, and MPLUS scripts for reproducibility. https://osf.io/mu7e6/

ABSTRACT

Greenwald et al. (1998) proposed that the IAT measures
individual differences in implicit social cognition. This claim requires evidence of construct validity.
I review the evidence and show that there is insufficient evidence for this
claim. Most important, I show that few
studies were able to test discriminant validity of the IAT as a measure of
implicit constructs. I examine discriminant validity in several multi-method
studies and find no or weak evidence for discriminant validity. I also show
that validity of the IAT as a measure of attitudes varies across constructs. Validity
of the self-esteem IAT is low, but estimates vary across studies. About 20% of the variance in the race IAT
reflects racial preferences. The highest validity is obtained for measuring
political orientation with the IAT (64% valid variance). Most of this valid variance stems from a
distinction between individuals with opposing attitudes, while reaction times
contribute less than 10% of variance in the prediction of explicit attitude
measures. In all domains, explicit
measures are more valid than the IAT, but the IAT can be used as a measure of
sensitive attitudes to reduce measurement error by using a multi-method
measurement model.

Despite its popularity, relatively little is known about the construct validity of the IAT.

As Cronbach (1989) pointed out, construct validation is better examined by independent experts than by authors of a test because “colleagues are especially able to refine the interpretation, as they compensate for blind spots and capitalize on their own distinctive experience” (p. 163).

It is of utmost importance to determine how much of the variance in IAT scores is valid variance and how much of the variance is due to measurement error, especially when IAT scores are used to provide individualized feedback.

There is also no consensus in the literature whether the IAT measures something different from explicit measures.

In conclusion, while there is general consensus to make a distinction between explicit measures and implicit measures, it is not clear what the IAT measures

To complicate matters further, the validity of the IAT may vary across attitude objects. After all the IAT is a method, just like Likert scales are a method, and it is impossible to say that a method is valid (Cronbach, 1971).

At present, relatively little is known about the contribution of these three parameters to observed correlations in hundreds of mono-method studies.

A Critical Review
of Greenwald et al.’s (1998) Original Article

In conclusion, the seminal IAT article introduced the IAT as a measure of implicit constructs that cannot be measured with explicit measures, but it did not really test this dual-attitude model.

Construct Validity in 2007

In conclusion, the 2007 review of construct validity revealed major psychometric challenges for the construct validity of the IAT, which explains why some researchers have concluded that the IAT cannot be used to measure individual differences (Payne et al., 2017). It also revealed that most studies were mono-method studies that could not examine convergent and discriminant validity

Cunningham, Preacher and Banaji (2001)

Another noteworthy finding is that a single factor accounted for correlations among all measures on the same occasion and across measurement occasions. This finding shows that there were no true changes in racial attitudes over the course of this two-month study. This finding is important because Cunningham et al.’s (2001) study is often cited as evidence that implicit attitudes are highly unstable and malleable (e.g., Payne et al., 2017). This interpretation is based on the failure to distinguish random measurement error and true change in the construct that is being measured (Anusic & Schimmack, 2016). While Cunningham et al.’s (2001) results suggest that the IAT is a highly unreliable measure, the results also suggest that the racial attitudes that are measured with the race IAT are highly stable over periods of weeks or months.

Bar-Anan & Vianello, 2018

this large study of construct validity also provides little evidence for the original claim that the IAT measures a new construct that cannot be measured with explicit measures, and confirms the estimate from Cunningham et al. (2001) that about 20% of the variance in IAT scores reflects variance in racial attitudes.

Greenwald et al. (2009)

“When entered after the self-report measures, the two implicit measures incrementally explained 2.1% of vote intention variance, p=.001, and when political conservativism was also included in the model, “the pair of implicit measures incrementally predicted only 0.6% of voting intention variance, p = .05.” (Greenwald et al., 2009, p. 247).

I tried to reproduce these results with the published correlation matrix and failed to do so. I contacted Anthony Greenwald, who provided the raw data, but I was unable to recreate the sample size of N = 1,057. Instead I obtained a similar sample size of N = 1,035. Performing the analysis on this sample also produced non-significant results (IAT: b = -.003, se = .044, t = .070, p = .944; AMP: b = -.014, se = .042, t = 0.344, p = .731). Thus, there is no evidence for incremental predictive validity in this study.

Axt (2018)

With N = 540,723 respondents, sampling error is very small, σ = .002, and parameter estimates can be interpreted as true scores in the population of Project Implicit visitors. A comparison of the factor loadings shows that explicit ratings are more valid than IAT scores. The factor loading of the race IAT on the attitude factor once more suggests that about 20% of the variance in IAT scores reflects racial attitudes

Falk, Heine, Zhang, and Hsu (2015)

Most important, the self-esteem IAT and the other implicit measures have low and non-significant loadings on the self-esteem factor.

Bar-Anan & Vianello (2018)

Thus, low validity contributes considerably to low observed correlations between IAT scores and explicit self-esteem measures.

Bar-Anan & Vianello (2018) – Political Orientation

More important, the factor loading of the IAT on the implicit factor is much higher than for self-esteem or racial attitudes, suggesting over 50% of the variance in political orientation IAT scores is valid variance, π = .79, σ = .016. The loading of the self-report on the explicit ratings was also higher, π = .90, σ = .010

Variation of Implicit – Explicit Correlations Across Domains

This suggests that the IAT is good in classifying individuals into opposing groups, but it has low validity of individual differences in the strength of attitudes.

What Do IATs Measure?

The present results suggest that measurement error alone is often sufficient to explain these low correlations. Thus, there is little empirical support for the claim that the IAT measures implicit attitudes that are not accessible to introspection and that cannot be measured with self-report measures.

For 21 years the lack of discriminant validity has been overlooked because psychologists often fail to take measurement error into account and do not clearly distinguish between measures and constructs.

In the future, researchers need to be more careful when they make claims about constructs based on a single measure like the IAT because measurement error can produce misleading results.

Researchers should avoid terms like implicit attitude or implicit preferences that make claims about constructs simply because attitudes were measured with an implicit measure

Recently, Greenwald and Banaji (2017) also expressed concerns about their earlier assumption that IAT scores reflect unconscious processes. “Even though the present authors find themselves occasionally lapsing to use implicit and explicit as if they had conceptual meaning, they strongly endorse the empirical understanding of the implicit– explicit distinction” (p. 862).

How
Well Does the IAT Measure What it Measures?

Studies with the IAT can be divided into applied studies (A-studies) and basic studies (B-studies). B-studies employ the IAT to study basic psychological processes. In contrast, A-studies use the IAT as a measure of individual differences. Whereas B-studies contribute to the understanding of the IAT, A-studies require that IAT scores have construct validity. Thus, B-studies should provide quantitative information about the psychometric properties for researchers who are conducting A-studies. Unfortunately, 21 years of B-studies have failed to do so. For example, after an exhaustive review of the IAT literature, de Houwer et al. (2009) conclude that “IAT effects are reliable enough to be used as a measure of individual differences” (p. 363). This conclusion is not helpful for the use of the IAT in A-studies because (a) no quantitative information about reliability is given, and (b) reliability is necessary but not sufficient for validity. Height can be measured reliably, but it is not a valid measure of happiness.

This article provides the first quantitative information about validity of three IATs. The evidence suggests that the self-esteem IAT has no clear evidence of construct validity (Falk et al., 2015). The race-IAT has about 20% valid variance and even less valid variance in studies that focus on attitudes of members from a single group. The political orientation IAT has over 40% valid variance, but most of this variance is explained by group-differences and overlaps with explicit measures of political orientation. Although validity of the IAT needs to be examined on a case by case basis, the results suggest that the IAT has limited utility as a measurement method in A-studies. It is either invalid or the construct can be measured more easily with direct ratings.

Implications for the Use of IAT scores in Personality Assessment

I suggest to replace the reliability coefficient with the validity coefficient. For example, if we assume that 20% of the variance in scores on the race IAT is valid variance, the 95%CI for IAT scores from Project Implicit (Axt, 2018), using the D-scoring method, with a mean of .30 and a standard deviation of.46 ranges from -.51 to 1.11. Thus, participants who score at the mean level could have an extreme pro-White bias (Cohen’s d = 1.11/.46 = 2.41), but also an extreme pro-Black Bias (Cohen’s d = -.51/.46 = -1.10). Thus, it seems problematic to provide individuals with feedback that their IAT score may reveal something about their attitudes that is more valid than their beliefs.

Conclusion

Social psychologists have always distrusted self-report, especially for the measurement of sensitive topics like prejudice. Many attempts were made to measure attitudes and other constructs with indirect methods. The IAT was a major breakthrough because it has relatively high reliability compared to other methods. Thus, creating the IAT was a major achievement that should not be underestimated because the IAT lacks construct validity as a measure of implicit constructs. Even creating an indirect measure of attitudes is a formidable feat. However, in the early 1990s, social psychologists were enthralled by work in cognitive psychology that demonstrated unconscious or uncontrollable processes (Greenwald & Banaji, 1995). Implicit measures were based on this work and it seemed reasonable to assume that they might provide a window into the unconscious (Banaji & Greenwald, 2013). However, the processes that are involved in the measurement of attitudes with implicit measures are not the personality characteristics that are being measured. There is nothing implicit about being a Republican or Democrat, gay or straight, or having low self-esteem. Conflating implicit processes in the measurement of attitudes with implicit personality constructs has created a lot of confusion. It is time to end this confusion. The IAT is an implicit measure of attitudes with varying validity. It is not a window into people’s unconscious feelings, cognitions, or attitudes.

Social psychology textbook like colorful laboratory experiments that illustrate a theoretical point. As famous social psychologist Daryl Bem stated, he considered his experiments more illustrations of what could happen than empirical tests of what actually happens. Unfortunately, social psychology textbooks make it less obvious that the results of highlighted studies should not be generalized to real life.

Myers and Twenge (2019) tell the story of fishy smells.

In a laboratory experiment, exposure to a fishy smell caused people to be suspicious of each other and cooperate less—priming notions of a shady deal as “fishy” (Lee & Schwarz, 2012). All these effects occurred without the participants’ conscious awareness of the scent and its influence.

They don’t even mention some other fun facts about this study. To make sure that the effect is not just a mood effect induced by bad odors in general, fishy smells were contrasted with fart smells, and the effect seemed to be limited to fishy smells.

The article was published in the top journal for experimental social psychology (JPSP:ASC) and is relatively highly cited.

However, the studies reported in this article smell a bit fishy and should be consumed with a grain of salt and a lot of lemon. The problem is that all of the results are significant, which is highly unlikely unless studies have very high statistical power (Schimmack, 2012).

And it even works the other way around.

And making people think about suspicion, also makes them think about fish, in theory.

Suspicion also makes you be more sensitive to fishy smells.

Undergraduate students may not realize what the problem with these studies is. After all, they all worked out; that is they produced a p-value less than .05, which is supposed to ensure that no more than 1 out of 20 studies are a false positive result. As all of these studies are significant, it is extremely unlikely that all of them are false positives. So, we would have to infer that suspicion is related to fishy smells in our minds.

However, since 2012 it is clear that we have to draw another conclusion. The reason is that results in social psychology articles like this one smell fishy and suggest that the authors are telling us a fun story, but they are not telling us what really happened in their lab. It is extremely unlikely that the authors reported all of their studies and data analyses that they conducted. Instead they may have used a variety of so-called questionable research practices that increase the chances of reporting a significant result. Questionable research practices are also known as fishing for significance. These questionable research practices have the undesirable effect that they increase the type-I error rate. Thus, while the reported p-values are below .05, the risk of a false positive result is not and could be as high as 100%.

To demonstrate that researchers used questionable research practices, we can conduct a bias test. The most powerful bias test for small sets of studies is the Test of Insufficient Variance. When most p-values are just significant , p < .05 and p > .005, but always significant the results are not trustworthy because sampling error should produce more variability than we see.

The table lists the test statistics, converts the two-tailed p-values into z-scores and computes the variance of the z-scores. The variance is expected to be 1, but the actual variance is only 0.14. A chi-square test shows that this deviation is significant with p = .01. Thus, we have scientific evidence to claim that these results smell a bit fishy.

Study

test

value

df

p

z

1

t

2.22

42

0.032

2.15

2

t

2.01

79

0.048

1.98

3a

chisq

4.27

1

0.039

2.07

3b

chisq

6.28

1

0.012

2.51

3c

chisq

7.77

1

0.005

2.79

5

F

8.24

116

0.005

2.82

6

F

3.93

1614

0.048

1.98

VAR(z)

0.14

TIVA

0.01

Unfortunately, these results are not the only fishy results in social psychology textbooks. Thus, students of social psychology should read textbook claims with a healthy dose of skepticism. They should also ask their professors to provide information about the replicability of textbook findings. Has this study been replicated in a preregistered replication attempt? Would you think you could replicate this result in your own lab? It is time to get rid of the fishy smell and let the fresh wind of open science clean up social psychology.

We can only hope that sooner than later, articles like this will sleep with the fishes.

Every social psychology textbook emphasizes the problem of naturalistic studies (correlational research) that it is difficult to demonstrate cause-effect relationships in these studies.

Social psychology has a proud tradition of addressing this problem with laboratory experiments. The advantage of laboratory experiments is that they make it easy to demonstrate causality. The disadvantage is that laboratory experiments have low ecological validity. It is therefore important to demonstrate that findings from laboratory experiments generalize to real world behavior.

Myers and Twenge’s (2019) textbook (13e edition) addresses this issue in a section called “Generalizing from Laboratory to Life”

What people saw in everyday life suggested correlational research, which led to experimental research. Network and government policymakers, those with the power to make changes, are now aware of the results. In many areas, including studies of helping, leadership style, depression, and self-efficacy, effects found in the lab have been mirrored by effects in the field, especially when the laboratory effects have been large (Mitchell, 2012).

Mitchell, G. (2012). Revisiting truth or triviality: The external validity of research in the psychological laboratory. Perspectives on Psychological Science, 7, 109–117.

Curious about the evidence, I examined Mitchell’s article. I didn’t need to read beyond the abstract to see that the textbook misrepresented Mitchell’s findings.

Using 217 lab-field comparisons from 82 meta-analyses found that the external validity of laboratory research differed considerably by psychological subfield, research topic, and effect size. Laboratory results from industrial-organizational psychology most reliably predicted field results, effects found in social psychology laboratories most frequently changed signs in the field (from positive to negative or vice versa), and large laboratory effects were more reliably replicated in the field than medium and small laboratory effects.

Mitchell, G. (2012). Revisiting Truth or Triviality: The External Validity of Research in the Psychological Laboratory. Perspectives on Psychological Science, 7(2), 109–117. https://doi.org/10.1177/1745691611432343

So, a course in social psychology covers 80% results based on laboratory experiments that may not generalize to the real world. In addition, students are given the false information that these results do generalize to the real world, when evidence of ecological validity is often missing. On top of this, many articles based on laboratory experiments report inflated effect sizes due to selection for significance and the results may not even replicate in other laboratory contexts.

Over the past years, psychologists have become increasingly concerned about the credibility of published results. The credibility crisis started in 2011, when Bem published incredible results that seemed to suggest that humans can foresee random future events. Bem’s article revealed fundamental flaws in the way psychologists conduct research. The main problem is that psychology journals only publish statistically significant results (Sterling, 1959). If only significant results are published, all hypotheses will receive empirical support as long as they are tested. This is akin to saying that everybody has a 100% free throw average or nobody ever makes a mistake if we do not count failures.

The main problem of selection for significance is that we do not know the real strength of evidence that empirical studies provide. Maybe the selection effect is small and most studies would replicate. However, it is also possible that many studies might fail a replication test. Thus, the crisis of confidence is a crisis of uncertainty.

The Open Science Collaboration conducted actual replication studies to estimate the replicability of psychological science. They replicated 97 studies with statistically significant results and were able to reproduce 35 significant results (a 36% success rate). This is a shockingly low success rate. Based on this finding, most published results cannot be trusted, especially because there is heterogeneity across studies. Some studies would have an even lower chance of replication and several studies might even be outright false positives (there is actually no real effect).

As important as this project was to reveal major problems with the research culture in psychological science, there are also some limitations that cast doubt about the 36% estimate as a valid estimate of the replicability of psychological science. First, the sample size is small and sampling error alone might have lead to an underestimation of the replicability in the population of studies. However, sampling error could also have produced a positive bias. Another problem is that most of the studies focused on social psychology and that replicability in social psychology could be lower than in other fields. In fact, a moderator analysis suggested that the replication rate in cognitive psychology is 50%, while the replication rate in social psychology is only 25%. The replicated studies were also limited to a single year (2008) and three journals. It is possible that the replication rate has increased since 2008 or could be higher in other journals. Finally, there have been concerns about the quality of some of the replication studies. These limitations do not undermine the importance of the project, but they do imply that the 36% estimate is an estimate and that it may underestimate the replicability of psychological science.

Over the past years, I have been working on an alternative approach to estimate the replicability of psychological science. This approach starts with the simple fact that replicabiliity is tightly connected to the statistical power of a study because statistical power determines the long-run probability of producing significant results (Cohen, 1988). Thus, estimating statistical power provides valuable information about replicability. Cohen (1962) conducted a seminal study of statistical power in social psychology. He found that the average power to detect an average effect size was around 50%. This is the first estimate of replicability of psychological science, although it was only based on one journal and limited to social psychology. However, subsequent studies replicated Cohen’s findings and found similar results over time and across journals (Sedlmeier & Gigerenzer, 1989). It is noteworthy that the 36% estimate from the OSC project is not statistically different from Cohen’s estimate of 50%. Thus, there is convergent evidence that replicability in social psychology is around 50%.

In collaboration with Jerry Brunner, I have developed a new method that can estimate mean power for a set of studies that are selected for significance and that vary in effect sizes and samples sizes, which produces heterogeneity in power (Brunner & Schimmack, 2018). The input for this method are the actual test statistics of significance tests (e.g., t-tests, F-tests). These test-statistics are first converted into two-tailed p-values and then converted into absolute z-scores. The magnitude of these absolute z-scores provides information about the strength of evidence against the null-hypotheses. The histogram of these z-scores, called a z-curve, is then used to fit a finite mixture model to the data that estimates mean power, while taking selection for significance intro account. Extensive simulation studies demonstrate that z-curve performs well and provides better estimates than alternative methods. Thus, z-curve is the method of choice for estimating the replicability of psychological science on the basis of the test statistics that are reported in original articles.

For this blog post, I am reporting results based on preliminary results from a large project that extracts focal hypothesis from a broad range of journals that cover all areas of psychology for the years 2010 to 2017. The hand-coding of these articles complements a similar project that relies on automatic extraction of test statistics (Schimmack, 2018).

Table 1 shows the journals that have been coded so far. It also shows the estimates based on the automated method and for hand-coding of focal hypotheses.

Journal

Hand

Automated

Psychophysiology

84

75

Journal of Abnormal Psychology

76

68

Journal of Cross-Cultural Psychology

73

77

Journal of Research in Personality

68

75

J. Exp. Psych: Learning, Memory, &
Cognition

58

77

Journal of Experimental Social Psychology

55

62

Infancy

53

68

Behavioral Neuroscience

53

68

Psychological Science

52

66

JPSP-Interpersonal Relations & Group
Processes

33

63

JPSP-Attitudes and Social Cognition

30

65

Mean

58

69

Hand coding of focal hypothesis produces lower estimates than the automated method because the automated analysis also codes manipulation checks and other highly significant results that are not theoretically important. The correlation between the two methods shows consistency across the two methods, r = .67. Finally, the mean for the automated method, 69%, is close to the mean for over 100 journals, 72%, suggesting that the sample of journals is an unbiased sample.

The hand coding results also confirm results found with the automated method that social psychology has a lower replicability than some other disciplines. Thus, the OSC reproducibility results that are largely based on social psychology should not be used to make claims about psychological science in general.

The figure below shows the output of the latest version of z-curve. The first finding is that the replicability estimate for all 1,671 focal tests is 56% with a relatively tight confidence interval ranging from 45% to 56%. ZZZ The next finding is that the discovery rate or success rate is 92%, using p < .05 as the criterion. This confirms that psychology journals continue to published results are selected for significance (Sterling, 1959). The histogram further shows that even more results would be significant if p-values below .10 are included as evidence for “marginal significance.”

Z-Curve.19.1 also provides an estimate of the size of the file drawer. It does so by projecting the distribution of observed significant results into the range of non-significant results (grey curve). The file drawer ratio shows that for every published result, we would expect roughly two unpublished studies with non-significant results. However, z-curve cannot distinguish between different questionable research practices. Rather than not disclosing failed studies researchers may not disclose other statistical analyses within a published study to report significant results.

Z-Curve.19.1 also provides an estimate of the false positive rate (FDR). FDR is the percentage of significant results that may arise from testing a true nil-hypothesis, where the population effect size is zero. For a long time, the consensus has been that false positives are rare because the nil-hypothesis is rarely true (Cohen, 1994). Consistent with this view, Soric’s estimate of the maximum false discovery rate is only 10% with a tight CI ranging from 8% to 16%.

However, the focus on the nil-hypothesis is misguided because it treats tiny deviations from zero as true hypotheses even if the effect size has no practical or theoretical significance. These effect sizes also lead to low power and replication failures. Therefore, Z-Curve 19.1 also provides an estimate of the FDR that treats studies with very low power as false positives. This broader definition of false positives raises the FDR estimate slightly, but 15% is still a low percentage. Thus, the modest replicability of results in psychological science is mostly due to low statistical power to detect true effects rather than a high number of false positive discoveries.

The reproducibility project showed that studies with low p-values were more likely to replicate. This relationship follows from the influence of statistical power on p-values and replication rates. To achieve a replication rate of 80%, p-values had to be less than .00005 or the z-score had to exceed 4 standard deviations. However, this estimate was based on a very small sample of studies. Z-Curve.19.1 also provides estimates of replicability for different levels of evidence. These values are shown below the x-axis. Consistent with the OSC results, a replication rate over 80% is only expected once z-scores are greater than 4.

The results also provide information about the choice of the alpha criterion to draw inferences from significance tests in psychology. To do so, it is important to distinguish observed p-values and type-I probabilities. For a single unbiased tests, we can infer from an observed p-value less than .05 that the risk of a false positive result is less than 5%. However, when multiple comparisons are made or results are selected for significance, an observed p-values less than .05 does not imply that the type-I error risk is below .05. To claim a type-I error risk of 5% or less, we have to correct the observed p-values, just like a Bonferroni correction. As 50% power corresponds to statistical significance, we see that z-scores between 2 and 3 are not statistically significant; that is, the type-I error risk is greater than 5%. Thus, the standard criterion to claim significance with alpha = .05 is a p-value of .003. Given the popularity of .005, I suggest to use p = .005 as a criterion for statistical significance. However, this claim is not based on lowering the criterion for statistical significance because p < .005 still only allows to claim that the type-I error probability is less than 5%. The need for a lower criterion value stems from the inflation of the type-I error rate due to selection for significance. This is a novel argument that has been overlooked in the significance wars, which ignored the influence of publication bias on false positive risks.

Finally, z-curve.19.1 makes it possible to examine the robustness of the estimates by using different selection criteria. One problem with selection models is that p-values just below .05, say in the .01 to .05 range, can arise from various questionable research practices that have different effects on replicability estimates. To address this problem, it is possible to estimate the density with a different selection criterion, while still estimating the replicability with alpha = .05 as the criterion. Figure 2 shows the results by using only z-scores greater than 2.5, p = .012) to fit the observed z-curve for z-scores greater than 2.5.

The blue dashed line at z = 2.5 shows the selection criterion. The grey curve between 1.96 and 2.5 is projected form the distribution for z-scores greater than 2.5. Results show a close fit with the observed distribution. A s a result, the parameter estimates are also very similar. Thus, the results are robust and the selection model seems to be reasonable.

Conclusion

Psychology is in a crisis of confidence about the credibility of published results. The fundamental problems are as old as psychology itself. Psychologists have conducted low powered studies and selected only studies that worked for decades (Cohen, 1962; Sterling, 1959). However, awareness of these problems has increased in recent years. Like many crises, the confidence crisis in psychology has created confusion. Psychologists are aware that there is a problem, but they do not know how large the problem is. Some psychologists believe that there is no crisis and pretend that most published results can be trusted. Others are worried that most published results are false positives. Meta-psychologists aim to reduce the confusion among psychologists by applying the scientific method to psychological science itself.

This blog post provided the most comprehensive assessment of the replicability of psychological science so far. The evidence is largely consistent with previous meta-psychological investigations. First, replicability is estimated to be slightly above 50%. However, replicability varies across discipline and the replicability of social psychology is below 50%. The fear that most published results are false positives is not supported by the data. Replicability increases with the strength of evidence against the null-hypothesis. If the p-value is below .00001, studies are likely to replicate. However, significant results with p-values above .005 should not be considered statistically significant with an alpha level of 5%, because selection for significance inflates the type-I error. Only studies with p < .005 can claim statistical significance with alpha = .05.

The correction for publication bias implies that researchers have to increase sample sizes to meet the more stringent p < .005 criterion. However, a better strategy is to preregister studies to ensure that reported results can be trusted. In this case, p-values below .05 are sufficient to demonstrate statistical significance with alpha = .05. Given the low prevalence of false positives in psychology, I do see no need to lower the alpha criterion.

Future Directions

This blog post is just an interim report. The final project requires hand-coding of a broader range of journals. Readers who think that estimating the replicability of psychological science is beneficial and who want information about a particular journal are invited to collaborate on this project and can obtain authorship if their contribution is substantial enough to warrant authorship. Please consider taking part in this project. Although it is a substantial time commitment, it doesn’t require participants or materials that are needed for actual replication studies. Please consider taking part in this project. Contact me, if you are interested and want to know how you can get involved.