Dr. Ulrich Schimmack’s Blog about Replicability

For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication” (Cohen, 1994).

DEFINITION OF REPLICABILITYIn empirical studies with sampling error, replicability refers to the probability of a study with a significant result to produce a significant result again in an exact replication study of the first study using the same sample size and significance criterion (Schimmack, 2017).

BLOGS BY YEAR:  20192018, 2017, 2016, 2015, 2014

Featured Blog of the Month (January, 2020): Z-Curve.2.0 (with R-package) 

 

TOP TEN BLOGS

RR.Logo

  1. 2018 Replicability Rankings of 117 Psychology Journals (2010-2018)

Rankings of 117 Psychology Journals according to the average replicability of a published significant result. Also includes detailed analysis of time trends in replicability from 2010 to 2018). 

Golden2.  Introduction to Z-Curve with R-Code

This post presented the first replicability ranking and explains the methodology that is used to estimate the typical power of a significant result published in a journal.  The post provides an explanation of the new method to estimate observed power based on the distribution of test statistics converted into absolute z-scores.  The method has been developed further to estimate power for a wider range of z-scores by developing a model that allows for heterogeneity in power across tests.  A description of the new method will be published when extensive simulation studies are completed.

Say-No-to-Doping-Test-Image

3. An Introduction to the R-Index

 

The R-Index can be used to predict whether a set of published results will replicate in a set of exact replication studies. It combines information about the observed power of the original studies with information about the amount of inflation in observed power due to publication bias (R-Index = Observed Median Power – Inflation). The R-Index has predicted the outcome of actual replication studies.

Featured Image -- 203

4.  The Test of Insufficient Variance (TIVA)

 

The Test of Insufficient Variance is the most powerful test of publication bias and/or dishonest reporting practices. It can be used even if only two independent statistical results are available, although power to detect bias increases with the number of studies. After converting test results into z-scores, z-scores are expected to have a variance of one.   Unless power is very high, some of these z-scores will not be statistically significant (z .05 two-tailed).  If these non-significant results are missing, the variance shrinks, and TIVA detects that the variance is insufficient.  The observed variance is compared against the expected variance of 1 with a left-tailed chi-square test. The usefulness of TIVA is illustrated with Bem’s (2011) “Feeling the Future” data.

train-wreck-15.  MOST VIEWED POST (with comment by Noble Laureate Daniel Kahneman)

Reconstruction of a Train Wreck: How Priming Research Went off the Rails

This blog post examines the replicability of priming studies cited in Daniel Kahneman’s popular book “Thinking fast and slow.”   The results suggest that many of the cited findings are difficult to replicate.

http://schoolsnapshots.org/blog/2014/09/30/math-prize-for-girls-at-m-i-t/6. How robust are Stereotype-Threat Effects on Women’s Math Performance?

Stereotype-threat has been used by social psychologists to explain gender differences in math performance. Accordingly, the stereotype that men are better at math than women is threatening to women and threat leads to lower performance.  This theory has produced a large number of studies, but a recent meta-analysis showed that the literature suffers from publication bias and dishonest reporting.  After correcting for these effects, the stereotype-threat effect was negligible.  This blog post shows a low R-Index for the first article that appeared to provide strong support for stereotype-threat.  These results show that the R-Index can warn readers and researchers that reported results are too good to be true.

GPower7.  An attempt at explaining null-hypothesis testing and statistical power with 1 figure and 1500 words.   Null-hypothesis significance testing is old, widely used, and confusing. Many false claims have been used to suggest that NHST is a flawed statistical method. Others argue that the method is fine, but often misunderstood. Here I try to explain NHST and why it is important to consider power (type-II errors) using a picture from the free software GPower.

snake-oil

8.  The Problem with Bayesian Null-Hypothesis Testing

 

Some Bayesian statisticians have proposed Bayes-Factors to provide evidence for a Null-Hypothesis (i.e., there is no effect).  They used Bem’s (2011) “Feeling the Future” data to argue that Bayes-Factors would have demonstrated that extra-sensory perception does not exist.  This blog post shows that Bayes-Factors depend on the specification of the alternative hypothesis and that support for the null-hypothesis is often obtained by choosing an unrealistic alternative hypothesis (e.g., there is a 25% probability that effect size is greater than one standard deviation, d > 1).  As a result, Bayes-Factors can favor the null-hypothesis when there is an effect, but the effect size is small (d = .2).  A Bayes-Factor in favor of the null is more appropriately interpreted as evidence that the alternative hypothesis needs to decrease the probabilities assigned to large effect sizes. The post also shows that Bayes-Factors based on a meta-analysis of Bem’s data provide misleading evidence that an effect is present because Bayesian statistics do not take publication bias and dishonest reporting practices into account.

hidden9. Hidden figures: Replication failures in the stereotype threat literature.  A widespread problem is that failed replication studies are often not published. This blog post shows that another problem is that failed replication studies are ignored even when they are published.  Selective publishing of confirmatory results undermines the credibility of science and claims about the importance of stereotype threat to explain gender differences in mathematics.

20170620_14554410. My journey towards estimation of replicability.  In this blog post I explain how I got interested in statistical power and replicability and how I developed statistical methods to reveal selection bias and to estimate replicability.

The Power-Corrected H-Index

I was going to write this blog post eventually, but the online first publication of Radosic and Diener’s (2021) article “Citation Metrics in Psychological Science” provided a good opportunity to do so now.

Radosic and Diener’s (2021) article’s main purpose was to “provide norms to help evaluate the citation counts of psychological scientists” (p. 1). The authors also specify the purpose of these evaluations. “Citation metrics are one source of information that can be used in hiring, promotion, awards, and funding, and our goal is to help these evaluations” (p. 1).

The authors caution readers that they are agnostic about the validity of citation counts as a measure of good science. “The merits and demerits of citation counts are beyond the scope of the current article” (p. 8). Yet, they suggest that “there is much to recommend citation numbers in evaluating scholarly records” (p. 11).

At the same time, they list some potential limitations of using citation metrics to evaluate researchers.

1. Articles that developed a scale can have high citation counts. For example, Ed Diener has over 71,000 citations. His most cited article is the 1985 article with his Satisfaction with Life Scale. With 12,000 citations, it accounts for 17% of his citations. The fact that articles that published a measure have such high citation counts reflects a problem in psychological science. Researchers continue to use the first measure that was developed for a new construct (e.g., Rosenberg’s 1965 self-esteem scale) instead of improving measurement which would lead to citations of newer articles. So, the high citation counts of articles with scales is a problem, but it is only a problem if citation counts are used as a metric. A better metric is the H-Index that takes number of publications and citations into account. Ed Diener also has a very high H-Index of 108 publications with 108 or more citations. His scale article is only of these articles. Thus, scale development articles are not a major problem.

2. Review articles are cited more heavily than original research articles. Once more, Ed Diener is a good example. His second and third most cited articles are the 1984 and the co-authored 1999 Psychological Bulletin review articles on subjective well-being that together account for another 9,000 citations (13%). However, even review articles are not a problem. First, they also are unlikely to have an undue influence on the H-Index and second it is possible to exclude review articles and to compute metrics only for empirical articles. Web of Science makes this very easy. In WebofScience 361 out of Diener’s 469 publications are listed as articles. The others are listed as reviews, book chapters, or meeting abstracts. With a click of a button, we can produce the citation metrics only for the 361 articles. The H-Index drops from 108 to 102. Careful hand-selection of articles is unlikely to change this.

3. Finally, Radosic and Diener (2021) mention large-scale collaborations as a problem. For example, one of the most important research projects in psychological science in the last decade was the Reproducibility Project that examined the replicability of psychological science with 100 replication studies (Open Science Collaboration, 2015). This project required a major effort by many researchers. Participation earned researchers over 2,000 citations in just five years and the article is likely to be the most cited article for many of the collaborators. I do not see this as a problem because large-scale collaborations are important and can produce results that no single lab can produce. Thus, high citation counts provide a good incentive to engage in these collaborations.

To conclude, Radosic and Diener’s article provides norms for a citation counts that can and will be used to evaluate psychological scientists. However, the article sidesteps the main question about the use of citation metrics, namely (a) what criteria should be used to evaluate scientists and (b) are citation metrics valid indicators of these criteria. In short, the article is just another example that psychologists develop and promote measures without examining their construct validity (Schimmack, 2021).

What is a good scientists?

I didn’t do an online study to examine the ideal prototype of a scientist, so I have to rely on my own image of a good scientist. A key criterion is to search for some objectively verifiable information that can inform our understanding of the world, or in psychology ourselves; that is, humans affect, behavior, and cognition – the ABC of psychology. The second criterion elaborates the term objective. Scientists use methods that produce the same results independent of the user of the methods. That is, studies should be reproducible and results should be replicable within the margins of error. Third, the research question should have some significance beyond the personal interests of a scientist. This is of course a tricky criterion, but research that solves major problems like finding a vaccine for Covid-19 is more valuable and more likely to receive citations than research on the liking of cats versus dogs (I know, this is the most controversial statement I am making; go cats!). The problem is that not everybody can do research that is equally important to a large number of people. Once more Ed Diener is a good example. In the 1980s, he decided to study human happiness, which was not a major topic in psychology. Ed Diener’s high H-Index reflects his choice of a topic that is of interest to pretty much everybody. In contrast, research on stigma of minority groups is not of interest to a large group of people and unlikely to attract the same amount of attention. Thus, a blind focus on citation metrics is likely to lead to research on general topics and avoid research that applies research to specific problems. The problem is clearly visible in research on prejudice, where the past 20 years have produced hundreds of studies with button-press tasks by White researchers with White participants that gobbled up funding that could have been used for BIBOC researchers to study the actual issues in BIPOC populations. In short, relevance and significance of research is very difficult to evaluate, but it is unlikely to be reflected in citation metrics. Thus, a danger is that metrics are being used because they are easy to measure and relevance is not being used because it is harder to measure.

Do Citation Metrics Reward Good or Bad Research?

The main justification for the use of citation metrics is the hypothesis that the wisdom of crowds will lead to more citations of high quality work.

“The argument in favor of personal judgments overlooks the fact that citation counts are also based on judgments by scholars. In the case of citation counts, however, those judgments are broadly derived from the whole scholarly community and are weighted by the scholars who are publishing about the topic of the cited publications. Thus, there is much to recommend citation
numbers in evaluating scholarly records.” (Radosic & Diener, 2021, p. 8).

This statement is out of touch with discussions about psychological science over the past decade in the wake of the replication crisis (see Schimmack, 2020, for a review; I have to cite myself to get up my citation metrics. LOL). In order to get published and cited, researchers of original research articles in psychological science need statistically significant p-values. The problem is that it can be difficult to find significant results when novel hypotheses are false or effect sizes are small. Given the pressure to publish in order to rise in the H-Index rankings, psychologists have learned to use a number of statistical tricks to get significant results in the absence of strong evidence in the data. These tricks are known as questionable research practices, but most researchers think they are acceptable (John et al., 2012). However, these practices undermine the value of significance testing and published results may be false positives or difficult to replicate, and do not add to the progress of science. Thus, citation metrics may have the negative consequence to pressure scientists into using bad practices and to reward scientists who publish more false results just because they publish more.

Meta-psychologists have produced strong evidence that the use of these practices was widespread and accounts for the majority of replication failures that occurred over the past decade.

Schimmack, U. (2020). A meta-psychological perspective on the decade of replication failures in social psychology. Canadian Psychology/Psychologie canadienne, 61(4), 364–376. https://doi.org/10.1037/cap0000246

Motyl et al. (2017) collected focal test statistics from a representative sample of articles in social psychology. I analyzed their data using z-curve.2.0 (Brunner & Schimmack, 2020; Bartos & Schimmack, 2021). Figure 1 shows the distribution of the test-statistics after converting them into absolute z-scores, where higher values show a higher signal/noise (effect size / sampling error) ratio. A z-score of 1.96 is needed to claim a discovery with p < .05 (two-sided). Consistent with publication practices since the 1960s, most focal hypothesis tests confirm predictions (Sterling, 1959). The observed discovery rate is 90% and even higher if marginally significant results are included (z > 1.65). This high success rate is not something to celebrate. Even I could win all marathons if I use a short-cut and run only 5km. The problem with this high success rate is clearly visible when we fit a model to the distribution of the significant z-scores and extrapolate the distribution of z-scores that are not significant (the blue curve in the figure). Based on this distribution, the significant results are only 19% of all tests, indicating that many more non-significant results are expected than observed. The discrepancy between the observed and estimated discovery rate provides some indication of the use of questionable research practices. Moreover, the estimated discovery rate shows how much statistical power studies have to produce significant results without questionable research practices. The results confirm suspicions that power in social psychology is abysmally low (Cohen, 1961; Tversky & Kahneman, 1971).

The use of questionable practices makes it possible that citation metrics may be invalid. When everybody in a research field uses p < .05 as a criterion to evaluate manuscripts and these p-values are obtained with questionable research practices, the system will reward researchers how use the most questionable methods to produce more questionable results than their peers. In other words, citation metrics are no longer a valid criterion of research quality. Instead, bad research is selected and rewarded (Smaldino & McElreath, 2016). However, it is also possible that implicit knowledge helps researchers to focus on robust results and that questionable research practices are not rewarded. For example, prediction markets suggest that it is fairly easy to spot shoddy research and to predict replication failures (Dreber et al., 2015). Thus, we cannot assume that citation metrics are valid or invalid. Instead, citation metrics – like all measures – require a program of construct validation.

Do Citation Metrics Take Statistical Power Into Account?

A few days ago, I published the first results of an ongoing research project that examines the relationship between researchers’ citation metrics and estimates of the average power of their studies based on z-curve analyses like the one shown in Figure 1 (see Schimmack, 2021, for details). The key finding is that there is no statistically or practically significant relationship between researchers H-Index and the average power of their studies. Thus, researchers who invest a lot of resources in their studies to produce results with a low false positive risk and high replicability are not cited more than researchers who flood journals with low powered studies that produce questionable results that are difficult to replicate.

These results show a major problem of citation metrics. Although methodologists have warned against underpowered studies, researchers have continued to use underpowered studies because they can use questionable practices to produce the desired outcome. This strategy is beneficial for scientists and their career, but hurts the larger goal of science to produce a credible body of knowledge. This does not mean that we need to abandon citation metrics altogether, but it must be complemented with other information that reflects the quality of researchers data.

The Power-Corrected H-Index

In my 2020 review article, I proposed to weight the H-Index by estimates of researchers’ replicability. For my illustration, I used the estimated replication rate, which is the average power of significant tests, p < .05 (Brunner & Schimmack, 2020). One advantage of the ERR is that it is highly reliable. The reliability of the ERRs for 300 social psychologists is .90. However, the ERR has some limitations. First, it predicts replication outcomes under the unrealistic assumption that psychological studies can be replicated exactly. However, it has been pointed out that this often impossible, especially in social psychology (Strobe & Strack, 2014). As a result, ERR predictions are overly optimistic and overestimate the success rate of actual replication studies (Bartos & Schimmack, 2021). In contrast, EDR estimates are much more in line with actual replication outcomes because effect sizes in replication studies can regress towards the mean. For example, Figure 1 shows an EDR of 19% for social psychology and the actual success rate (if we can call it that) for social psychology was 25% in the reproducibility project (Open Science Collaboration, 2015). Another advantage of the EDR is that it is sensitive to questionable research practices that tend to produce an abundance of p-values that are just significant. Thus, the EDR more strongly punishes researchers for using these undesirable practices. The main limitation of the EDR is that it is less reliable than the ERR. The reliability for 300 social psychologists was only .5. Of course, it is not necessary to chose between ERR and EDR. Just like there are many citation metrics, it is possible to evaluate the pattern of power-corrected metrics using ERR and EDR. I am presenting both values here, but the rankings are sorted by EDR weighted H-Indices.

The H-Index is an absolute number that can range from 0 to infinity. In contrast, power is limited to a range from 5% (with alpha = .05) to 100%. Thus, it makes sense to use power as a weight and to weight the H-index by a researchers EDR. A researcher who published only studies with 100% power has a power-corrected H-Index that is equivalent to the actual H-Index. The average EDR of social psychologists, however, is 35%. Thus, the average H-index is reduced to a third of the unadjusted value.

To illustrate this approach, I am using two researchers with a large H-Index, but different EDRs. One researcher is James J. Gross with an H-Index of 99 in WebofScience. His z-curve plot shows some evidence that questionable research practices were used to report 72% significant results with 50% power. However, the 95%CI around the EDR ranges from 23% to 78% and includes the point estimate. Thus, the evidence for QRPs is weak and not statistically significant. More important, the EDR -corrected H-Index is 90 * .50 = 45.

A different example is provided by Shelly E. Taylor with a similarly high H-Index of 84, but her z-curve plot shows clear evidence that the observed discovery rate is inflated by questionable research practices. Her low EDR reduces the H-Index considerably and results in a PC-H-Index of only 12.6.

Weighing the two researchers’ H-Index by their respective ERR’s, 77 vs. 54, has similar, but less extreme effects in absolute terms, ERR-adjusted H-Indices of 76 vs. 45.

In the sample of 300 social psychologists, the H-Index (r = .74) and the EDR (r = .65) contribute about equal amounts of variance to the power-corrected H-Index. Of course, a different formula could be used to weigh power more or less.

Discussion

Ed Diener is best known for his efforts to measure well-being and to point out that traditional economic indicators of well-being are imperfect. While wealth of countries is a strong predictor of citizens’ average well-being, r ~ .8, income is a poor predictor of individuals’ well-being with countries. However, economists continue to rely on income and GDP because it is more easily quantified and counted than subjective life-evaluations. Ironically, Diener advocates the opposite approach when it comes to measuring research quality. Counting articles and citations is relatively easy and objective, but it may not measure what we really want to measure, namely how much is somebody contributing to the advancement of knowledge. The construct of scientific advancement is probably as difficult to define as well-being, but producing replicable results with reproducible studies is one important criterion of good science. At present, citation metrics fail to track this indicator of research quality. Z-curve analyses of published results make it possible to measure this aspect of good science and I recommend to take it into account when researchers are being evaluated.

However, I do not recommend the use of quantitative information for the evaluation of hiring and promotion decisions. The reward system in science is too biased to reward privileged upper-class, White, US Americans (see APS rising stars lists). That being said, a close examination of published articles can be used to detect and eliminate researchers who severely p-hacked to get their significant results. Open science criteria can also be used to evaluate researchers who are just starting their career.

In conclusion, Radosic and Diener’s (2021) article disappointed me because it sidesteps the fundamental questions about the validity of citation metrics as a criterion for scientific excellence.

Conflict of Interest Statement: At the beginning of my career I was motivated to succeed in psychological science by publishing as many JPSP articles as possible and I made the unhealthy mistake to try to compete with Ed Diener. That didn’t work out for me. Maybe I am just biased against citation metrics because my work is not cited as much as I would like. Alternatively, my disillusionment with the system reflects some real problems with the reward structure in psychological science and helped me to see the light. The goal of science cannot be to have the most articles or the most citations, if these metrics do not really reflect scientific contributions. Chasing indicators is a trap, just like chasing happiness is a trap. Most scientists can hope to make maybe one lasting contribution to the advancement of knowledge. You need to please others to stay in the game, but beyond those minimum requirements to get tenure, personal criteria of success are better than social comparisons for the well-being of science and scientists. The only criterion that is healthy is to maximize statistical power. As Cohen said, less is more and by this criterion psychology is not doing well as more and more research is published with little concern about quality.

NameEDR.H.IndexERR.H.IndexH-IndexEDRERR
James J. Gross5076995077
John T. Cacioppo48701024769
Richard M. Ryan4661895269
Robert A. Emmons3940468588
Edward L. Deci3643695263
Richard W. Robins3440576070
Jean M. Twenge3335595659
William B. Swann Jr.3244555980
Matthew D. Lieberman3154674780
Roy F. Baumeister31531013152
David Matsumoto3133397985
Carol D. Ryff3136486476
Dacher Keltner3144684564
Michael E. McCullough3034446978
Kipling D. Williams3034446977
Thomas N Bradbury3033486369
Richard J. Davidson30551082851
Phoebe C. Ellsworth3033466572
Mario Mikulincer3045714264
Richard E. Petty3047744064
Paul Rozin2949585084
Lisa Feldman Barrett2948694270
Constantine Sedikides2844634570
Alice H. Eagly2843614671
Susan T. Fiske2849664274
Jim Sidanius2730426572
Samuel D. Gosling2733535162
S. Alexander Haslam2740624364
Carol S. Dweck2642663963
Mahzarin R. Banaji2553683778
Brian A. Nosek2546574481
John F. Dovidio2541663862
Daniel M. Wegner2434524765
Benjamin R. Karney2427376573
Linda J. Skitka2426327582
Jerry Suls2443633868
Steven J. Heine2328376377
Klaus Fiedler2328386174
Jamil Zaki2327356676
Charles M. Judd2336534368
Jonathan B. Freeman2324307581
Shinobu Kitayama2332455071
Norbert Schwarz2235564063
Antony S. R. Manstead2237593762
Patricia G. Devine2125375867
David P. Schmitt2123307177
Craig A. Anderson2132593655
Jeff Greenberg2139732954
Kevin N. Ochsner2140573770
Jens B. Asendorpf2128415169
David M. Amodio2123336370
Bertram Gawronski2133434876
Fritz Strack2031553756
Virgil Zeigler-Hill2022277481
Nalini Ambady2032573556
John A. Bargh2035633155
Arthur Aron2036653056
Mark Snyder1938603263
Adam D. Galinsky1933682849
Tom Pyszczynski1933613154
Barbara L. Fredrickson1932523661
Hazel Rose Markus1944642968
Mark Schaller1826434361
Philip E. Tetlock1833454173
Anthony G. Greenwald1851613083
Ed Diener18691011868
Cameron Anderson1820276774
Michael Inzlicht1828444163
Barbara A. Mellers1825325678
Margaret S. Clark1823305977
Ethan Kross1823345267
Nyla R. Branscombe1832493665
Jason P. Mitchell1830414373
Ursula Hess1828404471
R. Chris Fraley1828394572
Emily A. Impett1819257076
B. Keith Payne1723305876
Eddie Harmon-Jones1743622870
Wendy Wood1727434062
John T. Jost1730493561
C. Nathan DeWall1728453863
Thomas Gilovich1735503469
Elaine Fox1721276278
Brent W. Roberts1745592877
Harry T. Reis1632433874
Robert B. Cialdini1629513256
Phillip R. Shaver1646652571
Daphna Oyserman1625463554
Russell H. Fazio1631503261
Jordan B. Peterson1631394179
Bernadette Park1624384264
Paul A. M. Van Lange1624384263
Jeffry A. Simpson1631572855
Russell Spears1529522955
A. Janet Tomiyama1517236576
Jan De Houwer1540552772
Samuel L. Gaertner1526423561
Michael Harris Bond1535423584
Agneta H. Fischer1521314769
Delroy L. Paulhus1539473182
Marcel Zeelenberg1429373979
Eli J. Finkel1426453257
Jennifer Crocker1432483067
Steven W. Gangestad1420483041
Michael D. Robinson1427413566
Nicholas Epley1419265572
David M. Buss1452652280
Naomi I. Eisenberger1440512879
Andrew J. Elliot1448712067
Steven J. Sherman1437592462
Christian S. Crandall1421363959
Kathleen D. Vohs1423453151
Jamie Arndt1423453150
John M. Zelenski1415206976
Jessica L. Tracy1423324371
Gordon B. Moskowitz1427472957
Klaus R. Scherer1441522678
Ayelet Fishbach1321363759
Jennifer A. Richeson1321403352
Charles S. Carver1352811664
Leaf van Boven1318274767
Shelley E. Taylor1244841452
Lee Jussim1217245271
Edward R. Hirt1217264865
Shigehiro Oishi1232522461
Richard E. Nisbett1230432969
Kurt Gray1215186981
Stacey Sinclair1217304157
Niall Bolger1220343658
Paula M. Niedenthal1222363461
Eliot R. Smith1231422973
Tobias Greitemeyer1221313967
Rainer Reisenzein1214215769
Rainer Banse1219264672
Galen V. Bodenhausen1228462661
Ozlem Ayduk1221353459
E. Tory. Higgins1238701754
D. S. Moskowitz1221333663
Dale T. Miller1225393064
Jeanne L. Tsai1217254667
Roger Giner-Sorolla1118225180
Edward P. Lemay1115195981
Ulrich Schimmack1122353263
E. Ashby Plant1118363151
Ximena B. Arriaga1113195869
Janice R. Kelly1115225070
Frank D. Fincham1135601859
David Dunning1130432570
Boris Egloff1121372958
Karl Christoph Klauer1125392765
Caryl E. Rusbult1019362954
Tessa V. West1012205159
Jennifer S. Lerner1013224661
Wendi L. Gardner1015244263
Mark P. Zanna1030621648
Michael Ross1028452262
Jonathan Haidt1031432373
Sonja Lyubomirsky1022382659
Sander L. Koole1018352852
Duane T. Wegener1016273660
Marilynn B. Brewer1027442262
Christopher K. Hsee1020313163
Sheena S. Iyengar1015195080
Laurie A. Rudman1026382568
Joanne V. Wood916263660
Thomas Mussweiler917392443
Shelly L. Gable917332850
Felicia Pratto930402375
Wiebke Bleidorn920273474
Jeff T. Larsen917253667
Nicholas O. Rule923303075
Dirk Wentura920312964
Klaus Rothermund930392376
Joris Lammers911165669
Stephanie A. Fryberg913194766
Robert S. Wyer930471963
Mina Cikara914184980
Tiffany A. Ito914224064
Joel Cooper914352539
Joshua Correll914233862
Peter M. Gollwitzer927461958
Brad J. Bushman932511762
Kennon M. Sheldon932481866
Malte Friese915263357
Dieter Frey923392258
Lorne Campbell914233761
Monica Biernat817292957
Aaron C. Kay814283051
Yaacov Schul815233664
Joseph P. Forgas823392159
Guido H. E. Gendolla814302747
Claude M. Steele813312642
Igor Grossmann815233566
Paul K. Piff810165063
Joshua Aronson813282846
William G. Graziano820302666
Azim F. Sharif815223568
Juliane Degner89126471
Margo J. Monteith818243277
Timothy D. Wilson828451763
Kerry Kawakami813233356
Hilary B. Bergsieker78116874
Gerald L. Clore718391945
Phillip Atiba Goff711184162
Elizabeth W. Dunn717262864
Bernard A. Nijstad716312352
Mark J. Landau713282545
Christopher R. Agnew716213376
Brandon J. Schmeichel714302345
Arie W. Kruglanski728491458
Eric D. Knowles712183864
Yaacov Trope732571257
Wendy Berry Mendes714312244
Jennifer S. Beer714252754
Nira Liberman729451565
Penelope Lockwood710144870
Jeffrey W Sherman721292371
Geoff MacDonald712183767
Eva Walther713193566
Daniel T. Gilbert727411665
Grainne M. Fitzsimons611232849
Elizabeth Page-Gould611164066
Mark J. Brandt612173770
Ap Dijksterhuis620371754
James K. McNulty621331965
Dolores Albarracin618331956
Maya Tamir619292164
Jon K. Maner622431452
Alison L. Chasteen617252469
Jay J. van Bavel621302071
William A. Cunningham619302064
Glenn Adams612173573
Wilhelm Hofmann622331866
Ludwin E. Molina67124961
Lee Ross626421463
Andrea L. Meltzer69134572
Jason E. Plaks610153967
Ara Norenzayan621341761
Batja Mesquita617232573
Tanya L. Chartrand69282033
Toni Schmader518301861
Abigail A. Scholer59143862
C. Miguel Brendl510153568
Emily Balcetis510153568
Diana I. Tamir59153562
Nir Halevy513182972
Alison Ledgerwood58153454
Yoav Bar-Anan514182876
Paul W. Eastwick517242169
Geoffrey L. Cohen513252050
Yuen J. Huo513163180
Benoit Monin516291756
Gabriele Oettingen517351449
Roland Imhoff515212373
Mark W. Baldwin58202441
Ronald S. Friedman58192544
Shelly Chaiken522431152
Kristin Laurin59182651
David A. Pizarro516232069
Michel Tuan Pham518271768
Amy J. C. Cuddy517241972
Gun R. Semin519301564
Laura A. King419281668
Yoel Inbar414202271
Nilanjana Dasgupta412231952
Kerri L. Johnson413172576
Roland Neumann410152867
Richard P. Eibach410221947
Roland Deutsch416231871
Michael W. Kraus413241755
Steven J. Spencer415341244
Gregory M. Walton413291444
Ana Guinote49202047
Sandra L. Murray414251655
Leif D. Nelson416251664
Heejung S. Kim414251655
Elizabeth Levy Paluck410192155
Jennifer L. Eberhardt411172362
Carey K. Morewedge415231765
Lauren J. Human49133070
Chen-Bo Zhong410211849
Ziva Kunda415271456
Geoffrey J. Leonardelli46132848
Danu Anthony Stinson46113354
Kentaro Fujita411182062
Leandre R. Fabrigar414211767
Melissa J. Ferguson415221669
Nathaniel M Lambert314231559
Matthew Feinberg38122869
Sean M. McCrea38152254
David A. Lishner38132563
William von Hippel313271248
Joseph Cesario39191745
Martie G. Haselton316291154
Daniel M. Oppenheimer316261260
Oscar Ybarra313241255
Simone Schnall35161731
Travis Proulx39141962
Spike W. S. Lee38122264
Dov Cohen311241144
Ian McGregor310241140
Dana R. Carney39171553
Mark Muraven310231144
Deborah A. Prentice312211257
Michael A. Olson211181363
Susan M. Andersen210211148
Sarah E. Hill29171352
Michael A. Zarate24141331
Lisa K. Libby25101854
Hans Ijzerman2818946
James M. Tyler1681874
Fiona Lee16101358

References

Open Science Collaboration (OSC). (2015). Estimating the reproducibility
of psychological science. Science, 349, aac4716. http://dx.doi.org/10
.1126/science.aac4716

Radosic, N., & Diener, E. (2021). Citation Metrics in Psychological Science. Perspectives on Psychological Science. https://doi.org/10.1177/1745691620964128

Schimmack, U. (2021). The validation crisis. Meta-psychology. in press

Schimmack, U. (2020). A meta-psychological perspective on the decade of replication failures in social psychology. Canadian Psychology/Psychologie canadienne, 61(4), 364–376. https://doi.org/10.1037/cap0000246

Power and Success: When the R-Index meets the H-Index

A main message of the Lord of the Rings novels is that power is dangerous and corrupts. The main exception is statistical power. High statistical power is desirable because it reduces the risk of false negative results and therewith increases the rate of true discoveries. A high rate of true discoveries is desirable because it reduces the risk that significant results are false positives. For example, a researcher who conducts studies with low power to produce many significant results, but also tests many false hypotheses, will have a high rate of false positive discoveries (Finkel, 2018). In contrast, a researcher who invests more resources in any single study will have fewer significant results, but a lower risk of false positives. Another advantage of high power is that true discoveries are more replicable. A true positive that was obtained with 80% power has an 80% chance to produce a successful replication. In contrast, a true discovery that was obtained with 20% power has an 80% chance to end with a failure to replicate that requires additional replication studies to determine whether the original result was a false positive.

Although most researchers agree that high power is desirable – and specify that they are planning studies with 80% power in their grant proposals, they no longer care about power once the study is completed and a significant result was obtained. The fallacy is to assume that a significant result was obtained because the hypothesis was true and the study had good power. Until recently, there was also no statistical method to estimate researchers’ actual power. The main problem was that questionable research practices inflate post-hoc estimates of statistical power. Selection for significance ensures that post-hoc power is at least 50%. This problem has been solved with selection models that correct for selection for significance, namely p-curve and z-curve. A comparison of these methods with simulation studies shows that p-curve estimates can be dramatically inflated when studies are heterogeneous in power (Brunner & Schimmack, 2020). Z-curve is also the only method that estimates power for all studies that were conducted and not just the subset of studies that produced a significant results. A comparison with actual success rates of replication studies shows that these estimates predict actual replication outcomes (Bartos & Schimmack, 2021).

The ability to estimate researchers’ actual power offers new opportunities for meta-psychologists. One interesting question is how statistical power is related to traditional indicators of scientific success or eminence. There are three several possible outcomes.

One possibility is that power could be positively correlated with success, especially for older researchers. The reason is that low power should produce many replication failures for other researchers who are trying to build on the work of this researcher. Faced with replication failures, they are likely to abandon this research and work on this topic will cease after a while. Accordingly, low powered studies are unlikely to produce a large body of research. In contrast, high powered studies replicate and many other researchers who build on this work are building on these findings, leading to many citations and a large H-Index.

A second possibility is that there is no relationship between power and success. The reason would be that power is determined by many other factors such as the effect sizes in a research area and the type of design that is used to examine these effects. Some research areas will have robust findings that replicate often. Other areas will have low power, but everybody in this area accepts that studies do not always work. In this scenario, success is determined by other factors that vary within research areas and not by power, which varies mostly across research areas.

Another reason for the lack of a correlation could be a floor effect. In a system that does not value credibility and replicability, researchers who use questionable practices to push articles out might win and the only way to survive is to do bad research (Smaldino & McElreath, 2016).

A third possibility is that power is negatively correlated with success. Although there is no evidence for a negative relationship, concerns have been raised that some researchers are gaming the system by conducting many studies with low power to produce as many significant results as possible. The costs of replication failures are passed on to other researchers that try to build on these findings, whereas the innovator moves on to produce more significant results on new questions.

Given the lack of data and plausible predictions for any type of relationship, it is not possible to make a priori predictions about the relationship. Thus, the theoretical implications can only be examined after we look at the data.

Data

Success was measured with the H-Index in Web of Science. Information about statistical power of over 300 social/personality psychologists was obtained using z-curve analyses of automatically extracted test statistics (Schimmack, 2021). A sample size of N = 300 provides reasonably tight confidence intervals to evaluate whether there is a substantial relationship between H-Index and power. I transformed the H-Index using log-transformation to compute the correlation with the estimated discovery rate, which corresponds to the average power before selection for significance (Brunner & Schimmack, 2020). The results show a weak positive relationship that is not significantly different from zero, r(N -= 304) = .07, 95%CI = -.04 to .18. Thus, the results are most consistent with theories that predict no relationship between success and research practices. Figure 1 shows the scatterplot and there is no indication that the weak correlation is due to a floor effect. There is considerable variation in the estimated discovery rate across researchers.

One concern could be that the EDR is just a very noisy and unreliable measure of statistical power. To examine this, I split the z-values of researchers in half, computed separate z-curves and then computed the split-half correlation and adjusted it to compute alpha for the full set of z-scores. Reliability of the EDR was alpha r = .5. To increase reliability, I used extreme groups for the EDR and excluded values between 25 and 45. However , the correlation with the H-Index did not increase, r = .08, 95%CI = -.08 to .23.

I also correlated the H-Index with the more reliable estimated replication rate (reliability = .9), which is power after selection for significance. This correlation was also not significant, r = .08, 95%CI = -.04 to .19.

In conclusion, we can reject the hypothesis that higher success is related to conducting many small studies with low power and selectively reporting only significant results (r > -.1, p < .05). There may be a small positive correlation, (r < .2, p < .05), but a larger sample would be needed to reject the hypotheses that there is no relationship between success and statistical power.

Discussion

Low replication rates and major replication failures of some findings in social psychology created a crisis of confidence. Some articles suggests that most published results are false and were obtained with questionable research practices. The present results suggests that these fears are unfounded and that it would be false to generalize from a few researchers to the whole group of social psychologists.

The present results also suggest that it is not necessary to burn social psychology to the ground. Instead, social psychologists should carefully examine which important findings are credible and replicable and which ones are not. Although this work has begun, it is moving slowly. The present results show that researchers success, wich is measured in terms of citations by peers, is not tight to the credibility of their findings. Personalized information about power may help to change this in the future.

A famous quote in management is “If You Can’t Measure It, You Can’t Improve It.” This might explain why statistical power remained low despite early warnings about low power (Cohen, 1961; Tversky & Kahneman, 1971). Z-curve analysis is a game changer because it makes it possible to measure power and with the use of modern computers, it is possible to do so quickly and on a large scale. If we agree that power is important and that it can be measured, it is time to improve it. Every researcher can do so and the present results suggest that increasing power is not a career ending move. So, I hope this post empowers researchers to invest more resources in high-powered studies.

Replicability Rankings 2010-2020

Welcome to the replicability rankings for 120 psychology journals. More information about the statistical method that is used to create the replicability rankings can be found elsewhere (Z-Curve; Video Tutorial; Talk; Examples). The rankings are based on automated extraction of test statistics from all articles published in these 120 journals from 2010 to 2020 (data). The results can be reproduced with the R-package zcurve.

To give a brief explanation of the method, I use the journal with the highest ranking and the journal with the lowest ranking as examples. Figure 1 shows the z-curve plot for the 2nd highest ranking journal for the year 2020 (the Journal of Organizational Psychology is ranked #1, but it has very few test statistics). Plots for all journals that include additional information and information about test statistics are available by clicking on the journal name. Plots for previous years can be found on the site for the 2010-2019 rankings (previous rankings).

To create the z-curve plot in Figure 1, the 361 test statistics were first transformed into exact p-values that were then transformed into absolute z-scores. Thus, each value represents the deviation from zero for a standard normal distribution. A value of 1.96 (solid red line) corresponds to the standard criterion for significance, p = .05 (two-tailed). The dashed line represents the treshold for marginal significance, p = .10 (two-tailed). A z-curve analysis fits a finite mixture model to the distribution of the significant z-scores (the blue density distribution on the right side of the solid red line). The distribution provides information about the average power of studies that produced a significant result. As power determines the success rate in future studies, power after selection for significance is used to estimate replicability. For the present data, the z-curve estimate of the replication rate is 84%. The bootstrapped 95% confidence interval around this estimate ranges from 75% to 92%. Thus, we would expect the majority of these significant results to replicate.

However, the graph also shows some evidence that questionable research practices produce too many significant results. The observed discovery rate (i.e., the percentage of p-values below .05) is 82%. This is outside of the 95%CI of the estimated discovery rate which is represented by the grey line in the range of non-significant results; EDR = .31%, 95%CI = 18% to 81%. We see that there are fewer results reported than z-curve predicts. This finding casts doubt about the replicability of the just significant p-values. The replicability rankings ignore this problem, which means that the predicted success rates are overly optimistic. A more pessimistic predictor of the actual success rate is the EDR. However, the ERR still provides useful information to compare power of studies across journals and over time.

Figure 2 shows a journal with a low ERR in 2020.

The estimated replication rate is 64%, with a 95%CI ranging from 55% to 73%. The 95%CI does not overlap with the 95%CI for the Journal of Sex Research, indicating that this is a significant difference in replicability. Visual inspection also shows clear evidence for the use of questionable research practices with a lot more results that are just significant than results that are not significant. The observed discovery rate of 75% is inflated and outside the 95%CI of the EDR that ranges from 10% to 56%.

To examine time trends, I regressed the ERR of each year on the year and computed the predicted values and 95%CI. Figure 3 shows the results for the journal Social Psychological and Personality Science as an example (x = 0 is 2010, x = 1 is 2020). The upper bound of the 95%CI for 2010, 62%, is lower than the lower bound of the 95%CI for 2020, 74%.

This shows a significant difference with alpha = .01. I use alpha = .01 so that only 1.2 out of the 120 journals are expected to show a significant change in either direction by chance alone. There are 22 journals with a significant increase in the ERR and no journals with a significant decrease. This shows that about 20% of these journals have responded to the crisis of confidence by publishing studies with higher power that are more likely to replicate.

Rank  JournalObserved 2020Predicted 2020Predicted 2010
1Journal of Organizational Psychology88 [69 ; 99]84 [75 ; 93]73 [64 ; 81]
2Journal of Sex Research84 [75 ; 92]84 [74 ; 93]75 [65 ; 84]
3Evolution & Human Behavior84 [74 ; 93]83 [77 ; 90]62 [56 ; 68]
4Judgment and Decision Making81 [74 ; 88]83 [77 ; 89]68 [62 ; 75]
5Personality and Individual Differences81 [76 ; 86]81 [78 ; 83]68 [65 ; 71]
6Addictive Behaviors82 [75 ; 89]81 [77 ; 86]71 [67 ; 75]
7Depression & Anxiety84 [76 ; 91]81 [77 ; 85]67 [63 ; 71]
8Cognitive Psychology83 [75 ; 90]81 [76 ; 87]71 [65 ; 76]
9Social Psychological and Personality Science85 [78 ; 92]81 [74 ; 89]54 [46 ; 62]
10Journal of Experimental Psychology – General80 [75 ; 85]80 [79 ; 81]67 [66 ; 69]
11J. of Exp. Psychology – Learning, Memory & Cognition81 [75 ; 87]80 [77 ; 84]73 [70 ; 77]
12Journal of Memory and Language79 [73 ; 86]80 [76 ; 83]73 [69 ; 77]
13Cognitive Development81 [75 ; 88]80 [75 ; 85]67 [62 ; 72]
14Sex Roles81 [74 ; 88]80 [75 ; 85]72 [67 ; 77]
15Developmental Psychology74 [67 ; 81]80 [75 ; 84]67 [63 ; 72]
16Canadian Journal of Experimental Psychology77 [65 ; 90]80 [73 ; 86]74 [68 ; 81]
17Journal of Nonverbal Behavior73 [59 ; 84]80 [68 ; 91]65 [53 ; 77]
18Memory and Cognition81 [73 ; 87]79 [77 ; 81]75 [73 ; 77]
19Cognition79 [74 ; 84]79 [76 ; 82]70 [68 ; 73]
20Psychology and Aging81 [74 ; 87]79 [75 ; 84]74 [69 ; 79]
21Journal of Cross-Cultural Psychology83 [76 ; 91]79 [75 ; 83]75 [71 ; 79]
22Psychonomic Bulletin and Review79 [72 ; 86]79 [75 ; 83]71 [67 ; 75]
23Journal of Experimental Social Psychology78 [73 ; 84]79 [75 ; 82]52 [48 ; 55]
24JPSP-Attitudes & Social Cognition82 [75 ; 88]79 [69 ; 89]55 [45 ; 65]
25European Journal of Developmental Psychology75 [64 ; 86]79 [68 ; 91]74 [62 ; 85]
26Journal of Business and Psychology82 [71 ; 91]79 [68 ; 90]74 [63 ; 85]
27Psychology of Religion and Spirituality79 [71 ; 88]79 [66 ; 92]72 [59 ; 85]
28J. of Exp. Psychology – Human Perception and Performance79 [73 ; 84]78 [77 ; 80]75 [73 ; 77]
29Attention, Perception and Psychophysics77 [72 ; 82]78 [75 ; 82]73 [70 ; 76]
30Psychophysiology79 [74 ; 84]78 [75 ; 82]66 [62 ; 70]
31Psychological Science77 [72 ; 84]78 [75 ; 82]57 [54 ; 61]
32Quarterly Journal of Experimental Psychology81 [75 ; 86]78 [75 ; 81]72 [69 ; 74]
33Journal of Child and Family Studies80 [73 ; 87]78 [74 ; 82]67 [63 ; 70]
34JPSP-Interpersonal Relationships and Group Processes81 [74 ; 88]78 [73 ; 82]53 [49 ; 58]
35Journal of Behavioral Decision Making77 [70 ; 86]78 [72 ; 84]66 [60 ; 72]
36Appetite78 [73 ; 84]78 [72 ; 83]72 [67 ; 78]
37Journal of Comparative Psychology79 [65 ; 91]78 [71 ; 85]68 [61 ; 75]
38Journal of Religion and Health77 [57 ; 94]78 [70 ; 87]75 [67 ; 84]
39Aggressive Behaviours82 [74 ; 90]78 [70 ; 86]70 [62 ; 78]
40Journal of Health Psychology74 [64 ; 82]78 [70 ; 86]72 [64 ; 80]
41Journal of Social Psychology78 [70 ; 87]78 [70 ; 86]69 [60 ; 77]
42Law and Human Behavior81 [71 ; 90]78 [69 ; 87]70 [61 ; 78]
43Psychological Medicine76 [68 ; 85]78 [66 ; 89]74 [63 ; 86]
44Political Psychology73 [59 ; 85]78 [65 ; 92]59 [46 ; 73]
45Acta Psychologica81 [75 ; 88]77 [74 ; 81]73 [70 ; 76]
46Experimental Psychology73 [62 ; 83]77 [73 ; 82]73 [68 ; 77]
47Archives of Sexual Behavior77 [69 ; 83]77 [73 ; 81]78 [74 ; 82]
48British Journal of Psychology73 [65 ; 81]77 [72 ; 82]74 [68 ; 79]
49Journal of Cognitive Psychology77 [69 ; 84]77 [72 ; 82]74 [69 ; 78]
50Journal of Experimental Psychology – Applied82 [75 ; 88]77 [72 ; 82]70 [65 ; 76]
51Asian Journal of Social Psychology79 [66 ; 89]77 [70 ; 84]70 [63 ; 77]
52Journal of Youth and Adolescence80 [71 ; 89]77 [70 ; 84]72 [66 ; 79]
53Memory77 [71 ; 84]77 [70 ; 83]71 [65 ; 77]
54European Journal of Social Psychology82 [75 ; 89]77 [69 ; 84]61 [53 ; 69]
55Social Psychology81 [73 ; 90]77 [67 ; 86]73 [63 ; 82]
56Perception82 [74 ; 88]76 [72 ; 81]78 [74 ; 83]
57Journal of Anxiety Disorders80 [71 ; 89]76 [72 ; 80]71 [67 ; 75]
58Personal Relationships65 [54 ; 76]76 [68 ; 84]62 [54 ; 70]
59Evolutionary Psychology63 [51 ; 75]76 [67 ; 85]77 [68 ; 86]
60Journal of Research in Personality63 [46 ; 77]76 [67 ; 84]70 [61 ; 79]
61Cognitive Behaviour Therapy88 [73 ; 99]76 [66 ; 86]68 [58 ; 79]
62Emotion79 [73 ; 85]75 [72 ; 79]67 [64 ; 71]
63Animal Behavior79 [72 ; 87]75 [71 ; 80]68 [64 ; 73]
64Group Processes & Intergroup Relations80 [73 ; 87]75 [71 ; 80]60 [56 ; 65]
65JPSP-Personality Processes and Individual Differences78 [70 ; 86]75 [70 ; 79]64 [59 ; 69]
66Psychology of Men and Masculinity88 [77 ; 96]75 [64 ; 87]78 [67 ; 89]
67Consciousness and Cognition74 [67 ; 80]74 [69 ; 80]67 [62 ; 73]
68Personality and Social Psychology Bulletin78 [72 ; 84]74 [69 ; 79]57 [52 ; 62]
69Journal of Cognition and Development70 [60 ; 80]74 [67 ; 81]65 [59 ; 72]
70Journal of Applied Psychology69 [59 ; 78]74 [67 ; 80]73 [66 ; 79]
71European Journal of Personality80 [67 ; 92]74 [65 ; 83]70 [61 ; 79]
72Journal of Positive Psychology75 [65 ; 86]74 [65 ; 83]66 [57 ; 75]
73Journal of Research on Adolescence83 [74 ; 92]74 [62 ; 87]67 [55 ; 79]
74Psychopharmacology75 [69 ; 80]73 [71 ; 75]67 [65 ; 69]
75Frontiers in Psychology75 [70 ; 79]73 [70 ; 76]72 [69 ; 75]
76Cognitive Therapy and Research73 [66 ; 81]73 [68 ; 79]67 [62 ; 73]
77Behaviour Research and Therapy70 [63 ; 77]73 [67 ; 79]70 [64 ; 76]
78Journal of Educational Psychology82 [73 ; 89]73 [67 ; 79]76 [70 ; 82]
79British Journal of Social Psychology74 [65 ; 83]73 [66 ; 81]61 [54 ; 69]
80Organizational Behavior and Human Decision Processes70 [65 ; 77]72 [69 ; 75]67 [63 ; 70]
81Cognition and Emotion75 [68 ; 81]72 [68 ; 76]72 [68 ; 76]
82Journal of Affective Disorders75 [69 ; 83]72 [68 ; 76]74 [71 ; 78]
83Behavioural Brain Research76 [71 ; 80]72 [67 ; 76]70 [66 ; 74]
84Child Development81 [75 ; 88]72 [66 ; 78]68 [62 ; 74]
85Journal of Abnormal Psychology71 [60 ; 82]72 [66 ; 77]65 [60 ; 71]
86Journal of Vocational Behavior70 [59 ; 82]72 [65 ; 79]84 [77 ; 91]
87Journal of Experimental Child Psychology72 [66 ; 78]71 [69 ; 74]72 [69 ; 75]
88Journal of Consulting and Clinical Psychology81 [73 ; 88]71 [64 ; 78]62 [55 ; 69]
89Psychology of Music78 [67 ; 86]71 [64 ; 78]79 [72 ; 86]
90Behavior Therapy78 [69 ; 86]71 [63 ; 78]70 [63 ; 78]
91Journal of Occupational and Organizational Psychology66 [51 ; 79]71 [62 ; 80]87 [79 ; 96]
92Journal of Happiness Studies75 [65 ; 83]71 [61 ; 81]79 [70 ; 89]
93Journal of Occupational Health Psychology77 [65 ; 90]71 [58 ; 83]65 [52 ; 77]
94Journal of Individual Differences77 [62 ; 92]71 [51 ; 90]74 [55 ; 94]
95Frontiers in Behavioral Neuroscience70 [63 ; 76]70 [66 ; 75]66 [62 ; 71]
96Journal of Applied Social Psychology76 [67 ; 84]70 [63 ; 76]70 [64 ; 77]
97British Journal of Developmental Psychology72 [62 ; 81]70 [62 ; 79]76 [67 ; 85]
98Journal of Social and Personal Relationships73 [63 ; 81]70 [60 ; 79]69 [60 ; 79]
99Behavioral Neuroscience65 [57 ; 73]69 [64 ; 75]69 [63 ; 75]
100Psychology and Marketing71 [64 ; 77]69 [64 ; 74]67 [63 ; 72]
101Journal of Family Psychology71 [59 ; 81]69 [63 ; 75]62 [56 ; 68]
102Journal of Personality71 [57 ; 85]69 [62 ; 77]64 [57 ; 72]
103Journal of Consumer Behaviour70 [60 ; 81]69 [59 ; 79]73 [63 ; 83]
104Motivation and Emotion78 [70 ; 86]69 [59 ; 78]66 [57 ; 76]
105Developmental Science67 [60 ; 74]68 [65 ; 71]65 [63 ; 68]
106International Journal of Psychophysiology67 [61 ; 73]68 [64 ; 73]64 [60 ; 69]
107Self and Identity80 [72 ; 87]68 [60 ; 76]70 [62 ; 78]
108Journal of Counseling Psychology57 [41 ; 71]68 [55 ; 81]79 [66 ; 92]
109Health Psychology63 [50 ; 73]67 [62 ; 72]67 [61 ; 72]
110Hormones and Behavior67 [58 ; 73]66 [63 ; 70]66 [62 ; 70]
111Frontiers in Human Neuroscience68 [62 ; 75]66 [62 ; 70]76 [72 ; 80]
112Annals of Behavioral Medicine63 [53 ; 75]66 [60 ; 71]71 [65 ; 76]
113Journal of Child Psychology and Psychiatry and Allied Disciplines58 [45 ; 69]66 [55 ; 76]63 [53 ; 73]
114Infancy77 [69 ; 85]65 [56 ; 73]58 [50 ; 67]
115Biological Psychology64 [58 ; 70]64 [61 ; 67]66 [63 ; 69]
116Social Development63 [54 ; 73]64 [56 ; 72]74 [66 ; 82]
117Developmental Psychobiology62 [53 ; 70]63 [58 ; 68]67 [62 ; 72]
118Journal of Consumer Research59 [53 ; 67]63 [55 ; 71]58 [50 ; 66]
119Psychoneuroendocrinology63 [53 ; 72]62 [58 ; 66]61 [57 ; 65]
120Journal of Consumer Psychology64 [55 ; 73]62 [57 ; 67]60 [55 ; 65]

If Consumer Psychology Wants to be a Science It Has to Behave Like a Science

Consumer psychology is an applied branch of social psychology that uses insights from social psychology to understand consumers’ behaviors. Although there is cross-fertilization and authors may publish in more basic and more applied journals, it is its own field in psychology with its own journals. As a result, it has escaped the close attention that has been paid to the replicability of studies published in mainstream social psychology journals (see Schimmack, 2020, for a review). However, given the similarity in theories and research practices, it is fair to ask why consumer research should be more replicable and credible than basic social psychology. This question was indirectly addressed in a diaologue about the merits of pre-registration that was published in the Journal of Consumer Psychology (Krishna, 2021).

Open science proponents advocate pre-registration to increase the credibility of published results. The main concern is that researchers can use questionable research practices to produce significant results (John et al., 2012). Preregistration of analysis plans would reduce the chances of using QRPs and increase the chances of a non-significant result. This would make the reporting of significant results more valuable because signifiance was produced by the data and not by the creativity of the data analyst.

In my opinion, the focus on pre-registration in the dialogue is misguided. As Pham and Oh (2021) point out, pre-registration would not be necessary, if there is no problem that needs to be fixed. Thus, a proper assessment of the replicability and credibility of consumer research should inform discussions about preregistration.

The problem is that the past decade has seen more articles talking about replications than actual replication studies, especially outside of social psychology. Thus, most of the discussion about actual and ideal research practices occurs without facts about the status quo. How often do consumer psychologists use questionable research practices? How many published results are likely to replicate? What is the typical statistical power of studies in consumer psychology? What is the false positive risk?

Rather than writing another meta-psychological article that is based on paranoid or wishful thinking, I would like to add to the discussion by providing some facts about the health of consumer psychology.

Do Consumer Psychologists Use Questionable Research Practices?

John et al. (2012) conducted a survey study to examine the use of questionable research practices. They found that respondents admitted to using these practices and that they did not consider these practices to be wrong. In 2021, however, nobody is defending the use of questionable practices that can inflate the risk of false positive results and hide replication failures. Consumer psychologists could have conducted an internal survey to find out how prevalent these practices are among consumer psychologists. However, Pham and Oh (2021) do not present any evidence about the use of QRPs by consumer psychologists. Instead, they cite a survey among German social psychologists to suggest that QPRs may not be a big problem in consumer psychology. Below, I will show that QRPs are a big problem in consumer psychology and that consumer psychologists have done nothing over the past decade to curb the use of these practices.

Are Studies in Consumer Psychology Adequately Powered

Concerns about low statistical power go back to the 1960s (Cohen, 1961; Maxwell, 2004; Schimmack, 20212; Sedlmeier & Gigerenzer, 1989; Smaldino & McElreath, 2016). Tversky and Kahneman (1971) refused “to believe that any that a serious investigator will knowingly accept a .50 risk of failing to confirm a valid research hypothesis” (p. 110). Yet, results from the reproducibility project suggest that social psychologists conduct studies with less than 50% power all the time (Open Science Collaboration, 2015). It is not clear why we should expect higher power from consumer research. More concerning is that Pham and Oh (2021) do not even mention low power as a potential problem for consumer psychology. One advantage of a pre-registration is that researchers are forced to think ahead of time about the sample size that is required to have a good chance to show the desired outcome, assuming the theory is right. More than 20 years ago, the APA taskforce on on statistical inference recommended a priori power analysis, but researchers continued to conduct underpowered studies. Pre-registration, however, would not be necessary if consumer psychologists already conduct studies with adquate power. Here I show that power in consumer psychology is unacceptably low and has not increased over the past decade.

False Positive Risk

Pham and Oh note that Simmons, Nelson, & Simmonsohn’s (2011) influential article relied exclusively on simulations and speculations and suggest that the fear of massive p-hacking may be unfounded. “Whereas Simmons et al. (2011) highly influential computer simulations point to massive distortions of test statistics when QRPs are used, recent empirical estimates of the actual impact of self-serving analyses suggest more modest degrees of distortion of reported test statistics in recent consumer studies (see Krefeld-Schwalb & Scheibehenne, 2020). Here I presents of empirical analyses to estimate the false discovery risk in consumer psychology.

Data

The data are part of a larger project that examines research practices in psychology over the past decade. For this purpose, my research team and I downloaded all articles form 2010 to 2020 published in 120 psychology journals that cover a broad range of disciplines. Four journals represent research in consumer psychology, namely the Journal of Consumer Behavior, the Journal of Consumer Psychology, the Journal of Consumer Research and Psychology and Marketing. The articles were converted into text files and the text files were searched for test statistics. All F, t, and z-tests were used, but most test statistics were F and t tests. There were 2,304 tests for Journal of Consumer Behavior, 8940 for Journal of Consumer Psychology, 10,521 for Journal of Consumer Research, and 5,913 for Psychology and Marketing.

Results

I first conducted z-curve analyses for each journal and year separately. The 40 results were analyzed with year as continuous and journal as categorical predictor variable. No time trends were significant, but the main effect for the expected replication rate of journals was significant, F(3,36) = 9.63, p < .001. Inspection of the means showed higher values for Journal of Consumer Psychology and Psychology & Marketing than for the other two journals. No other effects were significant. Therefore, I combined the data of Journal of Consumer Psychology and Psychology of Marketing and the Journal of Consumer Behavior and Journal of Consumer Reserach.

Figure 1 shows the z-curve analysis for the first set of journals. The observed discovery rate (ODR) is simply the percentage of results that are significant. Out of the 14,853 tests, 10636 were significant which yields an ODR of 72%. To examine the influence of questionable research practices, the ODR can be compared to the estimated discovery rate (EDR). The EDR is an estimate that is based on a finite mixture model that is fitted to the distribution of the signifiant test statistics. Figure 1 shows that the fitted grey curve closely matches the observed distribution of test statistics that are all converted into z-scores. Figure 1 also shows the projected distribution that is expected for non-significant results. Contrary to the predicted distribution, observed non-significant results sharply drop off at the level of significance (z = 1.96). This pattern provides visual evidence that non-significant results do not follow a sampling distribution. The EDR is the area under the curve for the significant values relative to the total distribution. The EDR is only 34%. The 95%CI of the EDR can be used to test statistical significance. The ODR of 72% is well out side the 95% confidence interval of the EDR that ranges from 17% to 34%. Thus, there is strong evidence that consumer researchers use QRPs and publish too many significant results.

The EDR can also be used to assess the risk of publishing false positive results; that is significant results without a true population effect. Using a formula from Soric (1989), we can use the EDR to estimate the maximum percentage of false positive results. As the EDR decreases, the false discovery risk increases. With an EDR of 34%, the FDR is 10%, with a 95% confidence interval ranging from 7% to 26%. Thus, the present results do not suggest that most results in consumer psychology journals are false positives as some meta-scientists suggested (Ioannidis, 2005; Simmons et al., 2011).

It is more difficult to asses the replicability of results published in these two journals. On the one hand, z-curve provides an estimate of the expected replication rate. That is, the probability that a significant result produces a significant result again in an exact replication study (Brunner & Schimmack, 2020). The ERR is higher than the EDR because studies that produced a significant result have higher power than studies that did not produce a significant result. The ERR of 63% suggests that more than 50% of significant results can be successfully replicated. However, a comparison of the ERR with success rate in actual replication studies showed that the ERR overestimates actual replication rates (Brunner & Schimmack, 2020). There are a number of reasons for this discrepancy. One reason is that replication studies in psychology are never exact replications and that regression to the mean lowers the chances of reproducing the same effect size in a replication study. In social psychology, the EDR is actually a better predictor of the actual success rate. Thus, the present results suggest that actual replication studies in consumer psychology are likely to produce as many replication failures as studies in social psychology have (Schimmack, 2020).

Figure 2 shows the results for the Journal of Consumer Behavior and the Journal of Consumer Research.

The results are even worse. The ODR of 73% is above the EDR of 26% and well outside the 95%CI of the EDR, . The EDR of 24% implies a false discovery risk of 15%, 95%CI =

Conclusion

The present results show that consumer psychology is plagued by the same problems that have produced replication failures in social psychology. Given the similarities between consumer psychology and social psychology, it is not surprising that the two disciplines are alike. Researchers conduct underpowered studies and use QRPs to report inflated success rates. These illusory results cannot be replicated and it is unclear which statistically significant results reveal effects that have practical significance and which ones are mere false positives. To make matters worse, social psychologists have responded to awareness of these problems by increasing power of their studies and by implementing changes in their research practices. In contrast, z-curve analyses of consumer psychology show no improvement in research practices over the past year. In light of this disappointing tend, it is disconcerting to read an article that suggests improvements in consumer psychology are not needed and that everything is well (Pham and Oh, 2021). I demonstrated with hard data and objective analysis that this assessment is false. It is time for consumer psychologists to face reality and to follow in the footsteps of social psychologists to increase the credibility of their science. While preregistration may be optional, increasing power is not.

Guest Post by Peter Holtz: From Experimenter Bias Effects To the Open Science Movement

This post was first shared as a post in the Facebook Psychological Methods Discussion Group. (Group, Post). I thought it was interesting and deserved a wider audience.

Peter Holtz

I know that this is too long for this group, but I don’t have a blog …

A historical anecdote:

In 1963, Rosenthal and Fode published a famous paper on the Experimenter Bias Effect (EBE): There were of course several different experiments and conditions etc., but for example, research assistants were given a set of 20 photos of people that were to be rated by participants on a scale from -10 ([will experience …] “extreme failure”) to + 10 (…“extreme success”).

The research assistants (e.g., participants in a class on experimental psychology) were told to replicate a “well-established” psychological finding just like “students in physics labs are expected to do” (p. 494). On average, the sets of photos had been rated in a large pre-study as neutral (M=0), but some research assistants were told that the expected mean of their photos was -5, whereas others were told that it was +5. When the research assistants, who were not allowed to communicate with each other during the experiments, handed in the results of their studies, their findings were biased in the direction of the effect that they had expected. Funnily enough, similar biases could be found for experiments with rats in Skinner boxes as well (Rosenthal & Fode, 1963b).

The findings on the EBE were met with skepticism from other psychologists since they casted doubt on experimental psychology’s self-concept as a true and unbiased natural science. And what do researchers do since the days of Socrates if they doubt the findings of a colleague? Sure, they attempt to replicate them. Whereas Rosenthal and colleagues (by and large) produced several successful “conceptual replications” in slightly different contexts (for a summary see e.g. Rosenthal, 1966), others (most notably T. X. Barber) couldn’t replicate Rosenthal and Fode’s original study (e.g., Barber et al., 1969; Barber & Silver, 1968, but also Jacob, 1968; Wessler & Strauss, 1968).

Rosenthal, a versed statistician, responded (e.g., Rosenthal, 1969) that the difference between significant and non-significant may be not itself significant and used several techniques that about ten years later came to be known as “meta-analysis” to argue that although Barber’s and others’ replications, which of course used other groups of participants and materials etc., most often did not yield significant results, a summary of results suggests that there may still be an EBE (1968; albeit probably smaller than in Rosenthal and Fode’s initial studies – let me think… how can we explain that…).

Of course, Barber and friends responded to Rosenthal’s responses (e.g., Barber, 1969 titled “invalid arguments, post-mortem analyses, and the experimenter bias effect”) and vice versa and a serious discussion of psychology’s methodology emerged. Other notables weighed in as well and frequently statisticians such as Rozeboom (1960) and Bakan (1966) were quoted who had by then already done their best to explain to their colleagues the problems of the p-ritual that psychologists use(d) as a verification procedure. (On a side note: To me, Bakan’s 1966 paper is better than much of the recent work on the problems with the p-ritual; in particular the paragraph on the problematic assumption of an “automacity of inference” on p. 430 is still worth reading).

Lykken (1968) and Meehl (1967) soon joined the melee and attacked the p-ritual also from an epistemological perspective. In 1969, Levy wrote an interesting piece about the value of replications in which he argued that replicating the EBE-studies doesn’t make much sense as long as there are no attempts to embed the EBE into a wider explanatory theory that allows for deducing other falsifiable hypotheses as well. Levy knew very well already by 1969 that the question whether some effect “exists” or “does not exist” is only in very rare cases relevant (exactly then when there are strong reasons to assume that an effect does not exist – as is the case, for example, with para-psychological phenomena).

Eventually Rosenthal himself (e.g., 1968a) came to think critically of the “reassuring nature of the null hypothesis decision procedure”. What happened then? At some point Rosenthal moved away from experimenter expectancy effects in the lab to Pygmalion effects in the classroom (1968b) – an idea that is much less likely to provoke criticism and replication attempts: Who doesn’t believe that teachers’ stereotypes influence the way they treat children and consequently the children’s chances to succeed in school? The controversy fizzled out and if you take up a social psychology textbook, you may find the comforting story in it that this crisis was finally “overcome” (Stroebe, Hewstone, & Jonas, 2013, p. 18) by enlarging psychology’s methodological arsenal, for example, with meta-analytic practices and by becoming a stronger and better science with a more rigid methodology etc. Hooray!

So psychology was finally great again from the 1970s on … was it? What can we learn from this episode?- It is not the case that psychologists didn’t know the replication game, but they only played it whenever results went against their beliefs – and that was rarely the case (exceptions are apart from Rosenthal’s studies of course Bem’s “feeling the future” experiments). –

Science is self-correcting – but only whenever there are controversies (and not if subcommunities just happily produce evidence in favor of their pet theories). – Everybody who wanted to know it could know by the 1960s that something is wrong with the p-ritual – but no one cared. This was the game that needed to be played to produce evidence in favor of theories and to get published and to make a career; consequently, people learned to play the verification game more and more effectively. (Bakan writes on p. 423: “What will be said in this paper is hardly original. It is, in a certain sense, what “everybody knows.” To say it “out loud” is, as it were, to assume the role of the child who pointed out that the emperor was really outfitted only in his underwear.” – in 1966!)-

Just making it more difficult to verify a theory will not solve the problem imo; ambitious psychologists will again find ways to play the game – and to win.- I see two risks with the changes that have been proposed by the “open science community” (in particular preregistration): First, I am afraid that since the verification game still dominates in psychology researchers will simply shift towards “proving” more boring hypotheses; second, there is the risk that psychological theories will be shielded even more from criticism since only criticism based on “good science” (preregistered experiments with a priori power analysis and open data) will be valid whereas criticism based on other types of research activities (e.g., simulations, case studies … or just rational thinking for a change) will be dismissed as “unscientific” => no criticism => no controversy => no improvement => no progress. – And of course, pre-registration and open science etc. allow psychologists to still maintain the misguided, unfortunate, and highly destructive myth of the “automacity of inferences”; no inductive mechanism whatsoever can ensure “true discovery”.-

I think what is needed more is a discussion about the relationship between data and theory and about epistemological questions such as the question what a “growth of knowledge” in science could look like and how it can be facilitated (I call this a “falsificationist turn”).- Irrespective of what is going to happen, authors of textbooks will find ways to write up the history of psychology as a flawless cumulative success story …

A Z-Curve Analysis of a Self-Replication: Shah et al. (2012) Science

Since 2011, psychologists are wondering which published results are credible and which results are not. One way to answer this question would be for researchers to self-replicate their most important findings. However, most psychologists have avoided conducting or publishing self-replications (Schimmack, 2020).

It is therefore always interesting when a self-replication is published. I just came across Shah, Mullainathana, and Shafir (2019). The authors conducted high-powered (much larger sample-sizes) replications of five studies that were published in Shah, Mullainathana, and Shafir’s (2012) Science article.

The article reported five studies with 1, 6, 2, 3, and 1 focal hypothesis tests. One additional test was significant, but the authors focussed on the small effect size and considered it not theoretically important. The replication studies successfully replicated 9 of the 13 significant results; a success rate of 69%. This is higher than the success rate in the famous reproducibility project of 100 studies in social and cognitive psychology; 37% (OSC, 2015).

One interesting question is whether this success rate was predictable based on the original findings. An even more interesting question is whether original results provide clues about the replicability of specific effects. For example, why were the results of Study 1 and 5 harder to replicate than those of the other studies.

Z-curve relies on the strength of the evidence against the null-hypothesis in the original studies to predict replication outcomes (Brunner & Schimmack, 2020; Bartos & Schimmack, 2020). It also takes into account that original results may be selected for significance. For example, the original article reported 14 out of 14 significant results. It is unlikely that all statistical tests of critical hypotheses produce significant results (Schimmack, 2012). Thus, some questionable practices were probably used although the authors do not mention this in their self-replication article.

I converted the 13 test statistics into exact p-values and converted the exact p-values into z-scores. Figure 1 shows the z-curve plot and the results of the z-curve analysis. The first finding is that the observed success rate of 100% is much higher than the expected discovery rate of 15%. Given the small sample of tests, the 95%CI around the estimated discovery rate is wide, but it does not include 100%. This suggests that some questionable practices were used to produce a pretty picture of results. This practice is in line with widespread practices in psychology in 2012.

The next finding is that despite a low discovery rate, the estimated replication rate of 66% is in line with the observed discovery rate. The reason for the difference is that the estimated discovery rate includes the large set of non-significant results that the model predicts. Selection for significance selects studies with higher power that have a higher chance to be significant (Brunner & Schimmack, 2020).

It is unlikely that the authors conducted many additional studies to get only significant results. It is more likely that they used a number of other QRPs. Whatever method they used, QRPs make just significant results questionable. One solution to this problem is to alter the significance criterion post-hoc. This can be done gradually. For example, a first adjustment might lower the significance criterion to alpha = .01.

Figure 2 shows the adjusted results. The observed discovery rate decreased to 69%. In addition, the estimated discovery rate increased to 48% because the model no longer needs to predict the large number of just significant results. Thus, the expected and observed discovery rate are much more in line and suggest little need for additional QRPs. The estimated replication rate decreased because it uses the more stringent criterion of alpha = .01. Otherwise, it would be even more in line with the observed replication rate.

Thus, a simple explanation for the replication outcomes is that some results were obtained with QRPs that produced just significant results with p-values between .01 and .05. These results did not replicate, but the other results did replicate.

There was also a strong point-biseral correlation between the z-scores and the dichotomous replication outcome. When the original p-values were split into p-values above or below .01, they perfectly predicted the replication outcome; p-values greater than .01 did not replicate, those below .01 did replicate.

In conclusion, a single p-values from a single analysis provides little information about replicability, although replicability increases as p-values decrease. However, meta-analyses of p-values with models that take QRPs and selection for significance into account are a promising tool to predict replication outcomes and to distinguish between questionable and solid results in the psychological literature.

Meta-analyses that take QRPs into account can also help to avoid replication studies that merely confirm highly robust results. Four of the z-scores in Shah et al.’s (2019) project were above 4, which makes it very likely that the results replicate. Resources are better spend on findings that have high theoretical importance, but weak evidence. Z-curve can help to identify these results because it corrects for the influence of QRPs.

Conflict of Interest statement: Z-curve is my baby.

How Credible is Clinical Psychology?

Don Lynam and the clinical group at Purdue University invited me to give a talk and they generously gave me permission to share it with you.

Talk (the first 4 min. were not recorded, it starts right away with my homage to Jacob Cohen).

The first part of the talk discusses the problems with Fisher’s approach to significance testing and the practice in psychology to publish only significant results. I then discuss Neyman-Pearson’s alternative approach, statistical power, and Cohen’s seminal meta-analysis of power in social/abnormal psychology. I then point out that questionable research practices must have been used to publish 95% significant results with only 50% power.

The second part of the talk discusses Soric’s insight that we can estimate the false discovery risk based on the discovery rate. I discuss the Open Science Collaboration project as one way to estimate the discovery rate (prettty high for within-subject cognitive psychology, terribly low for between-subject social psychology), but point out that it doesn’t tell us about clinical psychology. I then introduce z-curve to estimate the discovery rate based on the distribution of significant p-values (converted into z-scores).

In the empirical part, I show the z-curve for Positive Psychology Interventions that shows massive use of QRPs and a high false discovery risk.

I end with a comparison of the z-curve for the Journal of Abnormal Psychology in 2010 and 2020 that shows no change in research practices over time.

The discussion focussed on changing the way we do research and what research we reward. I argue strongly against the implementation of alpha = .005 and for the adoption of Neyman Pearson’s approach with pre-registration which would allow researchers to study small populations (e.g., mental health issues in the African American community) with a higher false-positive risk to balance type-I and type-II errors.

A tutorial about effect sizes, power, z-curve analysis, and personalized p-values.

I recorded a meeting with my research assistants who are coding articles to estimate the replicability of psychological research. It is unedited and raw, but you might find it interesting to listen to. Below I give a short description of the topics that were discussed starting from an explanation of effect sizes and ending with a discussion about the choice of a graduate supervisor.

Link to video

The meeting is based on two blog posts that introduce personalized p-values.
1. https://replicationindex.com/2021/01/15/men-are-created-equal-p-values-are-not/
2. https://replicationindex.com/2021/01/19/personalized-p-values/

1. Rant about Fischer’s approach to statistics that ignores effect sizes.
– look for p < .05, and do a happy dance if you find it, now you can publish.
– still the way statistics is taught to undergraduate students.

2. Explaining statistics starting with effect sizes.
– unstandardized effect size (height difference between men and women in cm)
– unstandardized effect sizes depend on the unit of measurement
– to standardize effect sizes we divide by standard deviation (Cohen’s d)

3. Why do/did social psychologists run studies with n = 20 per condition?
– limited resources, small subject pool, statistics can be used with n = 20 ~ 30.
– obvious that these sample sizes are too small after Cohen (1961) introduced power analysis
– but some argued that low power is ok because it is more efficient to get significant results.

4. Simulation of social psychology: 50% of hypothesis are true, 50% are false, the effect size of true hypotheses is d = .4 and the sample size of studies is N = 20.
– Analyzing the simulated results (with k = 200 studies) with z-curve.2.0. In this simulation, the true discovery rate is 14%. That is 14% of the 200 studies produced a significant result.
– Z-curve correctly estimates this discovery rate based on the distribution of the significant p-values, converted into z-scores.
– If only significant results are published, the observed discovery rate is 100%, but the true discovery rate is only 14%.
– Publication bias leads to false confidence in published results.
– Publication is wasteful because we are discarding useful information.

5. Power analysis.
– Fischer did not have power analysis.
– Neyman and Pearson invented power analysis, but Fischer wrote the textbook for researchers.
– We had 100 years to introduce students to power analysis, but it hasn’t happened.
– Cohen wrote books about power analysis, but he was ignored.
– Cohen suggested we should aim for 80% power (more is not efficient).
– Think a priori about effect size to plan sample sizes.
– Power analysis was ignored because it often implied very large samples.
(very hard to get participants in Germany with small subject pools).
– no change because all p-values were treated as equal. p < .05 = truth.
– Literature reviews or textbook treat every published significant results as truth.

6. Repeating simulation (50% true hypotheses, effect size d = .4) with 80% power, N = 200.
– much higher discovery rate (58%)
– much more credible evidence
– z-curve makes it possible to distinguish between p-values from research with low or high discovery rate.
– Will this change the way psychologists look at p-values? Maybe, but Cohen and others have tried to change psychology without success. Will z-curve be a game-changer?

7. Personalized p-values
– P-values are being created by scientists.
– Scientists have some control about the type of p-values they publish.
– There are systemic pressures to publish more p-values based on low powered studies.
– But at some point, researchers get tenure.
– nobody can fire you if you stop publishing
– social media allow researchers to publish without censure from peers.
– tenure also means you have a responsibility to do good research.
– Researcher who are listed on the post with personalized p-values all have tenure.
– Some researchers, like David Matsumoto, have a good z-curve.
– Other researchers have way too many just significant results.
– The observed discovery rates between good and bad researchers are the same.
– Z-curve shows that the significant results were produced very differently and differ in credibility and replicability; this could be a game changer if people care about it.
– My own z-curve doesn’t look so good. 😦
– How can researchers improve their z-curve
– publish better research now
– distance yourself from bad old research
– So far, few people have distanced themselves from bad old work because there was no incentive to do so.
– Now there is an incentive to do so, because researchers can increase credibility of their good work.
– some people may move up when we add the 2020 data.
– hand-coding of articles will further improve the work.

8. Conclusion and Discussion
– not all p-values are created equal.
– working with undergraduate is easy because they are unbiased.
– once you are in grad school, you have to produce significant results.
– z-curve can help to avoid getting into labs that use questionable practices.
– I was lucky to work in labs that cared about the science.

The Prevalence of Questionable Research Practices in Social Psychology

Introduction

A naive model of science assumes that scientists are objective. That is, they derive hypotheses from theories, collect data to test these theories, and then report the results. In reality, scientists are passionate about theories and often want to confirm that their own theories are right. This leads to conformation bias and the use of questionable research practices (QRPs, John et al., 2012; Schimmack, 2015). QRPs are defined as practices that increase the chances of the desired outcome (typically a statistically significant result) while at the same time inflating the risk of a false positive discovery. A simple QRP is to conduct multiple studies and to report only the results that support the theory.

The use of QRPs explains the astonishingly high rate of statistically significant results in psychology journals that is over 90% (Sterling, 1959; Sterling et al., 1995). While it is clear that this rate of significant results is too high, it is unclear how much it is inflated by QRPs. Given the lack of quantitative information about the extent of QRPs, motivated biases also produce divergent opinions about the use of QRPs by social psychologists. John et al. (2012) conducted a survey and concluded that QRPs are widespread. Fiedler and Schwarz (2016) criticized the methodology and their own survey of German psychologists suggested that QRPs are not used frequently. Neither of these studies is ideal because they relied on self-report data. Scientists who heavily use QRPs may simply not participate in surveys of QRPs or underreport the use of QRPs. It has also been suggested that many QRPs happen automatically and are not accessible to self-reports. Thus, it is necessary to study the use of QRPs with objective methods that reflect the actual behavior of scientists. One approach is to compare dissertations with published articles (Cairo et al., 2020). This method provided clear evidence for the use of QRPs, even though a published document could reveal their use. It is possible that this approach underestimates the use of QRPs because even the dissertation results could be influenced by QRPs and the supervision of dissertations by outsiders may reduce the use of QRPs.

With my colleagues, I developed a statistical method that can detect and quantify the use of QRPs (Bartos & Schimmack, 2020; Brunner & Schimmack, 2020). Z-curve uses the distribution of statistically significant p-values to estimate the mean power of studies before selection for significance. This estimate predicts how many non-significant results were obtained in the serach for the significant ones. This makes it possible to compute the estimated discovery rate (EDR). The EDR can then be compared to the observed discovery rate, which is simply the percentage of published results that are statistically significant. The bigger the difference between the ODR and the EDR is, the more questionable research practices were used (see Schimmack, 2021, for a more detailed introduction).

I merely focus on social psychology because (a) I am a social/personality psychologists, who is interested in the credibility of results in my field, and (b) because social psychology has a large number of replication failures (Schimmack, 2020). Similar analyses are planned for other areas of psychology and other disciplines. I also focus on social psychology more than personality psychology because personality psychology is often more exploratory than confirmatory.

Method

I illustrate the use of z-curve to quantify the use of QRPs with the most extreme examples in the credibility rankings of social/personality psychologists (Schimmack, 2021). Figure 1 shows the z-value plot (ZVP) of David Matsumoto. To generate this plot, the tests statistics from t-tests and F-tests were transformed into exact p-values and then transformed into the corresponding values on the standard normal distribution. As two-sided p-values are used, all z-scores are positive. However, because the curve is centered over the z-score that corresponds to the median power before selection for significance (and not zero, when the null-hypothesis is true), the distribution can look relatively normal. The variance of the distribution will be greater than 1 when studies vary in statistical power.

The grey curve in Figure 1 shows the predicted distribution based on the observed distribution of z-scores that are significant (z > 1.96). In this case, the observed number of non-significant results is similar to the predicted number of significant results. As a result, the ODR of 78% closely matches the EDR of 79%.

Figure 2 shows the results for Shelly Chaiken. The first notable observation is that the ODR of 75% is very similar to Matsumoto’s EDR of 78%. Thus, if we simply count the number of significant and non-significant p-values, there is no difference between these two researchers. However, the z-value plot (ZVP) shows a dramatically different picture. The peak density is 0.3 for Matsoumoto and 1.0 for Chaiken. As the maximum density of the standard normal distribution is .4, it is clear that the results in Chaiken’s articles are not from an actual sampling distribution. In other words, QRPs must have been used to produce too many just significant results with p-values just below .05.

The comparison of the ODR and EDR shows a large discrepancy of 64 percentage points too many significant results (ODR = 75% minus EDR = 11%). This is clearly not a chance finding because the ODR falls well outside the 95% confidence interval of the EDR, 5% to 21%.

To examine the use of QPSs in social psychology, I computed the EDR and ORDR for over 200 social/personality psychologists. Personality psychologists were excluded if they reported too few t-values and F-values. The actual values can be found and additional statistics can be found in the credibility rankings (Schimmack, 2021). Here I used these data to examine the use of QRPs in social psychology.

Average Use of QRPs

The average ODR is 73.48 with a 95% confidence interval ranging from 72.67 to 74.29. The average EDR is 35.28 with a 95% confidence interval ranging from 33.14 to 37.43. the inflation due to QRPs is 38.20 percentage points, 95%CI = 36.10 to 40.30. This difference is highly significant, t(221) = 35.89, p < too many zeros behind the decimal for R to give an exact value.

It is of course not surprising that QRPs have been used. More important is the effect size estimate. The results suggest that QRPs inflate the discovery rate by over 100%. This explains why unbiased replication studies in social psychology have only a 25% chance of being significant (Open Science Collaboration, 2015). In fact, we can use the EDR as a conservative predictor of replication outcomes (Bartos & Schimmack, 2020). While the EDR of 35% is a bit higher than the actual replication rate, this may be due to the inclusion of non-focal hypothesis tests in these analyses. Z-curve analyses of focal hypothesis tests typically produce lower EDRs. In contrast, Fiedler and Schwarz failed to comment on the low replicability of social psychology. If social psychologists would not have used QRPs, it remains a mystery why their results are so hard to replicate.

In sum, the present results confirm that, on average, social psychologists heavily used QRPs to produce significant results that support their predictions. However, these averages masks differences between researchers like Matsumoto and Chaiken. The next analyses explore these individual differences between researchers.

Cohort Effects

I had no predictions about the effect of cohort on the use of QRPs. I conducted a twitter poll that suggested a general intuition that the use of QRPs may not have changed over time, but there was a lot of uncertainty in these answers. Similar results were obtained in a Facebook poll in the Psychological Methods Discussion Group. Thus, the a priori hypothesis is a vague prior of no change.

The dataset includes different generations of researchers. I used the first publication listed in WebofScience to date researchers. The earliest date was 1964 (Robert S. Wyer). The latest date was 2012 (Kurt Gray). The histogram shows that researchers from the 1970s to 2000s were well-represented in the dataset.

There was a significant negative correlation between the ODR and cohort, r(N = 222) = -.25, 95%CI = -.12 to -.37, t(220) = 3.83, p = .0002. This finding suggests that over time the proportion of non-significant results increased. For researchers with the first publication in the 1970s, the average ODR was 76%, whereas it was 72% for researchers with the first publication in the 2000s. This is a modest trend. There are various explanations for this trend.

One possibility is that power decreased as researchers started looking for weaker effects. In this case, the EDR should also show a decrease. However, the EDR showed no relationship with cohort, r(N = 222) = -.03, 95%CI = -.16 to .10, t(220) = 0.48, p = .63. Thus, less power does not seem to explain the decrease in the ODR. At the same time, the finding that EDR does not show a notable, abs(r) < .2, relationship with cohort suggests that power has remained constant over time. This is consistent with previous examinations of statistical power in social psychology (Sedlmeier & Gigerenzer, 1989).

Although the ODR decreased significantly and the EDR did not decrease significantly, bias (ODR – EDR) did not show a significant relationship with cohort, r(N = 222) = -.06, 95%CI = -19 to .07, t(220) = -0.94, p = .35, but the 95%CI allows for a slight decrease in bias that would be consistent with the significant decrease in the ODR.

In conclusion, there is a small, statistically significant decrease in the ODR, but the effect over the past 40 decades is too small to have practical significance. The EDR and bias are not even statistically significantly related to cohort. These results suggest that research practices and the use of questionable ones has not changed notably since the beginning of empirical social psychology (Cohen, 1961; Sterling, 1959).

Achievement Motivation

Another possibility is that in each generation, QRPs are used more by researches who are more achievement motivated (Janke et al., 2019). After all, the reward structure in science is based on number of publications and significant results are often needed to publish. In social psychology it is also necessary to present a package of significant results across multiple studies, which is nearly impossible without the use of QRPs (Schimmack, 2012). To examine this hypothesis, I correlated the EDR with researchers’ H-Index (as of 2/1/2021). The correlation was small, r(N = 222) = .10, 95%CI = -.03 to .23, and not significant, t(220) = 1.44, p = .15. This finding is only seemingly inconsistent with Janke et al.’s (2019) finding that self-reported QRPs were significantly correlated with self-reported ambition, r(217) = .20, p = .014. Both correlations are small and positive, suggesting that achievement motivated researchers may be slightly more likely to use QRPs. However, the evidence is by no means conclusive and the actual relationship is weak. Thus, there is no evidence to support that highly productive researchers with impressive H-indices achieved their success by using QRPs more than other researchers. Rather, they became successful in a field where QRPs are the norm. If the norms were different, they would have become successful following these other norms.

Impact

A common saying in science is that “extraordinary claims require extraordinary evidence.” Thus, we might expect stronger evidence for claims of time-reversed feelings (Bem, 2011) than for evidence that individuals from different cultures regulate their emotions differently (Matsumoto et al., 2008). However, psychologists have relied on statistical significance with alpha = .05 as a simple rule to claim discoveries. This is a problem because statistical significance is meaningless when results are selected for significance and replication failures with non-significant results remain unpublished (Sterling, 1959). Thus, psychologists have trusted an invalid criterion that does not distinguish between true and false discoveries. It is , however, possible that social psychologists used other information (e.g, gossip about replication failures at conferences) to focus on credible results and to ignore incredible ones. To examine this question, I correlated authors’ EDR with the number of citations in 2019. I used citation counts for 2019 because citation counts for 2020 are not yet final (the results will be updated with the 2020 counts). Using 2019 increases the chances of finding a significant relationship because replication failures over the past decade could have produced changes in citation rates.

The correlation between EDR and number of citations was statistically significant, r(N = 222) = .16, 95%CI = .03 to .28, t(220) = 2.39, p = .018. However, the lower limit of the 95% confidence interval is close to zero. Thus, it is possible that the real relationship is too small to matter. Moreover, the non-parametric correlation with Kendell’s tau was not significant, tau = .085, z = 1.88, p = .06. Thus, at present there is insufficient evidence to suggest that citation counts take the credibility of significant results into account. At present, p-values less than .05 are treated as equally credible no matter how they were produced.

Conclusion

There is general agreement that questionable research practices have been used to produce an unreal success rate of 90% or more in psychology journals (Sterling, 1959). However, there is less agreement about the amount of QRPs that are being used and the implications for the credibility of significant results in psychology journals (John et al., 2012; Fiedler & Schwarz, 2016). The problem is that self-reports may be biased because researchers are unable or unwilling to report the use of QRPs (Nisbett & Wilson, 1977). Thus, it is necessary to examine this question with alternative methods. The present study used a statistical method to compare the observed discovery rate with a statistically estimated discovery rate based on the distribution of significant p-values. The results showed that on average social psychologists have made extensive use of QRPs to inflate an expected discovery rate of around 35% to an observed discovery rate of 70%. Moreover, the estimated discovery rate of 35%is likely to be an inflated estimate of the discovery rate for focal hypothesis tests because the present analysis is based on focal and non-focal tests. This would explain why the actual success rate in replication studies is even lower thna the estimated discovery rate of 35% (Open Science Collaboration, 2015).

The main novel contribution of this study was to examine individual differences in the use of QRPs. While the ODR was fairly consistent across articles, the EDR varied considerably across researchers. However, this variation showed only very small relationships with a researchers’ cohort (first year of publication). This finding suggests that the use of QRPs varies more across research fields and other factors than over time. Additional analysis should explore predictors of the variation across researchers.

Another finding was that citations of authors’ work do not take credibility of p-values into account. Citations are influenced by popularity of topics and other factors and do not take the strength of evidence into account. One reason for this might be that social psychologists often publish multiple internal replications within a single article. This gives the illusion that results are robust and credible because it is very unlikely to replicate type-I errors. However, Bem’s (2011) article with 9 internal replications of time-reversed feelings showed that QRPs are also used to produce consistent results within a single article (Francis, 2012; Schimmack, 2012). Thus, number of significant results within an article or across articles is also an invalid criterion to evaluate the robustness of results.

In conclusion, social psychologists have conducted studies with low statistical power since the beginning of empirical social psychology. The main reason for this is the preference for between-subject designs that have low statistical power with small sample sizes of N = 40 participants and small to moderate effect sizes. Despite repeated warnings about the problems of selection for significance (Sterling, 1959) and the problems of small sample sizes (Cohen, 1961; Sedelmeier & Gigerenzer, 1989; Tversky & Kahneman, 1971), the practices have not changed since Festinger conducted his seminal study on dissonance with n = 20 per group. Over the past decades, social psychology journals have reported thousands of statistically significant results that are used in review articles, meta-analyses, textbooks, and popular books as evidence to support claims about human behavior. The problem is that it is unclear which of these significant results are true positives and which are false positives, especially if false positives are not just strictly nil-results, but also results with tiny effect sizes that have no practical significance. Without other reliable information, even social psychologists do not know which of their colleagues results are credible or not. Over the past decade, the inability to distinguish credible and incredible information has produced heated debates and a lack of confidence in published results. The present study shows that the general research practices of a researcher provide valuable information about credibility. For example, a p-value of .01 by a researcher with an EDR of 70 is more credible than a p-value of .01 by a researcher with an EDR of 15. Thus, rather than stereotyping social psychologists based on the low replication rate in the Open Science Collaboration project, social psychologists should be evaluated based on their own research practices.

References

Cairo, A. H., Green, J. D., Forsyth, D. R., Behler, A. M. C., & Raldiris, T. L. (2020). Gray (Literature) Matters: Evidence of Selective Hypothesis Reporting in Social Psychological Research. Personality and Social Psychology Bulletin, 46(9), 1344–1362. https://doi.org/10.1177/0146167220903896

Janke, S., Daumiller, M., & Rudert, S. C. (2019). Dark pathways to achievement in science: Researchers’ achievement goals predict engagement in questionable research practices.
Social Psychological and Personality Science, 10(6), 783–791. https://doi.org/10.1177/1948550618790227

Nations’ Well-Being and Wealth

Scientists have made a contribution when a phenomenon or a statist is named after them. Thus, it is fair to say that Easterlin made a contribution to happiness research because researchers who write about income and happiness often mention his 1974 article “Does Economic Growth Improve the Human Lot? Some Empirical Evidence” (Easterlin, 1974).

To be fair, the article examines the relationship between income and happiness from three perspectives: (a) the correlation between income and happiness across individuals within nations, (b) the correlation of average incomes and average happiness across nations, and (c) the correlation between average income and average happiness within nations over time. A forth perspective, namely the correlation between income and happiness within individuals over time was not examined because no data were available in 1974.

Even for some of the other questions, the data were limited. Here I want to draw attention to Easterlin’s examination of correlations between nations’ wealth and well-being. He draws heavily on Cantril’s seminal contribution to this topic. Cantil (1965) not only developed a measure that can be used to compare well-being across nations, he also used this measure to compare the well-being of 14 nations (Cuba is not included in Table 1 because I did not have new data).

Cantril.Cross-Cultural.Data.png

Cantril also correlated the happiness scores with a measure of nations’ wealth. The correlation was r = .5. Cantril also suggested that Cuba and the Dominican Republic were positive and negative outliers, respectively. Excluding these two nations increases the correlation to r = .7.

Easterlin took issue with these results.

“Actually the association between wealth and happiness indicated by Cantril”s international data is not so clear-cut. This is shown by a scatter diagram of the data (Fig. I). The inference about a positive association relies heavily on the observations for India and the United States. [According to Cantril (1965, pp. 130-131), the values for Cuba and the Dominican Republic reflect unusual political circumstances-the immediate aftermath of a successful revolution in Cuba and prolonged political turmoil in the Dominican Republic].

What is perhaps most striking is that the personal happiness ratings for 10 of the 14 countries lie virtually within half a point of the midpoint rating of 5, as is brought out by the broken horizontal lines in the diagram. While a difference of rating of only 0.2 is significant at the 0.05 level, nevertheless there is not much evidence, for these IO countries, of a systematic association between income and happiness. The closeness of the happiness ratings implies also that a similar lack of association would be found between happiness and other economic magnitudes such as income inequality or the rate of change of income.

Nearly 50 years later, it is possible to revisit Easterlin’s challenge of Cantril’s claim that nations’ well-being is tied to their wealth with much better data from the Gallup World Poll. The Gallup World Poll used the same measure of well-being. However, it also provides a better measure of citizens’ wealth by asking for income. In contrast, GDP can be distorted and may not reflect the spending power of the average citizen very well. The data about well-being (World Happiness Report, 2020) and median per capita income (Gallup) are publicly available. All I needed to do was to compute the correlation and make a pretty graph.

The Pearson correlation between income and the ladder scores is r(126) = .75. The rank correlation is r(126) = .80. and the Pearson correlation between the log of income and the ladder scores is r(126) = .85. These results strongly support Cantril’s prediction based on his interpretation of the first cross-national study in the 1960s and refute Eaterlin’s challenge that that this correlation is merely driven by two outliers. Other researchers who analyzed the Gallup World Poll data also reported correlations of r = .8 and showed high stability of nations’ wealth and income over time (Zyphur et al., 2020).

Figure 2 also showed that Easterlin underestimate the range of well-being scores. Even ignoring additional factors like wars, income alone can move well-being from a 4 in one of the poorest countries in the world (Burundi) close to an 8 in one of the richest countries in the world (Norway). It also does not show that Scandinavian countries have a happiness secret. The main reason for their high average well-being appears to be that median personal incomes are very high.

The main conclusion is that social scientists are often biased for a number of reasons. The bias is evident in Easterlin’s interpretation of Cantril’s data. The same anti-materialstic bias can be found in many other articles on this topic that claim the benefits of wealth are limited.

To be clear, a log-function implies that the same amount of wealth buys more well-being in poor countries, but the graph shows no evidence that the benefits of wealth level off. It is also true that the relationship between GDP and happiness over time is more complicated. However, regarding cross-national differences the results are clear. There is a very strong relationship between wealth and well-being. Studies that do not control for this relationship may report spurious relationships that disappear when income is included as a predictor.

Furthermore, the focus on happiness ignores that wealth also buys longer lives. Thus, individuals in richer nations not only have happier lives they also have more happy life years. The current Covid-19 pandemic further increases these inequalities.

In conclusion, one concern about subjective measures of well-being has been that individuals in poor countries may be happy with less and that happiness measures fail to reflect human suffering. This is not the case. Sustainable, global economic growth that raises per capita wealth remains a challenge to improve human well-being.