Category Archives: Alpha Wars

Personalized P-Values for Social/Personality Psychologists

Last update 8/25/2021
(expanded to 410 social/personality psychologists; included Dan Ariely)

Introduction

Since Fisher invented null-hypothesis significance testing, researchers have used p < .05 as a statistical criterion to interpret results as discoveries worthwhile of discussion (i.e., the null-hypothesis is false). Once published, these results are often treated as real findings even though alpha does not control the risk of false discoveries.

Statisticians have warned against the exclusive reliance on p < .05, but nearly 100 years after Fisher popularized this approach, it is still the most common way to interpret data. The main reason is that many attempts to improve on this practice have failed. The main problem is that a single statistical result is difficult to interpret. However, when individual results are interpreted in the context of other results, they become more informative. Based on the distribution of p-values it is possible to estimate the maximum false discovery rate (Bartos & Schimmack, 2020; Jager & Leek, 2014). This approach can be applied to the p-values published by individual authors to adjust p-values to keep the risk of false discoveries at a reasonable level, FDR < .05.

Researchers who mainly test true hypotheses with high power have a high discovery rate (many p-values below .05) and a low false discovery rate (FDR < .05). Figure 1 shows an example of a researcher who followed this strategy (for a detailed description of z-curve plots, see Schimmack, 2021).

We see that out of the 317 test-statistics retrieved from his articles, 246 were significant with alpha = .05. This is an observed discovery rate of 78%. We also see that this discovery rate closely matches the estimated discovery rate based on the distribution of the significant p-values, p < .05. The EDR is 79%. With an EDR of 79%, the maximum false discovery rate is only 1%. However, the 95%CI is wide and the lower bound of the CI for the EDR, 27%, allows for 14% false discoveries.

When the ODR matches the EDR, there is no evidence of publication bias. In this case, we can improve the estimates by fitting all p-values, including the non-significant ones. With a tighter CI for the EDR, we see that the 95%CI for the maximum FDR ranges from 1% to 3%. Thus, we can be confident that no more than 5% of the significant results wit alpha = .05 are false discoveries. Readers can therefore continue to use alpha = .05 to look for interesting discoveries in Matsumoto’s articles.

Figure 3 shows the results for a different type of researcher who took a risk and studied weak effect sizes with small samples. This produces many non-significant results that are often not published. The selection for significance inflates the observed discovery rate, but the z-curve plot and the comparison with the EDR shows the influence of publication bias. Here the ODR is similar to Figure 1, but the EDR is only 11%. An EDR of 11% translates into a large maximum false discovery rate of 41%. In addition, the 95%CI of the EDR includes 5%, which means the risk of false positives could be as high as 100%. In this case, using alpha = .05 to interpret results as discoveries is very risky. Clearly, p < .05 means something very different when reading an article by David Matsumoto or Shelly Chaiken.

Rather than dismissing all of Chaiken’s results, we can try to lower alpha to reduce the false discovery rate. If we set alpha = .01, the FDR is 15%. If we set alpha = .005, the FDR is 8%. To get the FDR below 5%, we need to set alpha to .001.

A uniform criterion of FDR < 5% is applied to all researchers in the rankings below. For some this means no adjustment to the traditional criterion. For others, alpha is lowered to .01, and for a few even lower than that.

The rankings below are based on automatrically extracted test-statistics from 40 journals (List of journals). The results should be interpreted with caution and treated as preliminary. They depend on the specific set of journals that were searched, the way results are being reported, and many other factors. The data are available (data.drop) and researchers can exclude articles or add articles and run their own analyses using the z-curve package in R (https://replicationindex.com/2020/01/10/z-curve-2-0/).

I am also happy to receive feedback about coding errors. I also recommended to hand-code articles to adjust alpha for focal hypothesis tests. This typically lowers the EDR and increases the FDR. For example, the automated method produced an EDR of 31 for Bargh, whereas hand-coding of focal tests produced an EDR of 12 (Bargh-Audit).

And here are the rankings. The results are fully automated and I was not able to cover up the fact that I placed only #188 out of 400 in the rankings. In another post, I will explain how researchers can move up in the rankings. Of course, one way to move up in the rankings is to increase statistical power in future studies. The rankings will be updated again when the 2021 data are available.

Despite the preliminary nature, I am confident that the results provide valuable information. Until know all p-values below .05 have been treated as if they are equally informative. The rankings here show that this is not the case. While p = .02 can be informative for one researcher, p = .002 may still entail a high false discovery risk for another researcher.

Good science requires not only open and objective reporting of new data; it also requires unbiased review of the literature. However, there are no rules and regulations regarding citations, and many authors cherry-pick citations that are consistent with their claims. Even when studies have failed to replicate, original studies are cited without citing the replication failures. In some cases, authors even cite original articles that have been retracted. Fortunately, it is easy to spot these acts of unscientific behavior. Here I am starting a project to list examples of bad scientific behaviors. Hopefully, more scientists will take the time to hold their colleagues accountable for ethical behavior in citations. They can even do so by posting anonymously on the PubPeer comment site.

RankNameTestsODREDRERRFDRAlpha
1Robert A. Emmons538789901.05
2Allison L. Skinner2295981851.05
3David Matsumoto3788379851.05
4Linda J. Skitka5326875822.05
5Todd K. Shackelford3057775822.05
6Jonathan B. Freeman2745975812.05
7Virgil Zeigler-Hill5157274812.05
8Arthur A. Stone3107573812.05
9David P. Schmitt2077871772.05
10Emily A. Impett5497770762.05
11Paula Bressan628270762.05
12Kurt Gray4877969812.05
13Michael E. McCullough3346969782.05
14Kipling D. Williams8437569772.05
15John M. Zelenski1567169762.05
16Amy J. C. Cuddy2128368782.05
17Elke U. Weber3126968770.05
18Hilary B. Bergsieker4396768742.05
19Cameron Anderson6527167743.05
20Rachael E. Jack2497066803.05
21Jamil Zaki4307866763.05
22A. Janet Tomiyama767865763.05
23Benjamin R. Karney3925665733.05
24Phoebe C. Ellsworth6057465723.05
25Jim Sidanius4876965723.05
26Amelie Mummendey4617065723.05
27Carol D. Ryff2808464763.05
28Juliane Degner4356364713.05
29Steven J. Heine5977863773.05
30David M. Amodio5846663703.05
31Thomas N Bradbury3986163693.05
32Elaine Fox4727962783.05
33Miles Hewstone14277062733.05
34Linda R. Tropp3446561803.05
35Rainer Greifeneder9447561773.05
36Klaus Fiedler19507761743.05
37Jesse Graham3777060763.05
38Richard W. Robins2707660704.05
39Simine Vazire1376660644.05
40On Amir2676759884.05
41Edward P. Lemay2898759814.05
42William B. Swann Jr.10707859804.05
43Margaret S. Clark5057559774.05
44Bernhard Leidner7246459654.05
45B. Keith Payne8797158764.05
46Ximena B. Arriaga2846658694.05
47Joris Lammers7286958694.05
48Patricia G. Devine6067158674.05
49Rainer Reisenzein2016557694.05
50Barbara A. Mellers2878056784.05
51Joris Lammers7056956694.05
52Jean M. Twenge3817256594.05
53Nicholas Epley15047455724.05
54Kaiping Peng5667754754.05
55Krishna Savani6387153695.05
56Leslie Ashburn-Nardo1098052835.05
57Lee Jussim2268052715.05
58Richard M. Ryan9987852695.05
59Ethan Kross6146652675.05
60Edward L. Deci2847952635.05
61Roger Giner-Sorolla6638151805.05
62Bertram F. Malle4227351755.05
63George A. Bonanno4797251705.05
64Jens B. Asendorpf2537451695.05
65Samuel D. Gosling1085851625.05
66Tessa V. West6917151595.05
67Paul Rozin4497850845.05
68Joachim I. Krueger4367850815.05
69Sheena S. Iyengar2076350805.05
70James J. Gross11047250775.05
71Mark Rubin3066850755.05
72Pieter Van Dessel5787050755.05
73Shinobu Kitayama9837650715.05
74Matthew J. Hornsey16567450715.05
75Janice R. Kelly3667550705.05
76Antonio L. Freitas2477950645.05
77Paul K. Piff1667750635.05
78Mina Cikara3927149805.05
79Beate Seibt3797249626.01
80Ludwin E. Molina1636949615.05
81Bertram Gawronski18037248766.01
82Penelope Lockwood4587148706.01
83Edward R. Hirt10428148656.01
84Matthew D. Lieberman3987247806.01
85John T. Cacioppo4387647696.01
86Agneta H. Fischer9527547696.01
87Leaf van Boven7117247676.01
88Stephanie A. Fryberg2486247666.01
89Daniel M. Wegner6027647656.01
90Anne E. Wilson7857147646.01
91Rainer Banse4027846726.01
92Alice H. Eagly3307546716.01
93Jeanne L. Tsai12417346676.01
94Jennifer S. Lerner1818046616.01
95Andrea L. Meltzer5495245726.01
96R. Chris Fraley6427045727.01
97Constantine Sedikides25667145706.01
98Paul Slovic3777445706.01
99Dacher Keltner12337245646.01
100Brian A. Nosek8166844817.01
101George Loewenstein7527144727.01
102Ursula Hess7747844717.01
103Jason P. Mitchell6007343737.01
104Jessica L. Tracy6327443717.01
105Charles M. Judd10547643687.01
106S. Alexander Haslam11987243647.01
107Mark Schaller5657343617.01
108Susan T. Fiske9117842747.01
109Lisa Feldman Barrett6446942707.01
110Jolanda Jetten19567342677.01
111Mario Mikulincer9018942647.01
112Bernadette Park9737742647.01
113Paul A. M. Van Lange10927042637.01
114Wendi L. Gardner7986742637.01
115Will M. Gervais1106942597.01
116Jordan B. Peterson2666041797.01
117Philip E. Tetlock5497941737.01
118Amanda B. Diekman4388341707.01
119Daniel H. J. Wigboldus4927641678.01
120Michael Inzlicht6866641638.01
121Naomi Ellemers23887441638.01
122Phillip Atiba Goff2996841627.01
123Stacey Sinclair3277041578.01
124Francesca Gino25217540698.01
125Michael I. Norton11367140698.01
126David J. Hauser1567440688.01
127Elizabeth Page-Gould4115740668.01
128Tiffany A. Ito3498040648.01
129Richard E. Petty27716940648.01
130Tim Wildschut13747340648.01
131Norbert Schwarz13377240638.01
132Veronika Job3627040638.01
133Wendy Wood4627540628.01
134Minah H. Jung1568339838.01
135Marcel Zeelenberg8687639798.01
136Tobias Greitemeyer17377239678.01
137Jason E. Plaks5827039678.01
138Carol S. Dweck10287039638.01
139Christian S. Crandall3627539598.01
140Harry T. Reis9986938749.01
141Vanessa K. Bohns4207738748.01
142Jerry Suls4137138688.01
143Eric D. Knowles3846838648.01
144C. Nathan DeWall13367338639.01
145Clayton R. Critcher6978238639.01
146John F. Dovidio20196938629.01
147Joshua Correll5496138629.01
148Abigail A. Scholer5565838629.01
149Chris Janiszewski1078138589.01
150Herbert Bless5867338579.01
151Mahzarin R. Banaji8807337789.01
152Rolf Reber2806437729.01
153Kevin N. Ochsner4067937709.01
154Mark J. Brandt2777037709.01
155Geoff MacDonald4066737679.01
156Mara Mather10387837679.01
157Antony S. R. Manstead16567237629.01
158Lorne Campbell4336737619.01
159Sanford E. DeVoe2367137619.01
160Ayelet Fishbach14167837599.01
161Fritz Strack6077537569.01
162Jeff T. Larsen18174366710.01
163Nyla R. Branscombe12767036659.01
164Yaacov Schul4116136649.01
165D. S. Moskowitz34187436639.01
166Pablo Brinol13566736629.01
167Todd B. Kashdan3777336619.01
168Barbara L. Fredrickson2877236619.01
169Duane T. Wegener9807736609.01
170Joanne V. Wood10937436609.01
171Daniel A. Effron4846636609.01
172Niall Bolger3766736589.01
173Craig A. Anderson4677636559.01
174Michael Harris Bond37873358410.01
175Glenn Adams27071357310.01
176Daniel M. Bernstein40473357010.01
177C. Miguel Brendl12176356810.01
178Azim F. Sharif18374356810.01
179Emily Balcetis59969356810.01
180Eva Walther49382356610.01
181Michael D. Robinson138878356610.01
182Igor Grossmann20364356610.01
183Diana I. Tamir15662356210.01
184Samuel L. Gaertner32175356110.01
185John T. Jost79470356110.01
186Eric L. Uhlmann45767356110.01
187Nalini Ambady125662355610.01
188Daphna Oyserman44655355410.01
189Victoria M. Esses29575355310.01
190Linda J. Levine49574347810.01
191Wiebke Bleidorn9963347410.01
192Thomas Gilovich119380346910.01
193Alexander J. Rothman13369346510.01
194Francis J. Flynn37872346310.01
195Paula M. Niedenthal52269346110.01
196Ozlem Ayduk54962345910.01
197Paul Ekman8870345510.01
198Alison Ledgerwood21475345410.01
199Christopher R. Agnew32575337610.01
200Michelle N. Shiota24260336311.01
201Malte Friese50161335711.01
202Kerry Kawakami48768335610.01
203Danu Anthony Stinson49477335411.01
204Jennifer A. Richeson83167335211.01
205Margo J. Monteith77376327711.01
206Ulrich Schimmack31875326311.01
207Mark Snyder56272326311.01
208Michele J. Gelfand36576326311.01
209Russell H. Fazio109469326111.01
210Eric van Dijk23867326011.01
211Tom Meyvis37777326011.01
212Eli J. Finkel139262325711.01
213Robert B. Cialdini37972325611.01
214Jonathan W. Kunstman43066325311.01
215Delroy L. Paulhus12177318212.01
216Yuen J. Huo13274318011.01
217Gerd Bohner51371317011.01
218Christopher K. Hsee68975316311.01
219Vivian Zayas25171316012.01
220John A. Bargh65172315512.01
221Tom Pyszczynski94869315412.01
222Roy F. Baumeister244269315212.01
223E. Ashby Plant83177315111.01
224Kathleen D. Vohs94468315112.01
225Jamie Arndt131869315012.01
226Anthony G. Greenwald35772308312.01
227Nicholas O. Rule129468307513.01
228Lauren J. Human44759307012.01
229Jennifer Crocker51568306712.01
230Dale T. Miller52171306412.01
231Thomas W. Schubert35370306012.01
232Joseph A. Vandello49473306012.01
233W. Keith Campbell52870305812.01
234Arthur Aron30765305612.01
235Pamela K. Smith14966305212.01
236Aaron C. Kay132070305112.01
237Steven W. Gangestad19863304113.005
238Eliot R. Smith44579297313.01
239Nir Halevy26268297213.01
240E. Allan Lind37082297213.01
241Richard E. Nisbett31973296913.01
242Hazel Rose Markus67476296813.01
243Emanuele Castano44569296513.01
244Dirk Wentura83065296413.01
245Boris Egloff27481295813.01
246Monica Biernat81377295713.01
247Gordon B. Moskowitz37472295713.01
248Russell Spears228673295513.01
249Jeff Greenberg135877295413.01
250Caryl E. Rusbult21860295413.01
251Naomi I. Eisenberger17974287914.01
252Brent W. Roberts56272287714.01
253Yoav Bar-Anan52575287613.01
254Eddie Harmon-Jones73873287014.01
255Matthew Feinberg29577286914.01
256Roland Neumann25877286713.01
257Eugene M. Caruso82275286413.01
258Ulrich Kuehnen82275286413.01
259Elizabeth W. Dunn39575286414.01
260Jeffry A. Simpson69774285513.01
261Sander L. Koole76765285214.01
262Richard J. Davidson38064285114.01
263Shelly L. Gable36464285014.01
264Adam D. Galinsky215470284913.01
265Grainne M. Fitzsimons58568284914.01
266Geoffrey J. Leonardelli29068284814.005
267Joshua Aronson18385284614.005
268Henk Aarts100367284514.005
269Vanessa K. Bohns42276277415.01
270Jan De Houwer197270277214.01
271Dan Ariely60070276914.01
272Charles Stangor18581276815.01
273Karl Christoph Klauer80167276514.01
274Mario Gollwitzer50058276214.01
275Jennifer S. Beer8056275414.01
276Eldar Shafir10778275114.01
277Guido H. E. Gendolla42276274714.005
278Klaus R. Scherer46783267815.01
279William G. Graziano53271266615.01
280Galen V. Bodenhausen58574266115.01
281Sonja Lyubomirsky53071265915.01
282Kai Sassenberg87271265615.01
283Kristin Laurin64863265115.01
284Claude M. Steele43473264215.005
285David G. Rand39270258115.01
286Paul Bloom50272257916.01
287Kerri L. Johnson53276257615.01
288Batja Mesquita41671257316.01
289Rebecca J. Schlegel26167257115.01
290Phillip R. Shaver56681257116.01
291David Dunning81874257016.01
292Laurie A. Rudman48272256816.01
293David A. Lishner10565256316.01
294Mark J. Landau95078254516.005
295Ronald S. Friedman18379254416.005
296Joel Cooper25772253916.005
297Alison L. Chasteen22368246916.01
298Jeff Galak31373246817.01
299Steven J. Sherman88874246216.01
300Shigehiro Oishi110964246117.01
301Thomas Mussweiler60470244317.005
302Mark W. Baldwin24772244117.005
303Evan P. Apfelbaum25662244117.005
304Nurit Shnabel56476237818.01
305Klaus Rothermund73871237618.01
306Felicia Pratto41073237518.01
307Jonathan Haidt36876237317.01
308Roland Imhoff36574237318.01
309Jeffrey W Sherman99268237117.01
310Jennifer L. Eberhardt20271236218.005
311Bernard A. Nijstad69371235218.005
312Brandon J. Schmeichel65266234517.005
313Sam J. Maglio32572234217.005
314David M. Buss46182228019.01
315Yoel Inbar28067227119.01
316Serena Chen86572226719.005
317Spike W. S. Lee14568226419.005
318Marilynn B. Brewer31475226218.005
319Michael Ross116470226218.005
320Dieter Frey153868225818.005
321G. Daniel Lassiter18982225519.01
322Sean M. McCrea58473225419.005
323Wendy Berry Mendes96568224419.005
324Paul W. Eastwick58365216919.005
325Kees van den Bos115084216920.005
326Maya Tamir134280216419.005
327Joseph P. Forgas88883215919.005
328Michaela Wanke36274215919.005
329Dolores Albarracin54066215620.005
330Elizabeth Levy Paluck3184215520.005
331Vanessa LoBue29968207621.01
332Christopher J. Armitage16062207321.005
333Elizabeth A. Phelps68678207221.005
334Jay J. van Bavel43764207121.005
335David A. Pizarro22771206921.005
336Andrew J. Elliot101881206721.005
337William A. Cunningham23876206422.005
338Laura D. Scherer21269206421.01
339Kentaro Fujita45869206221.005
340Geoffrey L. Cohen159068205021.005
341Ana Guinote37876204721.005
342Tanya L. Chartrand42467203321.001
343Selin Kesebir32866197322.005
344Vincent Y. Yzerbyt141273197322.01
345James K. McNulty104756196523.005
346Robert S. Wyer87182196322.005
347Travis Proulx17463196222.005
348Peter M. Gollwitzer130364195822.005
349Nilanjana Dasgupta38376195222.005
350Jamie L. Goldenberg56877195022.01
351Richard P. Eibach75369194723.001
352Gerald L. Clore45674194522.001
353James M. Tyler13087187424.005
354Roland Deutsch36578187124.005
355Ed Diener49864186824.005
356Kennon M. Sheldon69874186623.005
357Wilhelm Hofmann62467186623.005
358Laura L. Carstensen72377186424.005
359Toni Schmader54669186124.005
360Frank D. Fincham73469185924.005
361David K. Sherman112861185724.005
362Lisa K. Libby41865185424.005
363Chen-Bo Zhong32768184925.005
364Stefan C. Schmukle11462177126.005
365Michel Tuan Pham24686176825.005
366Leandre R. Fabrigar63270176726.005
367Neal J. Roese36864176525.005
368Carey K. Morewedge63376176526.005
369Timothy D. Wilson79865176326.005
370Brad J. Bushman89774176225.005
371Ara Norenzayan22572176125.005
372Benoit Monin63565175625.005
373Michael W. Kraus61772175526.005
374Ad van Knippenberg68372175526.001
375E. Tory. Higgins186868175425.001
376Ap Dijksterhuis75068175426.005
377Joseph Cesario14662174526.001
378Simone Schnall27062173126.001
379Joshua M. Ackerman38053167013.01
380Melissa J. Ferguson116372166927.005
381Laura A. King39176166829.005
382Daniel T. Gilbert72465166527.005
383Charles S. Carver15482166428.005
384Leif D. Nelson40974166428.005
385David DeSteno20183165728.005
386Sandra L. Murray69760165528.001
387Heejung S. Kim85859165529.001
388Mark P. Zanna65964164828.001
389Nira Liberman130475156531.005
390Gun R. Semin15979156429.005
391Tal Eyal43962156229.005
392Nathaniel M Lambert45666155930.001
393Angela L. Duckworth12261155530.005
394Dana R. Carney20060155330.001
395Garriy Shteynberg16854153130.005
396Lee Ross34977146331.001
397Arie W. Kruglanski122878145833.001
398Ziva Kunda21767145631.001
399Shelley E. Taylor42769145231.001
400Jon K. Maner104065145232.001
401Gabriele Oettingen104761144933.001
402Nicole L. Mead24070144633.01
403Gregory M. Walton58769144433.001
404Michael A. Olson34665136335.001
405Fiona Lee22167135834.001
406Melody M. Chao23757135836.001
407Adam L. Alter31478135436.001
408Sarah E. Hill50978135234.001
409Jaime L. Kurtz9155133837.001
410Michael A. Zarate12052133136.001
411Jennifer K. Bosson65976126440.001
412Daniel M. Oppenheimer19880126037.001
413Deborah A. Prentice8980125738.001
414Yaacov Trope127773125738.001
415Oscar Ybarra30563125540.001
416William von Hippel39865124840.001
417Steven J. Spencer54167124438.001
418Martie G. Haselton18673115443.001
419Shelly Chaiken36074115244.001
420Susan M. Andersen36174114843.001
421Dov Cohen64168114441.001
422Mark Muraven49652114441.001
423Ian McGregor40966114041.001
424Hans Ijzerman2145694651.001
425Linda M. Isbell1156494150.001
426Cheryl J. Wakslak2787383559.001

What would Cohen say? A comment on p < .005

Most psychologists are trained in Fisherian statistics, which has become known as Null-Hypothesis Significance Testing (NHST).  NHST compares an observed effect size against a hypothetical effect size. The hypothetical effect size is typically zero; that is, the hypothesis is that there is no effect.  The deviation of the observed effect size from zero relative to the amount of sampling error provides a test statistic (test statistic = effect size / sampling error).  The test statistic can then be compared to a criterion value. The criterion value is typically chosen so that only 5% of test statistics would exceed the criterion value by chance alone.  If the test statistic exceeds this value, the null-hypothesis is rejected in favor of the inference that an effect greater than zero was present.

One major problem of NHST is that non-significant results are not considered.  To address this limitation, Neyman and Pearson extended Fisherian statistic and introduced the concepts of type-I (alpha) and type-II (beta) errors.  A type-I error occurs when researchers falsely reject a true null-hypothesis; that is, they infer from a significant result that an effect was present, when there is actually no effect.  The type-I error rate is fixed by the criterion for significance, which is typically p < .05.  This means, that a set of studies cannot produce more than 5% false-positive results.  The maximum of 5% false positive results would only be observed if all studies have no effect. In this case, we would expect 5% significant results and 95% non-significant results.

The important contribution by Neyman and Pearson was to consider the complementary type-II error.  A type-II error occurs when an effect is present, but a study produces a non-significant result.  In this case, researchers fail to detect a true effect.  The type-II error rate depends on the size of the effect and the amount of sampling error.  If effect sizes are small and sampling error is large, test statistics will often be too small to exceed the criterion value.

Neyman-Pearson statistics was popularized in psychology by Jacob Cohen.  In 1962, Cohen examined effect sizes and sample sizes (as a proxy for sampling error) in the Journal of Abnormal and Social Psychology and concluded that there is a high risk of type-II errors because sample sizes are too small to detect even moderate effect sizes and inadequate to detect small effect sizes.  Over the next decades, methodologists have repeatedly pointed out that psychologists often conduct studies with a high risk to fail; that is, to provide empirical evidence for real effects (Sedlemeier & Gigerenzer, 1989).

The concern about type-II errors has been largely ignored by empirical psychologists.  One possible reason is that journals had no problem filling volumes with significant results, while rejecting 80% of submissions that also presented significant results.  Apparently, type-II errors were much less common than methodologists feared.

However, in 2011 it became apparent that the high success rate in journals was illusory. Published results were not representative of studies that were conducted. Instead, researchers used questionable research practices or simply did not report studies with non-significant results.  In other words, the type-II error rate was as high as methodologists suspected, but selection of significant results created the impression that nearly all studies were successful in producing significant results.  The influential “False Positive Psychology” article suggested that it is very easy to produce significant results without an actual effect.  This led to the fear that many published results in psychology may be false positive results.

Doubt about the replicability and credibility of published results has led to numerous recommendations for the improvement of psychological science.  One of the most obvious recommendations is to ensure that published results are representative of the studies that are actually being conducted.  Given the high type-II error rates, this would mean that journals would be filled with many non-significant and inconclusive results.  This is not a very attractive solution because it is not clear what the scientific community can learn from an inconclusive result.  A better solution would be to increase the statistical power of studies. Statistical power is simply the inverse of a type-II error (power = 1 – beta).  As power increases, studies with a true effect have a higher chance of producing a true positive result (e.g., a drug is an effective treatment for a disease). Numerous articles have suggested that researchers should increase power to increase replicability and credibility of published results (e.g., Schimmack, 2012).

In a recent article, a team of 72 authors proposed another solution. They recommended that psychologists should reduce the probability of a type-I error from 5% (1 out of 20 studies) to 0.5% (1 out of 200 studies).  This recommendation is based on the belief that the replication crisis in psychology reflects a large number of type-I errors.  By reducing the alpha criterion, the rate of type-I errors will be reduced from a maximum of 10 out of 200 studies to 1 out of 200 studies.

I believe that this recommendation is misguided because it ignores the consequences of a more stringent significance criterion on type-II errors.  Keeping resources and sampling error constant, reducing the type-I error rate increases the type-II error rate. This is undesirable because the actual type-II error is already large.

For example, a between-subject comparison of two means with a standardized effect size of d = .4 and a sample size of N = 100 (n = 50 per cell) has a 50% risk of a type-II error.  The risk of a type-II error rises to 80%, if alpha is reduced to .005.  It makes no sense to conduct a study with an 80% chance of failure (Tversky & Kahneman, 1971).  Thus, the call for a lower alpha implies that researchers will have to invest more resources to discover true positive results.  Many researchers may simply lack the resources to meet this stringent significance criterion.

My suggestion is exactly opposite to the recommendation of a more stringent criterion.  The main problem for selection bias in journals is that even the existing criterion of p < .05 is too stringent and leads to a high percentage of type-II errors that cannot be published.  This has produced the replication crisis with large file-drawers of studies with p-values greater than .05,  the use of questionable research practices, and publications of inflated effect sizes that cannot be replicated.

To avoid this problem, researchers should use a significance criterion that balances the risk of a type-I and type-II error.  For example, in a between-subject design with an expected effect size of d = .4 and N = 100, researchers should use p < .20 for significance, which reduces the risk of a type -II error to 20%.  In this case, type-I and type-II error are balanced.  If the study produces a p-value of, say, .15, researchers can publish the result with the conclusion that the study provided evidence for the effect. At the same time, readers are warned that they should not interpret this result as strong evidence for the effect because there is a 20% probability of a type-I error.

Given this positive result, researchers can then follow up their initial study with a larger replication study that allows for a stricter type-I error control, while holding power constant.   With d = 4, they now need N = 200 participants to have 80% power and alpha = .05.  Even if the second study does not produce a significant result (the probability that two studies with 80% power are significant is only 64%, Schimmack, 2012), researchers can combine the results of both studies and with N = 300, the combined studies have 80% power with alpha = .01.

The advantage of starting with smaller studies with a higher alpha criterion is that researchers are able to test risky hypothesis with a smaller amount of resources.  In the example, the first study used “only” 100 participants.  In contrast, the proposal to require p < .005 as evidence for an original, risky study implies that researchers need to invest a lot of resources in a risky study that may provide inconclusive results if it fails to produce a significant result.  A power analysis shows that a sample size of N = 338 participants is needed to have 80% power for an effect size of d = .4 and p < .005 as criterion for significance.

Rather than investing 300 participants into a risky study that may produce a non-significant and uninteresting result (eating green jelly beans does not cure cancer), researchers may be better able and willing to start with 100 participants and to follow up an encouraging result with a larger follow-up study.  The evidential value that arises from one study with 300 participants or two studies with 100 and 200 participants is the same, but requiring p < .005 from the start discourages risky studies and puts even more pressure on researchers to produce significant results if all of their resources are used for a single study.  In contrast, lowering alpha reduces the need for questionable research practices and reduces the risk of type-II errors.

In conclusion, it is time to learn Neyman-Pearson statistic and to remember Cohen’s important contribution that many studies in psychology are underpowered.  Low power produces inconclusive results that are not worthwhile publishing.  A study with low power is like a high-jumper that puts the bar too high and fails every time. We learned nothing about the jumpers’ ability. Scientists may learn from high-jump contests where jumpers start with lower and realistic heights and then raise the bar when they succeeded.  In the same manner, researchers should conduct pilot studies or risky exploratory studies with small samples and a high type-I error probability and lower the alpha criterion gradually if the results are encouraging, while maintaining a reasonably low type-II error.

Evidently, a significant result with alpha = .20 does not provide conclusive evidence for an effect.  However, the arbitrary p < .005 criterion also fails short of demonstrating conclusively that an effect exists.  Journals publish thousands of results a year and some of these results may be false positives, even if the error rate is set at 1 out of 200. Thus, p < .005 is neither defensible as a criterion for a first exploratory study, nor conclusive evidence for an effect.  A better criterion for conclusive evidence is that an effect can be replicated across different laboratories and a type-I error probability of less than 1 out of a billion (6 sigma).  This is by no means an unrealistic target.  To achieve this criterion with an effect size of d = .4, a sample size of N = 1,000 is needed.  The combined evidence of 5 labs with N = 200 per lab would be sufficient to produce conclusive evidence for an effect, but only if there is no selection bias.  Thus, the best way to increase the credibility of psychological science is to conduct studies with high power and to minimize selection bias.

This is what I believe Cohen would have said, but even if I am wrong about this, I think it follows from his futile efforts to teach psychologists about type-II errors and statistical power.