All posts by Ulrich Schimmack

About Ulrich Schimmack

Since Cohen (1962) published his famous article on statistical power in psychological journals, statistical power has not increased. The R-Index makes it possible f to distinguish studies with high power (good science) and studies with low power (bad science). Protect yourself from bad science and check the R-Index before you believe statistical results.

Replicability Rankings 2010-2020

Welcome to the replicability rankings for 120 psychology journals. More information about the statistical method that is used to create the replicability rankings can be found elsewhere (Z-Curve; Video Tutorial; Talk; Examples). The rankings are based on automated extraction of test statistics from all articles published in these 120 journals from 2010 to 2020 (data). The results can be reproduced with the R-package zcurve.

To give a brief explanation of the method, I use the journal with the highest ranking and the journal with the lowest ranking as examples. Figure 1 shows the z-curve plot for the 2nd highest ranking journal for the year 2020 (the Journal of Organizational Psychology is ranked #1, but it has very few test statistics). Plots for all journals that include additional information and information about test statistics are available by clicking on the journal name. Plots for previous years can be found on the site for the 2010-2019 rankings (previous rankings).

To create the z-curve plot in Figure 1, the 361 test statistics were first transformed into exact p-values that were then transformed into absolute z-scores. Thus, each value represents the deviation from zero for a standard normal distribution. A value of 1.96 (solid red line) corresponds to the standard criterion for significance, p = .05 (two-tailed). The dashed line represents the treshold for marginal significance, p = .10 (two-tailed). A z-curve analysis fits a finite mixture model to the distribution of the significant z-scores (the blue density distribution on the right side of the solid red line). The distribution provides information about the average power of studies that produced a significant result. As power determines the success rate in future studies, power after selection for significance is used to estimate replicability. For the present data, the z-curve estimate of the replication rate is 84%. The bootstrapped 95% confidence interval around this estimate ranges from 75% to 92%. Thus, we would expect the majority of these significant results to replicate.

However, the graph also shows some evidence that questionable research practices produce too many significant results. The observed discovery rate (i.e., the percentage of p-values below .05) is 82%. This is outside of the 95%CI of the estimated discovery rate which is represented by the grey line in the range of non-significant results; EDR = .31%, 95%CI = 18% to 81%. We see that there are fewer results reported than z-curve predicts. This finding casts doubt about the replicability of the just significant p-values. The replicability rankings ignore this problem, which means that the predicted success rates are overly optimistic. A more pessimistic predictor of the actual success rate is the EDR. However, the ERR still provides useful information to compare power of studies across journals and over time.

Figure 2 shows a journal with a low ERR in 2020.

The estimated replication rate is 64%, with a 95%CI ranging from 55% to 73%. The 95%CI does not overlap with the 95%CI for the Journal of Sex Research, indicating that this is a significant difference in replicability. Visual inspection also shows clear evidence for the use of questionable research practices with a lot more results that are just significant than results that are not significant. The observed discovery rate of 75% is inflated and outside the 95%CI of the EDR that ranges from 10% to 56%.

To examine time trends, I regressed the ERR of each year on the year and computed the predicted values and 95%CI. Figure 3 shows the results for the journal Social Psychological and Personality Science as an example (x = 0 is 2010, x = 1 is 2020). The upper bound of the 95%CI for 2010, 62%, is lower than the lower bound of the 95%CI for 2020, 74%.

This shows a significant difference with alpha = .01. I use alpha = .01 so that only 1.2 out of the 120 journals are expected to show a significant change in either direction by chance alone. There are 22 journals with a significant increase in the ERR and no journals with a significant decrease. This shows that about 20% of these journals have responded to the crisis of confidence by publishing studies with higher power that are more likely to replicate.

Rank  JournalObserved 2020Predicted 2020Predicted 2010
1Journal of Organizational Psychology88 [69 ; 99]84 [75 ; 93]73 [64 ; 81]
2Journal of Sex Research84 [75 ; 92]84 [74 ; 93]75 [65 ; 84]
3Evolution & Human Behavior84 [74 ; 93]83 [77 ; 90]62 [56 ; 68]
4Judgment and Decision Making81 [74 ; 88]83 [77 ; 89]68 [62 ; 75]
5Personality and Individual Differences81 [76 ; 86]81 [78 ; 83]68 [65 ; 71]
6Addictive Behaviors82 [75 ; 89]81 [77 ; 86]71 [67 ; 75]
7Depression & Anxiety84 [76 ; 91]81 [77 ; 85]67 [63 ; 71]
8Cognitive Psychology83 [75 ; 90]81 [76 ; 87]71 [65 ; 76]
9Social Psychological and Personality Science85 [78 ; 92]81 [74 ; 89]54 [46 ; 62]
10Journal of Experimental Psychology – General80 [75 ; 85]80 [79 ; 81]67 [66 ; 69]
11J. of Exp. Psychology – Learning, Memory & Cognition81 [75 ; 87]80 [77 ; 84]73 [70 ; 77]
12Journal of Memory and Language79 [73 ; 86]80 [76 ; 83]73 [69 ; 77]
13Cognitive Development81 [75 ; 88]80 [75 ; 85]67 [62 ; 72]
14Sex Roles81 [74 ; 88]80 [75 ; 85]72 [67 ; 77]
15Developmental Psychology74 [67 ; 81]80 [75 ; 84]67 [63 ; 72]
16Canadian Journal of Experimental Psychology77 [65 ; 90]80 [73 ; 86]74 [68 ; 81]
17Journal of Nonverbal Behavior73 [59 ; 84]80 [68 ; 91]65 [53 ; 77]
18Memory and Cognition81 [73 ; 87]79 [77 ; 81]75 [73 ; 77]
19Cognition79 [74 ; 84]79 [76 ; 82]70 [68 ; 73]
20Psychology and Aging81 [74 ; 87]79 [75 ; 84]74 [69 ; 79]
21Journal of Cross-Cultural Psychology83 [76 ; 91]79 [75 ; 83]75 [71 ; 79]
22Psychonomic Bulletin and Review79 [72 ; 86]79 [75 ; 83]71 [67 ; 75]
23Journal of Experimental Social Psychology78 [73 ; 84]79 [75 ; 82]52 [48 ; 55]
24JPSP-Attitudes & Social Cognition82 [75 ; 88]79 [69 ; 89]55 [45 ; 65]
25European Journal of Developmental Psychology75 [64 ; 86]79 [68 ; 91]74 [62 ; 85]
26Journal of Business and Psychology82 [71 ; 91]79 [68 ; 90]74 [63 ; 85]
27Psychology of Religion and Spirituality79 [71 ; 88]79 [66 ; 92]72 [59 ; 85]
28J. of Exp. Psychology – Human Perception and Performance79 [73 ; 84]78 [77 ; 80]75 [73 ; 77]
29Attention, Perception and Psychophysics77 [72 ; 82]78 [75 ; 82]73 [70 ; 76]
30Psychophysiology79 [74 ; 84]78 [75 ; 82]66 [62 ; 70]
31Psychological Science77 [72 ; 84]78 [75 ; 82]57 [54 ; 61]
32Quarterly Journal of Experimental Psychology81 [75 ; 86]78 [75 ; 81]72 [69 ; 74]
33Journal of Child and Family Studies80 [73 ; 87]78 [74 ; 82]67 [63 ; 70]
34JPSP-Interpersonal Relationships and Group Processes81 [74 ; 88]78 [73 ; 82]53 [49 ; 58]
35Journal of Behavioral Decision Making77 [70 ; 86]78 [72 ; 84]66 [60 ; 72]
36Appetite78 [73 ; 84]78 [72 ; 83]72 [67 ; 78]
37Journal of Comparative Psychology79 [65 ; 91]78 [71 ; 85]68 [61 ; 75]
38Journal of Religion and Health77 [57 ; 94]78 [70 ; 87]75 [67 ; 84]
39Aggressive Behaviours82 [74 ; 90]78 [70 ; 86]70 [62 ; 78]
40Journal of Health Psychology74 [64 ; 82]78 [70 ; 86]72 [64 ; 80]
41Journal of Social Psychology78 [70 ; 87]78 [70 ; 86]69 [60 ; 77]
42Law and Human Behavior81 [71 ; 90]78 [69 ; 87]70 [61 ; 78]
43Psychological Medicine76 [68 ; 85]78 [66 ; 89]74 [63 ; 86]
44Political Psychology73 [59 ; 85]78 [65 ; 92]59 [46 ; 73]
45Acta Psychologica81 [75 ; 88]77 [74 ; 81]73 [70 ; 76]
46Experimental Psychology73 [62 ; 83]77 [73 ; 82]73 [68 ; 77]
47Archives of Sexual Behavior77 [69 ; 83]77 [73 ; 81]78 [74 ; 82]
48British Journal of Psychology73 [65 ; 81]77 [72 ; 82]74 [68 ; 79]
49Journal of Cognitive Psychology77 [69 ; 84]77 [72 ; 82]74 [69 ; 78]
50Journal of Experimental Psychology – Applied82 [75 ; 88]77 [72 ; 82]70 [65 ; 76]
51Asian Journal of Social Psychology79 [66 ; 89]77 [70 ; 84]70 [63 ; 77]
52Journal of Youth and Adolescence80 [71 ; 89]77 [70 ; 84]72 [66 ; 79]
53Memory77 [71 ; 84]77 [70 ; 83]71 [65 ; 77]
54European Journal of Social Psychology82 [75 ; 89]77 [69 ; 84]61 [53 ; 69]
55Social Psychology81 [73 ; 90]77 [67 ; 86]73 [63 ; 82]
56Perception82 [74 ; 88]76 [72 ; 81]78 [74 ; 83]
57Journal of Anxiety Disorders80 [71 ; 89]76 [72 ; 80]71 [67 ; 75]
58Personal Relationships65 [54 ; 76]76 [68 ; 84]62 [54 ; 70]
59Evolutionary Psychology63 [51 ; 75]76 [67 ; 85]77 [68 ; 86]
60Journal of Research in Personality63 [46 ; 77]76 [67 ; 84]70 [61 ; 79]
61Cognitive Behaviour Therapy88 [73 ; 99]76 [66 ; 86]68 [58 ; 79]
62Emotion79 [73 ; 85]75 [72 ; 79]67 [64 ; 71]
63Animal Behavior79 [72 ; 87]75 [71 ; 80]68 [64 ; 73]
64Group Processes & Intergroup Relations80 [73 ; 87]75 [71 ; 80]60 [56 ; 65]
65JPSP-Personality Processes and Individual Differences78 [70 ; 86]75 [70 ; 79]64 [59 ; 69]
66Psychology of Men and Masculinity88 [77 ; 96]75 [64 ; 87]78 [67 ; 89]
67Consciousness and Cognition74 [67 ; 80]74 [69 ; 80]67 [62 ; 73]
68Personality and Social Psychology Bulletin78 [72 ; 84]74 [69 ; 79]57 [52 ; 62]
69Journal of Cognition and Development70 [60 ; 80]74 [67 ; 81]65 [59 ; 72]
70Journal of Applied Psychology69 [59 ; 78]74 [67 ; 80]73 [66 ; 79]
71European Journal of Personality80 [67 ; 92]74 [65 ; 83]70 [61 ; 79]
72Journal of Positive Psychology75 [65 ; 86]74 [65 ; 83]66 [57 ; 75]
73Journal of Research on Adolescence83 [74 ; 92]74 [62 ; 87]67 [55 ; 79]
74Psychopharmacology75 [69 ; 80]73 [71 ; 75]67 [65 ; 69]
75Frontiers in Psychology75 [70 ; 79]73 [70 ; 76]72 [69 ; 75]
76Cognitive Therapy and Research73 [66 ; 81]73 [68 ; 79]67 [62 ; 73]
77Behaviour Research and Therapy70 [63 ; 77]73 [67 ; 79]70 [64 ; 76]
78Journal of Educational Psychology82 [73 ; 89]73 [67 ; 79]76 [70 ; 82]
79British Journal of Social Psychology74 [65 ; 83]73 [66 ; 81]61 [54 ; 69]
80Organizational Behavior and Human Decision Processes70 [65 ; 77]72 [69 ; 75]67 [63 ; 70]
81Cognition and Emotion75 [68 ; 81]72 [68 ; 76]72 [68 ; 76]
82Journal of Affective Disorders75 [69 ; 83]72 [68 ; 76]74 [71 ; 78]
83Behavioural Brain Research76 [71 ; 80]72 [67 ; 76]70 [66 ; 74]
84Child Development81 [75 ; 88]72 [66 ; 78]68 [62 ; 74]
85Journal of Abnormal Psychology71 [60 ; 82]72 [66 ; 77]65 [60 ; 71]
86Journal of Vocational Behavior70 [59 ; 82]72 [65 ; 79]84 [77 ; 91]
87Journal of Experimental Child Psychology72 [66 ; 78]71 [69 ; 74]72 [69 ; 75]
88Journal of Consulting and Clinical Psychology81 [73 ; 88]71 [64 ; 78]62 [55 ; 69]
89Psychology of Music78 [67 ; 86]71 [64 ; 78]79 [72 ; 86]
90Behavior Therapy78 [69 ; 86]71 [63 ; 78]70 [63 ; 78]
91Journal of Occupational and Organizational Psychology66 [51 ; 79]71 [62 ; 80]87 [79 ; 96]
92Journal of Happiness Studies75 [65 ; 83]71 [61 ; 81]79 [70 ; 89]
93Journal of Occupational Health Psychology77 [65 ; 90]71 [58 ; 83]65 [52 ; 77]
94Journal of Individual Differences77 [62 ; 92]71 [51 ; 90]74 [55 ; 94]
95Frontiers in Behavioral Neuroscience70 [63 ; 76]70 [66 ; 75]66 [62 ; 71]
96Journal of Applied Social Psychology76 [67 ; 84]70 [63 ; 76]70 [64 ; 77]
97British Journal of Developmental Psychology72 [62 ; 81]70 [62 ; 79]76 [67 ; 85]
98Journal of Social and Personal Relationships73 [63 ; 81]70 [60 ; 79]69 [60 ; 79]
99Behavioral Neuroscience65 [57 ; 73]69 [64 ; 75]69 [63 ; 75]
100Psychology and Marketing71 [64 ; 77]69 [64 ; 74]67 [63 ; 72]
101Journal of Family Psychology71 [59 ; 81]69 [63 ; 75]62 [56 ; 68]
102Journal of Personality71 [57 ; 85]69 [62 ; 77]64 [57 ; 72]
103Journal of Consumer Behaviour70 [60 ; 81]69 [59 ; 79]73 [63 ; 83]
104Motivation and Emotion78 [70 ; 86]69 [59 ; 78]66 [57 ; 76]
105Developmental Science67 [60 ; 74]68 [65 ; 71]65 [63 ; 68]
106International Journal of Psychophysiology67 [61 ; 73]68 [64 ; 73]64 [60 ; 69]
107Self and Identity80 [72 ; 87]68 [60 ; 76]70 [62 ; 78]
108Journal of Counseling Psychology57 [41 ; 71]68 [55 ; 81]79 [66 ; 92]
109Health Psychology63 [50 ; 73]67 [62 ; 72]67 [61 ; 72]
110Hormones and Behavior67 [58 ; 73]66 [63 ; 70]66 [62 ; 70]
111Frontiers in Human Neuroscience68 [62 ; 75]66 [62 ; 70]76 [72 ; 80]
112Annals of Behavioral Medicine63 [53 ; 75]66 [60 ; 71]71 [65 ; 76]
113Journal of Child Psychology and Psychiatry and Allied Disciplines58 [45 ; 69]66 [55 ; 76]63 [53 ; 73]
114Infancy77 [69 ; 85]65 [56 ; 73]58 [50 ; 67]
115Biological Psychology64 [58 ; 70]64 [61 ; 67]66 [63 ; 69]
116Social Development63 [54 ; 73]64 [56 ; 72]74 [66 ; 82]
117Developmental Psychobiology62 [53 ; 70]63 [58 ; 68]67 [62 ; 72]
118Journal of Consumer Research59 [53 ; 67]63 [55 ; 71]58 [50 ; 66]
119Psychoneuroendocrinology63 [53 ; 72]62 [58 ; 66]61 [57 ; 65]
120Journal of Consumer Psychology64 [55 ; 73]62 [57 ; 67]60 [55 ; 65]

If Consumer Psychology Wants to be a Science It Has to Behave Like a Science

Consumer psychology is an applied branch of social psychology that uses insights from social psychology to understand consumers’ behaviors. Although there is cross-fertilization and authors may publish in more basic and more applied journals, it is its own field in psychology with its own journals. As a result, it has escaped the close attention that has been paid to the replicability of studies published in mainstream social psychology journals (see Schimmack, 2020, for a review). However, given the similarity in theories and research practices, it is fair to ask why consumer research should be more replicable and credible than basic social psychology. This question was indirectly addressed in a diaologue about the merits of pre-registration that was published in the Journal of Consumer Psychology (Krishna, 2021).

Open science proponents advocate pre-registration to increase the credibility of published results. The main concern is that researchers can use questionable research practices to produce significant results (John et al., 2012). Preregistration of analysis plans would reduce the chances of using QRPs and increase the chances of a non-significant result. This would make the reporting of significant results more valuable because signifiance was produced by the data and not by the creativity of the data analyst.

In my opinion, the focus on pre-registration in the dialogue is misguided. As Pham and Oh (2021) point out, pre-registration would not be necessary, if there is no problem that needs to be fixed. Thus, a proper assessment of the replicability and credibility of consumer research should inform discussions about preregistration.

The problem is that the past decade has seen more articles talking about replications than actual replication studies, especially outside of social psychology. Thus, most of the discussion about actual and ideal research practices occurs without facts about the status quo. How often do consumer psychologists use questionable research practices? How many published results are likely to replicate? What is the typical statistical power of studies in consumer psychology? What is the false positive risk?

Rather than writing another meta-psychological article that is based on paranoid or wishful thinking, I would like to add to the discussion by providing some facts about the health of consumer psychology.

Do Consumer Psychologists Use Questionable Research Practices?

John et al. (2012) conducted a survey study to examine the use of questionable research practices. They found that respondents admitted to using these practices and that they did not consider these practices to be wrong. In 2021, however, nobody is defending the use of questionable practices that can inflate the risk of false positive results and hide replication failures. Consumer psychologists could have conducted an internal survey to find out how prevalent these practices are among consumer psychologists. However, Pham and Oh (2021) do not present any evidence about the use of QRPs by consumer psychologists. Instead, they cite a survey among German social psychologists to suggest that QPRs may not be a big problem in consumer psychology. Below, I will show that QRPs are a big problem in consumer psychology and that consumer psychologists have done nothing over the past decade to curb the use of these practices.

Are Studies in Consumer Psychology Adequately Powered

Concerns about low statistical power go back to the 1960s (Cohen, 1961; Maxwell, 2004; Schimmack, 20212; Sedlmeier & Gigerenzer, 1989; Smaldino & McElreath, 2016). Tversky and Kahneman (1971) refused “to believe that any that a serious investigator will knowingly accept a .50 risk of failing to confirm a valid research hypothesis” (p. 110). Yet, results from the reproducibility project suggest that social psychologists conduct studies with less than 50% power all the time (Open Science Collaboration, 2015). It is not clear why we should expect higher power from consumer research. More concerning is that Pham and Oh (2021) do not even mention low power as a potential problem for consumer psychology. One advantage of a pre-registration is that researchers are forced to think ahead of time about the sample size that is required to have a good chance to show the desired outcome, assuming the theory is right. More than 20 years ago, the APA taskforce on on statistical inference recommended a priori power analysis, but researchers continued to conduct underpowered studies. Pre-registration, however, would not be necessary if consumer psychologists already conduct studies with adquate power. Here I show that power in consumer psychology is unacceptably low and has not increased over the past decade.

False Positive Risk

Pham and Oh note that Simmons, Nelson, & Simmonsohn’s (2011) influential article relied exclusively on simulations and speculations and suggest that the fear of massive p-hacking may be unfounded. “Whereas Simmons et al. (2011) highly influential computer simulations point to massive distortions of test statistics when QRPs are used, recent empirical estimates of the actual impact of self-serving analyses suggest more modest degrees of distortion of reported test statistics in recent consumer studies (see Krefeld-Schwalb & Scheibehenne, 2020). Here I presents of empirical analyses to estimate the false discovery risk in consumer psychology.

Data

The data are part of a larger project that examines research practices in psychology over the past decade. For this purpose, my research team and I downloaded all articles form 2010 to 2020 published in 120 psychology journals that cover a broad range of disciplines. Four journals represent research in consumer psychology, namely the Journal of Consumer Behavior, the Journal of Consumer Psychology, the Journal of Consumer Research and Psychology and Marketing. The articles were converted into text files and the text files were searched for test statistics. All F, t, and z-tests were used, but most test statistics were F and t tests. There were 2,304 tests for Journal of Consumer Behavior, 8940 for Journal of Consumer Psychology, 10,521 for Journal of Consumer Research, and 5,913 for Psychology and Marketing.

Results

I first conducted z-curve analyses for each journal and year separately. The 40 results were analyzed with year as continuous and journal as categorical predictor variable. No time trends were significant, but the main effect for the expected replication rate of journals was significant, F(3,36) = 9.63, p < .001. Inspection of the means showed higher values for Journal of Consumer Psychology and Psychology & Marketing than for the other two journals. No other effects were significant. Therefore, I combined the data of Journal of Consumer Psychology and Psychology of Marketing and the Journal of Consumer Behavior and Journal of Consumer Reserach.

Figure 1 shows the z-curve analysis for the first set of journals. The observed discovery rate (ODR) is simply the percentage of results that are significant. Out of the 14,853 tests, 10636 were significant which yields an ODR of 72%. To examine the influence of questionable research practices, the ODR can be compared to the estimated discovery rate (EDR). The EDR is an estimate that is based on a finite mixture model that is fitted to the distribution of the signifiant test statistics. Figure 1 shows that the fitted grey curve closely matches the observed distribution of test statistics that are all converted into z-scores. Figure 1 also shows the projected distribution that is expected for non-significant results. Contrary to the predicted distribution, observed non-significant results sharply drop off at the level of significance (z = 1.96). This pattern provides visual evidence that non-significant results do not follow a sampling distribution. The EDR is the area under the curve for the significant values relative to the total distribution. The EDR is only 34%. The 95%CI of the EDR can be used to test statistical significance. The ODR of 72% is well out side the 95% confidence interval of the EDR that ranges from 17% to 34%. Thus, there is strong evidence that consumer researchers use QRPs and publish too many significant results.

The EDR can also be used to assess the risk of publishing false positive results; that is significant results without a true population effect. Using a formula from Soric (1989), we can use the EDR to estimate the maximum percentage of false positive results. As the EDR decreases, the false discovery risk increases. With an EDR of 34%, the FDR is 10%, with a 95% confidence interval ranging from 7% to 26%. Thus, the present results do not suggest that most results in consumer psychology journals are false positives as some meta-scientists suggested (Ioannidis, 2005; Simmons et al., 2011).

It is more difficult to asses the replicability of results published in these two journals. On the one hand, z-curve provides an estimate of the expected replication rate. That is, the probability that a significant result produces a significant result again in an exact replication study (Brunner & Schimmack, 2020). The ERR is higher than the EDR because studies that produced a significant result have higher power than studies that did not produce a significant result. The ERR of 63% suggests that more than 50% of significant results can be successfully replicated. However, a comparison of the ERR with success rate in actual replication studies showed that the ERR overestimates actual replication rates (Brunner & Schimmack, 2020). There are a number of reasons for this discrepancy. One reason is that replication studies in psychology are never exact replications and that regression to the mean lowers the chances of reproducing the same effect size in a replication study. In social psychology, the EDR is actually a better predictor of the actual success rate. Thus, the present results suggest that actual replication studies in consumer psychology are likely to produce as many replication failures as studies in social psychology have (Schimmack, 2020).

Figure 2 shows the results for the Journal of Consumer Behavior and the Journal of Consumer Research.

The results are even worse. The ODR of 73% is above the EDR of 26% and well outside the 95%CI of the EDR, . The EDR of 24% implies a false discovery risk of 15%, 95%CI =

Conclusion

The present results show that consumer psychology is plagued by the same problems that have produced replication failures in social psychology. Given the similarities between consumer psychology and social psychology, it is not surprising that the two disciplines are alike. Researchers conduct underpowered studies and use QRPs to report inflated success rates. These illusory results cannot be replicated and it is unclear which statistically significant results reveal effects that have practical significance and which ones are mere false positives. To make matters worse, social psychologists have responded to awareness of these problems by increasing power of their studies and by implementing changes in their research practices. In contrast, z-curve analyses of consumer psychology show no improvement in research practices over the past year. In light of this disappointing tend, it is disconcerting to read an article that suggests improvements in consumer psychology are not needed and that everything is well (Pham and Oh, 2021). I demonstrated with hard data and objective analysis that this assessment is false. It is time for consumer psychologists to face reality and to follow in the footsteps of social psychologists to increase the credibility of their science. While preregistration may be optional, increasing power is not.

Guest Post by Peter Holtz: From Experimenter Bias Effects To the Open Science Movement

This post was first shared as a post in the Facebook Psychological Methods Discussion Group. (Group, Post). I thought it was interesting and deserved a wider audience.

Peter Holtz

I know that this is too long for this group, but I don’t have a blog …

A historical anecdote:

In 1963, Rosenthal and Fode published a famous paper on the Experimenter Bias Effect (EBE): There were of course several different experiments and conditions etc., but for example, research assistants were given a set of 20 photos of people that were to be rated by participants on a scale from -10 ([will experience …] “extreme failure”) to + 10 (…“extreme success”).

The research assistants (e.g., participants in a class on experimental psychology) were told to replicate a “well-established” psychological finding just like “students in physics labs are expected to do” (p. 494). On average, the sets of photos had been rated in a large pre-study as neutral (M=0), but some research assistants were told that the expected mean of their photos was -5, whereas others were told that it was +5. When the research assistants, who were not allowed to communicate with each other during the experiments, handed in the results of their studies, their findings were biased in the direction of the effect that they had expected. Funnily enough, similar biases could be found for experiments with rats in Skinner boxes as well (Rosenthal & Fode, 1963b).

The findings on the EBE were met with skepticism from other psychologists since they casted doubt on experimental psychology’s self-concept as a true and unbiased natural science. And what do researchers do since the days of Socrates if they doubt the findings of a colleague? Sure, they attempt to replicate them. Whereas Rosenthal and colleagues (by and large) produced several successful “conceptual replications” in slightly different contexts (for a summary see e.g. Rosenthal, 1966), others (most notably T. X. Barber) couldn’t replicate Rosenthal and Fode’s original study (e.g., Barber et al., 1969; Barber & Silver, 1968, but also Jacob, 1968; Wessler & Strauss, 1968).

Rosenthal, a versed statistician, responded (e.g., Rosenthal, 1969) that the difference between significant and non-significant may be not itself significant and used several techniques that about ten years later came to be known as “meta-analysis” to argue that although Barber’s and others’ replications, which of course used other groups of participants and materials etc., most often did not yield significant results, a summary of results suggests that there may still be an EBE (1968; albeit probably smaller than in Rosenthal and Fode’s initial studies – let me think… how can we explain that…).

Of course, Barber and friends responded to Rosenthal’s responses (e.g., Barber, 1969 titled “invalid arguments, post-mortem analyses, and the experimenter bias effect”) and vice versa and a serious discussion of psychology’s methodology emerged. Other notables weighed in as well and frequently statisticians such as Rozeboom (1960) and Bakan (1966) were quoted who had by then already done their best to explain to their colleagues the problems of the p-ritual that psychologists use(d) as a verification procedure. (On a side note: To me, Bakan’s 1966 paper is better than much of the recent work on the problems with the p-ritual; in particular the paragraph on the problematic assumption of an “automacity of inference” on p. 430 is still worth reading).

Lykken (1968) and Meehl (1967) soon joined the melee and attacked the p-ritual also from an epistemological perspective. In 1969, Levy wrote an interesting piece about the value of replications in which he argued that replicating the EBE-studies doesn’t make much sense as long as there are no attempts to embed the EBE into a wider explanatory theory that allows for deducing other falsifiable hypotheses as well. Levy knew very well already by 1969 that the question whether some effect “exists” or “does not exist” is only in very rare cases relevant (exactly then when there are strong reasons to assume that an effect does not exist – as is the case, for example, with para-psychological phenomena).

Eventually Rosenthal himself (e.g., 1968a) came to think critically of the “reassuring nature of the null hypothesis decision procedure”. What happened then? At some point Rosenthal moved away from experimenter expectancy effects in the lab to Pygmalion effects in the classroom (1968b) – an idea that is much less likely to provoke criticism and replication attempts: Who doesn’t believe that teachers’ stereotypes influence the way they treat children and consequently the children’s chances to succeed in school? The controversy fizzled out and if you take up a social psychology textbook, you may find the comforting story in it that this crisis was finally “overcome” (Stroebe, Hewstone, & Jonas, 2013, p. 18) by enlarging psychology’s methodological arsenal, for example, with meta-analytic practices and by becoming a stronger and better science with a more rigid methodology etc. Hooray!

So psychology was finally great again from the 1970s on … was it? What can we learn from this episode?- It is not the case that psychologists didn’t know the replication game, but they only played it whenever results went against their beliefs – and that was rarely the case (exceptions are apart from Rosenthal’s studies of course Bem’s “feeling the future” experiments). –

Science is self-correcting – but only whenever there are controversies (and not if subcommunities just happily produce evidence in favor of their pet theories). – Everybody who wanted to know it could know by the 1960s that something is wrong with the p-ritual – but no one cared. This was the game that needed to be played to produce evidence in favor of theories and to get published and to make a career; consequently, people learned to play the verification game more and more effectively. (Bakan writes on p. 423: “What will be said in this paper is hardly original. It is, in a certain sense, what “everybody knows.” To say it “out loud” is, as it were, to assume the role of the child who pointed out that the emperor was really outfitted only in his underwear.” – in 1966!)-

Just making it more difficult to verify a theory will not solve the problem imo; ambitious psychologists will again find ways to play the game – and to win.- I see two risks with the changes that have been proposed by the “open science community” (in particular preregistration): First, I am afraid that since the verification game still dominates in psychology researchers will simply shift towards “proving” more boring hypotheses; second, there is the risk that psychological theories will be shielded even more from criticism since only criticism based on “good science” (preregistered experiments with a priori power analysis and open data) will be valid whereas criticism based on other types of research activities (e.g., simulations, case studies … or just rational thinking for a change) will be dismissed as “unscientific” => no criticism => no controversy => no improvement => no progress. – And of course, pre-registration and open science etc. allow psychologists to still maintain the misguided, unfortunate, and highly destructive myth of the “automacity of inferences”; no inductive mechanism whatsoever can ensure “true discovery”.-

I think what is needed more is a discussion about the relationship between data and theory and about epistemological questions such as the question what a “growth of knowledge” in science could look like and how it can be facilitated (I call this a “falsificationist turn”).- Irrespective of what is going to happen, authors of textbooks will find ways to write up the history of psychology as a flawless cumulative success story …

A Z-Curve Analysis of a Self-Replication: Shah et al. (2012) Science

Since 2011, psychologists are wondering which published results are credible and which results are not. One way to answer this question would be for researchers to self-replicate their most important findings. However, most psychologists have avoided conducting or publishing self-replications (Schimmack, 2020).

It is therefore always interesting when a self-replication is published. I just came across Shah, Mullainathana, and Shafir (2019). The authors conducted high-powered (much larger sample-sizes) replications of five studies that were published in Shah, Mullainathana, and Shafir’s (2012) Science article.

The article reported five studies with 1, 6, 2, 3, and 1 focal hypothesis tests. One additional test was significant, but the authors focussed on the small effect size and considered it not theoretically important. The replication studies successfully replicated 9 of the 13 significant results; a success rate of 69%. This is higher than the success rate in the famous reproducibility project of 100 studies in social and cognitive psychology; 37% (OSC, 2015).

One interesting question is whether this success rate was predictable based on the original findings. An even more interesting question is whether original results provide clues about the replicability of specific effects. For example, why were the results of Study 1 and 5 harder to replicate than those of the other studies.

Z-curve relies on the strength of the evidence against the null-hypothesis in the original studies to predict replication outcomes (Brunner & Schimmack, 2020; Bartos & Schimmack, 2020). It also takes into account that original results may be selected for significance. For example, the original article reported 14 out of 14 significant results. It is unlikely that all statistical tests of critical hypotheses produce significant results (Schimmack, 2012). Thus, some questionable practices were probably used although the authors do not mention this in their self-replication article.

I converted the 13 test statistics into exact p-values and converted the exact p-values into z-scores. Figure 1 shows the z-curve plot and the results of the z-curve analysis. The first finding is that the observed success rate of 100% is much higher than the expected discovery rate of 15%. Given the small sample of tests, the 95%CI around the estimated discovery rate is wide, but it does not include 100%. This suggests that some questionable practices were used to produce a pretty picture of results. This practice is in line with widespread practices in psychology in 2012.

The next finding is that despite a low discovery rate, the estimated replication rate of 66% is in line with the observed discovery rate. The reason for the difference is that the estimated discovery rate includes the large set of non-significant results that the model predicts. Selection for significance selects studies with higher power that have a higher chance to be significant (Brunner & Schimmack, 2020).

It is unlikely that the authors conducted many additional studies to get only significant results. It is more likely that they used a number of other QRPs. Whatever method they used, QRPs make just significant results questionable. One solution to this problem is to alter the significance criterion post-hoc. This can be done gradually. For example, a first adjustment might lower the significance criterion to alpha = .01.

Figure 2 shows the adjusted results. The observed discovery rate decreased to 69%. In addition, the estimated discovery rate increased to 48% because the model no longer needs to predict the large number of just significant results. Thus, the expected and observed discovery rate are much more in line and suggest little need for additional QRPs. The estimated replication rate decreased because it uses the more stringent criterion of alpha = .01. Otherwise, it would be even more in line with the observed replication rate.

Thus, a simple explanation for the replication outcomes is that some results were obtained with QRPs that produced just significant results with p-values between .01 and .05. These results did not replicate, but the other results did replicate.

There was also a strong point-biseral correlation between the z-scores and the dichotomous replication outcome. When the original p-values were split into p-values above or below .01, they perfectly predicted the replication outcome; p-values greater than .01 did not replicate, those below .01 did replicate.

In conclusion, a single p-values from a single analysis provides little information about replicability, although replicability increases as p-values decrease. However, meta-analyses of p-values with models that take QRPs and selection for significance into account are a promising tool to predict replication outcomes and to distinguish between questionable and solid results in the psychological literature.

Meta-analyses that take QRPs into account can also help to avoid replication studies that merely confirm highly robust results. Four of the z-scores in Shah et al.’s (2019) project were above 4, which makes it very likely that the results replicate. Resources are better spend on findings that have high theoretical importance, but weak evidence. Z-curve can help to identify these results because it corrects for the influence of QRPs.

Conflict of Interest statement: Z-curve is my baby.

How Credible is Clinical Psychology?

Don Lynam and the clinical group at Purdue University invited me to give a talk and they generously gave me permission to share it with you.

Talk (the first 4 min. were not recorded, it starts right away with my homage to Jacob Cohen).

The first part of the talk discusses the problems with Fisher’s approach to significance testing and the practice in psychology to publish only significant results. I then discuss Neyman-Pearson’s alternative approach, statistical power, and Cohen’s seminal meta-analysis of power in social/abnormal psychology. I then point out that questionable research practices must have been used to publish 95% significant results with only 50% power.

The second part of the talk discusses Soric’s insight that we can estimate the false discovery risk based on the discovery rate. I discuss the Open Science Collaboration project as one way to estimate the discovery rate (prettty high for within-subject cognitive psychology, terribly low for between-subject social psychology), but point out that it doesn’t tell us about clinical psychology. I then introduce z-curve to estimate the discovery rate based on the distribution of significant p-values (converted into z-scores).

In the empirical part, I show the z-curve for Positive Psychology Interventions that shows massive use of QRPs and a high false discovery risk.

I end with a comparison of the z-curve for the Journal of Abnormal Psychology in 2010 and 2020 that shows no change in research practices over time.

The discussion focussed on changing the way we do research and what research we reward. I argue strongly against the implementation of alpha = .005 and for the adoption of Neyman Pearson’s approach with pre-registration which would allow researchers to study small populations (e.g., mental health issues in the African American community) with a higher false-positive risk to balance type-I and type-II errors.

A tutorial about effect sizes, power, z-curve analysis, and personalized p-values.

I recorded a meeting with my research assistants who are coding articles to estimate the replicability of psychological research. It is unedited and raw, but you might find it interesting to listen to. Below I give a short description of the topics that were discussed starting from an explanation of effect sizes and ending with a discussion about the choice of a graduate supervisor.

Link to video

The meeting is based on two blog posts that introduce personalized p-values.
1. https://replicationindex.com/2021/01/15/men-are-created-equal-p-values-are-not/
2. https://replicationindex.com/2021/01/19/personalized-p-values/

1. Rant about Fischer’s approach to statistics that ignores effect sizes.
– look for p < .05, and do a happy dance if you find it, now you can publish.
– still the way statistics is taught to undergraduate students.

2. Explaining statistics starting with effect sizes.
– unstandardized effect size (height difference between men and women in cm)
– unstandardized effect sizes depend on the unit of measurement
– to standardize effect sizes we divide by standard deviation (Cohen’s d)

3. Why do/did social psychologists run studies with n = 20 per condition?
– limited resources, small subject pool, statistics can be used with n = 20 ~ 30.
– obvious that these sample sizes are too small after Cohen (1961) introduced power analysis
– but some argued that low power is ok because it is more efficient to get significant results.

4. Simulation of social psychology: 50% of hypothesis are true, 50% are false, the effect size of true hypotheses is d = .4 and the sample size of studies is N = 20.
– Analyzing the simulated results (with k = 200 studies) with z-curve.2.0. In this simulation, the true discovery rate is 14%. That is 14% of the 200 studies produced a significant result.
– Z-curve correctly estimates this discovery rate based on the distribution of the significant p-values, converted into z-scores.
– If only significant results are published, the observed discovery rate is 100%, but the true discovery rate is only 14%.
– Publication bias leads to false confidence in published results.
– Publication is wasteful because we are discarding useful information.

5. Power analysis.
– Fischer did not have power analysis.
– Neyman and Pearson invented power analysis, but Fischer wrote the textbook for researchers.
– We had 100 years to introduce students to power analysis, but it hasn’t happened.
– Cohen wrote books about power analysis, but he was ignored.
– Cohen suggested we should aim for 80% power (more is not efficient).
– Think a priori about effect size to plan sample sizes.
– Power analysis was ignored because it often implied very large samples.
(very hard to get participants in Germany with small subject pools).
– no change because all p-values were treated as equal. p < .05 = truth.
– Literature reviews or textbook treat every published significant results as truth.

6. Repeating simulation (50% true hypotheses, effect size d = .4) with 80% power, N = 200.
– much higher discovery rate (58%)
– much more credible evidence
– z-curve makes it possible to distinguish between p-values from research with low or high discovery rate.
– Will this change the way psychologists look at p-values? Maybe, but Cohen and others have tried to change psychology without success. Will z-curve be a game-changer?

7. Personalized p-values
– P-values are being created by scientists.
– Scientists have some control about the type of p-values they publish.
– There are systemic pressures to publish more p-values based on low powered studies.
– But at some point, researchers get tenure.
– nobody can fire you if you stop publishing
– social media allow researchers to publish without censure from peers.
– tenure also means you have a responsibility to do good research.
– Researcher who are listed on the post with personalized p-values all have tenure.
– Some researchers, like David Matsumoto, have a good z-curve.
– Other researchers have way too many just significant results.
– The observed discovery rates between good and bad researchers are the same.
– Z-curve shows that the significant results were produced very differently and differ in credibility and replicability; this could be a game changer if people care about it.
– My own z-curve doesn’t look so good. 😦
– How can researchers improve their z-curve
– publish better research now
– distance yourself from bad old research
– So far, few people have distanced themselves from bad old work because there was no incentive to do so.
– Now there is an incentive to do so, because researchers can increase credibility of their good work.
– some people may move up when we add the 2020 data.
– hand-coding of articles will further improve the work.

8. Conclusion and Discussion
– not all p-values are created equal.
– working with undergraduate is easy because they are unbiased.
– once you are in grad school, you have to produce significant results.
– z-curve can help to avoid getting into labs that use questionable practices.
– I was lucky to work in labs that cared about the science.

The Prevalence of Questionable Research Practices in Social Psychology

Introduction

A naive model of science assumes that scientists are objective. That is, they derive hypotheses from theories, collect data to test these theories, and then report the results. In reality, scientists are passionate about theories and often want to confirm that their own theories are right. This leads to conformation bias and the use of questionable research practices (QRPs, John et al., 2012; Schimmack, 2015). QRPs are defined as practices that increase the chances of the desired outcome (typically a statistically significant result) while at the same time inflating the risk of a false positive discovery. A simple QRP is to conduct multiple studies and to report only the results that support the theory.

The use of QRPs explains the astonishingly high rate of statistically significant results in psychology journals that is over 90% (Sterling, 1959; Sterling et al., 1995). While it is clear that this rate of significant results is too high, it is unclear how much it is inflated by QRPs. Given the lack of quantitative information about the extent of QRPs, motivated biases also produce divergent opinions about the use of QRPs by social psychologists. John et al. (2012) conducted a survey and concluded that QRPs are widespread. Fiedler and Schwarz (2016) criticized the methodology and their own survey of German psychologists suggested that QRPs are not used frequently. Neither of these studies is ideal because they relied on self-report data. Scientists who heavily use QRPs may simply not participate in surveys of QRPs or underreport the use of QRPs. It has also been suggested that many QRPs happen automatically and are not accessible to self-reports. Thus, it is necessary to study the use of QRPs with objective methods that reflect the actual behavior of scientists. One approach is to compare dissertations with published articles (Cairo et al., 2020). This method provided clear evidence for the use of QRPs, even though a published document could reveal their use. It is possible that this approach underestimates the use of QRPs because even the dissertation results could be influenced by QRPs and the supervision of dissertations by outsiders may reduce the use of QRPs.

With my colleagues, I developed a statistical method that can detect and quantify the use of QRPs (Bartos & Schimmack, 2020; Brunner & Schimmack, 2020). Z-curve uses the distribution of statistically significant p-values to estimate the mean power of studies before selection for significance. This estimate predicts how many non-significant results were obtained in the serach for the significant ones. This makes it possible to compute the estimated discovery rate (EDR). The EDR can then be compared to the observed discovery rate, which is simply the percentage of published results that are statistically significant. The bigger the difference between the ODR and the EDR is, the more questionable research practices were used (see Schimmack, 2021, for a more detailed introduction).

I merely focus on social psychology because (a) I am a social/personality psychologists, who is interested in the credibility of results in my field, and (b) because social psychology has a large number of replication failures (Schimmack, 2020). Similar analyses are planned for other areas of psychology and other disciplines. I also focus on social psychology more than personality psychology because personality psychology is often more exploratory than confirmatory.

Method

I illustrate the use of z-curve to quantify the use of QRPs with the most extreme examples in the credibility rankings of social/personality psychologists (Schimmack, 2021). Figure 1 shows the z-value plot (ZVP) of David Matsumoto. To generate this plot, the tests statistics from t-tests and F-tests were transformed into exact p-values and then transformed into the corresponding values on the standard normal distribution. As two-sided p-values are used, all z-scores are positive. However, because the curve is centered over the z-score that corresponds to the median power before selection for significance (and not zero, when the null-hypothesis is true), the distribution can look relatively normal. The variance of the distribution will be greater than 1 when studies vary in statistical power.

The grey curve in Figure 1 shows the predicted distribution based on the observed distribution of z-scores that are significant (z > 1.96). In this case, the observed number of non-significant results is similar to the predicted number of significant results. As a result, the ODR of 78% closely matches the EDR of 79%.

Figure 2 shows the results for Shelly Chaiken. The first notable observation is that the ODR of 75% is very similar to Matsumoto’s EDR of 78%. Thus, if we simply count the number of significant and non-significant p-values, there is no difference between these two researchers. However, the z-value plot (ZVP) shows a dramatically different picture. The peak density is 0.3 for Matsoumoto and 1.0 for Chaiken. As the maximum density of the standard normal distribution is .4, it is clear that the results in Chaiken’s articles are not from an actual sampling distribution. In other words, QRPs must have been used to produce too many just significant results with p-values just below .05.

The comparison of the ODR and EDR shows a large discrepancy of 64 percentage points too many significant results (ODR = 75% minus EDR = 11%). This is clearly not a chance finding because the ODR falls well outside the 95% confidence interval of the EDR, 5% to 21%.

To examine the use of QPSs in social psychology, I computed the EDR and ORDR for over 200 social/personality psychologists. Personality psychologists were excluded if they reported too few t-values and F-values. The actual values can be found and additional statistics can be found in the credibility rankings (Schimmack, 2021). Here I used these data to examine the use of QRPs in social psychology.

Average Use of QRPs

The average ODR is 73.48 with a 95% confidence interval ranging from 72.67 to 74.29. The average EDR is 35.28 with a 95% confidence interval ranging from 33.14 to 37.43. the inflation due to QRPs is 38.20 percentage points, 95%CI = 36.10 to 40.30. This difference is highly significant, t(221) = 35.89, p < too many zeros behind the decimal for R to give an exact value.

It is of course not surprising that QRPs have been used. More important is the effect size estimate. The results suggest that QRPs inflate the discovery rate by over 100%. This explains why unbiased replication studies in social psychology have only a 25% chance of being significant (Open Science Collaboration, 2015). In fact, we can use the EDR as a conservative predictor of replication outcomes (Bartos & Schimmack, 2020). While the EDR of 35% is a bit higher than the actual replication rate, this may be due to the inclusion of non-focal hypothesis tests in these analyses. Z-curve analyses of focal hypothesis tests typically produce lower EDRs. In contrast, Fiedler and Schwarz failed to comment on the low replicability of social psychology. If social psychologists would not have used QRPs, it remains a mystery why their results are so hard to replicate.

In sum, the present results confirm that, on average, social psychologists heavily used QRPs to produce significant results that support their predictions. However, these averages masks differences between researchers like Matsumoto and Chaiken. The next analyses explore these individual differences between researchers.

Cohort Effects

I had no predictions about the effect of cohort on the use of QRPs. I conducted a twitter poll that suggested a general intuition that the use of QRPs may not have changed over time, but there was a lot of uncertainty in these answers. Similar results were obtained in a Facebook poll in the Psychological Methods Discussion Group. Thus, the a priori hypothesis is a vague prior of no change.

The dataset includes different generations of researchers. I used the first publication listed in WebofScience to date researchers. The earliest date was 1964 (Robert S. Wyer). The latest date was 2012 (Kurt Gray). The histogram shows that researchers from the 1970s to 2000s were well-represented in the dataset.

There was a significant negative correlation between the ODR and cohort, r(N = 222) = -.25, 95%CI = -.12 to -.37, t(220) = 3.83, p = .0002. This finding suggests that over time the proportion of non-significant results increased. For researchers with the first publication in the 1970s, the average ODR was 76%, whereas it was 72% for researchers with the first publication in the 2000s. This is a modest trend. There are various explanations for this trend.

One possibility is that power decreased as researchers started looking for weaker effects. In this case, the EDR should also show a decrease. However, the EDR showed no relationship with cohort, r(N = 222) = -.03, 95%CI = -.16 to .10, t(220) = 0.48, p = .63. Thus, less power does not seem to explain the decrease in the ODR. At the same time, the finding that EDR does not show a notable, abs(r) < .2, relationship with cohort suggests that power has remained constant over time. This is consistent with previous examinations of statistical power in social psychology (Sedlmeier & Gigerenzer, 1989).

Although the ODR decreased significantly and the EDR did not decrease significantly, bias (ODR – EDR) did not show a significant relationship with cohort, r(N = 222) = -.06, 95%CI = -19 to .07, t(220) = -0.94, p = .35, but the 95%CI allows for a slight decrease in bias that would be consistent with the significant decrease in the ODR.

In conclusion, there is a small, statistically significant decrease in the ODR, but the effect over the past 40 decades is too small to have practical significance. The EDR and bias are not even statistically significantly related to cohort. These results suggest that research practices and the use of questionable ones has not changed notably since the beginning of empirical social psychology (Cohen, 1961; Sterling, 1959).

Achievement Motivation

Another possibility is that in each generation, QRPs are used more by researches who are more achievement motivated (Janke et al., 2019). After all, the reward structure in science is based on number of publications and significant results are often needed to publish. In social psychology it is also necessary to present a package of significant results across multiple studies, which is nearly impossible without the use of QRPs (Schimmack, 2012). To examine this hypothesis, I correlated the EDR with researchers’ H-Index (as of 2/1/2021). The correlation was small, r(N = 222) = .10, 95%CI = -.03 to .23, and not significant, t(220) = 1.44, p = .15. This finding is only seemingly inconsistent with Janke et al.’s (2019) finding that self-reported QRPs were significantly correlated with self-reported ambition, r(217) = .20, p = .014. Both correlations are small and positive, suggesting that achievement motivated researchers may be slightly more likely to use QRPs. However, the evidence is by no means conclusive and the actual relationship is weak. Thus, there is no evidence to support that highly productive researchers with impressive H-indices achieved their success by using QRPs more than other researchers. Rather, they became successful in a field where QRPs are the norm. If the norms were different, they would have become successful following these other norms.

Impact

A common saying in science is that “extraordinary claims require extraordinary evidence.” Thus, we might expect stronger evidence for claims of time-reversed feelings (Bem, 2011) than for evidence that individuals from different cultures regulate their emotions differently (Matsumoto et al., 2008). However, psychologists have relied on statistical significance with alpha = .05 as a simple rule to claim discoveries. This is a problem because statistical significance is meaningless when results are selected for significance and replication failures with non-significant results remain unpublished (Sterling, 1959). Thus, psychologists have trusted an invalid criterion that does not distinguish between true and false discoveries. It is , however, possible that social psychologists used other information (e.g, gossip about replication failures at conferences) to focus on credible results and to ignore incredible ones. To examine this question, I correlated authors’ EDR with the number of citations in 2019. I used citation counts for 2019 because citation counts for 2020 are not yet final (the results will be updated with the 2020 counts). Using 2019 increases the chances of finding a significant relationship because replication failures over the past decade could have produced changes in citation rates.

The correlation between EDR and number of citations was statistically significant, r(N = 222) = .16, 95%CI = .03 to .28, t(220) = 2.39, p = .018. However, the lower limit of the 95% confidence interval is close to zero. Thus, it is possible that the real relationship is too small to matter. Moreover, the non-parametric correlation with Kendell’s tau was not significant, tau = .085, z = 1.88, p = .06. Thus, at present there is insufficient evidence to suggest that citation counts take the credibility of significant results into account. At present, p-values less than .05 are treated as equally credible no matter how they were produced.

Conclusion

There is general agreement that questionable research practices have been used to produce an unreal success rate of 90% or more in psychology journals (Sterling, 1959). However, there is less agreement about the amount of QRPs that are being used and the implications for the credibility of significant results in psychology journals (John et al., 2012; Fiedler & Schwarz, 2016). The problem is that self-reports may be biased because researchers are unable or unwilling to report the use of QRPs (Nisbett & Wilson, 1977). Thus, it is necessary to examine this question with alternative methods. The present study used a statistical method to compare the observed discovery rate with a statistically estimated discovery rate based on the distribution of significant p-values. The results showed that on average social psychologists have made extensive use of QRPs to inflate an expected discovery rate of around 35% to an observed discovery rate of 70%. Moreover, the estimated discovery rate of 35%is likely to be an inflated estimate of the discovery rate for focal hypothesis tests because the present analysis is based on focal and non-focal tests. This would explain why the actual success rate in replication studies is even lower thna the estimated discovery rate of 35% (Open Science Collaboration, 2015).

The main novel contribution of this study was to examine individual differences in the use of QRPs. While the ODR was fairly consistent across articles, the EDR varied considerably across researchers. However, this variation showed only very small relationships with a researchers’ cohort (first year of publication). This finding suggests that the use of QRPs varies more across research fields and other factors than over time. Additional analysis should explore predictors of the variation across researchers.

Another finding was that citations of authors’ work do not take credibility of p-values into account. Citations are influenced by popularity of topics and other factors and do not take the strength of evidence into account. One reason for this might be that social psychologists often publish multiple internal replications within a single article. This gives the illusion that results are robust and credible because it is very unlikely to replicate type-I errors. However, Bem’s (2011) article with 9 internal replications of time-reversed feelings showed that QRPs are also used to produce consistent results within a single article (Francis, 2012; Schimmack, 2012). Thus, number of significant results within an article or across articles is also an invalid criterion to evaluate the robustness of results.

In conclusion, social psychologists have conducted studies with low statistical power since the beginning of empirical social psychology. The main reason for this is the preference for between-subject designs that have low statistical power with small sample sizes of N = 40 participants and small to moderate effect sizes. Despite repeated warnings about the problems of selection for significance (Sterling, 1959) and the problems of small sample sizes (Cohen, 1961; Sedelmeier & Gigerenzer, 1989; Tversky & Kahneman, 1971), the practices have not changed since Festinger conducted his seminal study on dissonance with n = 20 per group. Over the past decades, social psychology journals have reported thousands of statistically significant results that are used in review articles, meta-analyses, textbooks, and popular books as evidence to support claims about human behavior. The problem is that it is unclear which of these significant results are true positives and which are false positives, especially if false positives are not just strictly nil-results, but also results with tiny effect sizes that have no practical significance. Without other reliable information, even social psychologists do not know which of their colleagues results are credible or not. Over the past decade, the inability to distinguish credible and incredible information has produced heated debates and a lack of confidence in published results. The present study shows that the general research practices of a researcher provide valuable information about credibility. For example, a p-value of .01 by a researcher with an EDR of 70 is more credible than a p-value of .01 by a researcher with an EDR of 15. Thus, rather than stereotyping social psychologists based on the low replication rate in the Open Science Collaboration project, social psychologists should be evaluated based on their own research practices.

References

Cairo, A. H., Green, J. D., Forsyth, D. R., Behler, A. M. C., & Raldiris, T. L. (2020). Gray (Literature) Matters: Evidence of Selective Hypothesis Reporting in Social Psychological Research. Personality and Social Psychology Bulletin, 46(9), 1344–1362. https://doi.org/10.1177/0146167220903896

Janke, S., Daumiller, M., & Rudert, S. C. (2019). Dark pathways to achievement in science: Researchers’ achievement goals predict engagement in questionable research practices.
Social Psychological and Personality Science, 10(6), 783–791. https://doi.org/10.1177/1948550618790227

Nations’ Well-Being and Wealth

Scientists have made a contribution when a phenomenon or a statist is named after them. Thus, it is fair to say that Easterlin made a contribution to happiness research because researchers who write about income and happiness often mention his 1974 article “Does Economic Growth Improve the Human Lot? Some Empirical Evidence” (Easterlin, 1974).

To be fair, the article examines the relationship between income and happiness from three perspectives: (a) the correlation between income and happiness across individuals within nations, (b) the correlation of average incomes and average happiness across nations, and (c) the correlation between average income and average happiness within nations over time. A forth perspective, namely the correlation between income and happiness within individuals over time was not examined because no data were available in 1974.

Even for some of the other questions, the data were limited. Here I want to draw attention to Easterlin’s examination of correlations between nations’ wealth and well-being. He draws heavily on Cantril’s seminal contribution to this topic. Cantil (1965) not only developed a measure that can be used to compare well-being across nations, he also used this measure to compare the well-being of 14 nations (Cuba is not included in Table 1 because I did not have new data).

Cantril.Cross-Cultural.Data.png

Cantril also correlated the happiness scores with a measure of nations’ wealth. The correlation was r = .5. Cantril also suggested that Cuba and the Dominican Republic were positive and negative outliers, respectively. Excluding these two nations increases the correlation to r = .7.

Easterlin took issue with these results.

“Actually the association between wealth and happiness indicated by Cantril”s international data is not so clear-cut. This is shown by a scatter diagram of the data (Fig. I). The inference about a positive association relies heavily on the observations for India and the United States. [According to Cantril (1965, pp. 130-131), the values for Cuba and the Dominican Republic reflect unusual political circumstances-the immediate aftermath of a successful revolution in Cuba and prolonged political turmoil in the Dominican Republic].

What is perhaps most striking is that the personal happiness ratings for 10 of the 14 countries lie virtually within half a point of the midpoint rating of 5, as is brought out by the broken horizontal lines in the diagram. While a difference of rating of only 0.2 is significant at the 0.05 level, nevertheless there is not much evidence, for these IO countries, of a systematic association between income and happiness. The closeness of the happiness ratings implies also that a similar lack of association would be found between happiness and other economic magnitudes such as income inequality or the rate of change of income.

Nearly 50 years later, it is possible to revisit Easterlin’s challenge of Cantril’s claim that nations’ well-being is tied to their wealth with much better data from the Gallup World Poll. The Gallup World Poll used the same measure of well-being. However, it also provides a better measure of citizens’ wealth by asking for income. In contrast, GDP can be distorted and may not reflect the spending power of the average citizen very well. The data about well-being (World Happiness Report, 2020) and median per capita income (Gallup) are publicly available. All I needed to do was to compute the correlation and make a pretty graph.

The Pearson correlation between income and the ladder scores is r(126) = .75. The rank correlation is r(126) = .80. and the Pearson correlation between the log of income and the ladder scores is r(126) = .85. These results strongly support Cantril’s prediction based on his interpretation of the first cross-national study in the 1960s and refute Eaterlin’s challenge that that this correlation is merely driven by two outliers. Other researchers who analyzed the Gallup World Poll data also reported correlations of r = .8 and showed high stability of nations’ wealth and income over time (Zyphur et al., 2020).

Figure 2 also showed that Easterlin underestimate the range of well-being scores. Even ignoring additional factors like wars, income alone can move well-being from a 4 in one of the poorest countries in the world (Burundi) close to an 8 in one of the richest countries in the world (Norway). It also does not show that Scandinavian countries have a happiness secret. The main reason for their high average well-being appears to be that median personal incomes are very high.

The main conclusion is that social scientists are often biased for a number of reasons. The bias is evident in Easterlin’s interpretation of Cantril’s data. The same anti-materialstic bias can be found in many other articles on this topic that claim the benefits of wealth are limited.

To be clear, a log-function implies that the same amount of wealth buys more well-being in poor countries, but the graph shows no evidence that the benefits of wealth level off. It is also true that the relationship between GDP and happiness over time is more complicated. However, regarding cross-national differences the results are clear. There is a very strong relationship between wealth and well-being. Studies that do not control for this relationship may report spurious relationships that disappear when income is included as a predictor.

Furthermore, the focus on happiness ignores that wealth also buys longer lives. Thus, individuals in richer nations not only have happier lives they also have more happy life years. The current Covid-19 pandemic further increases these inequalities.

In conclusion, one concern about subjective measures of well-being has been that individuals in poor countries may be happy with less and that happiness measures fail to reflect human suffering. This is not the case. Sustainable, global economic growth that raises per capita wealth remains a challenge to improve human well-being.

Jens Forster and the Credibility Crisis in Social Psychology

  • Please help out to improve this post. If you have conducted successful or unsuccessful replication studies of work done by Jens Forster, please share this information with me and I will add it to this blog post.

Jens Forster was a social psychologists from Germany. He was a rising star and on the way to receiving a prestigious 5 million Euro award from the Alexander von Humboldt Foundation (Retraction Watch, 2015). Then an anonymous whistle blower accused him of scientific misconduct. Under pressure, Forster returned the award without admitting to any wrongdoing.

He also was in transition to move from the University of Amsterdam to the University of Bochum. After a lengthy investigation, Forster was denied tenure and he is no longer working in academia (Science, 2016), despite the fact that an investigation by the German association of psychologists (DGP) did not conclude that he conducted fraud.

While the personal consequences for Forster are similar to those of Stapel, who admitted to fraud and left his tenured position, the effect on the scientific record is different. Stapel retracted over 50 articles that are no longer being cited at high numbers. In contrast, Forster retracted only a few papers and most of his articles are not flagged to readers as potentially fraudulent. We can see the differences in citation counts for Stapel and Forster.

Stapels Citation Counts

Stapel’s citation counts peaked at 350 and are now down to 150 citations a year. Some of these citations are with co-authors and from papers that have been cleared as credible.

Jens Forster Citations

Citation counts for Forster peaked at 450. The also decreased by 200 citations to 250 citations, but we are also seeing an uptick by 100 citations in 2019. The question is whether this muted correction is due to Forster’s denial of wrongdoing or whether the articles that were not retracted actually are more credible.

The difficulty in proving fraud in social psychology is that social psychologists also used many questionable practices to produce significant results. These questionable practices have the same effect as fraud, but they were not considered unethical or illegal. Thus, there are two reasons why articles that have not been retracted may still lack credible evidence. First, it is difficult to prove fraud when authors do not confess. Second, even if no fraud was committed, the data may lack credible evidence because they were produced with questionable practices that are not considered data fabrication.

For readers of the scientific literature it is irrelevant whether incredible (results with low credibility) results were produced with fraud or with other methods. The only question is whether the published results provide credible evidence for the theoretical claims in an article. Fortunately, meta-scientists have made progress over the past decade in answering this question. One method relies on a statistical examination of an author’s published test statistics. Test statistics can be converted into p-values or z-scores so that they have a common metric (e.g., t-values can be compared to F-values). The higher the z-score, the stronger is the evidence against the null-hypothesis. High z-scores are also difficult to obtain with questionable practices. Thus, they are either fraudulent or provide real evidence for a hypothesis (i.e. against the null-hypothesis).

I have published z-curve analyses of over 200 social/personality psychologists that show clear evidence of variation in research practices across researchers (Schimmack, 2021). I did not include Stapel or Forster in these analyses because doubts have been raised about their research practices. However, it is interesting to compare Forster’s z-curve plot to the plot of other researchers because it is still unclear whether anomalous statistical patterns in Forster’s articles are due to fraud or the use of questionable research practices.

The distribution of z-scores shows clear evidence that questionable practices were used because the observed discovery rate of 78% is much higher than the estimated discovery rate of 18% and the ODR is outside of the 95% CI of the EDR, 9% to 47%. An EDR of 18% places Forster at rank #181 in the ranking of 213 social psychologists. Thus, even if Forster did not conduct fraud, many of his published results are questionable.

The comparison of Forster with other social psychologists is helpful because humans’ are prone to overgeneralize from salient examples which is known as stereotyping. Fraud cases like Stapel and Forster have tainted the image of social psychology and undermined trust in social psychology as a science. The fact that Forster would rank very low in comparison to other social psychologists shows that he is not representative of research practices in social psychology. This does not mean that Stapel and Forster are bad apples and extreme outliers. The use of QRPs was widespread but how much researchers used QRPs varied across researchers. Thus, we need to take an individual difference perspective and personalize credibility. The average z-curve plot for all social psychologists ignores that some research practices were much worse and others were much better. Thus, I argue against stereotyping social psychologists and in favor of evaluating each social psychologists based on their own merits. As much as all social psychologists acted within a reward structure that nearly rewarded Forster’s practices with a 5 million dollar prize, researchers navigated this reward structure differently. Hopefully, making research practices transparent can change the reward structure so that credibility gets rewarded.

Personalized P-Values for Social/Personality Psychologists

Last update 2/24/2021
(the latest updated included articles published in 2020. This produced some changes in the rankings).

Introduction

Since Fisher invented null-hypothesis significance testing, researchers have used p < .05 as a statistical criterion to interpret results as discoveries worthwhile of discussion (i.e., the null-hypothesis is false). Once published, these results are often treated as real findings even though alpha does not control the risk of false discoveries.

Statisticians have warned against the exclusive reliance on p < .05, but nearly 100 years after Fisher popularized this approach, it is still the most common way to interpret data. The main reason is that many attempts to improve on this practice have failed. The main problem is that a single statistical result is difficult to interpret. However, when individual results are interpreted in the context of other results, they become more informative. Based on the distribution of p-values it is possible to estimate the maximum false discovery rate (Bartos & Schimmack, 2020; Jager & Leek, 2014). This approach can be applied to the p-values published by individual authors to adjust p-values to keep the risk of false discoveries at a reasonable level, FDR < .05.

Researchers who mainly test true hypotheses with high power have a high discovery rate (many p-values below .05) and a low false discovery rate (FDR < .05). Figure 1 shows an example of a researcher who followed this strategy (for a detailed description of z-curve plots, see Schimmack, 2021).

We see that out of the 317 test-statistics retrieved from his articles, 246 were significant with alpha = .05. This is an observed discovery rate of 78%. We also see that this discovery rate closely matches the estimated discovery rate based on the distribution of the significant p-values, p < .05. The EDR is 79%. With an EDR of 79%, the maximum false discovery rate is only 1%. However, the 95%CI is wide and the lower bound of the CI for the EDR, 27%, allows for 14% false discoveries.

When the ODR matches the EDR, there is no evidence of publication bias. In this case, we can improve the estimates by fitting all p-values, including the non-significant ones. With a tighter CI for the EDR, we see that the 95%CI for the maximum FDR ranges from 1% to 3%. Thus, we can be confident that no more than 5% of the significant results wit alpha = .05 are false discoveries. Readers can therefore continue to use alpha = .05 to look for interesting discoveries in Matsumoto’s articles.

Figure 3 shows the results for a different type of researcher who took a risk and studied weak effect sizes with small samples. This produces many non-significant results that are often not published. The selection for significance inflates the observed discovery rate, but the z-curve plot and the comparison with the EDR shows the influence of publication bias. Here the ODR is similar to Figure 1, but the EDR is only 11%. An EDR of 11% translates into a large maximum false discovery rate of 41%. In addition, the 95%CI of the EDR includes 5%, which means the risk of false positives could be as high as 100%. In this case, using alpha = .05 to interpret results as discoveries is very risky. Clearly, p < .05 means something very different when reading an article by David Matsumoto or Shelly Chaiken.

Rather than dismissing all of Chaiken’s results, we can try to lower alpha to reduce the false discovery rate. If we set alpha = .01, the FDR is 15%. If we set alpha = .005, the FDR is 8%. To get the FDR below 5%, we need to set alpha to .001.

A uniform criterion of FDR < 5% is applied to all researchers in the rankings below. For some this means no adjustment to the traditional criterion. For others, alpha is lowered to .01, and for a few even lower than that.

The rankings below are based on automatrically extracted test-statistics from 40 journals (List of journals). The results should be interpreted with caution and treated as preliminary. They depend on the specific set of journals that were searched, the way results are being reported, and many other factors. The data are available (data.drop) and researchers can exclude articles or add articles and run their own analyses using the z-curve package in R (https://replicationindex.com/2020/01/10/z-curve-2-0/).

I am also happy to receive feedback about coding errors. I also recommended to hand-code articles to adjust alpha for focal hypothesis tests. This typically lowers the EDR and increases the FDR. For example, the automated method produced an EDR of 31 for Bargh, whereas hand-coding of focal tests produced an EDR of 12 (Bargh-Audit).

And here are the rankings. The results are fully automated and I was not able to cover up the fact that I placed only #139 out of 300 in the rankings. In another post, I will explain how researchers can move up in the rankings. Of course, one way to move up in the rankings is to increase statistical power in future studies. The rankings will be updated again when the 2021 data are available.

Despite the preliminary nature, I am confident that the results provide valuable information. Until know all p-values below .05 have been treated as if they are equally informative. The rankings here show that this is not the case. While p = .02 can be informative for one researcher, p = .002 may still entail a high false discovery risk for another researcher.

NameTestsODREDRERRFDRAlpha
Robert A. Emmons588885881.05
David Matsumoto3788379851.05
Linda J. Skitka5326875822.05
Jonathan B. Freeman2745975812.05
Virgil Zeigler-Hill5157274812.05
David P. Schmitt2077871772.05
Emily A. Impett5497770762.05
John M. Zelenski1567169762.05
Kurt Gray4877969812.05
Michael E. McCullough3346969782.05
Kipling D. Williams8437569772.05
Hilary B. Bergsieker4396768742.05
Cameron Anderson6527167743.05
Jamil Zaki4307866763.05
Phoebe C. Ellsworth6057465723.05
Benjamin R. Karney3925665733.05
Jim Sidanius4876965723.05
A. Janet Tomiyama767865763.05
Juliane Degner4356364713.05
Carol D. Ryff2808464763.05
Steven J. Heine5977863773.05
Thomas N Bradbury3986163693.05
David M. Amodio5846663703.05
Elaine Fox4727962783.05
Klaus Fiedler14217860723.05
Richard W. Robins2707660704.05
Margaret S. Clark5057559774.05
William B. Swann Jr.10707859804.05
Edward P. Lemay2898759814.05
Ximena B. Arriaga2846658694.05
Patricia G. Devine6067158674.05
B. Keith Payne8797158764.05
Rainer Reisenzein2016557694.05
Joris Lammers7056956694.05
Jean M. Twenge3817256594.05
Nicholas Epley15047455724.05
Barbara Mellers1488553795.05
Edward L. Deci2847952635.05
Richard M. Ryan9987852695.05
Lee Jussim2268052715.05
Ethan Kross6146652675.05
Tessa V. West6917151595.05
Jens B. Asendorpf2537451695.05
Samuel D. Gosling1085851625.05
Roger Giner-Sorolla6638151805.05
Sheena S. Iyengar2076350805.05
James J. Gross11047250775.05
Paul Rozin4497850845.05
Janice R. Kelly3667550705.05
Shinobu Kitayama9837650715.05
Paul K. Piff1667750635.05
Mina Cikara3927149805.05
Penelope Lockwood4587148706.01
Bertram Gawronski18037248766.01
Edward R. Hirt10428148656.01
Matthew D. Lieberman3987247806.01
Stephanie A. Fryberg2486247666.01
Leaf van Boven7117247676.01
Daniel M. Wegner6027647656.01
Agneta H. Fischer9527547696.01
John T. Cacioppo4387647696.01
Alice H. Eagly3307546716.01
Rainer Banse4027846726.01
Jennifer S. Lerner1818046616.01
Jeanne L. Tsai12417346676.01
Constantine Sedikides25667145706.01
Dacher Keltner12337245646.01
Andrea L. Meltzer5495245726.01
R. Chris Fraley6427045727.01
Brian A. Nosek8166844817.01
Ursula Hess7747844717.01
Mark Schaller5657343617.01
S. Alexander Haslam11987243647.01
Charles M. Judd10547643687.01
Jessica L. Tracy6327443717.01
Jason P. Mitchell6007343737.01
Lisa Feldman Barrett6446942707.01
Susan T. Fiske9117842747.01
Bernadette Park9737742647.01
Paul A. M. Van Lange10927042637.01
Mario Mikulincer9018942647.01
Wendi L. Gardner7986742637.01
Jordan B. Peterson2666041797.01
Philip E. Tetlock5497941737.01
Michael Inzlicht5666441618.01
Stacey Sinclair3277041578.01
Norbert Schwarz13377240638.01
Tiffany A. Ito3498040648.01
Richard E. Petty27716940648.01
Wendy Wood4627540628.01
Elizabeth Page-Gould4115740668.01
Jason E. Plaks5827039678.01
Carol S. Dweck10287039638.01
Christian S. Crandall3627539598.01
Tobias Greitemeyer17377239678.01
Marcel Zeelenberg8687639798.01
Eric D. Knowles3846838648.01
Jerry Suls4137138688.01
Abigail A. Scholer5565838629.01
Harry T. Reis9986938749.01
John F. Dovidio20196938629.01
Joshua Correll5496138629.01
C. Nathan DeWall13367338639.01
Molly J. Crockett1797337799.01
Mahzarin R. Banaji8807337789.01
Mark J. Brandt2777037709.01
Fritz Strack6077537569.01
Antony S. R. Manstead16567237629.01
Kevin N. Ochsner4067937709.01
Lorne Campbell4336737619.01
Ayelet Fishbach14167837599.01
Geoff MacDonald4066737679.01
Barbara L. Fredrickson2877236619.01
Craig A. Anderson4677636559.01
Niall Bolger3766736589.01
Duane T. Wegener9807736609.01
D. S. Moskowitz34187436639.01
Yaacov Schul4116136649.01
Joanne V. Wood10937436609.01
Nyla R. Branscombe12767036659.01
Jeff T. Larsen18174366710.01
Igor Grossmann20364356610.01
Michael D. Robinson138878356610.01
C. Miguel Brendl12176356810.01
Eva Walther49382356610.01
Samuel L. Gaertner32175356110.01
Nalini Ambady125662355610.01
Azim F. Sharif18374356810.01
Daphna Oyserman44655355410.01
Emily Balcetis59969356810.01
Diana I. Tamir15662356210.01
Michael Harris Bond37873358410.01
John T. Jost79470356110.01
Wiebke Bleidorn9963347410.01
Paula M. Niedenthal52269346110.01
Ozlem Ayduk54962345910.01
Thomas Gilovich119380346910.01
Alison Ledgerwood21475345410.01
Kerry Kawakami48768335610.01
Christopher R. Agnew32575337610.01
Malte Friese50161335711.01
Danu Anthony Stinson49477335411.01
Jennifer A. Richeson83167335211.01
Ulrich Schimmack31875326311.01
Russell H. Fazio109469326111.01
Mark Snyder56272326311.01
Eli J. Finkel139262325711.01
Margo J. Monteith77376327711.01
Robert B. Cialdini37972325611.01
E. Ashby Plant83177315111.01
Yuen J. Huo13274318011.01
Christopher K. Hsee68975316311.01
Delroy L. Paulhus12177318212.01
Kathleen D. Vohs94468315112.01
Jamie Arndt131869315012.01
John A. Bargh65172315512.01
Roy F. Baumeister244269315212.01
Tom Pyszczynski94869315412.01
Anthony G. Greenwald35772308312.01
Jennifer Crocker51568306712.01
Dale T. Miller52171306412.01
Arthur Aron30765305612.01
Aaron C. Kay132070305112.01
Lauren J. Human44759307012.01
Nicholas O. Rule129468307513.01
Steven W. Gangestad19863304113.005
Richard E. Nisbett31973296913.01
Hazel Rose Markus67476296813.01
Dirk Wentura83065296413.01
Nir Halevy26268297213.01
Caryl E. Rusbult21860295413.01
Russell Spears228673295513.01
Gordon B. Moskowitz37472295713.01
Jeff Greenberg135877295413.01
Eliot R. Smith44579297313.01
Boris Egloff27481295813.01
Jeffry A. Simpson69774285513.01
Yoav Bar-Anan52575287613.01
Adam D. Galinsky215470284913.01
Roland Neumann25877286713.01
Matthew Feinberg29577286914.01
Sander L. Koole76765285214.01
Joshua Aronson18385284614.005
Naomi I. Eisenberger17974287914.01
Geoffrey J. Leonardelli29068284814.005
Shelly L. Gable36464285014.01
Grainne M. Fitzsimons58568284914.01
Richard J. Davidson38064285114.01
Brent W. Roberts56272287714.01
Elizabeth W. Dunn39575286414.01
Eddie Harmon-Jones73873287014.01
Jan De Houwer197270277214.01
Karl Christoph Klauer80167276514.01
Joshua M. Ackerman58075274914.01
Jennifer S. Beer8056275414.01
Guido H. E. Gendolla42276274714.005
Claude M. Steele43473264215.005
William G. Graziano53271266615.01
Kristin Laurin64863265115.01
Klaus R. Scherer46783267815.01
Galen V. Bodenhausen58574266115.01
Sonja Lyubomirsky53171265915.01
Kerri L. Johnson53276257615.01
Batja Mesquita41671257316.01
Joel Cooper25772253916.005
Ronald S. Friedman18379254416.005
Phillip R. Shaver56681257116.01
Laurie A. Rudman48272256816.01
David Dunning81874257016.01
Steven J. Sherman88874246216.01
Alison L. Chasteen22368246916.01
Shigehiro Oishi110964246117.01
Thomas Mussweiler60470244317.005
Mark W. Baldwin24772244117.005
Jonathan Haidt36876237317.01
Brandon J. Schmeichel65266234517.005
Jeffrey W Sherman99268237117.01
Jennifer L. Eberhardt20271236218.005
Felicia Pratto41073237518.01
Klaus Rothermund73871237618.01
Bernard A. Nijstad69371235218.005
Michael Ross116470226218.005
Dieter Frey153868225818.005
Marilynn B. Brewer31475226218.005
David M. Buss46182228019.01
Spike W. S. Lee14568226419.005
Yoel Inbar28067227119.01
Wendy Berry Mendes96568224419.005
Sean M. McCrea58473225419.005
Joseph P. Forgas88883215919.005
Maya Tamir134280216419.005
Paul W. Eastwick58365216919.005
Elizabeth Levy Paluck3184215520.005
Jay J. van Bavel43764207121.005
Geoffrey L. Cohen159068205021.005
Tanya L. Chartrand42467203321.001
David A. Pizarro22771206921.005
Andrew J. Elliot101881206721.005
Kentaro Fujita45869206221.005
Ana Guinote37876204721.005
Nilanjana Dasgupta38376195222.005
Amy J. C. Cuddy17081197222.005
Peter M. Gollwitzer130364195822.005
Robert S. Wyer87182196322.005
Gerald L. Clore45674194522.001
Travis Proulx17463196222.005
James K. McNulty104756196523.005
Dolores Albarracin52067195623.005
Richard P. Eibach75369194723.001
Kennon M. Sheldon69874186623.005
Wilhelm Hofmann62467186623.005
James M. Tyler13087187424.005
Ed Diener49864186824.005
Roland Deutsch36578187124.005
Frank D. Fincham73469185924.005
Toni Schmader54669186124.005
Lisa K. Libby41865185424.005
Chen-Bo Zhong32768184925.005
Ara Norenzayan22572176125.005
Benoit Monin63565175625.005
Brad J. Bushman89774176225.005
Michel Tuan Pham24686176825.005
Ap Dijksterhuis75068175426.005
E. Tory. Higgins192068175426.001
Michael W. Kraus61772175526.005
Simone Schnall27062173126.001
Carey K. Morewedge63376176526.005
Timothy D. Wilson79865176326.005
Leandre R. Fabrigar63270176726.005
Melissa J. Ferguson116372166927.005
Daniel T. Gilbert72465166527.005
William A. Cunningham21377166328.005
Mark P. Zanna65964164828.001
Sandra L. Murray69760165528.001
Charles S. Carver15482166428.005
Laura A. King39176166829.005
Heejung S. Kim85859165529.001
Gun R. Semin15979156429.005
Nathaniel M Lambert45666155930.001
Nira Liberman130475156531.005
Shelley E. Taylor43869155431.001
Ziva Kunda21767145631.001
Lee Ross34977146331.001
Jon K. Maner104065145232.001
Gabriele Oettingen104761144933.001
Arie W. Kruglanski122878145833.001
Gregory M. Walton58769144433.001
Sarah E. Hill50978135234.001
Fiona Lee22167135834.001
Michael A. Olson34665136335.001
Michael A. Zarate12052133136.001
Daniel M. Oppenheimer19880126037.001
Steven J. Spencer54167124438.001
Yaacov Trope127773125738.001
Deborah A. Prentice8980125738.001
William von Hippel39865124840.001
Oscar Ybarra30563125540.001
Dov Cohen64168114441.001
Mark Muraven49652114441.001
Ian McGregor40966114041.001
Martie G. Haselton18673115443.001
Susan M. Andersen36174114843.001
Shelly Chaiken36074115244.001
Hans Ijzerman2145694651.001