Yearly Archives: 2018

Estimating the Size of File-Drawers with Z-Curve

Every student in psychology is introduced to the logic of Null-Hypothesis Significance Testing (NHST). The basic idea is to establish the long-run maximum probability that a significant result is a false positive result. A false positive result is called a type-I error. The standard for an acceptable type-I error risk is 5%. Statistics programs and articles often highlight results with a p-value less than 0.05. Students quickly learn that the goal of statistical analysis is to find p-values less than .05.

NHST has been criticized for many reasons. This blog post focuses on the problem when NHST is used to hunt for significant results and when only significant results are reported. Hunting for significant results in itself is not a problem. If a researcher conducts 100 statistical tests and reports all results, the risk of a type-I error is controlled by the significance criterion. With alpha = .05, no more than 5 of the 100 tests can produce a false positive result. If, for example, 20 results are significant, it is clear that some of the significant results are true discoveries.

The problem arises when only significant results are reported (Sterling, 1959). If a researcher reports 20 significant results, it is not clear whether 20, 100, or 400 tests were conducted. However, this has important implications for the assessment of type-I errors. With 20 tests and 20 significant results, the type-I error is minimal, with 100 tests it is moderate (1 out of 4 significant results could be false positives) and with 400 tests (1 out of 20 = 5%) it is practically certain that at least some of the significant results are false positives. After all, the expected value if all 400 studies tests false hypotheses is 5%. So observing only 5% non-significant results in 400 tests suggests that some of these significant results are false positives.

The Replication Crisis and the True Type-I Error Risk

The selective publishing of only significant hypothesis tests is a major problem in psychological science (Sterling, 1959; Sterling et al., 1995), but psychologists only recently became aware of this problem (Francis, 2012; John et al., 2012; Schimmack, 2012). Once results are selected for significance, the true type-I error risk increases as a function of the actual number of tests that were conducted. While alpha is 5% in all studies, the percentage of significant results is unknown because it is unknown how many tests were conducted.

Type-I Error Risk and the File Drawer

Rosenthal (1979) introduced the concept of a file drawer. The proverbial file-drawer contains all of the unpublished studies that a researcher conducted that produced non-significant results.

If all studies had the same statistical power to produce a significant result, the size of the file-drawer would be self-evident. Studies with 50% power have a long-run probability of obtaining 50% significant results, by definition. Thus, there are also 50% studies with non-significant results. It follows that for each published significant result, there is a non-significant result in the proverbial file-drawer (File-Drawer Ratio 1:1; this simple example assumes independence of hypothesis tests).

If power were 80%, there would be only one non-significant result in the file-drawer for every 4 published significant results (File-Drawer Ratio 1:4 or 0.25 :1). However, if power is only 20%, there would be 4 non-significant results for every published significant result (File-Drawer Ratio 4:1).

Things are more complicated when studies vary in power. If we assume that some studies are true positives and others are false positives, the probability of a significant result varies across studies. Using a simple example, assume that 80 studies are false positives and 20 studies have 50% power. In this case, we expect 14 significant results; 80 * .05 = 4 + 20 * .5 = 1 == 14.

The 5% error rates is true for the 100 studies that were conducted, but it would be wrong to believe that only 5% of the selected set of 14 studies with significant results could be false positives. In this example, we would falsely assume that at most 1 of the 14 studies is a false positive; 14 * .05 = 0.7 studies. However, in this case, we know that there are actually 4 false positive results. We do get the correct estimate of the maximum number of false positives, if we start with the actual number of studies that were conducted, which gives a false positive risk of 5 studies, which would be a percentage of 5/14 = 36%. Thus, up to 36% of the reported 14 studies could be false positives. Thus, the actual risk is 7 times larger than the claim p < .05 suggests.

In short, we need to know the size of the file-drawer to estimate the percentage of reported results that could be false positives.

Estimating the Size of the File Drawer

Brunner and Schimmack (2018) developed a statistical method, z-curve, that can estimate mean power for a set of studies with heterogeneity in power, including some false positive results. The main purpose of the method was to estimate mean power for the set of published studies that produced significant results. However, the article also contained some theorems that make it possible to estimate the size of the file drawer.

Z-curve is a mixture model that models the distribution of observed test statistics (z-scores) as a mixture of studies with different levels of power. Bruner and Schimmack (2018) introduced a model with varying non-centrality parameters and weights. However, it is also possible to keep the non-centrality parameters constant and only the weights are free model parameters. The fixed non-centrality parameters can include a value of 0 to model the presence of false positive results. The latest version of z-curve uses fixed values of 0, 1, 2, 3, 4, 5, and 6. Values greater than 6 are not needed because z-curve treats all observed z-scores greater than 6 as having a power of 1.

The power values corresponding to these fixed non-centrality parameters are 5%, 17%, 52%, 85%, 98%, 99.9%, and 100%. Only the lower power values are important for the estimation of the file-drawer because high values imply that nearly all attempts produce significant results.

To illustrate the method, I focus on the lowest three power values: 5%, 17% and 52%. Assume that we observe 100 significant results with the following mixture of power values: 30 studies have 5% power, 34 studies have 17% power, and 26 studies have 52% power, and we want to know the size of the file drawer.

To get from the observed number of studies to the study that were actually run, we need to divide the number of observed studies by power (see Brunner & Schimmack, 2018, for a mathematical proof). With 5% power (i.e., false positive results), it requires 1/0.05 = 20 studies to produce 1 significant result in the long run. Thus, if 30 significant results were obtained with 5% power, 600 studies had to be run (600 * 0.05 = 30). With 17% power, it would require 200 studies to produce 34 significant results. And with 52% power, it would require 50 studies to produce 26 significant results. Thus, the total number of studies that are needed to obtain 100 significant results is 600 + 200 + 60 = 850. It follows that 750 (850 – 100) non-significant results are in the file drawer.

The following simulation illustrates how z-curve estimates the size of the file-drawer. Data are generated using standard normal distributions with means 0, 1, and 2. To achieve large sample accuracy, there are 800,000 observations (M = 0, k = 800,000; M = 1, k = 200,000; & M = 2, k = 50,000).

Only significant results (to the right of the red line at z = 1.96) were used to fit the model. The non-significant results are shown to see how well the model predicts the size of the file drawer.

The gray line shows the predicted distribution by the model. It shows that the predicted distribution of non-significant results matches the observed distribution of non-significant results, although the model slightly overestimates the size of the file-drawer.

The Expected Discovery Rate is the percentage of significant results for all studies including the file-drawer. The actual discovery rate is given by the number of studies (k = 1,050,000) and the actual number of significant results (k = 99,908), which is 99,908/1,050,000 = 9.52. The expected discovery rate is 9%, a fairly close match given the size of the file drawer.

Another way to look at the size of the file-drawer is the file-drawer ratio. That is, how many studies with non-significant results are in the file drawer for every significant result. The actual file-drawer ratio is (1,050,000 – 99,908)/99,908 = 9.51. That is, for every significant result, 9 to 10 non-significant results were not reported. The estimated file-drawer ratio is 9.7, a fairly close match.

The next example shows how z-curve performs when mean power is higher and the file-drawer is smaller. In this example, there were 100000 cases with z = 0, 200000 cases with z = 1, 400000 cases with z = 2, and 300000 cases with z = 3. The expected discovery rate for this simulation is 50%. With mean power of 50%, the file-drawer ratio is 1:1. That is, for each significant result there is one non-significant result.

The grey line shows that z-curve slightly overestimates the size of the file-drawer. However, this bias is small. The expected discovery rate is estimated to be 49% and the file-drawer ratio is estimated to be 1.05 : 1. These estimates closely match the actual results.

If power is greater than 50%, the file-drawer ratio is less than 1:1. The final simulation assumes that researchers have 80% power to test a true hypothesis, but that 20% of all studies are false positives. The mixture of actual power is 200,000 cases with M = 0, 100,000 cases with M = 2, 400,000 cases with M = 3, and 300,000 cases with M = 4. The mean power is 70%.

Once more, z-curve fits the actual data quite well. The expected discovery rate of 71% matches the actual discovery rate of 70% and the estimated file-drawer ratio of 0.4 to 1 also matches the actual file-drawer ratio of 0.44 to 1.

More extensive simulations are needed to examine the performance of z-curve. With smaller sets of studies, random sampling error alone will produce some variability in estimates. However, large differences in file-drawer estimates such as 0.4:1 versus 10:1 are unlikely to occur by chance alone.

Real Examples

To provide an illustration with real data, I fitted z-curve to Roy F. Baumeister’s results in his most influential studies (see Baumeister audiT for the data).

Visual inspection shows that Roy F. Baumeister’s z-curve matches most closely to the first simulation. The quantitative estimates confirm this impression. The expected discovery rate is estimated to be 11% and the file-drawer ratio is estimated to be 9.65 : 1. That is, for every published significant result, z-curve predicts 9 unpublished results with non-significant results. The figure shows that only a few non-significant results were reported in Baumeister’s articles. However, all of these non-significant results cluster in the region of marginally significant results (z > 1.65 & z < 1.96) and were interpreted as support for a hypothesis. Thus, all non-confirming evidence remained hidden in a fairly large file-drawer.

It is rare that social psychologists comment on their research practices, but in a personal communication Roy Baumeister confirmed that he has a file-drawer with non-significant results.

“We did run multiple studies, some of which did not work, and some of which worked better than others. You may think that not reporting the less successful studies is wrong, but that is how the field works.” (Roy Baumeister, personal email communication)

Other social psychologists have a smaller file-drawer. For example, the file-drawer ratio for Susan T. Fiske is only 2.8 : 1, which is only a third of Roy F. Baumeister’s file-drawer. Thus, while publication bias ensures that virtually everybody has a file-drawer, the size of the file-drawer can vary considerably across labs.

File Drawer of Statistical Analysis Rather than Entire Studies

It is unlikely that actual file-drawers are as large as z-curve estimates. Dropping studies with non-significant results is only one of several questionable research practices that can be used to report only significant results. For example, including several dependent variables in a study can help to produce a significant result for a single dependent variable. In this case, most studies can be published. Thus, it is more accurate to think of the file-drawer as being filled with statistical outputs with non-significant results rather than entire studies. This does not reduce the problem of questionable research practices. Undisclosed multiple comparisons within a single data set undermine the replicability of published results just as much as failures to disclose results from a whole study.

Nevertheless, z-curve estimates should not be interpreted too literally. If there were such a thing as a Replicability Bureau of Investigation (RBI), and the RBI would raid the labs of a researcher, the actual size of the file-drawer may differ from the z-curve prediction because it is impossible to know which questionable research practices were actually use to report only confirming results. However, the estimated file-drawer provides some information about the credibility of the published results. Estimates below the ratio of 1:1 suggest that the data are credible. The higher the file-drawer ratio is, the less credible the published results become.

File-Drawer Estimates and Open Science 

The main advantage of being able to estimate file-drawers is that it is possible to monitor research practices in psychology labs without the need of an RBI. Reforms such as a priori power calculations, preregistration and more honest reporting of non-significant results should reduce the size of file-drawers. Requirements to share all data ensure open file-drawers. Z-curve can be used to evaluate whether these reforms are actually improving psychological science.

Wagenmakers’ Crusade Against p-values

Two decades ago, Wagenmakers (2007) started his crusade against p-values. His article “A practical solution to the pervasive problems of p-values” (PPPV) has been cited over 800 times, and it is Wagemmakers most cited original article (he also contributed to the OSC, 2015, reproducibility project that already garnered over 1,000 citations.

In PPPV, Wagenmaker claims that statisticians have identified three fundamental problems of p-values, (a) p-values do not quantify statistical evidence, (b) p-values depend on hypothetical data, and (c) p-values depend on researchers’ unknown intentions.

When I read the article many years ago, statistics was a side-interest for me, and I didn’t fully understand the article. Since the replication crisis started in 2011, I have learned a lot about statistics, and I am ready to share my thoughts about Wagenmakers’ critique of p-values. In short, I think Wagenmakers’ arguments are a load of rubbish and the proposed solution to use Bayesian model comparisons is likely to make matters worse.

P-Values Depend on Hypothetical Data

Most readers of this blog post are familiar with the way p-values are computed. Some data are observed. Based on this observed data, an effect size is estimated. In addition, sampling error is computed either based on sample size alone or based on observed information about the distribution of observations (variance). The ratio of the effect size and the sampling error is used to compute a test statistic. To be clear, the same test statistics are used in frequentist statistics with p-values as in Bayesian statistics. So, any problems that occur during these steps are the same for p-values and Bayesian statistics.

What are the hypothetical data that Wagenmakers sees as a problem?

These hypothetical data are data expected under H0, without which it is impossible to construct the sampling distribution of the test statistic
t(xrep | H0).

Two things should be immediately obvious. First, the hypothetical data are no more or less hypothetical than the null-hypothesis. The null-hypothesis is hypothetical (hypothesis – hypothetical, see the connection) and based on the null-hypothesis predictions about the distribution of a test-statistic are made. The actual data are then compared to this prediction. There are no hypothetical data. There is a hypothetical distribution and an actual test statistic. Inferences are based on the comparison. Second, the “hypothetical data” that are expected under H0 are also expected in a Bayesian statistical framework because the same sampling distribution is used to compute the Bayesian Information Criterion or a Bayes Factor.

In short, it is easy to see that Wagenmakers’ problem is not a problem at all. Theories and hypotheses are abstractions. To use inferential statistics, the prediction have to be translated into a sampling distribution of a test statistics.

Wagenmakers presents an example from Pratt (1962) in full to drive home his point; and I reproduce this example again in full.

An engineer draws a random sample of electron
tubes and measures the plate voltage under certain
conditions with a very accurate volt-meter, accurate
enough so that measurement error is negligible compared
with the variability of the tubes. A statistician
examines the measurements, which look normally
distributed and vary from 75 to 99 volts with a mean
of 87 and a standard deviation of 4. He makes the
ordinary normal analysis, giving a confidence interval
for the true mean. Later he visits the engineer’s
laboratory, and notices that the volt meter used reads
only as far as 100, so the population appears to be
“censored.” This necessitates a new analysis, if the
statistician is orthodox. However, the engineer says
he has another meter, equally accurate and reading to
1000 volts, which he would have used if any voltage
had been over 100. This is a relief to the orthodox
statistician, because it means the population was effectively
uncensored after all. But the next day the
engineer telephones and says: “I just discovered my
high-range volt-meter was not working the day I did
the experiment you analyzed for me.” The statistician
ascertains that the engineer would not have held
up the experiment until the meter was fixed, and informs
him that a new analysis will be required. The
engineer is astounded. He says: “But the experiment
turned out just the same as if the high-range meter
had been working. I obtained the precise voltages
of my sample anyway, so I learned exactly what I
would have learned if the high-range meter had
been available. Next you’ll be asking me about my

What is the problem here? Truncating the measure at 100 changes the statistical model. If we have to suspect that the data are truncated, we cannot use a statistical model that assumes a normal distribution. We could use a non-parametric test to get a p-value or a more sophisticated model that models the truncation process. This model would notice that there is little truncation in these hypothetical data because there are actually no values greater than 100.

Thus, this example merely illustrated that statistical inferences depend on the proper modeling of the sampling distribution of a test statistic. All statistical inferences are only valid if the assumptions of the statistical model hold. Otherwise, all bets are off. Most important, this is also true for Bayesian statistics because they rely on the same test statistics and distribution assumptions as p-values. There is nothing magical about Bayes Factors that would allow them to produce valid inferences when distribution assumptions are violated.

P-Values Depend on Researchers’ Intentions

The second alleged problem of p-values is that they depend on researchers’ intentions.

“The same data may yield quite different p values, depending on the intention with which the experiment was carried out.”

This fact is illustrated with several examples like this one.

Imagine that you answered 9 out of 12 questions about statistics correctly (if it were possible to say what is correct and what is false), and I wanted to compute the p-value that you were simply guessing. The two-sided p-value is p = .146, if we assume that the test has 12 questions in total, However, the p-value is .033.

Since 2011, it is well known that data peaking alters the statistical model and that optional stopping alters p-values. If the decision to terminate data collection was in any way systematically influenced by some previous results, a p-value that assumes no data-peaking occurred is wrong because it is based on the wrong statistical model. Undisclosed checking of data is now known as a questionable research practice (John et al., 2012). Thus, Wagenmakers’ example merely shows that p-values cannot be trusted when researchers engaged in questionable research practices. It does not show that p-values are inherently flawed.

How does Bayesian statistic avoid this problem? It avoids the problem only partially. Bayes Factors always express information as a comparison between two models. As long as researchers peak at the data and continue because the data do not favor either model, data peaking does not introduce a bias. However, if they would peak and continue data collection until the data favor one model, Bayesian statistics would be just as biased by data peaking as the use of p-values. Even data peaking with inconclusive data can be biased if one of the models is implausible and would never receive support. In this case, the data can only produce evidence for one model or be undecided, which leads to the same problem that Wagenmakers sees with p-values. For example, testing the null-hypothesis against Wagenmaker’s prior that assumes large effects of 1 SD or more would eventually produce evidence for the null-hypothesis, even if it were false because the data can never produce support for the implausible alternative hypothesis.

In conclusion, the second argument is a good reason for preregistration and against the use of questionable research practices, but not a good argument against p-values.

P Values Do Not Quantify Statistical Evidence

The third claim is probably the most surprising for users of p-values. The main reason for computing p-values is that they are considered to be a common metric that can be used across different types of studies. Everything else being equal, a lower p-value is assumed to provide stronger evidence against the null-hypothesis.

In the Fisherian framework of statistical hypothesis testing, a p value is meant to indicate “the strength of the evidence against the hypothesis” (Fisher, 1958, p. 80).

What are the chances that all textbook writers got this wrong?

To make his point, Wagenmakers uses the ambiguity of everyday language and decides that “the most common and well-worked-out definition is the Bayesian definition”

Nobody is surprised that p-values do not provide evidence given a Bayesian definition of evidence, just like nobody would be surprised that Bayes Factors do not provide information about the long-run probability of false positive discoveries.

What is surprising is that Wagenmakers provides no argument. Instead, he reviews some surveys of statisticians and psychologists that examined the influence of sample size on the evaluation of identical p-values.

For example, which study produces stronger evidence against the null-hypothesis. A study with N = 300 and p = .01 or a study with N = 30 and p = .01. Most statisticians favor the larger study. A quick survey in the Psychological Method Discussion group confirmed this finding. 37 respondents favored the larger sample, 7 said no difference, and 4 favored the smaller sample.

Although this is interesting, it does not answer the question whether a p-value of .0001 provides stronger evidence against the null-hypothesis than a p-value of .10, which is the question at hand.

So, Wagenmakers strongest argument against p-values that they are misinterpreted as a measure of strength of evidence is not an argument at all.

In short, Wagenmakers has been successful in casting doubt about the use of p-values amongst psychologists. He was able to do so because statistics training in psychology is poor and most users of p-values have only a vague understanding of the underlying statistical theory. As a result, they are swayed by strong claims that they cannot evaluate. It took me some time, and away from my original research, to understand these issues. In my opinion, Wagenmakers critique falls apart under closer scrutiny.

The main problem of p-values is that they are not Bayesian, but that is only a problem if you like Bayesian statistics. For most practical purposes, p-values and Bayes-Factors lead to the same conclusions regarding the rejection of the null-hypothesis. In addition, Bayes-Factors offer the false promise that they can provide evidence for the nil-hypothesis, which is also false, but the topic of another blog post.

The real problem in psychological science is not the use of p-values, but the abuse of p-values. That is, a study with N = 30 participants and p = .01 would produce just as much evidence as a study with N = 300 and p = .01, if we wouldn’t have to worry that the researcher with N = 30 also ran 300 participants, but only presented the results of one study that produced a significant result by chance. For this reason, I have invested my time and energy in studying the real power of studies to produce significant results and to detect the use of questionable research practices. It does not matter to me whether effect size estimates and sampling error are reported as confidence intervals, converted into p-values, or reported as Bayes Factors. What matters is that the results are credible and strong claims are supported by strong evidence, no matter how it is reported.

Related blog Posts

Why Wagenmakers is wrong (about Bayesian Analysis of Bem, 2011)

Wagenmakers’ Prior is Inconsistent with Empirical Results in Psychology

Confidence Intervals are More Informative than Bayes Factors

The Bayesian Mixture Model Does Not Estimate the False Positive Rate

Wagenmakers Confuses Evidence Against H1 with Evidence For H0

2018 Journal Replicability Rankings

This table shows the Replicability Rankings for 117 psychology journals.

Journals are ranked based on the replicability estimates for the year 2018.

Replicability estimates are obtained from z-curve analyses of automatically extracted test statistics. If you click on the journal name, you can see plots of the z-curve distributions for the years 2010-2018.

Rank  Journal201820172016201520142013201220112010
1 European Journal of Developmental Psychology898683637375787867
2 Journal of Cognition and Development897477676568556669
3 Political Psychology887578747071734366
4 Social Development847278627471717372
5 Social Psychology847474727470647672
6 Depression & Anxiety837578707380808986
7 Journal of Counseling Psychology836978777078776282
8 Personal Relationships837671706965705866
9 Sex Roles838180737276797373
10 Journal of Occupational and Organizational Psychology827282797074777063
11 Cognitive Psychology817580727577718174
12 Epilepsy & Behavior818281798584798976
13 Experimental Psychology817472717672747269
14 Journal of Consumer Behaviour816979757381738379
15 Journal of Health Psychology816371797880766371
16 Journal of Pain816777728072777370
17 Law and Human Behavior817576696074768372
18 Psychology of Religion and Spirituality817180807570557375
19 Social Psychological and Personality Science817665606461576554
20 Evolution & Human Behavior807380757562646962
21 Journal of Personality807773687269726066
22 JPSP-Attitudes & Social Cognition807955746949616160
23 Journal of Vocational Behavior807485836583798577
24 Memory and Cognition808074797676797677
25 Attention, Perception and Psychophysics797970737677807473
26 Consciousness and Cognition797869697466707373
27 Journal of Cognitive Psychology797577747772737985
28 Journal of Educational Psychology797872677574767783
29 Journal of Nonverbal Behavior798973637276716364
30 Journal of Research in Personality797876817776707268
31 Psychophysiology797878707268717778
32 Quarterly Journal of Experimental Psychology797676757473767572
33 Aggressive Behavior787277677060697968
34 Evolutionary Psychology787882777681738069
35 Health Psychology787061666667596968
36 J. of Exp. Psychology – Human Perception and Performance787677757575777876
37 J. of Exp. Psychology – Learning, Memory & Cognition787978778174767180
38 Psychonomic Bulletin and Review787577827883717078
39 British Journal of Psychology777777827571787969
40 British Journal of Developmental Psychology777176746467857777
41 Journal of Cross-Cultural Psychology777575807780717777
42 Journal of Experimental Psychology – General777774747274667368
43 Journal of Family Psychology776962727170646768
44 Journal of Memory and Language778082797475717973
45 JPSP-Personality Processes and Individual Differences776574717365687061
46 Personality and Individual Differences777674777776737170
47 Appetite767771646673717273
48 Cognition767673727476747272
49 European Journal of Personality767679688167677079
50 Journal of Anxiety Disorders767973697675787174
51 Journal of Occupational Health Psychology768073727354757971
52 Cognition and Emotion756568747385858181
53 Journal of Affective Disorders757584857784787271
54 Journal of Child and Family Studies757372696874737473
55 Journal of Experimental Social Psychology757167626156545755
56 Journal of Social and Personal Relationships757184595769617882
57 Psychological Science757168696565636161
58 Cognitive Therapy and Research747570716175746765
59 Frontiers in Psychology747674737372726882
60 Journal of Applied Social Psychology747179677269777175
61 Journal of Religion and Health747485807676898068
62 Psychological Medicine747382677578667772
63 Animal Behavior737771697071717075
64 Child Development736673736869747173
65 Cognitive Development738074827171746963
66 Developmental Psychology737575747572676866
67 Emotion737371686972686873
68 Frontiers in Human Neuroscience737074737476787672
69 Judgment and Decision Making738178767768737071
70 Journal of Experimental Child Psychology737271777572727174
71 Journal of Social Psychology737573706562777175
72 Memory737479678776778488
73 Perception737576787379828993
74 Annals of Behavioral Medicine727073637075777272
75 Archives of Sexual Behavior727878797581787687
76 Frontiers in Behavioral Neuroscience727470706770727067
77 International Journal of Psychophysiology727464706762717065
78 Psychology and Aging727880768171777675
79 Behaviour Research and Therapy717072757672776669
80 Journal of Organizational Psychology717371667362726675
81 Journal of Positive Psychology718169727462676373
82 JPSP-Interpersonal Relationships and Group Processes716873646162566154
83 Organizational Behavior and Human Decision Processes716872696972697163
84 Personality Disorders718764637277525584
85 Personality and Social Psychology Bulletin717369646460596162
86 Acta Psychologica707773737674747674
87 British Journal of Social Psychology707863676163597063
88 Hormones & Behavior706163626262616663
89 Journal of Abnormal Psychology706964636569667370
90 Journal of Consulting and Clinical Psychology707761666562656665
91 Journal of Experimental Psychology – Applied708069687265757071
92 Journal of Happiness Studies705679787880778877
93 Behavioural Brain Research697168746770717172
94 Cognitive Behavioral Therapy697580766270807262
95 Journal of Applied Psychology697980707469736971
96 Journal of Autism and Developmental Disorders697172706572676770
97 Psychology of Music698079727375728287
98 Biological Psychology686366706666617070
99 Developmental Science687367696571676867
100 Journal of Comparative Psychology686675757980716862
101 Psychology and Marketing687070657665716371
102 Psychoneuroendocrinology686566636364626461
103 Psychopharmacology687475747173757171
104 Behavior Therapy677169717474756377
105 Developmental Psychobiology676366656970707164
106 Journal of Consumer Psychology665653676664595964
107 Journal of Consumer Research666463516348616064
108 Journal of Individual Differences658265746386559170
109 Journal of Youth and Adolescence657084778176747475
110 European Journal of Social Psychology647376647167566866
111 Group Processes & Intergroup Relations646866706769656759
112 Journal of Research on Adolescence626771676472747867
113 Journal of Child Psychology and Psychiatry and Allied Disciplines616867676268715762
114 Motivation and Emotion617263666466638167
115 Infancy596061616667637153
116 Behavioral Neuroscience577368707068706672
117 Self and Identity576868577271727073

Social Psychology Textbook audiT: Ego Depletion

Since 2011, social psychology is in a crisis of confidence. Many published results were obtained with questionable research practices and failed to replicate. The Open Science Collaboration found that only 25% of social psychological results could be successfully replicated (OSC, 2015).

One of the biggest scandals in social psychology is the ego-depletion literature. The main assumption of ego-depletion theory is that working on a cognitively demanding task lowers individuals’ ability to do well on a second demanding task.

A meta-analysis in 2010 seemed to show that ego-depletion effects in laboratory studies are robust and have a moderate effect size (d = .5). However, this meta-analysis did not control for the influence of questionable research practices. A subsequent meta-analysis did take QRPs into account and found no evidence for the effect.

This meta-analysis triggered a crisis of confidence in the ego-depletion effect and an initiative to investigate ego-depletion in a massive replication attempt. The outcome of this major replication study confirmed the finding of the second meta-analysis. There was no evidence for an ego-depletion effect, despite the massive statistical power to detect even a small effect (d = .2) (Hagger et al., 2016).

There have been different responses to the replication failure. The inventor of ego-depletion theory, Roy F. Baumeister, blames the design of the replication study for the replication failure (cf. Drummond & Philipp, 2017). However, others, including myself (Schimmack, 2016), pointed out that Baumeister and colleagues used QRPs in their original studies and therefore do not provide credible evidence for the effect. Some ego-depletion researchers, like Michael Inzlicht (pdf), openly expressed concern that ego-depletion may not be real.

“We did run multiple studies, some of which did not work, and some of which worked better than others. You may think that not reporting the less successful studies is wrong, but that is how the field works.” (Roy Baumeister, personal email communication)  [Schimmack, 2014]

Social psychology textbooks responded differently to these developments in ego-depletion research.

Gilovich et al., (2019, 5ed) simply removed ego-depletion from their textbook, while the 3ed (2013) covered ego depletion, including the even more controversial claim that links ego-depletion to blood glucose levels, which was also obtained with QRPs (Schimmack, 2012).

In contrast, Myers and Twenge (2018, 13ed) continue to cover ego-depletion without mentioning any replication failures or concerns about the robustness of the evidence.

Neither treatment of the doubts about ego-depletion is acceptable. Simply removing ego-depletion misses the opportunity to teach students to think critically about social psychological research; which probably is the point. However, presenting ego-depletion is even worse. Examples like this show that social psychologists are unwilling to be open about the recent developments in their field and that students cannot trust social psychology textbooks to provide a balanced and scientific introduction to the field.

An Introduction to Anti-Social Psychology

Social psychology textbooks aim to inform students about social psychology. However, the authors also want to promote social psychology. As a result, they present social psychology in the most favorable light. The result are textbooks that hide embarrassing facts (most textbook findings probably do not replicate) and do not allow students to think critically about social psychology.

Ideally somebody would publish a balanced and objective textbook. This blog post has a different aim. It introduces students of social psychology to critical evidence that are missing from most social psychology textbooks.

The evidence presented in the “anti-social” textbook may be biased against social psychology, but it provides students with some information that they can use to make up their own mind about social psychology.

The aim is to use a Hegelian approach of teaching, where the textbook provides a thesis (e.g., self-perception theory explains how individuals form attitudes), the anti-social textbook provides the anti-thesis (self-perception theory is an outdated attempt to explain attitudes from the perspective of radical behaviorism), and then students can synthesize the conflicting claims into their own perspective on social psychology.

Affective Misattribution Paradigm (AMP)

Culture of Honor (Northern vs. Southern US States)

Ease of Retrieval

Ego Depletion

Implicit Association Test (IAT)

Priming (Subliminal Priming; Unconscious Processes)

Replication, Replicability, Replication Outcomes in Social Psychology

Self-Knowledge (Accuracy and Biases)

Stereotype Threat


Self-Perception Theory

Terror-Management Theory

Social Psychology Textbook audit: The (In)Accuracy of Self-Knowledge

Gilovich, Keltner, Chen, & Ones, 2019, Social Psychology (5ed), p. 65-66

To understand social psychologists’ claims about accuracy of self-knowledge, it is important to be aware of the person-situation debate in the 1970s. On the one hand, social psychologists maintained that self-concepts are largely illusory and do not predict behavior. On the other hand, personality psychologists assumes that people have some accurate self-knowledge about stable personality dispositions that influence their behavior.

It is also important to know that textbook author Nisbett was actively engaged in the person-situation controversy. It is therefore not suprising that the textbook chapter about accuracy in self-knowledge is strongly biased and fails to mention decades of research that has demonstrated convergent validity of self-ratings and informant ratings of personality (e.g., Connely & Ones, 2010, for a meta-analysis).

Instead, students are given the impression that self-knowledge is rather poor.

Recall the research described in Chapter 1 in which Nisbett and Wilson (1977) discovered that that people can readily provide explanations for her their behaviors that are not in fact accurate. Someone might say that she picked her favorite nightgown because of its texture or color, when in fact she picked it out because it was the last one she saw. Even our ability to report accurately on more important decisions – such as why we chose job candidate A over job candidate B, why we like Joe better than Jack, or how we solved a particular problem – can be wide of the mark (Nisbett & Wilson, 1977).

…much of the time, we draw inaccurate conclusions about the self because we don’t have access to certain mental processes, such as those that lead us to prefer objects we looked at last (Wilson, 2002, Wilson & Dunn, 2004).

… such mental processes are outside nonconscious, occurring outside of our awareness, leaving us to generate alternative, plausible accounts for our preferences and behaviors instead.

Given such roadblocks, how can a person gain accurate self-knowledge?

The textbook doesn’t provide an answer to its own question. One possible answer could have been that it doesn’t require introspection to know oneself. Later the textbook introduces self-perception theory, which states that we can know ourselves like we know other people by observing ourselves and making attributions about our behaviors. For example, if I regularly order vanilla ice cream rather than chocolate ice cream, I can infer that I have a preference for vanilla; I do not need to know why I have a preference for vanilla (e..g, my mother gave me formula as a baby with vanilla flavor).

In any case, Nisbett’s musings about limitations of introspection fail to explain how people acquire accurate self-knowledge about their personality, values, happiness, and past behaviors, nor does it cite relevant studies by personality psychologists.

The section on accuracy of self-knowledge ends with studies by Vazire (Vazire & Meehle, 2008; Vazire 2010; Vazire & Carlson, 2011). These studies show that, on average, self-ratings and informant ratings are equally good predictors of an objective criterion of behavior. They also suggest that the self is better able to make judgments about internal states.

“Because we have greater information than others do about our internal states (such as our internal thoughts and feelings), we are better judges of our internal traits (being optimistic or pessimistic, for instance).

Students may be a little bit confused by the earlier claims that introspection often leads us astray and the concluding statement that the self is most accurate in judging internal states. Apparently, introspective does provide some valuable information that can be used to know oneself.

In conclusion, social psychologists have ignored accuracy in self-knowledge because they were more interested in demonstrating biases and errors in human information processing. The textbook is stuck in some old studies on limits of introspection and does not review decades of research on accuracy of self-knowledge (e.g., Funder, 1995). To learn about accuracy of self-knowledge, students are better off taking a course on personality psychology.

Auditing Social Psychology Textbooks: Hitler had High Self-Esteem

Social psychologists see themselves as psychological “scientists,”  who study people and therefore believe that they know people better than you or me. However, often their claims are not based on credible scientific evidence and are merely personal opinions disguised as science.

For example, a popular undergraduate psychology textbook claims that

Hitler had high self-esteem.

quoting an article that has been cited over 500 times in the journal “Psychological Science in the Public Interest.”  At the end of the article with the title “Does High Self-Esteem Cause Better Performance, Interpersonal Success, Happiness, or Healthier Lifestyles?” the authors write: 

“High self-esteem feels good and fosters initiative. It may still prove a useful tool to promote success and virtue, but it should be clearly and explicitly linked to desirable behavior. After all, Hitler had very high self-esteem and plenty of initiative, too, but those were hardly guarantees of ethical behavior.”

In the textbook this quote is linked to boys who engage in sex at an “inappropriately young age” which is not further specified (in Canada this would be 14) according to recent statistics). 

“High self-esteem does have some benefits—it fosters initiative, resilience, and pleasant feelings (Baumeister & others, 2003). Yet teen males who engage in sexual activity at an “inappropriately young age” tend to have higher than average self-esteem. So do teen gang leaders, extreme ethnocentrists, terrorists, and men in prison for committing violent crimes (Bushman & Baumeister, 2002; Dawes, 1994, 1998). “Hitler had very high self-esteem,” note Baumeister and co-authors (2003).”  (Myers, 2011, Social Psychology, 12th edition)

Undergraduate students pay a lot of money to be informed that people with high self-esteem are like sexually deviants, terrorists, violent criminals, and Hitler. (maybe we should add scientists with big claims and small samples to the list).

The problem is that this is not even true. Students who work with me on fact checking the textbook found this quote in the original article.

“There was no [!] significant difference in self-esteem scores between violent offenders and non-offenders, Ms = 28.90 and 28.89, respectively, t(7653) = 0.02, p > .9, d = 0.0001.”

Although the df of the t-test look impressive, the study compared 63 violent offenders to 7590 unmatched, mostly undergraduate student (gender not specified, probably mostly female) participants. So the sampling error of this study is high and the theoretical importance of comparing these two groups is questionable.

[the latest edition 13 from 2018 still contains the quote 

How Many Correct Citations Could be False Positives?  

Of course, the example above is an exception.  Most of the time a cited reference contains an empirical finding that is consistent with the textbook claim.  However, this does not mean that textbook findings are based on credible and replicable evidence.  Until recently it was common to assume that statistical significance ensures that most published results are true positives (i.e, not a false positive random finding).  However, this is only the case if all results are reported. It has been known since 1959 that this is not the case in psychology (Sterling, 1959). Jerry Brunner and I developed a statistical tool that can be used to clean up the existing literature. Rather than actually redoing 50 years of research, we use the statistical results reported in original studies to apply a significance filter post-hoc.  Our tool is called zcurve.   Below I used zcurve to examine the replicability of studies that were used in chapter 2 about the self.  


More detailed information about the interpretation of the graph above is provided elsewhere (link).  In short, for each citation in the textbook chapter that is used as evidence for a claim, a team of undergraduate students retrieved the cited article and extracted the main statistical result that matches the textbook claim.  These statistical results are then converted into a z-score that reflects the strength of evidence for a claim.  Only significant results are important because non-significant results cannot support an empirical claim.  Zcurve fits a model to the (density) distribution of significant z-scores (z-scores > 1.96).  The shape of the density distribution provides information about the probability that a randomly drawn study from the set would replicate (i.e., reproduce a significant result).  The grey line shows the predicted distribution by zcurve. It matches the observed density in dark blue well. Simulation studies show good performance of zcurve. Zcurve estimates that the average replicability of studies in this chapter is  56%. This number would be reassuring if all studies had 56% power.  This would mean that all studies are true positives and if a study were replicated every other study would be successful. However, reality does not match this rosy scenario.  In reality, studies vary in replicability.  Studies with z-scores greater than 5 have 99% replicability (see numbers below x-axis).  However, studies with just significant results (z < 2.5) have only 21% replicability.  As you can see, there are a lot more studies with z < 2.5 than studies with z > 5.  So there are more studies with low replicability than studies with high replicability. The next plot shows model fit (higher numbers = worse fit) for zcurve models with a fixed proportion of false positives.  If the data are inconsistent with a fixed proportion of false positives, model fit decreases (higher numbers).  


 The graph shows that models with 100%, 90% or 80% false positives clearly do not fit the data as well as models with fewer false positives.  This shows that some textbook claims are based on solid empirical evidence.   However, model fit for models with 0% to 60% look very similar.  Thus, it is possible that the majority of claims in the self chapter of this textbook are false positives. It is even more problematic that textbook claims are often based on a single study with a student sample at one university.  Social psychologists have warned repeatedly that their findings are very sensitive to minute variations in studies, which makes it difficult to replicate these effects even under very similar conditions (Van Bavel et al., 2016), and that it is impossible to reproduce exactly the same experimental conditions (Stroebe and Strack, 2014).  Thus, the zcurve estimate of 56% replicability is a wildly optimistic estimate of replicability in actual replication studies. In fact, the average replicability of studies in social psychology is only 25% (Open Science Collaboration, 2015). 


Social psychology textbooks present many findings as if they are established facts, when this is not the case.  It is time to audit psychology textbooks to ensure that students receive accurate scientific information to inform their beliefs about human behavior.  Ideally, textbook authors will revise their textbooks to make them more scientific and instructors will chose textbook based on the credibility of the evidence in textbooks.

Social Psych Textbook AudiT: The Affective Misattribution Paradigm

The Affective Missattribution Paradigm (AMP) is a popular measure of attitudes. The main promise of this measure is that it measures implicit attitudes, where the term “implicit” suggests that participants are (a) not aware of their attitude, (b) not aware that their attitude is being measured, or (c) aware that their attitude is being measured, but unable to control (fake) their responses.

The picture below illustrates the basic principle of the AMP. The critical attitude object is the picture of a tropical beach. The aim is to measure your attitude towards tropical beaches. However, the task is presented as a task to evaluate the Chinese character and to ignore the tropical beach. The problem for participants is that the Chinese character elicits no strong emotion (unless you are Chinese and know that the character means death), while the picture of the beach elicits a positive emotional response for many participants. The proposition of the AMP is that participants involuntarily rely on their emotional response to the the tropical beach (called the prime) to judge the character (called the target).

An alternative version of the AMP would present the prime subliminally; that is, the presentation would be so short and masked with another image that participants cannot identify the target picture. This would make the AMP an implicit measure of attitudes without awareness of the true source of the evaluation. However, subliminal presentations are not reliable enough to measure attitudes (Payne, 2017).

However, the presentation of prime stimuli in plain view makes it clear that participants are aware of the prime. The question is whether they are aware that their emotional response is elicited by the prime rather than the target. Maybe they are simply too lazy to bother controlling their attention or response, which is more effortful than to simply report the emotional response that was elicited.

Three studies have provided evidence that participants are aware of the true source of their feelings and can control their responses.

Bar-Anan and Nosek (2002) asked participants how they made their responses and found priming effects for participants who stated that their responses were guided by the prime rather than the target.

Teige-Mocigemba, Penzl, Becker, Henn, and Klauer (2015) simply instructed participants to respond opposite to their emotional responses and found that participants were able to do so.

Hazlett and Berinsky (2018) gave students small monetary incentives to control their automatic emotional responses to primes. The key finding was that providing a monetary incentive further decreased the influence of primes on participants’ responses over a simple instruction to ignore the primes. This supports the motivation hypothesis that participants are abel to control their responses, but lack motivation to do so unless their is a reason for it.

In conclusion, the AMP is not an implicit measure of attitudes in the sense that participants are unaware that their attitude is being measured or unable to control their responses.

It is also noteworthy that the AMP has modest correlations with implicit measures of attitudes like the Implicit Association Test.

The problem with the low correlation between the IAT and AMP is that both tests are promoted as measures of implicit attitudes, but the low correlation means that they are poor measures of a single construct.

In conclusion, there is evidence to suggest that the AMP may not be an implicit measure of attitudes and evidence that it correlates poorly with other implicit measures.

How do Gilovich, Keltner, Chen, and Nisbett introduce the AMP to undergraduate students?

The textbook merely mentions that AMP scores have demonstrated convergent and predictive validity in a number of studies.

Responses on the AMP have been shown to be related to plitical attitudes, other measures of racial bias, and significant personal habits like smoking and drinking (Greenwald, Smith, Sriram, Bar-Anan & Nosek, 2009; Payne et al., 1995; Payne, Govorun, & Arbuckle, 2008; Payne, McClernon, & Dobbins, 2007). (p. 369)

Three of the four citations are by Payne, who developed the AMP and has a conflict of interest to show that his measure is valid and useful. None of the citations is more recent than 2010, meaning that no significant updates have been made in response to recent critical articles about the AMP.

The cited Greenwald et al. (2009) article is particularly informative because it examined convergent, discriminant, and predictive validity in a large, non-student sample.

The textbook claim that the AMP correlates with other implicit racial bias measures is supported by the r = .218 correlation with the Brief IAT measure. However, there is little evidence for discriminant validity because the AMP also correlates r = .220 and r = .208 with two explicit measures of prejudice (4. Thermometer & 5. Likert, respectively).

Moroever, predictive validity of the AMP is shown by the correlation with voter intentions , r = .113. This correlation is low and not higher than those for the two explicit measures, r = .211 and r = .124.

Finally, the study failed to find strong pro-White biases in this largely White sample (70% White) for the AMP (M = -0.02, SD = 0.17, d = .12) and the brief IAT (M = 0.06, SD = 0.42, d = 0.14), which were not larger than the pro-White bias for explicit measures that are subject to social desirable responding; feeling thermometer (M = 0.35, SD = 1.63, d = .21) and Likert Scale (M = 0.35, SD = 0.86, d = 41).

These results do not justify claims that the AMP and the IAT are measures of some hidden, implicit attitudes that are only accessible by means of indirect attitude measures and that influence participants’ behavior without their knowledge. However, this is the citation provided in the textbook to support these claims.

If textbook authors would have to present the actual evidence rather than a citation such distortions of the truth would not be possible. Thus, students should demand more scientific figures and tables and fewer cute pictures in their social psychology textbook. After all, they pay good money for it.

2016 Blogs


12/31 ****
Review of an “eventful” 2016 (“Method Terrorists”)

A Forensic Analysis of Stapel: Fabrication or Falsification?

Replicability Analysis of Dijksterhuis’s “Enhancing Implicit Self-Esteem by Subliminal Evaluative Conditioning”


9/13 ***
Critique of Finkel, Eastwick, & Reis’s article “Replicability and Other Features of a High-Quality Science: Toward a Balanced and Empirical Approach”


6/30 ***
Wagenmaker’s Default Prior is Unrealistic

6/25 ****
A Principled Approach to Setting the Prior of the Null-Hypothesis

6/18 ***
What is the Difference between the Test of Excessive Significance and the Incredibility Index?

6/16 ****
The A Priori Probability of the Point Null-Hypothesis is not 50%


Replicability Report on Social Priming Studies with Mating Primes

Critique of Jeffrey N. Rouder, Richard D. Morey, and Eric-Jan Wagenmakers’s article “The Interplay between Subjectivity, Statistical Practice, and Psychological Science” 

5/9 ***
Questionable Research Practices Invalidate Bayes-Factors Just As Much as P-Values


4/18 *****
Replicability Report of the Ego Depletion Literature


2/16 ****
Discussion of Sterling et al.’s (1995) Seminal Article on Inflated Success Rates in Psychological Science [also recommend reading the original article]

Replicability AudiT of a 10 Study Article by Adam D. Galinsky

A Replicability AudiT of Yaacov Trope’s Publications

2/3 ***
A Critique of Finkel, Eastwick, & Reis’s Views on the Replication Crisis


1/31 *****
Introduction to the R-Index
[The R-Index builds on the Incredibility Index, Schimmack (2012)]

Replicability Analysis of Damisch, Stoberock, & Mussweiler (2010)
[Anonymous Submission to R-Index Blog]

Replicability Analysis of Williams & Bargh (2008)

1/14 ***
Discussion of Hoenig and Heisey’s Critique of Observed Power Calculations

2017 Blogs


11/29 *****
A Quantitative Book Review of John A. Bargh’s Book “Before you know it”

My Response to the Rejection of the Z-Curve manuscript from AMPPS
[Reviewer 3 is author of the competing P-Curve method]


10/24 *****
Replicability Rankings of Psychology Journals (2010-2017)


9/4 ****
Replicability Report: The Pen-Paradigm of Facial Feedback Studies


8/2 *****
A Comment on the Alpha Wars: Focus on Beta


Replicability AudiT of the Journal Psychological Science


3/5 *****
Meta-Psychology: A New Discipline and a New Journal
[the journal now exists Meta-Psychology link]


2/26 ***
A Brief Introduction to Null-Hypothesis Significance Testing and Power
[1 Figure and 1500 words]

2/23 ****
On Rand Measurement Error, Reliability, and Replicability

2/21 ***
Examining the Influence of Selection for Significance on Observed Power

2/2 ***** (100,000 views)
A Quantitative Review of Kakneman’s Thinking Fast and Slow Chapter on Social Priming [Co-Authored with Moritz Heene]