# Estimating the Size of File-Drawers with Z-Curve

Every student in psychology is introduced to the logic of Null-Hypothesis Significance Testing (NHST). The basic idea is to establish the long-run maximum probability that a significant result is a false positive result. A false positive result is called a type-I error. The standard for an acceptable type-I error risk is 5%. Statistics programs and articles often highlight results with a p-value less than 0.05. Students quickly learn that the goal of statistical analysis is to find p-values less than .05.

NHST has been criticized for many reasons. This blog post focuses on the problem when NHST is used to hunt for significant results and when only significant results are reported. Hunting for significant results in itself is not a problem. If a researcher conducts 100 statistical tests and reports all results, the risk of a type-I error is controlled by the significance criterion. With alpha = .05, no more than 5 of the 100 tests can produce a false positive result. If, for example, 20 results are significant, it is clear that some of the significant results are true discoveries.

The problem arises when only significant results are reported (Sterling, 1959). If a researcher reports 20 significant results, it is not clear whether 20, 100, or 400 tests were conducted. However, this has important implications for the assessment of type-I errors. With 20 tests and 20 significant results, the type-I error is minimal, with 100 tests it is moderate (1 out of 4 significant results could be false positives) and with 400 tests (1 out of 20 = 5%) it is practically certain that at least some of the significant results are false positives. After all, the expected value if all 400 studies tests false hypotheses is 5%. So observing only 5% non-significant results in 400 tests suggests that some of these significant results are false positives.

The Replication Crisis and the True Type-I Error Risk

The selective publishing of only significant hypothesis tests is a major problem in psychological science (Sterling, 1959; Sterling et al., 1995), but psychologists only recently became aware of this problem (Francis, 2012; John et al., 2012; Schimmack, 2012). Once results are selected for significance, the true type-I error risk increases as a function of the actual number of tests that were conducted. While alpha is 5% in all studies, the percentage of significant results is unknown because it is unknown how many tests were conducted.

Type-I Error Risk and the File Drawer

Rosenthal (1979) introduced the concept of a file drawer. The proverbial file-drawer contains all of the unpublished studies that a researcher conducted that produced non-significant results.

If all studies had the same statistical power to produce a significant result, the size of the file-drawer would be self-evident. Studies with 50% power have a long-run probability of obtaining 50% significant results, by definition. Thus, there are also 50% studies with non-significant results. It follows that for each published significant result, there is a non-significant result in the proverbial file-drawer (File-Drawer Ratio 1:1; this simple example assumes independence of hypothesis tests).

If power were 80%, there would be only one non-significant result in the file-drawer for every 4 published significant results (File-Drawer Ratio 1:4 or 0.25 :1). However, if power is only 20%, there would be 4 non-significant results for every published significant result (File-Drawer Ratio 4:1).

Things are more complicated when studies vary in power. If we assume that some studies are true positives and others are false positives, the probability of a significant result varies across studies. Using a simple example, assume that 80 studies are false positives and 20 studies have 50% power. In this case, we expect 14 significant results; 80 * .05 = 4 + 20 * .5 = 1 == 14.

The 5% error rates is true for the 100 studies that were conducted, but it would be wrong to believe that only 5% of the selected set of 14 studies with significant results could be false positives. In this example, we would falsely assume that at most 1 of the 14 studies is a false positive; 14 * .05 = 0.7 studies. However, in this case, we know that there are actually 4 false positive results. We do get the correct estimate of the maximum number of false positives, if we start with the actual number of studies that were conducted, which gives a false positive risk of 5 studies, which would be a percentage of 5/14 = 36%. Thus, up to 36% of the reported 14 studies could be false positives. Thus, the actual risk is 7 times larger than the claim p < .05 suggests.

In short, we need to know the size of the file-drawer to estimate the percentage of reported results that could be false positives.

Estimating the Size of the File Drawer

Brunner and Schimmack (2018) developed a statistical method, z-curve, that can estimate mean power for a set of studies with heterogeneity in power, including some false positive results. The main purpose of the method was to estimate mean power for the set of published studies that produced significant results. However, the article also contained some theorems that make it possible to estimate the size of the file drawer.

Z-curve is a mixture model that models the distribution of observed test statistics (z-scores) as a mixture of studies with different levels of power. Bruner and Schimmack (2018) introduced a model with varying non-centrality parameters and weights. However, it is also possible to keep the non-centrality parameters constant and only the weights are free model parameters. The fixed non-centrality parameters can include a value of 0 to model the presence of false positive results. The latest version of z-curve uses fixed values of 0, 1, 2, 3, 4, 5, and 6. Values greater than 6 are not needed because z-curve treats all observed z-scores greater than 6 as having a power of 1.

The power values corresponding to these fixed non-centrality parameters are 5%, 17%, 52%, 85%, 98%, 99.9%, and 100%. Only the lower power values are important for the estimation of the file-drawer because high values imply that nearly all attempts produce significant results.

To illustrate the method, I focus on the lowest three power values: 5%, 17% and 52%. Assume that we observe 100 significant results with the following mixture of power values: 30 studies have 5% power, 34 studies have 17% power, and 26 studies have 52% power, and we want to know the size of the file drawer.

To get from the observed number of studies to the study that were actually run, we need to divide the number of observed studies by power (see Brunner & Schimmack, 2018, for a mathematical proof). With 5% power (i.e., false positive results), it requires 1/0.05 = 20 studies to produce 1 significant result in the long run. Thus, if 30 significant results were obtained with 5% power, 600 studies had to be run (600 * 0.05 = 30). With 17% power, it would require 200 studies to produce 34 significant results. And with 52% power, it would require 50 studies to produce 26 significant results. Thus, the total number of studies that are needed to obtain 100 significant results is 600 + 200 + 60 = 850. It follows that 750 (850 – 100) non-significant results are in the file drawer.

The following simulation illustrates how z-curve estimates the size of the file-drawer. Data are generated using standard normal distributions with means 0, 1, and 2. To achieve large sample accuracy, there are 800,000 observations (M = 0, k = 800,000; M = 1, k = 200,000; & M = 2, k = 50,000).

Only significant results (to the right of the red line at z = 1.96) were used to fit the model. The non-significant results are shown to see how well the model predicts the size of the file drawer.

The gray line shows the predicted distribution by the model. It shows that the predicted distribution of non-significant results matches the observed distribution of non-significant results, although the model slightly overestimates the size of the file-drawer.

The Expected Discovery Rate is the percentage of significant results for all studies including the file-drawer. The actual discovery rate is given by the number of studies (k = 1,050,000) and the actual number of significant results (k = 99,908), which is 99,908/1,050,000 = 9.52. The expected discovery rate is 9%, a fairly close match given the size of the file drawer.

Another way to look at the size of the file-drawer is the file-drawer ratio. That is, how many studies with non-significant results are in the file drawer for every significant result. The actual file-drawer ratio is (1,050,000 – 99,908)/99,908 = 9.51. That is, for every significant result, 9 to 10 non-significant results were not reported. The estimated file-drawer ratio is 9.7, a fairly close match.

The next example shows how z-curve performs when mean power is higher and the file-drawer is smaller. In this example, there were 100000 cases with z = 0, 200000 cases with z = 1, 400000 cases with z = 2, and 300000 cases with z = 3. The expected discovery rate for this simulation is 50%. With mean power of 50%, the file-drawer ratio is 1:1. That is, for each significant result there is one non-significant result.

The grey line shows that z-curve slightly overestimates the size of the file-drawer. However, this bias is small. The expected discovery rate is estimated to be 49% and the file-drawer ratio is estimated to be 1.05 : 1. These estimates closely match the actual results.

If power is greater than 50%, the file-drawer ratio is less than 1:1. The final simulation assumes that researchers have 80% power to test a true hypothesis, but that 20% of all studies are false positives. The mixture of actual power is 200,000 cases with M = 0, 100,000 cases with M = 2, 400,000 cases with M = 3, and 300,000 cases with M = 4. The mean power is 70%.

Once more, z-curve fits the actual data quite well. The expected discovery rate of 71% matches the actual discovery rate of 70% and the estimated file-drawer ratio of 0.4 to 1 also matches the actual file-drawer ratio of 0.44 to 1.

More extensive simulations are needed to examine the performance of z-curve. With smaller sets of studies, random sampling error alone will produce some variability in estimates. However, large differences in file-drawer estimates such as 0.4:1 versus 10:1 are unlikely to occur by chance alone.

Real Examples

To provide an illustration with real data, I fitted z-curve to Roy F. Baumeister’s results in his most influential studies (see Baumeister audiT for the data).

Visual inspection shows that Roy F. Baumeister’s z-curve matches most closely to the first simulation. The quantitative estimates confirm this impression. The expected discovery rate is estimated to be 11% and the file-drawer ratio is estimated to be 9.65 : 1. That is, for every published significant result, z-curve predicts 9 unpublished results with non-significant results. The figure shows that only a few non-significant results were reported in Baumeister’s articles. However, all of these non-significant results cluster in the region of marginally significant results (z > 1.65 & z < 1.96) and were interpreted as support for a hypothesis. Thus, all non-confirming evidence remained hidden in a fairly large file-drawer.

It is rare that social psychologists comment on their research practices, but in a personal communication Roy Baumeister confirmed that he has a file-drawer with non-significant results.

“We did run multiple studies, some of which did not work, and some of which worked better than others. You may think that not reporting the less successful studies is wrong, but that is how the field works.” (Roy Baumeister, personal email communication)

Other social psychologists have a smaller file-drawer. For example, the file-drawer ratio for Susan T. Fiske is only 2.8 : 1, which is only a third of Roy F. Baumeister’s file-drawer. Thus, while publication bias ensures that virtually everybody has a file-drawer, the size of the file-drawer can vary considerably across labs.

File Drawer of Statistical Analysis Rather than Entire Studies

It is unlikely that actual file-drawers are as large as z-curve estimates. Dropping studies with non-significant results is only one of several questionable research practices that can be used to report only significant results. For example, including several dependent variables in a study can help to produce a significant result for a single dependent variable. In this case, most studies can be published. Thus, it is more accurate to think of the file-drawer as being filled with statistical outputs with non-significant results rather than entire studies. This does not reduce the problem of questionable research practices. Undisclosed multiple comparisons within a single data set undermine the replicability of published results just as much as failures to disclose results from a whole study.

Nevertheless, z-curve estimates should not be interpreted too literally. If there were such a thing as a Replicability Bureau of Investigation (RBI), and the RBI would raid the labs of a researcher, the actual size of the file-drawer may differ from the z-curve prediction because it is impossible to know which questionable research practices were actually use to report only confirming results. However, the estimated file-drawer provides some information about the credibility of the published results. Estimates below the ratio of 1:1 suggest that the data are credible. The higher the file-drawer ratio is, the less credible the published results become.

File-Drawer Estimates and Open Science

The main advantage of being able to estimate file-drawers is that it is possible to monitor research practices in psychology labs without the need of an RBI. Reforms such as a priori power calculations, preregistration and more honest reporting of non-significant results should reduce the size of file-drawers. Requirements to share all data ensure open file-drawers. Z-curve can be used to evaluate whether these reforms are actually improving psychological science.

# Wagenmakers’ Crusade Against p-values

Two decades ago, Wagenmakers (2007) started his crusade against p-values. His article “A practical solution to the pervasive problems of p-values” (PPPV) has been cited over 800 times, and it is Wagemmakers most cited original article (he also contributed to the OSC, 2015, reproducibility project that already garnered over 1,000 citations.

In PPPV, Wagenmaker claims that statisticians have identified three fundamental problems of p-values, (a) p-values do not quantify statistical evidence, (b) p-values depend on hypothetical data, and (c) p-values depend on researchers’ unknown intentions.

When I read the article many years ago, statistics was a side-interest for me, and I didn’t fully understand the article. Since the replication crisis started in 2011, I have learned a lot about statistics, and I am ready to share my thoughts about Wagenmakers’ critique of p-values. In short, I think Wagenmakers’ arguments are a load of rubbish and the proposed solution to use Bayesian model comparisons is likely to make matters worse.

P-Values Depend on Hypothetical Data

Most readers of this blog post are familiar with the way p-values are computed. Some data are observed. Based on this observed data, an effect size is estimated. In addition, sampling error is computed either based on sample size alone or based on observed information about the distribution of observations (variance). The ratio of the effect size and the sampling error is used to compute a test statistic. To be clear, the same test statistics are used in frequentist statistics with p-values as in Bayesian statistics. So, any problems that occur during these steps are the same for p-values and Bayesian statistics.

What are the hypothetical data that Wagenmakers sees as a problem?

These hypothetical data are data expected under H0, without which it is impossible to construct the sampling distribution of the test statistic
t(xrep | H0).

Two things should be immediately obvious. First, the hypothetical data are no more or less hypothetical than the null-hypothesis. The null-hypothesis is hypothetical (hypothesis – hypothetical, see the connection) and based on the null-hypothesis predictions about the distribution of a test-statistic are made. The actual data are then compared to this prediction. There are no hypothetical data. There is a hypothetical distribution and an actual test statistic. Inferences are based on the comparison. Second, the “hypothetical data” that are expected under H0 are also expected in a Bayesian statistical framework because the same sampling distribution is used to compute the Bayesian Information Criterion or a Bayes Factor.

In short, it is easy to see that Wagenmakers’ problem is not a problem at all. Theories and hypotheses are abstractions. To use inferential statistics, the prediction have to be translated into a sampling distribution of a test statistics.

Wagenmakers presents an example from Pratt (1962) in full to drive home his point; and I reproduce this example again in full.

An engineer draws a random sample of electron
tubes and measures the plate voltage under certain
conditions with a very accurate volt-meter, accurate
enough so that measurement error is negligible compared
with the variability of the tubes. A statistician
examines the measurements, which look normally
distributed and vary from 75 to 99 volts with a mean
of 87 and a standard deviation of 4. He makes the
ordinary normal analysis, giving a confidence interval
for the true mean. Later he visits the engineer’s
laboratory, and notices that the volt meter used reads
only as far as 100, so the population appears to be
“censored.” This necessitates a new analysis, if the
statistician is orthodox. However, the engineer says
he has another meter, equally accurate and reading to
1000 volts, which he would have used if any voltage
had been over 100. This is a relief to the orthodox
statistician, because it means the population was effectively
uncensored after all. But the next day the
engineer telephones and says: “I just discovered my
high-range volt-meter was not working the day I did
the experiment you analyzed for me.” The statistician
ascertains that the engineer would not have held
up the experiment until the meter was fixed, and informs
him that a new analysis will be required. The
engineer is astounded. He says: “But the experiment
turned out just the same as if the high-range meter
had been working. I obtained the precise voltages
of my sample anyway, so I learned exactly what I
would have learned if the high-range meter had
been available. Next you’ll be asking me about my
oscilloscope.”

What is the problem here? Truncating the measure at 100 changes the statistical model. If we have to suspect that the data are truncated, we cannot use a statistical model that assumes a normal distribution. We could use a non-parametric test to get a p-value or a more sophisticated model that models the truncation process. This model would notice that there is little truncation in these hypothetical data because there are actually no values greater than 100.

Thus, this example merely illustrated that statistical inferences depend on the proper modeling of the sampling distribution of a test statistic. All statistical inferences are only valid if the assumptions of the statistical model hold. Otherwise, all bets are off. Most important, this is also true for Bayesian statistics because they rely on the same test statistics and distribution assumptions as p-values. There is nothing magical about Bayes Factors that would allow them to produce valid inferences when distribution assumptions are violated.

P-Values Depend on Researchers’ Intentions

The second alleged problem of p-values is that they depend on researchers’ intentions.

“The same data may yield quite different p values, depending on the intention with which the experiment was carried out.”

This fact is illustrated with several examples like this one.

Imagine that you answered 9 out of 12 questions about statistics correctly (if it were possible to say what is correct and what is false), and I wanted to compute the p-value that you were simply guessing. The two-sided p-value is p = .146, if we assume that the test has 12 questions in total, However, the p-value is .033.

Since 2011, it is well known that data peaking alters the statistical model and that optional stopping alters p-values. If the decision to terminate data collection was in any way systematically influenced by some previous results, a p-value that assumes no data-peaking occurred is wrong because it is based on the wrong statistical model. Undisclosed checking of data is now known as a questionable research practice (John et al., 2012). Thus, Wagenmakers’ example merely shows that p-values cannot be trusted when researchers engaged in questionable research practices. It does not show that p-values are inherently flawed.

How does Bayesian statistic avoid this problem? It avoids the problem only partially. Bayes Factors always express information as a comparison between two models. As long as researchers peak at the data and continue because the data do not favor either model, data peaking does not introduce a bias. However, if they would peak and continue data collection until the data favor one model, Bayesian statistics would be just as biased by data peaking as the use of p-values. Even data peaking with inconclusive data can be biased if one of the models is implausible and would never receive support. In this case, the data can only produce evidence for one model or be undecided, which leads to the same problem that Wagenmakers sees with p-values. For example, testing the null-hypothesis against Wagenmaker’s prior that assumes large effects of 1 SD or more would eventually produce evidence for the null-hypothesis, even if it were false because the data can never produce support for the implausible alternative hypothesis.

In conclusion, the second argument is a good reason for preregistration and against the use of questionable research practices, but not a good argument against p-values.

P Values Do Not Quantify Statistical Evidence

The third claim is probably the most surprising for users of p-values. The main reason for computing p-values is that they are considered to be a common metric that can be used across different types of studies. Everything else being equal, a lower p-value is assumed to provide stronger evidence against the null-hypothesis.

In the Fisherian framework of statistical hypothesis testing, a p value is meant to indicate “the strength of the evidence against the hypothesis” (Fisher, 1958, p. 80).

What are the chances that all textbook writers got this wrong?

To make his point, Wagenmakers uses the ambiguity of everyday language and decides that “the most common and well-worked-out definition is the Bayesian definition”

Nobody is surprised that p-values do not provide evidence given a Bayesian definition of evidence, just like nobody would be surprised that Bayes Factors do not provide information about the long-run probability of false positive discoveries.

What is surprising is that Wagenmakers provides no argument. Instead, he reviews some surveys of statisticians and psychologists that examined the influence of sample size on the evaluation of identical p-values.

For example, which study produces stronger evidence against the null-hypothesis. A study with N = 300 and p = .01 or a study with N = 30 and p = .01. Most statisticians favor the larger study. A quick survey in the Psychological Method Discussion group confirmed this finding. 37 respondents favored the larger sample, 7 said no difference, and 4 favored the smaller sample.

Although this is interesting, it does not answer the question whether a p-value of .0001 provides stronger evidence against the null-hypothesis than a p-value of .10, which is the question at hand.

So, Wagenmakers strongest argument against p-values that they are misinterpreted as a measure of strength of evidence is not an argument at all.

In short, Wagenmakers has been successful in casting doubt about the use of p-values amongst psychologists. He was able to do so because statistics training in psychology is poor and most users of p-values have only a vague understanding of the underlying statistical theory. As a result, they are swayed by strong claims that they cannot evaluate. It took me some time, and away from my original research, to understand these issues. In my opinion, Wagenmakers critique falls apart under closer scrutiny.

The main problem of p-values is that they are not Bayesian, but that is only a problem if you like Bayesian statistics. For most practical purposes, p-values and Bayes-Factors lead to the same conclusions regarding the rejection of the null-hypothesis. In addition, Bayes-Factors offer the false promise that they can provide evidence for the nil-hypothesis, which is also false, but the topic of another blog post.

The real problem in psychological science is not the use of p-values, but the abuse of p-values. That is, a study with N = 30 participants and p = .01 would produce just as much evidence as a study with N = 300 and p = .01, if we wouldn’t have to worry that the researcher with N = 30 also ran 300 participants, but only presented the results of one study that produced a significant result by chance. For this reason, I have invested my time and energy in studying the real power of studies to produce significant results and to detect the use of questionable research practices. It does not matter to me whether effect size estimates and sampling error are reported as confidence intervals, converted into p-values, or reported as Bayes Factors. What matters is that the results are credible and strong claims are supported by strong evidence, no matter how it is reported.

Related blog Posts

Why Wagenmakers is wrong (about Bayesian Analysis of Bem, 2011)

Wagenmakers’ Prior is Inconsistent with Empirical Results in Psychology

Confidence Intervals are More Informative than Bayes Factors

The Bayesian Mixture Model Does Not Estimate the False Positive Rate

Wagenmakers Confuses Evidence Against H1 with Evidence For H0

# 2018 Journal Replicability Rankings

This table shows the Replicability Rankings for 117 psychology journals.

Journals are ranked based on the replicability estimates for the year 2018.

Replicability estimates are obtained from z-curve analyses of automatically extracted test statistics. If you click on the journal name, you can see plots of the z-curve distributions for the years 2010-2018.

 Rank Journal 2018 2017 2016 2015 2014 2013 2012 2011 2010 1 European Journal of Developmental Psychology 89 86 83 63 73 75 78 78 67 2 Journal of Cognition and Development 89 74 77 67 65 68 55 66 69 3 Political Psychology 88 75 78 74 70 71 73 43 66 4 Social Development 84 72 78 62 74 71 71 73 72 5 Social Psychology 84 74 74 72 74 70 64 76 72 6 Depression & Anxiety 83 75 78 70 73 80 80 89 86 7 Journal of Counseling Psychology 83 69 78 77 70 78 77 62 82 8 Personal Relationships 83 76 71 70 69 65 70 58 66 9 Sex Roles 83 81 80 73 72 76 79 73 73 10 Journal of Occupational and Organizational Psychology 82 72 82 79 70 74 77 70 63 11 Cognitive Psychology 81 75 80 72 75 77 71 81 74 12 Epilepsy & Behavior 81 82 81 79 85 84 79 89 76 13 Experimental Psychology 81 74 72 71 76 72 74 72 69 14 Journal of Consumer Behaviour 81 69 79 75 73 81 73 83 79 15 Journal of Health Psychology 81 63 71 79 78 80 76 63 71 16 Journal of Pain 81 67 77 72 80 72 77 73 70 17 Law and Human Behavior 81 75 76 69 60 74 76 83 72 18 Psychology of Religion and Spirituality 81 71 80 80 75 70 55 73 75 19 Social Psychological and Personality Science 81 76 65 60 64 61 57 65 54 20 Evolution & Human Behavior 80 73 80 75 75 62 64 69 62 21 Journal of Personality 80 77 73 68 72 69 72 60 66 22 JPSP-Attitudes & Social Cognition 80 79 55 74 69 49 61 61 60 23 Journal of Vocational Behavior 80 74 85 83 65 83 79 85 77 24 Memory and Cognition 80 80 74 79 76 76 79 76 77 25 Attention, Perception and Psychophysics 79 79 70 73 76 77 80 74 73 26 Consciousness and Cognition 79 78 69 69 74 66 70 73 73 27 Journal of Cognitive Psychology 79 75 77 74 77 72 73 79 85 28 Journal of Educational Psychology 79 78 72 67 75 74 76 77 83 29 Journal of Nonverbal Behavior 79 89 73 63 72 76 71 63 64 30 Journal of Research in Personality 79 78 76 81 77 76 70 72 68 31 Psychophysiology 79 78 78 70 72 68 71 77 78 32 Quarterly Journal of Experimental Psychology 79 76 76 75 74 73 76 75 72 33 Aggressive Behavior 78 72 77 67 70 60 69 79 68 34 Evolutionary Psychology 78 78 82 77 76 81 73 80 69 35 Health Psychology 78 70 61 66 66 67 59 69 68 36 J. of Exp. Psychology – Human Perception and Performance 78 76 77 75 75 75 77 78 76 37 J. of Exp. Psychology – Learning, Memory & Cognition 78 79 78 77 81 74 76 71 80 38 Psychonomic Bulletin and Review 78 75 77 82 78 83 71 70 78 39 British Journal of Psychology 77 77 77 82 75 71 78 79 69 40 British Journal of Developmental Psychology 77 71 76 74 64 67 85 77 77 41 Journal of Cross-Cultural Psychology 77 75 75 80 77 80 71 77 77 42 Journal of Experimental Psychology – General 77 77 74 74 72 74 66 73 68 43 Journal of Family Psychology 77 69 62 72 71 70 64 67 68 44 Journal of Memory and Language 77 80 82 79 74 75 71 79 73 45 JPSP-Personality Processes and Individual Differences 77 65 74 71 73 65 68 70 61 46 Personality and Individual Differences 77 76 74 77 77 76 73 71 70 47 Appetite 76 77 71 64 66 73 71 72 73 48 Cognition 76 76 73 72 74 76 74 72 72 49 European Journal of Personality 76 76 79 68 81 67 67 70 79 50 Journal of Anxiety Disorders 76 79 73 69 76 75 78 71 74 51 Journal of Occupational Health Psychology 76 80 73 72 73 54 75 79 71 52 Cognition and Emotion 75 65 68 74 73 85 85 81 81 53 Journal of Affective Disorders 75 75 84 85 77 84 78 72 71 54 Journal of Child and Family Studies 75 73 72 69 68 74 73 74 73 55 Journal of Experimental Social Psychology 75 71 67 62 61 56 54 57 55 56 Journal of Social and Personal Relationships 75 71 84 59 57 69 61 78 82 57 Psychological Science 75 71 68 69 65 65 63 61 61 58 Cognitive Therapy and Research 74 75 70 71 61 75 74 67 65 59 Frontiers in Psychology 74 76 74 73 73 72 72 68 82 60 Journal of Applied Social Psychology 74 71 79 67 72 69 77 71 75 61 Journal of Religion and Health 74 74 85 80 76 76 89 80 68 62 Psychological Medicine 74 73 82 67 75 78 66 77 72 63 Animal Behavior 73 77 71 69 70 71 71 70 75 64 Child Development 73 66 73 73 68 69 74 71 73 65 Cognitive Development 73 80 74 82 71 71 74 69 63 66 Developmental Psychology 73 75 75 74 75 72 67 68 66 67 Emotion 73 73 71 68 69 72 68 68 73 68 Frontiers in Human Neuroscience 73 70 74 73 74 76 78 76 72 69 Judgment and Decision Making 73 81 78 76 77 68 73 70 71 70 Journal of Experimental Child Psychology 73 72 71 77 75 72 72 71 74 71 Journal of Social Psychology 73 75 73 70 65 62 77 71 75 72 Memory 73 74 79 67 87 76 77 84 88 73 Perception 73 75 76 78 73 79 82 89 93 74 Annals of Behavioral Medicine 72 70 73 63 70 75 77 72 72 75 Archives of Sexual Behavior 72 78 78 79 75 81 78 76 87 76 Frontiers in Behavioral Neuroscience 72 74 70 70 67 70 72 70 67 77 International Journal of Psychophysiology 72 74 64 70 67 62 71 70 65 78 Psychology and Aging 72 78 80 76 81 71 77 76 75 79 Behaviour Research and Therapy 71 70 72 75 76 72 77 66 69 80 Journal of Organizational Psychology 71 73 71 66 73 62 72 66 75 81 Journal of Positive Psychology 71 81 69 72 74 62 67 63 73 82 JPSP-Interpersonal Relationships and Group Processes 71 68 73 64 61 62 56 61 54 83 Organizational Behavior and Human Decision Processes 71 68 72 69 69 72 69 71 63 84 Personality Disorders 71 87 64 63 72 77 52 55 84 85 Personality and Social Psychology Bulletin 71 73 69 64 64 60 59 61 62 86 Acta Psychologica 70 77 73 73 76 74 74 76 74 87 British Journal of Social Psychology 70 78 63 67 61 63 59 70 63 88 Hormones & Behavior 70 61 63 62 62 62 61 66 63 89 Journal of Abnormal Psychology 70 69 64 63 65 69 66 73 70 90 Journal of Consulting and Clinical Psychology 70 77 61 66 65 62 65 66 65 91 Journal of Experimental Psychology – Applied 70 80 69 68 72 65 75 70 71 92 Journal of Happiness Studies 70 56 79 78 78 80 77 88 77 93 Behavioural Brain Research 69 71 68 74 67 70 71 71 72 94 Cognitive Behavioral Therapy 69 75 80 76 62 70 80 72 62 95 Journal of Applied Psychology 69 79 80 70 74 69 73 69 71 96 Journal of Autism and Developmental Disorders 69 71 72 70 65 72 67 67 70 97 Psychology of Music 69 80 79 72 73 75 72 82 87 98 Biological Psychology 68 63 66 70 66 66 61 70 70 99 Developmental Science 68 73 67 69 65 71 67 68 67 100 Journal of Comparative Psychology 68 66 75 75 79 80 71 68 62 101 Psychology and Marketing 68 70 70 65 76 65 71 63 71 102 Psychoneuroendocrinology 68 65 66 63 63 64 62 64 61 103 Psychopharmacology 68 74 75 74 71 73 75 71 71 104 Behavior Therapy 67 71 69 71 74 74 75 63 77 105 Developmental Psychobiology 67 63 66 65 69 70 70 71 64 106 Journal of Consumer Psychology 66 56 53 67 66 64 59 59 64 107 Journal of Consumer Research 66 64 63 51 63 48 61 60 64 108 Journal of Individual Differences 65 82 65 74 63 86 55 91 70 109 Journal of Youth and Adolescence 65 70 84 77 81 76 74 74 75 110 European Journal of Social Psychology 64 73 76 64 71 67 56 68 66 111 Group Processes & Intergroup Relations 64 68 66 70 67 69 65 67 59 112 Journal of Research on Adolescence 62 67 71 67 64 72 74 78 67 113 Journal of Child Psychology and Psychiatry and Allied Disciplines 61 68 67 67 62 68 71 57 62 114 Motivation and Emotion 61 72 63 66 64 66 63 81 67 115 Infancy 59 60 61 61 66 67 63 71 53 116 Behavioral Neuroscience 57 73 68 70 70 68 70 66 72 117 Self and Identity 57 68 68 57 72 71 72 70 73
>

# Social Psychology Textbook audiT: Ego Depletion

Since 2011, social psychology is in a crisis of confidence. Many published results were obtained with questionable research practices and failed to replicate. The Open Science Collaboration found that only 25% of social psychological results could be successfully replicated (OSC, 2015).

One of the biggest scandals in social psychology is the ego-depletion literature. The main assumption of ego-depletion theory is that working on a cognitively demanding task lowers individuals’ ability to do well on a second demanding task.

A meta-analysis in 2010 seemed to show that ego-depletion effects in laboratory studies are robust and have a moderate effect size (d = .5). However, this meta-analysis did not control for the influence of questionable research practices. A subsequent meta-analysis did take QRPs into account and found no evidence for the effect.

This meta-analysis triggered a crisis of confidence in the ego-depletion effect and an initiative to investigate ego-depletion in a massive replication attempt. The outcome of this major replication study confirmed the finding of the second meta-analysis. There was no evidence for an ego-depletion effect, despite the massive statistical power to detect even a small effect (d = .2) (Hagger et al., 2016).

There have been different responses to the replication failure. The inventor of ego-depletion theory, Roy F. Baumeister, blames the design of the replication study for the replication failure (cf. Drummond & Philipp, 2017). However, others, including myself (Schimmack, 2016), pointed out that Baumeister and colleagues used QRPs in their original studies and therefore do not provide credible evidence for the effect. Some ego-depletion researchers, like Michael Inzlicht (pdf), openly expressed concern that ego-depletion may not be real.

“We did run multiple studies, some of which did not work, and some of which worked better than others. You may think that not reporting the less successful studies is wrong, but that is how the field works.” (Roy Baumeister, personal email communication)  [Schimmack, 2014]

Social psychology textbooks responded differently to these developments in ego-depletion research.

Gilovich et al., (2019, 5ed) simply removed ego-depletion from their textbook, while the 3ed (2013) covered ego depletion, including the even more controversial claim that links ego-depletion to blood glucose levels, which was also obtained with QRPs (Schimmack, 2012).

In contrast, Myers and Twenge (2018, 13ed) continue to cover ego-depletion without mentioning any replication failures or concerns about the robustness of the evidence.

Neither treatment of the doubts about ego-depletion is acceptable. Simply removing ego-depletion misses the opportunity to teach students to think critically about social psychological research; which probably is the point. However, presenting ego-depletion is even worse. Examples like this show that social psychologists are unwilling to be open about the recent developments in their field and that students cannot trust social psychology textbooks to provide a balanced and scientific introduction to the field.

# An Introduction to Anti-Social Psychology

Social psychology textbooks aim to inform students about social psychology. However, the authors also want to promote social psychology. As a result, they present social psychology in the most favorable light. The result are textbooks that hide embarrassing facts (most textbook findings probably do not replicate) and do not allow students to think critically about social psychology.

Ideally somebody would publish a balanced and objective textbook. This blog post has a different aim. It introduces students of social psychology to critical evidence that are missing from most social psychology textbooks.

The evidence presented in the “anti-social” textbook may be biased against social psychology, but it provides students with some information that they can use to make up their own mind about social psychology.

The aim is to use a Hegelian approach of teaching, where the textbook provides a thesis (e.g., self-perception theory explains how individuals form attitudes), the anti-social textbook provides the anti-thesis (self-perception theory is an outdated attempt to explain attitudes from the perspective of radical behaviorism), and then students can synthesize the conflicting claims into their own perspective on social psychology.

Culture of Honor (Northern vs. Southern US States)

Ease of Retrieval

Ego Depletion

Implicit Association Test (IAT)

Priming (Subliminal Priming; Unconscious Processes)

Replication, Replicability, Replication Outcomes in Social Psychology

Self-Knowledge (Accuracy and Biases)

Stereotype Threat

Forthcoming

Self-Perception Theory

Terror-Management Theory

# Social Psychology Textbook audit: The (In)Accuracy of Self-Knowledge

Gilovich, Keltner, Chen, & Ones, 2019, Social Psychology (5ed), p. 65-66

To understand social psychologists’ claims about accuracy of self-knowledge, it is important to be aware of the person-situation debate in the 1970s. On the one hand, social psychologists maintained that self-concepts are largely illusory and do not predict behavior. On the other hand, personality psychologists assumes that people have some accurate self-knowledge about stable personality dispositions that influence their behavior.

It is also important to know that textbook author Nisbett was actively engaged in the person-situation controversy. It is therefore not suprising that the textbook chapter about accuracy in self-knowledge is strongly biased and fails to mention decades of research that has demonstrated convergent validity of self-ratings and informant ratings of personality (e.g., Connely & Ones, 2010, for a meta-analysis).

Instead, students are given the impression that self-knowledge is rather poor.

Recall the research described in Chapter 1 in which Nisbett and Wilson (1977) discovered that that people can readily provide explanations for her their behaviors that are not in fact accurate. Someone might say that she picked her favorite nightgown because of its texture or color, when in fact she picked it out because it was the last one she saw. Even our ability to report accurately on more important decisions – such as why we chose job candidate A over job candidate B, why we like Joe better than Jack, or how we solved a particular problem – can be wide of the mark (Nisbett & Wilson, 1977).

…much of the time, we draw inaccurate conclusions about the self because we don’t have access to certain mental processes, such as those that lead us to prefer objects we looked at last (Wilson, 2002, Wilson & Dunn, 2004).

… such mental processes are outside nonconscious, occurring outside of our awareness, leaving us to generate alternative, plausible accounts for our preferences and behaviors instead.

Given such roadblocks, how can a person gain accurate self-knowledge?

The textbook doesn’t provide an answer to its own question. One possible answer could have been that it doesn’t require introspection to know oneself. Later the textbook introduces self-perception theory, which states that we can know ourselves like we know other people by observing ourselves and making attributions about our behaviors. For example, if I regularly order vanilla ice cream rather than chocolate ice cream, I can infer that I have a preference for vanilla; I do not need to know why I have a preference for vanilla (e..g, my mother gave me formula as a baby with vanilla flavor).

In any case, Nisbett’s musings about limitations of introspection fail to explain how people acquire accurate self-knowledge about their personality, values, happiness, and past behaviors, nor does it cite relevant studies by personality psychologists.

The section on accuracy of self-knowledge ends with studies by Vazire (Vazire & Meehle, 2008; Vazire 2010; Vazire & Carlson, 2011). These studies show that, on average, self-ratings and informant ratings are equally good predictors of an objective criterion of behavior. They also suggest that the self is better able to make judgments about internal states.

“Because we have greater information than others do about our internal states (such as our internal thoughts and feelings), we are better judges of our internal traits (being optimistic or pessimistic, for instance).

Students may be a little bit confused by the earlier claims that introspection often leads us astray and the concluding statement that the self is most accurate in judging internal states. Apparently, introspective does provide some valuable information that can be used to know oneself.

In conclusion, social psychologists have ignored accuracy in self-knowledge because they were more interested in demonstrating biases and errors in human information processing. The textbook is stuck in some old studies on limits of introspection and does not review decades of research on accuracy of self-knowledge (e.g., Funder, 1995). To learn about accuracy of self-knowledge, students are better off taking a course on personality psychology.

# Auditing Social Psychology Textbooks: Hitler had High Self-Esteem

Social psychologists see themselves as psychological “scientists,”  who study people and therefore believe that they know people better than you or me. However, often their claims are not based on credible scientific evidence and are merely personal opinions disguised as science.

For example, a popular undergraduate psychology textbook claims that

Hitler had high self-esteem.

quoting an article that has been cited over 500 times in the journal “Psychological Science in the Public Interest.”  At the end of the article with the title “Does High Self-Esteem Cause Better Performance, Interpersonal Success, Happiness, or Healthier Lifestyles?” the authors write:

“High self-esteem feels good and fosters initiative. It may still prove a useful tool to promote success and virtue, but it should be clearly and explicitly linked to desirable behavior. After all, Hitler had very high self-esteem and plenty of initiative, too, but those were hardly guarantees of ethical behavior.”

In the textbook this quote is linked to boys who engage in sex at an “inappropriately young age” which is not further specified (in Canada this would be 14) according to recent statistics).

“High self-esteem does have some benefits—it fosters initiative, resilience, and pleasant feelings (Baumeister & others, 2003). Yet teen males who engage in sexual activity at an “inappropriately young age” tend to have higher than average self-esteem. So do teen gang leaders, extreme ethnocentrists, terrorists, and men in prison for committing violent crimes (Bushman & Baumeister, 2002; Dawes, 1994, 1998). “Hitler had very high self-esteem,” note Baumeister and co-authors (2003).”  (Myers, 2011, Social Psychology, 12th edition)

Undergraduate students pay a lot of money to be informed that people with high self-esteem are like sexually deviants, terrorists, violent criminals, and Hitler. (maybe we should add scientists with big claims and small samples to the list).

The problem is that this is not even true. Students who work with me on fact checking the textbook found this quote in the original article.

“There was no [!] significant difference in self-esteem scores between violent offenders and non-offenders, Ms = 28.90 and 28.89, respectively, t(7653) = 0.02, p > .9, d = 0.0001.”

Although the df of the t-test look impressive, the study compared 63 violent offenders to 7590 unmatched, mostly undergraduate student (gender not specified, probably mostly female) participants. So the sampling error of this study is high and the theoretical importance of comparing these two groups is questionable.

[the latest edition 13 from 2018 still contains the quote

How Many Correct Citations Could be False Positives?

Of course, the example above is an exception.  Most of the time a cited reference contains an empirical finding that is consistent with the textbook claim.  However, this does not mean that textbook findings are based on credible and replicable evidence.  Until recently it was common to assume that statistical significance ensures that most published results are true positives (i.e, not a false positive random finding).  However, this is only the case if all results are reported. It has been known since 1959 that this is not the case in psychology (Sterling, 1959). Jerry Brunner and I developed a statistical tool that can be used to clean up the existing literature. Rather than actually redoing 50 years of research, we use the statistical results reported in original studies to apply a significance filter post-hoc.  Our tool is called zcurve.   Below I used zcurve to examine the replicability of studies that were used in chapter 2 about the self.

More detailed information about the interpretation of the graph above is provided elsewhere (link).  In short, for each citation in the textbook chapter that is used as evidence for a claim, a team of undergraduate students retrieved the cited article and extracted the main statistical result that matches the textbook claim.  These statistical results are then converted into a z-score that reflects the strength of evidence for a claim.  Only significant results are important because non-significant results cannot support an empirical claim.  Zcurve fits a model to the (density) distribution of significant z-scores (z-scores > 1.96).  The shape of the density distribution provides information about the probability that a randomly drawn study from the set would replicate (i.e., reproduce a significant result).  The grey line shows the predicted distribution by zcurve. It matches the observed density in dark blue well. Simulation studies show good performance of zcurve. Zcurve estimates that the average replicability of studies in this chapter is  56%. This number would be reassuring if all studies had 56% power.  This would mean that all studies are true positives and if a study were replicated every other study would be successful. However, reality does not match this rosy scenario.  In reality, studies vary in replicability.  Studies with z-scores greater than 5 have 99% replicability (see numbers below x-axis).  However, studies with just significant results (z < 2.5) have only 21% replicability.  As you can see, there are a lot more studies with z < 2.5 than studies with z > 5.  So there are more studies with low replicability than studies with high replicability. The next plot shows model fit (higher numbers = worse fit) for zcurve models with a fixed proportion of false positives.  If the data are inconsistent with a fixed proportion of false positives, model fit decreases (higher numbers).

The graph shows that models with 100%, 90% or 80% false positives clearly do not fit the data as well as models with fewer false positives.  This shows that some textbook claims are based on solid empirical evidence.   However, model fit for models with 0% to 60% look very similar.  Thus, it is possible that the majority of claims in the self chapter of this textbook are false positives. It is even more problematic that textbook claims are often based on a single study with a student sample at one university.  Social psychologists have warned repeatedly that their findings are very sensitive to minute variations in studies, which makes it difficult to replicate these effects even under very similar conditions (Van Bavel et al., 2016), and that it is impossible to reproduce exactly the same experimental conditions (Stroebe and Strack, 2014).  Thus, the zcurve estimate of 56% replicability is a wildly optimistic estimate of replicability in actual replication studies. In fact, the average replicability of studies in social psychology is only 25% (Open Science Collaboration, 2015).

Conclusion

Social psychology textbooks present many findings as if they are established facts, when this is not the case.  It is time to audit psychology textbooks to ensure that students receive accurate scientific information to inform their beliefs about human behavior.  Ideally, textbook authors will revise their textbooks to make them more scientific and instructors will chose textbook based on the credibility of the evidence in textbooks.

# Social Psych Textbook AudiT: The Affective Misattribution Paradigm

The Affective Missattribution Paradigm (AMP) is a popular measure of attitudes. The main promise of this measure is that it measures implicit attitudes, where the term “implicit” suggests that participants are (a) not aware of their attitude, (b) not aware that their attitude is being measured, or (c) aware that their attitude is being measured, but unable to control (fake) their responses.

The picture below illustrates the basic principle of the AMP. The critical attitude object is the picture of a tropical beach. The aim is to measure your attitude towards tropical beaches. However, the task is presented as a task to evaluate the Chinese character and to ignore the tropical beach. The problem for participants is that the Chinese character elicits no strong emotion (unless you are Chinese and know that the character means death), while the picture of the beach elicits a positive emotional response for many participants. The proposition of the AMP is that participants involuntarily rely on their emotional response to the the tropical beach (called the prime) to judge the character (called the target).

An alternative version of the AMP would present the prime subliminally; that is, the presentation would be so short and masked with another image that participants cannot identify the target picture. This would make the AMP an implicit measure of attitudes without awareness of the true source of the evaluation. However, subliminal presentations are not reliable enough to measure attitudes (Payne, 2017).

However, the presentation of prime stimuli in plain view makes it clear that participants are aware of the prime. The question is whether they are aware that their emotional response is elicited by the prime rather than the target. Maybe they are simply too lazy to bother controlling their attention or response, which is more effortful than to simply report the emotional response that was elicited.

Three studies have provided evidence that participants are aware of the true source of their feelings and can control their responses.

Bar-Anan and Nosek (2002) asked participants how they made their responses and found priming effects for participants who stated that their responses were guided by the prime rather than the target.

Teige-Mocigemba, Penzl, Becker, Henn, and Klauer (2015) simply instructed participants to respond opposite to their emotional responses and found that participants were able to do so.

Hazlett and Berinsky (2018) gave students small monetary incentives to control their automatic emotional responses to primes. The key finding was that providing a monetary incentive further decreased the influence of primes on participants’ responses over a simple instruction to ignore the primes. This supports the motivation hypothesis that participants are abel to control their responses, but lack motivation to do so unless their is a reason for it.

In conclusion, the AMP is not an implicit measure of attitudes in the sense that participants are unaware that their attitude is being measured or unable to control their responses.

It is also noteworthy that the AMP has modest correlations with implicit measures of attitudes like the Implicit Association Test.

The problem with the low correlation between the IAT and AMP is that both tests are promoted as measures of implicit attitudes, but the low correlation means that they are poor measures of a single construct.

In conclusion, there is evidence to suggest that the AMP may not be an implicit measure of attitudes and evidence that it correlates poorly with other implicit measures.

How do Gilovich, Keltner, Chen, and Nisbett introduce the AMP to undergraduate students?

The textbook merely mentions that AMP scores have demonstrated convergent and predictive validity in a number of studies.

Responses on the AMP have been shown to be related to plitical attitudes, other measures of racial bias, and significant personal habits like smoking and drinking (Greenwald, Smith, Sriram, Bar-Anan & Nosek, 2009; Payne et al., 1995; Payne, Govorun, & Arbuckle, 2008; Payne, McClernon, & Dobbins, 2007). (p. 369)

Three of the four citations are by Payne, who developed the AMP and has a conflict of interest to show that his measure is valid and useful. None of the citations is more recent than 2010, meaning that no significant updates have been made in response to recent critical articles about the AMP.

The cited Greenwald et al. (2009) article is particularly informative because it examined convergent, discriminant, and predictive validity in a large, non-student sample.

The textbook claim that the AMP correlates with other implicit racial bias measures is supported by the r = .218 correlation with the Brief IAT measure. However, there is little evidence for discriminant validity because the AMP also correlates r = .220 and r = .208 with two explicit measures of prejudice (4. Thermometer & 5. Likert, respectively).

Moroever, predictive validity of the AMP is shown by the correlation with voter intentions , r = .113. This correlation is low and not higher than those for the two explicit measures, r = .211 and r = .124.

Finally, the study failed to find strong pro-White biases in this largely White sample (70% White) for the AMP (M = -0.02, SD = 0.17, d = .12) and the brief IAT (M = 0.06, SD = 0.42, d = 0.14), which were not larger than the pro-White bias for explicit measures that are subject to social desirable responding; feeling thermometer (M = 0.35, SD = 1.63, d = .21) and Likert Scale (M = 0.35, SD = 0.86, d = 41).

These results do not justify claims that the AMP and the IAT are measures of some hidden, implicit attitudes that are only accessible by means of indirect attitude measures and that influence participants’ behavior without their knowledge. However, this is the citation provided in the textbook to support these claims.

If textbook authors would have to present the actual evidence rather than a citation such distortions of the truth would not be possible. Thus, students should demand more scientific figures and tables and fewer cute pictures in their social psychology textbook. After all, they pay good money for it.

# 2016 Blogs

DECEMBER

12/31 ****
Review of an “eventful” 2016 (“Method Terrorists”)

12/6
A Forensic Analysis of Stapel: Fabrication or Falsification?

12/3
Replicability Analysis of Dijksterhuis’s “Enhancing Implicit Self-Esteem by Subliminal Evaluative Conditioning”

SEPTEMBER

9/13 ***
Critique of Finkel, Eastwick, & Reis’s article “Replicability and Other Features of a High-Quality Science: Toward a Balanced and Empirical Approach”

JUNE

6/30 ***
Wagenmaker’s Default Prior is Unrealistic

6/25 ****
A Principled Approach to Setting the Prior of the Null-Hypothesis

6/18 ***
What is the Difference between the Test of Excessive Significance and the Incredibility Index?

6/16 ****
The A Priori Probability of the Point Null-Hypothesis is not 50%

MAY

5/21
Replicability Report on Social Priming Studies with Mating Primes

5/18
Critique of Jeffrey N. Rouder, Richard D. Morey, and Eric-Jan Wagenmakers’s article “The Interplay between Subjectivity, Statistical Practice, and Psychological Science”

5/9 ***
Questionable Research Practices Invalidate Bayes-Factors Just As Much as P-Values

APRIL

4/18 *****
Replicability Report of the Ego Depletion Literature

FEBRUARY

2/16 ****
Discussion of Sterling et al.’s (1995) Seminal Article on Inflated Success Rates in Psychological Science [also recommend reading the original article]

2/10
Replicability AudiT of a 10 Study Article by Adam D. Galinsky

2/9
A Replicability AudiT of Yaacov Trope’s Publications

2/3 ***
A Critique of Finkel, Eastwick, & Reis’s Views on the Replication Crisis

JANUARY

1/31 *****
Introduction to the R-Index
[The R-Index builds on the Incredibility Index, Schimmack (2012)]

1/31
Replicability Analysis of Damisch, Stoberock, & Mussweiler (2010)
[Anonymous Submission to R-Index Blog]

1/31
Replicability Analysis of Williams & Bargh (2008)

1/14 ***
Discussion of Hoenig and Heisey’s Critique of Observed Power Calculations

# 2017 Blogs

NOVEMBER

11/29 *****
A Quantitative Book Review of John A. Bargh’s Book “Before you know it”

11/16
My Response to the Rejection of the Z-Curve manuscript from AMPPS
[Reviewer 3 is author of the competing P-Curve method]

OCTOBER

10/24 *****
Replicability Rankings of Psychology Journals (2010-2017)

SEPTEMBER

9/4 ****
Replicability Report: The Pen-Paradigm of Facial Feedback Studies

AUGUST

8/2 *****
A Comment on the Alpha Wars: Focus on Beta

MAY

5/15
Replicability AudiT of the Journal Psychological Science

MARCH

3/5 *****
Meta-Psychology: A New Discipline and a New Journal
[the journal now exists Meta-Psychology link]

FEBRUARY

2/26 ***
A Brief Introduction to Null-Hypothesis Significance Testing and Power
[1 Figure and 1500 words]

2/23 ****
On Rand Measurement Error, Reliability, and Replicability

2/21 ***
Examining the Influence of Selection for Significance on Observed Power

2/2 ***** (100,000 views)
A Quantitative Review of Kakneman’s Thinking Fast and Slow Chapter on Social Priming [Co-Authored with Moritz Heene]