Yearly Archives: 2018

Estimating the Size of File-Drawers with Z-Curve

December 30, 2018UncategorizedUlrich Schimmack

Every student in psychology is introduced to the logic of Null-Hypothesis Significance Testing (NHST). The basic idea is to establish the long-run maximum probability that a significant result is a false positive result. A false positive result is called a type-I error. The standard for an acceptable type-I error risk is 5%. Statistics programs and articles often highlight results with a p-value less than 0.05. Students quickly learn that the goal of statistical analysis is to find p-values less than .05.

NHST has been criticized for many reasons. This blog post focuses on the problem when NHST is used to hunt for significant results and when only significant results are reported. Hunting for significant results in itself is not a problem. If a researcher conducts 100 statistical tests and reports all results, the risk of a type-I error is controlled by the significance criterion. With alpha = .05, no more than 5 of the 100 tests can produce a false positive result. If, for example, 20 results are significant, it is clear that some of the significant results are true discoveries.

The problem arises when only significant results are reported (Sterling, 1959). If a researcher reports 20 significant results, it is not clear whether 20, 100, or 400 tests were conducted. However, this has important implications for the assessment of type-I errors. With 20 tests and 20 significant results, the type-I error is minimal, with 100 tests it is moderate (1 out of 4 significant results could be false positives) and with 400 tests (1 out of 20 = 5%) it is practically certain that at least some of the significant results are false positives. After all, the expected value if all 400 studies tests false hypotheses is 5%. So observing only 5% non-significant results in 400 tests suggests that some of these significant results are false positives.

The Replication Crisis and the True Type-I Error Risk

The selective publishing of only significant hypothesis tests is a major problem in psychological science (Sterling, 1959; Sterling et al., 1995), but psychologists only recently became aware of this problem (Francis, 2012; John et al., 2012; Schimmack, 2012). Once results are selected for significance, the true type-I error risk increases as a function of the actual number of tests that were conducted. While alpha is 5% in all studies, the percentage of significant results is unknown because it is unknown how many tests were conducted.

Type-I Error Risk and the File Drawer

Rosenthal (1979) introduced the concept of a file drawer. The proverbial file-drawer contains all of the unpublished studies that a researcher conducted that produced non-significant results.

If all studies had the same statistical power to produce a significant result, the size of the file-drawer would be self-evident. Studies with 50% power have a long-run probability of obtaining 50% significant results, by definition. Thus, there are also 50% studies with non-significant results. It follows that for each published significant result, there is a non-significant result in the proverbial file-drawer (File-Drawer Ratio 1:1; this simple example assumes independence of hypothesis tests).

If power were 80%, there would be only one non-significant result in the file-drawer for every 4 published significant results (File-Drawer Ratio 1:4 or 0.25 :1). However, if power is only 20%, there would be 4 non-significant results for every published significant result (File-Drawer Ratio 4:1).

Things are more complicated when studies vary in power. If we assume that some studies are true positives and others are false positives, the probability of a significant result varies across studies. Using a simple example, assume that 80 studies are false positives and 20 studies have 50% power. In this case, we expect 14 significant results; 80 * .05 = 4 + 20 * .5 = 1 == 14.

The 5% error rates is true for the 100 studies that were conducted, but it would be wrong to believe that only 5% of the selected set of 14 studies with significant results could be false positives. In this example, we would falsely assume that at most 1 of the 14 studies is a false positive; 14 * .05 = 0.7 studies. However, in this case, we know that there are actually 4 false positive results. We do get the correct estimate of the maximum number of false positives, if we start with the actual number of studies that were conducted, which gives a false positive risk of 5 studies, which would be a percentage of 5/14 = 36%. Thus, up to 36% of the reported 14 studies could be false positives. Thus, the actual risk is 7 times larger than the claim p < .05 suggests.

In short, we need to know the size of the file-drawer to estimate the percentage of reported results that could be false positives.

Estimating the Size of the File Drawer

Brunner and Schimmack (2018) developed a statistical method, z-curve, that can estimate mean power for a set of studies with heterogeneity in power, including some false positive results. The main purpose of the method was to estimate mean power for the set of published studies that produced significant results. However, the article also contained some theorems that make it possible to estimate the size of the file drawer.

Z-curve is a mixture model that models the distribution of observed test statistics (z-scores) as a mixture of studies with different levels of power. Bruner and Schimmack (2018) introduced a model with varying non-centrality parameters and weights. However, it is also possible to keep the non-centrality parameters constant and only the weights are free model parameters. The fixed non-centrality parameters can include a value of 0 to model the presence of false positive results. The latest version of z-curve uses fixed values of 0, 1, 2, 3, 4, 5, and 6. Values greater than 6 are not needed because z-curve treats all observed z-scores greater than 6 as having a power of 1.

The power values corresponding to these fixed non-centrality parameters are 5%, 17%, 52%, 85%, 98%, 99.9%, and 100%. Only the lower power values are important for the estimation of the file-drawer because high values imply that nearly all attempts produce significant results.

To illustrate the method, I focus on the lowest three power values: 5%, 17% and 52%. Assume that we observe 100 significant results with the following mixture of power values: 30 studies have 5% power, 34 studies have 17% power, and 26 studies have 52% power, and we want to know the size of the file drawer.

To get from the observed number of studies to the study that were actually run, we need to divide the number of observed studies by power (see Brunner & Schimmack, 2018, for a mathematical proof). With 5% power (i.e., false positive results), it requires 1/0.05 = 20 studies to produce 1 significant result in the long run. Thus, if 30 significant results were obtained with 5% power, 600 studies had to be run (600 * 0.05 = 30). With 17% power, it would require 200 studies to produce 34 significant results. And with 52% power, it would require 50 studies to produce 26 significant results. Thus, the total number of studies that are needed to obtain 100 significant results is 600 + 200 + 60 = 850. It follows that 750 (850 – 100) non-significant results are in the file drawer.

The following simulation illustrates how z-curve estimates the size of the file-drawer. Data are generated using standard normal distributions with means 0, 1, and 2. To achieve large sample accuracy, there are 800,000 observations (M = 0, k = 800,000; M = 1, k = 200,000; & M = 2, k = 50,000).

Only significant results (to the right of the red line at z = 1.96) were used to fit the model. The non-significant results are shown to see how well the model predicts the size of the file drawer.

The gray line shows the predicted distribution by the model. It shows that the predicted distribution of non-significant results matches the observed distribution of non-significant results, although the model slightly overestimates the size of the file-drawer.

The Expected Discovery Rate is the percentage of significant results for all studies including the file-drawer. The actual discovery rate is given by the number of studies (k = 1,050,000) and the actual number of significant results (k = 99,908), which is 99,908/1,050,000 = 9.52. The expected discovery rate is 9%, a fairly close match given the size of the file drawer.

Another way to look at the size of the file-drawer is the file-drawer ratio. That is, how many studies with non-significant results are in the file drawer for every significant result. The actual file-drawer ratio is (1,050,000 – 99,908)/99,908 = 9.51. That is, for every significant result, 9 to 10 non-significant results were not reported. The estimated file-drawer ratio is 9.7, a fairly close match.

The next example shows how z-curve performs when mean power is higher and the file-drawer is smaller. In this example, there were 100000 cases with z = 0, 200000 cases with z = 1, 400000 cases with z = 2, and 300000 cases with z = 3. The expected discovery rate for this simulation is 50%. With mean power of 50%, the file-drawer ratio is 1:1. That is, for each significant result there is one non-significant result.

The grey line shows that z-curve slightly overestimates the size of the file-drawer. However, this bias is small. The expected discovery rate is estimated to be 49% and the file-drawer ratio is estimated to be 1.05 : 1. These estimates closely match the actual results.

If power is greater than 50%, the file-drawer ratio is less than 1:1. The final simulation assumes that researchers have 80% power to test a true hypothesis, but that 20% of all studies are false positives. The mixture of actual power is 200,000 cases with M = 0, 100,000 cases with M = 2, 400,000 cases with M = 3, and 300,000 cases with M = 4. The mean power is 70%.

Once more, z-curve fits the actual data quite well. The expected discovery rate of 71% matches the actual discovery rate of 70% and the estimated file-drawer ratio of 0.4 to 1 also matches the actual file-drawer ratio of 0.44 to 1.

More extensive simulations are needed to examine the performance of z-curve. With smaller sets of studies, random sampling error alone will produce some variability in estimates. However, large differences in file-drawer estimates such as 0.4:1 versus 10:1 are unlikely to occur by chance alone.

Real Examples

To provide an illustration with real data, I fitted z-curve to Roy F. Baumeister’s results in his most influential studies (see Baumeister audiT for the data).

Visual inspection shows that Roy F. Baumeister’s z-curve matches most closely to the first simulation. The quantitative estimates confirm this impression. The expected discovery rate is estimated to be 11% and the file-drawer ratio is estimated to be 9.65 : 1. That is, for every published significant result, z-curve predicts 9 unpublished results with non-significant results. The figure shows that only a few non-significant results were reported in Baumeister’s articles. However, all of these non-significant results cluster in the region of marginally significant results (z > 1.65 & z < 1.96) and were interpreted as support for a hypothesis. Thus, all non-confirming evidence remained hidden in a fairly large file-drawer.

It is rare that social psychologists comment on their research practices, but in a personal communication Roy Baumeister confirmed that he has a file-drawer with non-significant results.

“We did run multiple studies, some of which did not work, and some of which worked better than others. You may think that not reporting the less successful studies is wrong, but that is how the field works.” (Roy Baumeister, personal email communication)

Other social psychologists have a smaller file-drawer. For example, the file-drawer ratio for Susan T. Fiske is only 2.8 : 1, which is only a third of Roy F. Baumeister’s file-drawer. Thus, while publication bias ensures that virtually everybody has a file-drawer, the size of the file-drawer can vary considerably across labs.

File Drawer of Statistical Analysis Rather than Entire Studies

It is unlikely that actual file-drawers are as large as z-curve estimates. Dropping studies with non-significant results is only one of several questionable research practices that can be used to report only significant results. For example, including several dependent variables in a study can help to produce a significant result for a single dependent variable. In this case, most studies can be published. Thus, it is more accurate to think of the file-drawer as being filled with statistical outputs with non-significant results rather than entire studies. This does not reduce the problem of questionable research practices. Undisclosed multiple comparisons within a single data set undermine the replicability of published results just as much as failures to disclose results from a whole study.

Nevertheless, z-curve estimates should not be interpreted too literally. If there were such a thing as a Replicability Bureau of Investigation (RBI), and the RBI would raid the labs of a researcher, the actual size of the file-drawer may differ from the z-curve prediction because it is impossible to know which questionable research practices were actually use to report only confirming results. However, the estimated file-drawer provides some information about the credibility of the published results. Estimates below the ratio of 1:1 suggest that the data are credible. The higher the file-drawer ratio is, the less credible the published results become.

File-Drawer Estimates and Open Science

The main advantage of being able to estimate file-drawers is that it is possible to monitor research practices in psychology labs without the need of an RBI. Reforms such as a priori power calculations, preregistration and more honest reporting of non-significant results should reduce the size of file-drawers. Requirements to share all data ensure open file-drawers. Z-curve can be used to evaluate whether these reforms are actually improving psychological science.

Wagenmakers’ Crusade Against p-values

December 29, 2018UncategorizedUlrich Schimmack

Two decades ago, Wagenmakers (2007) started his crusade against p-values. His article “A practical solution to the pervasive problems of p-values” (PPPV) has been cited over 800 times, and it is Wagemmakers most cited original article (he also contributed to the OSC, 2015, reproducibility project that already garnered over 1,000 citations.

In PPPV, Wagenmaker claims that statisticians have identified three fundamental problems of p-values, (a) p-values do not quantify statistical evidence, (b) p-values depend on hypothetical data, and (c) p-values depend on researchers’ unknown intentions.

When I read the article many years ago, statistics was a side-interest for me, and I didn’t fully understand the article. Since the replication crisis started in 2011, I have learned a lot about statistics, and I am ready to share my thoughts about Wagenmakers’ critique of p-values. In short, I think Wagenmakers’ arguments are a load of rubbish and the proposed solution to use Bayesian model comparisons is likely to make matters worse.

P-Values Depend on Hypothetical Data

Most readers of this blog post are familiar with the way p-values are computed. Some data are observed. Based on this observed data, an effect size is estimated. In addition, sampling error is computed either based on sample size alone or based on observed information about the distribution of observations (variance). The ratio of the effect size and the sampling error is used to compute a test statistic. To be clear, the same test statistics are used in frequentist statistics with p-values as in Bayesian statistics. So, any problems that occur during these steps are the same for p-values and Bayesian statistics.

What are the hypothetical data that Wagenmakers sees as a problem?

These hypothetical data are data expected under H0, without which it is impossible to construct the sampling distribution of the test statistic
t(xrep | H0).

Two things should be immediately obvious. First, the hypothetical data are no more or less hypothetical than the null-hypothesis. The null-hypothesis is hypothetical (hypothesis – hypothetical, see the connection) and based on the null-hypothesis predictions about the distribution of a test-statistic are made. The actual data are then compared to this prediction. There are no hypothetical data. There is a hypothetical distribution and an actual test statistic. Inferences are based on the comparison. Second, the “hypothetical data” that are expected under H0 are also expected in a Bayesian statistical framework because the same sampling distribution is used to compute the Bayesian Information Criterion or a Bayes Factor.

In short, it is easy to see that Wagenmakers’ problem is not a problem at all. Theories and hypotheses are abstractions. To use inferential statistics, the prediction have to be translated into a sampling distribution of a test statistics.

Wagenmakers presents an example from Pratt (1962) in full to drive home his point; and I reproduce this example again in full.

An engineer draws a random sample of electron
tubes and measures the plate voltage under certain
conditions with a very accurate volt-meter, accurate
enough so that measurement error is negligible compared
with the variability of the tubes. A statistician
examines the measurements, which look normally
distributed and vary from 75 to 99 volts with a mean
of 87 and a standard deviation of 4. He makes the
ordinary normal analysis, giving a confidence interval
for the true mean. Later he visits the engineer’s
laboratory, and notices that the volt meter used reads
only as far as 100, so the population appears to be
“censored.” This necessitates a new analysis, if the
statistician is orthodox. However, the engineer says
he has another meter, equally accurate and reading to
1000 volts, which he would have used if any voltage
had been over 100. This is a relief to the orthodox
statistician, because it means the population was effectively
uncensored after all. But the next day the
engineer telephones and says: “I just discovered my
high-range volt-meter was not working the day I did
the experiment you analyzed for me.” The statistician
ascertains that the engineer would not have held
up the experiment until the meter was fixed, and informs
him that a new analysis will be required. The
engineer is astounded. He says: “But the experiment
turned out just the same as if the high-range meter
had been working. I obtained the precise voltages
of my sample anyway, so I learned exactly what I
would have learned if the high-range meter had
been available. Next you’ll be asking me about my
oscilloscope.”

What is the problem here? Truncating the measure at 100 changes the statistical model. If we have to suspect that the data are truncated, we cannot use a statistical model that assumes a normal distribution. We could use a non-parametric test to get a p-value or a more sophisticated model that models the truncation process. This model would notice that there is little truncation in these hypothetical data because there are actually no values greater than 100.

Thus, this example merely illustrated that statistical inferences depend on the proper modeling of the sampling distribution of a test statistic. All statistical inferences are only valid if the assumptions of the statistical model hold. Otherwise, all bets are off. Most important, this is also true for Bayesian statistics because they rely on the same test statistics and distribution assumptions as p-values. There is nothing magical about Bayes Factors that would allow them to produce valid inferences when distribution assumptions are violated.

P-Values Depend on Researchers’ Intentions

The second alleged problem of p-values is that they depend on researchers’ intentions.

“The same data may yield quite different p values, depending on the intention with which the experiment was carried out.”

This fact is illustrated with several examples like this one.

Imagine that you answered 9 out of 12 questions about statistics correctly (if it were possible to say what is correct and what is false), and I wanted to compute the p-value that you were simply guessing. The two-sided p-value is p = .146, if we assume that the test has 12 questions in total, However, the p-value is .033.

Since 2011, it is well known that data peaking alters the statistical model and that optional stopping alters p-values. If the decision to terminate data collection was in any way systematically influenced by some previous results, a p-value that assumes no data-peaking occurred is wrong because it is based on the wrong statistical model. Undisclosed checking of data is now known as a questionable research practice (John et al., 2012). Thus, Wagenmakers’ example merely shows that p-values cannot be trusted when researchers engaged in questionable research practices. It does not show that p-values are inherently flawed.

How does Bayesian statistic avoid this problem? It avoids the problem only partially. Bayes Factors always express information as a comparison between two models. As long as researchers peak at the data and continue because the data do not favor either model, data peaking does not introduce a bias. However, if they would peak and continue data collection until the data favor one model, Bayesian statistics would be just as biased by data peaking as the use of p-values. Even data peaking with inconclusive data can be biased if one of the models is implausible and would never receive support. In this case, the data can only produce evidence for one model or be undecided, which leads to the same problem that Wagenmakers sees with p-values. For example, testing the null-hypothesis against Wagenmaker’s prior that assumes large effects of 1 SD or more would eventually produce evidence for the null-hypothesis, even if it were false because the data can never produce support for the implausible alternative hypothesis.

In conclusion, the second argument is a good reason for preregistration and against the use of questionable research practices, but not a good argument against p-values.

P Values Do Not Quantify Statistical Evidence

The third claim is probably the most surprising for users of p-values. The main reason for computing p-values is that they are considered to be a common metric that can be used across different types of studies. Everything else being equal, a lower p-value is assumed to provide stronger evidence against the null-hypothesis.

In the Fisherian framework of statistical hypothesis testing, a p value is meant to indicate “the strength of the evidence against the hypothesis” (Fisher, 1958, p. 80).

What are the chances that all textbook writers got this wrong?

To make his point, Wagenmakers uses the ambiguity of everyday language and decides that “the most common and well-worked-out definition is the Bayesian definition”

Nobody is surprised that p-values do not provide evidence given a Bayesian definition of evidence, just like nobody would be surprised that Bayes Factors do not provide information about the long-run probability of false positive discoveries.

What is surprising is that Wagenmakers provides no argument. Instead, he reviews some surveys of statisticians and psychologists that examined the influence of sample size on the evaluation of identical p-values.

For example, which study produces stronger evidence against the null-hypothesis. A study with N = 300 and p = .01 or a study with N = 30 and p = .01. Most statisticians favor the larger study. A quick survey in the Psychological Method Discussion group confirmed this finding. 37 respondents favored the larger sample, 7 said no difference, and 4 favored the smaller sample.

Although this is interesting, it does not answer the question whether a p-value of .0001 provides stronger evidence against the null-hypothesis than a p-value of .10, which is the question at hand.

So, Wagenmakers strongest argument against p-values that they are misinterpreted as a measure of strength of evidence is not an argument at all.

In short, Wagenmakers has been successful in casting doubt about the use of p-values amongst psychologists. He was able to do so because statistics training in psychology is poor and most users of p-values have only a vague understanding of the underlying statistical theory. As a result, they are swayed by strong claims that they cannot evaluate. It took me some time, and away from my original research, to understand these issues. In my opinion, Wagenmakers critique falls apart under closer scrutiny.

The main problem of p-values is that they are not Bayesian, but that is only a problem if you like Bayesian statistics. For most practical purposes, p-values and Bayes-Factors lead to the same conclusions regarding the rejection of the null-hypothesis. In addition, Bayes-Factors offer the false promise that they can provide evidence for the nil-hypothesis, which is also false, but the topic of another blog post.

The real problem in psychological science is not the use of p-values, but the abuse of p-values. That is, a study with N = 30 participants and p = .01 would produce just as much evidence as a study with N = 300 and p = .01, if we wouldn’t have to worry that the researcher with N = 30 also ran 300 participants, but only presented the results of one study that produced a significant result by chance. For this reason, I have invested my time and energy in studying the real power of studies to produce significant results and to detect the use of questionable research practices. It does not matter to me whether effect size estimates and sampling error are reported as confidence intervals, converted into p-values, or reported as Bayes Factors. What matters is that the results are credible and strong claims are supported by strong evidence, no matter how it is reported.

Related blog Posts

Why Wagenmakers is wrong (about Bayesian Analysis of Bem, 2011)

Wagenmakers’ Prior is Inconsistent with Empirical Results in Psychology

Confidence Intervals are More Informative than Bayes Factors

The Bayesian Mixture Model Does Not Estimate the False Positive Rate

Wagenmakers Confuses Evidence Against H1 with Evidence For H0

2018 Journal Replicability Rankings

December 29, 2018UncategorizedUlrich Schimmack

This table shows the Replicability Rankings for 117 psychology journals.

Journals are ranked based on the replicability estimates for the year 2018.

Replicability estimates are obtained from z-curve analyses of automatically extracted test statistics. If you click on the journal name, you can see plots of the z-curve distributions for the years 2010-2018.

Rank	Journal	2018	2017	2016	2015	2014	2013	2012	2011	2010
1	European Journal of Developmental Psychology	89	86	83	63	73	75	78	78	67
2	Journal of Cognition and Development	89	74	77	67	65	68	55	66	69
3	Political Psychology	88	75	78	74	70	71	73	43	66
4	Social Development	84	72	78	62	74	71	71	73	72
5	Social Psychology	84	74	74	72	74	70	64	76	72
6	Depression & Anxiety	83	75	78	70	73	80	80	89	86
7	Journal of Counseling Psychology	83	69	78	77	70	78	77	62	82
8	Personal Relationships	83	76	71	70	69	65	70	58	66
9	Sex Roles	83	81	80	73	72	76	79	73	73
10	Journal of Occupational and Organizational Psychology	82	72	82	79	70	74	77	70	63
11	Cognitive Psychology	81	75	80	72	75	77	71	81	74
12	Epilepsy & Behavior	81	82	81	79	85	84	79	89	76
13	Experimental Psychology	81	74	72	71	76	72	74	72	69
14	Journal of Consumer Behaviour	81	69	79	75	73	81	73	83	79
15	Journal of Health Psychology	81	63	71	79	78	80	76	63	71
16	Journal of Pain	81	67	77	72	80	72	77	73	70
17	Law and Human Behavior	81	75	76	69	60	74	76	83	72
18	Psychology of Religion and Spirituality	81	71	80	80	75	70	55	73	75
19	Social Psychological and Personality Science	81	76	65	60	64	61	57	65	54
20	Evolution & Human Behavior	80	73	80	75	75	62	64	69	62
21	Journal of Personality	80	77	73	68	72	69	72	60	66
22	JPSP-Attitudes & Social Cognition	80	79	55	74	69	49	61	61	60
23	Journal of Vocational Behavior	80	74	85	83	65	83	79	85	77
24	Memory and Cognition	80	80	74	79	76	76	79	76	77
25	Attention, Perception and Psychophysics	79	79	70	73	76	77	80	74	73
26	Consciousness and Cognition	79	78	69	69	74	66	70	73	73
27	Journal of Cognitive Psychology	79	75	77	74	77	72	73	79	85
28	Journal of Educational Psychology	79	78	72	67	75	74	76	77	83
29	Journal of Nonverbal Behavior	79	89	73	63	72	76	71	63	64
30	Journal of Research in Personality	79	78	76	81	77	76	70	72	68
31	Psychophysiology	79	78	78	70	72	68	71	77	78
32	Quarterly Journal of Experimental Psychology	79	76	76	75	74	73	76	75	72
33	Aggressive Behavior	78	72	77	67	70	60	69	79	68
34	Evolutionary Psychology	78	78	82	77	76	81	73	80	69
35	Health Psychology	78	70	61	66	66	67	59	69	68
36	J. of Exp. Psychology – Human Perception and Performance	78	76	77	75	75	75	77	78	76
37	J. of Exp. Psychology – Learning, Memory & Cognition	78	79	78	77	81	74	76	71	80
38	Psychonomic Bulletin and Review	78	75	77	82	78	83	71	70	78
39	British Journal of Psychology	77	77	77	82	75	71	78	79	69
40	British Journal of Developmental Psychology	77	71	76	74	64	67	85	77	77
41	Journal of Cross-Cultural Psychology	77	75	75	80	77	80	71	77	77
42	Journal of Experimental Psychology – General	77	77	74	74	72	74	66	73	68
43	Journal of Family Psychology	77	69	62	72	71	70	64	67	68
44	Journal of Memory and Language	77	80	82	79	74	75	71	79	73
45	JPSP-Personality Processes and Individual Differences	77	65	74	71	73	65	68	70	61
46	Personality and Individual Differences	77	76	74	77	77	76	73	71	70
47	Appetite	76	77	71	64	66	73	71	72	73
48	Cognition	76	76	73	72	74	76	74	72	72
49	European Journal of Personality	76	76	79	68	81	67	67	70	79
50	Journal of Anxiety Disorders	76	79	73	69	76	75	78	71	74
51	Journal of Occupational Health Psychology	76	80	73	72	73	54	75	79	71
52	Cognition and Emotion	75	65	68	74	73	85	85	81	81
53	Journal of Affective Disorders	75	75	84	85	77	84	78	72	71
54	Journal of Child and Family Studies	75	73	72	69	68	74	73	74	73
55	Journal of Experimental Social Psychology	75	71	67	62	61	56	54	57	55
56	Journal of Social and Personal Relationships	75	71	84	59	57	69	61	78	82
57	Psychological Science	75	71	68	69	65	65	63	61	61
58	Cognitive Therapy and Research	74	75	70	71	61	75	74	67	65
59	Frontiers in Psychology	74	76	74	73	73	72	72	68	82
60	Journal of Applied Social Psychology	74	71	79	67	72	69	77	71	75
61	Journal of Religion and Health	74	74	85	80	76	76	89	80	68
62	Psychological Medicine	74	73	82	67	75	78	66	77	72
63	Animal Behavior	73	77	71	69	70	71	71	70	75
64	Child Development	73	66	73	73	68	69	74	71	73
65	Cognitive Development	73	80	74	82	71	71	74	69	63
66	Developmental Psychology	73	75	75	74	75	72	67	68	66
67	Emotion	73	73	71	68	69	72	68	68	73
68	Frontiers in Human Neuroscience	73	70	74	73	74	76	78	76	72
69	Judgment and Decision Making	73	81	78	76	77	68	73	70	71
70	Journal of Experimental Child Psychology	73	72	71	77	75	72	72	71	74
71	Journal of Social Psychology	73	75	73	70	65	62	77	71	75
72	Memory	73	74	79	67	87	76	77	84	88
73	Perception	73	75	76	78	73	79	82	89	93
74	Annals of Behavioral Medicine	72	70	73	63	70	75	77	72	72
75	Archives of Sexual Behavior	72	78	78	79	75	81	78	76	87
76	Frontiers in Behavioral Neuroscience	72	74	70	70	67	70	72	70	67
77	International Journal of Psychophysiology	72	74	64	70	67	62	71	70	65
78	Psychology and Aging	72	78	80	76	81	71	77	76	75
79	Behaviour Research and Therapy	71	70	72	75	76	72	77	66	69
80	Journal of Organizational Psychology	71	73	71	66	73	62	72	66	75
81	Journal of Positive Psychology	71	81	69	72	74	62	67	63	73
82	JPSP-Interpersonal Relationships and Group Processes	71	68	73	64	61	62	56	61	54
83	Organizational Behavior and Human Decision Processes	71	68	72	69	69	72	69	71	63
84	Personality Disorders	71	87	64	63	72	77	52	55	84
85	Personality and Social Psychology Bulletin	71	73	69	64	64	60	59	61	62
86	Acta Psychologica	70	77	73	73	76	74	74	76	74
87	British Journal of Social Psychology	70	78	63	67	61	63	59	70	63
88	Hormones & Behavior	70	61	63	62	62	62	61	66	63
89	Journal of Abnormal Psychology	70	69	64	63	65	69	66	73	70
90	Journal of Consulting and Clinical Psychology	70	77	61	66	65	62	65	66	65
91	Journal of Experimental Psychology – Applied	70	80	69	68	72	65	75	70	71
92	Journal of Happiness Studies	70	56	79	78	78	80	77	88	77
93	Behavioural Brain Research	69	71	68	74	67	70	71	71	72
94	Cognitive Behavioral Therapy	69	75	80	76	62	70	80	72	62
95	Journal of Applied Psychology	69	79	80	70	74	69	73	69	71
96	Journal of Autism and Developmental Disorders	69	71	72	70	65	72	67	67	70
97	Psychology of Music	69	80	79	72	73	75	72	82	87
98	Biological Psychology	68	63	66	70	66	66	61	70	70
99	Developmental Science	68	73	67	69	65	71	67	68	67
100	Journal of Comparative Psychology	68	66	75	75	79	80	71	68	62
101	Psychology and Marketing	68	70	70	65	76	65	71	63	71
102	Psychoneuroendocrinology	68	65	66	63	63	64	62	64	61
103	Psychopharmacology	68	74	75	74	71	73	75	71	71
104	Behavior Therapy	67	71	69	71	74	74	75	63	77
105	Developmental Psychobiology	67	63	66	65	69	70	70	71	64
106	Journal of Consumer Psychology	66	56	53	67	66	64	59	59	64
107	Journal of Consumer Research	66	64	63	51	63	48	61	60	64
108	Journal of Individual Differences	65	82	65	74	63	86	55	91	70
109	Journal of Youth and Adolescence	65	70	84	77	81	76	74	74	75
110	European Journal of Social Psychology	64	73	76	64	71	67	56	68	66
111	Group Processes & Intergroup Relations	64	68	66	70	67	69	65	67	59
112	Journal of Research on Adolescence	62	67	71	67	64	72	74	78	67
113	Journal of Child Psychology and Psychiatry and Allied Disciplines	61	68	67	67	62	68	71	57	62
114	Motivation and Emotion	61	72	63	66	64	66	63	81	67
115	Infancy	59	60	61	61	66	67	63	71	53
116	Behavioral Neuroscience	57	73	68	70	70	68	70	66	72
117	Self and Identity	57	68	68	57	72	71	72	70	73

Social Psychology Textbook audiT: Ego Depletion

December 28, 2018UncategorizedUlrich Schimmack

Since 2011, social psychology is in a crisis of confidence. Many published results were obtained with questionable research practices and failed to replicate. The Open Science Collaboration found that only 25% of social psychological results could be successfully replicated (OSC, 2015).

One of the biggest scandals in social psychology is the ego-depletion literature. The main assumption of ego-depletion theory is that working on a cognitively demanding task lowers individuals’ ability to do well on a second demanding task.

A meta-analysis in 2010 seemed to show that ego-depletion effects in laboratory studies are robust and have a moderate effect size (d = .5). However, this meta-analysis did not control for the influence of questionable research practices. A subsequent meta-analysis did take QRPs into account and found no evidence for the effect.

This meta-analysis triggered a crisis of confidence in the ego-depletion effect and an initiative to investigate ego-depletion in a massive replication attempt. The outcome of this major replication study confirmed the finding of the second meta-analysis. There was no evidence for an ego-depletion effect, despite the massive statistical power to detect even a small effect (d = .2) (Hagger et al., 2016).

There have been different responses to the replication failure. The inventor of ego-depletion theory, Roy F. Baumeister, blames the design of the replication study for the replication failure (cf. Drummond & Philipp, 2017). However, others, including myself (Schimmack, 2016), pointed out that Baumeister and colleagues used QRPs in their original studies and therefore do not provide credible evidence for the effect. Some ego-depletion researchers, like Michael Inzlicht (pdf), openly expressed concern that ego-depletion may not be real.

Social psychology textbooks responded differently to these developments in ego-depletion research.

Gilovich et al., (2019, 5ed) simply removed ego-depletion from their textbook, while the 3ed (2013) covered ego depletion, including the even more controversial claim that links ego-depletion to blood glucose levels, which was also obtained with QRPs (Schimmack, 2012).

In contrast, Myers and Twenge (2018, 13ed) continue to cover ego-depletion without mentioning any replication failures or concerns about the robustness of the evidence.

Neither treatment of the doubts about ego-depletion is acceptable. Simply removing ego-depletion misses the opportunity to teach students to think critically about social psychological research; which probably is the point. However, presenting ego-depletion is even worse. Examples like this show that social psychologists are unwilling to be open about the recent developments in their field and that students cannot trust social psychology textbooks to provide a balanced and scientific introduction to the field.

An Introduction to Anti-Social Psychology

December 28, 2018UncategorizedUlrich Schimmack

Social psychology textbooks aim to inform students about social psychology. However, the authors also want to promote social psychology. As a result, they present social psychology in the most favorable light. The result are textbooks that hide embarrassing facts (most textbook findings probably do not replicate) and do not allow students to think critically about social psychology.

Ideally somebody would publish a balanced and objective textbook. This blog post has a different aim. It introduces students of social psychology to critical evidence that are missing from most social psychology textbooks.

The evidence presented in the “anti-social” textbook may be biased against social psychology, but it provides students with some information that they can use to make up their own mind about social psychology.

The aim is to use a Hegelian approach of teaching, where the textbook provides a thesis (e.g., self-perception theory explains how individuals form attitudes), the anti-social textbook provides the anti-thesis (self-perception theory is an outdated attempt to explain attitudes from the perspective of radical behaviorism), and then students can synthesize the conflicting claims into their own perspective on social psychology.

Affective Misattribution Paradigm (AMP)

Culture of Honor (Northern vs. Southern US States)

Ease of Retrieval

Ego Depletion

Implicit Association Test (IAT)

Priming (Subliminal Priming; Unconscious Processes)

Replication, Replicability, Replication Outcomes in Social Psychology

Self-Knowledge (Accuracy and Biases)

Stereotype Threat

Forthcoming

Self-Perception Theory

Terror-Management Theory

Social Psychology Textbook audit: The (In)Accuracy of Self-Knowledge

December 28, 2018UncategorizedUlrich Schimmack

Gilovich, Keltner, Chen, & Ones, 2019, Social Psychology (5ed), p. 65-66

To understand social psychologists’ claims about accuracy of self-knowledge, it is important to be aware of the person-situation debate in the 1970s. On the one hand, social psychologists maintained that self-concepts are largely illusory and do not predict behavior. On the other hand, personality psychologists assumes that people have some accurate self-knowledge about stable personality dispositions that influence their behavior.

It is also important to know that textbook author Nisbett was actively engaged in the person-situation controversy. It is therefore not suprising that the textbook chapter about accuracy in self-knowledge is strongly biased and fails to mention decades of research that has demonstrated convergent validity of self-ratings and informant ratings of personality (e.g., Connely & Ones, 2010, for a meta-analysis).

Instead, students are given the impression that self-knowledge is rather poor.

Recall the research described in Chapter 1 in which Nisbett and Wilson (1977) discovered that that people can readily provide explanations for her their behaviors that are not in fact accurate. Someone might say that she picked her favorite nightgown because of its texture or color, when in fact she picked it out because it was the last one she saw. Even our ability to report accurately on more important decisions – such as why we chose job candidate A over job candidate B, why we like Joe better than Jack, or how we solved a particular problem – can be wide of the mark (Nisbett & Wilson, 1977).

…much of the time, we draw inaccurate conclusions about the self because we don’t have access to certain mental processes, such as those that lead us to prefer objects we looked at last (Wilson, 2002, Wilson & Dunn, 2004).

… such mental processes are outside nonconscious, occurring outside of our awareness, leaving us to generate alternative, plausible accounts for our preferences and behaviors instead.

Given such roadblocks, how can a person gain accurate self-knowledge?

The textbook doesn’t provide an answer to its own question. One possible answer could have been that it doesn’t require introspection to know oneself. Later the textbook introduces self-perception theory, which states that we can know ourselves like we know other people by observing ourselves and making attributions about our behaviors. For example, if I regularly order vanilla ice cream rather than chocolate ice cream, I can infer that I have a preference for vanilla; I do not need to know why I have a preference for vanilla (e..g, my mother gave me formula as a baby with vanilla flavor).

In any case, Nisbett’s musings about limitations of introspection fail to explain how people acquire accurate self-knowledge about their personality, values, happiness, and past behaviors, nor does it cite relevant studies by personality psychologists.

The section on accuracy of self-knowledge ends with studies by Vazire (Vazire & Meehle, 2008; Vazire 2010; Vazire & Carlson, 2011). These studies show that, on average, self-ratings and informant ratings are equally good predictors of an objective criterion of behavior. They also suggest that the self is better able to make judgments about internal states.

“Because we have greater information than others do about our internal states (such as our internal thoughts and feelings), we are better judges of our internal traits (being optimistic or pessimistic, for instance).

Students may be a little bit confused by the earlier claims that introspection often leads us astray and the concluding statement that the self is most accurate in judging internal states. Apparently, introspective does provide some valuable information that can be used to know oneself.

In conclusion, social psychologists have ignored accuracy in self-knowledge because they were more interested in demonstrating biases and errors in human information processing. The textbook is stuck in some old studies on limits of introspection and does not review decades of research on accuracy of self-knowledge (e.g., Funder, 1995). To learn about accuracy of self-knowledge, students are better off taking a course on personality psychology.

Auditing Social Psychology Textbooks: Hitler had High Self-Esteem

December 28, 2018UncategorizedUlrich Schimmack

Social psychologists see themselves as psychological “scientists,” who study people and therefore believe that they know people better than you or me. However, often their claims are not based on credible scientific evidence and are merely personal opinions disguised as science.

For example, a popular undergraduate psychology textbook claims that

“Hitler had high self-esteem.“

quoting an article that has been cited over 500 times in the journal “Psychological Science in the Public Interest.” At the end of the article with the title “Does High Self-Esteem Cause Better Performance, Interpersonal Success, Happiness, or Healthier Lifestyles?” the authors write:

“High self-esteem feels good and fosters initiative. It may still prove a useful tool to promote success and virtue, but it should be clearly and explicitly linked to desirable behavior. After all, Hitler had very high self-esteem and plenty of initiative, too, but those were hardly guarantees of ethical behavior.”

In the textbook this quote is linked to boys who engage in sex at an “inappropriately young age” which is not further specified (in Canada this would be 14) according to recent statistics).

“High self-esteem does have some benefits—it fosters initiative, resilience, and pleasant feelings (Baumeister & others, 2003). Yet teen males who engage in sexual activity at an “inappropriately young age” tend to have higher than average self-esteem. So do teen gang leaders, extreme ethnocentrists, terrorists, and men in prison for committing violent crimes (Bushman & Baumeister, 2002; Dawes, 1994, 1998). “Hitler had very high self-esteem,” note Baumeister and co-authors (2003).” (Myers, 2011, Social Psychology, 12th edition)

Undergraduate students pay a lot of money to be informed that people with high self-esteem are like sexually deviants, terrorists, violent criminals, and Hitler. (maybe we should add scientists with big claims and small samples to the list).

The problem is that this is not even true. Students who work with me on fact checking the textbook found this quote in the original article.

“There was no [!] significant difference in self-esteem scores between violent offenders and non-offenders, Ms = 28.90 and 28.89, respectively, t(7653) = 0.02, p > .9, d = 0.0001.”

Although the df of the t-test look impressive, the study compared 63 violent offenders to 7590 unmatched, mostly undergraduate student (gender not specified, probably mostly female) participants. So the sampling error of this study is high and the theoretical importance of comparing these two groups is questionable.

[the latest edition 13 from 2018 still contains the quote

How Many Correct Citations Could be False Positives?

Of course, the example above is an exception. Most of the time a cited reference contains an empirical finding that is consistent with the textbook claim. However, this does not mean that textbook findings are based on credible and replicable evidence. Until recently it was common to assume that statistical significance ensures that most published results are true positives (i.e, not a false positive random finding). However, this is only the case if all results are reported. It has been known since 1959 that this is not the case in psychology (Sterling, 1959). Jerry Brunner and I developed a statistical tool that can be used to clean up the existing literature. Rather than actually redoing 50 years of research, we use the statistical results reported in original studies to apply a significance filter post-hoc. Our tool is called zcurve. Below I used zcurve to examine the replicability of studies that were used in chapter 2 about the self.

More detailed information about the interpretation of the graph above is provided elsewhere (link). In short, for each citation in the textbook chapter that is used as evidence for a claim, a team of undergraduate students retrieved the cited article and extracted the main statistical result that matches the textbook claim. These statistical results are then converted into a z-score that reflects the strength of evidence for a claim. Only significant results are important because non-significant results cannot support an empirical claim. Zcurve fits a model to the (density) distribution of significant z-scores (z-scores > 1.96). The shape of the density distribution provides information about the probability that a randomly drawn study from the set would replicate (i.e., reproduce a significant result). The grey line shows the predicted distribution by zcurve. It matches the observed density in dark blue well. Simulation studies show good performance of zcurve. Zcurve estimates that the average replicability of studies in this chapter is 56%. This number would be reassuring if all studies had 56% power. This would mean that all studies are true positives and if a study were replicated every other study would be successful. However, reality does not match this rosy scenario. In reality, studies vary in replicability. Studies with z-scores greater than 5 have 99% replicability (see numbers below x-axis). However, studies with just significant results (z < 2.5) have only 21% replicability. As you can see, there are a lot more studies with z < 2.5 than studies with z > 5. So there are more studies with low replicability than studies with high replicability. The next plot shows model fit (higher numbers = worse fit) for zcurve models with a fixed proportion of false positives. If the data are inconsistent with a fixed proportion of false positives, model fit decreases (higher numbers).

The graph shows that models with 100%, 90% or 80% false positives clearly do not fit the data as well as models with fewer false positives. This shows that some textbook claims are based on solid empirical evidence. However, model fit for models with 0% to 60% look very similar. Thus, it is possible that the majority of claims in the self chapter of this textbook are false positives. It is even more problematic that textbook claims are often based on a single study with a student sample at one university. Social psychologists have warned repeatedly that their findings are very sensitive to minute variations in studies, which makes it difficult to replicate these effects even under very similar conditions (Van Bavel et al., 2016), and that it is impossible to reproduce exactly the same experimental conditions (Stroebe and Strack, 2014). Thus, the zcurve estimate of 56% replicability is a wildly optimistic estimate of replicability in actual replication studies. In fact, the average replicability of studies in social psychology is only 25% (Open Science Collaboration, 2015).

Conclusion

Social psychology textbooks present many findings as if they are established facts, when this is not the case. It is time to audit psychology textbooks to ensure that students receive accurate scientific information to inform their beliefs about human behavior. Ideally, textbook authors will revise their textbooks to make them more scientific and instructors will chose textbook based on the credibility of the evidence in textbooks.

Social Psych Textbook AudiT: The Affective Misattribution Paradigm

December 23, 2018UncategorizedUlrich Schimmack

The Affective Missattribution Paradigm (AMP) is a popular measure of attitudes. The main promise of this measure is that it measures implicit attitudes, where the term “implicit” suggests that participants are (a) not aware of their attitude, (b) not aware that their attitude is being measured, or (c) aware that their attitude is being measured, but unable to control (fake) their responses.

The picture below illustrates the basic principle of the AMP. The critical attitude object is the picture of a tropical beach. The aim is to measure your attitude towards tropical beaches. However, the task is presented as a task to evaluate the Chinese character and to ignore the tropical beach. The problem for participants is that the Chinese character elicits no strong emotion (unless you are Chinese and know that the character means death), while the picture of the beach elicits a positive emotional response for many participants. The proposition of the AMP is that participants involuntarily rely on their emotional response to the the tropical beach (called the prime) to judge the character (called the target).

An alternative version of the AMP would present the prime subliminally; that is, the presentation would be so short and masked with another image that participants cannot identify the target picture. This would make the AMP an implicit measure of attitudes without awareness of the true source of the evaluation. However, subliminal presentations are not reliable enough to measure attitudes (Payne, 2017).

However, the presentation of prime stimuli in plain view makes it clear that participants are aware of the prime. The question is whether they are aware that their emotional response is elicited by the prime rather than the target. Maybe they are simply too lazy to bother controlling their attention or response, which is more effortful than to simply report the emotional response that was elicited.

Three studies have provided evidence that participants are aware of the true source of their feelings and can control their responses.

Bar-Anan and Nosek (2002) asked participants how they made their responses and found priming effects for participants who stated that their responses were guided by the prime rather than the target.

Teige-Mocigemba, Penzl, Becker, Henn, and Klauer (2015) simply instructed participants to respond opposite to their emotional responses and found that participants were able to do so.

Hazlett and Berinsky (2018) gave students small monetary incentives to control their automatic emotional responses to primes. The key finding was that providing a monetary incentive further decreased the influence of primes on participants’ responses over a simple instruction to ignore the primes. This supports the motivation hypothesis that participants are abel to control their responses, but lack motivation to do so unless their is a reason for it.

In conclusion, the AMP is not an implicit measure of attitudes in the sense that participants are unaware that their attitude is being measured or unable to control their responses.

It is also noteworthy that the AMP has modest correlations with implicit measures of attitudes like the Implicit Association Test.

The problem with the low correlation between the IAT and AMP is that both tests are promoted as measures of implicit attitudes, but the low correlation means that they are poor measures of a single construct.

In conclusion, there is evidence to suggest that the AMP may not be an implicit measure of attitudes and evidence that it correlates poorly with other implicit measures.

How do Gilovich, Keltner, Chen, and Nisbett introduce the AMP to undergraduate students?

The textbook merely mentions that AMP scores have demonstrated convergent and predictive validity in a number of studies.

Responses on the AMP have been shown to be related to plitical attitudes, other measures of racial bias, and significant personal habits like smoking and drinking (Greenwald, Smith, Sriram, Bar-Anan & Nosek, 2009; Payne et al., 1995; Payne, Govorun, & Arbuckle, 2008; Payne, McClernon, & Dobbins, 2007). (p. 369)

Three of the four citations are by Payne, who developed the AMP and has a conflict of interest to show that his measure is valid and useful. None of the citations is more recent than 2010, meaning that no significant updates have been made in response to recent critical articles about the AMP.

The cited Greenwald et al. (2009) article is particularly informative because it examined convergent, discriminant, and predictive validity in a large, non-student sample.

The textbook claim that the AMP correlates with other implicit racial bias measures is supported by the r = .218 correlation with the Brief IAT measure. However, there is little evidence for discriminant validity because the AMP also correlates r = .220 and r = .208 with two explicit measures of prejudice (4. Thermometer & 5. Likert, respectively).

Moroever, predictive validity of the AMP is shown by the correlation with voter intentions , r = .113. This correlation is low and not higher than those for the two explicit measures, r = .211 and r = .124.

Finally, the study failed to find strong pro-White biases in this largely White sample (70% White) for the AMP (M = -0.02, SD = 0.17, d = .12) and the brief IAT (M = 0.06, SD = 0.42, d = 0.14), which were not larger than the pro-White bias for explicit measures that are subject to social desirable responding; feeling thermometer (M = 0.35, SD = 1.63, d = .21) and Likert Scale (M = 0.35, SD = 0.86, d = 41).

These results do not justify claims that the AMP and the IAT are measures of some hidden, implicit attitudes that are only accessible by means of indirect attitude measures and that influence participants’ behavior without their knowledge. However, this is the citation provided in the textbook to support these claims.

If textbook authors would have to present the actual evidence rather than a citation such distortions of the truth would not be possible. Thus, students should demand more scientific figures and tables and fewer cute pictures in their social psychology textbook. After all, they pay good money for it.

2016 Blogs

December 22, 20182016, Index, r-indexUlrich Schimmack

DECEMBER

12/31 ****
Review of an “eventful” 2016 (“Method Terrorists”)

12/6
A Forensic Analysis of Stapel: Fabrication or Falsification?

12/3
Replicability Analysis of Dijksterhuis’s “Enhancing Implicit Self-Esteem by Subliminal Evaluative Conditioning”

SEPTEMBER

9/13 ***
Critique of Finkel, Eastwick, & Reis’s article “Replicability and Other Features of a High-Quality Science: Toward a Balanced and Empirical Approach”

JUNE

6/30 ***
Wagenmaker’s Default Prior is Unrealistic

6/25 ****
A Principled Approach to Setting the Prior of the Null-Hypothesis

6/18 ***
What is the Difference between the Test of Excessive Significance and the Incredibility Index?

6/16 ****
The A Priori Probability of the Point Null-Hypothesis is not 50%

MAY

5/21
Replicability Report on Social Priming Studies with Mating Primes

5/18
Critique of Jeffrey N. Rouder, Richard D. Morey, and Eric-Jan Wagenmakers’s article “The Interplay between Subjectivity, Statistical Practice, and Psychological Science”

5/9 ***
Questionable Research Practices Invalidate Bayes-Factors Just As Much as P-Values

APRIL

4/18 *****
Replicability Report of the Ego Depletion Literature

FEBRUARY

2/16 ****
Discussion of Sterling et al.’s (1995) Seminal Article on Inflated Success Rates in Psychological Science [also recommend reading the original article]

2/10
Replicability AudiT of a 10 Study Article by Adam D. Galinsky

2/9
A Replicability AudiT of Yaacov Trope’s Publications

2/3 ***
A Critique of Finkel, Eastwick, & Reis’s Views on the Replication Crisis

JANUARY

1/31 *****
Introduction to the R-Index
[The R-Index builds on the Incredibility Index, Schimmack (2012)]

1/31
Replicability Analysis of Damisch, Stoberock, & Mussweiler (2010)
[Anonymous Submission to R-Index Blog]

1/31
Replicability Analysis of Williams & Bargh (2008)

1/14 ***
Discussion of Hoenig and Heisey’s Critique of Observed Power Calculations

2017 Blogs

December 22, 2018UncategorizedUlrich Schimmack

NOVEMBER

11/29 *****
A Quantitative Book Review of John A. Bargh’s Book “Before you know it”

11/16
My Response to the Rejection of the Z-Curve manuscript from AMPPS
[Reviewer 3 is author of the competing P-Curve method]

OCTOBER

10/24 *****
Replicability Rankings of Psychology Journals (2010-2017)

SEPTEMBER

9/4 ****
Replicability Report: The Pen-Paradigm of Facial Feedback Studies

AUGUST

8/2 *****
A Comment on the Alpha Wars: Focus on Beta

MAY

5/15
Replicability AudiT of the Journal Psychological Science

MARCH

3/5 *****
Meta-Psychology: A New Discipline and a New Journal
[the journal now exists Meta-Psychology link]

FEBRUARY

2/26 ***
A Brief Introduction to Null-Hypothesis Significance Testing and Power
[1 Figure and 1500 words]

2/23 ****
On Rand Measurement Error, Reliability, and Replicability

2/21 ***
Examining the Influence of Selection for Significance on Observed Power

2/2 ***** (100,000 views)
A Quantitative Review of Kakneman’s Thinking Fast and Slow Chapter on Social Priming [Co-Authored with Moritz Heene]

https://replicationindex.com/2017/11/28/before-you-know-it-by-john-a-bargh-a-quantitative-book-review/

Replicability-Index

Improving the replicability of empirical research

Yearly Archives: 2018

Estimating the Size of File-Drawers with Z-Curve

Wagenmakers’ Crusade Against p-values

2018 Journal Replicability Rankings

Social Psychology Textbook audiT: Ego Depletion

An Introduction to Anti-Social Psychology

Social Psychology Textbook audit: The (In)Accuracy of Self-Knowledge

Auditing Social Psychology Textbooks: Hitler had High Self-Esteem

Social Psych Textbook AudiT: The Affective Misattribution Paradigm

2016 Blogs

2017 Blogs