Category Archives: Uncategorized

Covid-19 in Canada: What do the Numbers Mean

The Covid-19 pandemic is unique because the virus emerged in a globally connected world. This enabled the virus to spread quickly across the globe. At the same time, the ability to fight the virus has never been better. Chinese scientists quickly developed a test to identify infected individuals. This made it possible to isolate infected individuals and prevent the spread of the virus. It also provided an unprecedented amount of data that are widely shared on websites and in the news about the number of COVID-19 positive cases in different countries, the Canadian provinces. For example, the government of Canada keeps Canadian citizens informed on a website that tracks cases and COVID-19 fatalities.

As a psychologist, I wonder what Canadians are learning from these numbers. When I teach psychology, I spend a lot of time explaining what numbers actually mean. My concerns is that Canadians are bombarded with COVID-19 numbers with little information what these numbers actually mean. Every Canadian with some knowledge about Canada will notice that the numbers are bigger for provinces with a larger population, but few Canadians may know the exact population of provinces and are willing to compute the fatality rates of provinces that take population size into account. So, it remains unclear whether the situation is better or worse in Manitoba or Alberta. A simple solution to this problem would be to provide information about the number of cases for every 100,000 people.

Taking population size into account is a step in the right direction, but there is another problem that makes comparisons of provinces difficult. The number of cases that are detected also depends on the number of tests that are done. Alberta rightfully points out that they are world-leaders in the use of testing to fight the spread of COVID-19. While massive testing is positive, it has the negative effect that Alberta is also likely to find more cases than Ontario, where capacity to test the large population of Ontario is more limited. A solution to this problem is to compute the positive rate; that is the number of positive tests over the number of tests conducted. This also makes it unnecessary to take population size into account. Provinces with a larger population are likely to conduct more tests, but what matters is how many tests are being done, not how many people live in a province. A province with a large population could have a low number of cases, if there is very little testing.

Fortunately, Canadian labs report both the number of positive cases and the number of negative cases. This makes it possible to compute the positive rate as a meaningful statistic to examine the spread of the COVID-19 virus across Canadian provinces. The positive rate can also be used to compare Canadian provinces to the states in the United States. Overall, the United States have had a different response to COVID-19 than Canada. In the beginning, there was a lot less testing than in Canada. Therefore, a simple comparison of the number of positive cases can be very misleading. For example, on March 16, New York reported 49 positive cases for every 1 million inhabitants. Ontario reported 23 positive cases. Taken at face value this suggests that New York already had 2.5 times as many cases. However, NY had only carried out 282 tests per 1 million inhabitants, while Ontario had already carried out 1,044 tests per 1 million inhabitants. This means Ontario had a positive rate of 2%, when NY already had a positive rate of 21%. This means that things were 8 times worse in NY than Ontario, well before cases and fatalities in NY exploded. As of March 28, the numbers in Ontario haven’t changed much. There are now 99 positives for every 1 million inhabitants, but there are also 3,120 tests per inhabitant; a positive rate of 3%. In comparison, NY now has 2,689 positives for 8,016 tests, a positive rate of 34%. This means, things are now 10.5 times worse in NY than in Ontario. Thus, the positive rate reflects that the situation in Ontario is much better than in NY. To summarize, the positive rate controls for differences between provinces and states in the size of a population and the rate of testing that is done. This makes it possible to compare Canadian provinces to each other and with US states. The Table below ranks provinces and states according to the positive rate. The good news for Canadians is that Canada currently has low positive rates, suggesting that early testing, school closures, and social distancing measures have helped to keep COVID-19 in Canada under control. Of course, these numbers are about the present and the future is unknown. However, it is noteworthy that NY already had a positive rate of 17% on March 16, when the situation seemed under control, while COVID-19 was spreading undetected in the community. The good news is that positive rates in Canada are well below this number.

No photo description available.

It is worthwhile to discuss Quebec. On March 22, Quebec reported 221 cases with 10,005 tests (positive rate = 2%). On March 23 Quebec reported 628 cases. This was an increase of 187% from the day before with 221 cases. On March 24 numbers increased again dramatically from 628 cases to 1,013 cases. This suggested a big surge in positive cases. However, the next day the number of tests doubled from 13,215 to 27,973, while the number of positive cases increased by only 300 cases. This suggests that some accounting problems produced an artificial surge in cases. Once the increase in tests was taken into account the positive rate on March 25 was 5% and it has remained at this level until March 28. Thus, there is no rampant increase in COVID-19 cases in Quebec, and the spike was a statistical artifact.

The positive rate can also be used to compare Canada to other countries. The Table also includes fatality rates (death / population). The results show that Canada is not only doing better than the United States, but also better than many European countries and the UK, while Australia, New Zealand, and Asian countries are doing better.

It is too early to examine more carefully the reasons for national differences in COVID- 19 cases and fatalities, but the data are encouraging that Canada’s response to the global pandemic has been reasonable and effective. It is also important to avoid politicizing the issue. The liberal federal government and conservative provincial governments have worked together to respond to the crisis. We only have to look South of the border to see how politics can make a global pandemic worse.

Like all statistics, the positive rate is not a perfect measure of the actual spread of COVID-19. One important limitation of the numbers here is that they are based on all cases so far. This means, things might be changing for the better (Norway) or getting worse (UK), and these trends are not reflected in the numbers. Another problem is that positive rates depend on the criteria that are used for testing individuals. However, a low positive rate means that testing is widespread and not just limited to cases in hospitals. Thus, a low positive rate suggests that the virus is not spreading quickly. In contrast, a high positive rate could mean that testing is extremely limited or that there are a lot of cases in the community. However, limited testing is a problem in itself because there is no information about spread in the community. Thus, high numbers are a problem even if a 20% number does not mean that 20% of the population is COVID-19 positive. An alternative measure is the fatality rate. However, once fatality rates are high, the battle against COVID-19 is already lost. The advantage of the positive rate is that it provides information before individuals are dying in large numbers.

Hopefully, school and business closures that were implemented two weeks ago will show some improvement in the numbers in the coming week. Once more, case numbers are the wrong numbers to show this. Case numbers will go up, especially if testing capacities increase. Thus, what we really want to see is a decrease in the number of positive tests. Personally, my analyses of the data give me some hope that we can manage this crisis.

Comforting Case Counts Can be Misleading

Disclaimer: I tried my best to use the best available data, but I may have made some mistakes. The general points about tracking the spread of COVID-19 remain valid even if there are mistakes in the actual numbers, but the actual statistics may be misleading. Of course, even the best available data can also be distorted by errors. This is a best effort, but the results remain guestimates of the current situation.

COVID-19 is a pandemic like and unlike previous pandemics. It is spreading faster than any previous pandemics because the world is more interconnected than it every was. The virus spread from China to neighboring countries like South Korea, to Iran, and to Italy early. However, a virus doesn’t know or respect boarders. It is spreading around the world. COVID-19 is also unprecedented because modern technology made it possible to develop tests that can identify infected individuals within days. Testing makes it possible to track the spread of COVID-19 before people are dying in large numbers. Websites and newspapers report the number of confirmed COVID-19 cases as quickly as these numbers become available. Most of the time, the key statistic is the absolute number of COVID-19 cases.

For example, a popular COVID-19 tracking site posts a ranking of countries on the basis of case counts (

There are three problems with the use of case counts as a statistic that reflects the severity of the problem. First, countries with a larger population are bound to have a higher number of cases. Switzerland has a population of 8.5 million people, while China has a population of 1.4 billion. Using Switzerland’s number of cases with China’s population, implies 1,722,164 cases (1.7 million). Thus, the COVID-19 pandemic is a much bigger problem in Switzerland than in China. A solution to this problem is to provide information about cases relative to the size of the population. This is done too rarely. The German news channel ARD provided a map of cases for every 100,000 inhabitants of one of the states. The map shows that small city-states like Berlin and Hamburg have a relatively high rate of infections, compared to bigger states with larger populations in neighboring states. Thus, it is much more informative to show information relative to population size than to compare absolute case numbers.

No photo description available.

Another factor that influences case counts is time. High case counts are observed in places where the pandemic is out of control and widespread community transmission is taking place. At this point, statistics only document the pandemic, but they cannot inform policies or actions because it is too late. At this stage in the pandemic, case counts are also relatively unimportant because deaths are a more tragic and objective indicator of the severity of the crisis. For example, on March 16, the state of New York reported only 669 cases with a population of nearly 20 million people, which translates into 3 cases for every 100,000 inhabitants. Nine days later, this has increased exponentially to 91,270 cases, which translates into 456 cases for every 100,000. More important, the number of deaths increased from less than 10 to 200 deaths. Thus, within 10 days a region can go from a low case count to crisis.

A better statistic is to track the increase in the number of cases over time. A case count of 10 cases per 100,000 inhabitants may not sound alarming, but ignores that this number can explode if the number of cases is increasing rapidly. The financial times provides informative graphs that take growth rates into account.

Image result for financial times covid-19 graph

The graph shows that Italy and China had similar growth rates in the beginning, but then the growth rate in China slowed down more than the growth rate in Italy. It also shows, that Singapore was able to slow down the growth rate early on and that South Korea was able to slow the growth rate even after they had a substantial number of cases and the same growth rate as China and Italy (about 33% increase a day). It also shows that many countries are on the same trajectory as Italy and China.

Although the FT graphic is an improvement over simple count data, it has some problems and can be easily misinterpreted. First, it still relies on absolute numbers that are distorted by population size. The graph suggests that problems are worse in the United States than in the Netherlands. However, fatality rates tell a different story. The Netherlands have seen already 276 deaths with a population of 17 million, whereas the USA has seen 784 deaths with a population of 34 million, nearly 20 times as many as the Netherlands. If the Netherlands had the same population as the United States, the fatality rate would be 5,520, which is 7 times higher than the rate in the United States. Thus, the situation in the Netherlands is considerably worse than in the United States.

A third factor that distorts comparisons is that regions differ considerably in the amount of testing that is done. The more tests are being conducted, the more positive cases will be found. Thus, low case counts may provide a false sense of security and comfort, if they are caused by low rates of testing. Testing can also distort comparisons over time. Absolute numbers might jump simply because the number of tests has increased dramatically. For example, in NY the number of positive cases jumped from 669 cases on March 16 to 2,601 cases on March 17, an increase by 289%. Another way of saying it is that case numbers nearly quadrupled in one day. This jump does not suggest that cases quadrupled in one day, but rather a dramatic increase in testing. Another problem is that data may not be updated regularly. For example, Germany announced a surprising flattening of the curve last weekend, only to be surprised again when cases increased more than usual on Monday. The reason was that some states did not report updated numbers on Sunday.

A solution to this problem is to count the proportion of positive tests relative to the number of tests that were conducted. Countries with bigger populations will conduct more tests, but that will not affect the proportion of positive tests. Similarly, the number of tests can vary from day to day, but this would not influence the proportion of positive tests that are recorded. For most researchers, this seems to be obvious. We may report, that a study had 132 female and 68 male participants, but we really only care about the fact that the gender composition was 66% female and 34% male. However, when it comes to COVID-19 data, this statistic is rarely reported. To make matters worse, in some countries it is impossible to compute it because only positive tests are recorded and reported. Nobody knows how many tests are conducted in Germany (Dear Robert Koch Institute, improve the recording and reporting of COVID-19 cases!).

Most of the states in the USA and the Canadian provinces report the numbers of tests that are being conducted. This makes it possible to examine the spread of the COVID-19 pandemic in the USA and Canada. For the states in the USA, I used data from the covidtracking website. I was not able to compute statistics for all states because some states do not report negative results. Without reporting of negative results, it is unclear how widespread COVID-19 in these states is. The best estimates may be based on results from neighboring states that report credible testing rates.

With every new statistic, we have to examine whether it produces useful, valid information. One indication that the statistic provides some useful information is the ranking of New York as the state with the biggest problem. Not surprisingly, neighboring New Jersey also ranks highly. It is harder to validate the statistic at the lower end because it is not clear whether places with low case numbers have low cases of COVID-19 or testing rates are low. However, Alberta has the highest testing rate of all states and provinces, and only 1% of cases test positive. This suggest that COVID-19 is not widespread in Alberta.

The placement of Quebec is interesting because Quebec has been testing at a fairly high rate with relatively low case numbers, but it recorded big increases over the past two days. This spike made Quebec move to rank #7 and there is concern about widespread community transmission.

Thus, some results suggest that the positive rate is a useful indicator of the spread of COVID-19. However, there are some problems with this statistic as well. Testing is not random and in places where tests are limited, testing will focus on testing of cases in hospitals with a high positive rate. In places with more capacity, the general public has more opportunity to get tested and the positive rate will be lower. Interestingly, I found no correlation between the amount of testing (tests conducted as of March 24 / population) and the positive rates, Pearson r = .03, and only a small negative rank correlation, rank r = -.27. Thus, a considerable amount of the variation in the positive rates may reflect the actual spread of COVID-19 in different states. This would suggest that COVID-19 has spread already in the East of the United States, while prevalence in the Great Plains is still low. This would suggest that the Southern states may be more affected than other statistics suggest because testing rates in these states are low. There is still a chance that some Western states can prevent a pandemic by implementing the right measures before it is too late. In Canada, Ontario and Alberta seem to be doing well so far, and have implemented measures to prevent community transmission, but the spike in Quebec shows that case numbers and positive rates can increase quickly.

In sum, statistics can provide valuable information, but they can also be misleading if numbers are not presented in a way that is informative. Absolute case counts are relatively easy to compute and to report, but absolutely uninformative. To provide the general public with valuable information, everybody needs to do better. Labs and governments need to record and report the number of negative tests, and media need to report numbers that actually reflect the risk to a particular population. Widespread community testing is needed to detect new outbreaks of the pandemic. This has worked in South Korea, the only country that has been able to stop the pandemic. Japan and Singapore were able to prevent it from spreading widely in the first place. In some places, it is not too late to follow these examples.

Totalitarian Scientists

I have read many, if not most highly influential articles in the history of psychology, but once in a while I stumble upon an article I didn’t know. Here is one of them: “The Totalitarian Ego
Fabrication and Revision of Personal History” by Tony Greenwald (1980). Back in the days, Toni Greenwald was a revolutionary, who recognized many of the flaws that prevent psychology from being a real science. For example, in 1975 he published a critical article about the tendency to hide disconfirming evidence. This article is mentioned in the 1980s article.

The 1980s article was written during a time when social psychologists discovered cognitive biases and started to examine why humans often make errors in the processing of information. One influential hypothesis was that cognitive biases are actually beneficial for individuals, which led to Taylor and Brown’s (1988) claim that positive illusions are a sign of mental health.

The same argument is made by Greenwald (1980), and he compares the benefits of biases for individuals to those for totalitarian regimes and scientific theories. The main function of biases is to preserve either the ego of individuals, the organization of a totalitarian regime, or the integrity of a theory.

The view of biases as beneficial has been challenged. Illusions about reality can have dramatic negative consequences for individuals. In fact, there is little evidence to support the claim that positive illusions are beneficial for well-being (Schimmack & Kim, 2020). The idea that illusions are beneficial for scientific theory is even more questionable. After all, the very idea of science is that scientific theories should be subjected to empirical tests and revised or abandoned when they fail these tests. Greenwald (1980) first seems to agree.

But then he cites Popper and Kuhn to come to the opposite conclusion.

At least in the short term, it is beneficial for individuals and scientific theories to protect them against disconfirming evidence. It is only in the long-run when hard evidence forces makes a theory untenable that individuals or theories need to change. For individuals these hard facts may be life experiences that are not under their control. It may take years before it becomes clear that a marriage is not worth saving. However, scientists can avoid this moment of painful reckoning as long as they can hide disconfirming evidence by avoiding strong tests of theories, dismissing disconfirming evidence in their own studies, and using their status as experts in the peer-review process to keep disconfirming evidence from being published. Thus, scientists have a strong incentive to protect their ego and their theories (brain-children) from a confrontation with reality. Thus, scientists who’s ego is invested in a theory, as for example Greenwald is invested in the theory of implicit biases, are the least trustworthy individuals to evaluate a theory; as Feynman observed, scientists should not fool themselves, but when it comes to their own theories, they are the easiest to fool.

Thus, scientists end up behaving like totalitarian societies. They will use all of their energy to preserve theories, even when they are false. Moreover, the biggest fools have an advantage because they have the least doubt about their theories, which facilitates goal attainment. The research program on implicit bias is a great example. The theory that individuals have unconscious, hidden biases that guide their behavior has become a dominant theory in social cognition research, despite much evidence to support it (Schimmack, 2020). Criticism was sporadic and drowned out by the forces that pushed the theory.

While this has been extremely advantageous for the scientists pushing the theory, these totalitarian forces are bad for a science as a hole. Thus, psychology needs to find a mechanism to counteract totalitarianism in science. Fortunately, there are some positive trends that this is happening. The 2010s have seen a string of major replication failures in social psychology that would have been difficult to publish when psychology was prejudiced against null-findings (Grenwald, 1975). Other changes are needed to subject theories to stronger tests so that they can fail before they have become to big to fail.

In conclusion, Greenwald’s (1980) article deserves some recognition for pointing out some similarities between ego-defense mechanisms, totalitarian regimes, and scientific theories. They all want to live forever, but eternal life is an unattainable goal. The goal of empirical research should not be to feed an illusion, but a process of evolution where old theories are constantly replaced by new theories that are better adapted to reality. Implicit bias theory had a good life. It’s time to die.


Greenwald, A. G. (1980). The totalitarian ego: Fabrication and revision of personal history. American Psychologist, 35(7), 603–618.

Denial is Not Going to Fix Social Psychology

In 2015, social psychologists replicated published results in psychology journals. While the original studies, often including multiple studies, reported nearly exclusively significant results (97% success rate), the replication studies produced only 25% significant results (Open Science Collaboration, 2015).

Since this embarrassing finding has been published, leaders of social psychology have engaged in damage control, using a string of false arguments to suggest that a replication rate of 25% is normal and not a crisis (see Schimmack, 2020, for a review).

One open question about the OSC results is what they actually mean. One explanation is that original studies actually reported false positive results. That is, a significant result was reported although there is actually no effect of an experimental manipulation. The other explanation is that the original studies merely reported inflated effect sizes, but did get the direction of an effect right. As social psychologists do not care about effect sizes, the latter explanation is not a problem for social psychologists. Unfortunately, a replication rate of 25% does not tell us how many original results were false positives, but there have been attempts to estimate the false discovery rate in the OSC studies.

Brent M. Wilson, a post-doctoral student at UC San Diego, and distinguished professor, John T. Wixted, also at UC San Diego published an article that used sign-changes between original studies and replication studies to estimate the false discovery rate (Wilson & Wixted, 2018). The logic is straightforward. A true null-result is equally likely to show an effect in one direction (increase) or the other direction (decrease) due to sampling error alone. Thus, a sign change in a replication study may suggest that the original result was a statistical fluke. Based on this logic, the authors conclude that 49% of the results in social psychology were false positive results.

The implications of this conclusion cannot be overstated. Every other result published in social psychology is a false positive result. Half of the studies in social psychology textbooks support false claims unless textbook writers are clairvoyant and can tell true effects from false effects. If this is not bad enough, the estimate of 49% uses the nil-hypothesis to claim that a reported result is false. However, effects in the same direction that are very small have no practical significance, especially when effect sizes are difficult to estimate because they are susceptible to small changes in experimental procedures. Thus, the implication of Wilson and Wixted’s article is that social psychology has a replication crisis because it is not clear which published results can be replicated with practically meaningful effect sizes. I cited the article accordingly, in my review article (Schimmack, 2020).

You may understand my surprise when the same authors a couple of years later write another article that claims most published results are true (Wilson, Harris, & Wixted, 2020).

Although the authors do not suffer from full-blown amnesia and do recall and cite their previous article, they fail to mention that they previously estimated that 49% of published results in social psychology are false positives. Instead, the blur the distinction between cognitive and social psychology although cognitive psychology had an estimate of 19% false positive, compared to the 49% for social psychology.

So, apparently the authors remembered that they published an article on this topic, but they forgot their main argument and the conclusions of their original article. In fact, in the original article they found a silver lining in the fact that 49% or more of results of social psychology are false positives. They argued that this finding shows that social psychologists are willing to test risky hypothesis that have a high chance of being false. In contrast, cognitive psychologists should be ashamed that they have an 81% success rate, which only shows that they make obvious predictions.

Assuming their estimate is correct, it is not good news that only 1 out of 17 hypotheses that are tested by social psychologists is true. The problem is that social psychologists do not just test a hypothesis and give up when they get a non-significant result. Rather, they continue to run a series of conceptual replication studies with minor variations until a significant result is found. Thus, the chance that false findings are published are rather high, which would explain why findings are difficult to replicate.

In conclusion, Wilson and Wixted published two articles with opposing conclusions. One article claims that social psychology is a wild chase of effects when most experiments test hypotheses that are false (i.e., the null-hypothesis is true). This leads to the publication of many false positive results that fail to replicate in honest replication studies that do not select for significance. Two years later social psychology is a respectable science that may not be much different from cognitive psychology, and most published results are true, which also implies that most tested hypotheses must be true because a high proportion of false hypothesis would result in false positive results in journals.

What caused this flip-flop about the replication crisis is unclear. Maybe the fact that Susan T. Fiske was in charge of publishing the new article in PNAS has something to do with it. Maybe she pressured the authors into saying nice things about social psychology. Maybe they were willing accomplices in white-washing the embarrassing replication outcome for social psychology. I don’t know and I don’t care. Their new PNAS article is nonsense and ignores other evidence that social psychology has a major replication problem (Schimmack, 2020). Fiske may wish that articles like the PNAS article hide the fact that social psychologists made a mockery of the scientific method (publish only studies that work, err on the side of discovery, never replicate a study so that you have plausible deniability). I can only hope that young scholars realize that old practices produces a pile of results that have no theoretical or practical meaning and work towards improving scientific practices. The future of social psychology depends on it.


Schimmack, U. (2020). A meta-psychological perspective on the decade of replication failures in social psychology. Canadian Psychologist, in press

B. M. Wilson, J. T. Wixted, The prior odds of testing a true effect in cognitive and social psychology. Adv. Methods Pract. Psychol. Sci. 1, 186–197 (2018).

Brent M. Wilson, Christine R. Harris, and John T. Wixted (2020). Science is not a signal detection problem. PNAS,

Replicability Rankings of 120 Psychology Journals (2010-2019)

The individual z-curve plots can be found by clicking on the journal names.

Table 1 shows the results for the expected replication rate. The values are high because the analysis includes manipulation checks and the ERR assumes exact replications. Actual replication attempts of focal hypothesis are likely to produce lower success rates. The ERR is still useful to examine differences between journals and changes over time.

Table 1 reports the individual results for the past three years, aggregated results for 2017-2019, 2014-2016, and 2010-2013. These aggregates produce more stable estimates, especially for journals with fewer test statistics. Correlations among the aggregates are r = .53 for 17-19 with 14-16, r = .65 for 14-16 with 10-13, and r = .32 for 17-19 with 10-13. The lower stability over the longer time period indicates that rank order changes are not just random measurement error.

The Change column is the difference for the average in the last three years (17-19) minus the first four years (2010-2014). The results give some indication whether replicability increased, but scores are still subject to sampling error. Individual time-trend analysis showed statistically significant changes for 13 journals. Statistical significance is indicated by adding a + and printing the change score in bold. Some journals benefit from low sampling error and can show significance with a small increase. Others show larger increases, but sampling error is too large to show a clear linear trend. As more data become available, persistent trends will become significant.

Rank   Journal 2019 2018 2017 17-19 14-16 10-13 Change
1 Journal of Religion and Health 89 75 74 79 81 79 0
2 Journal of Individual Differences 88 69 83 80 66 76 4
3 Journal of Business and Psychology 87 72 86 82 73 77 5
4 Journal of Research in Personality 87 78 78 81 78 72 +9
5 Journal of Happiness Studies 85 69 57 70 79 80 -10
6 Journal of Occupational Health Psychology 85 65 55 68 72 68 0
7 Journal of Youth and Adolescence 85 66 70 74 80 75 -1
8 Journal of Nonverbal Behavior 84 79 88 84 70 68 +16
9 Journal of Research on Adolescence 84 64 66 71 67 73 -2
10 Cognitive Psychology 83 82 72 79 76 76 3
11 Evolution & Human Behavior 83 78 73 78 76 64 +14
12 Psychology of Men and Masculinity 83 69 57 70 69 82 -12
13 Developmental Psychology 82 78 76 79 74 68 +11
14 Evolutionary Psychology 82 78 78 79 78 75 4
15 Experimental Psychology 82 78 74 78 73 72 6
16 Journal of Anxiety Disorders 82 77 78 79 73 75 4
17 Psychological Science 82 76 71 76 67 63 +13
18 Attention, Perception and Psychophysics 81 80 79 80 73 76 4
19 Cognition 81 77 76 78 73 74 4
20 Journal of Behavioral Decision Making 81 76 70 76 75 68 8
21 Journal of Organizational Psychology 81 68 73 74 71 69 5
22 Aggressive Behavior 80 78 72 77 71 68 9
23 Cognitive Development 80 73 79 77 75 70 7
24 Consciousness and Cognition 80 78 77 78 70 70 8
25 Depression & Anxiety 80 82 75 79 74 84 -5
26 European Journal of Personality 80 73 76 76 76 72 4
27 Judgment and Decision Making 80 72 81 78 77 72 6
28 Memory and Cognition 80 80 79 80 76 77 3
29 Psychology and Aging 80 72 79 77 79 74 3
30 Psychonomic Bulletin and Review 80 79 75 78 79 76 2
31 Journal of Applied Psychology 79 67 78 75 74 70 5
32 Journal of Cross-Cultural Psychology 79 78 76 78 78 76 2
33 Journal of Experimental Psychology – General 79 78 78 78 74 70 +8
34 Journal of Memory and Language 79 78 80 79 78 75 4
35 Journal of Occupational and Organizational Psychology 79 82 71 77 76 72 5
36 Journal of Positive Psychology 79 71 82 77 71 68 9
37 Journal of Sex Research 79 80 81 80 78 80 0
38 Journal of Social Psychology 79 79 75 78 70 72 6
39 Law and Human Behavior 79 80 76 78 68 76 2
40 Personality and Individual Differences 79 76 76 77 77 72 +5
41 Perception 79 73 76 76 76 86 -10
42 Acta Psychologica 78 71 77 75 75 74 1
43 Asian Journal of Social Psychology 78 80 68 75 74 70 5
44 Journal of Child and Family Studies 78 75 73 75 70 73 2
45 Journal of Counseling Psychology 78 85 69 77 76 75 2
46 J. of Exp. Psychology – Learning, Memory & Cognition 78 78 79 78 78 76 2
47 Journal of Experimental Social Psychology 78 75 71 75 64 56 +19
48 Memory 78 74 74 75 78 81 -6
49 British Journal of Social Psychology 77 70 75 74 63 63 11
50 Cognitive Therapy and Research 77 74 75 75 67 71 4
51 European Journal of Social Psychology 77 64 71 71 70 64 7
52 Social Psychological and Personality Science 77 80 75 77 63 59 +18
53 Archives of Sexual Behavior 76 72 78 75 78 80 -5
54 Emotion 76 73 73 74 70 70 4
55 Journal of Affective Disorders 76 75 75 75 82 77 -2
56 J. of Exp. Psychology – Human Perception and Performance 76 78 76 77 76 76 1
57 Journal of Pain 76 79 68 74 77 72 2
58 Personal Relationships 76 83 76 78 69 64 +14
59 Psychology of Religion and Spirituality 76 82 69 76 79 69 7
60 Appetite 75 76 77 76 67 72 4
61 Group Processes & Intergroup Relations 75 65 70 70 68 65 5
62 Journal of Cognition and Development 75 82 74 77 70 65 12
63 Journal of Cognitive Psychology 75 79 75 76 76 77 -1
64 Journal of Experimental Psychology – Applied 75 75 80 77 70 70 7
65 JPSP-Personality Processes and Individual Differences 75 79 64 73 72 66 7
66 Political Psychology 75 88 76 80 74 63 17
67 Psychopharmacology 75 68 73 72 74 72 0
68 Psychophysiology 75 79 78 77 74 74 3
69 Quarterly Journal of Experimental Psychology 75 79 76 77 75 74 3
70 Animal Behavior 74 71 77 74 70 72 2
71 Behaviour Research and Therapy 74 75 70 73 74 71 2
72 British Journal of Developmental Psychology 74 77 72 74 72 76 -2
73 Frontiers in Psychology 74 74 76 75 74 72 3
74 Journal of Abnormal Psychology 74 71 69 71 64 70 1
75 Journal of Applied Social Psychology 74 75 73 74 73 72 2
76 Journal of Consumer Behaviour 74 81 72 76 76 79 -3
77 Journal of Health Psychology 74 81 63 73 75 72 1
78 JPSP-Attitudes & Social Cognition 74 80 79 78 66 58 20
79 Journal of Social and Personal Relationships 74 74 71 73 67 73 0
80 Psychology and Marketing 74 68 70 71 70 68 3
81 Behavioral Neuroscience 73 66 73 71 69 70 1
82 Canadian Journal of Experimental Psychology 73 87 73 78 77 76 2
83 Cognition and Emotion 73 75 66 71 72 82 -11
84 European Journal of Developmental Psychology 73 90 85 83 75 71 12
85 Journal of Child Psychology and Psychiatry and Allied Disciplines 73 62 68 68 66 64 4
86 Journal of Educational Psychology 73 73 79 75 71 77 -2
87 Organizational Behavior and Human Decision Processes 73 70 68 70 70 69 1
88 Psychological Medicine 73 76 73 74 75 73 1
89 Sex Roles 73 83 81 79 76 75 4
90 Social Psychology 73 83 73 76 73 71 5
91 Behavioural Brain Research 72 69 71 71 70 72 -1
92 British Journal of Psychology 72 79 76 76 79 74 2
93 Developmental Science 71 67 73 70 67 69 1
94 Journal of Personality 71 79 77 76 70 67 9
95 Behavior Therapy 70 67 71 69 71 72 -3
96 Child Development 70 73 66 70 71 72 -2
97 International Journal of Psychophysiology 70 71 73 71 67 68 3
98 Journal of Consulting and Clinical Psychology 70 71 77 73 64 65 8
99 Journal of Experimental Child Psychology 70 73 71 71 75 73 -2
100 Motivation and Emotion 70 62 73 68 65 70 -2
101 Frontiers in Behavioral Neuroscience 69 72 74 72 70 70 2
102 Journal of Comparative Psychology 69 68 67 68 75 70 -2
103 JPSP-Interpersonal Relationships and Group Processes 69 72 68 70 67 58 +12
104 Frontiers in Human Neuroscience 68 73 71 71 74 75 -4
105 Journal of Consumer Research 68 68 65 67 60 59 8
106 Journal of Family Psychology 68 76 70 71 68 66 5
107 Journal of Vocational Behavior 68 83 75 75 78 80 -5
108 Personality and Social Psychology Bulletin 68 71 74 71 65 61 +10
109 Self and Identity 68 60 67 65 66 72 -7
110 Hormones & Behavior 66 69 61 65 63 63 2
111 Psychoneuroendocrinology 66 68 65 66 64 62 +4
112 Annals of Behavioral Medicine 65 74 70 70 69 74 -4
113 Biological Psychology 65 68 63 65 68 66 -1
114 Cognitive Behavioral Therapy 65 67 75 69 73 71 -2
115 Health Psychology 63 77 70 70 64 65 5
116 Infancy 62 58 61 60 62 64 -4
117 Journal of Consumer Psychology 62 66 57 62 62 62 0
118 Psychology of Music 62 74 81 72 74 79 -7
119 Developmental Psychobiology 60 66 63 63 67 69 -6
120 Social Development 55 80 73 69 73 72 -3

The Lost Decades in Psychological Science

Methodologists have criticized psychological research for decades (Cohen, 1962; Gigerenzer & Sedlmeier, 1989; Maxwell, 2004; Sterling, 1959). A key concern is that psychologists conduct studies that are only meaningful when they reject the nil-hypothesis that results were not just a chance finding; p < .05, and that many studies had low statistical power to do so. As a result, many studies that were conducted remained unpublished, while studies that were published obtain significance only with the help of chance. Despite repeated attempts to educate psychologists about statistical power, there has been little evidence that researchers increased power of their studies. The main reason is that power analyses often showed that large samples were required that would require sometimes years of data collection for a single study. However, pressure to publish increased and nobody could simply work on a study without publishing. Therefore, psychologists found ways to produce significant results with smaller samples. The problem is that these questionable practices inflate effect sizes and make it difficult to replicate results. This produced the replication crisis in psychology (see Schimmack, 2020, for a review). So far, the replication crisis has played out mostly in social psychology because social psychologists have conducted the most attempts to replicate findings and produced a pile of replication failures in the 2010s. The replication crisis has produced a lot of discussion about reforms and many suggestions to increase statistical power. However, the incentive structure has not changed. Graduate students today are required to have a CV with many original articles to be competitive on the job market. Thus, strong forces counteract reforms of research practices in psychology.

In this blog post, I examine whether psychologists have changed their research practices in ways that increase statistical power. To do so, I use automatically extracted test statistics from 121 psychology journals that cover a broad range of psychology, including social, cognitive, personality, developmental, clinical, physiological, and brain sciences. With help of undergraduate students, I downloaded all articles from these journals from 2010 to 2019. To keep this post short, I am only presenting the results for 2010 and 2019. The latest year is particularly important because reforms require times and the latest year provides the best opportunity to see the effects of reforms.

All test-statistics are converted into absolute z-scores. A bigger z-score shows that a test-statistic showed stronger evidence against the nil-hypothesis that there is no effect. The higher the z-score, the greater the power of a study to reject the nil-hypothesis. Thus, any increase in power would result in a distribution of z-scores that is moved to the right. Z-curve (Brunner & Schimmack, 2019; Bartos & Schimmack, 2020), uses the distribution of significant z-scores (z > 1.96) to estimate several statistics. One statistic is the expected replication rate. This is the percentage of significant results that would be significant again if the studies were replicated exactly. Another statistic is the expected discovery rate (EDR). The EDR is the rate of significant results that is expected given the distribution of significant results. The EDR can be lower than the observed discovery rate (ODR) if there is selection for significance.

Figure 1 shows the results for 2010.

First, visual inspection shows clear evidence of the use of questionable research practices as there are a lot fewer z-scores just below 1.96 (not significant) than above 1.96. Figure 1 shows with the grey curve the distribution of non-significant results that is expected. The amount of selection is reflected in the discrepancy between the observed discovery rate of 69% and the expected discovery rate of 32%, 95%CI = xx to xx.

The expected replication rate is 71%, 95%CI = xx xx . This is much higher than the estimate of 37% based on the Open Science Collaboration project (Science, 2015). However, there are two caveats. First, the present analysis is based on all reported statistical tests, including manipulation checks that should produce very strong evidence against the null-hypothesis. Results for focal and risky hypothesis tests will be lower. Second, the ERR assumes that studies can be replicated exactly. This is typically not the case in psychology (Lykken, 1968; Stroebe & Strack, 2014). Bartos and Schimmack (2020) found that the EDR was a better predictor of actual replication outcomes. Assuming that this finding generalizes, the present estimate of 32% would be consistent with the Open Science Collaboration results of 37%.

Figure 1 establishes baseline results for 2010 at the beginning of the replication crisis in psychology. Figure 2 shows the results nine years later.

A comparison of the results shows a small improvement. The Observed Discovery Rate decreased from 69% to 64%. This means that researchers are reporting more non-significant results. The EDR increased from 32% to 37%. The ERR also increased from 71% to 75%. However, there is still clear evidence that questionable research practices inflate the percentage of significant results in psychology journals and the small increase in EDR and ERR predicts only a small increase in replicability. Thus, if the Open Science Collaboration Project were repeated with studies from 2019, it is still likely to produce fewer than 50% successful replications.

Figure 3 shows the ERR (black), ODR (grey), and EDR (black, dotted) for all 10 years. The continues trend for the ERR suggests that power is increasing a bit, at least in some areas of psychology. However, the trend for the EDR shows no consistent improvement, suggesting that editors are unwilling or unable to reject manuscripts that used QRPs to produce just-significant results. A simple solution might be to ask either for (a) pre-registration with clear power analysis that is followed exactly or (b) simply demand a more stringent criterion of significance, p < .005, to compensate for the hidden multiple-comparisons.


These results provide important scientific evidence about research practices in psychology. Dozens of articles have discussed the replication crisis. Dozens of editorials by new editors have introduced new policies to increase the replicability of psychological results. However, without hard evidence claims about progress are essentially projective tests that say more about the authors than about the state of psychology as a science. The present results provide no evidence that psychology as a field has successfully addressed problems that are decades old, or can be considered a leader in openness and transparency.

Most important, there is no evidence that researchers listen to Cohen’s famous saying “Less is more, except for sample size.” Instead, they are publishing more and more studies in more and more articles, which leads to more and more citations, which looks like progress on quantitative indicators like Impact Factors. However, most of these findings are only replicated in conceptual replication studies that are also selected for significance, giving false evidence of robustness (Schimmack, 2012). Thus, it is unclear which results are replicable and which results are not.

It would require enormous resources to follow up on these questionable results with actual replication studies. To improve psychology, psychologists need to change the incentive structure. To do so, we need to quantify strength of evidence and stop treating all results that are p < .05 as equally significant. A z-score of 2 is just significant, while a z-score of 5 corresponds to the criterion in particle physics that is used to claim a decisive result. Investing resources in decisive studies needs to be rewarded because a single well-designed experiment with z > 5 provides stronger evidence than many weak studies with z = 2, especially if just-significant results may be obtained with questionable practices that inflate effect sizes.

The only differences between my criticism and previous criticism of low powered studies is that technology now makes it possible to track the research practices of psychologists in real time. Students downloaded the articles for 2019 at the beginning of February and the processing of this information took a couple of days (mainly to convert PDFs into txt files). Running a z-curve analysis with bootstrapped confidence intervals takes only a couple of minutes. Therefore, we have hard empirical evidence about research practices in 2019 and the results show that questionable research practices continue to be used to publish more significant results than the power of studies warrants. I hope that demonstrating and quantifying the use of these practices helps to curb their use and to reward researchers who conduct well-powered studies.

A simple way to change the incentive structure is to ban QRPs and treat them like other research fraud. John et al. (2012) introduced the term “scientific doping” for questionable research practices. If sports organizations can ban doping to create fair competitions, why couldn’t scientists do the same. The past decade has shown that they are unable to self-regulate. It is time for funders and consumers (science journalists, undergraduate students, textbook writers) to demand transparency about research practices and the end of fishing for significance.

Estimating Replicability in the “British Journal of Social Psychology”


There is a replication crisis in social psychology (see Schimmack, 2020, for a review). One major cause of the replication crisis is selection for statistical significance. Researchers conduct many studies with low power, but only the significant results get published. As these results ar only significant with the help of sampling error, replication studies fail to replicate a significant result. Awareness of these problems has led some journal editors to change submission guidelines in the hope to attract more replicable results. As replicability depends on power , this would mean that the mean power of statistical tests increased. This can be tested by estimating the mean power before and after selection for significance (Bartos & Schimmack, 2020; Brunner & Schimmack, 2019).

In 2017, John Drury and Hanna Zagefka took over as editors of the “British Journal of Social Psychology” (BJSP). Their editorial directly addresses the replication crisis in social psychology.

A third small change has to do with the continuing crisis in social psychology (especially in quantitative experimental social psychology). We see the mission of social psychology to be to make sense of our social world, in a way which is necessarily selective and has subjective aspects (such as choice of topic and motivation for the research). This sense-making, however, must not entail deliberate distortions, fabrications, and falsifications. It seems apparent to us that the fundamental causes of the growth of data fraud, selective reporting of results and other issues of trust we now face are the institutional pressures to publish and the related reward structure of academic career progression. These factors need to be addressed.

In response to this analysis of problems in the field, they introduced new submission guidelines.

Current debate demonstrates that there is a considerable grey area when deciding which methodological choices are defensible and which ones are not. Clear guidelines are therefore essential. We have added to the submission portal a set of statements to which authors respond in relation to determining sample size, criteria for data exclusion, and reporting of all manipulations, conditions, and measures. We will also encourage authors to share their data with interested parties upon request. These responses will help authors understand what is considered acceptable, and they will help associate editors judge the scientific soundness of the work presented.

In this blog post, I examine the replicability of results published in BJSP and I examine whether changes in submission guidelines have increased replicability. To do so, I downloaded articles from 2000 to 2019 and automatically extracted test-statistics (t-values, F-values) from those articles. All test-statistics were converted into absolute z-scores. Higher z-scores provide stronger evidence against the nil-hypothesis. I then submitted the 8,605 z-scores to a z-curve analysis. Figure 1 shows the results.

First, visual inspects shows a clear drop around z = 1.96. This value corresponds to the typical significance criterion of .05 (two-sided). This drop shows the influence of selectively publishing significant results. A quantitative test of selection can be made by comparing the observed discovery rate to the expected discovery rate. The observed discovery rate is the percentage of significant results that are reported, 70%, 95%CI = 69% to 71%. The expected discovery rate (EDR) is estimated by z-curve on the basis of the distribution of the significant results (grey curve. The EDR is lower, 46%, and the 95%CI, 25% to 57% does not include the ODR. Thus, there is clear evidence that results in BJSP are biased towards significant results.

Z-curve also estimates the replicability of significant results. The expected replication rate (ERR) is the percentage of significant results that is expected in exact replication studies. The ERR is 68%, with a 95%CI ranging from 68% to 73%. This is not a bad replication rate, but there are two caveats. First, automatic extraction does not distinguish theoretically important focal tests from other tests such as manipulation checks. A comparison of automated extraction and hand-coding shows that replication rates for focal tests are lower than the ERR of automated extraction (cf. analysis of JESP). The results for BJSP are slightly better than the results for JESP (ERR: 68% vs. 63%; EDR 46% vs. 35%, but the differences are not statistically significant (confidence intervals overlap). Hand-coding of JESP articles produces an ERR of 39% and an EDR of 12%. Thus, the overall analysis of BJSP suggests that replication rates for actual replication studies are similar to social psychology in general. The Open Science Collaboration found that only 25% could be replicated.

Figure 2 examines time-trends by computing the ERR and EDR for each year. It also computes the ERR (solid) and EDR (dotted) in analyses that are limited to p-values smaller than .005 (grey), which are less likely to be produced by questionable practices. The EDR estimates are highly variable because they are very sensitive to the number of just significant p-values. The ERR estimates are more stable. Importantly, none of them show a significant trend over time. Visual inspection also suggests that editorial changes in 2017 haven’t yet produced changes in published results in 2018 or 2019.

Given concerns about questionable practices and low replicability in social psychology, readers should be cautious about empirical claims, especially when they are based on just-significant results. P-values should be at least below .005 to be considered empirical evidence.