This post was first shared as a post in the Facebook Psychological Methods Discussion Group. (Group, Post). I thought it was interesting and deserved a wider audience.
I know that this is too long for this group, but I don’t have a blog …
A historical anecdote:
In 1963, Rosenthal and Fode published a famous paper on the Experimenter Bias Effect (EBE): There were of course several different experiments and conditions etc., but for example, research assistants were given a set of 20 photos of people that were to be rated by participants on a scale from -10 ([will experience …] “extreme failure”) to + 10 (…“extreme success”).
The research assistants (e.g., participants in a class on experimental psychology) were told to replicate a “well-established” psychological finding just like “students in physics labs are expected to do” (p. 494). On average, the sets of photos had been rated in a large pre-study as neutral (M=0), but some research assistants were told that the expected mean of their photos was -5, whereas others were told that it was +5. When the research assistants, who were not allowed to communicate with each other during the experiments, handed in the results of their studies, their findings were biased in the direction of the effect that they had expected. Funnily enough, similar biases could be found for experiments with rats in Skinner boxes as well (Rosenthal & Fode, 1963b).
The findings on the EBE were met with skepticism from other psychologists since they casted doubt on experimental psychology’s self-concept as a true and unbiased natural science. And what do researchers do since the days of Socrates if they doubt the findings of a colleague? Sure, they attempt to replicate them. Whereas Rosenthal and colleagues (by and large) produced several successful “conceptual replications” in slightly different contexts (for a summary see e.g. Rosenthal, 1966), others (most notably T. X. Barber) couldn’t replicate Rosenthal and Fode’s original study (e.g., Barber et al., 1969; Barber & Silver, 1968, but also Jacob, 1968; Wessler & Strauss, 1968).
Rosenthal, a versed statistician, responded (e.g., Rosenthal, 1969) that the difference between significant and non-significant may be not itself significant and used several techniques that about ten years later came to be known as “meta-analysis” to argue that although Barber’s and others’ replications, which of course used other groups of participants and materials etc., most often did not yield significant results, a summary of results suggests that there may still be an EBE (1968; albeit probably smaller than in Rosenthal and Fode’s initial studies – let me think… how can we explain that…).
Of course, Barber and friends responded to Rosenthal’s responses (e.g., Barber, 1969 titled “invalid arguments, post-mortem analyses, and the experimenter bias effect”) and vice versa and a serious discussion of psychology’s methodology emerged. Other notables weighed in as well and frequently statisticians such as Rozeboom (1960) and Bakan (1966) were quoted who had by then already done their best to explain to their colleagues the problems of the p-ritual that psychologists use(d) as a verification procedure. (On a side note: To me, Bakan’s 1966 paper is better than much of the recent work on the problems with the p-ritual; in particular the paragraph on the problematic assumption of an “automacity of inference” on p. 430 is still worth reading).
Lykken (1968) and Meehl (1967) soon joined the melee and attacked the p-ritual also from an epistemological perspective. In 1969, Levy wrote an interesting piece about the value of replications in which he argued that replicating the EBE-studies doesn’t make much sense as long as there are no attempts to embed the EBE into a wider explanatory theory that allows for deducing other falsifiable hypotheses as well. Levy knew very well already by 1969 that the question whether some effect “exists” or “does not exist” is only in very rare cases relevant (exactly then when there are strong reasons to assume that an effect does not exist – as is the case, for example, with para-psychological phenomena).
Eventually Rosenthal himself (e.g., 1968a) came to think critically of the “reassuring nature of the null hypothesis decision procedure”. What happened then? At some point Rosenthal moved away from experimenter expectancy effects in the lab to Pygmalion effects in the classroom (1968b) – an idea that is much less likely to provoke criticism and replication attempts: Who doesn’t believe that teachers’ stereotypes influence the way they treat children and consequently the children’s chances to succeed in school? The controversy fizzled out and if you take up a social psychology textbook, you may find the comforting story in it that this crisis was finally “overcome” (Stroebe, Hewstone, & Jonas, 2013, p. 18) by enlarging psychology’s methodological arsenal, for example, with meta-analytic practices and by becoming a stronger and better science with a more rigid methodology etc. Hooray!
So psychology was finally great again from the 1970s on … was it? What can we learn from this episode?- It is not the case that psychologists didn’t know the replication game, but they only played it whenever results went against their beliefs – and that was rarely the case (exceptions are apart from Rosenthal’s studies of course Bem’s “feeling the future” experiments). –
Science is self-correcting – but only whenever there are controversies (and not if subcommunities just happily produce evidence in favor of their pet theories). – Everybody who wanted to know it could know by the 1960s that something is wrong with the p-ritual – but no one cared. This was the game that needed to be played to produce evidence in favor of theories and to get published and to make a career; consequently, people learned to play the verification game more and more effectively. (Bakan writes on p. 423: “What will be said in this paper is hardly original. It is, in a certain sense, what “everybody knows.” To say it “out loud” is, as it were, to assume the role of the child who pointed out that the emperor was really outfitted only in his underwear.” – in 1966!)-
Just making it more difficult to verify a theory will not solve the problem imo; ambitious psychologists will again find ways to play the game – and to win.- I see two risks with the changes that have been proposed by the “open science community” (in particular preregistration): First, I am afraid that since the verification game still dominates in psychology researchers will simply shift towards “proving” more boring hypotheses; second, there is the risk that psychological theories will be shielded even more from criticism since only criticism based on “good science” (preregistered experiments with a priori power analysis and open data) will be valid whereas criticism based on other types of research activities (e.g., simulations, case studies … or just rational thinking for a change) will be dismissed as “unscientific” => no criticism => no controversy => no improvement => no progress. – And of course, pre-registration and open science etc. allow psychologists to still maintain the misguided, unfortunate, and highly destructive myth of the “automacity of inferences”; no inductive mechanism whatsoever can ensure “true discovery”.-
I think what is needed more is a discussion about the relationship between data and theory and about epistemological questions such as the question what a “growth of knowledge” in science could look like and how it can be facilitated (I call this a “falsificationist turn”).- Irrespective of what is going to happen, authors of textbooks will find ways to write up the history of psychology as a flawless cumulative success story …
1 thought on “Guest Post by Peter Holtz: From Experimenter Bias Effects To the Open Science Movement”
Fascinating, many thanks to Peter. The way that medicalised meditation has used ‘grey-zone’ unreplicated (unreplicable) work to promote meditation is surprising. Having completed my cognitive/neuropsychology MSc in 2018, the claims made for the reliability of the ‘method’ are still fresh in my mind. To discover now during my PhD that a lack of replication has been a feature of the science of meditation since the 1970s is perplexing. There are some replicated studies in meditation and mindfulness research, but these are not necessarily the most influential. Strategic reviews have made this same critical points in both 1980 and 2018; what does this tell us about the procession of science and its system of self-regulation?