Social psychology aims to study real-world problems with the tools of experimental psychology. The classic obedience studies by Milgram aimed to provide insights into the Holocaust by examining participants’ reactions to a sadistic experimenter. In this tradition, social psychologists have studied prejudice against African Americans since the beginning of experimental social psychology.
As Amadio and Cikara (2021) note, these studies aim to answer questions such as “How do humans learn to favor some groups over others?” and “Why does merely knowing a person’s ethnicity or nationality affect how we see them, the emotions we feel toward them, and the way we treat them?”
In their chapter, the authors review the social neuroscience of prejudice. Looking for prejudice in the brain makes a lot of sense. A person without a brain is not prejudiced, so we know that prejudice is somewhere in the brain. The problem for neuroscience is that the brain is complex and it is not easy to figure out how it does what it does. Another problem is that prejudice is most relevant when individuals act on their prejudices. For example, it is possible that prejudice contributes to the high prevalence of police shootings that involve Black civilians (Schimmack & Carlson, 2020). However, the measurement of brain activity often requires repeated measurements of many trials under constrained laboratory conditions.
In short, using neuroscience to understand prejudice faces some practical challenges. I was therefore skeptical that this research has produced much useful information about prejudice. When I voiced my skepticism on twitter, Amadio called me a bully. I therefore took a closer look at the chapter to see whether my skepticism was reasonable or whether I was uninformed and prejudice against social neuroscientists.
To act prejudice, the brain has to detect a difference between White and Black faces. Research shows that people do indeed notice these differences, especially in faces that were selected to be clear examples of the two categories. If asked to indicate whether a face is White or Black, responses can be made within a few hundred milliseconds.
The authors acknowledge that we do not need social neuroscience to know this. “behavioral studies suggest that social categorization occurs quickly (Macrae & Bodenhausen, 2000)” (p. 3), but they suggest that neuroscience produces additional information. Unfortunately, the meaning of early brain signals in the EEG are often unclear. Thus, the main conclusion that the authors can draw from these findings is that they provide “support for the early detection and categorization of race.” In other words, when the brain sees a Black person, we usually notice that the person is Black. It is not clear however that any of these early brain measures reflect the evaluation of the person, which is really the core of prejudice. Categorization is necessary, but not sufficient for prejudice. Thus, this research does not really help us to understand why noticing that somebody is Black leads to negative evaluations of this person.
Another problem with this research is the artificial nature of the task. Rather than presenting a heterogeneous set of faces that is representative of participants’ social environment, participants see 50% White and 50% Black faces. Every White person who has been suddenly in a situation where 50% or more of the people are Black may notice that they respond differently to this situation. The brain may look very different in these situations than in situations where race is not salient. In addition, the faces are strangers. Thus, these studies have no relevance for prejudice in work-settings with colleagues or in classrooms where teachers know their students. This lack of ecological validity is of course not unique to brain studies of prejudice. It applies also to behavioral experiments.
The only interesting and surprising claim in this section is that Black participants respond to White participants just like White participants respond to Black faces. “Research with
Black participants, in addition to White participants, has replicated this pattern and clarified that it is typically larger to racial outgroup faces rather than Black faces per se” (p. 5). The statement is a bit muddled because the out-group for Black participants is White.
Looking up the results from Dickter and Bartholow (2007) shows a clear participant x picture interaction effect (i.e., responses for opposite race trials are different for same race trials). While the effect for White participants is clearly significant, the effect for Black participants is not, F(1,13) = 4.46, p = .054, but was misreported as significant, p < .05. The second study did not examine Black participants or faces. It also did not include White participants. It showed that Asian participants responded stronger to the outgroup (White) than the in-group (Asian), F(1, 19) = 17.06, p = .0006. The lack of a White group of participants is puzzling. The third study had the largest sample of Black and White participants (Volpert-Esmond & Bartholow, 2019), but did not replicate Dickter and Bartholow’s findings. “the predicted Target race × Participant race interaction was not significant, b=−0.09, t(61.3)=−1.47, p=.146” I have seen shady citations and failure to cite disconfirming evidence, but it is pretty rare for authors to simply list a disconfirming study as if it produced consistent evidence. In conclusion, there is no clear evidence how minority groups respond to faces of different groups because most of the research is done by White researchers at White universities with White students.
The details are of course not important because the authors main goal is to sell social neuroscience. “Social neuroscience research has significantly advanced our understanding of the social categorization process” (p. 11). A close reading shows that this is not the case and that it is unclear what early brain signals mean and how they are modulated by the context, race of participants, and race of faces.
How is prejudice learned, represented, and activated?
Studying learning is a challenging task in an experimental context. To measure learning some form of memory task must be administered. Moreover, this assessment has to be preceded by a learning task. To make learning experiments more realistic, it is ideal to have a retention interval between the learning and the memory task. However, most studies in psychology are one-shot laboratory studies. Thus, the ecological validity of learning studies is low. Not surprisingly, the chapter contains no studies that examine neurological responses during learning or memory tasks related to prejudice.
Instead, the chapter reviews circumstantial evidence that may be related to prejudice. First, the authors review the general literature on Pavlovian aversive conditioning. However, they provide no evidence that prejudice is rooted in fear conditioning. In fact, many White Americans in White parts of the country are prejudice without threatening interactions with Black Americans. Not surprisingly, even the authors note that fear conditioning is not the most plausible root of prejudice.
“Some research has attempted to demonstrate a Pavlovian basis of prejudice using prepared fear or reversal learning paradigms (Dunsmoor et al., 2016; Olsson et al., 2005), but these results have been inconclusive regarding a prepared fear to Black faces (among White Participants) or have failed to replicate (Mallan et al., 2009; Molapour et al., 2015; Navarrete et al., 2009; Navarette et al., 2012). To our knowledge, research has not yet directly tested the hypothesis that social prejudice can be formed through Pavlovian aversive conditioning” (p. 14)
As processing of feared objects often involves the amygdala, one would expect White brains to show an amygdala response to Black faces. Contrary to this prediction, “most fMRI studies of race perception have not observed a difference in amygdala response to viewing racial outgroup compared with ingroup members (e.g., Beer et al., 2008; Gilbert et al., 2012; Golby et al., 2001; Knutson et al., 2007; Mattan et al., 2018; Phelps et al., 2000; Richeson et al., 2003; Ronquillo et al., 2005; Stanley et al., 2012; Telzer et al., 2013; Van Bavel et al., 2008, 2011).” (p. 15). The large number of studies shows how many resources were wasted on a hypotheses that is not grounded in an understanding of racism in the United States.
The chapter then reviews research on stereotypes. The main insight provided here is that “while the neural basis of stereotyping remains understudied, existing research consistently identifies the ATL (anterior temporal lobe) as supporting the representation of social stereotypes” (p. 17). However, it remains unclear what we learn about prejudice from this finding. If stereotypes were supported by some other brain area, would this change prejudice in some important way?
The authors next examine the involvement of instrumental learning in prejudice. “Although social psychologists have long hypothesized a role for instrumental learning in attitudes and social behavior (e.g., Breckler, 1984), this idea has only recently been tested using contemporary reinforcement learning paradigms and computational modeling (Behrens et al.,
2009; Hackel & Amodio, 2018).” (p. 19). Checking Hackel and Amodio (2018) shows that this review article does not mention prejudice. Other statements have nothing to do with prejudice, but rather explain why prejudice may not influence responses to all group-members. “Behavioral studies confirm that people incrementally update their attitudes about both persons (Hackel et al., 2019)” (p. 19). The authors want (us) to believe that “a model of instrumental prejudice may help to understand aspects of implicit prejudice” (p. 20), but they fail to make clear how instrumental learning is related to prejudice, let alone implicit prejudice.
The section on prejudice as habits starts with a wrong premises. “Habits: A basis for automatic prejudice? Automatic prejudices are often likened to habits; they appear to emerge from repeated negative experiences with outgroup members, unfold without intention, and resist change (Devine, 1989).” Devine’s (1989) classic subliminal priming study has not been replicated and subliminal priming in general has been questioned as producing robust findings. Moreover, the study has been questioned on methodological grounds and it has been shown that classifying an individual as Black does not automatically trigger negative responses. The main reason why prejudice is not a habit is that it requires often many repeated instances to form a habit and many White individuals have too little contact with Black individuals to form prejudice habits. The whole section is irrelevant because the authors note that “social neuroscience has yet to investigate the role of habit in prejudice” (p. 21). We can only hope that funding agencies are smart enough not to waste money on this kind of research.
This whole section ends with the following summary.
” A major contribution of social neuroscience research on prejudice has been to link different aspects of prejudice—stereotypes, affective bias, and discriminatory actions—to neurocognitive models of learning and memory. It reveals that intergroup bias, and implicit bias in particular, is not one phenomenon, but a set of different processes that may be formed, represented in the mind, expressed in behavior, and potentially changed via distinct interventions.” In short, we don’t know anything more about prejudice that we did not know without social neuroscience.
Effects of prejudice on perception
The first topic is face perception. Behavior studies show that individuals tend to be better able to discriminate between faces of their own group than faces of another group. Faces are processed in a brain area called the fusiform gyrus. A study by Golbi et al. (2001) with 10 White and 10 Black participants confirmed this finding for White Americans, t(8) = 2.10, p = .03, but not for African Americans. t(9) = 0.63. Given the small sample size the interaction is not significant in this study. The more important finding was that the fusiform gyrus showed more activation to same-race faces, t(18) = 2.58, p = .02. Inconsistent with the behavioral data, African American participants showed more activation of the fusiform face area as much as White participants. Over the past two decades, this preliminary study has been cited over 300 times. We would expect a review in 2021 to include follow-up and replication studies, but the preliminary results of this seminal study are offered as evidence as if they are conclusive. Yet, in 2021 it is clear that many results with just significant p-values, p > .005, often do not replicate. The authors seem to be blissfully unaware of the replication crisis. I was able to find a recent study that examined own-group bias for White participants only with three age groups. The study replicated findings that White participants show more activation to White faces than to Black faces, especially for adolescents and adults. The study also linked this finding to prejudice, but I will discuss these results later because it was not the focus of the review article.
In short, behavioral studies have demonstrated that White Americans have difficulties in distinguishing Black faces. This has led to false convictions based on misidentification by eye-witnesses. Expert testimony by psychologists has helped to draw awareness to this problem. Social neuroscience shows that this problem is correlated with activity in the fusiform gyrus. It is not clear, however, how knowledge about the localization of face processing in the brain provides a deeper understanding of the problem.
The authors suggest, however, that face processing may directly lead to discriminatory behavior based on an article by Krosch and Amodio (2019). In a pair of experiments, White participants were given a small or large amount of money and then had to allocate it to White or Black recipients based on some superficial impression of deservingness. In Study 1 (N = 81, 10 excluded), EEG responses to the faces showed a greater N170 response to Black faces, but only when resources were scarce, 2way interaction F(1, 69) = 4.97, p = .029. Furthermore, the results showed a significant mediation effect on resource allocation, b = .14, se = .09, p = .039. Study 2 used fMRI (N = 35, 5 excluded). This study showed the race effect on the fusiform gyrus, but only in the scarcity condition, F(1, 28) = 7.16, p = .012. Despite the smaller sample size, the mediation analysis was also significant, b = .43, se = .17, t = 2.64, p = .014. While the conceptual replication of the finding across two different studies with different brain measures makes these result look credible, the fact that all critical tests produced just significant results, p > .01 undermines the credibility of these findings (Schimmack, 2012). The most powerful test of credibility for a small set of tests is the Test of Insufficient Variance (Schimmack, 2014; Renkewitz & Keiner, 2019). The test first converts the p-values into z-scores. It then compares the observed variance to the expected variance of 1. The observed variance for these four p-values is much smaller, V = .05. A chi-square test shows that the probability of this outcome by chance is p = .013. Thus, it is unlikely that sampling error alone produced this restricted amount of variation. A more likely explanation is that the authors used questionable research practices to produce a perfect picture of significant results when the actual studies had insufficient power to produce significant results even if the main hypotheses are true. The main problem are the mediation analysis that rely on correlations in small sample sizes. It has been shown that many mediation analyses cannot be trusted because they are biased by questionable research practices
Effects of prejudice on emotion
Emotion is the most important topic for understanding prejudice. Whereas attitudes are broad dispositions to evaluate members of a specific group positively or negatively, emotions are the actual, momentary affective reactions to members of these groups. Ideally, neuroscience would be able to provide objective measures of emotions. These measures would reveal whether a White person responds with negative feelings in an interaction with a Black person. Obtaining objective, physiological indicators of emotions has been the holy grail of emotion research. First attempts to locate emotions in the body failed. Facial expressions (smiles and frowns) can provide valid information, but facial expressions can be controlled and do not always occur in response to emotional stimuli. Thus, the ability to measure brain activity seemed to open the door for objective measures of emotions. However, attempts to find signals of emotional valence in the EEG have failed. fMRI research focused on amygdala activity as a signal of fear, but latter research showed that the amygdala also responds to some positive stimuli, specifically erotic stimuli. Given this disappointing history, I was curious to see what the latest social neuroscience research on emotion has uncovered.
As it turns out, this section provides no new insights into emotional responses to members of an outgroup. The main focus is on empathy in the context of taking the perspective of an in-group or out-group member and guilt. The main reason why fear or hate are not explored is probably that there are no known neural correlates of these emotions and that research with undergraduate students in response to pictures of Black and White faces is unlikely to elicit strong emotions.
In short, the main topic where neuroscience could make a contribution lacks from knowledge of valid measures of emotions in the brain.
Effects of prejudice on decision making
Emotional responses would be less of a problem if individuals would not act on their emotions. Most adult individuals learn to regulate their emotions and to inhibit undesirable behaviors. The reason prejudice is a problem for minority groups is that some White individuals do not feel a need to regulate their negative emotions towards African Americans or that they lack the ability to do so in some situations, which is often called implicit bias. Thus, understanding how the brain is involved in actual behaviors is even more important than understanding its contribution to emotions. Although of prime importance, this section is short and contains few citations. One reference is to the resource allocation study by Krosch and Amodio that I reviewed in detail earlier. Blissfully aware of the questions raised about oxytocin research, another reference is to a study with oxytocin administration (Marsh et al., 2017). Thus, there is no research reviewed here that illuminates what the brain is doing when White individuals discriminate against African Americans. This does not stop the authors from making a big summary statement that “social neuroscience research has refined our understanding of how prejudice influences the visual processing of faces, intergroup emotion, and decision-making processes, particularly as each type of response pertains to behavior” (p. 34).
Self-regulation of Prejudice
This section starts of with a study by Amodio et al. (2004) and the claim that the results of this study have been replicated in numerous studies (Amodio et al., 2006; 2008; Amodio & Swencionis, 2018; Bartholow et al., 2006; Beer et al., 2008; Correll et al., 2006; Hughes et al., 2017). The main claim based on these studies is that self-regulation of prejudice relies on “detection of bias and initiation of control, in dACC—a process that can operate rapidly and in the absence of deliberation, and which can explain individual differences in prejudice control failures” (p. 39).
Amodio et al.’s (2004) study used the weapons – identification task. This task is an artificial task that puts participants in the position of a police officer who has to make a split second decision whether a civilian is holding a gun or some other object (cell phone). Respondents have to respond as quickly as possible whether the object is a gun or not. The race of the civilians is manipulated to examine racial biases. A robust finding is that White participants are faster to identify guns after seeing a Black face than a White face and slower to identify a tool after seeing a White face than a Black face. On some trials, participants also make mistakes. When the brain of participants notices that a mistake was made, EEG shows a distinct signal that is called the error-related negativity (ERN). The key finding in this article is that the ERN was more pronounced when participants identified a tool as a gun in trials with Black faces than in trials with White faces, t(33) = 2.94, p = .006. Correlational analysis suggested that participants with larger ERNs after mistakes with Black faces learned from their mistakes and reduced their errors, r(32) = -.50, p = .004. These results show that at least some individuals are aware when prejudice influences their behaviors and control their behaviors to avoid acting on their prejudice. It is difficult to generalize from this study to regulation of prejudice in real-life because the task is artificial and most situations provide only ambiguous feedback about the appropriateness of actions. Even the behavior here is a mere identification rather than an actual behavior such as a shoot or no-shoot decision, which might produce more careful responses and fewer errors especially in more realistic training scenarios (Andersen zzz).
Another limitation of these studies is the reliance on a few pictures to represent the large diversity of Black and White people.
All replication studies seem to have used the same faces. Therefore, it is unclear how generalizable these results and how much contextual factors (e.g., gender, age, clothing, location, etc.) might moderate the effect.
Some limitations of the generalizability were reported by Amadio and Swencionis (2018). The racial bias effect was eliminated (no longer statistically significant) when 80% of trials showed Black faces with tools rather than guns. This finding is not predicted by models that assume racial bias often has an implicit (automatic and uncontrollable) effect on behavior. Here it seems that simple knowledge about the low frequency of Black people with guns was sufficient to block the behavioral expression of prejudice. Study 4 measured EEG, but did not report ERN results.
The summary of this section concludes that “social neuroscience research on prejudice control has significantly expanded psychological theory by identifying and distinguishing multiple mechanisms of control” (p. 39). I would disagree. The main finding appears to be that the brain sometimes fails to notice that it made an error and that lack of awareness of these errors prohibits correcting this error. However, the studies are designed to produce errors in the first place to be able to measure the ERN. Without time pressure, few errors would be made and as shown by Amadio and Swencionis show that racial bias depends on a specific context. That being said, lack of awareness may cause sustained prejudice in the real world. One important role of diversity training is to make majority members aware of behaviors that hurt minority members. Awareness of the consequences should reduce the frequency of these behaviors because they are controllable as the reviewed research suggests.
The conclusion section repeats the claim that the review highlights “major theoretical advances produced by this literature to date” (p. 42). However, this claim rings hollow in comparison to the dearth of findings that inform our understanding of prejudice. The main problem for social neuroscience of prejudice is that the core component of prejudice, negative affect, has no clear neural correlates in EEG or fMRI measures of the brain, and that experimental designs suitable for neuroscience have low ecological validity. The authors suggests that this may change in the future. They provide a study with Black and White South Africans as an example. The study measured fMRI while participants viewed short video-clips of Black and White individuals in distress. The videos were taken from the South African Truth
and Reconciliation Commission. The key finding was that brain signals related to empathy showed an in-group bias. Both groups responded more to distress by members of their own group. The fact that this study is offered as an example for greater ecological validity shows the problems for social neuroscience to study prejudice in realistic settings where one individual responds to another individual and their behavior is influenced by prejudice. The authors also point to technological advances as a way to increase ecological validity. Wearable neuroimaging makes it possible to measure the brain in naturalistic settings, but it is not clear what brain signals would produce valuable information about prejudice.
My main concerns is that social neuroscience research on prejudice takes away resources from other, in my opinion more important, prejudice research that focuses on actual behaviors in the real world. I am not the only one who has observed that the focus on cognition and the brain has crowded out research of actual behaviors (Baumeister, Vohs, & Funder, 2007; Cesario, 2021). If a funding agency can spend a million dollars on a grant to study the brains of undergraduate students while they look at Black and White faces or on the shooting errors of police officers in realistic simulations, I would give money to the study of actual behavior. There is also a dearth of research on prejudice from the perspective of the victims. They know best what prejudice is and how it affects them. There needs to be more diversity in research and White researchers should collaborate with Black researchers who can draw on personal experiences and deep cultural knowledge that White researchers lack or fail to use in their research. Finally, the incentive structure needs to change. Prejudice researchers are rewarded like all other researchers for publishing in prestigious journals that are controlled by White researchers. Even journals dedicated to social issues have this systemic bias. Prejudice research more than any other field needs to ensure equity, diversity, and inclusions at all levels. Moving social neuroscience of prejudice out of White social cognition research into a diverse and interdisciplinary field might help to ensure that these studies actually inform our understanding of prejudice. Thus, a reallocation of funding is needed to ensure that funding for prejudice research benefits African Americans and other minority groups.
P-Curve was a first attempt to take the problem of selection for significance seriously and to evaluate whether a set of studies provides credible evidence against the null-hypothesis (evidential value). Here I showed that p-curve has serious limitations and provides misleading evidence about the strength of evidence against the null-hypothesis.
I showed that all of the information that is provided by a p-curve analysis is also provided by a z-curve analysis. Moreover, z-curve provides additional information about the presence of selection bias and the risk of false positive results. I also show how alpha levels can be adjusted to separate significant results with weak and strong evidence to select credible findings even when selection for significance is present.
As z-curve does every thing that p-curve does and more, the rational choice is to choose z-curve for the meta-analysis of p-values.
In 2011, it dawned on psychologists that something was wrong with their science. Daryl Bem had just published an article with nine studies that showed an incredible finding. Participants’ responses were influenced by random events that had not yet occurred. Since then, the flaws in research practices have become clear and it has been shown that they are not limited to mental time travel (Schimmack, 2020). For decades, psychologists assumed that statistically significant results reveal true effects and reported only statistically significant results (Sterling, 1959). However, selective reporting of significant results undermines the purpose of significance testing to distinguish true and false hypotheses. If only significant results are reported, most published results could be false positive results like those reported by Bem (2011).
Selective reporting of significant results also undermines the credibility of meta-analyses (Rosenthal, 1979), which explains why meta-analyses also suggest humans posses psychic abilities (Bem & Honorton, 1994). This sad state of affairs stimulated renewed interest in methods that detect selection for significance (Schimmack, 2012) and methods that correct for publication bias in meta-analyses. Here I focus on a comparison of p-curve (Simonsohn et al., 2014a, Simonsohn et al., 2014b), and z-curve (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020).
P-curve is a name for a family of statistical tests that have been combined into the p-curve app that researchers can use to conduct p-curve analyses, henceforth called p-curve . The latest version of p-curve is version 4.06 that was last updated on November 30, 2017 (p-curve.com).
The first part of a p-curve analysis is a p-curve plot. A p-curve plot is a histogram of all significant p-values where p-values are placed into five bins, namely p-values ranging from 0 to .01, .01 to .02, .02 to .03, .03 to .04, and .04 to .05. If the set of studies contains mostly studies with true effects that have been tested with moderate to high power, the plot shows decreasing frequencies as p-values increase (more p-values between 0 and .01 than between .04 and .05). This pattern has been called a right-skewed distribution by the p-curve authors. If the distribution is flat or reversed (more p-values between .04 and .05 than between 0 and .01), most p-values may be false positive results.
The main limitation of p-curve plots is that it is difficult to evaluate ambiguous cases. To aid in the interpretation of p-curve plots, p-curve also provides statistical tests of evidential value. One test is a significance tests against the null-hypothesis that all significant p-values are false positive results. If this null-hypothesis can be rejected with the traditional alpha criterion of .05, it is possible to conclude that at least some of the significant results are not false positives.
The main problem with significance tests is that they do not provide information about effect sizes. A right-skewed p-curve with a significant p-values may be due to weak evidence with many false positive results or strong evidence with few false positives.
To address this concern, the p-curve app also provides an estimate of statistical power. This estimate assumes that the studies in the meta-analysis are homogeneous because power is a conditional probability under the assumption that an effect is present. Thus, power does not apply to a meta-analysis of studies that contain true positive and false positive results because power is not defined for false positive results.
To illustrate the interpretation of p-curve analysis, I conducted a meta-analysis of all studies published by Leif D. Nelson, one of the co-authors of p-curve analysis. I found 119 studies with codable data and coded the most focal hypothesis for each of these studies. I then submitted the data to the online p-curve app. Figure 1 shows the output.
Visual inspection of the p-curve plot shows a right-skewed distribution with 57% of the p-values between 0 and .01 and only 6% of p-values between .04 and .05. The statistical tests against the null-hypothesis that all of the significant p-values are false positives is highly significant. Thus, at least some of the p-values are likely to be true positives. Finally, the power estimate is very high, 97%, with a tight confidence interval ranging from 96% to 98%. Somewhat redundant with this information, the p-curve app also provides a significance test for the hypothesis that power is less than 33%. This test is not significant, which is not surprising given the estimated power of 96%.
The next part of a p-curve output provides more details about the significance tests, but does not add more information.
The next part provides users with an interpretation of the results.
The interpretation informs readers that this set of p-values provides evidential value. Somewhat surprisingly, this automated interpretation does not mention the power estimate to quantify the strength of evidence. The focus on p-values is problematic because p-values are influenced by the number of tests. The p-value could be lower wit 100 studies with 40% power than with 10 studies with 99% power. As significance tests are redundant with confidence intervals, it is sufficient to focus on the confidence interval of the power estimate. With a 90% confidence interval ranging from 96% to 98%, we would be justified to conclude that this set of p-values provides strong support for the hypotheses tested in Nelson’s articles.
Like p-curve, z-curve analyses also start with a plot of the p-values. The main difference is that p-values are converted into z-scores using the formula for the inverse normal distribution; z = qnorm(1-p/2). The second difference is that significant and non-significant p-values are plotted. The third difference is that z-curve plots have a much finer resolution than p-curve plots. Whereas p-curve bins all z-scores from 2.58 to infinity into one bin (p < .01), z-curve uses the information about the distribution of z-scores all the way up to z = 6 (p = .000000002; 1/500,000,000).
Visual inspection of the z-curve plot reveals something that the p-curve plot does not show, namely there is clear evidence for the presence of selection bias. Whereas p-curve suggests that “highly” significant results (0 to .01) are much more common than “just” significant results (.04 to .05), z-curve shows that just significant results (.05 to .005) are much more frequent than highly significant (p < .005) results. The difference is due to the implicit definition of high and low in the two plots. The high frequency of highly significant (p < .01) results in the p-curve plots is due to the wide range of values that are lumped together into this bin. Once it is clear that many p-values are clustered just below .05 (z > 1.96, the vertical red line), it is immediately notable that there are too few just non-significant (z < 1.96) values. This steep drop is not consistent with random sampling error. To summarize, z-curve plots provide more information than p-curve plots. Whereas z-curve plots make the presence of selection for significance visible, p-curve plots provide no means to evaluate selection bias. Even worse, right skewed distributions are often falsely interpreted as evidence that there is no selection for significance. This example shows that notable right-skewed distributions can be found even when selection bias is present.
The second part of a z-curve analysis uses a finite mixture model to estimate two statistical parameters of the data. These parameters are called the estimated discovery rate and the estimated replication rate (Bartos & Schimmack, 2021). Another term for these parameters is mean power before selection and mean power after selection for significance (Brunner & Schimmack, 2020). The meaning of these terms is best understood with a simple example where a researcher tests 100 false hypotheses and 100 true hypotheses with 100% power. The outcome of this study produces significant and non-significant p-values. The expected value for the frequency of significant p-values is 100 for the 100 true hypotheses tested with 100% power and 5% for the 100 false hypotheses that produce 5 significant results when alpha is set to 5%. Thus, we are expecting 105 significant results and 95 non-significant results. Although we know the percentages of true and false hypotheses, this information is not available with real data. Thus, any estimate of average power changes the meaning of power. It now includes false hypotheses with a power equal to alpha. We call this unconditional power to distinguish it from the typical meaning of power conditioned on a true hypothesis.
It is now possible to compute mean unconditional power for two populations of studies. One population of studies are all studies that were conducted. In this example, this population are all 200 studies (100 true, 100 false hypotheses). The average power for these 200 studies is easy to compute as (.5*100 + 1*100)/200 = 52.5%. The second population of studies focuses only on the significant studies. After selecting only significant studies, mean unconditional power is (.05*5 + 1*100)/105 = 95.5%. The reason why power is so much higher after selection for significance is that the significance filter keeps most false hypotheses out of the population of studies with a significant result (95 of the 100 studies to be exact). Thus, power is mostly determined by the true hypotheses that were tested with perfect power. Of course, real data are not as clean as this simple example, but the same logic applies to all sets of studies with a diverse range of power values for individual studies (Brunner & Schimmack, 2020).
Mean power before selection of significance determines the percentage of significant results for a number of tests. With 50% mean power before selection, 100 tests are expected to produce 50 significant results (Brunner & Schimmack, 2020). It is common to refer to statistically significant results as discoveries (Soric, 1989). Importantly, discoveries could be true or false, just like a significant result could be a true effect or a type-I error. In our example, there were 105 discoveries. Normally we would not know that 100 of these discoveries are true discoveries. All we know is the percentage of significant results. I use the term estimated discovery rate (EDR) to refer to mean unconditional power before selection, which is a mouthful. In short, EDR is an estimate of the percentage of significant results in a series of statistical tests.
Mean power after selection for significance is relevant because power of significant results determines the probability that a significant result can be successfully replicated in a direct replication study with the same sample size (Brunner & Schimmack, 2020). Using the EDR would be misleading. In the present example, the EDR of 52.5% would dramatically underestimate replicability of significant results, which is actually 95.5%. Using the EDR would punish researchers who conduct high-powered tests of true and false hypotheses. To assess the replicability of this researchers, it is necessary to compute power only for the studies that produced significant results. The problem with traditional meta-analyses is that selection for significance leads to inflated effect size estimates even if the researcher reported all non-significant results. To estimate the replicability of the significant results, the data are conditioned on significance, which inflates replicability estimates. Z-curve models this selection process and corrects for regression to the mean in the estimation of mean unconditional power after selection for significance. I call this statistic the estimated replication rate. The reason is that mean unconditional power after selection for significance determines the percentage of significant results that is expected in direct replication studies of studies with a significant result. In short, the ERR is the probability that a direct replication study with the same sample size produces a significant result.
I start discussion of the z-curve results for Nelson’s data with the estimated replication rate because this estimate is conceptually similar to the power estimate in the p-curve analysis. Both estimates focus on the population of studies with significant results and correct for selection for significance. Thus, one would expect similar results. However, the p-curve estimate of 97%, 95%CI = 96% to 98%, is very different from the z-curve estimate of 52%, 95%CI = 40% to 68%. The confidence intervals do not overlap, showing that the difference between these estimates is statistically significant itself.
The explanation for this discrepancy is that p-curve estimates are inflated estimates of the ERR when power is heterogeneous (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020). This is even true, if effect sizes are homogeneous and studies vary only in sample sizes (Brunner, 2018). The p-curve authors have been aware of this problem since 2018 (Datacolada), and have not updated the p-curve app in response to this criticism of their app. The present example shows that using the p-curve app can lead to extremely misleading conclusions. Whereas p-curve suggests that nearly every study by Nelson would produce a significant result again in a direct replication attempt, the correct z-curve estimates suggests that only every other result would replicate successfully. This difference is not only statistically significant, but also practically significant in the evaluation of Nelson’s work.
In sum, p-curve is not only redundant with z-curve. It also produces false information about the strength of evidence in a set of p-values.
Unlike p-curve, z-curve.2.0 also estimates the discovery rate based on the distribution of the significant p-values. The results are shown in Figure 2 as the grey curve in the range of non-significant results. As can be seen, while z-curve predicts a large number of non-significant results, the actual studies reported very few non-significant results. This suggests selection for significance. To quantify the amount of selection bias, it is possible to compare the observed discovery rate (i.e., the actual percentage of significant results), 87%, to the estimated discovery rate, EDR = 27%. The 95% confidence interval around the EDR can be used for a significance test. As 87% is well outside the 95%CI of the EDR, 5% to 51%, the results provide strong evidence that the reported results were selected from a larger set of tests with non-significant results that were not reported. In this specific case, this inference is consistent with the authors’ admission that questionable research practices were used (Simmons, Nelson, & Simonsohn, 2011).
“Our best guess was that so many published findings were false because researchers
were conducting many analyses on the same data set and just reporting those that were statistically significant, a behavior that we later labeled “p-hacking” (Simonsohn, Nelson, & Simmons, 2014). We knew many researchers—including ourselves—who readily admitted
to dropping dependent variables, conditions, or participants to achieve significance.” (Simmons, Nelson, & Simonsohn, 2018, p. 255).
The p-curve authors also popularized the idea that selection for significance may have produced many false positive results (Simmons et al., 2011). However, p-curve does not provide an estimate of the false positive risk. In contrast, z-curve provides information about the false discovery risk because the false discovery risk is a direct function of the discovery rate. Using the EDR with Soric’s formula, shows that the false discovery risk for Nelson studies is 14%, but due to the small number of tests, the 95%CI around this estimate ranges from 5% to 100%. Thus, even though the ERR suggests that half of the studies can be replicated, it is possible that the other half of the studies contain a fairly large number of false positive results. Without the identification of moderator variables, it would be impossible to say whether a result is a true or a false discovery.
The ability to estimate the false positive risk makes it possible to identify a subset of studies with a low false positive risk by lowering alpha. Lowering alpha reduces the false positive risk for two reasons. First, it follows logically that a lower alpha produces a lower false positive risk. For example, in the prior example with 100 true and 100 false hypothesis, an alpha of 5% produced 105 significant results that included 5 non-significant results and the false positive rate was 5/105 = 4.76%. Lowering alpha to 1%, produces only 101% significant results and the false positive rate is 1/100 = 1.00%. Second, questionable research practices are much more likely to produce false positive results with alpha = .05 than with alpha = .01.
In a z-curve analysis can be set to different values to examine the false positive rate. A reasonable criterion is to aim for a false discovery rate of 5%, which many psychologists falsely assume is the goal of setting alpha to 5%. For Nelson’s 109 publications, alpha can be lowered to .01 to achieve a false discovery risk of 5%.
With alpha = .01, there are still 60 out of 119 (50%) significant results. It is therefore not necessary to dismiss all of the published results because some results were obtained with questionable research practices.
For Nelson’s studies, a plausible moderator is timing. As Nelson and colleagues reported, he used QRPs before he himself drew attention to the problems with these practices. In response, he may have changed his research practices. To test this hypothesis, it is possible to fit a z-curve analysis to articles published before and after 2012 (due to publication lack, articles in 2012 are likely to still contain QRPs).
Consistent with the hypothesis, The EDR for 2012 and before is only 11%, 95%CI 5% to 31%, and the false discovery risk increases to 42%, 95%CI = 12% to 100%. Even with alpha = .01, the FDR is still 11%, and with alpha = .005 it is still 10%. With alpha = .001, it is reduced to 2% and 18 results remain significant. Thus, most of the published results lack credible evidence against the null-hypothesis.
Results look very different after 2012. The EDR is 83% and not different from the ODR, suggesting no evidence that selection for significance occurred. The high EDR implies a low false discovery risk even with the conventional alpha criterion of 5%. Thus, all 40 results with p < .05 provide credible evidence against the null-hypothesis.
To see how misleading p-curves can be, I also conducted a p-curve analysis for the studies published in the years up to 2012. The p-curve analysis shows merely that the studies have evidential value and provides a dramatically inflated estimate of power (84% vs. 35%). It does not show evidence that p-values are selected for significance and it does not provide information to distinguish p-hacked studies from studies with evidential value.
P-Curve was a first attempt to take the problem of selection for significance seriously and to evaluate whether a set of studies provides credible evidence against the null-hypothesis (evidential value). Here I showed that p-curve has serious limitations and provides misleading evidence about the strength of evidence against the null-hypothesis.
I showed that all of the information that is provided by a p-curve analysis is also provided by a z-curve analysis. Moreover, z-curve provides additional information about the presence of selection bias and the risk of false positive results. I also show how alpha levels can be adjusted to separate significant results with weak and strong evidence to select credible findings even when selection for significance is present.
As z-curve does every thing that p-curve does and more, the rational choice is to choose z-curve for the meta-analysis of p-values.
This blog post reports a replicability audit of Ap Dijksterhuis 48 most highly cited articles that provide the basis for his H-Index of 48 (WebofScience, 4/23/2021). The z-curve analysis shows lack of evidential value and a high false positive risk. Rather than dismissing all findings, it is possible to salvage 10 findings by setting alpha to .001 to maintain a false positive risk below 5%. The main article that contains evidential value was published in 2016. Based on these results, I argue that 47 of the 48 articles do not contain credible empirical information that supports the claims in these articles. These articles should not be cited as if they contain empirical evidence.
“Trust is good, but control is better”
Since 2011, it has become clear that social psychologists misused the scientific method. It was falsely assumed that a statistically significant result ensures that a finding is not a statistical fluke. This assumption is false for two reasons. First, even if the scientific method is used correctly, statistically significance can occur without a real effect in 5% of all studies. This is a low risk if most studies test true hypothesis with high statistical power, which produces a high discovery rate. However, if many false hypotheses are tested and true hypotheses are tested with low power, the discovery rate is low and the false discovery risk is high. Unfortunately, the true discovery rate is not known because social psychologists only published significant results. This selective reporting of significant results renders statistically significance insignificant. In theory, all published results could be false positive results.
The question is what we, the consumers of social psychological research, should do with thousands of studies that provide only questionable evidence. One solution is to “burn everything to the ground” and start fresh. Another solution is to correct the mistake in the application of the scientific method. I compare this correction to the repair of the Hubble telescope (https://www.nasa.gov/content/hubbles-mirror-flaw). Only after the Hubble telescope was launched into space, it was discovered that a mistake was made in the creation of the mirror. Replacing the mirror in space was impractical. As a result, a correction was made to take the discrepancy in the data into account.
The same can be done with significance testing. To correct for the misuse of the scientific method, the criterion for statistical significance can be lowered to ensure an acceptably low risk of false positive results. One solution is to apply this correction to articles on a specific topic or to articles in a particular journal. Here, I focus on authors for two reasons. First, authors are likely to use a specific approach to research that depends on their training and the field of study. Elsewhere I demonstrated that researchers differ considerably in their research practices (Schimmack, 2021). More controversial, I also think that authors are accountable for their research practices. If they realize that they made mistakes, they could help the research community by admitting to their mistakes and retract articles or at least express their loss of confidence in some of their work (Rohrer et al., 2020).
Ap Dijksterhuis is a well-known social psychologist. His main focus has been on unconscious processes. Starting in the 1990s, social psychologists became fascinated by unconscious and implicit processes. This triggered what some call an implicit revolution (Greenwald & Banaji, 1995). Dijksterhuis has been prolific and his work is highly cited, which earned him an H-Index of 48 in WebOfScience.
However, after 2011 it became apparent that many findings in this literature are difficult to replicate (Kahneman, 2012). A large replication project also failed to replicate one of Dijksterhuis’s results (O’Donnell et al., 2018). It is therefore interesting and important to examine the credibility of Dijksterhuis’s studies.
I used WebofScience to identify the most cited articles by Dijksterhuis (datafile). I then coded empirical articles until the number of coded articles matched the number of citations. The 48 articles reported 105 studies with a codable focal hypothesis test.
The total number of participants was 7,470 with a median sample size of N = 57 participants. For each focal test, I first computed the exact two-sided p-value and then computed a z-score for the p-value divided by two. Consistent with practices in social psychology, all reported studies supported predictions, even when the results were not strictly significant. The success for p < .05 (two-tailed) was 100/105 = 95%, which has been typical for social psychology for decades (Sterling, 1959).
The z-scores were submitted to a z-curve analysis (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020). The first part of a z-curve analysis is the z-curve plot (Figure 1).
The vertical red line at z = 1.96 represents the significance criterion with alpha = .05 (two-tailed). The figure shows that most p-values are just significant with z-scores just above 1.96. The distribution of z-scores is abnormal in the sense that random sampling error alone cannot produce the steep drop on the left side of the significance criterion. This provides visual evidence of selection for significance.
The second part of a z-curve analysis is to fit a finite mixture model to the distribution of the significant z-scores (z > 1.96). The model tries to match the distribution as closely as possible. The best fitting curve is shown with the grey/black checkered line. It is notable that the actual data decrease a bit more steeply than the grey curve. This shows a problem for the curve to fit the data even though the curve. This suggests that significance was obtained with massive p-hacking which produces an abundance of just significant results. This is confirmed with a p-curve analysis that shows more p-values between .04 and .05 than p-values between 0 and .01; 24% vs. 19%, respectively (Simonsohn et al., 2014).
The main implication of a left-skewed p-curve is that most significant results do not provide evidence against the null-hypothesis. This is confirmed by the z-curve analysis. A z-curve analysis projects the model based on significant results into the range of non-significant results. This makes it possible to estimate how many tests were conducted to produce the observed significant results (assuming a simple selection model). The results for these data suggest that that the reported significant results are only 5% of all statistical tests, which is what would be expected if only false hypotheses were tested. As a result, the false positive risk is 100%. Z-curve also computes bootstrapped confidence intervals around these estimates. The upper bound for the estimated discovery rate is 12%. Thus, most of the studies had a very low chance of producing a significant result (low power), even if they did not test a false hypothesis (low statistical power). With a low discover rate of 12%, the risk that a significant result is a false positive result is still 39%. This is unacceptably high.
The estimated replication rate of 7% is slightly higher than the estimated discovery rate of 5%. This suggests some heterogeneity across the studies which leads to higher power for studies that produced significant results. However, even 7% replicability is very low. Thus, most studies are expected to produce a non-significant result in a replication attempt.
Based on these results, it would be reasonable to burn everything to the ground and to dismiss the claims made in these 48 articles as empirically unfounded. However, it is also possible to reduce the false positive risk by increasing the significance threshold. With alpha = .01 the FDR is 19%, with alpha = .005 it is 10%, and with alpha = .001 it is 2%. So, to keep the false positive risk below 5%, it is possible to set alpha to .001. This renders most findings non-significant, but 10 findings remain significant.
One finding is evidence that liking of one’s initials has retest reliability. A more interesting finding is that 4 significant (p < .001) results were obtained in the most recent, 2016) article that also included pre-registered studies. This suggests that Dijksterhuis changed research practices in the wake of the replicability crisis. Thus, new articles that have not garnered a lot of citations may be more credible, but the pre-2011 articles lack credible empirical evidence for most of the claims made in these articles.
It is nearly certain that I made some mistakes in the coding of Ap Dijksterhuis’s articles. However, it is important to distinguish consequential and inconsequential mistakes. I am confident that I did not make consequential errors that would alter the main conclusions of this audit. However, control is better than trust and everybody can audit this audit. The data are openly available and the z-curve code is also openly available. Thus, this replicability audit is fully transparent and open to revision.
Moreover, the results are broadly consistent with the z-curve results based on automated extraction of test statistics (Schimmack, 2021). Based on automated coding, Dijksterhuis has an EDR of 17, with a rank of 312 out of 357 social psychologists. The reason for the higher EDR is that automated coding does not distinguish focal and non-focal tests and focal tests tend to have lower power and a higher risk of being false positives.
If you found this audit interesting, you might also be interested in other replicability audits (Replicability Audits).
It has been proposed that psychologists used a number of p-hacking methods to produce false positive results. In this post, I examine the prevalence of two p-hacking methods, namely the use of covariates and peaking until a significant result is obtained. The evidence suggests that these strategies are not very prevalent. One explanation for this is that they are less efficient than other p-hacking methods. File-drawering of small studies and inclusion of multiple dependent measures are more likely to be the main questionable practices that inflate effect sizes and success rates in psychology.
P-hacking refers to statistical practices that inflate effect size estimates and increase the probability of false positive results (Simonsohn et al., 2014). There are a number of questionable practices that can be used to achieve this goal. Three major practices are (a) continuing to add participants until significance is reached, (b) adding covariates, and (c) testing multiple dependent variables.
In a previous blog-post, I pointed out that two of these practices are rather dumb because they require more resources than simply running many studies with small samples and to put non-significant results in the file drawer (Schimmack, 2021). The dumbest p-hacking method is to continue data collection until significance is reached. Even with sampling until N = 200, the majority of studies remain non-significant. The predicted pattern is a continuous decline in the frequencies with increasing sample sizes. The second dumb strategy is to measure additional variables and to use them as covariates. It is smarter to add additional variables as dependent variables.
Simonsohn et al. (2014) suggested that it is possible to detect the use of dumb p-hacking methods by means of p-curve plots. Repeated sampling and the use of covariates produce markedly left-skewed (monotonic decreasing) p-curves. Schimmack (2021) noted that left-skewed p-curves are actually very rare. Figure 1 shows the p-curve for the most cited articles of 71 social psychologists (k = 2,570). The p-curve is clearly right-skewed.
I then examined p-curves for individual social psychologists (k ~ 30). The worst p-curve was flat, but not left-skewed.
The most plausible explanation for this finding is that no social psychologists tested only false hypotheses. As studies with true effect sizes produce right skew, p-hacking cannot be detected by left-skewed p-curves.
I therefore examined the use of dumb p-hacking strategies in other ways. The use of covariates is easy to detect by coding studies whether they used covariates or not. If researchers use multiple covariates the chances that a result becomes significant with a covariate are higher than the chances to get the significant result without a covariate. Thus, we should see more results with covariates and the frequency of studies with covariates provides some information about the prevalence of covariate hacking. I also distinguished between strictly experimental studies and correlational studies because covariates are more likely to be used in correlational studies for valid reasons. Figure 3 shows that the use of covariates in experimental studies is fairly rare (8.6%). If researchers would try only one covariate, this would limit the number of studies that were p-hacked with covariates to 17.2%, but the true frequency is likely to be much lower because p-hacking with a single covariate barely increases the chances of a significant result.
To examine the use of peaking, I first plotted the histogram of sample sizes. I limited sample sizes to studies with N < 200 to make the distribution of small sample sizes more visible.
There is no evidence that researchers start with very small sample sizes (n = 5) and publish as soon as they get significance (simulation by Simonsohn et al., 2014). This would have produced a high frequency of studies with N = 10. The peak around N = 40 suggests that many researchers use n = 20 as a rule of thumb for the allocation of participants to cells in two-group designs. Another bump around N = 80 is explained by the same rule for 2 x 2 designs that are popular among social psychologists. N = 100 seems to be another rule of thumb. Except for these peaks, the distribution does show a decreasing trend suggesting that peaking was used. However, there is also no evidence that researchers simply stop after n = 15 when results are not significant (Simonsohn et al., 2014, simulation).
If the decreasing trend is due to peaking, sample sizes would be uncorrelated with the strength of the evidence. Otherwise, studies with larger samples have less sampling error and stronger evidence against the null-hypothesis. To test this prediction, I regressed p-values transformed into z-scores onto sampling error (1 / sqrt(N). I included the use of covariates and the nature of the study (experimental vs. correlational) as additional predictors.
The strength of evidence increased with decreasing sampling error without, z = 3.73, and with covariates, z = 3.30. These results suggest that many studies tested a true effect because a true effect is necessary to increase the strength of evidence against the null-hypothesis. To conclude, peaking may have been used, but not at excessive levels that would produce many low z-scores with large samples.
The last analysis was used to examine whether social psychologists have used questionable research practices. The difference between p-hacking and questionable research practices is that p-hacking excludes publication bias (not reporting entire studies). The focus on questionable research practices has the advantage that it is no longer necessary to distinguish between selective reporting of analyses or entire studies. Most researchers are likely to use both p-hacking and publication bias and both practices inflate effect sizes and lower replicability. Thus, it is not important to distinguish between p-hacking and publication bias.
The results show clear evidence that social psychologists used questionable research practices to produce an abundance of significant results. Even not counting marginally significant results, the success rate is 89%, but the actual power to produce these significant results is estimated to be just 26%. This shows that a right-skewed does not tell us how much questionable research practices contributed to significant results. A low discovery rate of 26% translates into a maximum false discovery rate of 15%. This would suggest that one reason for the lack of left-skewed p-curves is that p-hacking of true null-hypothesis is fairly rare. A bigger problem is that p-hacking of real effects in small samples produces vastly inflated effect size estimates. However, the 95% confidence interval around this estimate reaches all the way to 47%. Thus, it cannot be ruled out that a substantial number of results was obtained with true null-hypotheses by using p-hacking methods that do not produce a marked left-skew and publication bias.
In the early 2010s, two articles suggested that (a) p-hacking is common, (b) false positives are prevalent, and (c) left-skewed p-curves reveal p-hacking to produce false positive results (Simmons et al., 2011; Simonsohn, 2014a). However, empirical application of p-curve have produced few left-skewed p-curves. This raises question about the absence of left-skewed z-curves. One explanation is that some p-hacking strategies do not produce notable left skew and that these strategies may be used more often because they require fewer resources. Another explanation could be that file-drawering is much more common than p-packing. Finally, it could be that most of the time p-hacking is used to inflate true effect sizes rather than to chase false positive results. P-curve plots do not allow researchers to distinguish these alternative hypotheses. Thus, p-curve should be replaced by more powerful tools that detect publication bias or p-hacking and estimate the amount of evidence against the null-hypothesis. Fortunately, there is an app for this (zcurve package).
Simonsohn, Nelson, and Simmons (2014) coined the term p-hacking for a set of questionable research practices that increase the chances of obtaining a statistically significant result. In the worst case scenario, p-hacking can produce significant results without a real effect. In this case, the statistically significant result is entirely explained by p-hacking.
Simonsohn et al. (2014) make a clear distinction between p-hacking and publication bias. Publication bias is unlikely to produce a large number of false positive results because it requires 20 attempts to produce a single significant result in either direction or 40 attempts to get a significant result with a predicted direction. In contrast, “p-hacking can allow researchers to get most studies to reveal significant relationships between truly unrelated variables (Simmons et al., 2011)” (p. 535).
There have been surprisingly few investigations of the best way to p-hack studies. Some p-hacking strategies may work in simulation studies that do not impose limits on resources, but they may not be practical in real applications of p-hacking. I postulate that the main goal of p-hacking is to get significant results with minimal resources rather than with a minimum number of studies and that p-hacking is more efficient with a file drawer of studies that are abandoned.
Simmons et al. (2011) and Simonsohn et al. (2014) suggest one especially dumb p-hacking strategy, namely simply collecting more data until a significant result emerges.
“For example, consider a researcher who p-hacks by analyzing data after every five per-condition participants and ceases upon obtaining significance.” (Simonsohn et al., 2014).
This strategy is known to produce more p-values close to .04 than .01.
The main problem with this strategy is that sample sizes can get very large before the significant result emerges. I limited the maximum sample size before a researcher would give up to N = 200. A limit of 20 makes sense because N = 200 would allow a researcher to run 20 studies with the starting sample size of N = 10 to get a significant result. The p-curve plot shows a similar distribution as the simulation in the p-curve article.
The success rate was 25%. This means, 75% of studies with N = 200 produced a non-significant result that had to be put in the file-drawer. Figure 2 shows the distribution of sample sizes for the significant results.
The key finding is that the chances of a significant results drop drastically after the first attempt. The reason is that the most favorable results in the first trial produce a significant result in the first trial. As a result, the non-significant ones are less favorable. It would be better to start a new study because the chances to get a significant result are higher than adding participants after an unsuccessful attempt. In short, just adding participants to get significant is a dumb p-hacking method.
Simonsohn et al. (2014) do not disclose the stopping rule, but they do show that they got only 5.6% significant results compared to the 25% with N = 200. This means they stopped much earlier. Simulation suggest that they stopped when N = 30 (n = 15 per cell) did not produce a significant result (1 million simulations, success rate = 5.547%). The success rates for N = 10, 20, and 30 were 2.5%, 1.8%, and 1.3%, respectively. These probabilities can be compared to a probability of 2.5 for each test with N = 10. It is clear that trying three studies is a more efficient strategy than to add participants until N reaches 30. Moreover, neither strategy avoids producing a file drawer. To avoid a file-drawer, researchers would need to combine several questionable research practices (Simmons et al., 2011).
Simmons et al. (2011) proposed that researchers can add covariates to increase the number of statistical tests and to increase the chances of producing a significant result. Another option is to include several dependent variables. To simplify the simulation, I am assuming that dependent variables and covariates are independent of each other. Sample size has no influence on these results. To make the simulation consistent with typical results in actual studies, I used n = 20 per cell. Adding covariates or additional dependent variables requires the same amount of resources. For example, participants make additional ratings for one more item and this item is either used as a covariate or as a dependent variable. Following Simmons et al. (2011), I first simulated a scenario with 10 covariates.
The p-curve plot is similar to the repeated peaking plot and is called left-skewed. The success rate, however, is disappointing. Only 4.48% of results were statistically significant. This suggests that collecting data to be used as covariates is another dumb p-hacking strategy.
Adding dependent variables is much more efficient. In the simple scenario, with independent DVs, the probability of obtaining a significant result equals 1-(1-.025)^11 = 24.31%. A simulation with 100,000 trials produced a percentage of 24.55%. More important, the p-curve is flat.
Correlation among the dependent variables produces a slight left-skewed distribution, but not as much as the other p-hacking methods. With a population correlation of r = .3, the percentages are 17% for p < .01 and 22% for p between .04 and .05.
These results provide three insights into p-hacking that have been overlooked. First, some p-hacking methods are more effective than others. Second, the amount of left-skewness varies across p-hacking methods. Third, efficient p-hacking produces a fairly large file-drawer of studies with non-significant results because it is inefficient to add participants to data that failed to produce a significant result.
False P-curve Citations
The p-curve authors made it fairly clear what p-curve does and what it does not do. The main point of a p-curve analysis is to examine whether a set of significant results was obtained at least partially with some true effects. That is, at least in a subset of the studies the null-hypothesis was false. The authors call this evidential value. A right-skewed p-curve suggests that a set of significant results have evidential value. This is the only valid inference that can be drawn from p-curve plots.
“We say that a set of significant findings contains evidential value when we can rule out selective reporting as the sole [italics added] explanation of those findings” (p. 535).
The emphasize on selective reporting as the sole explanation is important. A p-curve that shows evidential value can still be biased by p-hacking and publication bias, which can lead to inflated effect size estimates.
To make sure that I interpret the article correctly, I asked one of the authors on twitter and the reply confirmed that p-curve is not a bias test, but strictly a test that some real effects contributed to a right-skewed p-curve. The answer also explains why the p-curve authors did not care about testing for bias. They assume that bias is almost always present; which makes it unnecessary to test for it.
Although the authors stated the purpose of p-curve plots clearly, many meta-analysists have misunderstood the meaning of a p-curve analysis and have drawn false conclusions about right-skewed p-curves. For example, Rivers (2017) writes that a right-skewed p-curve suggests “that the WIT effect is a) likely to exist, and b) unlikely biased by extensive p-hacking.” The first inference is correct. The second one is incorrect because p-curve is not a bias detection method. A right-skewed p-curve could be a mixture of real effects and bias due to selective reporting.
Rivers also makes a misleading claim that a flat p-curve shows the lack of evidential value, whereas “a significantly left-skewed distribution indicates that the effect under consideration may be biased by p-hacking.” These statements are wrong because a flat p-curve can also be produced by p-hacking, especially when a real effect is also present.
Rivers is by no means the only one who misinterpreted p-curve results. Using the 10 most highly cited articles that applied p-curve analysis, we can see the same mistake in several articles. A tutorial for biologists claims “p-curve can, however, be used to identify p-hacking, by only considering significant findings” (Head, 2015, p. 3). Another tutorial for biologists repeats this false interpretation of p-curves. “One proposed method for identifying P-hacking is ‘P-curve’ analysis” (Parker et al., 2016, p. 714). A similar false claim is made by Polanin et al. (2016). “The p-curve is another method that attempts to uncover selective reporting, or “p-hacking,” in primary reports (Simonsohn, Nelson, Leif, & Simmons, 2014)” (p. 211). The authors of a meta-analysis of personality traits claim that they conduct p-curve analyses “to check whether this field suffers from publication bias” (Muris et al., 2017, 186). Another meta-analysis on coping also claims “p-curve analysis (Simonsohn, Nelson, & Simmons, 2014) allows the detection of selective reporting by researchers who “file-drawer” certain parts of their studies to reach statistical significance” (Cheng et al., 2014; p. 1594).
Shariff et al.’s (2016) article on religious priming effects provides a better explanation of p-curve, but their final conclusion is still misleading. “These results suggest that the body of studies reflects a true effect of religious priming, and not an artifact of publication bias and p-hacking.” (p. 38). The first part is correct, but the second part is misleading. The correct claim would be “not solely the result of publication bias and p-hacking”, but it is possible that publication bias and p-hacking inflate effect size estimates in this literature. The skew of p-curves simply does not tell us about this. The same mistake is made by Weingarten et al. (2016). “When we included all studies (published or unpublished) with clear hypotheses for behavioral measures (as outlined in our p-curve disclosure table), we found no evidence of p-hacking (no left-skew), but dual evidence of a right-skew and flatter than 33% power.” (p. 482). While a left-skewed p-curve does reveal p-hacking, the absence of left-skew does not ensure that p-hacking was absent. The same mistake is made by Steffens et al. (2017), who interpret a right-skewed p-curve as evidence “that the set of studies contains evidential value and that there is no evidence of p-hacking or ambitious p-hacking” (p. 303).
Although some articles correctly limit the interpretation of the p-curve to the claim that the data contain evidential value (Combs et al., 2015; Rand, 2016; Siks et al., 2018), the majority of applied p-curve articles falsely assume that p-curve can reveal the presence or absence of p-hacking or publication bias. This is incorrect. A left-skewed p-curve does provide evidence of p-hacking, but the absence of left-skew does not imply that p-hacking is absent.
How prevalent are left-skewed p-curves?
After 2011, psychologists were worried that many published results might be false positive results that were obtained with p-hacking (Simmons et al., 2011). As the combination of p-hacking in the absence of a real effect does produce left-skewed p-curves, one might expect that a large percentage of p-curve analyses revealed left-skewed distributions. However, empirical examples of left-skewed p-curves are extremely rare. Take, power-posing as an example. It is widely assumed these days that original evidence for power-posing was obtained with p-hacking and that the real effect size of power-posing is negligible. Thus, power-posing would be expected to show a left-skewed p-curve.
Simmons and Simonsohn (2017) conducted a p-curve analysis of the power-posing literature. They did not observe a left-skewed p-curve. Instead, the p-curve was flat, which justifies the conclusion that the studies contain no evidential value (i.e., we cannot reject the null-hypothesis that all studies tested a true null-hypothesis). The interpretation of this finding is misleading.
“In this Commentary, we rely on p-curve analysis to answer the following question: Does the literature reviewed by Carney et al. (2015) suggest the existence of an effect once one accounts for selective reporting? We conclude that it does not. The distribution of p values from those 33 studies is indistinguishable from what would be expected if (a) the average effect size were zero and (b) selective reporting (of studies or analyses) were solely responsible for the significant effects that were published”
The interpretation only focus on selective reporting (or testing of independent DVs) as a possible explanation for lack of evidential value. However, usually the authors emphasize p-hacking as the most likely explanation for significant results without evidential value. Ignoring p-hacking is deceptive because a flat p-curve can occur as a combination of p-hacking and real effect, as the authors showed themselves (Simonsohn et al., 2014).
Another problem is that significance testing is also one-sided. A right-skewed p-curve can be used to reject the null-hypotheses that all studies are false positives, but the absence of significant right skew cannot be used to infer the lack of evidential value. Thus, p-curve cannot be used to establish that there is no evidential value in a set of studies.
There are two explanations for the surprising lack of left-skewed p-curves in actual studies. First, p-hacking may be much less prevalent than is commonly assumed and the bigger problem is publication bias which does not produce a left-skewed distribution. Alternatively, false positive results are much rarer than has been assumed in the wake of the replication crisis. The main reason for replication failures could be that published studies report inflated effect sizes and that replication studies with unbiased effect size estimates are underpowered and produce false negative results.
How useful are Right-skewed p-curves?
In theory, left-skew is diagnostic of p-hacking, but in practice left-skew is rarely observed. This leaves right-skew as the only diagnostic information of p-curve plots. Right skew can be used to reject the null-hypothesis that all of the significant results tested a true null-hypothesis. The problem with this information is shared by all significance tests. It does not provide evidence about the effect size. In this case, it does not provide evidence about the percentage of significant results that are true positives (the false positive risk), nor does it quantify the strength of evidence.
This problem has been addressed by other methods that quantify how strong the evidence against the null-hypothesis is. Confusingly, the p-curve authors used the term p-curve for a method that estimates the strength of evidence in terms of the unconditional power of the set of studies (Simonsohn et al., 2014b). The problem with these power estimates is that they are biased when studies are heterogeneous (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020). Simulation studies show that z-curve is a superior method to quantify the strength of evidence against the null-hypothesis. In addition, z-curve.2.0 provides additional information about the false positive risk; that is the maximum number of significant results that may be false positives.
In conclusion, p-curve plots no longer produce meaningful information. Left-skew can be detected in z-curves plots as well as in p-curve plots and is extremely rare. Right skew is diagnostic of evidential value, but does not quantify the strength of evidence. Finally, p-curve plots are not diagnostic when data contain evidential value and bias due to p-hacking or publication bias.
Ten years ago, the foundations of psychological science were shaking by the realization that the standard scientific method of psychological science is faulty. Since then it has become apparent that many classic findings are not replicable and many widely used measures are invalid; especially in social psychology (Schimmack, 2020).
However, it is not uncommon to read articles in 2021 that ignore the low credibility of published results. There are too many of these pseudo-scientific articles, but some articles matter more than others; at last to me. I do care about suicide and like many people my age, I know people who have committed suicide. I was therefore concerned when I saw a review article that examines suicide from a dual-process perspective.
My main concern about this article is that dual-process models in social cognition are based on implicit priming studies with low replicability and implicit measures with low validity (Schimmack, 2021a, 2021b). It is therefore unclear how dual-process models can help us to understand and prevent suicides.
After reading the article, it is clear that the authors make many false statements and present questionable studies that have never been replicated as if they produce a solid body of empirical evidence.
Introduction of the Article
The introduction cites outdated studies that have either not been replicated or produced replication failures.
“Our position is that even these integrative models omit a fundamental and well-established
dynamic of the human mind: that complex human behavior is the result of an interplay between relatively automatic and relatively controlled modes of thought (e.g., Sherman et al., 2014). From basic processes of impression formation (e.g., Fiske et al., 1999) to romantic relationships (e.g., McNulty & Olson, 2015) and intergroup relations (e.g., Devine, 1989), dual-process frameworks that incorporate automatic and controlled cognition have provided a more complete understanding of a broad array of social phenomena.”
This is simply not true. For example, there is no evidence that we implicitly love our partners when we consciously hate them or vice versa, and there is no evidence that prejudice occurs outside of awareness.
Automatic cognitions can be characterized as unintentional (i.e., inescapably activated), uncontrollable (i.e., difficult to stop), efficient in operation (i.e., requiring few cognitive resources), and/or unconscious (Bargh, 1994) and are typically captured with implicit measures.
This statement ignores many articles that have criticized the assumption that implicit measures measure implicit constructs. Even the proponent of the most widely used implicit measure have walked back this assumption (Greenwald & Banaji, 2017).
The authors then make the claim that implicit measures of suicide have incremental predictive validity of suicidal behavior.
“For example, automatic associations between the self and death predict suicidal ideation and action beyond traditional explicit (i.e., verbal) responses (Glenn et al., 2017).“
This claim has been made repeatedly by proponents of implicit measures, so I meta-analyzed the small set of studies that tested this prediction (Schimmack, 2021). Some of these studies produced non-significant results and the literature showed evidence that questionable research practices were used to produce significant results. Overall, the evidence is inconclusive. It is therefore incorrect to point to a single study as if there is clear evidence that implicit measures of suicidality are valid.
Further statements are also based on outdated research and a single reference.
“Research on threat has consistently shown that people preferentially process dangers to physical harm by prioritizing attention, response, and recall regarding threats (e.g., Öhman &
There have been many proposals about stimuli that attract attention, and threatening stimuli are by no means the only attention grabbing stimuli. Sexual stimuli also attract attention and in general arousal rather than valence or threat is a better predictor of attention (Schimmack, 2005).
It is also not clear how threatening stimuli are relevant for suicide which is related to depression rather than anxiety disorders.
The introduction of implicit measures totally disregards the controversy about the validity of implicit measures or the fact that different implicit measures of the same construct show low convergent validity.
“Much has been written about implicit measures (for reviews, see De Houwer et al., 2009; Fazio & Olson, 2003; March et al., 2020; Nosek et al., 2011; Olson & Fazio, 2009), but for the present purposes, it is important to note the consensus that implicit measures index the automatic properties of attitudes.“
More relevant are claims that implicit measures have been successfully used to understand a variety of clinical topics.
The application of a dual-process framework has consequently improved explanation and prediction in a number of areas involving mental health, including addiction (Wiers & Stacy, 2006), anxiety (Teachman et al., 2012), and sexual assault (Widman & Olson, 2013). Much of this work incorporates advances in implicit measurement in clinical domains (Roefs et al., 2011).
The authors then make the common mistake to conflate self-deception and other-deception. The notion of implicit motives that can influence behavior without awareness implies self-deception. An alternative rational for the use of implicit measures is that they are better measures of consciously accessible thoughts and feelings that individuals are hiding from others. Here we do not need to assume a dual-process model. We simply have to assume that self-report measures are easy to fake, whereas implicit measures can reveal the truth because they are difficult to fake. Thus, even incremental predictive validity does not automatically support a dual-process model of suicide. However, this question is only relevant if implicit measures of suicidality show incremental predictive validity, which has not been demonstrated.
Consistent with the idea that such automatic evaluative associations can predict suicidality later, automatic spouse-negative associations predicted increases in suicidal ideation over time across all three studies, even after accounting for their controlled counterparts (McNulty et al., 2019).
In the conclusion section, the authors repeat their false claim that implicit measures of suicidality reflect valid variance in implicit suicidality and that they are superior to explicit measures.
“As evidence of their impact on suicidality has accumulated, so has the need for incorporating automatic processes into integrative models that address questions surrounding how and under what circumstances automatic processes impact suicidality, as well as how automatic and controlled processes interact in determining suicide-relevant outcomes.”
“Implicit measures are better-suited to assess constructs that are more affective
(Kendrick & Olson, 2012), spontaneous (e.g., Phillips & Olson, 2014), and uncontrollable (e.g., Klauer & Teige-Mocigemba, 2007).“
As recent work has shown (e.g., Creemers et al., 2012; Franck, De Raedt, Dereu, et al., 2007; Franklin et al., 2016; Glashouwer et al., 2010; Glenn et al., 2017; Hussey et al., 2016; McNulty et al., 2019; Nock et al., 2010; Tucker,Wingate, et al., 2018), the psychology of suicidality requires formal consideration of automatic processes, their proper measurement, and how they relate
to one another and corresponding controlled processes.
We have articulated a number of hypotheses, several already with empirical support, regarding interactions between automatic and controlled processes in predicting suicidal ideation and lethal acts, as well as their combination into an integrated model.
Then they finally mention the measurement problems of implicit measures.
Research utilizing the model should be mindful of specific challenges. First, although the model answers calls to diversify measurement in suicidality research by incorporating implicit measures, such measures are not without their own problems. Reaction time measures often have problematically low reliabilities, and some include confounds (e.g., Olson et al., 2009). Further, implicit and explicit measures can differ in a number of ways, and structural differences between them can artificially deflate their correspondence (Payne et al., 2008). Researchers should be aware of the strengths and weaknesses of implicit measures.
Evaluation of the Evidence
Here I provide a brief summary of the actual results of studies cited in the review article so that readers can make up their own mind about the relevance and credibility of the evidence.
Creemers, D. H., Scholte, R. H., Engels, R. C., Prinstein, M. J., & Wiers, R. W. (2012). Implicit and explicit self-esteem as concurrent predictors of suicidal ideation, depressive symptoms, and loneliness. Journal of Behavior Therapy and Experimental Psychiatry, 43(1), 638–646
Participants: 95 undergraduate students
Implicit Construct / Measure: Implicit self-esteem / Name Latter Task
Dependent Variables: depression, loneliness, suicidal ideation
Results: No significant direct relationship. Interaction between explicit and implicit self-esteem for suicidal ideation only, b = .28.
Franck, E., De Raedt, R., & De Houwer, J. (2007). Implicit but not explicit self-esteem predicts future depressive symptomatology. Behaviour Research and Therapy, 45(10), 2448–2455.
Participants: 28 clinically depressed patients; 67 not-depressed participants.
Implicit Construct / Measure: Implicit self-esteem / Name Latter Task
Dependent Variable: change in depression controlling for T1
Result: However, after controlling for initial symptoms of depression, implicit, t(48) = 2.21, p = .03, b = .25, but not explicit self-esteem, t(48) = 1.26, p = .22, b = .17, proved to be a significant predictor for depressive symptomatology at 6 months follow-up.
Franck, E., De Raedt, R., Dereu, M., & Van den Abbeele, D. (2007). Implicit and explicit self- esteem in currently depressed individuals with and without suicidal ideation. Journal of Behavior Therapy and Experimental Psychiatry, 38(1), 75–85.
Participants: Depressed patients with suicidal ideation (N = 15), depressed patients without suicidal ideation (N = 14) and controls (N = 15)
Implicit Construct / Measure: Implicit self-esteem / IAT
Dependent variable. Group status
Contrast analysis revealed that the currently depressed individuals with suicidal ideation showed a significantly higher implicit self-esteem as compared to the currently depressed individuals without suicidal ideation, t(43) = 3.0, p < 0.01. Furthermore, the non-depressed controls showed a significantly higher implicit self-esteem as compared to the currently depressed individuals without suicidal ideation, t(43) = 3.7, p < 0.001.
[this finding implies that suicidal depressed patients have HIGHER implicit self-esteem than depressed patients who are not suicidal].
Glashouwer,K.A., de Jong,P. J., Penninx, B.W.,Kerkhof,A. J., vanDyck, R., & Ormel, J. (2010). Do automatic self-associations relate to suicidal ideation? Journal of Psychopathology and Behavioral Assessment, 32(3), 428–437.
Participants: General population (N = 2,837)
Implicit Constructs / Measure: Implicit depression, Implicit Anxiety / IAT
Dependent variable: Suicidal Ideation, Suicide Attempt
Results: simple correlations
Depression IAT – Suicidal Ideation, r = .22
Depression IAT – Suicide Attempt, r = .12
Anxiety IAT – Suicide Ideation, r = .18
Anxiety IAT – Suicide Attempt, r = .11
Controlling for Explicit Measures of Depression / Anxiety
Depression IAT – Suicidal Ideation, b = ..024, p = .179
Depression IAT – Suicide Attempt, b = .037, p = .061
Anxiety IAT – Suicide Ideation, b = .024, p = .178
Anxiety IAT – Suicide Attempt, r = ..039, p = .046
Glenn, J. J., Werntz, A. J., Slama, S. J., Steinman, S. A., Teachman, B. A., &
Nock, M. K. (2017). Suicide and self-injury-related implicit cognition: A
large-scale examination and replication. Journal of Abnormal Psychology,
Participants: Self-selected online sample with high rates of self-harm (> 50%). Ns = 3,115, 3114
Implicit Constructs / Measure: Self-Harm, Death, Suicide / IAT
Dependent variables: Group differences (non-suicidal self-injury / control; suicide attempt / control)
Non-suicidal self-injury versus control
Self-injury IAT, d = .81/.97; Death IAT d = .52/.61, Suicide IAT d = .58/.72
Suicide Attempt versus control
Self-injury IAT, d = ..52/.54; Death IAT d = .37/.32, Suicide IAT d = .54/.67
[these results show that self-ratings and IAT scores reflect a common construct;
they do not show discriminant validity; no evidence that they measure distinct
constructs and they do not show incremental predictive validity]
Hussey, I., Barnes-Holmes, D., & Booth, R. (2016). Individuals with current
suicidal ideation demonstrate implicit “fearlessness of death..” Journal of
Behavior Therapy and Experimental Psychiatry, 51, 1–9.
Participants: 23 patients with suicidal ideation and 25 controls (university students)
Implicit Constructs / Measure: Death attitudes (general / personal) / IRAP
Dependent variable: Group difference
Results: No main effects were found for either group (p = .08). Critically, however, a three-way interaction effect was found between group, IRAP type, and trial-type, F(3, 37) = 3.88, p = .01. Specifically, the suicidal ideation group produced a moderate “my death-not-negative” bias (M = .29, SD = .41), whereas the normative group produced a weak “my death-negative” bias (M = -.12, SD = .38, p < .01). This differential performance was of a very large effect size (Hedges’ g = 1.02).
[This study suggests that evaluations of personal death show stronger relationships than generic death]
McNulty, J. K., Olson, M. A., & Joiner, T. E. (2019). Implicit interpersonal evaluations as a risk factor for suicidality: Automatic spousal attitudes predict changes in the probability of suicidal thoughts. Journal of Personality and Social Psychology, 117(5), 978–997
Participants. Integrative analysis of 399 couples from 3 longitudinal study of marriages.
Implicit Construct / Measure: Partner attitudes / evaluative priming task
Dependent variable: Change in suicidal thoughts (yes/no) over time
Result: (preferred scoring method)
without covariates, b = -.69, se = .27, p = .010.
with covariate, b = -.64, se = .29, p = .027
Nock, M. K., Park, J. M., Finn, C. T., Deliberto, T. L., Dour, H. J., & Banaji, M. R. (2010). Measuring the suicidal mind: Implicit cognition predicts suicidal behavior. Psychological Science, 21(4), 511–517.
Participants. 157 patients with mental health problems
Implicit Construct / Measure: death attitudes / IAT
Dependent variable: Prospective Prediction of Suicide
Result: controlling for prior attempts / no explicit covariates
b = 1.85, SE = 0.94, z = 2.03, p = .042
Tucker, R. P., Wingate, L. R., Burkley, M., & Wells, T. T. (2018). Implicit Association with Suicide as Measured by the Suicide Affect Misattribution Procedure (S-AMP) predicts suicide ideation. Suicide and Life-Threatening Behavior, 48(6), 720–731.
Participants. 138 students oversampled for suicidal ideation
Implicit Construct / Measure: suicide attitudes / AMP
Dependent variable: Suicidal Ideation
Result: simple correlation, r = .24
regression controlling for depression, b = .09, se = .04, p = .028
Taken together the reference show a mix of constructs, measures and outcomes, and p-values cluster just below .05. Not one of these p-values is below .005. Moreover, many studies relied on small convenience samples. The most informative study is the study by Glashouwer et al. that examined incremental predictive validity of a depression IAT in a large, population wide, sample. The result was not significant and the effect size was less than r = .1. Thus, the references do not provide compelling evidence for dual-attitude models of depression.
Social psychology have abused the scientific method for decades. Over the past decade, criticism of their practices has become louder, but many social psychologists ignore this criticism and continue to abuse significance testing and to misrepresent these results as if they provide empirical evidence that can inform understanding of human behavior. This article is just another example of the unwillingness of social psychologists to “clean up their act” (Kahneman, 2012). Readers of this article should be warned that the claims made in this article are not scientific. Fortunately, there is a credible research on depression and suicide outside of social psychology.
The article “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant”, henceforth the FPP article, is now a classic in the meta-psychological literature that emerged in the wake of the replication crisis in psychology.
The the main claim of the FPP article is that it is easy to produce false positive results when researchers p-hack their data. P-hacking is a term coined by the FPP authors for the use of questionable research practices (QRPs, John et al., 2012). There are many QRPs, but they all have in common that researchers conduct more statistical analysis than they report and selectively report only the results of those analyses that produce a statistically significant result.
The main distinction between p-hacking and QRPs is that p-hacking ignores some QRPs. John et al. (2012) also include fraud as a QRP, but I prefer to treat the fabrication of data as a distinct form of malpractice that clearly requires intent to deceive others. The main difference between p-hacking and QRPs is that p-hacking does not consider publication bias. Publication bias implies that researchers fail to publish entire studies with non-significant results. The FPP authors are not concerned about publication bias because their main claim is that p-hacking makes it so easy to obtain significant results that it is unnecessary to discard entire datasets. After showing that a combination of four QRPSs can produce false positive results with a 60% success rate (for alpha = .05), the author hasten to warn readers that this is a conservative estimate because actual researchers might use even more QRPs. “As high as these estimates are, they may actually be conservative” (p. 1361).
The article shook the foundations of mainstream psychology because it suggested that most published results in psychology could be false positive results; that is, a statistically significant results was reported even though the reported effect does not exist. The FPP article provided a timely explanation for Bem’s (2011) controversial finding that humans have extrasensory abilities, which unintentionally contributed to the credibility crisis in social psychology (Simmons, Nelson, & Simonsohn, 2018; Schimmack, 2020).
In 2018, the FPP authors published their own reflection on their impactful article for a special issue of the most highly cited articles in Psychological Science (Simmons et al., 2018). In this article, the authors acknowledge that they used questionable research practices in their work and knew that using these practices was wrong. However, like many other psychologists they thought these practices were harmless because nothing substantial changes when a p-values is .04 rather than .06. Their own article convinced them that their practices were more like robbing a bank than jay walking.
The FPP authors were also asked to critically reflect on their article and to comment on things they might have done differently with the benefit of hindsight. The main regret was the recommendation to require a minimum sample size of n = 20 per cell. After learning about statistical power, they realized that sample sizes should be justified based on power analysis. Otherwise, false positive psychology would simply become false negative psychology where article mostly report non-significant results when effect sizes exist. To increase the credibility of psychological science it is necessary to curb the use of questionable research practices and to increase statistical power (Schimmack, 2020).
The 2018 reflections reinforce the main claim of the 2011 article that (a) p-hacking nil-effects to significance is easy and (b) that many published significant results might be false positive results. A blog post by the FPP authors in 2016 makes clear that the authors consider this to be the core findings of their article (http://datacolada.org/55).
In my critical examination of the FPP article, I challenge both of these claims. First it is important to clarify what the authors mean by “a bit of p-hacking.” To use an analogy, what does a bit of making out mean? Answers range from kissing to intercourse. So, what do you actually have to do to have a 60% probability of getting pregnant? The FPP article falsely suggests that a bit of kissing may get you there. However Table 1 shows that you actually have to f*&% the data to get a significant result.
The table also shows that it gets harder to p-hack results as the alpha criterion decreases. While the combination of four QRPs can produce 81.5% marginally significant results (p < .10), only 21.5% attempts were successful with p < .01 as the significance criterion. One sensible recommendation based on this finding would be to disregard significant results with p-values greater than .01.
Another important finding is that each QRP alone increased the probability of a false positive result only slightly from the nominal 5% to an actual level of no more than 12.6%. Based on these results, I would not claim that it is easy to get false positive results. I consider the combination of four QRPs in every study that is being conducted research fraud that is sanctioned by professional organizations. That is, even if a raid of a laboratory would find that a researcher actually uses this approach to analyze data, the research would not be considered to engage in fraudulent practices by psychological organizations like the Association for Psychological Science or granting agencies.
The distinction between a little and massive is not just a matter of semantics. It influences beliefs about the prevalence of false positive results in psychology journals. If it takes only a little bit of p-hacking to get false positive results, it is reasonable to assume that many published results are false positives. Hence, the title “False Positive Psychology.”
Aside from the simulation study, the FPP article also presents two p-hacked studies. The presentation of these two studies reinforces the narrative that p-hacking virtually guarantees significant results. At least, the authors do not mention that they also ran some additional studies with non-significant results that they did not report. However, their own simulation results suggest that a file-drawer of non-significant studies should exist despite massive p-hacking. After all, the probability to get two significant results in a row with a probability of 60% is only 36%. This means that the authors were lucky to get the desired result, used even more QRPs to ensure a nearly 100% success rate, or failed to disclose a file-drawer of non-significant results. To examine these hypothesis, I simulated their actual p-hacking design of Study 2.
A Z-curve analysis of massive p-hacking
The authors do not disclose how they p-hacked Study 1. For Study 2 they provide the following information. The design had study had three groups (“When I’m Sixty-Four”, “Kalimba”, “Hot Potato”) and the “Hot Potato” condition was dropped like a hot potato. It is not clear how the sample size decreased from 34 to 20 as a result, but maybe participants were not equally assigned to the three conditions and there were 14 participants in the “Hot Potato” condition. The next QRP was that there were two dependent variables; actual age and felt age. Then there were a number of co-variates, including bizarre and silly ones like the square root of 100 to enhance the humor of the article. In total, there were 10 covariates. Finally, the authors used optional stopping. They checked after every 10 participants. It is not specified whether they meant 10 participants per condition or in total, but to increase the chances of a significant result it is better to use smaller increments. So, I assume it was just 3 participants per condition.
To examine the long-run success rate of this p-hacking design, I simulated the following combination of QRPs: (a) three conditions, (b) two dependent variables, (c) 10 covariates, and (d) increasing sample size from n = 10 until N > 200 per condition in steps of 3. I ran 10,000 simulations of this p-hacking design. The first finding was that it provided a success rate of 77% (7718 / 10,000), which is even higher than the 60% success rate featured in the FPP article. Thus, more massive p-hacking partially explains why both studies were significant.
The simulation also produced a new insight into p-hacking by examining the success rates for every increment in sample sizes (Figure 1). It is readily apparent that the chances of a significant result decrease. The reason is that favorable sampling error in the beginning quickly produces significant results. However, unfavorable sampling error in the beginning takes a long time to be reversed.
It follows that no smart p-hacker would use optional stopping or only continue if the first test shows a promising trend. This is what Bem (2011) did to get his significant results (Schimmack, 2016). It is not clear why the FPP authors did not simulate optional stopping. However, the failure to include this QRP explains why they maintain that p-hacking does not leave a file drawer of non-significant results. In theory, adding participants would eventually produce a significant result, resulting in a success rate of 100%. However, in practice resources would often be depleted before a significant result emerges. Thus, even with massive p-hacking a file drawer of non-significant results is inevitable.
It is notable that both studies that are reported in the FPP article have very small sample sizes (Ns = 30, 34). This shows that adding participants does not explain the 100% success rate. This also means that the actual probability of a success on the first trial was only about 40% based on the QRP design for Study 2. This means the chance of getting two significant results in a row was only 16%. This low success rate suggests that the significant p-values in the FPP article are not replicable. I bet that a replication project would produce more non-significant than significant results.
In sum, the FPP article suggested that it is easy to get significant results with a little bit of p-hacking. Careful reading of the article and a new simulation study show that this claim is misleading. It requires massive p-hacking that is difficult to distinguish from fraud to consistently produce significant results in the absence of a real effect and even massive p-hacking is likely to produce a file-drawer of non-significant results unless researchers are willing to continue data collection until sample sizes are very large.
Detecting massive p-hacking
in the wake of the replication crisis, numerous statistical methods have been developed that enable detection of bias introduced by QRPs (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020; Schimmack, 2012; Simonsohn et al., 2014; Schimmack, 2016). The advantage of z-curve is that it also provides valuable additional information such as estimates of the success rate of replication attempts and information about the false discovery risk (Bartos & Schimmack, 2021).
Figure 2 shows the z-curve plot for the 10,000 p-values from the previous simulation of the FPP p-hacking design. To create a z-curve plot, he p-values are converted into z-scores, using the formula qnorm(1-p/2). Accordingly, a p-value of .05 corresponds to a z-score of 1.96 and all z-scores greater than 1.96 (the solid red line) are significant.
Visual inspection shows that z-curve is unable to fit the actual distribution of z-scores because the distribution of actual z-scores is even steeper than z-curve predicts. However, the distinction between p-hacking and other QRPs is irrelevant for the evaluation of evidential value. Z-curve correctly predicts that the actual discovery rate is 5%, which is expected when only false hypotheses are tested with alpha = .05. It also correctly predicts that the probability of a successful replication without QRPs is only 5%. Finally, z-curve also correctly estimates that the false discovery risk is 100%. That is, the distribution of z-scores suggests that all of the significant results are false positive results.
The results address outdated criticisms of bias-detection methods that they merely show the presence of publication bias. First, the methods do not care about the distinction between p-hacking and publication bias. All QRPs inflate the success rate and bias-detection method reveal inflated success rates (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020; Schimmack, 2012). Second, while older methods merely showed the presence of bias, newer methods like z-curve also quantify the amount of bias. Thus, even if bias is always present, they provide valuable information about the amount of bias. In the present example, massive p-hacking produced massive bias in the success rate. Finally, z-curve.2.0 also quantifies the false positive risk after correcting for bias and correctly shows that massively p-hacked nil-hypothesis produces only false positive results.
The simulation also allows to replicate the influence of alpha on the false positive risk. A simple selection model predicts that only 20% of the results that are significant with alpha = .05 are still significant with alpha = .01. This follows from the uniform distribution of p-values, which implies that .01/.05 p-values are below .05 and .01. However, massive p-hacking clusters even more p-values in the range between .01 and .05. In this simulation only 6% (500 / 7718) p-values were below .01. Thus, it is possible to reduce the false positive risk from 100% to close to 5% by disregarding all p-values between .05 and .01. Thus, massive p-hacking provides another reason for calls to adjust the alpha level for statistically significant results to .005 (Benjamin et al., 2017) to reduce the risk of false positive results even for p-hacked literatures.
In sum, since the FPP article was published it has become possible to detect p-hacking in actual data using statistical methods like z-curve. These methods work even when massive p-hacking was used because massive p-hacking makes the detection of bias easier, especially when massive p-hacking is used to produce false positive results. The development of z-curve makes it possible to compare the FPP scenarios with 60% or more false positive results to actual p-values in published journals.
How Prevalent is Massive P-Hacking?
Since the FPP article was published, other articles have examined the prevalence of questionable research practices. Most of these studies rely on surveys (John et al., 2012; Fiedler & Schwarz, 2016). The problem with survey results is that they do not provide sufficient evidence about the amount of p-hacking and do not provide information about the severity of p-hacking. Furthermore, it is possible that many researchers use QRPs when they are testing a real effect. This practice would inflate effect sizes, but does not increase the risk of false positive results. These problems are addressed by z-curve analyses of published results. Figure 3 shows the results for Motyl et al.’s (2017) representative sample of test statistics in social psychology journals.
The z-curve plot of actual p-values differs in several ways from the z-curve plot of massive p-hacking. The estimated discovery rate is 23% and the estimated replication rather is 45%. The point estimate of the false discovery risk is only 18%, suggesting that no more than a quarter of published results are false positives. However, due to the small set of p-values, the 95%CI around the point estimate of the false positive risk reaches all the way to 70%. Thus, it remains unclear how high the false positive risk in social psychology is.
Results from a bigger coding project help to narrow down the uncertainty about the actual EDR in social psychology. This project coded at least 20 focal hypothesis tests from the most highly cited articles by eminent social psychologists, where eminence was based on the H-Index (Radosic & Diener, 2021). This project produced 2,208 p-values. The z-curve analysis of these p-values closely replicated the point estimates for Motyl et al.’s (2017) data (EDRs 26% vs. 23%, ERR 49% vs. 45%, FDR 15% vs. 18%). The confidence intervals are narrower and the upper limit of the false positive risk decreased to 47%.
However, combining the two samples did not notably reduce the confidence interval around the false discovery risk, 15%, 95%CI = 10% to 47%. Thus, up to 50% of published results in social psychology could be false positive results. This is an unacceptably high risk of false positive results, but the risk may seem small in comparison to a scenario where a little bit of p-hacking can produce over 60% false positive results.
In sum, empirical analyses of actual data suggest that false positive results are not as prevalent as the FPP article suggested. The main reason for the relatively low false positive risk is not that QRPs are rare. Rather, QRPs also help to inflate success rates when a small true effect exists. If effect sizes were not important, it might seem justifiable to reduce false negative rates with the help of QRPs. However, effect sizes matter and QRPs produce inflated effect sizes estimates by over 100% (Open Science Collaboration, 2015). Thus, p-hacking is a problem even if it does not generate a high rate of false positive results.
Individual Differences in Massive P-Hacking?
Psychologists study different topics and use different methods. Some research areas and some research methods have many true effects and high power to detect them. For example, cognitive psychologists appears to have few false positive results and relatively high replication rates (Open Science Collaboration, 2015; Schimmack, 2020). In contrast, between-subject experiments in social psychology are the most likely candidate for massive p-hacking and high rates of false positive results (Schimmack, 2020). As researchers focus on specific topics and paradigms, they are more or less likely to require massive p-hacking to produce significant results. To examine this variation across researchers, I combined the data from the 10 eminent social psychologists with the lowest EDR.
The results are disturbing. The EDR of 6% is just one percentage point above the 5% that is expected when only nil-hypotheses are tested and the 95%CI includes 5%. The upper limit reaches only 14%. The corresponding false discovery risk is 76% and the 95%CI includes 100%. Thus, the FPP article may describe the actual practices of some psychologists, but not the practices of psychology in general. It may not be surprising that one of the authors of the FPP article has a low EDR of 16%, even if the analysis is not limited to focal tests (Schimmack, 2021). It is well-known that the consensus bias leads individuals to project themselves onto others. The present results suggest that massive p-hacking of true null-results is the exception rather than the norm in psychology.
The last figure shows the z-curve plot for the 10 social psychologists with the highest EDR. The z-curve looks very different and shows that not all researchers were massive p-hackers. There is still publication bias because the ODR, 91%, matches the upper limit of the 95%CI of the EDR, 66% to 91%, but the effect size is much smaller (91% – 78% = 13%) than for the other extreme group (90% – 6% = 84%). As this comparison is based on extreme group, a replication study would show a smaller difference due to regression to the mean, but the difference is likely to remain substantial.
In sum, z-curve analysis of actual data can be used to evaluate how prevalent massive p-hacking actually is. The results suggest that only a minority of psychologists consistently used massive p-hacking to produce significant results that have a high risk of being false positive results.
The FPP article made an important positive contribution to psychological science. The recommendations motivated some journals and editors to implement policies that discourage the use of QRPs and motivate researchers to preregister their data analysis plans. At the same time, the FPP article also had some negative consequences. The main problem with the article is that it ignored statistical power, false negatives, and effect sizes. That is, the article suggested that the main problem in psychological science is a high risk of false positive results. Instead, the key problem in psychological science remains qualitative thinking in terms of true and false hypotheses that is rooted in the nil-hypothesis ritual that is still being taught to undergraduate and graduate students. Psychological science will only advance by replacing nil-hypothesis testing with quantitative statistics that take effect sizes into account. However, the FPP article succeeded where previous meta-psychologists failed by suggesting that most published results are false positives. It therefore stimulated much needed reforms that decades of methodological criticism failed to deliver.
False False Positive Psychology
The main claim of the FPP article was that many published results in psychology journals could be false positives. Unfortunately, the focus on false positive results has created a lot of confusion and misguided attempts to quantify false positive results in psychology. The problem with false positives is that they are mathematical entities rather than empirically observable phenomena. Based on the logic of nil-hypothesis testing, a false positive result requires an effect size that is exactly zero. Even a significant result with a population effect size of d = 0.0000000001 would count as a true positive result, although it is only possible to produce significant results for this effect size with massive p-hacking.
Thus, it is not very meaningful to worry about false positive results. For example, ego-depletion researchers have seen effect sizes reduced from d = 6 to d = .1 in studies without p-hacking. Proponents of ego-depletion point to these fact that d = .1 is different from 0 and supports their theory. However, the honest effect size invalidates hundreds of studies that claim to have demonstrate the effect for different dependent variables and under different conditions. None of these p-hacked studies are credible and each one would require a new replication study with over 1,000 participants to see whether the small effect size is really observed for a specific outcome under specific conditions. Whether the effect size is really zero or small is entirely irrelevant.
A study with a N = 40 participants and d = .1 has only 6% power to produce a significant result. Thus, there is hardly a difference between a study with a true null-effect (5% power) and a study with a small effect size. Nothing is learned from a significant result in either case and as Cohen once said “God hates studies with 5% power as much as studies with 6% power.”
To demonstrate that an effect is real, it is important to show that the majority of studies are successful without the use of questionable research practices (Brunner & Schimmack, 2020; Cohen, 1994). Thus, the empirical foundation of a real science requires (a) making true predictions, (b) designing studies that can provide evidence for the prediction, and (c) honest reporting of results. The FPP article illustrated the importance of honest reporting of results. It did not address the other problems that have plagued psychological science since its inception. As Cohen pointed out, results have to be replicable, and to be replicable, studies need high power. Honest reporting alone is insufficient.
P-Hacking versus Publication Bias
In my opinion, the main problem of the FPP article is the implicit claim that publication bias is less important than p-hacking. This attitude has led the authors to claim that bias detection tools are irrelevant because “we already know that researchers do not report 100% of their nonsignificant studies and analyses” (Nelson, Simmons, & Simonsohn, 2018). This argument is invalid for several reasons. Most important, bias detection tools do not distinguish between p-hacking and publication bias. As demonstrated here, they detect p-hacking as well as publication bias. For the integrity of a science it is also not important whether 1 researcher tests 20 dependent variables in one study or 20 researchers test 1 dependent variable in 20 independent studies. As long as results are only reported when they are significant, p-hacking and publication bias introduce bias in the literature and undermine the credibility of a science.
It is also unwarranted to make the strong claim that publication bias is unavoidable. Pre-registration and registered reports are designed to ensure that there is no bias in the reporting of results. Bias-detection methods can be used to verify this assumption. For example, they show no bias in the replication studies of the Open Science Collaboration project.
Third, new developments in bias detection methods do not only test for the presence of bias. As shown here, a comparison of the ODR and EDR provides a quantitative estimate of the amount of bias and small amounts of bias have no practical implications for the credibility of findings.
In conclusion, it makes no sense to draw a sharp line between hiding of dependent variables and hiding of entire studies. All questionable research practices produce bias and massive use of QRPs leads to more bias. Bias-detection methods play an important role in verifying that published results can be trusted.
Reading a Questionable Literature
One overlooked implication of the FPP article is the finding that it is much harder to produce significant results with p-hacking if the significance criterion is lowered from .05 to .01. This provides an easy solution to the problem how psychologists should interpret findings in psychology journals with dramatically inflated success rates (Sterling, 1959; Sterling et al., 1995). The success rate and the false positive risk can be reduced by adjusting alpha to .01 or even .005 (Benjamin et al., 2017). This way it is possible to build on some published results that produced credible evidence.
The FPP article was an important stepping stone in the evolution of psychology towards becoming a respectable science. It alerted many psychologists who were exploiting questionable research practices to various degrees that these practices were undermining the credibility of psychology as a science. However, one constant in science is that science is always evolving. The other one is that scientists who made a significant contribution think that they reached the end of history. In this article, I showed that meta-psychology has evolved over the past decade since the FPP article appeared. Ten years later, it is clear that massive p-hacking of nil-results is the exception rather than the norm in psychological science. As a result, the false positive risk is lower than feared ten years ago. However, this does not imply that psychological science is credible. The reason is that success rates and effect sizes in psychology journals are severely inflated by the use of questionable research practices. This makes it difficult to trust published results. One overlooked implication of the FPP article is that p-values below .01 are much more trustworthy than p-values below .05 because massive p-hacking mostly produces p-values between .05 and .01. Thus, one recommendation for readers of psychology journals is to ignore results with p-values greater than .01. Finally, bias detection tools like z-curve can be used to assess the credibility of published literatures and to correct for the bias introduced by questionable research practices.
Cite as: Schimmack. U. (1997). Frequency Judgments of Emotions: How Accurate are They and How are They Made? Unpublished dissertation. Free University Berlin.
“If you haven’t read it, it is new to you.”
I received my Ph.D. from the Free University Berlin in 1997. My dissertation contained two daily diary studies and two laboratory experiments. The main question that intrigued me at that time was how individuals make judgments about the frequency of their emotions (e.g., how often did you feel happy in the past week or month). I was also interested in the accuracy of these judgments because they are routinely used in personality questionnaires and in the measurement of subjective well-being. My dissertation never got published. I was fortunate that I was able to publish other work. Otherwise, my life might have turned out differently.
Although 23 years are a long time in real sciences, psychology is not a real science. The past 20 years have been wasted on questionable research to support false theories like Kahneman and Tversky’s (1973) influential availability heuristic. It now turns out that key results cannot be replicated (Schimmack, 2019). As you can see from my dissertation, cognitive psychologists already showed that ease of retrieval is not a plausible model of frequency estimation in the 1980s. Social psychologists simply ignored this work. So, now that the easy-of-retrieval model has failed, it may be a good time to introduce social and personality psychologists to cognitive models of frequency estimation that were developed in the 1980s. These models may provide a framework for applied research on frequency judgments of emotions and behaviors that are routinely used to measure personality traits (Fleisher, Woehr, Edwards, & Cullen, 2011).
Frequency Judgments of Emotions: How Accurate are They and How are They Made?
Freie Universität Berlin
Fachbereich Erziehungswissenschaften, Psychologie und Sportwissenschaften
Betreuung durch Prof. Dr. Hubert Feger
To my grandparents Dr. Gerhard Walpert and Martha Walpert
Thanks are due to the many individuals and institutions that made this dissertation possible: Hubert Feger for the academic freedom and the resources needed to carry out the empirical investigations; “Studienstiftung des deutschen Volkes,” “Deutscher Akademischer Austauschdienst,” and my parents Frank and Liesel Schimmack for financial support; Ed Diener, Shigeh Oishi, and Mark Suh for collaboration on Study 1; Stephan Dutke, Hubert Feger, Bärbel Knäuper, Rainer Reisenzein, Thomas Rodenhausen, Matthias Siemer, and Germmi Temme for valuable comments on drafts of the dissertation; and, last but not least, Phanikiran Radhakrishnan and Joachim Stöber for social support.
The frequency of emotional experiences is an important topic for several basic and applied domains in psychology. Most studies investigating the frequency of emotions rely on retrospective self-reports of emotional experiences (e.g., “How frequently did you feel happy in the last month?”). However, relatively little is known about (a) the accuracy of such retrospective frequency estimates of emotions, (b) the representation of information about the frequency of emotions in memory, and (c) the cognitive processes underlying frequency judgments of emotions. The present dissertation addresses these questions in four studies. In two field studies, averaged daily frequency estimates of emotions are compared with frequency judgments of emotions extending over several weeks, to test the accuracy of the latter judgments. In two experiments the cognitive processes underlying frequency judgments of emotions were investigated under controlled conditions. In these studies, participants first rated their likely emotional reactions to several hypothetical scenarios and then judged the frequency of emotions in these scenarios. The results indicate that absolute frequency estimates of emotions underestimate the actual frequencies of emotional experiences, but they accurately discriminate between the frequencies of emotions across different emotions as well as across participants. In addition, the studies provide further support for the familiarity model of frequency judgments of emotions (Hintzman, 1988; Schimmack & Reisenzein, in press). This model assumes that memories of emotional experiences are stored in separate memory traces in an episodic memory. When a frequency judgments of an emotion is required, memories of experiences of this emotion are activated in parallel. This generates a feedback signal, which is experienced as a feeling of familiarity. The more memory traces were activated, the stronger is the feeling of familiarity, and the higher is the frequency judgment. This model is contrasted with (a) models that assume the direct encoding of frequency information in memory and (b) models that assume the retrieval of memories into consciousness.
2.1 What is the Frequency of Emotions?
More than hundred years after James’s (1884) question “What is emotion?”, researchers of emotions still search for an answer to this question. In the last decade, however, several researchers used prototype theory (Rosch, 1975) to address this question empirically. Prototype theory at least allows to identify a set of typical emotions1 (cf. Fehr & Russell, 1984). Typical emotions are, for example, love, hate, joy, anger, fear, and sadness.
Schimmack and Reisenzein (1994) extended this work, demonstrating that emotions can be differentiated from moods by means of typicality ratings: Some concepts denote emotions but not moods (e.g., love, pride), others denote moods but not emotions (e.g., relaxed, nervous), and there are also concepts that denote moods and emotions (e.g., happy, sad). Furthermore, Schimmack and Siemer (1995) found that intentionality or objected directedness is a characteristic that differentiates typical emotion concepts (i.e., one is proud about something) from typical mood concepts (i.e., one feels relaxed, but not relaxed about something). This finding supports cognitive emotion theories, such as Stumpf’s theory (cf. Reisenzein & Schönpflug, 1992), in which intentionality is necessary feature of emotions that differentiates them from other affective states.
Furthermore, the object directedness of emotions explains their episodic nature; that is, emotions are elicited by the cognitive appraisal of an event for one’s own well-being, are maintained as long as the feeling is directed at an object, and are terminated once the thoughts are no longer directed at the object (Lazarus, 1991). The aroused affect might remain; this is, however, often considered a mood and no longer an emotion (cf. Ekman & Davidson, 1994, chapter 2). Due to the episodic nature of emotional experiences like pride, disappointment, gratitude, or shame, there are times in which individuals do not experience emotions at all, compared to times when they experience emotions. As a consequence, it is possible to ask people at any moment in time, whether they feel a particular emotion or not. If this question is asked repeatedly, one can determine the frequency with which the individual experienced the emotion. In other words, the question about the frequency of emotions, is about the number of times that a particular emotion has been elicited. The frequency of emotional experiences has to be differentiated from two other important characteristics of emotional experiences, namely their duration and their intensity. The differences between these three features of emotional experiences are illustrated in Figure 1, which shows an individual’s experience of a particular emotion over time. Most of the time the individual does not experience this emotion at all (i.e., the intensity equals zero).
Over the time interval displayed in Figure 1, the emotion is elicited two times (i.e., a change from zero to an intensity greater than zero), so that the frequency of the emotion equals two. The duration of the first emotional experience is longer than the one of the second episode (i.e., a longer distance along the time axis with intensity greater than zero). Regarding the intensity of an emotional experience, different definitions have been proposed: (a) the maximum intensity at the peak of the experience or (b) the integral of the area under the curve from the beginning to the end of the emotional episode (Frijda, Ortony, Sonnemans, & Clore, 1992).
Duration and intensity are important aspects of emotional experiences which can vary independently from the frequency of emotional experiences. For example, Schimmack and Diener (in press) demonstrated that individuals who experience emotions frequently do not necessarily experience emotions more intensely. Hence, frequency, intensity, and duration should be studied separately. The present dissertation focuses exclusively on the frequency of emotions.
2.2 Why Studying Frequency Judgments of Emotions?
The frequency of emotions is important in many everyday life situations. For example, a person who often feels joy seeing a friend is likely to spend time with this friend in the future, whereas a person who often feels fear of flying is likely to use other means of transportation. Apparently, information about the frequency of emotions in past situations can serve as a guide for future behavior (cf. Emmons & Diener, 1986a).
The frequency of emotions is also used as information in the formation of impressions of oneself (e.g., “I am an emotional person”) or of others (e.g., ”She is cold-blooded”). Because frequency of emotions is an important source of information in everyday life, it is not surprising that it is also relevant for psychological disciplines, such as personality, social, cross-cultural, clinical, and industrial and organizational psychology.
I briefly summarize some prevalent questions regarding frequency of emotions in these fields of inquiry. This short review shows that a valid assessment of the frequency of emotions is a necessary requirement for some of the research in these diverse fields, but that the validity of the most often used measure of frequency of emotions; that is, retrospective frequency judgments of emotions, is not yet firmly established. Indeed, various researchers are skeptical about the validity of retrospective reports in general. For example, Lewinsohn and Rosenbaum (1987) state that “retrospective memory should probably never be construed to represent what really occurred” (p. 618). However, evidence for such extreme claims is scarce (cf. Brewin, Andrews, & Gotlib, 1993). The present set of studies examines this issue for retrospective judgments of the frequency of emotional experiences.
2.2.1 Relevance for Personality Psychology
Over the last years, research on stable individual differences in the experience of affect has increased considerably. In several studies, retrospective estimates of experienced pleasant affect were correlated with extraversion and retrospective estimates of experienced unpleasant affect were correlated with neuroticism (Costa & McCrae, 1980; Emmons & Diener, 1986b; Izard, Libero, Putnam, & Haynes, 1993; Larsen & Diener, 1992; Meyer & Shack, 1989; Pavot, Diener, & Fujita, 1990; Watson & Clark, 1992). Some researchers even found these personality dimensions and experiences of affect to be so highly correlated that they equated neuroticism with the disposition to experience unpleasant affect and extraversion with the disposition to experience pleasant affect (Meyer & Shack, 1989).
The finding that extraversion and neuroticism are also correlated with averaged daily reports of affect suggests that this relation is substantial and that the retrospective judgments do possess some validity (Emmons & Diener, 1986b). However, it is possible that the correlations between retrospective estimates of experienced affect and personality traits overestimate the true strength of the relation between these traits and the actual frequencies of experienced affect. This hypothesis is suggested by a meta-analysis (Schimmack, 1996b), which shows that correlations between the two traits and the amount of experienced pleasant and unpleasant affect were higher for retrospective estimates than for averaged daily ratings of affect. Therefore, at least part of the shared variance between self-reports of personality traits and amount of experienced affect could be due to a so called personality-congruent memory bias (Martin, 1985). That is, people overestimate experiences of affect that is consistent with their personality. For example, neurotic individuals tend to overestimate the amount of unpleasant affect, whereas extravert individuals tend to overestimate the amount of pleasant affect (Diener, Larsen, & Emmons, 1984). One aim of the present dissertation is to explore the presence of personality-congruent biases in retrospective frequency estimates of emotions.
Furthermore, previous studies of the relation between personality and trait affect often did not differentiate between emotions and moods, and often did not separate the frequency of emotions from their typical intensity or duration. However, it has been demonstrated that individual differences in the frequency and the typical intensity of emotional experiences are separable constructs (cf. Diener, Larsen, Levine, & Emmons, 1985; Schimmack & Diener, in press). Therefore, the present dissertation also explores the structure of individual differences in the frequency of emotional experiences.
2.2.2 Relevance for Clinical Psychology
Abnormally frequent or infrequent experiences of emotions are symptoms of several psychological disorders (cf. Andreasen & Black, 1991). For example in the diagnostic system DSM-III-R, symptoms of the paranoid personality disorder are frequent experiences of distrust, fear, jealousy, and resentments. Symptoms of the schizoid personality disorder are very infrequent experiences of rage and joy. And a symptom of the narcissistic personality is the frequent experience of envy.
For a practitioner, it is difficult to assess these symptoms because epidemiological data about the frequency of these emotions in the general population are lacking. The results of the present dissertation can serve as a first guideline about the prevalence of emotions like envy or joy, although the results are limited to a student population. One aim of the dissertation is to suggest strategies that allow an economical but accurate estimation of the prevalence of emotions in the general population.
A second problem for the diagnostician is that it is unknown whether a patient’s reported frequencies of an emotion are accurate. It could be that psychological disorders bias the self-report of past emotional experiences. For example, depressed patients might overestimate the frequency of their unpleasant emotional experiences (Fitzgerald, Slade, & Lawrence, 1988). However, the evidence concerning long-term memory deficits due to psychological disorders is mixed (see Brewin et al., 1993, for a review). An investigation of the cognitive processes underlying frequency judgments of emotions could help to determine when a patient’s self-report is likely to be accurate and when it might be biased. It might also help to develop measurement instruments that are least susceptible to distortions.
2.2.3 Relevance for Social and Cross-Cultural Psychology
People often experience emotions in social situations (cf. Scherer, Wallbott, & Summerfield, 1986). Because the structure of a society influences the type of social situations that its members encounter, it is likely that social factors influence the frequency of emotions. For example, Briggs (1970) claimed that the Inuit never experience anger (but see Briggs, 1987). In contrast, the high homicide rate in the USA suggests that anger-related emotions such as anger, rage, or hate are experienced quite frequently in the United States, especially in the South (Cohen, 1996).
Evidently, understanding the cultural factors that influence the frequency of emotions – and the actions motivated by them – is relevant for political decisions. One important cultural dimension that is likely to influence the frequency of emotional experiences is the individualism-collectivism dimension. Because individualistic cultures provide looser social networks than collectivistic cultures (cf. Hofstede, 1980; Triandis, 1994), members of individualistic cultures might experience more frequently loneliness, but less frequently shame. Markus and Kitayama (1994) report a study in which the frequency of joy was more highly correlated with the frequency of pride in the USA (an individualistic culture) than in Japan (a collectivistic culture), indicating that achievement situations are a stronger source of happiness in individualistic cultures.
Studying the frequency of emotions is also important because the frequency of pleasant versus unpleasant emotions is one component of Subjective Well-Being (cf. Diener, 1984; Diener, Sandvik, & Pavot, 1991). Recently Diener, Diener, and Diener, (1995; Diener & Diener, 1995) explored what differentiates “happy” from “unhappy” nations. They found happy nations to be more affluent, individualistic, and democratic. These findings have potential implications for political questions such as whether China can promote the happiness of its people by economic growth, but without changing its political system. The interpretation of cross-cultural studies is, however, based on a number of assumptions. Among others, one basic assumption is that people can accurately estimate the frequency of their emotional experiences.
2.2.4 Relevance for Industrial and Organizational Psychology
Traditionally, job satisfaction is measured as an evaluative judgment or an attitude (Pekrun & Frese, 1992). Temme and Tränkle (1996) pointed out that this approach neglects the emotional aspects of work. An individual might be satisfied with his or her job, but experience only rarely emotions such as joy or pride. Similarly, global evaluations of one’s life are correlated, but distinct from measures of the frequency of pleasant versus unpleasant emotions (cf. Pavot & Diener, 1993). Therefore, the traditional assessment of job satisfaction should be complemented with the assessment of the frequencies of pleasant and unpleasant emotions at work. It is therefore an important question whether people can make accurate estimates of the frequencies of their emotional experiences in different contexts. Otherwise the frequency judgments would not only reflect the number of joy experiences at work, but also joy experiences in other contexts. Actually, several frequency judgment models, which are reviewed later, predict that people lack the ability to discriminate frequencies in different contexts. If this were true, assessing frequencies of emotions at work would be difficult.
2.3 Previous Research on Frequency Judgments of Emotions
Accuracy of frequency judgments can be globally defined as the agreement between frequency judgments and actual frequencies. Complementary, lack of accuracy is indicated by deviations of frequency judgments from actual frequencies. Because these deviations can be computed in various ways, several types of accuracy can be distinguished (cf. Naveh-Benjamin & Jonides, 1986; Thomas & Diener, 1990).
Absolute and relative accuracy compare the absolute level of actual and estimated frequencies. Absolute accuracy is defined as the ability of individuals to estimate the absolute frequency of an emotion accurately, irrespectively of the direction of the estimation error; that is, whether the actual frequencies are over- or underestimated. A common index of absolute accuracy is the standard deviation of the frequency judgments from the actual frequencies (Naveh-Benjamin & Jonides, 1986). In contrast, relative accuracy takes the direction of the error into account. A common measure of relative accuracy is the difference between the actual and estimated absolute frequency (estimated minus actual). In contrast to absolute accuracy, relative accuracy indicates whether the actual frequencies are over- or underestimated.
In the literature on frequency judgment of emotions, absolute and relative accuracy have been neglected. So far, the only study of relative accuracy compared the actual number of pleasant days; that is, days on which a person experienced more pleasure than displeasure; with the estimated number of pleasant days (Thomas & Diener, 1990). The results indicated that participants underestimate the number of pleasant days.
Because the number of unpleasant days is by definition perfectly inversely related to the number of pleasant days, the study also demonstrated that participants overestimated the number of unpleasant days. Because most participants experienced more pleasant than unpleasant days, this finding is consistent with the finding that frequency estimates regress toward the mean due to information loss (Howell, 1973; Fiedler, 1991). In contrast to the number of pleasant versus unpleasant days, the present dissertation explores for the first time the absolute and relative accuracy of frequency estimates of single emotions such as joy, anger, or sadness.
Two other types of accuracy are not concerned with the absolute frequency of a single entity (e.g., an emotion), but rather test how accurately frequency estimates discriminate between actual frequencies of different entities; where the entities can be the stimuli or the participants. These types of accuracy are subsequently called discriminative accuracy. A common index of discriminative accuracy is the Pearson correlation coefficient between actual and estimated frequencies computed across different entities.
Experimental studies investigate the ability of frequency estimates to discriminate the actual frequencies of the stimuli, which are experimentally manipulated. In the present context, the stimuli are different emotions such as joy, fear, or gratitude. Therefore, this type of accuracy is called discriminative accuracy across emotions, which has not been tested in previous studies of frequencies judgments of emotions. Exploring the discriminative accuracy across emotions has, however, several advantages. First, it can be tested in field and experimental studies of frequency judgments of emotions, which is not true for the discriminative accuracy across participants introduced below.
Second, tests of different models of frequency judgments are often based on the ability of a potential mediator variable (e.g., the number of recalled examplars) to discriminate between actual frequencies of different emotions (cf. Fitzgerald et al., 1988). This is, however, only meaningful if the frequency estimates possess discriminative accuracy across emotions. As a consequence, the accuracy of frequency judgments across emotions is investigated in the present dissertation.
In contrast to experimental studies, personality psychologists are mainly concerned with the ability of frequency judgments of emotions to discriminate actual frequencies of emotions experienced by different participants (Diener, Smith, & Fujita, 1995; Feldman Barrett, in press; Parkinson, Briner, Reynolds, & Totterdell, 1995; Thomas & Diener, 1990). This type of accuracy is subsequently called discriminative accuracy across participants. For example, Diener et al. (1995) studied individual differences in the experience of six types of emotions (see below). Each type was measured by four items. At the end of 52 consecutive days, the participants rated the time that they experienced each emotion on the particular day. Before and after the diary period, the participants also made time judgments for the previous month. The correlations between the averaged pre- and post-diary judgments and the averaged daily judgments were r = .69 for threat emotions (e.g., fear), r = .61 for bad-other emotions (e.g., anger), r = .52 for bad-self emotions (e.g., shame), r = .64 for separation emotions (e.g., sadness), r = .65 for good-other emotions (e.g., love), and r = .68 for pleasure emotions (e.g., joy). This finding shows quite high discriminative accuracy across participants.
In addition, the correlations between actual and estimated times of emotions were higher within the same type of emotions (pleasure–pleasure) than across different types of emotions (pleasure-good-other), indicating that the frequency judgments were emotion specific. This pattern of results rules out a simple response set explanation. Feldman Barrett (in press) reported correlations ranging from r = .59 to .76 between averages of repeated momentary mood ratings over a 90 day period and retrospective estimates of these averages after the diary period. Again, in this study frequency was not measured in a pure fashion, because the average of repeated intensity ratings comprises frequency and intensity information (Schimmack & Diener, in press). Furthermore, previous studies might overestimate the true discriminative accuracy across participants, because the frequency judgments were made (partly) after the diary study. Therefore, it is possible that participants based their post-diary judgments on memories of their daily judgments, rather than their daily experiences, a source of information that is not available under natural circumstances.
An influence of daily ratings on the subsequent post-diary judgments is especially likely because the post-diary judgments were made on the same response scale as the daily ratings. This hypothesis is further strengthened by Thomas and Diener’s (1990) finding that the correlation between pre-diary estimates of the number of happy days and the number of happy days experienced during the diary period was lower than the correlation obtained for the post-diary estimates. To test this hypothesis, the present dissertation followed Thomas and Diener’s (1990) approach to compare pre- and post-diary frequency judgments of emotions. Furthermore, different response scales were used for the assessment of the daily frequencies and the retrospective frequency judgments.
In sum, previous studies suggest that the discriminative accuracy across participants is fairly high (Diener, Smith et al., 1995; Feldman Barrett, in press; Parkinson et al., 1995; Thomas & Diener, 1990). However, in none of these studies the frequency of emotions was measured in a pure fashion; Diener et al. (1995) studied frequency and duration, Feldman Barrett (in press; Parkinson et al., 1995) investigated frequency and intensity, and Thomas and Diener’s (1990) studied the number of pleasant versus unpleasant days. Thomas and Diener’s study suggests that people underestimate the frequency of pleasant emotions and overestimate the frequency of unpleasant emotions. However, this finding might be limited to their measure of pleasant versus unpleasant days. Different results might be obtained for absolute frequency estimates of single pleasant and unpleasant emotions such as anger, joy, or sadness. Furthermore, several types of accuracy -namely absolute and relative accuracy as well as discriminative accuracy across emotions – have been neglected in previous studies. One major aim of the present dissertation is to explore the different types of accuracy for pure frequency judgments of emotions.
2.3.2 Underlying Cognitive Processes
Only two studies explored the cognitive processes underlying frequency judgments of emotions (Fitzgerald, et al., 1988; MacLeod, Andersen, & Davies, 1994; see 2.4 for research on frequency judgments in general). Both studies compared frequency judgments of emotions with the latencies to retrieve a single autobiographic memory in which the target emotion occurred. The results consistently showed an inverse relation between these two variables: Retrieval latencies decreased with increasing frequency of emotions. Furthermore, retrieval latencies were faster for the more frequent pleasant emotions than for the less frequent unpleasant emotions (MacLeod et al., 1994). These findings have been interpreted as support for the assumption of the ease-of-retrieval model (see next paragraph for more detail) that people base frequency judgments on the ease (speed) with which they can retrieve exemplars from memory. However, this design has two shortcomings. First, the evidence is only correlational. Second, the relation of both measures to the actual frequencies of emotions is unknown. Only if both measures show the same correlation with the actual frequencies of emotions, the frequency judgments can be based on ease-of-retrieval. If, however, the frequency judgments would be more highly correlated with the actual frequencies than a measure of ease-of-retrieval, ease-of-retrieval could not explain the accuracy of the frequency judgments. Therefore, more rigorous tests are needed to uncover the cognitive processes underlying frequency judgments of emotions.
2.4 The Experimental Literature on Frequency Judgments in General
2.4.1 Theoretical Models
Experimental research on frequency judgments has often relied on stimuli (e.g. word lists with concepts of natural objects such as fruits, furniture, birds, etc.) which, on first sight, bear little resemblance to experiences of emotions in everyday life. Therefore, one might be skeptical whether this research helps to understand frequency judgments of emotions. This skepticism is not justified for two reasons.
First, frequency judgments of emotion also employ emotion concepts. Although emotion concepts differ from concepts of natural objects – for example, concepts of natural objects are hierarchically organized and mutually exclusive at the same level of the hierarchy, whereas emotion concepts are not (Reisenzein, 1995) – frequency judgments might not be affected by these differences.
Secondly, even if frequency judgments of emotions employ other cognitive processes than frequency judgments of natural objects, the theories and experimental methods developed in the general frequency judgment literature are at least heuristically fruitful for the investigation of frequency judgments of emotions.
Various frequency judgment models have been proposed in the psychological literature, which are not mutually exclusive (Brown, 1995; Hintzman, 1988; Howell, 1973; Tversky & Kahneman, 1973). Each model might be correct in specific contexts and for specific domains. For example, Manis, Shedler, Jonides, and Nelson (1993) argued that direct-encoding models might account for frequency judgments of repeated occurrences of the same stimulus, whereas retrieval-based models might account for frequency judgments of categories (see also Brown, 1995).
Other studies show that the expectation about the frequency of the event to be estimated is also important. People are likely to use a counting strategy for rare events (seeing the dentist in the last year), but estimation strategies for more frequent events (restaurant visits in the last year) (Blair & Burton, 1987).
Figure 2 provides a taxonomy of the different frequency judgment models proposed in the literature. Two major characteristics differentiate between the frequency judgment models. The first distinction is between direct encoding models that assume frequency information to be encoded directly at the time of encoding (Hasher & Zacks, 1979; Jonides & Jones, 1992; Underwood, 1969) and indirect encoding models, which assume frequency information to be stored indirectly in memory in the form of multiple memory traces (see Figure 2).
Direct encoding models are, for example, the counter model, which is based on the idea that concepts are linked to a frequency counter that registers every activation of the concept (Underwood, 1969), or the concept strength model, which assumes that concepts are strengthened by each activation so that frequency judgments can be based on a readout of a concept’s strength (cf. Howell, 1973).
The second major distinction is between retrieval-based versus retrieval-free models. Retrieval-based models assume that frequency judgments are based on information that is obtained by the retrieval of relevant exemplars to the level of consciousness. A straight forward strategy would be to count all available instances in memory (Brown, 1995; Meudall, 1971). However, research suggests that people do not use the counting strategy for unregular and frequent events (Menon, 1994). Because emotional experiences are irregular, and quite frequent over longer time periods, it is unlikely that people rely on a counting strategy when they judge the frequency of emotions.
The other possibility is that people use simple heuristics to make frequency judgments. Tversky and Kahneman (1973) suggested several possibilities which heuristics people might use to make frequency judgments. Somewhat confusingly, all of the proposed heuristics became to be known as availability heuristics, although they assume clearly distinct cognitive processes. Most commonly, the availability heuristic has been interpreted as the retrieval of a limited number of exemplars followed by an estimation based on the number of retrieved exemplars (cf. Watkins & LeCompte, 1991). “The subject could, therefore, use the number of instances retrieved in a short period to estimate the number of instances that could be retrieved in a much longer period of time” (Tversky & Kahneman, 1973, p. 210). To distinguish this heuristic from other heuristics, it has been named recall-estimate theory (Watkins & LeCompte, 1991).
Empirical support for the recall-estimate model stems from the finding that the number of recalled exemplars is correlated with frequency judgments and both variables are influenced in the same way by experimental manipulations at the time of encoding (Manis, et al., Tversky, & Kahneman, 1973). For example, in a now classical study, Tversky and Kahneman demonstrated that people recall more female names than male names from a list with an equal number of female and male names, when the female names referred to famous people. In addition, they also made higher frequency estimates for female than for male names.
Today, the second availability heuristic proposed by Tversky and Kahneman is known as the ease-of-retrieval model (Schwarz, Bless, Strack, Klumpp, Rittenauer-Schatka, & Simons, 1991). According to this model, an individual “attempts to recall some instances and judges overall frequency by availability, i.e., by the ease with which instances come [italics added] to mind (Tversky & Kahneman, 1973, p. 220). The ease-of-retrieval model has been empirically tested and supported in some studies (Gabrielcik & Fazio, 1984; Schwarz et al., 1991). For example, Schwarz et al. (1991) asked participants to recall either six instances when they were assertive, which was easy, or twelve instances, which was difficult. Subsequently, the participants in the easy recall condition judged themselves to be more assertive than those in the difficult recall condition.
In contrast to the retrieval-based models, retrieval-free models assume that frequency judgments do not involve retrieval of exemplars to the level of consciousness. Interestingly, Tversky and Kahneman (1973) also suggested a retrieval-free model of frequency judgments. “To assess availability it is not necessary to perform the actual operations of retrieval. It suffices to assess the ease with which these operations could [italics added] be performed, much as the difficulty of a puzzle or mathematical problem can be assessed without considering specific solutions” (p. 208).
This proposition bears a close resemblance to the finding in the metamemory literature that people often have a feeling-of-knowing the answer to a question, even when they cannot recall the answer. Nevertheless, the strength of this feeling predicts people’s performance in a later recognition test (see Nelson, 1988). Metcalfe (1993) proposed that this seemingly paradox ability is based on the familiarity of the question. Similarly, Hintzman (1988) proposed that people can judge the frequency of events without actual retrieval of related memories by means of a direct familiarity signal from memory. Hintzman’s familiarity model assumes that a question such as “How frequently did you experience joy in the last week?” activates automatically and in parallel memories of joy experiences in the last week. The activation is based on a feature-matching process: The more features of a typical joy experience a memory possesses the stronger the activation of this memory; and the stronger the familiarity signal. Furthermore, some features encode the time of the experience, so that joy experiences in the last week are activated more strongly than joy experiences at other times. The automatic activation process produces an echo. The intensity of this echo reflects the amount of information that was activated in memory. This echo intensity is experienced as a feeling of familiarity. The major distinction between the familiarity model on the one hand, and the retrieval-based models on the other hand, is that the familiarity model does not require the retrieval of emotional memories to a conscious level. Therefore, it is possible that someone says: “I cannot recall a specific situation in which I felt joy last week, but I think I felt joy about 20 times.”
One important limitation of the familiarity model is that it does not explain how participants make absolute frequency estimates. The familiarity model only predicts that the familiarity signal will be stronger for frequent stimuli and weaker for rare stimuli, but the model does not explain how a feeling of familiarity is converted into an absolute numerical estimate (cf. Brown, 1995; Brown & Siegler, 1993). This problem, however, exists also for the ease-of-retrieval and the recall-estimate model.
2.4.2 Empirical Paradigms
In the experimental literature, several experimental paradigms have been developed to differentiate between the various frequency judgment models. Subsequently, I review those paradigms that were employed in the present studies to explore the cognitive processes underlying frequency estimates of emotions. More specifically, I first review paradigms that differentiate direct from indirect encoding models, and then paradigms that differentiate between indirect encoding models.
One paradigm is modeled after a study by Hintzman and Block (1971). The authors asked participants to learn two lists of words in which the frequency of words was independently varied. Subsequently, the participants estimated the frequency of words separately for the first and the second list. The authors found that the participants were able to make accurate frequency judgments for each list. This finding is difficult to explain by direct encoding models, which assumes that frequency information is constantly updated at the time of encoding. Therefore, only the total frequency is stored in memory and it is impossible to differentiate frequencies in different contexts. Study 2 of the present dissertation used a similar paradigm. Participants first made daily frequency estimates of emotions for two weeks. Subsequently, they made separate frequency estimates for the first and second week of the diary study.
A second paradigm that has been used to test direct versus indirect encoding of frequency information relies on a manipulation of the salience of category membership (e.g., Bruce, Hockley, & Craik, 1991; Greene, 1989). In Greene’s study, participants were asked to study a list of words. In this list, words of different categories occurred with varying frequencies (e.g., fruits: orange, apple, banana, grapes; trees: oak, pine). In one study, he manipulated the salience of category membership in that exemplars of the same category appeared either in one block, or were spread across the list. The direct encoding model assumes that category members automatically activate category labels and that the frequency of the category is counted (Alba, Chromiak, Hasher, & Attig, 1980). Hence, making category membership salient should not have an effect on frequency judgments. However, Greene (1989) found that frequency was judged to be higher when category membership was salient; that is, in the blocked condition. This is once again difficult to explain by the direct encoding model. Manipulations of the salience of emotion concepts were employed in the studies 1, 3 and 4 of the present dissertation, to test whether salience has an influence on frequency judgments of emotions. In Study 1, participants made frequency judgments of emotions before and after a diary study for salient emotions; that is, those that were on the rating form during the diary study, and non-salient emotions; that is, those that were not on the form. First of all, it was expected that frequency judgments of all emotions increase due to the participation in a diary study. In addition, it was expected that frequency judgments of salient emotions increase more strongly than those of non-salient emotions. In studies 3 and 4, participants first rated for a number of emotions whether they would experience these emotions in various hypothetical scenarios. Subsequently, they were asked to estimate how frequently they would have experienced emotions in the set of scenarios. This question was asked for salient emotions; that is, those emotions that had been included in the previous rating task, and non-salient emotions; that is, those that had not been included in the scenario rating task. Note that frequency judgments of non-salient emotions are meaningful, because the fact that these emotion concepts were not included in the scenario rating task does not imply that these emotions could not have been experienced in the hypothetical scenarios. It was expected that the frequency judgments of salient emotions would be higher than those of non-salient emotions (Bruce et al., 1991; Greene, 1989; Hintzman, 1988).
The previously described paradigms can test direct and indirect encoding models against each other, but they do not allow to distinguish between retrieval-based and retrieval-free models, because all indirect models predict that different frequency judgments can be provided for different contexts (Hintzman & Block, 1971), or that salience at the time of encoding enhances frequency judgments (Bruce et al., 1991). Even the finding that frequency judgments and recall measures show discriminative accuracy across stimuli does not provide conclusive evidence that the frequency judgments were actually based upon a recall-estimate strategy (Bruce et al., 1991; Hastie & Park, 1986; Watkins & LeCompte, 1991). The correlation could be simply due to the fact that frequency judgments and number of recalled exemplars are bound to be related by the number of exemplars stored in memory. Nevertheless, frequency judgments might not be based on the retrieval of exemplars. Therefore, a closer examination of the frequency judgment process is needed. The major difference between the two models is the assumption of the retrieval-based models that exemplars are retrieved to the level of consciousness. Therefore, the time needed for a frequency judgment should be longer than the latency to retrieve at least a single exemplar. Similarly, Reder (1987) argued in the metamemory literature that feeling-of-knowing judgments should take more time than the retrieval of answers, if they are retrieval-based; however, consistent with retrieval-free models (Metcalfe, 1993), the feeling-of-knowing judgments were faster than retrieval times of answers.
Furthermore, if frequency judgments are based on the retrieval of exemplars, the judgment times of frequency judgments should be systematically related to the judged frequency: According to the recall-estimate theory, higher frequency judgment should need more time because more exemplars were retrieved, and the retrieval of more exemplars takes more time (Brown, 1995; Meudall, 1971). The opposite prediction is made by the ease-of-retrieval model. Higher frequency judgments are based on easier retrieval of examplars, which implies that the exemplars come to mind faster so that the judgment can be made faster. In contrast to these predictions of the two retrieval-based models, Hanson and Hirst (1988) found neither a positive nor a negative relation between judgment times and size of the frequency judgments. Brown (1995) found a positive correlation when participants used the counting or the recall-estimate strategy, but not when participants used a retrieval-free strategy. In studies 3 and 4, the size of the frequency judgments, the time needed to make these judgments, and response times in a latency-to-retrieve task were compared to test the indirect encoding models of frequency judgments in the emotion domain.
2.5 Proposing a Familiarity Model of Frequency Judgments of Emotions
A finding by Schimmack and Reisenzein (in press) casts doubt on the basic assumption of the retrieval-based models that participants rely on the recall of exemplars to judge the frequency of emotions. Fitzgerald et al. (1988) found that on average the retrieval of emotional episodes from autobiographic memory needed more than 7s. MacLeod et al. (1994) also reported average retrieval times of more than 7s for unpleasant emotion memories, although pleasant memories were retrieved within 4s. In Schimmack and Reisenzein’s (in press) study, participants made conditional probability judgments of emotions (i.e., “If you experience joy, how frequently do you experience euphoria?”) on average within 5s. These judgment times appear to be too fast, especially given that the judgments were made only for unpleasant emotions, to be based on the recall of past emotional experiences. Of course, the fast responses could be due to random responding. However, covariation judgments of emotions are not only fast, but also reflect actual covariations between emotions fairly accurately (Reisenzein & Schimmack, 1996).
Finally, Schimmack and Reisenzein analyzed asymmetries in conditional probability judgments. According to Bayes’s theorem (Wiggins, 1973), p(A) > p(B) exactly if p(A|B) > p(B|A); that is, because sadness is in general a more frequent emotion than embarrassment, the conditional probability of sadness given embarrassment should be higher than the conditional probability of embarrassment given sadness. This prediction was confirmed for most emotion pairs. Therefore, conditional probability judgments reflect not only the actual co-occurrence of emotions, but also the separate frequencies of the two emotions. It is difficult to imagine how the participants (a) retrieved a sufficient number of exemplars and (b) carried out the necessary computations on a conscious level within 5s. Therefore, Schimmack and Reisenzein concluded that both frequency and co-occurrence judgments of emotions are either already pre-stored in memory, or the judgments are based on a feeling of familiarity (Hintzman, 1988; Metcalfe, 1993). As a consequence, a test between direct encoding models and the familiarity model seems to be highly desirable. Nevertheless, previous findings by Hintzman and Block (1971) and others (Green, 1989) have challenged direct encoding models in other domains. Furthermore, the familiarity model has been successfully applied to other social judgments. For example, the familiarity model, but not the direct-encoding models, explains the phenomenon of illusory correlations (Smith, 1991; Smith & Zaraté, 1993; see also Fiedler, 1991). Therefore, Schimmack and Reisenzein recommended the familiarity model as an “inference to the best explanation,” for frequency and co-occurrence judgments of emotions.
2.6 Biases in Frequency Judgments of Emotions
2.6.1 Mood-Congruent Biases
Up till now, frequency judgments of emotions have been treated just like frequency judgments of natural objects (e.g. fruits, cities, furniture). However, emotions differ from these stimuli in that emotions have a hedonic tone: they are either pleasant or unpleasant (cf. Clore, 1994). Several information processing models predict that affective information is processed differently from non-affective information. According to the mood-congruent-memory hypothesis (Bower, 1981), the current affective state renders mood-congruent memories more accessible. In combination with the indirect encoding models of frequency judgments, this leads to the prediction that frequency estimates of emotions are biased in a mood-congruent direction. In contrast, the competing model of mood effects on social judgments; that is, the mood-as-information model (Schwarz & Clore, 1983), does not make this prediction. According to this model, people directly use their current mood to make evaluative judgments whenever they consider their current mood a valid and relevant source of information for the judgment. In an intriguing experiment, Schwarz and Clore (1983) demonstrated that participants rated the satisfaction with their lives to be higher in a good mood (e.g., on sunny days) than in a bad mood (e.g., on rainy days). This effect, however, disappeared when the influence of the weather on participants’ current mood was made salient to them. As a consequence, participants did no longer consider their current mood a valid source of information and used other information. Subsequent studies showed that current mood was not used for judgments about satisfaction
with specific life domains, presumably because participants considered their current mood as irrelevant (Schwarz, 1987). Because current mood does not appear to be a particularly relevant source of information for frequency judgments of specific emotions such as love, hate, joy, and fear, people should not use their current mood as information for these judgments. Therefore, the mood-congruent-memory model, but not the mood-as-information model, predicts an influence of current mood on frequency judgments of emotions.
To address this question empirically, individual differences in naturally occurring mood at the time of the frequency judgments were assessed in the two field studies. Naturally occurring mood, rather than a mood-induction procedure, was used for two reasons. First, an experimental mood manipulation may have distorted the results in the more important analyses of the accuracy of frequency judgments. Second, I believe that it is useful to start a scientific investigation with a demonstration of a phenomenon under natural conditions. If naturally occurring mood is unrelated to frequency judgments of emotions, an experimental investigation is at least of secondary importance to the present research question. This research strategy seems especially desirable in the light of a series of studies by Parrott and Sabini (1990), who did not find mood-congruent recall in naturalistic settings; indeed, the authors found mood-incongruent recall. In addition, mood effects in experimental studies are often quite small and inconsistent (Blaney, 1986; Brewin et al., 1993), suggesting that current mood leads only to small distortions in retrospective frequency estimates of emotions.
2.6.2 Personality-Congruent Biases
Martin’s (1985) notion of a personality-congruent memory bias suggests that a person’s personality might also influence frequency judgments of emotions. Specifically, participants might overestimate the frequency of personality-congruent emotions. In support of this hypothesis, Diener et al. (1984) found that neurotic individuals overestimated the amount of their unpleasant affect, whereas extravert individuals overestimated the amount of their pleasant affect (see also Feldman-Barrett, in press). In addition, Larsen (1992) found that neurotic individuals tended to overestimate the frequency of some physical symptoms.
Personality-congruent biases can be explained in several ways. First, personality-congruent memories are more accessible (Martin, 1985); therefore, at least the retrieval-based models would predict higher frequency judgments for personality-congruent emotions. Second, people may have generalized beliefs about their personality that can be based on various kinds of information, such as, for example, communication with others or abstractions from own experiences (Fiske & Taylor, 1984; Hastie & Park, 1986). For example, a person might think that he or she is a “jealous”, “choleric”, or “happy” person. People might rely on such generalized beliefs when they are asked to estimate the frequency of their emotional experiences (Feldman Barrett, in press; Zuroff, 1989). The use of generalized beliefs is, of course, just another judgment model of frequency judgments of emotions: Once an individual has determined the frequency of his or her emotions; for example by means of one of the other judgment strategies, he or she simply retrieves this prestored frequency information to make subsequent frequency judgments. As long as this information accurately reflects the actual frequencies of emotional experiences, this judgment strategy provides for a fast and efficient way to answer frequency judgments of emotions. However, if the formerly derived frequency judgments deviate from the actual frequencies of emotional experiences, reliance on this information leads to personality-congruent biases. With regard to frequency questions over limited time periods (e.g., “in the last month”) a third explanation is possible: People may have difficulties to distinguish between episodes that fall within and those that fall outside of the asked time period (cf., Schwarz, 1990). In this case, the frequency judgments would cover a longer time period than intended by the investigator’s question. Furthermore, personality tends to be a better predictor of emotional experiences the more they are aggregated over longer time periods (Epstein, 1983). As a consequence, personality explains additional variance in the frequency judgments of emotions that is not accounted for by the actual frequencies of emotions during the limited time period under investigation. Finally, personality-congruent memory biases could be a simple method artifact due to the fact that personality traits are measured by judgments that are very similar to frequency judgments of emotions.
In studies 1 and 2 the aim was simply to further explore whether a personality-congruent memory bias exists. If so, this could be the starting point for further analyses, differentiating between the different explanations described above. Note that a personality-congruent bias is consistent with the familiarity model, if it is due to the activation of memory traces of experiences outside of the time period under investigation. The bias should, however, account for much less variance than the actual frequencies of emotions, because the familiarity model assumes that memory traces can be activated for different contexts, including the time of experience (Hintzman & Block, 1971).
2.7 Summary of Hypotheses
The present dissertation has two main aims: (a) to test the accuracy of frequency judgments of emotions and (b) to explore the cognitive processes underlying these judgments.
With regard to the first question, the following predictions are made:
1. Participants should underestimate the absolute frequency of their emotional experiences. A related prediction is that people underestimate especially the frequencies of more frequent emotions. Both predictions are based primarily on the fact that these effects have been consistently obtained in the frequency judgment literature (cf. Thompson & Mingay, 1991; Williams & Durso, 1986). However, an explanation of this effect is lacking, because many frequency judgment models do not address the question how frequency information (e.g., a feeling of familiarity) is converted into absolute frequencies (cf. Brown, 1995).
2. Frequency judgments of emotions show discriminative accuracy across emotions because the familiarity signal reflects the number of stored memories, and therewith the number of experiences of an emotion, fairly accurately (Hintzman, 1988). High discriminative accuracy across stimuli has been reported for frequency judgments of other stimuli (Hasher & Zacks, 1984; Hintzman, 1988).
3. The discriminative accuracy across participants is expected to be moderate. This prediction is based on some earlier findings (e.g., Thomas & Diener, 1990). The correlations reported by Diener et al. (1995) and Feldman-Barrett (in press) in the range from r = .50 to .70 are predicted to overestimate the true discriminative accuracy across participants, because the estimates were made (a) on the same scale that was used during the diary period and (b) after participation in a diary study.
With regard to the cognitive processes underlying frequency judgments of emotions, the following predictions were made:
1. Participants should be able to judge accurately the frequencies of emotions in the first and in the second week of a diary study. The reason is that it is possible to activate memory traces of different time periods separately, so that the familiarity signal reflects predominantly the frequencies in a specified time period (Hintzman & Block, 1971).
2. Making some emotion concepts salient during the encoding process should increase the judged frequency of salient emotion concepts compared to non-salient onesthat is, those emotion concepts that were not presented at the time of encoding, because salience leads to deeper encoding and less information loss (Greene, 1989; Hintzman, 1988).
3. The retrieval latency of an emotional episode from memory should be unable to account for the discriminative accuracy of frequency judgments across emotions. The basis for this prediction is that frequency judgments are assumed to be based on a sense of familiarity and that the familiarity signal reflects the actual frequencies more accurately than the ease-of-retrieval of exemplars (Watkins & LeCompte, 1991). This hypothesis allows for the possibility that frequency judgments of emotions and retrieval latencies of emotional episodes are negatively correlated (Fitzgerald, et al., 1988; MacLeod et al., 1994). It only predicts that this correlation is not strong enough to account for the discriminative accuracy across emotions.
4. The familiarity model does not predict a relation between the size and the speed of frequency judgments. In contrast, the retrieval-based models predict such a relation; the counting and the recall-estimate model predict a positive correlation (Brown, 1995), whereas the ease-of-retrieval model predicts a negative correlation.
5. Finally, it is predicted that the recall of a single emotional episode takes longer than the complete frequency judgment process. This prediction is again based on the assumption of the familiarity model that frequency judgments are not based on the retrieval of emotional experiences from memory. This prediction is in agreement with previous results that frequency judgments are faster than the retrieval of exemplars (Alba et al., 1980; Schimmack & Reisenzein, in press).
No explicit predictions were made concerning the influence of current mood or personality on frequency judgments of emotions because the familiarity model predicts such biases only under certain conditions. Mood-congruent effects could be due to a stronger activation of mood-congruent memory traces, leading to a stronger familiarity signal. Personality-congruent effects could be due to the activation of memory traces outside of the time frame of the question. However, it is predicted on account of the familiarity model that biases, if they exist, are relatively small compared to the amount of variance that is explained by the actual frequencies of emotions.
3 STUDY 1
Studies 1 and 2 used the pre-post design of Thomas and Diener (1990). In a pre-post design, participants first judge the frequency of emotions for a time period prior to the diary study. Then, they take part in a diary study which serves the purpose to obtain a measure of the actual frequencies of emotions. Finally, they judge the frequency of emotions during the diary period. The advantage of the pre-post design is that it allows to test salience effects; that is, whether the participation in the diary study influenced the post-diary judgments. The problem of the design is that pre-diary estimates necessarily cover a different time period than the one during which the actual frequencies of emotions are measured. Hence, changes in the true frequencies of emotions from the pre-diary period to the actual diary period can attenuate correlations between pre-diary estimates and the actual frequencies of emotions.
Study 1 served several goals: first, to test the accuracy of frequency judgments of emotions, using various measures of accuracy, and second, to test whether the salience of emotions at the time of encoding influences subsequent frequency judgments. Third, Study 1 tested the presence of mood- or personality-congruent biases in frequency judgments of emotions.
Fourth, the strengths and weaknesses of two different response formats were compared: (a) the participants made absolute estimates; that is, they estimated the absolute number of occurrences of an emotion (X times a week), and (b) they made vague quantifier ratings (Pepper, 1981; Wright, Gaskel, & O’Muircheartaigh, 1994); that is, they checked for a number of common frequency expressions (e.g., never, rarely, sometimes, often) which one most appropriately described how frequently they experienced an emotion. The use of both response formats appeared to be especially desirable because Schaeffer (1991) obtained different results for absolute estimates and vague quantifier ratings. In a survey study, respondents were first asked to rate the frequency of excitement and boredom in their lives by means of vague quantifiers. Then, they were asked to indicate which absolute frequency the chosen quantifier indicates. Black participants appeared to be more bored than white participants according to the vague quantifier ratings, but not according to the absolute estimates.
Two ancillary analyses were carried out. First, I explored the time period covered by frequency judgments of emotions. It is well known that memories increasingly decay over time (Hintzman, 1988). Therefore, frequency judgments of emotions should be influenced predominantly by more recent emotional experiences. Nevertheless, it remains to be discovered whether frequency judgments of emotions cover only experiences in the last few days or extend over much longer time periods. Second, the structure of individual differences in the frequencies of pleasant and unpleasant emotions was explored, which is an important topic in personality psychology (Bradburn, 1969; Diener, Smith et al., 1995; Green, Goldman, & Salovey, 1993; Meyer & Shack, 1989).
One hundred and fifty students in a semester-long course on research in personality at the University of Illinois took part in this study. Four participants were excluded because of missing data. The final sample consisted of 107 female and 39 male participants. Although the topic of the validity of retrospective frequency judgments of emotions was discussed in this course, this happened only after the data collection relevant to this study had been completed.
3.1.2 Material and Procedure
126.96.36.199 Daily Estimates
At the core of the present study, participants estimated the absolute frequencies of 20 emotions (see Table 1) at the end of each day for 23 days. The first two days were used to practice the use of the questionnaire and were excluded from all analyses. Using a free response format, participants entered any number that seemed to be appropriate as an estimate of the absolute frequency with which they experienced an emotion on a particular day. Participants returned the forms the next day, except for the weekend forms which were due on Monday.
In studies 1 and 2, the averaged daily ratings are used as a standard of comparison for the long-term frequency judgments. Therefore, the averaged daily judgments are labeled actual frequencies, although the measure can be expected to give only an approximation of participants’ actual frequencies of emotional experiences. However, random- or event-sampling methods (see Schimmack & Diener, in press) would reflect the frequency of emotional experiences only if the number of daily measurement points were extremely high; which would probably overtax the motivation of the participants. Therefore, daily, or, as in Study 2, twice-daily, frequency estimates were considered the most appropriate measure of actual frequencies of emotions in everyday life.
188.8.131.52 Vague Quantifier Ratings
Before (pre-diary) and after (post-diary) the diary study, the participants made vague quantifier ratings concerning the frequencies of emotions during the last three weeks. The emotions were the 20 emotions included in the daily form and 9 additional ones. For the rating task, participants were provided with frequency expressions commonly used in everyday language, and they had to check the most appropriate one (i.e., In the last three weeks, I experienced joy [never], [very rarely], [rarely], [sometimes], [often], [very often], [extremely often]). For statistical analyses, the vague quantifier ratings were later converted to numbers from 0=never to 6=extremely often.
184.108.40.206 Absolute Estimates
The participants also made pre- and post-diary absolute estimates of the frequency of emotions experienced during the last three weeks. These judgments were made after the vague quantifier ratings. This order was chosen because participants might have relied on their absolute estimates to make the vague quantifier ratings, whereas a transfer effect in the other direction seemed to be less likely. The emotions were the same as in the vague quantifier questionnaire. Also, the item sequence was the same for vague quantifier ratings and absolute estimates. The absolute estimates were made using a free response format; that is, the participants wrote down any number that deemed to be appropriate. Although the estimates were required to cover the last three weeks, the questionnaire asked to estimate weekly frequencies (i.e., “In the last three weeks, I experience joy _ times a week”). A weekly time frame was used for the following reasons. A daily time frame ( times a day) seemed problematic. First, a daily time frame might have especially encouraged the participants to memorize their daily estimates to make the post-diary estimates. Second, a daily time frame does not allow to discriminate frequencies of very rare emotions (e.g., envy, hate), which on many days are not experienced at all; therefore the modal response of these emotions would be zero. A three-week time frame (_ times in the last three weeks) was not used because a weekly time frame has the advantage that it can be used for different time periods, ranging from one week (“In the last week, I experienced joy _ times a week.”) to people’s frequency of emotional experiences in general (“In general, I experience joy __ times a week”).
220.127.116.11 Mood Questionnaire
After completing the two frequency judgment tasks, the participants rated their current mood on the ELMI (Everyday Language Mood Inventory; Schimmack, 1996a) which is an English adaptation of the BASTI (Berliner Alltagssprachliche Stimmungsinventar; Schimmack, in press). The ELMI measures 10 specific mood dimensions, namely indifference, sentimentality, depression, grouchiness, irritation, anxiety, nervousness, euphoria, cheerfulness, and relaxation, and three global mood dimensions, pleasure-displeasure, aroused-calm, and wakeful-tired, with two items each. Ratings were made on an intensity scale ranging from 0=not at all to 6=extremely intense.
18.104.22.168 Personality Questionnaire
The personality dimensions of neuroticism and extraversion were measured by means of the NEO-PI-R (Costa & McCrae, 1992). Each trait is measured by six subscales, and each subscale comprises 8 items. Therefore, the NEO-PI-R provides for a reliable and broad assessment of these two personality dimensions.
3.2.1 Absolute Accuracy
The absolute and relative accuracy of frequency judgments of emotions can only be tested for the absolute estimates, because vague quantifiers do not correspond to a fixed absolute frequency (Pepper, 1981; Wright et al., 1994). To test the absolute accuracy, each participant’s standard deviation (i.e., the square root of the squared differences between estimated and actual frequencies) from the participant’s actual frequency was computed for the 20 pre- and the 20 post-diary estimates. A comparison of the standard deviations, indicated that absolute accuracy increased from pre-diary (mean SD = 11.88) to the post-diary estimates (mean SD = 9.09), t(145) = 5.95, p < .01.
This finding suggests that the participation in the diary study increased the accuracy of the estimates. One problem in the interpretation of this finding is, however, that the post-diary estimates, but not the pre-diary estimates, cover the time period of the diary study. Therefore, the present finding might also be due to the fact that the actual emotion frequencies changed from the pre-diary period to the diary period.
The previous analysis measured absolute accuracy at the individual level. It is, however, also possible to compare the absolute accuracy of pre- and post-diary estimates at the group level. To this aim, the frequency judgments of each emotion were first averaged across participants. Then the standard deviations of the averaged pre- and post-diary estimates from the averaged actual frequencies were compared. This analysis also shows an improvement in absolute accuracy from pre-diary (mean SD = 7.54) to post-diary (mean SD = 3.60) estimates, t(19) = 4.46, p < .01.
This finding is stronger evidence for an improvement in absolute accuracy, because it is less likely that the average frequency of an emotion changed from the pre-diary to the diary period. That is, it is less likely that all participants experienced less anger or more joy in one of the two weeks. In sum, the analyses suggest that the participation in the diary study improved the absolute accuracy of the judgments.
3.2.2 Relative Accuracy
Table 1 shows the weekly frequencies of the 20 emotions, as derived from the daily frequency estimates, which were added up and divided by 3, and as estimated before and after the diary study. For all 20 emotions, the actual frequencies were higher than estimated frequencies. The other notable finding that all pre-diary estimates were lower than the post-diary estimates is discussed later. In some cases the absolute differences were quite dramatic: For example, participants experienced contentment on average 28 times a week, but estimated to do so only 7 times in their pre-diary estimates.
As a measure of relative accuracy, each participant’s actual frequencies were subtracted from his or her estimated frequencies of emotions. Underestimation was more severe for the pre-diary estimates (mean d = -7.54) than for the post-diary estimates (mean d = -.3.59), t(145) = 7.09, p < .01. Before the diary study only 6 (of 146) participants revealed overestimation, whereas after the diary study 23 participants overestimated their frequencies of emotions.
The analysis at the group level provided the same results as the previous analysis of the absolute accuracy because the averaged estimates always underestimated the averaged actual frequencies (absolute and relative accuracy differ only when both over- and underestimation occurs). In sum, the analyses provide clear support for the first hypothesis that people in general underestimate the frequency of their emotional experiences.
Figure 3 shows the means of Table 1 to test the second part of hypothesis 1, that people underestimate higher frequencies more than lower frequencies (cf. Watkins & LeCompte, 1991). Clearly, underestimation increases with higher actual frequencies. In addition, it can be seen that the frequency estimates follow a linear trend. This finding is in agreement with results reported by Watkins and LeCompte (1991). To demonstrate higher underestimation for higher actual frequencies quantitatively, the relative accuracy score was correlated with the actual frequencies. This analysis produced, as predicted, negative correlations for both pre- and post-diary estimates: rs = -.99 and -.86 (ps < .01), respectively. Underestimation was more severe for higher actual frequencies. In addition, the relative accuracy scores of pre- and post-diary estimates were highly correlated with each other, r = .88, p < .01. In sum, analyses of the relative accuracy of frequency judgments of emotions revealed that (a) people underestimate the frequency of their emotions and (b) that they do so increasingly with increasing actual frequencies of emotions. This finding is consistent with experimental studies (cf. Watkins & LeCompte, 1991; Williams & Durso, 1986), and estimates of daily activities (Mingay, Shevell, Bradburn, & Ramirez, 1994). Furthermore, underestimation was more pronounced for pre- than for post-diary estimates. Because it is unlikely that the actual frequencies of all emotions increased from the pre-diary to the diary period, this finding can be interpreted as evidence that the relative accuracy of the estimates increased due to the participation in the diary studies.
3.2.3 Discriminative Accuracy across Emotions
Discriminative accuracy across emotions can be assessed at the group level as well as at the individual level. For the analysis at the group level, one simply has to compute the correlation between the actual frequencies and the estimated frequencies across the 20 emotions included in the daily report form (Table 1). The discriminative accuracy across emotions was very high for pre- and post-diary estimates, rs = .96 and .98, respectively, both ps < .01; the pre- and post-diary estimates were also highly correlated with each other, r = .96, p < .01. The same analyses were performed for the vague quantifier ratings. The correlations with the actual frequencies as well as with the absolute estimates were very high (all rs > .90, all ps < .01).
To test discriminative accuracy across emotions at the level of each participant, the correlations between actual frequencies and the four frequency judgments (pre- and post-diary absolute estimates and vague quantifier ratings) were computed for each individual. Subsequently, the correlation coefficients4 were used as dependent variables in a 2 x 2 ANOVA with the within-subject factors response format (absolute estimates vs. vague quantifier ratings) and time of judgment (pre-diary vs. post-diary). This analysis revealed that the absolute estimates produced higher correlations (mean r = .79) than the vague quantifier ratings (mean r = .73), F(1,145) = 61.54, p < .01. Furthermore, post-diary judgments were more highly correlated (mean r = .85) with actual frequencies than pre-diary judgments (mean r = .67), F(1,145) = 415.51, p < .01. The interaction was not significant, F(1,145) = 0.02.
The higher correlations obtained for post-diary estimates suggests that participation in the diary study also increased the discriminative accuracy across emotions. However, the effect could also be due to changes in the true frequencies of emotions from the pre-diary to the diary period. The fact that vague quantifier ratings possessed less discriminative accuracy across emotions in the analysis at the individual level could be due to the limited number of response categories which restricts the possibility to discriminate between emotions with similar frequencies.
In sum, the results support hypothesis 2 that frequency judgments of emotions possess discriminative accuracy across emotionsthat is, people are sensitive to the different frequencies with which they experience different emotions. This finding is consistent with studies of frequency judgment in other domains which also show that people are sensitive to variation in the frequencies of different stimuli (Hasher & Zacks, 1984; Hintzman, 1988).
3.2.4 Discriminative Accuracy across Participants
To test the discriminative accuracy across participants the pre- and post-diary frequency judgments of both response formats were correlated with the actual frequencies across participants, separately for each of
the 20 emotions (Table 2). Table 2 also shows the test-retest correlations of the absolute estimates and the weak quantifier ratings. In the last column of Table 2, the internal consistency of the daily estimates across the 21 days of the diary period is reported as a measure of the stability of individual differences in the frequency of emotional experiences during the diary period (Diener & Larsen, 1984).
As can be seen in Table 2, nearly all correlations between frequency estimates and actual frequencies of emotions were significant. Nonsignificant correlations were obtained only for pre-diary absolute estimates. These results support hypothesis 3 that frequency judgments of emotions possess discriminative accuracy across participants. However, the correlations in Table 2 vary considerably, ranging from r = -.03 to .78. A 2 x 2 ANOVA with the within-subject factors response format (absolute estimates vs. vague quantifier ratings) and time of judgment (pre- vs. post-diary judgments) was used to test whether these factors influence the discriminative accuracy across participants.
This analysis revealed significant main effects for response format, F(1,19) = 8.72, p < .01, and for time of estimate, F(1,19) = 148.54, p < .01. In addition, the interaction was also significant, F(1,19) = 12.93, p < .01. Follow up analyses of the mean correlations5 indicated that the post-diary correlations were higher than the pre-diary correlations (Figure 4). In addition, the significant interaction is due to the fact that the pre-diary vague quantifier ratings produced higher correlations than the pre-diary absolute estimates, whereas both response formats produced equally high correlations when the judgments were made after the diary period.
Again, the finding that post-diary estimates possess higher discriminative accuracy across participants can be due to two, not mutually exclusive factors. First, rating the frequency of emotions on a daily basis might make emotional experiences more salient, leading to more accurate judgments. Second, individual differences in the actual frequency of emotional experiences may have changed from the pre-diary period to the actual diary period. It is important to distinguish between these two explanations, because the first explanation implies that the post-diary correlations overestimate the true discriminative accuracy of frequency judgments of emotions, whereas the second explanation implies that the pre-diary correlations underestimate the discriminative accuracy. Additional analyses were carried out to test the viability of the two accounts in greater detail.
The last column in Table 2 shows that the individual differences in the frequencies of emotions were highly stable over the three-week diary period. On the basis of this finding, a fairly high stability of individual differences in the frequencies of emotions can also be expected from the three weeks prior to the diary study to the three weeks of the diary period. If so, the higher correlations obtained for the post-diary judgments were at least partly due to the participation in the diary study. Table 2, however, also shows that emotions differ in their stability over time. For example, the frequency of affection is more stable (alpha = .93) than the frequency of feeling hurt (alpha = .71). If the correlation between pre-diary estimates and actual frequencies is attenuated by changes in the true frequencies of emotions, emotions with more variable frequencies over time (e.g., hurt) should be more affected than emotions with very stable frequencies over time (e.g., affection). To test this hypothesis, the pre-diary correlations (column 1 and 3 in Table 2) were correlated with the stability coefficient (i.e., alpha in the last column in Table 2) across the 20 emotions. Both correlations indicate that the discriminative accuracy of pre-diary estimates increased with the stability of individual differences in the frequency of an emotion (absolute estimates r = .44, p = .05; vague quantifier ratings r = .54, p < .05), although the correlation for the absolute estimates was only marginally significant. This finding suggests that the temporal stability of an emotion influenced the size of the correlations between pre-diary estimates and actual frequencies during the diary study. Therefore, these correlations tend to underestimate the discriminative accuracy of frequency judgments of emotions. In sum, the analyses show empirically that the true discriminative accuracy is higher than the correlation obtained for pre-diary estimates and lower than the correlation obtained for post-diary estimates. Therefore, a point-estimate of the true discriminative accuracy across participants is not possible, but it is on average in a rage from r = .30 to .60. This finding suggests that the discriminative accuracy across participants has been overestimated in previous studies which used only post-diary judgments (Feldman Barrett, in press). The fact that the post-diary estimates in the present study are still lower than in previous studies can be attributed to the use of singe-item measures in the present study, whereas previous studies used multiple-item measures which are bound to have a higher reliability.
In a further set of analyses the specificity of the frequency judgments was explored; that is, whether individual differences in the judged frequency of an emotion are more highly correlated with individual differences in the actual frequencies of the same emotion than with those of other emotions (cf. Diener, Smith et al., 1995). The actual frequencies of each emotion was correlated with the frequency judgments of the remaining 19 emotions and the highest correlation was recorded (see Appendix 1). Subsequently, this correlation was compared to the correlation with the frequency judgment of the same emotion (Table 2). Specificity was established, if the correlation with the judgments of the same emotion exceeded the highest correlation with judgments of another emotion. These analyses were carried out for all four frequency judgments (pre- and post-diary absolute estimates and vague quantifier ratings). The strongest evidence for specificity was obtained for the post-diary weak quantifier ratings: Estimates for all 20 emotions revealed specificity. For the other judgments, specificity existed for frequency judgments of 18 (post-diary absolute estimates), 17 (pre-diary vague quantifier ratings) and 14 emotions (pre-diary absolute estimates). Even 14 cases of specificity are much more than what would be expected by chance; expected = 1, χ2(N = 20) = 177.89, p < .01. These results show that the participants clearly used information about specific emotions. This finding eliminates a simple response set explanation of discriminative accuracy across participants (cf. Diener, Smith et al., 1995). Furthermore, the results suggest that frequency judgments are not based on generalized beliefs, unless one assumes that participants have different beliefs for each of the 20 emotions.
In sum, frequency judgments of emotions were found to (a) possess moderate discriminative accuracy across participants (b) and to show remarkable specificity for each emotion. With regard to the two response formats, the vague quantifier ratings yielded higher correlations and more specificity than the absolute estimates, despite the use of absolute estimates on the daily report form to measure actual frequencies.
3.2.5 The Influence of Daily Ratings on Frequency Judgments of Emotions
Daily ratings of emotions during the diary study might make these emotions salient. According to hypothesis 5, this should increase the absolute level of the frequency estimates of these emotions. To test this prediction, 9 emotions were included in the pre- and post-diary questionnaires that had not been on the daily rating form. Furthermore, these emotions were selected to be related to one of the emotions on the daily form (not daily form-daily form: happiness-joy, love-affection, fear-anxiety, rage-anger, dislike-contempt, regret-guilt, shame-embarrassment, depression-sadness, helplessness-hopelessness).
The previous analysis of relative accuracy already demonstrated that people underestimated actual frequencies less in the post-diary judgments than in the pre-diary judgments. This effect implies that the frequency judgments increased from pre- to post-diary ratings (see Figure 3). If, however, the daily ratings increased especially the salience of those emotions on the daily report form, the increase should be stronger for those emotions on the report form than for their counterparts that were not on the form. In other words, salient emotions should reveal a higher increase from pre- to post-diary estimates than non-salient emotions.
To test this prediction, repeated measure ANOVAs were carried out with the within-subject factors time (pre- vs. post-diary), salience (on the form vs. not on the form) and type of emotion (9 pairs of emotions). The first analysis was based on the absolute estimates and the second analysis on the vague quantifier ratings. The ANOVA revealed significant effects for all main effects and interactions (Table 3). However, not all of the effects are theoretically important. For example, the strong effect6 for the salience x emotion interaction simply shows that the frequencies of emotions were not equivalent across and within the 9 emotion-pairs.
The most important finding is the predicted time x salience interaction was significant. Furthermore, Figure 5 shows that the interaction is due to the predicted stronger increase from pre- to post-diary frequency estimates of salient emotions.
However, the significant three-way interaction indicates that this effect differed across emotion pairs. Table 4 shows the pre- and post-diary absolute estimates for all 9 emotion pairs. Inspection of the data shows that the increase over time was replicated for all emotions, but three emotion pairs did not show the expected stronger increase for the salient emotion, namely joy-happiness, contempt-dislike, and hopelessness-helplessness.
Visual inspection of the effects suggests that frequent emotions showed a stronger increase in the frequency estimates from pre- to post-diary estimatesa hypothesis that is also suggested by the regression lines in Figure 3. To explore this hypothesis more thoroughly, I went back to the data in Table 2 and correlated the actual emotion frequencies with a change score, subtracting pre-diary from post-diary absolute estimates. The correlation proved to be highly significant, r(20) = .96, p < .01, indicating that more frequent emotions show a higher increase in the estimated absolute frequency from pre- to post-diary judgments. This finding is most likely due to the stronger underestimation of these emotions in the pre-diary estimates. Therefore, frequency judgments of more frequent emotions benefit in particular from making them salient.
Figure 6 shows the means of the pre- and post-diary vague quantifier ratings of salient and non-salient emotions. An unexpected finding was that the ratings of both salient and non-salient emotions decreased from pre- to post-diary judgments. This is exactly the opposite of what was expected. Furthermore, this effect occurred although the same participants had just made the absolute estimates which showed the expected increase. This finding is strong supports for the hypothesis that vague quantifiers do not correspond in a one-to-one fashion to absolute frequencies (Pepper, 1981; Schaeffer, 1991; Wright et al., 1994). However, Figure 6 also shows that the significant time x salience interaction is due to a smaller decrease for the salient than the non-salient emotions. This finding is consistent with the predicted influence of salience: Given that vague quantifier ratings decrease over repeated assessments, they do so less for emotions which were made salient.
Table 5 shows the results for each emotion pair. All except three emotions showed the unexpected decrease from pre- to post-diary judgments. Next it was explored whether frequent emotions showed a smaller decrease than less frequent emotions, which would be equivalent to the stronger increase obtained for absolute estimates. Again, this hypothesis was tested by means of the data reported in Table 2. The actual emotion frequencies were correlated with a change score, subtracting pre-diary from post-diary vague quantifier ratings. As to the absolute frequencies, a significant positive correlation was obtained, r(20) = .79, p < .01. In addition, the change scores of absolute estimates and vague quantifier ratings were significantly correlated, r(20) = .75, p < .01.
This finding suggests that absolute estimates and vague quantifier ratings also responded in the same way to the participation in the diary study. Frequent emotions show a higher increase for absolute estimates and a smaller decrease for vague quantifier ratings. In sum, the salience manipulation had the expected effect on both types of frequency judgments. However, for the vague quantifier ratings the expected salience effect was overshadowed by the unexpected and counterintuitive finding that vague quantifier ratings decreased from pre- to post-diary judgments.
A search in the psychological literature uncovered that this finding could have been predicted on the basis of earlier findings. As early as 1954, Windle conducted a meta-analysis and reported a decreasing mean in test-retest comparisons of social-adjustment measures. Although this effect has many practical implications, very little research tried to illuminate its causal mechanisms (see Knowles, Coker, Scott, Cook, & Neville, 1996). Recently, Knowles et al. (1996) suggested that the mean change is due to a meaning change of the items from test to retest. That is, participants better learn the common theme of the items in a questionnaire, which changes the meaning of single items. For example, participants might first think of all episodes of crying when they answer an item such as “I cry easily.” However, after learning that the questionnaire is about anxiety, the item is understood in this context and certain episodes of crying (e.g., crying for joy) are discounted, leading to the choice of lower response categories. Although this explanation might explain decreasing means in questionnaires which assess a single construct, it can hardly explain the findings in the present study. First, the items were not intended to measure a common construct, and it is unlikely that the participants falsely detected such a common theme. Second, meaning changes should have influenced the absolute estimates in the same way, but these estimates increased.
A different explanation could be Parducci’s range-frequency principle. Parducci (1968) demonstrated that people’s assignment of numbers between 100 and 1000 to vague quantifiers such as very small, small, large, very large was context dependent. 550 was high if most numbers fell in the range from 100 to 550, but low if most numbers fell in the range from 550 to 1000. In other words, the meaning of a response category depends on the distribution of the stimuli that have to be assigned to the response categories. Parducci demonstrated in several experiments with various types of stimuli that the assignment function is a compromise between a range and a frequency principle. The frequency principle implies that people try to accommodate an equal number of stimuli (i.e., in the present context a stimulus is the frequency of an emotion) in each category; that is, the same number of emotions should be in the “not at all”, “rarely,” or “often” category. Clearly, the frequency principle ignores the actual distribution of the stimuli; rather it forces the data into a uniform distribution. The range principle is most easily understood by its mathematical formula: Ric = (Si – Smin) / (Smax – Smin), where S is the actual scale value. As the actual scale represents frequencies, it is reasonable to assume that Smin equals zero; hence, Ric = Si/ / Smax). Although it is not clear which frequency corresponds to Smax, probably the number of all emotional experiences during a specified time period, it is clear that the range principle preserves the distribution of the stimuli, as long as the respondent has a sufficient number of response alternatives (Parducci & Wedell, 1986). Previous studies showed that in the final assignment of an item to a category, the two principles are weighted equally: Jic = wRic + (1-w)Fic with w = .50 (Parducci 1968). More recently, Parducci and Wedell (1986) demonstrated that the weight of the two principles is context dependent. For example, a higher number of response categories decreased the influence of the frequency principle.
The range-frequency principle would predict the observed mean change from pre- to post-diary vague quantifier ratings under certain conditions, namely, (a) if the distribution of the emotion frequencies is positively skewed; in this case the frequency-principle leads to the assignment of small frequencies to medium response categories, and (b) if people weight the frequency principle less during the post-diary judgmentsin this case the small frequencies are assigned to low categories. The major problem with this explanation is that it remains unclear why participants would shift the weights of the two principles. This problem is closely related to Haubensak’s (1994) criticism of range-frequency theory: It is descriptive but not explanatory; that is, a combination of the range and the frequency principle can predict outcomes of experiments, but this does not illuminate the underlying processes of the effect. To overcome this limitation of range-frequency theory, Haubensak (1994) developed a consistency model which might explain the finding in the present study that vague quantifier ratings decrease from pre- to post-diary judgments. According to the consistency model, respondents prefer to start the rating task with medium rating categories. If the distribution of stimuli is positively skewed, this implies that small frequencies are often assigned to medium rather than small categories. Furthermore, the first ratings influence all subsequent ratings because (a) the initial assignments of frequencies to response categories remain a standard for the complete task, and (b) the participants want to be consistent with their initial standard. Therefore, the tendency to assign small frequencies to moderate categories prevails throughout the task. This model could explain the decreasing mean of vague quantifier ratings: The second time participants have a better sense of the distribution of emotion frequencies, and they do no longer try to be consistent with the standard of the first task, which might have been forgotten anyway. To test the viability of this post-hoc explanation, additional analyses were carried out.
As noted above, a basic assumption of this explanation is that the distribution of the actual frequencies of the emotions is positively skewed. As can be seen in Figure 3 this is indeed the case, which can also be shown quantitatively (skewness = 1.04). Similarly, the pre- and post-diary absolute estimates show a similar skewness (pre-diary 1.00; post-diary 1.15). The prediction of the consistency model that the skewness of the vague quantifier ratings is reduced was also confirmed (pre-diary skewness = 0.24). The additional assumption made to explain the decreasing mean is that participants became more sensitive to the actual distribution of the stimuli so that the distribution of the vague quantifier ratings should be more similar to the actual distribution of the stimuli (this is equivalent to a decreasing influence of the frequency principle in range-frequency theory). This is also the case (post-diary skewness = 0.41). In sum, analyses of the distribution of the actual frequencies and the vague quantifier ratings are in agreement with Huabensak’s (1994) consistency model. Problems with the initial choice of rating categories lead to distorted assignments of vague quantifiers to frequencies. This problem persists within the same questionnaire because people want to be consistent. However, experience with the set of stimuli and the lack of a need (or ability) to be consistency from one measurement point to the other allows participants to improve their ratings. Because the finding was unexpected and the explanation is post-hoc, it was further explored in Study 2.
3.2.5 Exploration of Mood- and Personality-Congruent Biases
Personality- and mood-congruent biases were investigated simultaneously because extraversion and neuroticism are often correlated with current mood (Matthews, Jones, & Chamberlain, 1990; Schimmack, in press; Steyer, Schwenkmezger, Notz, & Eid, 1994). Current mood was measured with the 10 specific mood scales and the global pleasure-displeasure and aroused-unaroused dimensions of the ELMI (Schimmack, in press, 1996a).
To reduce the number of variables, a factor analysis of the 10 specific mood scales was carried out and the factor scores of the first two unrotated factors were retained. The obtained factors were very similar for the pre- and post-diary administration of the ELMI; therefore, only the post-diary factor analysis is reported in detail. The first factor was a displeasure-pleasure factor: The specific mood scales Depression, Grouchiness, and Irritation had high positive loadings on this factor, whereas the scales Good Humor and Relaxation had high negative loadings. The second factor was an arousal factor; the scales Nervousness, Anxiety, and Euphoria had high positive loadings on this factor. This interpretation of the factors was also supported by the simple correlations between the two factors and the directly measured pleasure and arousal dimensions. The first factor correlated highly negatively with the pleasure dimension (r = -.84, p < .01), and slightly with the arousal dimension (r = -.18, p < .05). The second factor correlated mainly with the arousal dimension (r = .45, p < .01), and slightly with the pleasure dimension (r = .18, p < .05).
In the following analysis, the factor scores were used, because they are based on a greater number of items than the direct measures of pleasure and arousal. To facilitate the interpretation of results, the factor scores of the first factor were inverted so that higher values indicate more pleasure. Extraversion and neuroticism were measured by the respective scales of the NEO-PI-R. To reduce the number of analyses, the frequency estimates of pleasant and those of unpleasant emotions were averaged (analyses for each emotion separately are included in Appendix 2).
In the first set of analyses, the post-diary frequency judgments were regressed simultaneously onto the actual frequencies, the mood and the personality variables to control for the intercorrelations between the predictor variables. Table 6 shows that for all analyses the daily averages were the strongest predictor of frequency estimates, indicating that the frequency judgments primarily reflect individual differences in the actual frequencies of experienced emotions. The personality and mood variables, however, were also related to the frequency judgments.
The absolute estimates showed a consistent bias for extraversion: Extraverted individuals estimated their pleasant and unpleasant emotions to occur more frequently than introverted individuals. Because extraversion is not generally assumed to be congruent with frequent experience of unpleasant emotions, this result does not indicate a personality-congruent effect. In contrast, the vague quantifier ratings showed a personality-congruent bias: Extraversion was a significant predictor of frequency estimates of pleasant emotions even after controlling for actual frequencies of emotions, and neuroticism predicted a bias in the frequency estimates of unpleasant emotions.
A mood-congruent effect was obtained in that a pleasant mood predicted lower vague quantifier ratings of unpleasant emotions, but not higher ratings of pleasant mood.
In a second analysis, the simple correlations between the personality and mood variables with the post-diary judgments were compared to the correlations with the pre-diary judgments. Conceivably, participation in the diary study could attenuates personality or mood biases, because the participants are more aware of their emotional experiences. If this is true, the simple correlations between personality and mood measures and frequency judgments should be higher for the pre- than for the post-diary judgments. However, Table 7 provides little support for this hypothesis. Only the absolute level of the correlation between current pleasure and vague quantifier ratings of pleasant and unpleasant emotions was higher for pre- than for post-diary estimates.
In sum, the results are mixed; only vague quantifier ratings showed a consistent personality-congruent bias (see also Feldman Barrett, in press). This finding could be a method artifact because the measurement of extraversion and neuroticism in the NEO-PI-R is partly based on items that include vague quantifier (e.g., “I rarely feel depressed”). Therefore, individual differences in the interpretation of vague quantifiers could explain the finding that the personality measure explained additional variance on top of the actual frequencies, which are based on absolute estimates. Finally, it should be noted that the actual frequencies were by far the strongest predictor of the frequency judgments of emotions. This shows that frequency judgments are mainly based on the actual frequencies of emotions in the past, and that biases play only a minor rule.
3.3 Additional Analyses
3.3.1 The Time Extension of Frequency Judgments of Emotions
People’s frequency judgments of emotions might only reflect the frequencies of emotions in the most recent past, or they may extend over longer time periods. To address this question empirically, the daily frequency estimates were averaged separately for the first, second, and third week of the diary period. Then, the post-diary estimates were regressed onto the three weekly averages in hierarchical regression analyses. In one set of analyses, the third week was entered first, followed by the second and first week, whereas in a second set of analyses, the predictors were entered in the reverse order. If information about more remote time periods; that is, the first week, is weighted less heavily in the frequency judgments, then entering the first week as the last predictor should explain less additional variance than entering the third week as the last predictor. Figure 7 shows the averaged incremental amount of explained variance for the separate analyses of the 20 emotions (see Appendix 3 for the results of each emotion).
A comparison of the increment in explained variance for the two orders in which the predictor variables were entered revealed that in the first step more variance was explained by week 3 than by week 1, t(19) = 2.24, p < .05. Week 2, entered always in the second step, explained more variance when it was entered after week 1 rather than after week 3, t(19) = 3.49, p < .01. There was no significant difference in the amount of explained variance for step 3, t(19) = 0.11, p = .91. This pattern of results indicates a slight recency effect in the vague quantifier ratings. However, week 1 and week 2 still explain 3% additional variance when they were entered after week 3. Therefore, vague quantifier ratings reflect the emotional experiences over the whole three weeks of the diary study (in individual analyses, an increment of 3% explained variance was significant).
The same analyses were performed for the absolute estimates. For these judgments, very different results were obtained (Figure 8): When week 3 was entered first, adding the second and first week hardly increased the amount of explained variance (2% and 1% respectively). In contrast, when week 1 was entered first, week 2 and 3 still explained a considerable amount of additional variance. The differences between the two orders of entry in amount of explained variance for all three steps were highly significant, all ts(19) > 5.00, ps < .01. This pattern of results reveals a strong recency effect for the absolute estimates.
The differences between vague quantifier ratings and absolute estimates can also be shown quantitatively. The third week entered in Step 1 explained more variance for the absolute estimates than for the vague quantifier ratings, t(19) = 2.14, p < .05. In contrast, week 1 entered in Step 3 explained more additional variance for the vague quantifier ratings than for the absolute estimates, t(19) = 5.84, p < .01. There were no differences for the second week entered in step 2, t(19) = 1.72, p = .10. In sum, both response formats show a recency effect; that is, frequency judgments are biased toward the frequencies in the more recent past. However, this effect is more pronounced for the absolute estimates than for the vague quantifier ratings.
The stronger recency effect for absolute estimates might be due to the use of absolute estimates to assess the actual frequencies during the diary study. Therefore, participants might have been influenced by recollections of their last daily absolute estimates when they made the absolute estimates, but not when they made the vague quantifier ratings. This could also explain, why the absolute estimates were much less stable than the vague quantifier ratings, from the pre- to the post-diary judgments (Table 2). Nevertheless both response formats achieve an equally good prediction of emotional frequencies in the last three weeks (see Figure 4), but they do so differently: Whereas the absolute estimates better capture the frequencies in the most recent past, the vague quantifier ratings more accurately reflect frequencies in the remoter past.
3.3.2 Interrelations between the Frequencies of Pleasant and Unpleasant Emotions
The relation between the frequencies of pleasant and unpleasant emotions is an important topic in personality psychology. Previous researchers found the frequencies of pleasant and unpleasant affects to be independent or negatively correlated (Bradburn, 1969; Diener, Smith et al., 1995; Green et al., 1993; Watson, Clark, & Tellegen, 1988). However, these studies relied mostly on retrospective frequency estimates (for an exception see Diener, Smith et al., 1995). More importantly, the studies exclusively used vague quantifiers to assess the frequency of emotions. Because the present study already revealed several differences between vague quantifier ratings and absolute estimates, it seemed worthwhile to explore whether the response format also influenced the relation between frequency estimates of pleasant and unpleasant emotions. This is indeed the case as can be seen in Table 8: The absolute response format produced positive correlations, whereas the vague quantifier ratings produced negative correlations. These conflicting correlations indicate that one response format produces misleading results due to a method artifact.
Green et al. (1993) have argued that the experience of pleasant and unpleasant moods is highly negatively correlated, but that this negative correlation is often obscured by random and systematic measurement errors. On first sight, this argument would suggest that the positive correlation obtained for the absolute estimates is an artifact, for example due to an extremity bias. Furthermore, the low negative correlations obtained for the vague quantifier ratings could be attenuated by random measurement error. However, this interpretation of the data does not recognize the important distinction between mood and emotion. Green et al. (1993) asked their respondents’ how much pleasant or unpleasant mood they experienced in the last month. Assuming that a person is most of the time in a mood state (cf. Ekman & Davidson, 1994, chapter 2); that is, feels either pleasant or unpleasant, and that pleasant and unpleasant affects are rarely experienced at the same moment in time (Diener & Iran-Nejad, 1986; Green et al., 1993; Schimmack, in press; Steyer et al., 1994), it follows that the amount of pleasant mood must be negatively correlated with the amount of unpleasant mood experienced in the last month. This logical necessity, however, does not apply to the relation between the frequencies of pleasant and unpleasant emotions (see Figure 1), because emotions are not elicited and experienced all the time. Therefore the number of times that pleasant emotions are elicited can vary independently from the number of times that unpleasant emotions are elicited. Subsequently, I want to argue that the empirical relation between frequencies of pleasant and unpleasant emotional experiences is positive; that is, that some individuals experience more pleasant and unpleasant emotions than others, and that the negative correlation obtained for the vague quantifier ratings is an artifact.
A major problem of vague quantifier ratings is that it is unknown how the participants’ use the vague quantifiers for their judgments. One possibility could be that participants use vague quantifiers to indicate ranges of absolute frequencies; that is, rarely might mean 2-5 times a week. If this were true, however, vague quantifier ratings should show a similar pattern of results to the absolute estimates. The previous findings contradict this hypothesis. Another possibility could be that the participants used vague quantifiers to describe ranges of percentages (cf. Reisenzein, 1995). For example, experiencing an emotion often might mean in 80 to 90% of all emotional experiences. This, however, would produce a method artifact in the analysis of individual differences in the frequencies of experienced emotions, because percentages eliminate such individual differences. For example, one person might have only 10 emotional experiences a week, of which 8 elicited happiness and 2 elicited sadness. Another person might have 100 emotional experiences a week of which 80 elicited happiness and 20 sadness. Both respondents might say that they experience happiness often, meaning in 80-90% of their emotional experiences and sadness rarely, meaning in 10-20% of their emotional experiences. But the second person clearly experienced both emotions more frequently than the first person. Furthermore, because pleasant and unpleasant emotions co-occur very infrequently during a single emotional episode (Reisenzein, 1995; Schimmack & Reisenzein, in press), percentages, in contrast to absolute frequencies, of pleasant and unpleasant emotion frequencies are bound to be negatively correlated across participants.
The hypothesis that participants use vague quantifiers to indicate percentages makes the prediction that vague quantifier ratings of, for example, pleasant emotions reflect not only the actual frequencies of pleasant emotions, but also the actual frequencies of unpleasant emotions, although in the opposite direction; that is, higher frequency judgments are obtained for lower actual frequencies of emotions of the opposite valence. This follows from the fact that percentages take all emotional experiences into account (i.e. rating of pleasure = actual pleasure / (actual pleasure + actual displeasure). To test this hypothesis, the vague quantifier ratings of pleasant and unpleasant emotions were regressed onto the actual frequencies of pleasant and unpleasant emotions. Table 8 shows the predicted pattern that vague quantifier ratings of pleasant (unpleasant) emotions are positively related to the actual frequencies of pleasant (unpleasant) emotions, but also negatively related to the actual frequencies of unpleasant (pleasant) emotions.
However, the correlations across types of affects are not close to -1, which is what one would expect if vague quantifiers were pure measures of percentages. It is therefore conceivable that they reflect partly absolute frequencies and partly percentages of emotional experiences. If the vague quantifier ratings also reflect individual differences in the average actual frequencies of all emotions, the sum of all weak quantifier ratings should be correlated with the sum of all absolute estimates. Indeed, the correlations are rs = .40 and .45, ps < .01, for the pre- and post-diary vague quantifier ratings. Apparently, vague quantifier ratings indicate partly absolute frequencies and partly percentages of individual’s emotional experiences.
In sum, the major implication of the present findings is that the negative correlations between frequencies of pleasant and unpleasant emotions, obtained with vague quantifier ratings, are an artifactthat is, they conceal that the actual frequencies of pleasant and unpleasant emotions are positively correlated. Additional support for this claim stems from a study by Schimmack and Diener (in press), who also found a positive correlation between frequencies of pleasant and unpleasant emotions by means of a different method. In several studies, the participants indicated their emotional reactions to hypothetical or real life events, using an intensity scale ranging from 0=not at all to 6=extremely intense. Frequency of emotions was then determined as the number of non-zero ratings; that is, the number of times the emotion was experienced at all. In all studies, the frequencies of pleasant emotions were found to be positively correlated with the frequency scores of unpleasant emotions. Hence, yet another method supports a positive correlation between frequencies of pleasant and unpleasant emotions. Furthermore, Suh, Diener, and Fujita (1996) demonstrated that people with more positive life events also have more negative life events, presumably because they lead more active lives. For example, researchers who submit many papers to a journal have both more positive and negative reviews, and as a consequence more pleasant and unpleasant emotions, than researchers who submit only few papers. This can explain the positive correlation between the frequencies of pleasant and unpleasant emotions, because positive events elicit pleasant emotions and negative events elicit negative emotions.
In sum, the present results indicate that the frequencies of pleasant and unpleasant emotions are positively correlated across individuals. This finding contradicts previous findings. However, previous studies did either study moods and not emotions (Green et al., 1993), or used exclusively vague quantifiers ratings to measure the frequency of emotions. For reasons outlined above, the use of vague quantifiers is likely to produce method artifacts and lead to false conclusions about individual differences in the frequency of experienced emotions.
The findings of Study 1 support the first three hypothesis: Frequency judgments of emotions possess discriminative accuracy across emotions as well as across participants. They underestimate the actual frequencies of emotions, and they do so increasingly with increasing frequencies of actual occurrences. Study 1 also provided evidence for hypothesis 5 that the salience of emotions during the encoding stage increases subsequent frequency judgments: After participating in a diary study, participants provided higher absolute frequency estimates and did so especially for salient emotions; that is, emotions that were included in the daily report form. Furthermore, frequent emotions showed stronger increases than infrequent emotions. This is very likely due to the stronger underestimation of these emotions when they are not salient; there is simply more room for salience to boost the estimates of frequent emotions.
The increased salience also had an effect on the accuracy of the frequency judgments. All measures of accuracy showed higher accuracy for the post-diary than for the pre-diary judgments. Although the interpretation of this finding is ambiguous, because the pre-diary judgments necessarily covered a different time period than the period when the actual frequencies were measured, the consistency of the effects suggests that participants were better able to judge the frequencies of their emotions after participating in a diary study. Influences of salience on the accuracy of frequency judgments have also been reported in experimental studies (Naveh-Benjamin & Jonides, 1986).
The findings concerning mood- or personality-congruent biases were mixed. Only the vague quantifier ratings showed a personality-congruent bias: neurotic individuals overestimated the frequencies of their unpleasant emotions, and extraverted individuals overestimated the frequency of their pleasant emotions (Diener et al., 1984; Feldman Barrett, in press).
Study 1 also revealed interesting new findings that deserve attention in future research. First, in various analyses different results were obtained for the absolute estimates and the vague quantifier ratings. Concerning absolute estimates, pre-diary estimates revealed low discriminative accuracy across participants, and the test-retest correlations between pre- and post-diary estimates were low, despite a relatively high stability of emotion frequencies. It was also found that the absolute estimates reflect mostly frequencies in the most recent past (last week), and that they were unaffected by mood or personality-congruent biases. On the other hand, vague quantifier ratings showed high temporal stability, reflected frequencies of emotions over the last three weeks, and appeared to be slightly biased in a personality and mood-congruent direction. First of all, this finding indicates that the choice of the response format matters (Schaeffer, 1991); a factor that has been neglected in the frequency judgment literature. The present study does only allow to speculate about the causal mechanisms that produced these differences. Brown (1995) made the important point that most frequency judgment models do not explain how frequency information (e.g., a feeling of familiarity) is converted into an absolute estimate. The present study shows that the conversion into a vague quantifier rating also requires explanation. The finding that the two response formats produced divergent results, even though both formats were administered at the same time to the same participants, suggests that the response formats influence predominantly the conversion of frequency information into a response.
One very interesting differences between the two response formats was that absolute estimates increased considerably from pre- to post-diary judgments. In contrast, vague quantifier ratings, which were made right before the absolute estimates, decreased from pre- to post-diary judgments. This finding demonstrates clearly that vague quantifier ratings do not correspond to absolute frequencies. The unexpected decrease of vague quantifier ratings in a test-retest design has been observed in other studies (Knowles et al., 1996). Post-hoc analysis were in agreement with Haubensak’s consistency model. Vague quantifier ratings, but not absolute estimates, distorted the positively skewed distribution of the emotion frequencies. However, the post-diary vague quantifier ratings reflected the actual distribution better than the pre-diary estimates. This finding suggests that the participants had problems to convert their frequency impressions into ratings along a limited number of vague quantifier categories.
Finally, the two response formats produced different correlations between frequencies of pleasant and unpleasant emotions: For absolute estimates the correlation was positive, whereas it was negative for the vague quantifier ratings. Regression analyses suggested that the negative correlations obtained with vague quantifier ratings were due to a method artifact: Participants used vague quantifiers partly to indicate percentages of the overall number of their emotional experiences (Reisenzein, 1995) which (a) eliminates individual differences in absolute frequencies of emotions and (b) pushes the correlation in a negative direction. Therefore, the absolute estimates are better suited to explore individual differences in the frequency of emotional experiences. According to the present study, people who experience pleasant emotions frequently are also likely to experience more unpleasant emotions. This finding challenges current structural models of personality which assume that the frequencies of pleasant and unpleasant emotions are independent (cf. Costa & McCrae, 1992; Watson et al., 1988).
4 STUDY 2
Study 2 was similar to Study 1 in that the participants again made daily frequency judgments and estimated the frequency of emotions before and after the diary study. Minor differences between Study 1 and 2 are that in Study 2 (a) the diary period extended only over two weeks, (b) the actual frequencies were based on twice-daily absolute estimates, (c) and the participants made only vague quantifier ratings, but not absolute estimates, before and after the diary period. The most important difference was that the participants in Study 2 estimated the frequencies of emotions separately for the first and second week of the diary period. In analogy to Hintzman and Block’s (1971) seminal experiment of frequency judgments for separate word list, this permitted to test whether information about the frequency of emotions is stored directly or indirectly in memory. If frequencies of emotions are stored directly in memory, participants should be unable to provide accurate frequency estimates separately for the first and second week. However, if frequency estimates are based on the activation of multiple memory traces, and if memory traces of particular time periods can be selectively activated, then participants should be able to judge accurately the frequencies of their emotions in the first and the second week.
Study 2 also provided for an opportunity to replicate several of the findings in Study 1, namely to test (a) the discriminative accuracy of frequency judgments of emotions across emotions and participants, (b) the presence of personality- and mood-congruent biases, and (c) the unexpected decrease of vague quantifier ratings over repeated measurements.
The participants were undergraduate students at the Free University Berlin who took part in a course on emotions in everyday life. 80 participants (24 men and 56 woman) with a mean age of 25 completed all data collections.
4.1.2 Material and Procedure
22.214.171.124 Daily Estimates
At the core of Study 2 was the two week diary period. Participants rated the frequency of 34 emotions twice a day, which probably provides for a better estimation of the actual frequencies of emotions than the end-of-day judgments in Study 1. The response format for the daily estimates differed from Study 1. Whereas in Study 1 a free response format was used, the participants in Study 2 used a seven-point scale, ranging from 0 to 6, for their absolute estimates. All categories of this scale represented absolute frequencies (i.e. 0=never, 1=once, 2=twice, etc.) except the last category, which comprised all absolute frequencies greater than five. For most emotions, however, this category was used infrequently, so that the sum of the twice-daily estimates approximates the actual frequencies of emotional experiences during the diary period. Furthermore, the absolute level of the frequencies is less relevant in Study 2, because participants did not make absolute estimates before and after the diary period. Therefore, absolute and relative accuracy were not tested in Study 2.
In Study 1, participants had been asked to return questionnaires on a daily basis to ensure that the ratings were completed daily. This procedure was not feasible in Study 2 because students in Berlin do not live “on campus” and many students do not visit the university each day. Therefore, the report forms for the daily ratings were given to the participants in the form of two booklets; one for each week, so that at least the weekly completion of the report forms could be controlled. Afterwards, the twice-daily ratings were averaged across the repeated assessments to obtain a measure of the actual frequencies with which emotions were experienced.
126.96.36.199 Vague Quantifier Ratings
The vague quantifier ratings were always made for the time period of one week (How often did you experience joy in the last week?). The questionnaire included the 34 emotions that were on the daily report from, although in a different order, to discourage participants from using a stereotyped response pattern when they made the post-diary judgments. Judgments were made on the following scale: 0 =never, 1 = very rarely, 2 = rarely, 3 = sometimes, 4 = often, 5 = very often, and 6 = nearly always. Even though the daily report form and the vague quantifier ratings used a rating scale from 0 to 6, participants could not use the modal response on the daily form to make accurate vague quantifier ratings, because the same numeric category has different meanings in the two questionnaires. For example, a participant who always checks the response category “1” on the daily form as the frequency of his experiences of hate, would indicate to experience hate 14 times a week; this means experiencing hate quite often and not, as a vague quantifier rating of “1” would mean “very rarely.”
The vague quantifier ratings were made four times: two times prior to the diary study with a two week interval between the two assessments, and two times immediately after the diary study to make separate judgments of the first and the second week of the diary period.
188.8.131.52 Mood Questionnaire
Current mood was assessed with the BASTI (Schimmack, in press). The BASTI is the German counterpart of the ELMI used in Study 1 (see 184.108.40.206). The BASTI was completed after each administration of the vague quantifier ratings.
220.127.116.11 Personality Questionnaires
Extraversion and neuroticism were assessed two weeks after the diary study with the NEO-FFI (Borkenau & Ostendorf, 1991). In this questionnaire, neuroticism and extraversion are assessed with 12 items each. The NEO-FFI is a short version of the NEO-PI-R (Costa & McCrae, 1992) used in Study 1. The NEO-FFI is somewhat less reliable than the NEO-PI-R, but the reliability is still good (Borkenau & Ostendorf, 1991).
Because the participants in Study 2 did not estimate absolute frequencies before or after the diary study, it was not possible to test the absolute or relative accuracy of frequency judgments in this study.
4.2.1 Discriminative Accuracy across Emotions
As in Study 1, the discriminative accuracy across emotions was assessed at the group and at the individual level. For the analysis at the group level, the actual frequencies and the vague quantifier ratings of the 34 emotions were averaged across the 80 participants. Then, the correlations of the actual frequencies with the two pre-diary and the averaged post-diary estimates were computed across the 34 emotions. All correlations were very high and statistically significant (pre-diary 1 r = .88, pre-diary 2 r = .92, averaged post-diary r = .96); a trend toward higher accuracy after repeated assessments is also apparent.
To test discriminative accuracy at the level of each participant, the same correlations were computed for each participant. Subsequently, the correlation coefficients were used as dependent variables in an ANOVA with the within-subject factor time of judgment. This analysis revealed a significant main effect, F(2,158) = 151.88, p < .01. Follow up analyses revealed that all three mean correlations differed significantly from each other (Figure 9).
The increased accuracy from the first to the second pre-diary ratings is especially important, because both ratings do not cover the period during which the actual frequencies of emotions were assessed. Therefore, the effect can be more easily attributed to an increase in people’s ability to discriminate the frequencies of emotions. Furthermore, the increased accuracy from pre- to post-diary ratings replicates the finding of Study 1; and it probably reflects an influence of the participation in the diary study on the discriminative accuracy across emotions. In sum, the results largely replicate the finding of Study 1 that frequency judgments of emotions possess discriminative accuracy across emotions which increased over repeated assessments.
4.2.2 Discriminative Accuracy across Participants
To estimate the discriminative accuracy across participants, the two post-diary estimates were averaged (separate analyses for each of the two diary weeks are reported later). As in Study 1, the actual frequencies were correlated with the pre- and the post-diary ratings.
Figure 10 shows that the frequency judgments of emotions in general possessed discriminative accuracy across participants (the separate correlations for each the 34 emotions can be found in Appendix 5).
However, follow up tests indicated that all three mean correlations differed significantly from each other. Study 2 also replicated the finding that pre-diary estimates are a better predictor of actual frequencies of emotions with higher temporal stability. However, this finding was only supported for the second pre-diary judgments (r = .49, p < .01), but not for the first pre-diary judgments (r = .10, p = .59). This pattern of results indicates that pre-diary judgments attenuate the discriminative accuracy of frequency judgments because they cover a different time period than the one, during which the actual frequencies of emotions were assessed. Nevertheless, the high accuracy of the post-diary judgments is probably due to the participation in the diary study and overestimates the discriminative accuracy across participants under natural conditions.
Next, the discriminative accuracy across participants in study 2 was quantitatively compared to the one in Study 2 for the 18 emotions that were included in both studies (all emotions of Study 1 except hurt and worry). These analyses revealed a very similar degree of accuracy in both studies (Study 1: pre-diary r = .39; post-diary r = .58; Study 2: second pre-diary r = .36; post-diary r = .58), which did not differ significantly from each other ts < 1.50, ps > .20. Furthermore, it was explored whether emotions that revealed high discriminative accuracy across participants in one study, also revealed high discriminative accuracy across participants in the other study; this was, however, not true (pre-diary r = .09, post-diary r = .22, both ps > .10). Therefore, at present it is not possible to recommend emotions that guarantee a high degree of discriminative accuracy across participants.
As in Study 1, the specificity of the frequency judgments was investigated in that the correlation between the actual frequencies of an emotion with the frequency judgment of this emotion was compared to the correlations with the frequency judgments of all other 33 emotions. Specificity was established if the correlation with the same emotion was higher than any of the other 33 correlations. For the post-diary estimates this was the case for all emotions except discontentment (see Appendix 6). However, for the two pre-diary estimates, only about half of the emotions revealed specificity (1st pre-diary estimates N = 16, 2nd pre-diary estimates N = 15 out of 34). Although this number is still significantly different from chance (expected value N = 1, both χ2s > 200, ps < .01), it indicates lower specificity for the pre-diary estimates.
A similar effect had not been observed in Study 1. A possible explanation for this difference could be that Study 1 included more similar emotions. Another explanation would be the longer delay between pre-diary ratings and the assessment of the actual frequencies, which renders it more likely that the actual frequencies changed between the time periods covered by the pre-diary judgments and the time period during which the actual frequencies were assessed.
In sum, Study 2 closely replicated the findings of Study 1 that frequency judgments of emotions possess moderate discriminative accuracy across participants. The size of the correlations between pre-diary and post-diary estimates with actual frequencies was very similar in both studies in the range from r = .30 to r = .60. Furthermore, frequency judgments of emotions in Study 2 were quite often specific for each emotion, indicating that the respondents used different information for frequency judgments of each emotion. As in Study 1, this finding demonstrates that discriminative accuracy across participants is not due to response sets.
4.2.3 Exploration of Mood- and Personality-Congruent Biases
The analysis follows closely the procedure in Study 1. First, the 10 specific mood dimensions of the BASTI were submitted to a factor analysis and the factor scores of the first two unrotated factors were retained for further analyses. Replicating Study 1, the factor scores of the first factor were highly negatively correlated with the directly measured pleasure-displeasure dimension (r = -.81, p < .01), and the factor scores of the second factor were moderately correlated with the directly measured arousal dimension (r = .52, p < .01). To facilitate the interpretation, the factor scores of the first factor were inverted so that positive values indicate pleasure. The frequency estimates of pleasant and of unpleasant emotions were averaged to reduce the number of analyses. Emotion words denoting “mixed feelings” (e.g., sympathy) were dropped at this step. Next, multiple regression analyses were carried out, in which the post-diary vague quantifier ratings were regressed onto the actual frequencies of emotions, neuroticism, extraversion, current pleasure, and current arousal. Table 10 shows that the actual frequencies of emotions were the best predictor. The only additional significant effect was that current arousal predicted frequency estimates of pleasant emotions. However, current arousal did not predict vague quantifier ratings of pleasant emotions in Study 1; although it predicted absolute estimates. Similarly, Study 2 did not replicate the personality-congruency effects obtained for vague quantifier ratings in Study 1.
Finally, the simple correlations between the pre- and post-diary vague quantifier ratings and the five predictors of the previous analyses were compared to each other. If participation in a diary study attenuates biases, the simple correlations between personality and mood measures and pre-diary frequency judgments should be higher than those for the averaged post-diary estimates. Table 11 shows that only extraversion was more highly correlation with both pre-diary than with the post-diary judgments. However, extraversion revealed no higher correlations with pre-diary estimates in Study 1. Hence, the present studies do not support the hypothesis that participation in the diary study reduces the influence of personality- or mood-congruent biases on frequency judgments of emotions.
In sum, only actual frequencies were consistently related to frequency judgments of emotions in studies 1 and 2. Personality and mood effects were not consistent across studies or response formats in Study 1. Although the present results do not indicate that personality- or mood-congruent effects do not exist, especially given the fact that other studies found at least personality-congruent biases (Diener et al., 1984; Feldman Barrett, in press), the studies clearly show that biases are small relative to the amount of accuracy in frequency judgments. This conclusion is in agreement with the previous studies, in which the bias was also small compared to the accuracy in the retrospective judgments (Diener et al., 1984; Larsen, 1992; Feldman Barrett, in press).
4.2.4 Accuracy of Separate Estimates of the First and Second Diary Week
The next analyses go beyond a simple test of the accuracy of frequency judgments of emotions, and start to investigate the cognitive processes underlying frequency judgments of emotions. The participants in Study 2 were asked to provide separate frequency estimates for the first and second week of the diary period. If frequency information is stored directly in memory and frequency judgments are simply based upon the retrieval of this prestored information, participants should not be able to make accurate frequency judgments for the two separate weeks. In contrast, if frequency information is stored indirectly in memory (e.g., in the form of multiple episodes) and frequencies are computed only at the time of the judgments, the judgments should accurately reflect differences between the frequencies of emotions in the two weeks, because contextual cues can be used to activate memory traces of specific time periods (Hintzman & Block, 1971).
First, the daily estimates were averaged separately for the first and second week. Then, a set of hierarchical regression analyses were computed to determine three components of the shared variance between the actual frequencies in the two weeks and the vague quantifier ratings of one week: (a) the variance that is uniquely explained by the actual frequencies of emotions in the first week, (b) the variance that is uniquely explained by the actual frequencies of emotions in the second week, and (c) the variance that is shared by the two predictor variables. Figure 11 illustrates how the shared variance was decomposed into these three components. The amount of shared variance is simply the difference of the total amount of explained
variance minus the two unique variances (R2total = R2unique week1 + R2unique week2 + R2shared). This decomposition of the overall amount of explained variance is possible because all variables were positively intercorrelated so that suppression effects can be ruled out.
Frequency Judgments of Emotions 93
If participants are able to make accurate frequency estimates separately for the two weeks, entering the judged week in the second step should produce a higher increase in explained variance than entering the week that was not the target of the judgment (i.e. for week 1, R2unique week1 > R2unique week2; for week 2, R2unique week2 > R2unique week1). Furthermore, if participants discriminate frequencies of emotions in the two weeks perfectly, adding the non-target week in the second step should not increase the amount of explained variance. (i.e., for week 1, R2unique week2 = 0, for week 2, R2unique week1 = 0.). Finally, if participants are better able to estimate the frequencies of emotions in the more recent second week, the unique variance in estimates of week 2 should be higher than the unique variance in estimates of week 1 estimates (i.e. [for week 2, R2unique week2 ] > [for week 1, R2unique week1]).
Figure 12 shows the amount of the three variance components averaged across the analyses of each of the 34 emotions (see Appendix 7 for the results of the individual analyses). Most important are the findings that (a) the actual frequencies in week 1 uniquely explained more variance in vague quantifier ratings of week 1 than the actual frequencies in week 2, t(33) = 5.24, p < .01, and (b) the actual frequencies in week 2 uniquely explained more variance in vague quantifier ratings of week 2 than the actual frequencies in week 1, t(33) = 3.11, p < .01. This confirms the prediction of the familiarity model that people are sensitive to frequencies of emotions in different contexts.
In addition, the amount of uniquely explained variance by the target week (e.g., actual frequencies in week 1 for ratings of week 1) did not differ between ratings of week 1 and ratings of week 2, t(33) = 1.15, p = .25. There were also no significant differences between the ratings of the two weeks in the variance uniquely explained by the actual frequencies in the non-target week (e.g., actual frequencies in week 1 for ratings of week 2), t(33) = 0.66, p = .52, or the amount of shared variance, t(33) = 0.60, p = .56. This pattern of results shows that the participants were equally able to make accurate frequency estimates for the first and the second week of the diary study, and that the accuracy of the frequency judgments for the remoter first week was as good as for the more recent second week. Finally, the amount of variance uniquely explained by the non-target week was significantly different from zero for ratings of both weeks, both Fs(1,33) > 20.00, both ps < .01. This finding indicates that participants are not perfect in discriminating the frequencies of emotions experienced in the two weeks: Ratings of one week were also influenced by actual frequencies in the other week. This influence is, however relatively small (see Figure 12).
In sum, the findings show a high ability of the participants to detect changes in the frequencies of their emotional experiences from week 1 to week 2. This finding is particularly noteworthy because it replicates an experimental finding (Hintzman & Block, 1971) under natural conditions over a much longer interval between the encoding of stimuli (i.e. the experience of emotions) and the moment when the frequency judgments were made. Furthermore, it demonstrates this effect for the first time with regard to discriminative accuracy across participants. Finally, the present design provides for a strong test of the hypothesis of context sensitivity because the actual frequencies in the two weeks were highly correlated (mean r = .71). Therefore, the changes in the emotion frequencies from one week to the other were relatively small. Nevertheless, the participants detected these changes, a finding that contradicts direct encoding models of frequency information. It also speaks against the hypothesis that frequency judgments of emotions are based on generalized beliefs or are pre-stored in memory.
4.3 Additional Analyses
4.3.1 Repeated Assessment of Vague Quantifier Ratings
In Study 1 the mean of the vague quantifier ratings decreased from pre- to post-diary judgments, whereas the absolute estimates increased. In Study 2 the participants made vague quantifier ratings two times prior to the diary study. Therefore, it could be explored whether the decreased also occurs when participants do not participate in a diary study between the two ratings. Vague quantifier ratings at each measurement point were first averaged across participants. Then, a repeated measure ANOVA was computed across emotions with the within-subject factor time. A strong effect was obtained, F(3,33) = 62.50, p < .01, partial ε2 = .65.
Follow up analyses indicated that the mean decreased from the first to the second assessment and again to the post-diary assessment. The means of the two post-diary ratings were practically identical (Figure 13). Next, the prediction of Haubensak’s consistency model was tested that decreasing means should be paralleled by a better approximation of the distribution of the actual frequencies of emotions, which should be positively skewed. The data support this prediction: The skewness of the actual frequencies was 1.50. The skewness of the first vague quantifier ratings was 0.45. It increased to 0.57 for the second pre-diary judgments, and once more to 0.91 and 0.93 for the two post-diary ratings.
In sum, the decreasing mean of vague quantifier ratings in Study 1 has been replicated. The finding has been extended by showing an additional decrease from a second to a third assessment. Furthermore, additional analyses suggest that the consistency model (Haubensak, 1994) is a promising candidate for a theory that can explain this effect. Future research should try to test the consistency model more directly. A better understanding of the effect seems to be highly desirable because the decreasing mean of vague quantifier ratings has important practical implications (Knowles et al., 1996). Most importantly, it makes the interpretation of changes in pre-post designs (e.g. in therapy evaluation studies) very difficult. A better understanding of the effect might help to develop measurement instruments that minimize this effect.
4.3.2 Interrelations between the Frequencies of Pleasant and Unpleasant Emotions
In Study 1 it was found that the relation between frequency estimates of pleasant and unpleasant emotions depended on the response format: The absolute format produced positive correlations whereas the vague quantifier ratings produced negative correlations. In Study 1 the absolute estimates were made using a free response format. This response format might be especially susceptible to an extremity response style. In contrast, the participants in Study 2 made (daily) absolute estimates on a predefined response scale which limited the range of responses. Therefore, the positive correlation should disappear, if it were simply due to the free response format. However, despite the modified procedure used in Study 2, it replicated the finding of a high positive correlation between absolute frequency estimates of pleasant and unpleasant emotions (Table 12). In contrast, the significant negative correlations obtained for the vague quantifier ratings in Study 1 were not replicated: The two pre-diary estimates revealed non-significant correlations close to zero, whereas the post-diary estimates produced low, but significant positive correlations.
As in Study 1, regression analyses were carried out to test the prediction that vague quantifier ratings of pleasant (unpleasant) emotions are influenced by the actual frequencies of unpleasant (pleasant) emotions. These analyses in general replicated the results of Study 1 (Table 13). Besides high correlations with the actual frequencies of emotions of the same valence, the vague quantifier ratings also show negative correlations to the actual frequencies of emotions of the opposite valence.
As in Study 1, actual and estimated frequencies were averaged across all emotions to test whether people who experience on average more emotions also used higher vague quantifiers. Again, positive correlations were obtained: the correlations were r = .36 for the first pre-diary judgments, r = .47 from the second pre-diary estimates, and r = .64 for the averaged post-diary estimates. Especially for the post-diary estimates the correlation is higher than in Study 1 (r = .45). This finding indicates that vague quantifier ratings partly reflect percentages and partly absolute frequencies of emotions. It seems that the participants in Study 2 used the vague quantifiers more to express absolute frequencies.
In sum, the results of Study 2 replicated a positive correlation between frequencies of pleasant and unpleasant emotions, and the finding that vague quantifier ratings mask this positive correlation because participants partly use vague quantifiers to indicate percentages and only partly to indicate absolute frequencies of emotional experiences. As noted before, this finding undermines the empirical support of prevailing theories of individual differences in emotional experiences (Costa & McCrae, 1992; Watson et al., 1988).
Study 2 replicated several findings of Study 1: The discriminative accuracy across emotions was high, the discriminative accuracy across participants was in the same moderate range (r = .30 to .60), and the mean of vague quantifier ratings decreased over repeated assessments. On the other hand, the few personality- and mood-congruent biases obtained in Study 1 were not replicated, suggesting that theses biases are not an important factor in frequency judgments of emotions. Study 1 also replicated the finding of a positive correlation between the frequencies of pleasant and unpleasant emotions, which, this time, was even supported by the post-diary vague quantifier ratings. Most importantly, Study 2 provided for a first test between direct and indirect frequency judgment models.
Study 2 showed that participants were able to estimate accurately the frequency of emotions in two separate weeks, even after controlling for the highly correlated frequencies in the other week. This finding is incompatible with direct encoding models of frequency information, which assume that frequency information is constantly updated (cf. Hintzman & Block, 1971). According to these models, it should be impossible to estimate frequencies in a remoter time period independently from the frequencies in the recent past. Opposing this prediction, participants were sensitive to differences in the emotion frequencies in the two weeks. This result suggests that the information about the frequency of emotional experiences is indirectly stored in memory, probably in the form of multiple memory traces of emotional episodes. Study 2, however, does not allow to differentiate between the different indirect encoding models (see Figure 2). To this aim, the next two studies were carried out.
5 STUDY 3
The aim of studies 3 and 4 was to study the cognitive processes underlying frequency judgments of emotions under more controlled conditions. To do so, real emotional experiences investigated in the previous studies were replaced by emotional reactions that participants would experience in hypothetical scenarios. That is, participants indicated their likely emotional reactions to a number of hypothetical scenarios. Subsequently, the number of times participants indicated that they would feel joy, anger, or gratitude was used as the measure of the actual frequencies of emotions. The advantage of this new way to determine the actual frequency of emotions is that actual frequencies can be objectively determined. This avoids the problem of the previous studies that the measure of the actual frequencies was based on frequency judgments. The disadvantage of this paradigm is clearly that the frequency judgments concern only hypothetical situations and not real life emotional events. Nevertheless, the approach is similar to the strategy of experimental psychologists to study frequency judgments of words denoting natural objects and to assume that the results generalize to frequency judgments of objects or events in real life (cf. Hasher & Zacks, 1984).
The use of hypothetical scenarios as stimulus material allows one to manipulate to a certain degree the frequencies of “emotions,” because people’s emotional reactions to some kind of situations (e.g., death of a loved one) are fairly universal (cf. Mesquita & Frijda, 1992). However, there are also individual differences in emotional reactions to the same situations, because people appraise situations differently (cf. Lazarus, 1991; Reisenzein & Hofmann, 1993). Therefore, the manipulation of emotion frequencies in the present studies is less reliable than the manipulation of the frequency of natural objects in previous experimental studies (e.g., Greene, 1989). For example, not everybody feels angry if he or she has to wait in line, but everybody agrees that a banana is a fruit. This “weakness” of the present approach can also be a strength, when researchers want to study individual differences in the processing of emotional information under controlled conditions (cf. Schimmack & Hartmann, in press). In the present context, however, individual differences constitute error variance, in accordance with experimental studies of frequency judgments in general (cf. Naveh-Benjamin & Jonides, 1986).
The standard paradigm used in the following two studies consists of an initial scenario rating task (SRT) and a subsequent frequency judgment task (FJT). In the SRT, participants indicated for several scenarios which emotions they would experience if they were in the described situation. These ratings allow the researcher to determine the actual frequencies of emotionsthat is, the number of times that a respondent indicated that he or she would have experienced an emotion in a scenario. In the FJT, participants have to estimate how often they would have experienced various emotions in the situations of the SRT. These judgments can then be compared to the actual frequencies as defined above. As the following experiments were carried out on a personal computer, it was also possible to measure the judgment times of the frequency judgments.
On top of the SRT and the FJT, other tasks can be added to explore the cognitive processes underlying frequency judgments of emotions. In the following two studies, a latency-of-retrieval task (LRT) was added (Fitzgerald et al., 1988; MacLeod et al., 1994). In this task, participants were asked to recall, as fast as possible, one of the scenarios in which an emotion would have been experienced. This allows one to test the ease-of-retrieval model: If the ease-of-retrieval hypothesis is correct, higher frequency judgments should be related to shorter retrieval times (see hypothesis 6). Furthermore, the judgment times of the frequency judgments should be correlated with the retrieval latencies as well as with the size of the frequency judgments (hypothesis 7), and frequency judgment should need longer than the retrieval of scenarios from memory (hypothesis 8). In contrast, the familiarity model does not predict these effects.
48 undergraduate students at the Free University Berlin participated in the Study for course credits.
Reisenzein and Hofmann (1993) asked 20 students at the Free University Berlin to report for each of 23 emotions one personal experience of this emotion. Subsequently, they asked a different sample of 51 participants to indicate which of the 23 emotions was most likely felt by the protagonist of each scenario. The authors found that for most scenarios, the target emotion; that is, the emotion that triggered the reported scenario, was recognized by the majority of the participants. For the present study, 12 of the 23 target emotions investigated by Reisenzein and Hoffman were selected: anger, anxiety, contempt, disappointment, disgust, embarrassment, gratitude, jealousy, joy, love, pride, and sadness. For each target emotion the ten scenarios with the highest recognition rate of the target emotion were selected, yielding a total of 120 scenarios. Each scenario was about 10 to 60 words long. The following example is a description of an anger experience:
A while ago, I bought some apples at the supermarket, because they were so cheap. At home, I found out that they were already rotten inside. I thought: “And this supermarket always advertises with its fresh fruits.‘
For the present study, the selected 120 episodes were split into four sets of 30 episodes and one of these four sets was presented to each participant. To manipulate the frequency of emotions between the four scenario sets, unequal numbers of scenarios of one target emotion were assigned to each set (see Table 14). It has to be noted, however, that this procedure allows only a rough manipulation of emotion frequencies between sets of scenarios because each scenario tends to elicit several emotions besides the target emotion (cf. Reisenzein, 1995). For example, jealousy scenarios often also elicit anger, disappointment, and sadness. Therefore, another aim of Study 3 was to determine the pattern of emotions that is elicited in each scenario. This would provide for a better manipulation of emotion frequencies in Study 4 and other future studies.
Although the scenarios elicited most strongly the 12 target emotions, it is likely that the scenarios also elicit various other emotions. To obtain ratings for a comprehensive list of emotions, 32 emotions were selected for the rating task. Besides the 12 target emotions, the 11 additional emotions studied by Reisenzein and Hofmann (1993) were also included, namely compassion, discontentment, envy, guilt, hope, helplessness, loneliness, regret, relieve, shame, and surprise. Furthermore, contentment, depression, euphoria, hate, and hopelessness were included because they had been studied by Reisenzein (1995, Study 3; see also Schimmack & Reisenzein, in press) in related research. Rage was included upon request from participants in a small pilot study. Finally, the global descriptions “a pleasant feeling,” and “an unpleasant feeling” were added, to investigate frequency judgments of broad categories compared to those of specific emotions. For the scenario rating task, the 12 target emotions were split into two sets of 6 emotions each. Similarly, the 20 remaining emotions, labeled non-target emotions, were split into two sets of 10 emotions each (see Appendix 8 for the assignment of emotions to sets of target and non-target emotions). Each participant received only one set of target emotions and one set of non-target emotions in the SRT. In sum, the SRT was divided between participants according to a three factorial design, with four sets of episodes, two sets of target emotions, and two sets of non-target emotions (Figure 14). However, analyses are not based upon the 4 x 2 x 2 design because the number of participants in each cell of the design is too small (N = 3). The purpose of this design was that an equal number of participants rated the intensity of each emotion in each of the four sets of scenarios. Because each scenario set was presented to 12 participants, and each participant rated the intensity of half of the target and half of the non-target emotions, in each set of scenarios six participants rated the intensity of the same emotion.
18.104.22.168 Rating Scale
In the SRT, participants were mainly asked to rate whether they would feel an emotion or not. However, the intensity of the emotional reactions appeared to be of interest as well, especially regarding other research questions (Schimmack & Hartmann, in press; Schimmack & Diener, in press). Therefore, the participants were also asked to indicate how intense their emotional reactions would be, provided that they experienced an emotion at all. Reisenzein (1995, Study 3) used two separate ratings to obtain independent information about the presence and intensity of an emotion. To simplify the judgment process, Schimmack (in press; Schimmack & Diener , in press) proposed to decompose a single rating on an intensity scale into information about the presence/frequency and intensity of an emotional reaction. One simply uses zero-responses as information that an emotion is not experienced; then all non-zero responses indicate the experience of an emotion. That is, it is proposed to use a dichotomization of intensity ratings into ratings equal zero and those greater than zero as information about the absence versus presence of emotions. Summed across scenarios, the number of non-zero judgments for one emotion represents the actual frequency of this emotion. In the present study, a four-point intensity scale was used (0=not at all, 1=sligthly, 2=medium to 3=high intensity).
The experiment comprised several tasks that were implemented in a single computer program. First, the participants judged how intensely they would experience a selection of 16 emotions in one of the four sets of 30 scenarios. Afterwards, they estimated how frequently they would have experienced each of the 32 emotions in the set of scenarios of the SRT. That is, they made frequency judgments for the 16 emotions included in their SRT, labeled salient emotions, plus the 16 emotions that were not presented in their SRT, labeled non-salient emotions. Subsequently, in a latency-to-retrieve task, the participants recalled, as quickly as possible, one scenario in which they would have experienced the emotion that was presented as a retrieval cue. To reduce the length of the already strenuous experiment, only the 12 target emotions (Table 14) were used as retrieval cues. Because each participant had six target emotions in his or her SRT, this guarantied that each participant retrieved scenarios for six salient and six non-salient emotions.
22.214.171.124 Scenario Rating Task
In the SRT, the participants were asked to imagine being in the described situation and to indicate how they would feel in the situation. For each emotion the participants were asked to consider first whether they would feel this emotion or not. Only if they would feel the emotion, were they to consider the intensity of the emotion. After having read the instructions, participants pressed the return key to start the scenario rating task. The scenarios were displayed in the upper half of the screen and could be studied by the participants as long as they wanted. When the participants had sufficiently studied the scenario, they pressed the return key to start the rating task. After pressing the return key, the rating scale was displayed below the scenario description, which remained on the screen. The rating scale was split into two parts, with the zero-category on the left and all remaining categories on the right side, to increase the salience of the difference between zero and non-zero responses. Between the scenario description and the rating scale, the sentence “In this situation I would have felt …” followed by each of the 16 emotion words was displayed. The participants indicated their likely emotional reaction by pressing the appropriate number on the keypad. If they made an error, they could repeat the last entry using a special correction key. After all 16 emotions had been rated, the next scenario was displayed. New random sequences were generated by the computer for the 30 scenarios for each participant and for the 16 emotions for each scenario. The computer also measured the judgment times from the display of each emotion words to the intensity rating.
126.96.36.199 Frequency Judgment Task
After the SRT was completed, the participants were surprised with additional instructions that they would now be asked several questions concerning the scenarios presented in the SRT. Their first task, after the SRT, would be to estimate the absolute frequency with which various emotions had occurred in the previous episodes. For example, if they had rated anger to be present in five of the scenarios (i.e. if they had made five non-zero-ratings in the SRT), then five would be the correct answer. Participants were not informed that they read 30 scenarios; therefore it was possible that participants made frequency judgments greater than 30. The participants were also informed that they should estimate the frequency of salient and non-salient emotions and that judgments of non-salient emotions (i.e. those not included in the previous SRT) are meaningful because they could have been elicited in the scenarios even though the participants did not have to rate their intensity. For example, although a participant might not have been asked to rate the presence and intensity of disgust in the SRT, the assignment of scenarios to the four sets of scenarios made it very likely that each participant would have indicated to experience disgust at least once (see Table 14). The participants were also informed that the frequency judgments had to be made within 10s, and that the next item would be presented automatically if they exceeded this time limit. A pilot study had shown that participants exceeded this time limit very rarely.
After reading the instructions, participants pressed the return key to start the computer-paced frequency judgment task. The 32 emotions were displayed in a different random sequence for each participant. With the display of each emotion word, a timer appeared also on the screen that counted the elapsed seconds. After 7s, the computer sounded a warning tone. After the participants had entered the first number, the computer recorded the time since the emotion word was presented. After entering the complete number, participants pressed the return key to continue with the task.
188.8.131.52 Latency-To-Retrieve Task
In the latency-to-retrieve task (LRT), the 12 target emotions appeared on the screen in a new random order for each participant participants. For each emotion, the participants were asked to recall, as quickly as possible, a scenario from the SRT. When they recalled one, they pressed the return key. Subsequently, they entered a keyword to describe the recalled scenario (e.g., apple for the example in 184.108.40.206). If a participant did not recall a scenario within 10s the next emotion word was automatically presented on the screen. The computer recorded the retrieval latency from the presentation of an emotion word to the pressing of the return key.
5.2.1 Preliminary Analyses
First, the actual frequencies of emotions in each of the four sets of scenarios were determined. To do so, the number of times a participant made a non-zero rating in one of his or her 30 scenarios was counted. Due to the experimental design, six participants rated the same emotion for the same set of scenarios. The frequencies of these six participants were averaged to determine the actual frequency of an emotion in each set of scenarios (see Appendix 8).
Correlations of the frequencies of emotions between the four sets of scenarios revealed that the attempted experimental manipulation of the frequencies was not very successful because the frequencies were highly intercorrelated. (Table 15). That is, emotions that were frequent in one set of scenarios also tended to be frequent in other set of scenarios.
5.2.2 Relative Accuracy
First, the number of times participants failed to make a frequency judgment within 10s was determined. This happened only 5 out of 1536 times. In these cases, the missing frequency judgment was set to zero and the missing judgment time was replaced by the maximum judgment time (10s).
As in Study 1, the relative accuracy of the absolute estimates was tested. Figure 15 shows the estimates of the 32 emotions plotted against the actual frequencies in each of the for scenario sets. As in Study 1 the regression slop of the estimates indicates that the actual frequencies were underestimated in all four sets of scenarios. The relative accuracy scores (actual – estimated frequencies) for all four sets of scenarios are negative (Set 1 d = -6.29, Set 2 d = -2.41, Set 3 d = -4.20, Set 4 d = -3.96) and significantly different from zero (all Fs > 20.00, ps < .01), which shows the trend toward underestimation quantitatively.
As in Study 1, the actual frequencies of emotions were correlated with the relative accuracy score of each emotion. If frequent emotions are underestimated more strongly than infrequent ones, a negative correlation between the actual frequencies and the relative accuracy scores is expected. This prediction was confirmed in all four sets of scenarios (rs =-.88, -.61, -.72, -.80, all ps < .01). In sum, Study 3 replicated the finding in Study 1 that absolute estimates underestimate the actual frequencies of emotions and that they do so increasingly with increasing frequency of occurrence.
Because each participants did not rate the intensity of all 32 emotions in the SRT, actual frequencies of all emotions were not available at the individual level. Therefore, only analyses at the group level were possible. For these analyses, the frequency judgments of those 12 participants who rated the same set of scenarios were averaged. The interrater agreement between the 12 participants in each condition was determined, using Shrout and Fleiss’s (1979) intra-class coefficient (ICC[2,k]). The interrater agreement for the four sets ranged from ICC[2,12] = .60 to .74. Furthermore, the estimated frequencies of emotions were correlated between the four sets of scenarios (Table 16), which is expected because the actual frequencies of emotions in the four sets of scenarios were also correlated (Table 15).
Table 17 shows the correlations between the actual and estimated emotion frequencies in the four sets of scenarios. First of all, the correlations are high, indicating general agreement between actual frequencies of emotions and frequency judgments. However, a stronger test
of discriminative accuracy across emotions would require that the frequency judgments of participants who rated a particular set of scenarios are more highly correlated with the actual frequencies of this rather than a different set of scenarios. Table 17 shows that this was the case for all four sets of scenarios (Note that this implies a comparison only along the rows in Table 17, but not necessarily also along the columns).
In sum, Study 3 replicates the finding of Study 1 that frequency judgments discriminate between actual frequencies of emotions. It was more difficult to demonstrate sensitivity to the particular frequencies in a specific set of scenarios. This was very likely due to the high correlations of the actual frequencies between sets of scenarios (Table 15) and the moderate interrater agreement of the frequency judgments (Table 16). A stronger test of sensitivity to experimentally manipulated frequencies of emotions seems desirable. This test was carried out in Study 4 which allowed a better manipulation of emotion frequencies on the basis of the SRT data obtained in this study.
5.2.4 Influence of the Salience of an Emotion on Frequency Estimates
As in Study 1, the effect of the salience of emotion concepts at the time of encoding on the subsequent frequency estimates was examined. In the present study, some emotions were salient, because they were included in the SRT, whereas others were not salient, because they were presented the first time during the FJT. In the following analyses, the differences between scenario sets are ignored because the sets were varied orthogonal to the sets of emotion words. Therefore, differences in frequency judgments of salient and non-salient emotions cannot be attributed to the presentation of different scenarios. For each emotion, the frequency estimates of those 24 participants for whom the emotion was salient was compared to the frequency estimates of those 24 participants for whom the emotion was not salient (see Figure 14). In 31 of the 32 comparisons the frequency estimate was higher when the emotion was salient. In an analysis across emotions, the mean frequency estimate in the salient condition (M = 7.21, SD = 1.92) was significantly higher than the mean frequency estimate in the non-salient condition (M = 4.61, SD = 2.28), t(31) = 10.23, p < .01. This finding replicates the salience effect obtained in Study 1.
This effect can also be seen in Figure 16. This figure also shows that, in contrast to Study 1, the slop of the regression line was not steeper in the salient condition. This implies that the salience condition produced higher estimates in general, but not for the more frequent emotions in particular. Indeed, the actual frequencies were not significantly correlated with a difference score between frequency estimates in the salient and non-salient condition, r = -.24, p = .19. The divergent findings could be due to (a) the fact that Study 1 comprised more participants, (b) the use of a within-subject design in Study 1 and a between-subject design in Study 3, or (c) the different manipulations of salience. This inconsistency should, however, not obscure the main finding of this analysis that frequency judgments were generally higher in the salient than in the non-salient condition. This finding was predicted in Hypothesis 5 and is inconsistent with direct encoding models of frequency information.
5.2.5 Testing the Ease-of-Retrieval Model
It has been proposed that people rely on the retrieval of exemplars to estimate frequencies (Tversky & Kahneman, 1973). One version of the retrieval-based models, the ease-of-retrieval model, was explicitly tested in this study in several ways. First, the relation between frequency judgments and retrieval times in the LRT were compared (cf. MacLeod et al., 1994). Second, the relation between the size and the speed of the frequency judgments was explored. And finally, the speed of frequency judgments was compared with retrieval latencies in the LRT. In the following analyses, the data were averaged across all 48 participants to increase the reliability of the variables. This was justifiable because of the high correlations of the actual frequencies between scenario sets (Table 15). The frequency estimates showed a high internal consistency, ICC[2,48] = .92, whereas the internal consistency of the averaged judgment times was only moderate, ICC[2,48] = .42. Contrary to the prediction of the ease-of-retrieval model, the judgment times were not significantly correlated with the size of the frequency judgments (r = .22, p = .22). The correlation was even slightly in the opposite direction than the one predicted by the ease-of-retrieval model. The direction of the correlation is more consistent with the recall-estimate model (see Brown, 1995); however, as it is not significant, it does not support this alternative retrieval-based model either.
For the next analysis, the retrieval latencies of the 12 target emotions were averaged across participants. The interrater agreement for these latencies was ICC[2,48] = .66. The latencies were then correlated with the frequency estimates across the 12 target emotions. The correlation with the retrieval latencies failed to be significant (r = .43, p = .17) and was again in the wrong direction. The positive correlation does not support the recall-estimate model either, because in the latency-to-retrieve task participants were asked to retrieve only one exemplar. Therefore, the longer retrieval times do not indicate counting of several exemplars (see Brown, 1995). Third, the average retrieval latencies were compared with the average times needed for the frequency judgment; for this analysis only the judgment times of the 12 target emotions were used, because only target emotions were used in the LRT. If frequency estimates are based on information about the speed of retrieval, frequency estimates should take at least as long as the retrieval of a single scenario. However, the mean retrieval latency (M = 3.56s, SD = 0.57) was longer than the mean time needed to make a frequency judgment (M = 3.16, SD = 0.18), t(11) = 2.91, p < .05. This finding is incompatible with any retrieval-based model; the ease-of-retrieval or the recall-estimate model. In sum, the analyses support hypotheses 6 to 8: retrieval latencies and judgment times were independent of frequency judgments and retrieval of a single exemplar required more time than the complete frequency judgment process.
The results of Study 3 replicated several earlier findings obtained in field studies of real emotional experiences. In studies 1 and 3 the actual frequencies of emotions were underestimated, especially the ones of frequent emotions. Nevertheless, the frequency judgments in both studies revealed discriminative accuracy across emotions (Study 3 did not allow to test discriminative accuracy across participants). Finally, both studies showed that the frequencies of salient emotions were estimated to be higher than the frequencies of non-salient emotions. The only difference was that in Study 3 frequent emotions did not benefit more from the salience manipulation than non-salient emotions, which was the case in Study 1.
Besides replicating the results of the field study in a more controlled setting, Study 3 also provided several new findings. First, frequency judgments were not related to the times needed to make these judgments. Furthermore, retrieval of a single scenario needed more time than the complete frequency judgment process. This finding contradicts retrieval-based frequency judgment models; both the recall-estimate theory (Brown, 1995; Meudall, 1971; Watkins & LeCompte, 1991), or the ease-of-retrieval model (Schwarz et al., 1991). Contrary to the predicts of the ease-of-retrieval model was also the finding that latencies in the LRT were unrelated to the frequencies judgments. In sum, the direct encoding models cannot explain the salience effect, whereas the retrieval-based models cannot account for the speed of the frequency judgments. This leaves the familiarity model as the only model that is compatible with the present data.
Three possible objections can be raised against the present findings. First, the judgment and retrieval times might not have been accurately measured, which is suggested by the low consistency of the measures across participants. However, given that the retrieval times were measured by a computer to the nearest millisecond, it remains unclear how the participants themselves are able to distinguish differences in retrieval times more accurately than the computer, as such an ability is needed to use retrieval latencies as information about frequencies of emotions. Only if one notices that one retrieved a joy-scenario faster than an envy scenario, one can judge the frequency of joy to be higher than the frequency of envy. A second objection could be that the ease of retrieval is conceptually different from the speed of retrieval. That is, people do not base their judgment on the speed of retrieval but on a feeling of ease which is separate from and unrelated to the speed of retrieval. Although such a modified ease-of-retrieval model is logically possible, it would have to specify (a) how the feeling of ease is generated, and (b) why it is unrelated to the latency of retrieval. Straightforward answers to these questions are not in sight. Finally, one might object to the present findings that the manipulation of emotion frequencies was only partly successful and that the high correlation across scenario sets was due to the fact that emotions which are frequent in everyday life were also frequent in all four scenario sets. As a consequence, the participants may have relied on generalized beliefs about the frequencies of different emotions, when making their frequency estimates. However, the analyses of different sets of scenarios suggested that participants were also sensitive to differences in the frequencies of emotions between sets. Nevertheless, a stronger demonstration of sensitivity to experimentally manipulated frequencies would be needed to rule out this hypothesis. This was attempted in Study 4, which also served the purpose to replicate the findings of Study 3.
6 STUDY 4
A main aim of Study 4 was to replicate the findings of Study 3. In addition, Study 4 had the purpose to study how individual differences in a repressive way of coping influence the encoding and retrieval of emotion memories; the results bearing on this issue are reported elsewhere (Schimmack & Hartmann, in press; see 7.3.1 for a brief summary). As a consequence, all participants rated the same set of scenarios with regard to the same set of emotions. Furthermore, the selected scenarios elicited mainly unpleasant emotions. This had the advantage that the frequencies of emotions differed from the frequencies of emotions in everyday life and from the frequencies of emotions in Study 3. Hence, Study 4 provides for a stronger test of participants’ sensitivity to experimentally manipulated frequencies of emotions.
61 undergraduate psychology students (14 male, 47 female) at the Free University Berlin participated in the study for course credit.
6.1.2 Material and Procedure
The SRT included 25 negative and 5 positive scenarios. 16 emotion words (13 unpleasant and 3 pleasant) were selected for the rating task. The rating scale was changed to a 7-point scale so that individual differences in the intensity of emotional reactions could be detected more easily, which was important for different research questions (cf. Schimmack & Diener, in press; Schimmack & Hartmann, in press). The response categories were labeled “not”, “very slightly”, “slightly”, “medium”, “strongly”, “very strongly”, and “extremely strongly” and were scored from 0 to 6. As in Study 3, participants were instructed that only zero-rating imply the complete absence of an emotion, whereas all remaining response categories imply its presence, although with different degrees of intensity. The frequency judgment task was identical to the one in Study 3 and the same 32 emotions were used. The LRT was identical to that used in Study 3. However, a different set of 10 emotions was used as retrieval cues, including five salient and five non-salient emotions.
6.2.1 Absolute Accuracy
In all following analyses the actual frequencies are based on the SRT ratings in Study 3. This had the advantage that Study 3 provided actual frequencies for all 32 emotions included in the frequency judgment task. Furthermore, the actual frequencies of salient and non-salient emotions are both based on ratings of a different group of participants. Absolute accuracy was determined separately for the salient and the non-salient emotions. A significant differences was obtained in that frequency judgments of salient emotions (mean SD = 8.09) were more accurate than those of non-salient emotions (mean SD = 9.41) t(60) = 4.99, p < .01. This finding replicates Study 1.
6.2.2 Relative Accuracy
Figure 17 shows that participants in Study 4 again underestimated the actual frequencies of emotions. Across all emotions, the relative accuracy was d = -5.99, which is significantly different from zero, F(1,31) = 54.97, p < .01. Figure 17 also shows that the frequencies of frequent emotions were underestimated more strongly than those of infrequent emotions. This is also evident in the correlation between actual frequencies and the relative accuracy score, r = -.90, p < .01. In sum, Study 4 replicates the previous finding that people underestimate the frequency of emotions and that they do so especially for frequent emotions.
6.2.3 Discriminative Accuracy across Emotions
One aim of Study 4 was to demonstrate that participants are sensitive to experimentally manipulated frequencies of emotions. Therefore, it is important to demonstrate that the selection of scenarios in study 4 yielded frequencies of emotions that are independent of emotion frequencies in real life. For the 29 overlapping emotions between Study 2 and 4, the correlation between the actual frequencies of emotions was r = -.21, p = .29. As a consequence, discriminative accuracy across emotion in the present study cannot be attributed to generalized beliefs about the frequencies of emotions.
For analysis at the group level, the frequency judgments of all participants were averaged. The correlation between actual frequencies and frequency estimates was r = .66, p < .01. This correlation is rather low, compared to the values in the previous studies. One explanation could be that the present analysis included salient and non-salient emotions. Figure 17 already shows that the frequency estimates of non-salient emotions were lower than those of the salient emotions. Therefore, salience can attenuate the present correlation. As a consequence, separate correlations were computed across the 16 salient and the 16 non-salient emotions. The correlation for the salient emotions was indeed higher, r = .85, p < .01, but the correlation across the 16 non-salient emotions was not, r = .63, p < .01. The difference between the two correlations also suggests that salience increased the discriminative accuracy across emotions.
The analysis at the individual level was carried out separately for the 16 salient and the 16 non-salient emotions. For the salient emotions the discriminative accuracy across emotions (mean r = .45) was significantly higher than the one for the non-salient emotions (mean r = .36), F(1,60) = 5.53, p < .05. Because the actual frequencies are based on ratings of a different sample, this finding suggests that salience also increased the discriminative accuracy across emotions. 6.2.4 Influence of the Salience of an Emotion on Frequency Estimates The following analysis attempts to replicate the finding of Study 1 and 3 that salience at the time of encoding increases frequency judgment. Figure 17 already suggests that this was also true in Study 4. In the present study, all participants rated the same emotions in the SRT. Therefore, the analysis had to be carried out across emotions. To control for any differences in the actual frequencies between salient and non-salient emotions, the actual frequencies of the emotions were used as a covariate. The analysis of variance revealed a highly significant effect of salience, F(1,29) = 23.61, p < .01. A comparison of the predicted means shows that participants judged the frequency of non-salient emotions to be lower (M = 5.18) than the frequency of salient emotions (M = 8.28). It was also tested, whether salience boosted especially the frequency judgments of frequent emotions. Figure 17 already suggests that this was not true, because the slops of the regression line for salient and non-salient emotions were similar. To test this hypothesis quantitatively, a median split of the actual emotion frequencies was carried out. Then, frequency judgments were used as a dependent variable in an ANOVA with the factors actual frequency (high vs. low frequency) and salience. If frequent emotions benefit more from the salience manipulation, the interaction should be significant, but the ANOVA does not confirm this prediction, F(1,28) < 1, p > .50. This finding is consistent with Study 3, where the same salience manipulation also failed to affect especially the frequent emotions, but it is inconsistent with Study 1, where participation in a diary study increased especially the frequencies of frequent emotions.
One explanation for this pattern of results is that the different salience manipulations influence different stages of the frequency judgment process (Brown, 1995). It might be that the participation in the diary study increased participants awareness of the absolute number of emotional experiences. Therefore, they converted the same familiarity signal into higher absolute frequencies than they did before the diary study. In contrast, rendering particular emotions salient does not influence the range of the absolute frequencies; it only boosts the familiarity signal of the salient emotions, which then receive higher absolute estimates for the same range of absolute frequencies as the non-salient emotions.
6.2.5 Testing the Ease-of-Retrieval Model
The interrater agreement for the frequency estimates was excellent, ICC[2,61] = .96. Because of the larger sample size, the judgment times were also more consistent across participants than in Study 3, ICC[2,61] = .61. Nevertheless, replicating the finding of Study 3, the size of the frequency judgments was unrelated to the speed of these judgments (r = -.19, p = .30).
The second test of the ease-of-retrieval hypothesis used the latencies in the LRT. As in Study 3, the retrieval latencies were averaged across participants. This time, the interrater agreement was excellent (ICC[2,61] = .95). The average latencies were then correlated with the frequency estimates across the 10 emotions included in the latency-to-retrieve task. The correlation with the frequency estimates was significant and consistent with predictions of the ease-of-retrieval model (r = -.76, p < .05). However, the next result indicates that this support is more apparent than real. As in Study 3, the average retrieval latencies were compared to the average speed of the frequency judgments. Again, the mean retrieval latency was significantly longer (M = 4.94, SD = 1.40) than the time needed to make a frequency judgment (M = 3.23, SD = 0.10), t(9) = 3.96, p < .01. Hence, it is not possible that the frequency estimates are based on retrieval processes. The finding that the speed of frequency judgment is unrelated to frequency judgments and that these judgments are made faster than the retrieval of scenarios contradicts the ease-of-retrieval hypothesis. It is instructive that the contradictory evidence was obtained concurrently with a high negative correlation between frequency judgments and latencies in the LRT. This finding demonstrates nicely that the significant negative correlation between frequency judgments and latencies in a LRT that were obtained in previous studies (MacLeod et al., 1994; Fitzgerald et al., 1988) do not indicate that the frequency judgment were based on the ease-of-retrieval.
It is also instructive to look at the speed of frequency judgments and retrieval times of the very rare emotion regret. On average, participants made a frequency judgment within 3.22s. In contrast, the average latency in the LRT was 8.13s. In addition, these long latencies are partly due to the responses of 42 participants, who were unable to retrieve a regret scenario within 10s (so that their retrieval latencies were set to 10s). These 42 participants were able to make a frequency judgment within the allotted time of 10s even though they could not recall a single scenario in this time limit. As a consequence, ease-of-retrieval cannot explain the frequency judgments of these participants, because the ease of-retrieval hypothesis assumes that at least one exemplar was retrieved. One could try to rescue the ease-of-retrieval model and argue for a two-stage processes. For example, a fast recognition process could inform the participant whether an emotion occurred at all. If so, an exemplar is actually retrieved from memory and the frequency is estimated following the ease-of-retrieval model. If the recognition signal suggests that no exemplar can be retrieved, the frequency judgment is zero. Such a two-stage model is not very parsimonious because Hintzman and Curran (1994) showed that recognition judgments are based on the same familiarity signal that is assumed to underlie frequency judgments. Therefore, the initial recognition process already supplies the frequency information that the additional ease-of-retrieval heuristic is supposed to provide. In sum, the analysis replicated two findings of Study 3. The judgment times of frequency judgments are not related to the size of the judgment and frequency judgments are made faster than the time needed to retrieve a single scenario. These findings are damaging for retrieval-based frequency judgment models. The finding that latencies in the retrieval task were significantly related to the frequency judgments does not rescue the ease-of-retrieval model. Rather, it demonstrates that the same finding in other studies does not provide evidence for a causal role of ease-of-retrieval in the frequency judgment process.
Study 4 replicated many of the earlier findings. As in Study 1 and 3, participants underestimated absolute frequencies of emotions, especially those of frequent emotions. Furthermore, increasing the salience of emotions at the time of encoding, increased frequency judgments. As in Study 3, salience did not increase especially the frequency estimates of frequent emotions. Most importantly, Study 4 replicated the findings of Study 3, that (a) the times needed to make frequency judgment were unrelated to the magnitude of the judgments and (b) that frequency judgments were made faster than the retrieval of a single scenario. Therefore, frequency judgments cannot be based on information about the retrieval of scenarios, although Study 4 found a significant negative correlation between frequency judgments and latencies in the LRT.
7 GENERAL DISCUSSION
The two main topics of the dissertationnamely (a) the accuracy of frequency judgments of emotions, and (b) the cognitive processes underlying these judgmentsare discussed separately.
7.1 The Accuracy of Frequency Judgments of Emotions
Four types of accuracy were differentiated: (a) absolute and (b) relative accuracy as well as discriminative accuracy (c) across emotions and (d) across participants. The results bearing on each of the types of accuracy are discussed next.
7.1.1 Absolute Accuracy
Absolute accuracy was only explored in studies 1 and 4, because the participants in Study 2 did not make absolute estimates, and Study 3 lacked an appropriate standard of comparison. However, studies 1 and 4 both showed that frequency judgments of emotions are not very accurate in an absolute sense. This finding is inconsistent with a direct-encoding of frequencies of emotions. On the other hand estimation errors can be expected when participants use heuristics to make the absolute estimates. The finding in studies 1 and 3 that absolute accuracy was higher for salient compared to non-salient emotions indicates that salience can increase the accuracy of frequency judgments of emotions. Increasing accuracy due to salience has also been observed in other studies of frequency judgments (Naveh-Benjamin & Jonides, 1986). Brown and Singer (1993) pointed out that absolute accuracy is sensitive to two types of estimation errors: (a) errors in the estimation of the distribution of the actual frequencies and (b) errors in the estimation of the level of the absolute frequencies. Therefore, the salience effect on absolute accuracy can be due to an influence on either (or both) of these error sources. These possibilities are explored in the next analyses.
7.1.2 Relative Accuracy
Relative accuracy refers to the question how good the absolute level of frequency judgments reflects the absolute level of the actual frequencies, or in other words, whether people over- or underestimate the actual frequency of their emotions. Studies 1, 3, and 4; Study 2 did not allow to address this question, all showed the predicted effect that frequency estimates underestimated the actual frequencies of emotion. Two objects might be raised against this finding. In the two field studies, actual frequencies were based on the sum of repeated frequency estimates, whereas the estimates are based on a single estimate for the whole time period. Fiedler and Armbruster (1994) demonstrated that splitting a single frequency judgment of one category into two frequency judgments of two sub-categories produced different frequency estimates: The sum of the two estimates was higher than the frequency judgment of the whole category. Therefore, one might argue that the repeated daily estimates might overestimate the actual frequencies. This explanation of the effect in Study 1 encounters several difficulties. First, underestimation was also found in studies 3 and 4, where actual frequencies were not based on split frequency estimates. Second, Fielder and Armbruster’s results did not show overestimation for the split-category judgments; rather they showed that split-judgments prevented categories from being underestimated. Therefore, the sum of the split-estimates was more accurate than the single judgments of a whole category.
Hence, the category-split effect supports, rather than contradicts, the current interpretation that the frequency estimates for the whole diary period underestimate the actual frequencies of emotions. Finally, underestimation is also prevalent in experimental studies of frequency judgments (cf. Watkins & LeCompte, 1991; Williams & Durso, 1986) in which the actual frequencies were objectively determined.
A second objection could be that the participants changed the meaning of the emotion words from the daily estimates to the frequency estimates over three weeks (Schwarz, Strack, Müller, & Chassein, 1988). For short time periods even very mild experiences of the emotion were counted, whereas for longer time periods the participants considered only severe experiences of the emotion. Again, this objection can not explain underestimation in studies 3 and 4, where participants were explicitly told that frequency judgments should reflect all scenarios in which the emotion was rated to be present, irrespectively of the intensity. Nevertheless, the actual frequencies were underestimated in the frequency judgment task. In sum, the present results provide strong support for the hypothesis that people underestimate the frequency of their emotions.
A related expectation was that underestimation should increase with the actual frequency of an emotion, which is also a common finding in the frequency judgment literature (Mingay et al., 1994; Watkins & LeCompte, 1991). Again, all studies that allowed a test of this prediction confirmed it. This finding has important practical implications when frequency judgments of emotions are used to measure subjective well-being. Often researchers compute a difference score between the frequencies of pleasant and unpleasant emotions. This hedonic-balance score is then used as a measure of subjective well-being. The problem with this index is that it
underestimates the well-being of those people who experience more pleasant than unpleasant emotions, because the more frequent pleasant emotions are underestimated more than the less frequent unpleasant emotions. Similarly, the index underestimates the unhappiness of those people who experience unpleasant emotions more frequently than pleasant emotions.
Another noteworthy finding in studies 1, 3 and 4 was that making certain emotions salient increased the estimated frequencies of these emotions. This finding also has implications in many applied settings. For example, several psychotherapies are likely to increase the salience of emotional experiences. This could produce an expected increase in the reported frequencies of pleasant emotions, and an unexpected increase in the reported frequencies of unpleasant emotions, without any changes in the actual frequencies of emotional experiences. Because salience could affect the comparison of pre- and post-treatment measures, evaluation studies of treatment effects should include control groups in which emotional experiences are made salient.
Study 1 also indicated that salience increased especially the frequency estimates of the more frequent emotions. This was evident in a steeper slop of the regression line for post-diary estimates. In contrast, in studies 3 and 4 the regression slop of salient emotions was elevated but not steeper, indicating no special influence of salience on frequent emotions. The discrepant findings can be due to the fact that the different salience manipulations influenced different stages of the frequency judgment process (Brown, 1995). To judge frequencies of emotions, participants first have to construct a range of plausible frequencies. Then, they can assign frequencies to emotions by mapping the strength of the familiarity signal onto the frequency scale. The pre-post design in Study 1 very likely influenced participants’ beliefs about the plausible range of emotion frequencies. Making frequency judgments each day, they noticed that they experienced more emotions than they believed to experience before the diary study. Hence, they increased the upper limit of the frequency scale for the post-diary judgments. Similarly, Brown demonstrated that manipulations of participants beliefs about the range of plausible frequencies changed the slop of the regression line.
This effect is different from the salience effect observed in studies 3 and 4 which showed that frequency judgments were higher for emotions that were made salient compared to emotions that were not made salient. It is plausible that this salience manipulation influenced the first stage of the frequency judgment process. Salient emotions appeared to be more familiar and therefore were rated to be more frequent than non-salient, unfamiliar emotions. Nevertheless, the familiarity feelings of salient and non-salient emotions were mapped onto the same range of absolute frequencies during the second stage of the judgment process, which leads to the observed differences in the level, but not in the slope, of the regression lines in studies 3 and 4. The distinction between two stages in the frequency judgment process (Brown, 1995) has also practical implications. The familiarity model predicts that people’s feeling of familiarity has high discriminative accuracy across emotions. However, it does not allow straightforward predicts of relative accuracy and discriminative accuracy across participants, because these two types of accuracy also depend on the second stage in which the feeling of familiarity is converted into an absolute estimate. A better understanding of this conversion process might help to improve the measurement of emotion frequencies. For example, one could try to assist respondents in their selection of an appropriate range of frequencies. For example, Blair and Williamson (1994) discussed the merits and pitfalls of providing participants with population norms of frequencies (e.g. On average people go to church once every three month) to increase the relative accuracy of frequency estimates. This procedure could improve the accuracy of frequency estimates, if participants have information about their relative standing on the relevant dimension (I go much less frequently to church than the average person). With regard to internal states such as emotions, it is unlikely that people have accurate knowledge how they compare to others in the frequency of emotional experiences. To conclude, the conversion process of frequency information (i.e. feeling of familiarity) into an absolute estimate is an important topic for future research, not only from a theoretical (Brown, 1995) but also from a practical point of view (Blair & Williamson, 1994).
7.1.3 Discriminative Accuracy Across Emotions
The present set of studies also explored the discriminative accuracy across emotionsthat is, the question how well frequency judgments discriminate the actual frequencies of different emotions. This question is relevant for some issues in research on emotions. First, hierarchical models of the structure of emotions (Oatley & Johnson-Laird, 1987; Shaver, Schwartz, Kirson & O’Connor, 1987) predict that emotions on higher levels of the hierarchy are experienced more frequently than emotions lower in the hierarchy. For example, sadness should be experienced more frequently than disappointment because disappointment is assumed to be a subtype of sadness. To test these predictions of structural models of emotions, one needs accurate information that discriminates between frequencies of different emotions. It also seems to be an interesting topic for future research on emotions, why certain emotions are experienced more frequently than others. For example, why is anxiety experienced in general more frequently than hate? Any emotion theory that explains how emotions are elicited should eventually explain differences between emotions in their frequencies of occurrence.
Discriminative accuracy across emotions was good in all four studies, and excellent when the data were first aggregated across participants. This finding is consistent with frequency judgments in other domains. Indeed, the claim that frequency judgments are very accurate, which has led some theorists to propose direct encoding models of frequency information (Hasher & Zacks, 1984), is predominantly based on findings of high discriminative accuracy across stimuli. However, even this type of accuracy was influenced by salience (studies 1 and 4), which contradicts the direct-encoding models. Furthermore, Study 1 also showed that the type of the response format influenced this type of accuracy. At the individual level, absolute estimates discriminated more accurately between the frequencies of emotions, presumably because vague quantifier ratings forced participants to assign the same frequency-category to several emotions, although they were able to discriminate between the frequencies of these emotions.
7.1.4 Discriminative Accuracy across Participants
For many applied settings, the last type of accuracy is most important, namely discriminative accuracy across participantsthat is, the question how well retrospective frequency judgments reflect individual differences in the actual frequencies of emotions? The first two studies provide highly similar answers to this question, despite the fact that (a) the participants were from different nations, and (b) that the daily ratings and the frequency judgments were obtained with slightly different methods. In both studies the discriminative accuracy was between r = .30 and .60. An exact estimate is difficult because frequency judgments after daily frequency ratings overestimate this type of accuracy, whereas judgments before the daily frequency ratings underestimate this type of accuracy.
The present set of studies also provided some valuable results that rule out artifact explanations. First, the frequency estimates of a single emotion were often correlated most highly with the actual frequency of this emotion and not those of other emotions. Furthermore, frequency judgments made for two separate weeks were more highly correlated with the actual frequencies in the target week and not the alternative week. This pattern of results rules out a simple response set explanation of discriminative accuracy across participants. Furthermore, it contradicts the contention that frequency judgments are simply based on some generalized beliefs. In particular, generalized beliefs cannot explain the context sensitivity of the frequency judgments. Furthermore, studies 1 and 2 provided little support for the hypotheses that frequency judgments are systematically biased by the self-concept or current mood of the participants: Neither emotion-related personality traits, nor the current mood at the time of the frequency judgments appeared to have a consistent effect on the frequency judgments. In a similar vein, Schimmack and Hartmann (in press; see also Cutler, Larsen, & Bunce, 1996 ) investigated whether people with a repressive coping style; that is, people who are assumed to repress unpleasant feelings, underestimate the frequencies of their unpleasant emotions. Although so called repressors indicated to experience unpleasant emotions less frequently when they were confronted with emotional scenarios (in a scenario rating task, see 220.127.116.11), their frequency estimates in the subsequent frequency judgment task were not biased. For unpleasant emotions, repressors’ lower frequency judgments correctly reflected the lower number of endorsements of unpleasant emotions in the scenario rating task. In sum, the search for personality dimensions that predict a systematic bias in frequency judgments of emotions has been unsuccessful. Nevertheless, it is possible that frequency judgments of emotions are influenced by systematic biases that remained undetected in the previous studies.
7.2 The Cognitive Processes underlying Frequency Judgments of Emotions
How do people judge the frequency of emotions? In the present dissertation four models were compared with each other: (a) the direct encoding models, (b) the recall-estimate model, (c) the ease-of-retrieval model, and the (d) familiarity model. Several results of the studies were incompatible with the direct encoding model. Maybe the most important result was that participants in Study 2 were able to estimate accurately the frequency of emotions in the first and second week of the diary period. This means that the frequency were estimated at the time of retrieval. Additional evidence against the direct encoding models was that the salience of emotions at the time of encoding increased frequency judgments (Study 1, 3 and 4). According to the direct encoding models, emotional experiences should automatically activate emotion concepts and modify the frequency counter (Alba et al., 1980), independently of the salience of the concept. The same type of evidence has been used to challenge the direct encoding models in other domains (Greene, 1989). The present studies show that the direct encoding model cannot explain frequency judgments of emotions either.
The present dissertation also challenges the ease-of-retrieval model. If frequency judgments were actually based on information about the ease-of-retrieval, higher frequency judgments should be made faster. Contrary to this prediction, studies 3 and 4 did not find a relation between the speed and the size of frequency judgments. Furthermore, in Study 4 many participants were able to judge the frequency of a very rare emotion (regret), although they were unable to retrieve a single scenario in which this emotion occurred. In addition, latencies in a separate retrieval task were related to frequency judgments only in study 4, but not in Study 3. Nevertheless the frequency judgments in both studies possessed discriminative accuracy across emotions. Probably the most damaging finding was that in studies 3 and 4, the frequency judgments were made faster than the time needed to retrieve a single scenario. Therefore, retrieval of exemplars to a conscious level is simply to slow to explain the fast (and accurate) frequency judgments. The same line of reasoning has been used to dismiss retrieval-based models in related research on metamemory (Metcalfe, 1993; Reder, 1987). The last finding contradicts not only the ease-of-retrieval model, but also other retrieval-based models, such as the recall-estimate model.
The only model that is compatible with all the present findings is the familiarity model. With regard to emotions this model assumes that frequency questions activate multiple memory traces of previous emotional experiences simultaneously. As a consequence, the memory sends a direct signal, reflecting how many traces have been activated in memory. This signal is experienced as a sense of familiarity. Like the other indirect encoding models, the model predicts that participants can differentiate frequencies in different contexts, such as week 1 and 2 in the diary study, if the context variable is sufficiently encoded in memory (see Barsalou & Ross, 1986). Furthermore, it predicts that the salience of emotions at the time of encoding increases frequency judgments because salience strengthens memory traces, which results in a greater echo intensity (Hintzman, 1988). The familiarity model does not predict any relation between the size of a frequency judgment and its time and such a relation was not obtained. Therefore the lack of such relations does at least not contradict the model. To conclude, the familiarity model seems to be the best candidate for a theory of frequency judgments of emotions. This conclusion should not be generalized to frequency judgments in other domains. Retrieval-based estimation strategies can be used and apparently are used under certain conditions (Brown, 1995; Menon, 1994).
The superior performance of the familiarity model might appear to some readers due to the selection of paradigms, which tested and disconfirmed mainly predictions that are made by the competing models, but tested only few predictions that follow from the familiarity model. Although a disconfirmation of the familiarity model does not rescue the other models, such tests are an important topic of future research. One prediction that follows from the familiarity model is, for example, that the feeling of familiarity should be influenced by the presence of similar exemplars in memory (Hintzman, Curran, & Oppy, 1992; Jones & Heit, 1993). For example, frequency estimates of eating carp should be inflated by memories of eating trout. With regard to emotions, this would imply that the presence of memories in which a person experienced disappointment but not anger should nevertheless increase the echo intensity of anger because disappointment episodes share some features with anger episodes.
A second prediction based on the familiarity model is that frequency judgments of very similar events should be higher than those of the same number of dissimilar events (Hintzman and Stern, 1978), because similar memories produce a stronger feeling of familiarity (but see Brown, 1995 for an alternative explanation). Therefore, the experience of similar anger episodes (e.g., always directed at one’s romantic partner) should lead to higher frequency judgments than the experience of anger in different contexts (e.g., directed at boss, partner, friends, and strangers).
7.3 Influence of the Response Format on Frequency Judgments of Emotions
The present dissertation bridged two research traditions: Studies on the validity of self-reports of emotional experiences and experimental studies of the cognitive processes underlying frequency judgments. The preferred response format in the former tradition are vague quantifier ratings, whereas the latter tradition preferred absolute estimates. Furthermore, a survey study which included both response formats obtained divergent results in the comparison of the two response formats (Schaeffer, 1991). This stimulated the idea to use both response formats within the same study (Study 1). Highly surprising and interesting results were obtained. The most interesting effect was that vague quantifier ratings decreased from pre- to post-diary judgments, whereas at the same time the absolute estimates increased. The decrease in vague quantifier ratings was replicated in Study 2. Similar findings have been reported in the literature (Knowles, et al. 1996), but the effect is still not understood, although it has important practical implications. For example, in therapy evaluation studies, patients often have to report the frequency of their emotions before and after treatment, commonly by means of vague quantifier ratings. The present study suggests that changes can be expected in these measures not only because of treatment effects but also due to changes in the use of the rating scale. Because the ratings tend to drop, a questionnaire accessing mainly unpleasant affect, could indicate a positive treatment effect; a decrease in the frequency of unpleasant emotional experiences, even if the treatment did not influence the actual frequencies of emotional experiences.
A second important difference between the two response formats was that the correlation between averaged judgments of pleasant emotions and averaged judgments of unpleasant emotions were positively correlated for the absolute estimates, but not for the vague quantifier ratings, which revealed sometimes negative, sometimes positive, and sometimes non-significant correlations close to zero. Whereas the practical implications are discussed in the next paragraph, the cognitive processes underlying these effects of the response format are discussed now.
First, one might ask during which stage of the frequency judgment process the effects occur (Brown, 1995). That is, do the different response formats influence the generation of frequency information (e.g., a familiarity signal), or do the response formats influence the conversion of this information into a response. It is likely that the influence of the response formats occurs during the second stage. During this stage, the participants are faced with the task to determine a reasonable range of absolute frequencies onto which the feeling of familiarity can be mapped. This part of the estimation process is likely to be difficult and error prone. The vague quantifier ratings do not require that the participants derive an absolute standard. The participants can simply map the different degrees of familiarity onto the categories of the response scale. This seems to suggest that vague quantifier ratings should be preferred. However, if participants use vague quantifier ratings simply to indicate the relative strength of their feeling of familiarity, the ratings can no longer be compared across participants. A rating of the highest category by one participant might reflect a very different absolute frequency than the same rating made by another participant. Therefore, vague quantifier ratings also do not solve the problem how frequency information can be converted into a response that is comparable across participants. Furthermore, range-frequency theory has shown that even the assignment of absolute numbers printed on a sheet of paper is influenced by context effects such as the distribution of the numbers (Parducci & Wedel, 1986). Similar effects were obtained in the present study for vague quantifier ratings of emotion frequencies, but not for the absolute estimates. This finding would favor the absolute estimates. Finally, Study 1 demonstrated that absolute estimates and vague quantifier ratings possessed the same amount of discriminative accuracy across participants when the judgments were made after the diary study; but before the diary study, the vague quantifier ratings outperformed the absolute estimates. In sum, it is not possible to recommend one of the two response formats over the other. Future research on the judgment process might help to reduce judgment errors, and might ultimately allow a rational choice of the best response format. Until then, a viable research strategy is to use both response formats, because each one is associated with different errors. As a consequence, the combined application of two short questionnaires with both response formats would produce more valid results than a long questionnaire with only one of the two response formats (Green et al., 1993).
7.4 The Structure of Individual Differences in the Frequencies of Pleasant and Unpleasant Emotional Experiences
The structure of individual differences in the frequencies of pleasant and unpleasant emotional experiences was not a central issue of the present investigations. However, important results were obtained that challenge current models of the personality structure of emotions. Currently, researchers assume that the frequencies of pleasant and unpleasant experiences of affect are independent (cf. Bradburn, 1969) or negatively correlated (Green et al., 1993). Furthermore, influential personality theories predict the frequencies of pleasant and unpleasant emotions to be independent (Costa & McCrae, 1992; Meyer & Shack, 1989; Watson & Clark, 1992), presumably because pleasant and unpleasant emotions are generated in different areas of the brain.
Studies 1 and 2 replicated previous results in that low correlations were obtained with the traditional response format, namely vague quantifier ratings. However, high positive correlations were obtained for absolute frequency estimates. Furthermore, regression analyses suggested that vague quantifier ratings produce an artifact, because the respondents use them partly to judge percentages of their emotional experiences and only partly to judge absolute frequencies of experienced emotions. As a consequence, individual differences in the overall number of emotional frequencies are obscured and the correlation between frequencies of pleasant and unpleasant emotions becomes negative. This invalidates the conclusion of previous studies that the frequencies of pleasant and unpleasant emotional experiences are independent. The present study suggests that a person who experiences pleasant emotions often also experiences unpleasant emotions often. This finding is consistent with a study by Schimmack and Diener (in press) which also demonstrated a positive correlation between frequencies of pleasant and unpleasant emotions, derived from repeated ratings of emotional events in everyday life. Furthermore, the positive correlation between pleasant and unpleasant emotions, is consistent with the positive correlation obtained for the number of pleasant and unpleasant events that people encounter in their lives (Suh et al., 1996).
It should be noted, however, that this finding is limited to the frequency of emotions. It is likely that it does not hold for moods. Considering only the frequency with which a person is in a pleasant or unpleasant mood, it is very likely that the two frequencies are negatively correlated, because (a) a person is nearly always in a pleasant or unpleasant state, and at any moment in time feelings of pleasure and displeasure rarely co-occur. As a consequence, it is a logical necessity that pleasure and displeasure are highly negatively correlated, a fact which is sometimes obscured by measurement error (Green et al., 1993).
In sum, evidence is growing that the frequencies of pleasant and unpleasant emotions are positively correlated, whereas the number of times a person is in a pleasant mood is inversely related to the number of times he or she is in an unpleasant mood. Although, the results of the present study should not be regarded as a final answer to this question, the divergent findings for the two response formats underscore the need to understand the processes underlying frequency judgments of emotions, before it is possible to use these measures to answer fundamental questions about the causes and the structure of individual differences in the frequency with which they experience emotions.
At the end, I would like to discuss two important questions for future research, namely (a) individual differences in the accuracy of frequency judgments of emotions, (b) and the question how frequency of emotions should be assessed from a normative point of view.
8.1 Individual Differences in the Accuracy of Frequency Judgments of Emotions
In the present studies, accuracy scores of individuals were averaged to estimate the general accuracy of frequency judgments of emotions. However, the accuracy scores varied across participants. An important topic of future research would be to explore (a) whether these differences in accuracy are systematic, (b) whether it will be possible to assess a person’s level of accuracy, and (c) whether a person’s accuracy is related to existing constructs in the emotion literature, such as affect intensity (Larsen & Diener, 1987; Schimmack & Diener, in press), emotional intelligence (Mayer, DiPaolo, & Salovey, 1990), or alexithymia (Taylor, 1984).
People who experience emotions more intensely than others might tend to overestimate the frequency of their emotions, or in the light of a consistent trend towards underestimation, at least underestimate the frequency of their emotions less, because intense emotional experiences are more memorable (Rapaport, 1942; Holmes, 1970). In contrast, people with high alexithymia scores might severely underestimate the frequency of their emotions, because they have problems to label their emotional experiences; and the present studies found that labeling experiences increased frequency judgments when the same label was part of the frequency question. Finally, “emotionally intelligent” persons can be expected to be more accurate in their self-perceptions than others, because they pay more attention to their emotional experiences However, it is also possible that traditional personality measures do not capture biases in the memory representation of emotions very well. Given the assumption that systematic individual differences in the accuracy of frequency judgments of emotions exist, this would require the development of new questionnaires that measure these differences.
8.2 Toward a Normative Assessment of the Frequency of Emotions
The present studies explored how people judge the frequency of emotions. An equally important question is how the frequency of emotions can be measured with the highest degree of accuracy by means of an economical research instrument at one moment in time. This question arises because an on-line recording over a long time period simply is not a viable option in most assessment situations, although it would be the best strategy from a normative perspective. The present studies suggest that people rely on a sense of familiarity when they judge the frequency of emotions and that they do not use the ease-of-retrieval or a recall-estimate heuristic. However, the fact that most people do not use these strategies, does not imply that they cannot be used. To the contrary, people can recall individual episodes (Fitzgerald et al., 1988) and people can judge the ease-of-retrieval (Schwarz et al., 1991). As a consequence, an important question of future research is whether these strategies would lead to better estimates of the frequency of emotions. For example, in two studies, Means, Swan, Jobe, and Esposito (1994) asked participants to record the number of smoked cigarettes for a period of five days. Afterwards, the participants were asked to estimate the number of smoked cigarettes on one of these five days. Furthermore, they were instructed to use one of four strategies, namely (a) to use any strategy that they wanted, (b) to provide a spontaneous estimate without thinking of particular instances of smoking, (c) to think of different contexts (in the office, after dinner) and then to sum these separate estimates, and (d) to try to recall as many instances as possible. In this study, the recall of exemplars appeared to be a better measure than the spontaneous estimation strategy, in which judgments were probably based on a familiarity signal. But different findings might be obtained for frequency judgments of emotions, especially when the time period is longer than one day. To address this question, one needs a measure of the actual frequencies of emotions as a validation criterion, which is not biased in favor of any of the estimation strategies under investigation. Both the diary and the scenario rating task could be used to assess actual frequencies. In the beginning, the SRT is preferable because it is more economical than a diary study. Ultimately, however, it is necessary to compare different estimation strategies against actual frequencies of emotions in real life.
Alba, J. W., Chromiak, W., Hasher, L., & Attig, M. S. (1980). Automatic encoding of category size information. Journal of Experimental Psychology: Learning, Memory, and Cognition, 6, 370-378.
Andreasen, N. C., & Black, D. W. (1991). Lehrbuch Psychiatrie [Textbook on Psychiatry]. Weinheim: Beltz.
Barsalou, L. W., & Ross, B. H. (1986). The roles of automatic and strategic processing in sensitivity to superordinate and property frequency. Journal of Experimental Psychology: Learning, Memory, and Cognition, 12, 116-134.
Blair, E., & Burton, S. (1987). Cognitive processes used by survey respondents to answer behavioral frequency questions. Journal of Consumer Research, 14, 280-288.
Blair, E., & Williamson, K. (1994). On providing population data to respondents. In N. Schwarz & S. Sudman (Eds.), Autobiographical memory and the validity of retrospective reports (pp. 173-186). New York: Springer.
Blaney, P. H. (1986). Affect and memory: A review. Psychological Bulletin, 99, 229-246.
Borkenau, P., & Ostendorf, F. (1991). Ein Fragebogen zur Erfassung fünf robuster Persönlichkeitsfaktoren [A questionnaire for the assessment of five robust personality factors]. Diagnostica, 37, 29-41.
Bower, G. H. (1981). Mood and memory. American Psychologist, 36, 129-148.
Bradburn, N. M. (1969). The structure of psychological well-being. Chicago: Aldine.
Brewin, C. R., & Andrews, B., & Gotlib, I. H. (1993). Psychopathology and early experience: A reappraisal of retrospective reports. Psychological Bulletin, 113, 82-98.
Briggs, J. L. (1970). Never in anger. Cambridge, MA: Harvard University.
Briggs, J. L. (1987). In search of emotional meaning. Ethos, 15, 8-15.
Brown, N. R. (1995). Estimation strategies and the judgment of event frequency. Journal of Experimental Psychology: Learning, Memory, and Cognition, 21, 1539-1553.
Brown, N. R., & Siegler, R. S. (1993). Metrics and mappings: A framework for understanding real-world quantitative estimation. Psychological Review, 100, 511-534.
Bruce, D., Hockley, W. E., & Craik, F. I. M. (1991). Availability and category-frequency estimation. Memory and Cognition, 19, 301-312.
Clore, G. L. (1994). Why emotions require cognition. In P. Ekman & R. J. Davidson (Eds.), The nature of emotion (pp. 181-191). New York: Oxford University Press.
Cohen, D. (1996). Law, social policy, and violence: The impact of regional cultures. Journal of Personality and Social Psychology, 70, 961-978.
Costa, P. T., & McCrae, R. R. (1980). Influence of extraversion and neuroticism on subjective well-being: Happy and unhappy people. Journal of Personality and Social Psychology, 38, 668-678.
Costa, P. T., & McCrae, R. R. (1992). The revised NEO personality inventory (NEO-PI R) professional manual. Odessa, FL: Psychological Assessment Resources.
Cutler, S. E., Larsen, R. J., & Bunce, S. C. (1996). Repressive coping style and the experience and recall of emotion: A naturalistic study of daily affect. Journal of Personality, 64, 379-405.
Diener, E. (1984). Subjective well-being. Psychological Bulletin, 95, 542-575.
Frequency Judgments of Emotions 149
Diener, E., & Diener, M. (1995). Cross-cultural correlates of life satisfaction and self-esteem. Journal of Personality and Social Psychology, 68, 653-663.
Diener, E., Diener, M., & Diener, C. (1995). Factors predicting the subjective well-being of nations. Journal of Personality and Social Psychology, 69, 851-864.
Diener, E., & Iran-Nejad, A. (1986). The relationship in experience between different types of affect. Journal of Personality and Social Psychology, 50, 1031-1038.
Diener, E., & Larsen, R. J. (1984). Temporal stability and cross-situational consistency of affective, behavioral, and cognitive responses. Journal of Personality and Social Psychology, 47, 871-883.
Diener, E., Larsen, R. J., & Emmons, R. A. (1984). Bias in mood recall in happy and unhappy persons. Paper delivered at the 92nd Annual Meeting of the American Psychological Association, Toronto, August 1984.
Diener, E., Larsen, R. J., Levine, S., & Emmons, R. A. (1985). Intensity and frequency: The underlying dimensions of positive and negative affect. Journal of Personality and Social Psychology, 48, 1253-1265.
Diener, E., Sandvik, E., & Pavot, W. (1991). Happiness is the frequency, not the intensity, of positive versus negative affect. In F. Strack, M. Argyle, & N. Schwarz (Eds.), Subjective well-being (pp. 119-139). Oxford: Pergamon Press.
Diener, E., Smith, H., & Fujita, F. (1995). The personality structure of affect. Journal of Personality and Social Psychology, 69, 130-141.
Ekman, P., & Davidson, R. J. (Eds.). (1994). The nature of emotion. New York: Oxford University Press.
Emmons, R. A., & Diener, E. (1986a). A goal-affect analysis of everyday situational choices. Journal of Research in Personality, 20, 309-326.
Emmons, R. A., & Diener, E. (1986b). Influence of impulsivity and sociability on subjective well-being. Journal of Personality and Social Psychology, 50, 1211-1215.
Epstein, S. (1983). A research paradigm for the study of personality and emotions. In M. M. Page (Ed.), Personality – Current theory and research: 1982 Nebraska symposium on motivation (pp. 91-154). Lincoln: University of Nebraska Press.
Fehr, B., & Russell, J. A. (1984). Concept of emotion viewed from a prototype perspective. Journal of Experimental Psychology: General, 113, 464-486.
Feldman Barrett, L. (in press). The relationships among momentary emotional experiences, personality descriptions, and retrospective ratings of emotion. Personality and Social Psychology Bulletin.
Fiedler, K. (1991). The tricky nature of skewed frequency tables: An information loss account of distinctiveness-based illusory correlations. Journal of Personality and Social Psychology, 60, 24-36.
Fiedler, K., & Armbruster, T. (1994). Two halfs may be more than one whole: Category-split effects on frequency illusions. Journal of Personality and Social Psychology, 66, 633-645.
Fiske, S. T., & Taylor S. E. (1984). Social cognition. Reading, MA: Addison-Wesley.
Fitzgerald, J. M., Slade, S., & Lawrence, R. (1988). Memory availability and judged frequency of affect. Cognitive Therapy and Research, 12, 379-390.
Frijda, N. H., Ortony, A., Sonnemans, J., & Clore, G. L. (1992). The complexity of intensity: Issues concerning the structure of emotion intensity. In M. S. Clark (Ed.), Review of Personality and Social Psychology: Emotion (Vol. 13, pp. 60-89). Newbury Park, CA: Sage.
Frequency Judgments of Emotions 150
Gabrielcik, A., & Fazio, R. H. (1984). Priming and frequency estimation: A strict test of the availability heuristic. Personality and Social Psychology Bulletin, 10, 85-89.
Green, D. P., & Goldman, S. L., & Salovey, P. (1993). Measurement error masks bipolarity in affect ratings. Journal of Personality and Social Psychology, 64, 1029-1041.
Greene, R. L. (1989). On the relationship between categorical frequency estimation and cued recall. Memory and Cognition, 17, 235-239.
Hanson, C., & Hirst, W. (1988). Frequency encoding of token and type information. Journal of Experimental Psychology: Learning, Memory, and Cognition, 14, 289-297.
Hasher, L., & Zacks, R. T. (1979). Automatic and effortful processes in memory. Journal of Experimental Psychology: General, 108, 356-388.
Hasher, L., & Zacks, R. T. (1984). Automatic processing of fundamental information. American Psychologist, 39, 1372-1388.
Hastie, R. & Park, B. (1986). The relationship between memory and judgment depends on whether the judgment task is memory-based or on-line. Psychological Review, 93, 258-268.
Haubensak, G. (1994). Wie entsteht der Häufigkeitseffekt in absoluten Urteilen? [On the origin of the frequency effect in absolute judgments]. Zeitschrift für experimentelle und angewandte Psychologie, 16, 378-397.
Hintzman, D. L. (1988). Judgments of frequency and recognition memory in a multiple trace memory model. Psychological Review, 95, 528-551.
Hintzman, D. L., & Block, R. A. (1971). Repetition and memory: Evidence for a multiple-trace hypothesis. Journal of Experimental Psychology, 88, 297-306.
Hintzman, D. L., & Curran, T. (1994). Retrieval dynamics of recognition and frequency judgments: Evidence for separate processes of familiarity and recall. Journal of Memory and Language, 33, 1-18.
Hintzman, D. L., Curran, T., & Oppy, B. (1992). Effects of similarity and repetition on memory: Registration without learning? Journal of Experimental Psychology: Learning, Memory, and Cognition, 18, 667-680.
Hintzman, D. L., & Stern, L. D. (1978). Contextual variability and memory for frequency. Journal of Experimental Psychology: Human Learning and Memory, 4, 539-549.
Hofstede, G. (1980). Culture’s consequences. Beverly Hills, CA: Sage.
Holmes, D. S. (1970). Differential change in affective intensity and the forgetting of unpleasant personal experiences. Journal of Personality and Social Psychology, 15, 234-239.
Howell, W. C. (1973). Representation of frequency in memory. Psychological Bulletin, 80, 44-53.
Isen, A. M. (1985). Asymmetry of happiness and sadness effects on memory in normal college students: Comments on Hasher, Rose, Zacks, Sanft, and Doren. Journal of Experimental Psychology: General, 114, 388-391.
Izard, C. E., Libero, D. Z., Putnam, P., & Haynes, O. M. (1993). Stability of emotion experiences and their relations to traits of personality. Journal of Personality and Social Psychology, 64, 847-860.
James, W. (1884). What is emotion? Mind, 9, 188-205.
Frequency Judgments of Emotions 151
Jones, C. M., & Heit, E. (1993). An evaluation of the total similarity principle: Effects of similarity on frequency judgments. Journal of Experimental Psychology: Learning, Memory, and Cognition, 19, 799-812.
Jonides, J., & Jones, C. M. (1992). Direct coding for frequency of occurrence. Journal of Experimental Psychology: Learning, Memory, and Cognition, 18, 368-378.
Keppel, G. (1991). Design and analysis: A researcher’s handbook (3rd edition). Englewood Cliffs, NJ: Prentice-Hall.
Knowles, E. S., Coker, M. C., Scott, R. A., Cook, D. A., & Neville, J. W. (1996). Measurement-induced improvement in anxiety: Mean shifts with repeated assessment. Journal of Personality and Social Psychology, 71, 352-363.
Larsen, R. J. (1992). Neuroticism and selective encoding and recall of symptoms: Evidence from a combined concurrent-retrospective study. Journal of Personality and Social Psychology, 62, 480-488.
Larsen, R. J., & Diener, E. (1987). Affect intensity as an individual difference characteristic: A review. Journal of Research in Personality, 21, 1-39.
Larsen, R. J., & Diener, E. (1992). Promises and problems with the circumplex model of emotion. In M. S. Clark (Ed.), Review of personality and social psychology: Emotion (Vol. 13, pp. 25-59). Newbury Park, CA: Sage.
Lazarus, R. S. (1991). Emotion and adaptation. New York: Oxford University Press.
Lewinsohn, P. M., & Rosenbaum, M. (1987). Recall of parental behavior by acute depressives, remitted depressives and nondepressives. Journal of Personality and Social Psychology, 52, 611-619.
MacLeod, A. K., & Andersen, A., & Davies, A. (1994). Self-ratings of positive and negative affect and retrieval of positive and negative affective memories. Cognition and Emotion, 8, 483-488.
Manis, M., Shedler, J., Jonides, J, & Nelson, T. E. (1993). Availability heuristic in judgments of set size and frequency of occurrence. Journal of Personality and Social Psychology, 65, 448-457.
Markus, H. R., & Kitayama, S. (1994). The cultural construction of self and emotion: Implications for Social Behavior. In S. Kitayama & H. R. Markus (Eds.), Emotion and Culture (pp. 89-130). Washington, DC: APA.
Martin, M. (1985). Neuroticism as predisposition toward depression: A cognitive mechanism. Personality and Individual Differences, 6, 353-365.
Matthews, G., Jones, D. M., & Chamberlain, A. G. (1990). Refining the measurement of mood: The UWIST Mood Adjective Checklist. British Journal of Psychology, 81, 17-42.
Mayer, J. D., & DiPaolo, & Salovey, P. (1990). Perceiving affective content in ambiguous visual stimuli: A component of emotional intelligence. Journal of Personality, 54, 772-781.
Means, B., Swan, G. E., Jobe, J. B., & Esposito, J. L. (1994). The effects of estimation strategies on the accuracy of respondents’ reports of cigarette smoking. In N. Schwarz & S. Sudman (Eds.), Autobiographical memory and the validity of retrospective reports (pp. 107-120). New York: Springer.
Menon, G. (1994). Judgments of behavioral frequencies: Memory search and retrieval strategies. In N. Schwarz & S. Sudman (Eds.), Autobiographic memory and the validity of retrospective reports (pp. 161-172). New York: Springer.
Mesquita, B., & Frijda, N. H. (1992). Cultural variations in emotions: A review. Psychological Bulletin, 112, 176-204.
Frequency Judgments of Emotions 152
Metcalfe, J. (1993). Novelty monitoring, metacognition and control in a composite holographic associative recall model: Implications for Korsakoff amnesia. Psychological Review, 100, 3-22.
Meudall, P. R. (1971). Retrieval and representations in long-term memory. Psychonomic Science, 23, 295-296.
Meyer, G. J., & Shack, J. R. (1989). Structural convergence of mood and personality: Evidence for old and new directions. Journal of Personality and Social Psychology, 57, 691-706.
Mingay, D. J., Shevell, K., Bradburn, N. M., & Ramirez, C. (1994). Self and proxy reports of everyday events. In N. Schwarz & S. Sudman (Eds.), Autobiographical memory and the validity of retrospective reports (pp. 235-250). New York: Springer.
Naveh-Benjamin, M., & Jonides, J. (1986). On the automaticity of frequency encoding: Effects of competing task load, encoding strategy, and intention. Journal of Experimental Psychology: Learning, Memory, and Cognition, 12, 378-386.
Nelson, T. O. (1988). Predictive accuracy of the feeling of knowing across different criterion tasks and across different subject populations and individuals. In M. M. Gruneberg, P. E. Morris, & R. N. Sykes (Eds.), Practical aspects of memory: Current research and issues (Vol. 1, pp. 190-196). New York: Wiley.
Oatley, K., & Johnson-Laird, P. N. (1987). Towards a cognitive theory of emotions. Cognition and Emotion, 1, 29-50.
Parducci, A. (1968). The relativism of absolute judgments. Scientific American, 219, 84-90.
Parducci, A., & Wedell, D. H. (1986). The category effect with rating scales: Number of categories, number of stimuli, and method of presentation. Journal of Experimental Psychology: Human Perception and Performance, 12, 496-516.
Parkinson, B., Briner, R. B., Reynolds, S., & Totterdell, P. (1995). Time frames of mood: Relations between momentary and generalized ratings of affect. Personality and Social Psychology Bulletin, 21, 331-339.
Parrott, W. G., & Sabini, J. (1990). Mood and memory under natural conditions: Evidence for mood incongruent recall. Journal of Personality and Social Psychology, 59, 321-326.
Pavot, W., & Diener, E. (1993). Review of the Satisfaction With Life Scale. Psychological Assessment, 5, 164-172.
Pavot, W., Diener, E., & Fujita, F. (1990). Extraversion and happiness. Personality and Individual Differences, 11, 1299-1306.
Pekrun, R., & Frese, M. (1992). Emotions in work and achievement. In C. L. Cooper & I. T. Robertson (Eds.), International Review of Industrial and Organizational Psychology (Vol. 7, pp. 153-200). New York: Wiley.
Pepper, S. (1981). Problems in the quantification of frequency expressions. In D. W. Fiske (Ed.), Problems with language imprecision (pp. 25-41). San Francisco: Jossey-Bass.
Rapaport, D. (1942). Emotions and memory. New York: International Universities Press.
Reder, L. M. (1987). Selection strategies in question answering. Cognitive Psychology, 19, 90-138.
Reisenzein, R. (1995). On Oatley and Johnson-Laird’s theory of emotion and hierarchical structures in the affective lexicon. Cognition and Emotion, 9, 383-416.
Frequency Judgments of Emotions 153
Reisenzein, R., & Hofmann, T. (1993). Discriminating emotions from appraisal-relevant situational information: Baseline data for structural models of cognitive appraisals. Cognition and Emotion, 7, 271-293.
Reisenzein, R. & Schimmack, U. (1996). Similarity and covariation of affects: Findings and implications. Manuscript submitted for publication.
Reisenzein, R., & Schönpflug, W. (1992). Stumpf’s cognitive-evaluative theory of emotion. American Psychologist, 47, 34-45.
Rosch, E. (1975). Cognitive representations of semantic categories. Journal of Experimental Psychology: General, 104, 192-223.
Schaeffer, N. C. (1991). Hardly ever or constantly? Group comparisons using vague quantifiers. Public Opinion Quarterly, 55, 395-423.
Scherer, K. R., Wallbott, H. G., & Summerfield, A. B. (Eds.). (1986). Experiencing emotion: A cross-cultural study. Cambridge: Cambridge University Press.
Schimmack, U. (1996). Resolving some controversies about the mood circumplex. Paper presented at the Annual Meetings of the Midwestern Psychological Association, Chicago, May, 1996.
Schimmack, U. (1996). The relation between extraversion/neuroticism and positive/negative affect: A meta-analysis. Manuscript in preparation.
Schimmack, U. (in press). Das Berliner-Alltagssprachliche-Stimmungsinventar (BASTI): Ein Vorschlag zur kontentvaliden Erfassung von Stimmungen [The Berlin Everyday Language Mood Inventory: Toward the content valid assessment of moods]. Diagnostica.
Schimmack, U., & Diener, E. (in press). Affect Intensity: Separating intensity and frequency in repeatedly measured affect. Journal of Personality and Social Psychology.
Schimmack, U., & Reisenzein, R. (1994). On the demarcation of the mood domain. Paper presented at the symposium, “Mood – Consensus and controversy” at the 102nd Annual Convention of the American Psychological Association, Los Angeles, CA.
Schimmack, U., & Reisenzein, R. (in press). Cognitive processes involved in similarity judgments of emotion concepts. Journal of Personality and Social Psychology.
Schimmack, U., & Siemer, M. (1995). e = m x c! Über Emotionen, Stimmungen und Kognitionen [On emotion, mood and cognition]. Positionsreferat gehalten auf der 37. TeaP in Bochum, 1995.
Schwarz, N. (1987). Stimmung als Information [Mood as information]. Heidelberg: Springer.
Schwarz, N. (1990). Assessing frequency reports of mundane behaviors: Contributions of cognitive psychology to questionnaire construction. In C. Hendrick & M. S. Clark (Eds.), Research methods in personality and social psychology (pp. 98-119). Beverly Hills, CA: Sage.
Schwarz, N., Bless, H., Strack, F., Klumpp, G., Rittenauer-Schatka, H., & Simons, A. (1991). Ease of retrieval as information: Another look at the availability heuristic. Journal of Personality and Social Psychology, 61, 195-202.
Schwarz, N., & Clore, G. L. (1983). Mood, misattribution, and judgments of well-being: Informative and directive functions of affective states. Journal of Personality and Social Psychology, 45, 513-523.
Frequency Judgments of Emotions 154
Schwarz, N., Strack, F., Müller, G., & Chassein, B. (1988). The range of response alternatives may determine the meaning of the question: Further evidence on information functions of response alternatives. Social Cognition, 6, 107-117.
Shaver, P., Schwartz, J., Kirson, D., & O’Conner, C. (1987). Emotion knowledge: Further exploration of a prototype approach. Journal of Personality and Social Psychology, 52, 1061-1086.
Shrout, P. E., & Fleiss, J. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86, 420-428.
Smith, E. R. (1991). Illusory correlation in a simulated exemplar-based memory. Journal of Experimental Social Psychology, 27, 107-123.
Smith, E. R., & Zarate, M. A. (1992). Exemplar-based model of social judgment. Psychological Review, 99, 3-21.
Steyer, R., Schwenkmezger, P., Notz, P., & Eid, M. (1994). Testtheoretische Analysen des Mehrdimensionalen Befindlichkeitsfragebogens [Test theoretical analyses of the multidimensional state questionnaire]. Diagnostica, 40, 320-328.
Suh, E., Diener, E., & Fujita, F. (1996). Events and subjective well-being: Only recent events matter. Journal of Personality and Social Psychology, 70, 1091-1102.
Taylor, G. J. (1984). Alexithymia: Concept, measurement, and implications for treatment. American Journal of Psychiatry, 141, 725-732.
Temme, G., & Tränkle, U. (1996). Arbeitsemotionen: Ein vernachlässigter Aspekt in der Arbeitszufriedenheitsforschung [Emotions at the workplace: A neglected aspect in research on job satisfaction]. Arbeit, 5, 275-297.
Thomas, D. L., & Diener, E. (1990). Memory accuracy in the recall of emotions. Journal of Personality and Social Psychology, 59, 291-297.
Thompson, C. P., & Mingay, D. (1991). Estimating the frequency of everyday events. Applied Cognitive Psychology, 5, 497-510.
Triandis, H. C. (1994). Culture and social behavior. New York: McGraw-Hill.
Tversky, A., & Kahneman, D. (1973). Availability: A heuristic for judging frequency and probability. Cognitive Psychology, 5, 207-232.
Underwood, B. J. (1969). The attributes of memory. Psychological Review, 76, 559-573.
Watkins, M. J., & LeCompte, D. C. (1991). Inadequacy of recall as a basis for frequency knowledge. Journal of Experimental Psychology: Learning, Memory, and Cognition, 17, 1161-1176.
Watson, D., & Clark, L. A. (1992). On traits and temperaments: General and specific factors of emotional experience and their relation to the five-factor model. Journal of Personality, 60, 441-476.
Watson, D., Clark, L. A., & Tellegen, A. (1988). Development and validation of brief measures of Positive and Negative Affect: The PANAS Scales. Journal of Personality and Social Psychology, 54, 1063-1070.
Wiggins, J. S. (1973). Personality and prediction: Principles of personality assessment. Reading, MA: Addision-Wesley.
Williams, K. W., & Durso, F. T. (1986). Judging category frequency: Automaticity or availability? Journal of Experimental Psychology: Learning, Memory, and Cognition, 12, 387-396.
Windle, C. (1955). Test-retest effect on personality questionnaires. Educational and Psychological Measurement, 15, 246-253.
Frequency Judgments of Emotions 155
Wright, D. B., Gaskell, G. D., & O’Muircheartaigh, C. A. (1994). How much is ‘Quite a bit’? Mapping between numerical values and vague quantifiers. Applied Cognitive Psychology, 8, 479-498.
Zuroff, D. C. (1989). Judgments of frequency of social stimuli: How schematic is person memory? Journal of Personality and Social Psychology, 56, 890-898.
A few years ago, Motyl et al. (2017) published the article “The State of Social and Personality Science: Rotten to the Core, Not So Bad, Getting Better, or Getting Worse?” The article provided the first assessment of the credibility and replicability of social psychology based on a representative sample of over 1,000 hand-coded test statistics in original research articles. Given the amount of work involved, the authors may be a bit disappointed that their article has been largely ignored by social psychologists and meta-psychologists alike. So far, it has received only 23 citations in Web of Science. In comparison, the reproducibility project that replicated a quasi-representative sample of 55 studies has received over 2,700 citations and 580 citations in 2020.
In my opinion, this difference is not proportional to the contributions of the two projects. Neither actual replications nor coding of original research findings are flawless methods to estimate the replicability of social psychology. Actual replication studies have the problem that replication studies may fail to reproduce the original conditions, especially when research is conducted with different populations. In contrast, the coding of original test statistics is 100% objective and are only biased by misreporting of statistics in original articles. The advantage of actual replications is that they more directly answer the question of interest. Can we reproduce a significant result, if we conduct the same study again? As many authors from Fisher to Cohen have pointed out, actual replication is the foundation of empirical sciences. In contrast, statistical analysis of published test statistics can only estimate the outcome of actual replication studies based on a number of assumptions that are difficult or impossible to verify. In short, both approaches have their merits and shortcomings and they are best used in tandem to produce convergent evidence with divergent methods.
A key problem with Motyl et al.’s (2017) article was that they did not provide a clearly interpretable result that is akin to the shocking finding in the reproducibility project that only 14 out of the 55 (25%) replication attempts were successful, despite increased sample sizes and power for some of the replication studies. This may explain why Motyl et al. (2017) did not conclude that social psychology is rotten to the core, which would be an apt description of a failure rate of 75%.
Motyl et al. (2017) used a variety of statistical methods that were just being developed. They also converted all test statistics into z-scores and showed z-curves for studies in 2003/04 and 2013/14. Yet, they did not analyze these z-curve plots with the z-curve analysis to estimate power. Moreover, the new version of z-curve.2.0 was not yet developed.
The authors clearly point out that the steep drop of values below the significance criterion of z = 1.96 (p = .05, two-sided) provides evidence of publication bias. “There is clear evidence of publication bias (i.e., a sharp rise of the distribution near 1.96)” (p. 49). In contrast, the Open Science Collaboration article provided no explanation for the drop in success rates from 97% in the original articles to 25% in the replication studies. This may be justified given the small sample of studies. Thus, Motyl et al.’s (2017) article should be cited because it provides clear visual evidence of publication bias in the social psychological literature. However, the only people interested in social psychology are social psychologists and they are not motivated to cite research that makes their science look bad.
A bigger limitation of Motyl et al.’s (2017) article is the discussion of power and replicability. First, the authors examine post-hoc power, which is dramatically inflated when publication bias selects significant results.
“Although post hoc observed power estimates are extremely upwardly biased and should be interpreted with great caution, our median values were very near Cohen’s .80 threshold for both time periods, a conclusion more consistent with an interpretation of it’s not so bad than it’s rotten to the core.”
To avoid these misleading conclusions, it is important to adjust power estimates for the effect of selection for significance. Motly et al. (2017) actually report results for the R-Index that corrects for the effect of inflation. To correct for inflation by publication bias, the R-Index first computes the discrepancy between the observed discovery rate (i.e, the percentage of z-scores greater than 1.96 in Figure 1) and observed power. The idea is that we cannot get 95% significant results if power is only 80%. The lower the observed power is, the more the success rate is inflated by questionable research practices. The R-Index is called an index because the correction method provides biased estimates of power. So, values should be used as a heuristic, but not as proper estimates of power. However, values around 50% are relatively unbiased. Thus, the R-Index results provide some initial information about the average power of studies.
“The R-index decreased numerically, but not statistically over time, from .62 [95% CI = .54, .68] in 2003–2004 to .52 [95% CI = .47, .56] in 2013–2014”
This result could be used as a rough estimate of the statistically predicted replication rate for social psychology that can be directly compared to the replication rate in the Open Science Collaboration project. This leads to two different conclusions about the published studies in social psychology from 1900 to 2014. Based on the Open Science Reproducibility project the field is rotten. With a 75% failure rate, it is not clear which results can be trusted. The best approach forward would be to burn everything to the ground and start from scratch to build a science of social behavior. With a 50% replication rate, we might be more willing to call the glass half empty or half full and search for some robust findings in the rubble of the replication crisis. So, in 2021 we have no clear assessment of the credibility of social psychology. We have clear evidence of publication bias and inflation of success rates, but we do not have clear evidence about the replicability of social psychology. It would seem imprudent to ignore all published evidence based on actual replication outcomes of just 50 studies.
In a recent publication, I analyzed Motyl et al.’s data using the latest version of z-curve (Brunner & Schimmack, 2020; Bartos & Schimmack, 2021). The advantage of z-curve over the R-Index is that it does provide estimates of power that have been validated in simulation studies. I focussed on t-tests and F-tests with one degree of freedom because these tests most directly test predictions about group differences. As there were no significant differences between 2003/04 and 2013/14, only one model was fitted to all years.
Figure 2 shows the results. The first finding is that the expected replication rate (ERR) is estimated to be slightly lower than the R-Index results in Motyl et al. (2017) suggested, 43% 95%CI = 36- 52%. This estimate is closer to the success rate for actual replication studies (25%), but there is still a gap. One reason for this gap is that the ERR assumes exact replications. However, to the extent that replication studies are not exact, regression to the mean will lower replication rates and in the worst case scenario, the success of replication studies is no different from the expected discovery rate (Bartos & Schimmack, 2020). That is, researchers are essentially doing a new study whenever they do a conceptual replication study and the outcome of these studies is based on the average power of studies that are being conducted. The EDR estimate is 19% and the 95%CI ranges from 6% to 36%, which includes 25%. Thus, the EDR estimate for Motyl et al. data is consistent with the replication rate in actual replication studies.
The main purpose of this post (pre-print) is to replicate and extend the z-curve analysis of Motyl et al.’s data. There are several good reasons for doing so. First, replication is a good practice for all sciences, including meta-science. Second, a blog post by Leif Nelson and colleagues questioned the coding of test statistics and implied that the results were too good (Nelson et al., 2071). Accordingly, the actual power of studies in social psychology would be even lower than 19%, but selection for significant might boost the expected replication rate to 25%. However, direct replications are often not as informative as replication studies with an extension that address a new question. For this reason, this replication project did not use a random sampling of studies. Instead, the focus was on the most cited articles by the most eminent social psychologists. There are several advantages of focusing on this set of studies. First, there have been concerns that studies by junior authors and studies with low citation counts are of lower quality. The wisdom of crowds might help to pick well-conducted studies with high replicability. Accordingly, this study should produce a higher ERR and EDR than Motyl et al.’s random sample of studies. Second, the replicability of highly cited articles is more important for the field than the replicability of studies with low citation counts that had no influence on the field of psychology.
A paid undergraduate student, who prefers to remain anonymous, and I coded the most highly cited articles of eminent social psychologists (an H-Index of 35 or higher in 2018). The goal was to code enough articles to have at least 20 studies per researcher.
For the most part, the results replicate the z-curve analysis of Motyl et al.’s data. The observed discovery rate is 89% compared to 90% for Motyl et al. Importantly, these values do not include marginally significant results. Including marginally significant results, the ODR is consistent with Sterling’s finding that over 90% of published focal tests in psychology are significant (Sterling, 1959; Sterling et al., 1995).
Z-curve provides the first estimates of the actual power to produce significant results. The EDR estimate for the replication study, 26%, is slightly higher than the estimate for Motyl et al., but the confidence intervals overlap considerably, showing that the differences are not statistically significant. The new confidence interval of 10% to 36% also includes the actual replication rate of 25%.
The ERR for the replication study, 49% is a bit higher than the ERR of Motyl’s study, 43%, but the confidence intervals overlap. Both confidence intervals exclude the actual replication rate of 25%, showing that the ERR of Motyl et al.’s study was not inflated by bad coding. Instead, the results provide further evidence that the ERR overestimates actual replication outcomes.
Social psychology lacks credibility
The foundation of an empirical science are objectively verified facts. In the social sciences, these building blocks are based on statistical inferences that come with the risk of false positive results. Only convergent evidence across multiple studies can provide solid foundations for theories of social behavior. However, selective publishing of studies that confirm theoretical predictions renders the published record inconclusive. The impressive success rates of close to 100% in psychology journals are a mirage and merely show psychologists aversion to disconfirming evidence (Sterling, 1959). The present study provides converging evidence that the actual discovery rate in social psychological laboratories is much lower and likely to be well below 50%. While statisticians are still debating the usefulness of statistical significance testing, they do agree that selecting significant results renders statistical significance useless. If only significant results are published, even false positive results like Bem’s embarrassing results of time-reversed priming get published (Bem, 2011). Nobody outside of social psychology needs to take claims based on these questionable results seriously. A science that does not publish disconfirming evidence is not a science. Period.
It is of course not easy to face the bitter truth that decades of research were wasted on pseud-scientific publications and that the thousands of articles with discoveries may be filled with false discoveries (“Let’s err on the side of discovery” Bem, 2000). Not surprisingly, social psychologists have reacted in ways that are all to familiar to psychoanalysts. Ten years after concerns about the trustworthiness of social psychology triggered a crisis of confidence, not much has been done to correct the scientific record. Citation counts show that claims based on questionable practices are still treated as if they are based on solid empirical foundations. Textbooks continue to pretend that social psychological theories are empirically supported, even if replication failures cast doubt on these theories. However, science is like the stock market. We know it will correct eventually; we just don’t know when. Meanwhile, social psychology is losing credibility because they are unable or unwilling to even acknowledge the mistakes of the past.
Social psychology needs to improve statistical power
Criticisms of low power in social psychology are nearly as old as empirical social psychology itself (Cohen, 1961). However, despite repeated calls for increased power, power did not increase from 1960 to 2010 (I have produced the first evidence that power increased afterwards, Schimmack, 2016, 2017, 2021). The main problem of low power is that studies are likely to produce non-significant results even if a study tested a true hypothesis. However, low power also influences the false discovery risk. If only a small portion of studies produces a significant outcome, the risk of a false positive result relative to a true positive result increases (Soric, 1989). In theory, this is not a problem if replication studies can be used to separate true and false discoveries, but if replication studies are not credible, it remains unclear how many discoveries are false discoveries.
Social psychology needs to invest more resources in original studies.
Before the major replication crisis in the 2010s, social psychologists were concerned about questionable practices in the 1990s (Kerr, 1998). In response to these concerns, demands increased to demonstrate robustness of findings in multi-study articles (cf. Schimmack, 2012). Surprisingly, social psychologists were able to present significant results again and again in these multiple-study articles, creating the illusion of replicability. Even Bem (2011), demonstrated time-reversed causality in nine studies. This is practically impossible to happen by chance. However, these seemingly robust results did not show that social psychological results were credible. Instead, they showed that social psychologists had found ways to produce many significant results with questionable practices. The demand for multiple studies is no longer needed when original studies are credible because they used large samples and pre-registered dependent variables and other design features. However, social psychologists continue to expect multiple studies within a single article. To do so, social psychologists have moved online and conduct cheap studies with short studies that take a few minutes and cost little. These studies are not intrinsically bad, but they crowd out important research on actual social behavior or intervention studies that can actually reduce prejudice or change other social behaviors. Cohen famously said, less is more. By this he did not mean to lower standards of external validity. Instead, he was trying to push back against a research culture that prizes quantitative indicators of success like the number of significant results, articles, and citations. This research culture has produced no reliable interventions to reduce prejudice in 60 years of research. It is time to change this and to reward carefully planned, expensive, and difficult studies that can make a real contribution. This may require collaboration rather than competition among labs. Social psychology needs a Hubble telescope, a CERN collider, or a large household panel study to tackle big questions. The genius scientist with a sample of 40 undergraduate students like Festinger was the wrong role model for social psychology for far too long. The Open Science Collaboration project showed how collaboration across many labs can have a big impact that no single replication study could have had. This should also be the model for original social psychology.
Evidence is accumulating that social psychology has made a lot of mistakes in the past. The evidence that has accumulated in social psychological journals has little evidential value. It will take time to separate what is credible and what is not. New researchers need to be careful to avoid investing resources in research lines that are mirages and to look for oases in the desert. A reasonable heuristic is to distrust all published findings with a p-value greater than .005 and to carefully check the research practices of individual researchers (Schimmack, 2021). Of course, it is not viable to retract all bad articles that have been published or to issue expressions of concerns for entire volumes. However, consumers of social psychology need to be aware that the entire literature comes with a big warning label “Readers are advised to proceed with caution”