All posts by Ulrich Schimmack

About Ulrich Schimmack

Since Cohen (1962) published his famous article on statistical power in psychological journals, statistical power has not increased. The R-Index makes it possible f to distinguish studies with high power (good science) and studies with low power (bad science). Protect yourself from bad science and check the R-Index before you believe statistical results.

Hey Social Psychologists: Don’t Mess with Suicide!

Ten years ago, the foundations of psychological science were shaking by the realization that the standard scientific method of psychological science is faulty. Since then it has become apparent that many classic findings are not replicable and many widely used measures are invalid; especially in social psychology (Schimmack, 2020).

However, it is not uncommon to read articles in 2021 that ignore the low credibility of published results. There are too many of these pseudo-scientific articles, but some articles matter more than others; at last to me. I do care about suicide and like many people my age, I know people who have committed suicide. I was therefore concerned when I saw a review article that examines suicide from a dual-process perspective.

Automatic and controlled antecedents of suicidal ideation and action: A dual-process conceptualization of suicidality.

My main concern about this article is that dual-process models in social cognition are based on implicit priming studies with low replicability and implicit measures with low validity (Schimmack, 2021a, 2021b). It is therefore unclear how dual-process models can help us to understand and prevent suicides.

After reading the article, it is clear that the authors make many false statements and present questionable studies that have never been replicated as if they produce a solid body of empirical evidence.

Introduction of the Article

The introduction cites outdated studies that have either not been replicated or produced replication failures.

“Our position is that even these integrative models omit a fundamental and well-established
dynamic of the human mind: that complex human behavior is the result of an interplay between relatively automatic and relatively controlled modes of thought (e.g., Sherman et al., 2014). From basic processes of impression formation (e.g., Fiske et al., 1999) to romantic relationships (e.g., McNulty & Olson, 2015) and intergroup relations (e.g., Devine, 1989), dual-process frameworks that incorporate automatic and controlled cognition have provided a more complete understanding of a broad array of social phenomena.

This is simply not true. For example, there is no evidence that we implicitly love our partners when we consciously hate them or vice versa, and there is no evidence that prejudice occurs outside of awareness.

Automatic cognitions can be characterized as unintentional (i.e., inescapably activated), uncontrollable (i.e., difficult to stop), efficient in operation (i.e., requiring few cognitive resources), and/or unconscious (Bargh, 1994) and are typically captured with implicit measures.

This statement ignores many articles that have criticized the assumption that implicit measures measure implicit constructs. Even the proponent of the most widely used implicit measure have walked back this assumption (Greenwald & Banaji, 2017).

The authors then make the claim that implicit measures of suicide have incremental predictive validity of suicidal behavior.

For example, automatic associations between the self and death predict suicidal ideation and action beyond traditional explicit (i.e., verbal) responses (Glenn et al., 2017).

This claim has been made repeatedly by proponents of implicit measures, so I meta-analyzed the small set of studies that tested this prediction (Schimmack, 2021). Some of these studies produced non-significant results and the literature showed evidence that questionable research practices were used to produce significant results. Overall, the evidence is inconclusive. It is therefore incorrect to point to a single study as if there is clear evidence that implicit measures of suicidality are valid.

Further statements are also based on outdated research and a single reference.

Research on threat has consistently shown that people preferentially process dangers to physical harm by prioritizing attention, response, and recall regarding threats (e.g., Öhman &
Mineka, 2001)
.”

There have been many proposals about stimuli that attract attention, and threatening stimuli are by no means the only attention grabbing stimuli. Sexual stimuli also attract attention and in general arousal rather than valence or threat is a better predictor of attention (Schimmack, 2005).

It is also not clear how threatening stimuli are relevant for suicide which is related to depression rather than anxiety disorders.

The introduction of implicit measures totally disregards the controversy about the validity of implicit measures or the fact that different implicit measures of the same construct show low convergent validity.

Much has been written about implicit measures (for reviews, see De Houwer et al., 2009; Fazio & Olson, 2003; March et al., 2020; Nosek et al., 2011; Olson & Fazio, 2009), but for the present purposes, it is important to note the consensus that implicit measures index the automatic properties of attitudes.

More relevant are claims that implicit measures have been successfully used to understand a variety of clinical topics.

The application of a dual-process framework has consequently improved explanation and prediction in a number of areas involving mental health, including addiction (Wiers & Stacy, 2006), anxiety (Teachman et al., 2012), and sexual assault (Widman & Olson, 2013). Much of this work incorporates advances in implicit measurement in clinical domains (Roefs et al., 2011).

The authors then make the common mistake to conflate self-deception and other-deception. The notion of implicit motives that can influence behavior without awareness implies self-deception. An alternative rational for the use of implicit measures is that they are better measures of consciously accessible thoughts and feelings that individuals are hiding from others. Here we do not need to assume a dual-process model. We simply have to assume that self-report measures are easy to fake, whereas implicit measures can reveal the truth because they are difficult to fake. Thus, even incremental predictive validity does not automatically support a dual-process model of suicide. However, this question is only relevant if implicit measures of suicidality show incremental predictive validity, which has not been demonstrated.

Consistent with the idea that such automatic evaluative associations can predict suicidality later, automatic spouse-negative associations predicted increases in suicidal ideation over time across all three studies, even after accounting for their controlled counterparts (McNulty et al., 2019).

Conclusion Section

In the conclusion section, the authors repeat their false claim that implicit measures of suicidality reflect valid variance in implicit suicidality and that they are superior to explicit measures.

As evidence of their impact on suicidality has accumulated, so has the need for incorporating automatic processes into integrative models that address questions surrounding how and under what circumstances automatic processes impact suicidality, as well as how automatic and controlled processes interact in determining suicide-relevant outcomes.”

Implicit measures are better-suited to assess constructs that are more affective
(Kendrick & Olson, 2012), spontaneous (e.g., Phillips & Olson, 2014), and uncontrollable (e.g., Klauer & Teige-Mocigemba, 2007).

As recent work has shown (e.g., Creemers et al., 2012; Franck, De Raedt, Dereu, et al., 2007; Franklin et al., 2016; Glashouwer et al., 2010; Glenn et al., 2017; Hussey et al., 2016; McNulty et al., 2019; Nock et al., 2010; Tucker,Wingate, et al., 2018), the psychology of suicidality requires formal consideration of automatic processes, their proper measurement, and how they relate
to one another and corresponding controlled processes.

We have articulated a number of hypotheses, several already with empirical support, regarding interactions between automatic and controlled processes in predicting suicidal ideation and lethal acts, as well as their combination into an integrated model.

Then they finally mention the measurement problems of implicit measures.

Research utilizing the model should be mindful of specific challenges. First, although the model answers calls to diversify measurement in suicidality research by incorporating implicit measures, such measures are not without their own problems. Reaction time measures often have problematically low reliabilities, and some include confounds (e.g., Olson et al., 2009). Further, implicit and explicit measures can differ in a number of ways, and structural differences between them can artificially deflate their correspondence (Payne et al., 2008). Researchers should be aware of the strengths and weaknesses of implicit measures.

Evaluation of the Evidence

Here I provide a brief summary of the actual results of studies cited in the review article so that readers can make up their own mind about the relevance and credibility of the evidence.

Creemers, D. H., Scholte, R. H., Engels, R. C., Prinstein, M. J., & Wiers, R. W. (2012). Implicit and explicit self-esteem as concurrent predictors of suicidal ideation, depressive symptoms, and loneliness. Journal of Behavior Therapy and Experimental Psychiatry, 43(1), 638–646

Participants: 95 undergraduate students
Implicit Construct / Measure: Implicit self-esteem / Name Latter Task
Dependent Variables: depression, loneliness, suicidal ideation
Results: No significant direct relationship. Interaction between explicit and implicit self-esteem for suicidal ideation only, b = .28.

Franck, E., De Raedt, R., & De Houwer, J. (2007). Implicit but not explicit self-esteem predicts future depressive symptomatology. Behaviour Research and Therapy, 45(10), 2448–2455.

Participants: 28 clinically depressed patients; 67 not-depressed participants.
Implicit Construct / Measure: Implicit self-esteem / Name Latter Task
Dependent Variable: change in depression controlling for T1
Result: However, after controlling for initial symptoms of depression, implicit, t(48) = 2.21, p = .03, b = .25, but not explicit self-esteem, t(48) = 1.26, p = .22, b = .17, proved to be a significant predictor for depressive symptomatology at 6 months follow-up.

Franck, E., De Raedt, R., Dereu, M., & Van den Abbeele, D. (2007). Implicit and explicit self- esteem in currently depressed individuals with and without suicidal ideation. Journal of Behavior Therapy and Experimental Psychiatry, 38(1), 75–85.

Participants: Depressed patients with suicidal ideation (N = 15), depressed patients without suicidal ideation (N = 14) and controls (N = 15)
Implicit Construct / Measure: Implicit self-esteem / IAT
Dependent variable. Group status
Contrast analysis revealed that the currently depressed individuals with suicidal ideation showed a significantly higher implicit self-esteem as compared to the currently depressed individuals without suicidal ideation, t(43) = 3.0, p < 0.01. Furthermore, the non-depressed controls showed a significantly higher implicit self-esteem as compared to the currently depressed individuals without suicidal ideation, t(43) = 3.7, p < 0.001.
[this finding implies that suicidal depressed patients have HIGHER implicit self-esteem than depressed patients who are not suicidal].

Glashouwer,K.A., de Jong,P. J., Penninx, B.W.,Kerkhof,A. J., vanDyck, R., & Ormel, J. (2010). Do automatic self-associations relate to suicidal ideation? Journal of Psychopathology and Behavioral Assessment, 32(3), 428–437.

Participants: General population (N = 2,837)
Implicit Constructs / Measure: Implicit depression, Implicit Anxiety / IAT
Dependent variable: Suicidal Ideation, Suicide Attempt
Results: simple correlations
Depression IAT – Suicidal Ideation, r = .22
Depression IAT – Suicide Attempt, r = .12
Anxiety IAT – Suicide Ideation, r = .18
Anxiety IAT – Suicide Attempt, r = .11
Controlling for Explicit Measures of Depression / Anxiety
Depression IAT – Suicidal Ideation, b = ..024, p = .179
Depression IAT – Suicide Attempt, b = .037, p = .061
Anxiety IAT – Suicide Ideation, b = .024, p = .178
Anxiety IAT – Suicide Attempt, r = ..039, p = .046

Glenn, J. J., Werntz, A. J., Slama, S. J., Steinman, S. A., Teachman, B. A., &
Nock, M. K. (2017). Suicide and self-injury-related implicit cognition: A
large-scale examination and replication. Journal of Abnormal Psychology,
126(2), 199–211.

Participants: Self-selected online sample with high rates of self-harm (> 50%). Ns = 3,115, 3114
Implicit Constructs / Measure
: Self-Harm, Death, Suicide / IAT
Dependent variables: Group differences (non-suicidal self-injury / control; suicide attempt / control)
Results:
Non-suicidal self-injury versus control
Self-injury IAT, d = .81/.97; Death IAT d = .52/.61, Suicide IAT d = .58/.72
Suicide Attempt versus control
Self-injury IAT, d = ..52/.54; Death IAT d = .37/.32, Suicide IAT d = .54/.67
[these results show that self-ratings and IAT scores reflect a common construct;
they do not show discriminant validity; no evidence that they measure distinct
constructs and they do not show incremental predictive validity]

Hussey, I., Barnes-Holmes, D., & Booth, R. (2016). Individuals with current
suicidal ideation demonstrate implicit “fearlessness of death..” Journal of
Behavior Therapy and Experimental Psychiatry, 51, 1–9.

Participants: 23 patients with suicidal ideation and 25 controls (university students)
Implicit Constructs / Measure: Death attitudes (general / personal) / IRAP
Dependent variable: Group difference
Results: No main effects were found for either group (p = .08). Critically, however, a three-way interaction effect was found between group, IRAP type, and trial-type, F(3, 37) = 3.88, p = .01. Specifically, the suicidal ideation group produced a moderate “my death-not-negative” bias (M = .29, SD = .41), whereas the normative group produced a weak “my death-negative” bias (M = -.12, SD = .38, p < .01). This differential performance was of a very large effect size (Hedges’ g = 1.02).
[This study suggests that evaluations of personal death show stronger relationships than generic death]

McNulty, J. K., Olson, M. A., & Joiner, T. E. (2019). Implicit interpersonal evaluations as a risk factor for suicidality: Automatic spousal attitudes predict changes in the probability of suicidal thoughts. Journal of Personality and Social Psychology, 117(5), 978–997

Participants. Integrative analysis of 399 couples from 3 longitudinal study of marriages.
Implicit Construct / Measure: Partner attitudes / evaluative priming task
Dependent variable: Change in suicidal thoughts (yes/no) over time
Result: (preferred scoring method)
without covariates, b = -.69, se = .27, p = .010.
with covariate, b = -.64, se = .29, p = .027

Nock, M. K., Park, J. M., Finn, C. T., Deliberto, T. L., Dour, H. J., & Banaji, M. R. (2010). Measuring the suicidal mind: Implicit cognition predicts suicidal behavior. Psychological Science, 21(4), 511–517.

Participants. 157 patients with mental health problems
Implicit Construct / Measure: death attitudes / IAT
Dependent variable: Prospective Prediction of Suicide
Result: controlling for prior attempts / no explicit covariates
b = 1.85, SE = 0.94, z = 2.03, p = .042

Tucker, R. P., Wingate, L. R., Burkley, M., & Wells, T. T. (2018). Implicit Association with Suicide as Measured by the Suicide Affect Misattribution Procedure (S-AMP) predicts suicide ideation. Suicide and Life-Threatening Behavior, 48(6), 720–731.

Participants. 138 students oversampled for suicidal ideation
Implicit Construct / Measure: suicide attitudes / AMP
Dependent variable: Suicidal Ideation
Result: simple correlation, r = .24
regression controlling for depression, b = .09, se = .04, p = .028

Taken together the reference show a mix of constructs, measures and outcomes, and p-values cluster just below .05. Not one of these p-values is below .005. Moreover, many studies relied on small convenience samples. The most informative study is the study by Glashouwer et al. that examined incremental predictive validity of a depression IAT in a large, population wide, sample. The result was not significant and the effect size was less than r = .1. Thus, the references do not provide compelling evidence for dual-attitude models of depression.

Conclusion

Social psychology have abused the scientific method for decades. Over the past decade, criticism of their practices has become louder, but many social psychologists ignore this criticism and continue to abuse significance testing and to misrepresent these results as if they provide empirical evidence that can inform understanding of human behavior. This article is just another example of the unwillingness of social psychologists to “clean up their act” (Kahneman, 2012). Readers of this article should be warned that the claims made in this article are not scientific. Fortunately, there is a credible research on depression and suicide outside of social psychology.

False False Positive Psychology

The article “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant”, henceforth the FPP article, is now a classic in the meta-psychological literature that emerged in the wake of the replication crisis in psychology.

The the main claim of the FPP article is that it is easy to produce false positive results when researchers p-hack their data. P-hacking is a term coined by the FPP authors for the use of questionable research practices (QRPs, John et al., 2012). There are many QRPs, but they all have in common that researchers conduct more statistical analysis than they report and selectively report only the results of those analyses that produce a statistically significant result.

The main distinction between p-hacking and QRPs is that p-hacking ignores some QRPs. John et al. (2012) also include fraud as a QRP, but I prefer to treat the fabrication of data as a distinct form of malpractice that clearly requires intent to deceive others. The main difference between p-hacking and QRPs is that p-hacking does not consider publication bias. Publication bias implies that researchers fail to publish entire studies with non-significant results. The FPP authors are not concerned about publication bias because their main claim is that p-hacking makes it so easy to obtain significant results that it is unnecessary to discard entire datasets. After showing that a combination of four QRPSs can produce false positive results with a 60% success rate (for alpha = .05), the author hasten to warn readers that this is a conservative estimate because actual researchers might use even more QRPs. “As high as these estimates are, they may actually be conservative” (p. 1361).

The article shook the foundations of mainstream psychology because it suggested that most published results in psychology could be false positive results; that is, a statistically significant results was reported even though the reported effect does not exist. The FPP article provided a timely explanation for Bem’s (2011) controversial finding that humans have extrasensory abilities, which unintentionally contributed to the credibility crisis in social psychology (Simmons, Nelson, & Simonsohn, 2018; Schimmack, 2020).

In 2018, the FPP authors published their own reflection on their impactful article for a special issue of the most highly cited articles in Psychological Science (Simmons et al., 2018). In this article, the authors acknowledge that they used questionable research practices in their work and knew that using these practices was wrong. However, like many other psychologists they thought these practices were harmless because nothing substantial changes when a p-values is .04 rather than .06. Their own article convinced them that their practices were more like robbing a bank than jay walking.

The FPP authors were also asked to critically reflect on their article and to comment on things they might have done differently with the benefit of hindsight. The main regret was the recommendation to require a minimum sample size of n = 20 per cell. After learning about statistical power, they realized that sample sizes should be justified based on power analysis. Otherwise, false positive psychology would simply become false negative psychology where article mostly report non-significant results when effect sizes exist. To increase the credibility of psychological science it is necessary to curb the use of questionable research practices and to increase statistical power (Schimmack, 2020).

The 2018 reflections reinforce the main claim of the 2011 article that (a) p-hacking nil-effects to significance is easy and (b) that many published significant results might be false positive results. A blog post by the FPP authors in 2016 makes clear that the authors consider this to be the core findings of their article (http://datacolada.org/55).

In my critical examination of the FPP article, I challenge both of these claims. First it is important to clarify what the authors mean by “a bit of p-hacking.” To use an analogy, what does a bit of making out mean? Answers range from kissing to intercourse. So, what do you actually have to do to have a 60% probability of getting pregnant? The FPP article falsely suggests that a bit of kissing may get you there. However Table 1 shows that you actually have to f*&% the data to get a significant result.

The table also shows that it gets harder to p-hack results as the alpha criterion decreases. While the combination of four QRPs can produce 81.5% marginally significant results (p < .10), only 21.5% attempts were successful with p < .01 as the significance criterion. One sensible recommendation based on this finding would be to disregard significant results with p-values greater than .01.

Another important finding is that each QRP alone increased the probability of a false positive result only slightly from the nominal 5% to an actual level of no more than 12.6%. Based on these results, I would not claim that it is easy to get false positive results. I consider the combination of four QRPs in every study that is being conducted research fraud that is sanctioned by professional organizations. That is, even if a raid of a laboratory would find that a researcher actually uses this approach to analyze data, the research would not be considered to engage in fraudulent practices by psychological organizations like the Association for Psychological Science or granting agencies.

The distinction between a little and massive is not just a matter of semantics. It influences beliefs about the prevalence of false positive results in psychology journals. If it takes only a little bit of p-hacking to get false positive results, it is reasonable to assume that many published results are false positives. Hence, the title “False Positive Psychology.”

Aside from the simulation study, the FPP article also presents two p-hacked studies. The presentation of these two studies reinforces the narrative that p-hacking virtually guarantees significant results. At least, the authors do not mention that they also ran some additional studies with non-significant results that they did not report. However, their own simulation results suggest that a file-drawer of non-significant studies should exist despite massive p-hacking. After all, the probability to get two significant results in a row with a probability of 60% is only 36%. This means that the authors were lucky to get the desired result, used even more QRPs to ensure a nearly 100% success rate, or failed to disclose a file-drawer of non-significant results. To examine these hypothesis, I simulated their actual p-hacking design of Study 2.

A Z-curve analysis of massive p-hacking

The authors do not disclose how they p-hacked Study 1. For Study 2 they provide the following information. The design had study had three groups (“When I’m Sixty-Four”, “Kalimba”, “Hot Potato”) and the “Hot Potato” condition was dropped like a hot potato. It is not clear how the sample size decreased from 34 to 20 as a result, but maybe participants were not equally assigned to the three conditions and there were 14 participants in the “Hot Potato” condition. The next QRP was that there were two dependent variables; actual age and felt age. Then there were a number of co-variates, including bizarre and silly ones like the square root of 100 to enhance the humor of the article. In total, there were 10 covariates. Finally, the authors used optional stopping. They checked after every 10 participants. It is not specified whether they meant 10 participants per condition or in total, but to increase the chances of a significant result it is better to use smaller increments. So, I assume it was just 3 participants per condition.

To examine the long-run success rate of this p-hacking design, I simulated the following combination of QRPs: (a) three conditions, (b) two dependent variables, (c) 10 covariates, and (d) increasing sample size from n = 10 until N > 200 per condition in steps of 3. I ran 10,000 simulations of this p-hacking design. The first finding was that it provided a success rate of 77% (7718 / 10,000), which is even higher than the 60% success rate featured in the FPP article. Thus, more massive p-hacking partially explains why both studies were significant.

The simulation also produced a new insight into p-hacking by examining the success rates for every increment in sample sizes (Figure 1). It is readily apparent that the chances of a significant result decrease. The reason is that favorable sampling error in the beginning quickly produces significant results. However, unfavorable sampling error in the beginning takes a long time to be reversed.

It follows that no smart p-hacker would use optional stopping or only continue if the first test shows a promising trend. This is what Bem (2011) did to get his significant results (Schimmack, 2016). It is not clear why the FPP authors did not simulate optional stopping. However, the failure to include this QRP explains why they maintain that p-hacking does not leave a file drawer of non-significant results. In theory, adding participants would eventually produce a significant result, resulting in a success rate of 100%. However, in practice resources would often be depleted before a significant result emerges. Thus, even with massive p-hacking a file drawer of non-significant results is inevitable.

It is notable that both studies that are reported in the FPP article have very small sample sizes (Ns = 30, 34). This shows that adding participants does not explain the 100% success rate. This also means that the actual probability of a success on the first trial was only about 40% based on the QRP design for Study 2. This means the chance of getting two significant results in a row was only 16%. This low success rate suggests that the significant p-values in the FPP article are not replicable. I bet that a replication project would produce more non-significant than significant results.

In sum, the FPP article suggested that it is easy to get significant results with a little bit of p-hacking. Careful reading of the article and a new simulation study show that this claim is misleading. It requires massive p-hacking that is difficult to distinguish from fraud to consistently produce significant results in the absence of a real effect and even massive p-hacking is likely to produce a file-drawer of non-significant results unless researchers are willing to continue data collection until sample sizes are very large.

Detecting massive p-hacking

in the wake of the replication crisis, numerous statistical methods have been developed that enable detection of bias introduced by QRPs (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020; Schimmack, 2012; Simonsohn et al., 2014; Schimmack, 2016). The advantage of z-curve is that it also provides valuable additional information such as estimates of the success rate of replication attempts and information about the false discovery risk (Bartos & Schimmack, 2021).

Figure 2 shows the z-curve plot for the 10,000 p-values from the previous simulation of the FPP p-hacking design. To create a z-curve plot, he p-values are converted into z-scores, using the formula qnorm(1-p/2). Accordingly, a p-value of .05 corresponds to a z-score of 1.96 and all z-scores greater than 1.96 (the solid red line) are significant.

Visual inspection shows that z-curve is unable to fit the actual distribution of z-scores because the distribution of actual z-scores is even steeper than z-curve predicts. However, the distinction between p-hacking and other QRPs is irrelevant for the evaluation of evidential value. Z-curve correctly predicts that the actual discovery rate is 5%, which is expected when only false hypotheses are tested with alpha = .05. It also correctly predicts that the probability of a successful replication without QRPs is only 5%. Finally, z-curve also correctly estimates that the false discovery risk is 100%. That is, the distribution of z-scores suggests that all of the significant results are false positive results.

The results address outdated criticisms of bias-detection methods that they merely show the presence of publication bias. First, the methods do not care about the distinction between p-hacking and publication bias. All QRPs inflate the success rate and bias-detection method reveal inflated success rates (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020; Schimmack, 2012). Second, while older methods merely showed the presence of bias, newer methods like z-curve also quantify the amount of bias. Thus, even if bias is always present, they provide valuable information about the amount of bias. In the present example, massive p-hacking produced massive bias in the success rate. Finally, z-curve.2.0 also quantifies the false positive risk after correcting for bias and correctly shows that massively p-hacked nil-hypothesis produces only false positive results.

The simulation also allows to replicate the influence of alpha on the false positive risk. A simple selection model predicts that only 20% of the results that are significant with alpha = .05 are still significant with alpha = .01. This follows from the uniform distribution of p-values, which implies that .01/.05 p-values are below .05 and .01. However, massive p-hacking clusters even more p-values in the range between .01 and .05. In this simulation only 6% (500 / 7718) p-values were below .01. Thus, it is possible to reduce the false positive risk from 100% to close to 5% by disregarding all p-values between .05 and .01. Thus, massive p-hacking provides another reason for calls to adjust the alpha level for statistically significant results to .005 (Benjamin et al., 2017) to reduce the risk of false positive results even for p-hacked literatures.

In sum, since the FPP article was published it has become possible to detect p-hacking in actual data using statistical methods like z-curve. These methods work even when massive p-hacking was used because massive p-hacking makes the detection of bias easier, especially when massive p-hacking is used to produce false positive results. The development of z-curve makes it possible to compare the FPP scenarios with 60% or more false positive results to actual p-values in published journals.

How Prevalent is Massive P-Hacking?

Since the FPP article was published, other articles have examined the prevalence of questionable research practices. Most of these studies rely on surveys (John et al., 2012; Fiedler & Schwarz, 2016). The problem with survey results is that they do not provide sufficient evidence about the amount of p-hacking and do not provide information about the severity of p-hacking. Furthermore, it is possible that many researchers use QRPs when they are testing a real effect. This practice would inflate effect sizes, but does not increase the risk of false positive results. These problems are addressed by z-curve analyses of published results. Figure 3 shows the results for Motyl et al.’s (2017) representative sample of test statistics in social psychology journals.

The z-curve plot of actual p-values differs in several ways from the z-curve plot of massive p-hacking. The estimated discovery rate is 23% and the estimated replication rather is 45%. The point estimate of the false discovery risk is only 18%, suggesting that no more than a quarter of published results are false positives. However, due to the small set of p-values, the 95%CI around the point estimate of the false positive risk reaches all the way to 70%. Thus, it remains unclear how high the false positive risk in social psychology is.

Results from a bigger coding project help to narrow down the uncertainty about the actual EDR in social psychology. This project coded at least 20 focal hypothesis tests from the most highly cited articles by eminent social psychologists, where eminence was based on the H-Index (Radosic & Diener, 2021). This project produced 2,208 p-values. The z-curve analysis of these p-values closely replicated the point estimates for Motyl et al.’s (2017) data (EDRs 26% vs. 23%, ERR 49% vs. 45%, FDR 15% vs. 18%). The confidence intervals are narrower and the upper limit of the false positive risk decreased to 47%.

This image has an empty alt attribute; its file name is image-5.png

However, combining the two samples did not notably reduce the confidence interval around the false discovery risk, 15%, 95%CI = 10% to 47%. Thus, up to 50% of published results in social psychology could be false positive results. This is an unacceptably high risk of false positive results, but the risk may seem small in comparison to a scenario where a little bit of p-hacking can produce over 60% false positive results.

In sum, empirical analyses of actual data suggest that false positive results are not as prevalent as the FPP article suggested. The main reason for the relatively low false positive risk is not that QRPs are rare. Rather, QRPs also help to inflate success rates when a small true effect exists. If effect sizes were not important, it might seem justifiable to reduce false negative rates with the help of QRPs. However, effect sizes matter and QRPs produce inflated effect sizes estimates by over 100% (Open Science Collaboration, 2015). Thus, p-hacking is a problem even if it does not generate a high rate of false positive results.

Individual Differences in Massive P-Hacking?

Psychologists study different topics and use different methods. Some research areas and some research methods have many true effects and high power to detect them. For example, cognitive psychologists appears to have few false positive results and relatively high replication rates (Open Science Collaboration, 2015; Schimmack, 2020). In contrast, between-subject experiments in social psychology are the most likely candidate for massive p-hacking and high rates of false positive results (Schimmack, 2020). As researchers focus on specific topics and paradigms, they are more or less likely to require massive p-hacking to produce significant results. To examine this variation across researchers, I combined the data from the 10 eminent social psychologists with the lowest EDR.

The results are disturbing. The EDR of 6% is just one percentage point above the 5% that is expected when only nil-hypotheses are tested and the 95%CI includes 5%. The upper limit reaches only 14%. The corresponding false discovery risk is 76% and the 95%CI includes 100%. Thus, the FPP article may describe the actual practices of some psychologists, but not the practices of psychology in general. It may not be surprising that one of the authors of the FPP article has a low EDR of 16%, even if the analysis is not limited to focal tests (Schimmack, 2021). It is well-known that the consensus bias leads individuals to project themselves onto others. The present results suggest that massive p-hacking of true null-results is the exception rather than the norm in psychology.

The last figure shows the z-curve plot for the 10 social psychologists with the highest EDR. The z-curve looks very different and shows that not all researchers were massive p-hackers. There is still publication bias because the ODR, 91%, matches the upper limit of the 95%CI of the EDR, 66% to 91%, but the effect size is much smaller (91% – 78% = 13%) than for the other extreme group (90% – 6% = 84%). As this comparison is based on extreme group, a replication study would show a smaller difference due to regression to the mean, but the difference is likely to remain substantial.

In sum, z-curve analysis of actual data can be used to evaluate how prevalent massive p-hacking actually is. The results suggest that only a minority of psychologists consistently used massive p-hacking to produce significant results that have a high risk of being false positive results.

Discussion

The FPP article made an important positive contribution to psychological science. The recommendations motivated some journals and editors to implement policies that discourage the use of QRPs and motivate researchers to preregister their data analysis plans. At the same time, the FPP article also had some negative consequences. The main problem with the article is that it ignored statistical power, false negatives, and effect sizes. That is, the article suggested that the main problem in psychological science is a high risk of false positive results. Instead, the key problem in psychological science remains qualitative thinking in terms of true and false hypotheses that is rooted in the nil-hypothesis ritual that is still being taught to undergraduate and graduate students. Psychological science will only advance by replacing nil-hypothesis testing with quantitative statistics that take effect sizes into account. However, the FPP article succeeded where previous meta-psychologists failed by suggesting that most published results are false positives. It therefore stimulated much needed reforms that decades of methodological criticism failed to deliver.

False False Positive Psychology

The main claim of the FPP article was that many published results in psychology journals could be false positives. Unfortunately, the focus on false positive results has created a lot of confusion and misguided attempts to quantify false positive results in psychology. The problem with false positives is that they are mathematical entities rather than empirically observable phenomena. Based on the logic of nil-hypothesis testing, a false positive result requires an effect size that is exactly zero. Even a significant result with a population effect size of d = 0.0000000001 would count as a true positive result, although it is only possible to produce significant results for this effect size with massive p-hacking.

Thus, it is not very meaningful to worry about false positive results. For example, ego-depletion researchers have seen effect sizes reduced from d = 6 to d = .1 in studies without p-hacking. Proponents of ego-depletion point to these fact that d = .1 is different from 0 and supports their theory. However, the honest effect size invalidates hundreds of studies that claim to have demonstrate the effect for different dependent variables and under different conditions. None of these p-hacked studies are credible and each one would require a new replication study with over 1,000 participants to see whether the small effect size is really observed for a specific outcome under specific conditions. Whether the effect size is really zero or small is entirely irrelevant.

A study with a N = 40 participants and d = .1 has only 6% power to produce a significant result. Thus, there is hardly a difference between a study with a true null-effect (5% power) and a study with a small effect size. Nothing is learned from a significant result in either case and as Cohen once said “God hates studies with 5% power as much as studies with 6% power.”

To demonstrate that an effect is real, it is important to show that the majority of studies are successful without the use of questionable research practices (Brunner & Schimmack, 2020; Cohen, 1994). Thus, the empirical foundation of a real science requires (a) making true predictions, (b) designing studies that can provide evidence for the prediction, and (c) honest reporting of results. The FPP article illustrated the importance of honest reporting of results. It did not address the other problems that have plagued psychological science since its inception. As Cohen pointed out, results have to be replicable, and to be replicable, studies need high power. Honest reporting alone is insufficient.

P-Hacking versus Publication Bias

In my opinion, the main problem of the FPP article is the implicit claim that publication bias is less important than p-hacking. This attitude has led the authors to claim that bias detection tools are irrelevant because “we already know that researchers do not report 100% of their nonsignificant studies and analyses” (Nelson, Simmons, & Simonsohn, 2018). This argument is invalid for several reasons. Most important, bias detection tools do not distinguish between p-hacking and publication bias. As demonstrated here, they detect p-hacking as well as publication bias. For the integrity of a science it is also not important whether 1 researcher tests 20 dependent variables in one study or 20 researchers test 1 dependent variable in 20 independent studies. As long as results are only reported when they are significant, p-hacking and publication bias introduce bias in the literature and undermine the credibility of a science.

It is also unwarranted to make the strong claim that publication bias is unavoidable. Pre-registration and registered reports are designed to ensure that there is no bias in the reporting of results. Bias-detection methods can be used to verify this assumption. For example, they show no bias in the replication studies of the Open Science Collaboration project.

Third, new developments in bias detection methods do not only test for the presence of bias. As shown here, a comparison of the ODR and EDR provides a quantitative estimate of the amount of bias and small amounts of bias have no practical implications for the credibility of findings.

In conclusion, it makes no sense to draw a sharp line between hiding of dependent variables and hiding of entire studies. All questionable research practices produce bias and massive use of QRPs leads to more bias. Bias-detection methods play an important role in verifying that published results can be trusted.

Reading a Questionable Literature

One overlooked implication of the FPP article is the finding that it is much harder to produce significant results with p-hacking if the significance criterion is lowered from .05 to .01. This provides an easy solution to the problem how psychologists should interpret findings in psychology journals with dramatically inflated success rates (Sterling, 1959; Sterling et al., 1995). The success rate and the false positive risk can be reduced by adjusting alpha to .01 or even .005 (Benjamin et al., 2017). This way it is possible to build on some published results that produced credible evidence.

Conclusion

The FPP article was an important stepping stone in the evolution of psychology towards becoming a respectable science. It alerted many psychologists who were exploiting questionable research practices to various degrees that these practices were undermining the credibility of psychology as a science. However, one constant in science is that science is always evolving. The other one is that scientists who made a significant contribution think that they reached the end of history. In this article, I showed that meta-psychology has evolved over the past decade since the FPP article appeared. Ten years later, it is clear that massive p-hacking of nil-results is the exception rather than the norm in psychological science. As a result, the false positive risk is lower than feared ten years ago. However, this does not imply that psychological science is credible. The reason is that success rates and effect sizes in psychology journals are severely inflated by the use of questionable research practices. This makes it difficult to trust published results. One overlooked implication of the FPP article is that p-values below .01 are much more trustworthy than p-values below .05 because massive p-hacking mostly produces p-values between .05 and .01. Thus, one recommendation for readers of psychology journals is to ignore results with p-values greater than .01. Finally, bias detection tools like z-curve can be used to assess the credibility of published literatures and to correct for the bias introduced by questionable research practices.

Frequency Judgments of Emotions: How Happy Were You Last Week?

Cite as: Schimmack. U. (1997). Frequency Judgments of Emotions: How Accurate are They and How are They Made? Unpublished dissertation. Free University Berlin.

Preface

“If you haven’t read it, it is new to you.”

I received my Ph.D. from the Free University Berlin in 1997. My dissertation contained two daily diary studies and two laboratory experiments. The main question that intrigued me at that time was how individuals make judgments about the frequency of their emotions (e.g., how often did you feel happy in the past week or month). I was also interested in the accuracy of these judgments because they are routinely used in personality questionnaires and in the measurement of subjective well-being. My dissertation never got published. I was fortunate that I was able to publish other work. Otherwise, my life might have turned out differently.

Although 23 years are a long time in real sciences, psychology is not a real science. The past 20 years have been wasted on questionable research to support false theories like Kahneman and Tversky’s (1973) influential availability heuristic. It now turns out that key results cannot be replicated (Schimmack, 2019). As you can see from my dissertation, cognitive psychologists already showed that ease of retrieval is not a plausible model of frequency estimation in the 1980s. Social psychologists simply ignored this work. So, now that the easy-of-retrieval model has failed, it may be a good time to introduce social and personality psychologists to cognitive models of frequency estimation that were developed in the 1980s. These models may provide a framework for applied research on frequency judgments of emotions and behaviors that are routinely used to measure personality traits (Fleisher, Woehr, Edwards, & Cullen, 2011).

DISSERTATION

Frequency Judgments of Emotions: How Accurate are They and How are They Made?

Ulrich Schimmack

Freie Universität Berlin
Fachbereich Erziehungswissenschaften, Psychologie und Sportwissenschaften

Betreuung durch Prof. Dr. Hubert Feger

Januar 1997

DEDICATION
To my grandparents Dr. Gerhard Walpert and Martha Walpert

ACKNOWLEDGMENT
Thanks are due to the many individuals and institutions that made this dissertation possible: Hubert Feger for the academic freedom and the resources needed to carry out the empirical investigations; “Studienstiftung des deutschen Volkes,” “Deutscher Akademischer Austauschdienst,” and my parents Frank and Liesel Schimmack for financial support; Ed Diener, Shigeh Oishi, and Mark Suh for collaboration on Study 1; Stephan Dutke, Hubert Feger, Bärbel Knäuper, Rainer Reisenzein, Thomas Rodenhausen, Matthias Siemer, and Germmi Temme for valuable comments on drafts of the dissertation; and, last but not least, Phanikiran Radhakrishnan and Joachim Stöber for social support.

1 ABSTRACT

The frequency of emotional experiences is an important topic for several basic and applied domains in psychology. Most studies investigating the frequency of emotions rely on retrospective self-reports of emotional experiences (e.g., “How frequently did you feel happy in the last month?”). However, relatively little is known about (a) the accuracy of such retrospective frequency estimates of emotions, (b) the representation of information about the frequency of emotions in memory, and (c) the cognitive processes underlying frequency judgments of emotions. The present dissertation addresses these questions in four studies. In two field studies, averaged daily frequency estimates of emotions are compared with frequency judgments of emotions extending over several weeks, to test the accuracy of the latter judgments. In two experiments the cognitive processes underlying frequency judgments of emotions were investigated under controlled conditions. In these studies, participants first rated their likely emotional reactions to several hypothetical scenarios and then judged the frequency of emotions in these scenarios. The results indicate that absolute frequency estimates of emotions underestimate the actual frequencies of emotional experiences, but they accurately discriminate between the frequencies of emotions across different emotions as well as across participants. In addition, the studies provide further support for the familiarity model of frequency judgments of emotions (Hintzman, 1988; Schimmack & Reisenzein, in press). This model assumes that memories of emotional experiences are stored in separate memory traces in an episodic memory. When a frequency judgments of an emotion is required, memories of experiences of this emotion are activated in parallel. This generates a feedback signal, which is experienced as a feeling of familiarity. The more memory traces were activated, the stronger is the feeling of familiarity, and the higher is the frequency judgment. This model is contrasted with (a) models that assume the direct encoding of frequency information in memory and (b) models that assume the retrieval of memories into consciousness.

2 INTRODUCTION

2.1 What is the Frequency of Emotions?

More than hundred years after James’s (1884) question “What is emotion?”, researchers of emotions still search for an answer to this question. In the last decade, however, several researchers used prototype theory (Rosch, 1975) to address this question empirically. Prototype theory at least allows to identify a set of typical emotions1 (cf. Fehr & Russell, 1984). Typical emotions are, for example, love, hate, joy, anger, fear, and sadness.

Schimmack and Reisenzein (1994) extended this work, demonstrating that emotions can be differentiated from moods by means of typicality ratings: Some concepts denote emotions but not moods (e.g., love, pride), others denote moods but not emotions (e.g., relaxed, nervous), and there are also concepts that denote moods and emotions (e.g., happy, sad). Furthermore, Schimmack and Siemer (1995) found that intentionality or objected directedness is a characteristic that differentiates typical emotion concepts (i.e., one is proud about something) from typical mood concepts (i.e., one feels relaxed, but not relaxed about something). This finding supports cognitive emotion theories, such as Stumpf’s theory (cf. Reisenzein & Schönpflug, 1992), in which intentionality is necessary feature of emotions that differentiates them from other affective states.

Furthermore, the object directedness of emotions explains their episodic nature; that is, emotions are elicited by the cognitive appraisal of an event for one’s own well-being, are maintained as long as the feeling is directed at an object, and are terminated once the thoughts are no longer directed at the object (Lazarus, 1991). The aroused affect might remain; this is, however, often considered a mood and no longer an emotion (cf. Ekman & Davidson, 1994, chapter 2). Due to the episodic nature of emotional experiences like pride, disappointment, gratitude, or shame, there are times in which individuals do not experience emotions at all, compared to times when they experience emotions. As a consequence, it is possible to ask people at any moment in time, whether they feel a particular emotion or not. If this question is asked repeatedly, one can determine the frequency with which the individual experienced the emotion. In other words, the question about the frequency of emotions, is about the number of times that a particular emotion has been elicited. The frequency of emotional experiences has to be differentiated from two other important characteristics of emotional experiences, namely their duration and their intensity. The differences between these three features of emotional experiences are illustrated in Figure 1, which shows an individual’s experience of a particular emotion over time. Most of the time the individual does not experience this emotion at all (i.e., the intensity equals zero).

Over the time interval displayed in Figure 1, the emotion is elicited two times (i.e., a change from zero to an intensity greater than zero), so that the frequency of the emotion equals two. The duration of the first emotional experience is longer than the one of the second episode (i.e., a longer distance along the time axis with intensity greater than zero). Regarding the intensity of an emotional experience, different definitions have been proposed: (a) the maximum intensity at the peak of the experience or (b) the integral of the area under the curve from the beginning to the end of the emotional episode (Frijda, Ortony, Sonnemans, & Clore, 1992).

Duration and intensity are important aspects of emotional experiences which can vary independently from the frequency of emotional experiences. For example, Schimmack and Diener (in press) demonstrated that individuals who experience emotions frequently do not necessarily experience emotions more intensely. Hence, frequency, intensity, and duration should be studied separately. The present dissertation focuses exclusively on the frequency of emotions.

2.2 Why Studying Frequency Judgments of Emotions?

The frequency of emotions is important in many everyday life situations. For example, a person who often feels joy seeing a friend is likely to spend time with this friend in the future, whereas a person who often feels fear of flying is likely to use other means of transportation. Apparently, information about the frequency of emotions in past situations can serve as a guide for future behavior (cf. Emmons & Diener, 1986a).

The frequency of emotions is also used as information in the formation of impressions of oneself (e.g., “I am an emotional person”) or of others (e.g., ”She is cold-blooded”). Because frequency of emotions is an important source of information in everyday life, it is not surprising that it is also relevant for psychological disciplines, such as personality, social, cross-cultural, clinical, and industrial and organizational psychology.

I briefly summarize some prevalent questions regarding frequency of emotions in these fields of inquiry. This short review shows that a valid assessment of the frequency of emotions is a necessary requirement for some of the research in these diverse fields, but that the validity of the most often used measure of frequency of emotions; that is, retrospective frequency judgments of emotions, is not yet firmly established. Indeed, various researchers are skeptical about the validity of retrospective reports in general. For example, Lewinsohn and Rosenbaum (1987) state that “retrospective memory should probably never be construed to represent what really occurred” (p. 618). However, evidence for such extreme claims is scarce (cf. Brewin, Andrews, & Gotlib, 1993). The present set of studies examines this issue for retrospective judgments of the frequency of emotional experiences.

2.2.1 Relevance for Personality Psychology

Over the last years, research on stable individual differences in the experience of affect has increased considerably. In several studies, retrospective estimates of experienced pleasant affect were correlated with extraversion and retrospective estimates of experienced unpleasant affect were correlated with neuroticism (Costa & McCrae, 1980; Emmons & Diener, 1986b; Izard, Libero, Putnam, & Haynes, 1993; Larsen & Diener, 1992; Meyer & Shack, 1989; Pavot, Diener, & Fujita, 1990; Watson & Clark, 1992). Some researchers even found these personality dimensions and experiences of affect to be so highly correlated that they equated neuroticism with the disposition to experience unpleasant affect and extraversion with the disposition to experience pleasant affect (Meyer & Shack, 1989).

The finding that extraversion and neuroticism are also correlated with averaged daily reports of affect suggests that this relation is substantial and that the retrospective judgments do possess some validity (Emmons & Diener, 1986b). However, it is possible that the correlations between retrospective estimates of experienced affect and personality traits overestimate the true strength of the relation between these traits and the actual frequencies of experienced affect. This hypothesis is suggested by a meta-analysis (Schimmack, 1996b), which shows that correlations between the two traits and the amount of experienced pleasant and unpleasant affect were higher for retrospective estimates than for averaged daily ratings of affect. Therefore, at least part of the shared variance between self-reports of personality traits and amount of experienced affect could be due to a so called personality-congruent memory bias (Martin, 1985). That is, people overestimate experiences of affect that is consistent with their personality. For example, neurotic individuals tend to overestimate the amount of unpleasant affect, whereas extravert individuals tend to overestimate the amount of pleasant affect (Diener, Larsen, & Emmons, 1984). One aim of the present dissertation is to explore the presence of personality-congruent biases in retrospective frequency estimates of emotions.

Furthermore, previous studies of the relation between personality and trait affect often did not differentiate between emotions and moods, and often did not separate the frequency of emotions from their typical intensity or duration. However, it has been demonstrated that individual differences in the frequency and the typical intensity of emotional experiences are separable constructs (cf. Diener, Larsen, Levine, & Emmons, 1985; Schimmack & Diener, in press). Therefore, the present dissertation also explores the structure of individual differences in the frequency of emotional experiences.

2.2.2 Relevance for Clinical Psychology

Abnormally frequent or infrequent experiences of emotions are symptoms of several psychological disorders (cf. Andreasen & Black, 1991). For example in the diagnostic system DSM-III-R, symptoms of the paranoid personality disorder are frequent experiences of distrust, fear, jealousy, and resentments. Symptoms of the schizoid personality disorder are very infrequent experiences of rage and joy. And a symptom of the narcissistic personality is the frequent experience of envy.

For a practitioner, it is difficult to assess these symptoms because epidemiological data about the frequency of these emotions in the general population are lacking. The results of the present dissertation can serve as a first guideline about the prevalence of emotions like envy or joy, although the results are limited to a student population. One aim of the dissertation is to suggest strategies that allow an economical but accurate estimation of the prevalence of emotions in the general population.

A second problem for the diagnostician is that it is unknown whether a patient’s reported frequencies of an emotion are accurate. It could be that psychological disorders bias the self-report of past emotional experiences. For example, depressed patients might overestimate the frequency of their unpleasant emotional experiences (Fitzgerald, Slade, & Lawrence, 1988). However, the evidence concerning long-term memory deficits due to psychological disorders is mixed (see Brewin et al., 1993, for a review). An investigation of the cognitive processes underlying frequency judgments of emotions could help to determine when a patient’s self-report is likely to be accurate and when it might be biased. It might also help to develop measurement instruments that are least susceptible to distortions.

2.2.3 Relevance for Social and Cross-Cultural Psychology

People often experience emotions in social situations (cf. Scherer, Wallbott, & Summerfield, 1986). Because the structure of a society influences the type of social situations that its members encounter, it is likely that social factors influence the frequency of emotions. For example, Briggs (1970) claimed that the Inuit never experience anger (but see Briggs, 1987). In contrast, the high homicide rate in the USA suggests that anger-related emotions such as anger, rage, or hate are experienced quite frequently in the United States, especially in the South (Cohen, 1996).

Evidently, understanding the cultural factors that influence the frequency of emotions – and the actions motivated by them – is relevant for political decisions. One important cultural dimension that is likely to influence the frequency of emotional experiences is the individualism-collectivism dimension. Because individualistic cultures provide looser social networks than collectivistic cultures (cf. Hofstede, 1980; Triandis, 1994), members of individualistic cultures might experience more frequently loneliness, but less frequently shame. Markus and Kitayama (1994) report a study in which the frequency of joy was more highly correlated with the frequency of pride in the USA (an individualistic culture) than in Japan (a collectivistic culture), indicating that achievement situations are a stronger source of happiness in individualistic cultures.

Studying the frequency of emotions is also important because the frequency of pleasant versus unpleasant emotions is one component of Subjective Well-Being (cf. Diener, 1984; Diener, Sandvik, & Pavot, 1991). Recently Diener, Diener, and Diener, (1995; Diener & Diener, 1995) explored what differentiates “happy” from “unhappy” nations. They found happy nations to be more affluent, individualistic, and democratic. These findings have potential implications for political questions such as whether China can promote the happiness of its people by economic growth, but without changing its political system. The interpretation of cross-cultural studies is, however, based on a number of assumptions. Among others, one basic assumption is that people can accurately estimate the frequency of their emotional experiences.

2.2.4 Relevance for Industrial and Organizational Psychology

Traditionally, job satisfaction is measured as an evaluative judgment or an attitude (Pekrun & Frese, 1992). Temme and Tränkle (1996) pointed out that this approach neglects the emotional aspects of work. An individual might be satisfied with his or her job, but experience only rarely emotions such as joy or pride. Similarly, global evaluations of one’s life are correlated, but distinct from measures of the frequency of pleasant versus unpleasant emotions (cf. Pavot & Diener, 1993). Therefore, the traditional assessment of job satisfaction should be complemented with the assessment of the frequencies of pleasant and unpleasant emotions at work. It is therefore an important question whether people can make accurate estimates of the frequencies of their emotional experiences in different contexts. Otherwise the frequency judgments would not only reflect the number of joy experiences at work, but also joy experiences in other contexts. Actually, several frequency judgment models, which are reviewed later, predict that people lack the ability to discriminate frequencies in different contexts. If this were true, assessing frequencies of emotions at work would be difficult.

2.3 Previous Research on Frequency Judgments of Emotions

2.3.1 Accuracy

Accuracy of frequency judgments can be globally defined as the agreement between frequency judgments and actual frequencies. Complementary, lack of accuracy is indicated by deviations of frequency judgments from actual frequencies. Because these deviations can be computed in various ways, several types of accuracy can be distinguished (cf. Naveh-Benjamin & Jonides, 1986; Thomas & Diener, 1990).

Absolute and relative accuracy compare the absolute level of actual and estimated frequencies. Absolute accuracy is defined as the ability of individuals to estimate the absolute frequency of an emotion accurately, irrespectively of the direction of the estimation error; that is, whether the actual frequencies are over- or underestimated. A common index of absolute accuracy is the standard deviation of the frequency judgments from the actual frequencies (Naveh-Benjamin & Jonides, 1986). In contrast, relative accuracy takes the direction of the error into account. A common measure of relative accuracy is the difference between the actual and estimated absolute frequency (estimated minus actual). In contrast to absolute accuracy, relative accuracy indicates whether the actual frequencies are over- or underestimated.
In the literature on frequency judgment of emotions, absolute and relative accuracy have been neglected. So far, the only study of relative accuracy compared the actual number of pleasant days; that is, days on which a person experienced more pleasure than displeasure; with the estimated number of pleasant days (Thomas & Diener, 1990). The results indicated that participants underestimate the number of pleasant days.

Because the number of unpleasant days is by definition perfectly inversely related to the number of pleasant days, the study also demonstrated that participants overestimated the number of unpleasant days. Because most participants experienced more pleasant than unpleasant days, this finding is consistent with the finding that frequency estimates regress toward the mean due to information loss (Howell, 1973; Fiedler, 1991). In contrast to the number of pleasant versus unpleasant days, the present dissertation explores for the first time the absolute and relative accuracy of frequency estimates of single emotions such as joy, anger, or sadness.

Two other types of accuracy are not concerned with the absolute frequency of a single entity (e.g., an emotion), but rather test how accurately frequency estimates discriminate between actual frequencies of different entities; where the entities can be the stimuli or the participants. These types of accuracy are subsequently called discriminative accuracy. A common index of discriminative accuracy is the Pearson correlation coefficient between actual and estimated frequencies computed across different entities.

Experimental studies investigate the ability of frequency estimates to discriminate the actual frequencies of the stimuli, which are experimentally manipulated. In the present context, the stimuli are different emotions such as joy, fear, or gratitude. Therefore, this type of accuracy is called discriminative accuracy across emotions, which has not been tested in previous studies of frequencies judgments of emotions. Exploring the discriminative accuracy across emotions has, however, several advantages. First, it can be tested in field and experimental studies of frequency judgments of emotions, which is not true for the discriminative accuracy across participants introduced below.

Second, tests of different models of frequency judgments are often based on the ability of a potential mediator variable (e.g., the number of recalled examplars) to discriminate between actual frequencies of different emotions (cf. Fitzgerald et al., 1988). This is, however, only meaningful if the frequency estimates possess discriminative accuracy across emotions. As a consequence, the accuracy of frequency judgments across emotions is investigated in the present dissertation.

In contrast to experimental studies, personality psychologists are mainly concerned with the ability of frequency judgments of emotions to discriminate actual frequencies of emotions experienced by different participants (Diener, Smith, & Fujita, 1995; Feldman Barrett, in press; Parkinson, Briner, Reynolds, & Totterdell, 1995; Thomas & Diener, 1990). This type of accuracy is subsequently called discriminative accuracy across participants. For example, Diener et al. (1995) studied individual differences in the experience of six types of emotions (see below). Each type was measured by four items. At the end of 52 consecutive days, the participants rated the time that they experienced each emotion on the particular day. Before and after the diary period, the participants also made time judgments for the previous month. The correlations between the averaged pre- and post-diary judgments and the averaged daily judgments were r = .69 for threat emotions (e.g., fear), r = .61 for bad-other emotions (e.g., anger), r = .52 for bad-self emotions (e.g., shame), r = .64 for separation emotions (e.g., sadness), r = .65 for good-other emotions (e.g., love), and r = .68 for pleasure emotions (e.g., joy). This finding shows quite high discriminative accuracy across participants.

In addition, the correlations between actual and estimated times of emotions were higher within the same type of emotions (pleasure–pleasure) than across different types of emotions (pleasure-good-other), indicating that the frequency judgments were emotion specific. This pattern of results rules out a simple response set explanation. Feldman Barrett (in press) reported correlations ranging from r = .59 to .76 between averages of repeated momentary mood ratings over a 90 day period and retrospective estimates of these averages after the diary period. Again, in this study frequency was not measured in a pure fashion, because the average of repeated intensity ratings comprises frequency and intensity information (Schimmack & Diener, in press). Furthermore, previous studies might overestimate the true discriminative accuracy across participants, because the frequency judgments were made (partly) after the diary study. Therefore, it is possible that participants based their post-diary judgments on memories of their daily judgments, rather than their daily experiences, a source of information that is not available under natural circumstances.

An influence of daily ratings on the subsequent post-diary judgments is especially likely because the post-diary judgments were made on the same response scale as the daily ratings. This hypothesis is further strengthened by Thomas and Diener’s (1990) finding that the correlation between pre-diary estimates of the number of happy days and the number of happy days experienced during the diary period was lower than the correlation obtained for the post-diary estimates. To test this hypothesis, the present dissertation followed Thomas and Diener’s (1990) approach to compare pre- and post-diary frequency judgments of emotions. Furthermore, different response scales were used for the assessment of the daily frequencies and the retrospective frequency judgments.

In sum, previous studies suggest that the discriminative accuracy across participants is fairly high (Diener, Smith et al., 1995; Feldman Barrett, in press; Parkinson et al., 1995; Thomas & Diener, 1990). However, in none of these studies the frequency of emotions was measured in a pure fashion; Diener et al. (1995) studied frequency and duration, Feldman Barrett (in press; Parkinson et al., 1995) investigated frequency and intensity, and Thomas and Diener’s (1990) studied the number of pleasant versus unpleasant days. Thomas and Diener’s study suggests that people underestimate the frequency of pleasant emotions and overestimate the frequency of unpleasant emotions. However, this finding might be limited to their measure of pleasant versus unpleasant days. Different results might be obtained for absolute frequency estimates of single pleasant and unpleasant emotions such as anger, joy, or sadness. Furthermore, several types of accuracy -namely absolute and relative accuracy as well as discriminative accuracy across emotions – have been neglected in previous studies. One major aim of the present dissertation is to explore the different types of accuracy for pure frequency judgments of emotions.

2.3.2 Underlying Cognitive Processes

Only two studies explored the cognitive processes underlying frequency judgments of emotions (Fitzgerald, et al., 1988; MacLeod, Andersen, & Davies, 1994; see 2.4 for research on frequency judgments in general). Both studies compared frequency judgments of emotions with the latencies to retrieve a single autobiographic memory in which the target emotion occurred. The results consistently showed an inverse relation between these two variables: Retrieval latencies decreased with increasing frequency of emotions. Furthermore, retrieval latencies were faster for the more frequent pleasant emotions than for the less frequent unpleasant emotions (MacLeod et al., 1994). These findings have been interpreted as support for the assumption of the ease-of-retrieval model (see next paragraph for more detail) that people base frequency judgments on the ease (speed) with which they can retrieve exemplars from memory. However, this design has two shortcomings. First, the evidence is only correlational. Second, the relation of both measures to the actual frequencies of emotions is unknown. Only if both measures show the same correlation with the actual frequencies of emotions, the frequency judgments can be based on ease-of-retrieval. If, however, the frequency judgments would be more highly correlated with the actual frequencies than a measure of ease-of-retrieval, ease-of-retrieval could not explain the accuracy of the frequency judgments. Therefore, more rigorous tests are needed to uncover the cognitive processes underlying frequency judgments of emotions.

2.4 The Experimental Literature on Frequency Judgments in General

2.4.1 Theoretical Models

Experimental research on frequency judgments has often relied on stimuli (e.g. word lists with concepts of natural objects such as fruits, furniture, birds, etc.) which, on first sight, bear little resemblance to experiences of emotions in everyday life. Therefore, one might be skeptical whether this research helps to understand frequency judgments of emotions. This skepticism is not justified for two reasons.

First, frequency judgments of emotion also employ emotion concepts. Although emotion concepts differ from concepts of natural objects – for example, concepts of natural objects are hierarchically organized and mutually exclusive at the same level of the hierarchy, whereas emotion concepts are not (Reisenzein, 1995) – frequency judgments might not be affected by these differences.

Secondly, even if frequency judgments of emotions employ other cognitive processes than frequency judgments of natural objects, the theories and experimental methods developed in the general frequency judgment literature are at least heuristically fruitful for the investigation of frequency judgments of emotions.

Various frequency judgment models have been proposed in the psychological literature, which are not mutually exclusive (Brown, 1995; Hintzman, 1988; Howell, 1973; Tversky & Kahneman, 1973). Each model might be correct in specific contexts and for specific domains. For example, Manis, Shedler, Jonides, and Nelson (1993) argued that direct-encoding models might account for frequency judgments of repeated occurrences of the same stimulus, whereas retrieval-based models might account for frequency judgments of categories (see also Brown, 1995).

Other studies show that the expectation about the frequency of the event to be estimated is also important. People are likely to use a counting strategy for rare events (seeing the dentist in the last year), but estimation strategies for more frequent events (restaurant visits in the last year) (Blair & Burton, 1987).

Figure 2 provides a taxonomy of the different frequency judgment models proposed in the literature. Two major characteristics differentiate between the frequency judgment models. The first distinction is between direct encoding models that assume frequency information to be encoded directly at the time of encoding (Hasher & Zacks, 1979; Jonides & Jones, 1992; Underwood, 1969) and indirect encoding models, which assume frequency information to be stored indirectly in memory in the form of multiple memory traces (see Figure 2).

Direct encoding models are, for example, the counter model, which is based on the idea that concepts are linked to a frequency counter that registers every activation of the concept (Underwood, 1969), or the concept strength model, which assumes that concepts are strengthened by each activation so that frequency judgments can be based on a readout of a concept’s strength (cf. Howell, 1973).

The second major distinction is between retrieval-based versus retrieval-free models. Retrieval-based models assume that frequency judgments are based on information that is obtained by the retrieval of relevant exemplars to the level of consciousness. A straight forward strategy would be to count all available instances in memory (Brown, 1995; Meudall, 1971). However, research suggests that people do not use the counting strategy for unregular and frequent events (Menon, 1994). Because emotional experiences are irregular, and quite frequent over longer time periods, it is unlikely that people rely on a counting strategy when they judge the frequency of emotions.

The other possibility is that people use simple heuristics to make frequency judgments. Tversky and Kahneman (1973) suggested several possibilities which heuristics people might use to make frequency judgments. Somewhat confusingly, all of the proposed heuristics became to be known as availability heuristics, although they assume clearly distinct cognitive processes. Most commonly, the availability heuristic has been interpreted as the retrieval of a limited number of exemplars followed by an estimation based on the number of retrieved exemplars (cf. Watkins & LeCompte, 1991). “The subject could, therefore, use the number of instances retrieved in a short period to estimate the number of instances that could be retrieved in a much longer period of time” (Tversky & Kahneman, 1973, p. 210). To distinguish this heuristic from other heuristics, it has been named recall-estimate theory (Watkins & LeCompte, 1991).

Empirical support for the recall-estimate model stems from the finding that the number of recalled exemplars is correlated with frequency judgments and both variables are influenced in the same way by experimental manipulations at the time of encoding (Manis, et al., Tversky, & Kahneman, 1973). For example, in a now classical study, Tversky and Kahneman demonstrated that people recall more female names than male names from a list with an equal number of female and male names, when the female names referred to famous people. In addition, they also made higher frequency estimates for female than for male names.

Today, the second availability heuristic proposed by Tversky and Kahneman is known as the ease-of-retrieval model (Schwarz, Bless, Strack, Klumpp, Rittenauer-Schatka, & Simons, 1991). According to this model, an individual “attempts to recall some instances and judges overall frequency by availability, i.e., by the ease with which instances come [italics added] to mind (Tversky & Kahneman, 1973, p. 220). The ease-of-retrieval model has been empirically tested and supported in some studies (Gabrielcik & Fazio, 1984; Schwarz et al., 1991). For example, Schwarz et al. (1991) asked participants to recall either six instances when they were assertive, which was easy, or twelve instances, which was difficult. Subsequently, the participants in the easy recall condition judged themselves to be more assertive than those in the difficult recall condition.

In contrast to the retrieval-based models, retrieval-free models assume that frequency judgments do not involve retrieval of exemplars to the level of consciousness. Interestingly, Tversky and Kahneman (1973) also suggested a retrieval-free model of frequency judgments. “To assess availability it is not necessary to perform the actual operations of retrieval. It suffices to assess the ease with which these operations could [italics added] be performed, much as the difficulty of a puzzle or mathematical problem can be assessed without considering specific solutions” (p. 208).

This proposition bears a close resemblance to the finding in the metamemory literature that people often have a feeling-of-knowing the answer to a question, even when they cannot recall the answer. Nevertheless, the strength of this feeling predicts people’s performance in a later recognition test (see Nelson, 1988). Metcalfe (1993) proposed that this seemingly paradox ability is based on the familiarity of the question. Similarly, Hintzman (1988) proposed that people can judge the frequency of events without actual retrieval of related memories by means of a direct familiarity signal from memory. Hintzman’s familiarity model assumes that a question such as “How frequently did you experience joy in the last week?” activates automatically and in parallel memories of joy experiences in the last week. The activation is based on a feature-matching process: The more features of a typical joy experience a memory possesses the stronger the activation of this memory; and the stronger the familiarity signal. Furthermore, some features encode the time of the experience, so that joy experiences in the last week are activated more strongly than joy experiences at other times. The automatic activation process produces an echo. The intensity of this echo reflects the amount of information that was activated in memory. This echo intensity is experienced as a feeling of familiarity. The major distinction between the familiarity model on the one hand, and the retrieval-based models on the other hand, is that the familiarity model does not require the retrieval of emotional memories to a conscious level. Therefore, it is possible that someone says: “I cannot recall a specific situation in which I felt joy last week, but I think I felt joy about 20 times.”

One important limitation of the familiarity model is that it does not explain how participants make absolute frequency estimates. The familiarity model only predicts that the familiarity signal will be stronger for frequent stimuli and weaker for rare stimuli, but the model does not explain how a feeling of familiarity is converted into an absolute numerical estimate (cf. Brown, 1995; Brown & Siegler, 1993). This problem, however, exists also for the ease-of-retrieval and the recall-estimate model.

2.4.2 Empirical Paradigms

In the experimental literature, several experimental paradigms have been developed to differentiate between the various frequency judgment models. Subsequently, I review those paradigms that were employed in the present studies to explore the cognitive processes underlying frequency estimates of emotions. More specifically, I first review paradigms that differentiate direct from indirect encoding models, and then paradigms that differentiate between indirect encoding models.

One paradigm is modeled after a study by Hintzman and Block (1971). The authors asked participants to learn two lists of words in which the frequency of words was independently varied. Subsequently, the participants estimated the frequency of words separately for the first and the second list. The authors found that the participants were able to make accurate frequency judgments for each list. This finding is difficult to explain by direct encoding models, which assumes that frequency information is constantly updated at the time of encoding. Therefore, only the total frequency is stored in memory and it is impossible to differentiate frequencies in different contexts. Study 2 of the present dissertation used a similar paradigm. Participants first made daily frequency estimates of emotions for two weeks. Subsequently, they made separate frequency estimates for the first and second week of the diary study.

A second paradigm that has been used to test direct versus indirect encoding of frequency information relies on a manipulation of the salience of category membership (e.g., Bruce, Hockley, & Craik, 1991; Greene, 1989). In Greene’s study, participants were asked to study a list of words. In this list, words of different categories occurred with varying frequencies (e.g., fruits: orange, apple, banana, grapes; trees: oak, pine). In one study, he manipulated the salience of category membership in that exemplars of the same category appeared either in one block, or were spread across the list. The direct encoding model assumes that category members automatically activate category labels and that the frequency of the category is counted (Alba, Chromiak, Hasher, & Attig, 1980). Hence, making category membership salient should not have an effect on frequency judgments. However, Greene (1989) found that frequency was judged to be higher when category membership was salient; that is, in the blocked condition. This is once again difficult to explain by the direct encoding model. Manipulations of the salience of emotion concepts were employed in the studies 1, 3 and 4 of the present dissertation, to test whether salience has an influence on frequency judgments of emotions. In Study 1, participants made frequency judgments of emotions before and after a diary study for salient emotions; that is, those that were on the rating form during the diary study, and non-salient emotions; that is, those that were not on the form. First of all, it was expected that frequency judgments of all emotions increase due to the participation in a diary study. In addition, it was expected that frequency judgments of salient emotions increase more strongly than those of non-salient emotions. In studies 3 and 4, participants first rated for a number of emotions whether they would experience these emotions in various hypothetical scenarios. Subsequently, they were asked to estimate how frequently they would have experienced emotions in the set of scenarios. This question was asked for salient emotions; that is, those emotions that had been included in the previous rating task, and non-salient emotions; that is, those that had not been included in the scenario rating task. Note that frequency judgments of non-salient emotions are meaningful, because the fact that these emotion concepts were not included in the scenario rating task does not imply that these emotions could not have been experienced in the hypothetical scenarios. It was expected that the frequency judgments of salient emotions would be higher than those of non-salient emotions (Bruce et al., 1991; Greene, 1989; Hintzman, 1988).

The previously described paradigms can test direct and indirect encoding models against each other, but they do not allow to distinguish between retrieval-based and retrieval-free models, because all indirect models predict that different frequency judgments can be provided for different contexts (Hintzman & Block, 1971), or that salience at the time of encoding enhances frequency judgments (Bruce et al., 1991). Even the finding that frequency judgments and recall measures show discriminative accuracy across stimuli does not provide conclusive evidence that the frequency judgments were actually based upon a recall-estimate strategy (Bruce et al., 1991; Hastie & Park, 1986; Watkins & LeCompte, 1991). The correlation could be simply due to the fact that frequency judgments and number of recalled exemplars are bound to be related by the number of exemplars stored in memory. Nevertheless, frequency judgments might not be based on the retrieval of exemplars. Therefore, a closer examination of the frequency judgment process is needed. The major difference between the two models is the assumption of the retrieval-based models that exemplars are retrieved to the level of consciousness. Therefore, the time needed for a frequency judgment should be longer than the latency to retrieve at least a single exemplar. Similarly, Reder (1987) argued in the metamemory literature that feeling-of-knowing judgments should take more time than the retrieval of answers, if they are retrieval-based; however, consistent with retrieval-free models (Metcalfe, 1993), the feeling-of-knowing judgments were faster than retrieval times of answers.

Furthermore, if frequency judgments are based on the retrieval of exemplars, the judgment times of frequency judgments should be systematically related to the judged frequency: According to the recall-estimate theory, higher frequency judgment should need more time because more exemplars were retrieved, and the retrieval of more exemplars takes more time (Brown, 1995; Meudall, 1971). The opposite prediction is made by the ease-of-retrieval model. Higher frequency judgments are based on easier retrieval of examplars, which implies that the exemplars come to mind faster so that the judgment can be made faster. In contrast to these predictions of the two retrieval-based models, Hanson and Hirst (1988) found neither a positive nor a negative relation between judgment times and size of the frequency judgments. Brown (1995) found a positive correlation when participants used the counting or the recall-estimate strategy, but not when participants used a retrieval-free strategy. In studies 3 and 4, the size of the frequency judgments, the time needed to make these judgments, and response times in a latency-to-retrieve task were compared to test the indirect encoding models of frequency judgments in the emotion domain.

2.5 Proposing a Familiarity Model of Frequency Judgments of Emotions

A finding by Schimmack and Reisenzein (in press) casts doubt on the basic assumption of the retrieval-based models that participants rely on the recall of exemplars to judge the frequency of emotions. Fitzgerald et al. (1988) found that on average the retrieval of emotional episodes from autobiographic memory needed more than 7s. MacLeod et al. (1994) also reported average retrieval times of more than 7s for unpleasant emotion memories, although pleasant memories were retrieved within 4s. In Schimmack and Reisenzein’s (in press) study, participants made conditional probability judgments of emotions (i.e., “If you experience joy, how frequently do you experience euphoria?”) on average within 5s. These judgment times appear to be too fast, especially given that the judgments were made only for unpleasant emotions, to be based on the recall of past emotional experiences. Of course, the fast responses could be due to random responding. However, covariation judgments of emotions are not only fast, but also reflect actual covariations between emotions fairly accurately (Reisenzein & Schimmack, 1996).

Finally, Schimmack and Reisenzein analyzed asymmetries in conditional probability judgments. According to Bayes’s theorem (Wiggins, 1973), p(A) > p(B) exactly if p(A|B) > p(B|A); that is, because sadness is in general a more frequent emotion than embarrassment, the conditional probability of sadness given embarrassment should be higher than the conditional probability of embarrassment given sadness. This prediction was confirmed for most emotion pairs. Therefore, conditional probability judgments reflect not only the actual co-occurrence of emotions, but also the separate frequencies of the two emotions. It is difficult to imagine how the participants (a) retrieved a sufficient number of exemplars and (b) carried out the necessary computations on a conscious level within 5s. Therefore, Schimmack and Reisenzein concluded that both frequency and co-occurrence judgments of emotions are either already pre-stored in memory, or the judgments are based on a feeling of familiarity (Hintzman, 1988; Metcalfe, 1993). As a consequence, a test between direct encoding models and the familiarity model seems to be highly desirable. Nevertheless, previous findings by Hintzman and Block (1971) and others (Green, 1989) have challenged direct encoding models in other domains. Furthermore, the familiarity model has been successfully applied to other social judgments. For example, the familiarity model, but not the direct-encoding models, explains the phenomenon of illusory correlations (Smith, 1991; Smith & Zaraté, 1993; see also Fiedler, 1991). Therefore, Schimmack and Reisenzein recommended the familiarity model as an “inference to the best explanation,” for frequency and co-occurrence judgments of emotions.

2.6 Biases in Frequency Judgments of Emotions

2.6.1 Mood-Congruent Biases

Up till now, frequency judgments of emotions have been treated just like frequency judgments of natural objects (e.g. fruits, cities, furniture). However, emotions differ from these stimuli in that emotions have a hedonic tone: they are either pleasant or unpleasant (cf. Clore, 1994). Several information processing models predict that affective information is processed differently from non-affective information. According to the mood-congruent-memory hypothesis (Bower, 1981), the current affective state renders mood-congruent memories more accessible. In combination with the indirect encoding models of frequency judgments, this leads to the prediction that frequency estimates of emotions are biased in a mood-congruent direction. In contrast, the competing model of mood effects on social judgments; that is, the mood-as-information model (Schwarz & Clore, 1983), does not make this prediction. According to this model, people directly use their current mood to make evaluative judgments whenever they consider their current mood a valid and relevant source of information for the judgment. In an intriguing experiment, Schwarz and Clore (1983) demonstrated that participants rated the satisfaction with their lives to be higher in a good mood (e.g., on sunny days) than in a bad mood (e.g., on rainy days). This effect, however, disappeared when the influence of the weather on participants’ current mood was made salient to them. As a consequence, participants did no longer consider their current mood a valid source of information and used other information. Subsequent studies showed that current mood was not used for judgments about satisfaction
with specific life domains, presumably because participants considered their current mood as irrelevant (Schwarz, 1987). Because current mood does not appear to be a particularly relevant source of information for frequency judgments of specific emotions such as love, hate, joy, and fear, people should not use their current mood as information for these judgments. Therefore, the mood-congruent-memory model, but not the mood-as-information model, predicts an influence of current mood on frequency judgments of emotions.

To address this question empirically, individual differences in naturally occurring mood at the time of the frequency judgments were assessed in the two field studies. Naturally occurring mood, rather than a mood-induction procedure, was used for two reasons. First, an experimental mood manipulation may have distorted the results in the more important analyses of the accuracy of frequency judgments. Second, I believe that it is useful to start a scientific investigation with a demonstration of a phenomenon under natural conditions. If naturally occurring mood is unrelated to frequency judgments of emotions, an experimental investigation is at least of secondary importance to the present research question. This research strategy seems especially desirable in the light of a series of studies by Parrott and Sabini (1990), who did not find mood-congruent recall in naturalistic settings; indeed, the authors found mood-incongruent recall. In addition, mood effects in experimental studies are often quite small and inconsistent (Blaney, 1986; Brewin et al., 1993), suggesting that current mood leads only to small distortions in retrospective frequency estimates of emotions.

2.6.2 Personality-Congruent Biases

Martin’s (1985) notion of a personality-congruent memory bias suggests that a person’s personality might also influence frequency judgments of emotions. Specifically, participants might overestimate the frequency of personality-congruent emotions. In support of this hypothesis, Diener et al. (1984) found that neurotic individuals overestimated the amount of their unpleasant affect, whereas extravert individuals overestimated the amount of their pleasant affect (see also Feldman-Barrett, in press). In addition, Larsen (1992) found that neurotic individuals tended to overestimate the frequency of some physical symptoms.
Personality-congruent biases can be explained in several ways. First, personality-congruent memories are more accessible (Martin, 1985); therefore, at least the retrieval-based models would predict higher frequency judgments for personality-congruent emotions. Second, people may have generalized beliefs about their personality that can be based on various kinds of information, such as, for example, communication with others or abstractions from own experiences (Fiske & Taylor, 1984; Hastie & Park, 1986). For example, a person might think that he or she is a “jealous”, “choleric”, or “happy” person. People might rely on such generalized beliefs when they are asked to estimate the frequency of their emotional experiences (Feldman Barrett, in press; Zuroff, 1989). The use of generalized beliefs is, of course, just another judgment model of frequency judgments of emotions: Once an individual has determined the frequency of his or her emotions; for example by means of one of the other judgment strategies, he or she simply retrieves this prestored frequency information to make subsequent frequency judgments. As long as this information accurately reflects the actual frequencies of emotional experiences, this judgment strategy provides for a fast and efficient way to answer frequency judgments of emotions. However, if the formerly derived frequency judgments deviate from the actual frequencies of emotional experiences, reliance on this information leads to personality-congruent biases. With regard to frequency questions over limited time periods (e.g., “in the last month”) a third explanation is possible: People may have difficulties to distinguish between episodes that fall within and those that fall outside of the asked time period (cf., Schwarz, 1990). In this case, the frequency judgments would cover a longer time period than intended by the investigator’s question. Furthermore, personality tends to be a better predictor of emotional experiences the more they are aggregated over longer time periods (Epstein, 1983). As a consequence, personality explains additional variance in the frequency judgments of emotions that is not accounted for by the actual frequencies of emotions during the limited time period under investigation. Finally, personality-congruent memory biases could be a simple method artifact due to the fact that personality traits are measured by judgments that are very similar to frequency judgments of emotions.

In studies 1 and 2 the aim was simply to further explore whether a personality-congruent memory bias exists. If so, this could be the starting point for further analyses, differentiating between the different explanations described above. Note that a personality-congruent bias is consistent with the familiarity model, if it is due to the activation of memory traces of experiences outside of the time period under investigation. The bias should, however, account for much less variance than the actual frequencies of emotions, because the familiarity model assumes that memory traces can be activated for different contexts, including the time of experience (Hintzman & Block, 1971).

2.7 Summary of Hypotheses

The present dissertation has two main aims: (a) to test the accuracy of frequency judgments of emotions and (b) to explore the cognitive processes underlying these judgments.
With regard to the first question, the following predictions are made:

1. Participants should underestimate the absolute frequency of their emotional experiences. A related prediction is that people underestimate especially the frequencies of more frequent emotions. Both predictions are based primarily on the fact that these effects have been consistently obtained in the frequency judgment literature (cf. Thompson & Mingay, 1991; Williams & Durso, 1986). However, an explanation of this effect is lacking, because many frequency judgment models do not address the question how frequency information (e.g., a feeling of familiarity) is converted into absolute frequencies (cf. Brown, 1995).

2. Frequency judgments of emotions show discriminative accuracy across emotions because the familiarity signal reflects the number of stored memories, and therewith the number of experiences of an emotion, fairly accurately (Hintzman, 1988). High discriminative accuracy across stimuli has been reported for frequency judgments of other stimuli (Hasher & Zacks, 1984; Hintzman, 1988).

3. The discriminative accuracy across participants is expected to be moderate. This prediction is based on some earlier findings (e.g., Thomas & Diener, 1990). The correlations reported by Diener et al. (1995) and Feldman-Barrett (in press) in the range from r = .50 to .70 are predicted to overestimate the true discriminative accuracy across participants, because the estimates were made (a) on the same scale that was used during the diary period and (b) after participation in a diary study.

With regard to the cognitive processes underlying frequency judgments of emotions, the following predictions were made:

1. Participants should be able to judge accurately the frequencies of emotions in the first and in the second week of a diary study. The reason is that it is possible to activate memory traces of different time periods separately, so that the familiarity signal reflects predominantly the frequencies in a specified time period (Hintzman & Block, 1971).

2. Making some emotion concepts salient during the encoding process should increase the judged frequency of salient emotion concepts compared to non-salient ones􀁘that is, those emotion concepts that were not presented at the time of encoding, because salience leads to deeper encoding and less information loss (Greene, 1989; Hintzman, 1988).

3. The retrieval latency of an emotional episode from memory should be unable to account for the discriminative accuracy of frequency judgments across emotions. The basis for this prediction is that frequency judgments are assumed to be based on a sense of familiarity and that the familiarity signal reflects the actual frequencies more accurately than the ease-of-retrieval of exemplars (Watkins & LeCompte, 1991). This hypothesis allows for the possibility that frequency judgments of emotions and retrieval latencies of emotional episodes are negatively correlated (Fitzgerald, et al., 1988; MacLeod et al., 1994). It only predicts that this correlation is not strong enough to account for the discriminative accuracy across emotions.

4. The familiarity model does not predict a relation between the size and the speed of frequency judgments. In contrast, the retrieval-based models predict such a relation; the counting and the recall-estimate model predict a positive correlation (Brown, 1995), whereas the ease-of-retrieval model predicts a negative correlation.

5. Finally, it is predicted that the recall of a single emotional episode takes longer than the complete frequency judgment process. This prediction is again based on the assumption of the familiarity model that frequency judgments are not based on the retrieval of emotional experiences from memory. This prediction is in agreement with previous results that frequency judgments are faster than the retrieval of exemplars (Alba et al., 1980; Schimmack & Reisenzein, in press).

No explicit predictions were made concerning the influence of current mood or personality on frequency judgments of emotions because the familiarity model predicts such biases only under certain conditions. Mood-congruent effects could be due to a stronger activation of mood-congruent memory traces, leading to a stronger familiarity signal. Personality-congruent effects could be due to the activation of memory traces outside of the time frame of the question. However, it is predicted on account of the familiarity model that biases, if they exist, are relatively small compared to the amount of variance that is explained by the actual frequencies of emotions.

3 STUDY 1

Studies 1 and 2 used the pre-post design of Thomas and Diener (1990). In a pre-post design, participants first judge the frequency of emotions for a time period prior to the diary study. Then, they take part in a diary study which serves the purpose to obtain a measure of the actual frequencies of emotions. Finally, they judge the frequency of emotions during the diary period. The advantage of the pre-post design is that it allows to test salience effects; that is, whether the participation in the diary study influenced the post-diary judgments. The problem of the design is that pre-diary estimates necessarily cover a different time period than the one during which the actual frequencies of emotions are measured. Hence, changes in the true frequencies of emotions from the pre-diary period to the actual diary period can attenuate correlations between pre-diary estimates and the actual frequencies of emotions.

Study 1 served several goals: first, to test the accuracy of frequency judgments of emotions, using various measures of accuracy, and second, to test whether the salience of emotions at the time of encoding influences subsequent frequency judgments. Third, Study 1 tested the presence of mood- or personality-congruent biases in frequency judgments of emotions.
Fourth, the strengths and weaknesses of two different response formats were compared: (a) the participants made absolute estimates; that is, they estimated the absolute number of occurrences of an emotion (X times a week), and (b) they made vague quantifier ratings (Pepper, 1981; Wright, Gaskel, & O’Muircheartaigh, 1994); that is, they checked for a number of common frequency expressions (e.g., never, rarely, sometimes, often) which one most appropriately described how frequently they experienced an emotion. The use of both response formats appeared to be especially desirable because Schaeffer (1991) obtained different results for absolute estimates and vague quantifier ratings. In a survey study, respondents were first asked to rate the frequency of excitement and boredom in their lives by means of vague quantifiers. Then, they were asked to indicate which absolute frequency the chosen quantifier indicates. Black participants appeared to be more bored than white participants according to the vague quantifier ratings, but not according to the absolute estimates.

Two ancillary analyses were carried out. First, I explored the time period covered by frequency judgments of emotions. It is well known that memories increasingly decay over time (Hintzman, 1988). Therefore, frequency judgments of emotions should be influenced predominantly by more recent emotional experiences. Nevertheless, it remains to be discovered whether frequency judgments of emotions cover only experiences in the last few days or extend over much longer time periods. Second, the structure of individual differences in the frequencies of pleasant and unpleasant emotions was explored, which is an important topic in personality psychology (Bradburn, 1969; Diener, Smith et al., 1995; Green, Goldman, & Salovey, 1993; Meyer & Shack, 1989).

3.1 Method

3.1.1 Participants

One hundred and fifty students in a semester-long course on research in personality at the University of Illinois took part in this study. Four participants were excluded because of missing data. The final sample consisted of 107 female and 39 male participants. Although the topic of the validity of retrospective frequency judgments of emotions was discussed in this course, this happened only after the data collection relevant to this study had been completed.

3.1.2 Material and Procedure

3.1.2.1 Daily Estimates

At the core of the present study, participants estimated the absolute frequencies of 20 emotions (see Table 1) at the end of each day for 23 days. The first two days were used to practice the use of the questionnaire and were excluded from all analyses. Using a free response format, participants entered any number that seemed to be appropriate as an estimate of the absolute frequency with which they experienced an emotion on a particular day. Participants returned the forms the next day, except for the weekend forms which were due on Monday.

In studies 1 and 2, the averaged daily ratings are used as a standard of comparison for the long-term frequency judgments. Therefore, the averaged daily judgments are labeled actual frequencies, although the measure can be expected to give only an approximation of participants’ actual frequencies of emotional experiences. However, random- or event-sampling methods (see Schimmack & Diener, in press) would reflect the frequency of emotional experiences only if the number of daily measurement points were extremely high; which would probably overtax the motivation of the participants. Therefore, daily, or, as in Study 2, twice-daily, frequency estimates were considered the most appropriate measure of actual frequencies of emotions in everyday life.

3.1.2.2 Vague Quantifier Ratings

Before (pre-diary) and after (post-diary) the diary study, the participants made vague quantifier ratings concerning the frequencies of emotions during the last three weeks. The emotions were the 20 emotions included in the daily form and 9 additional ones. For the rating task, participants were provided with frequency expressions commonly used in everyday language, and they had to check the most appropriate one (i.e., In the last three weeks, I experienced joy [never], [very rarely], [rarely], [sometimes], [often], [very often], [extremely often]). For statistical analyses, the vague quantifier ratings were later converted to numbers from 0=never to 6=extremely often.

3.1.2.3 Absolute Estimates

The participants also made pre- and post-diary absolute estimates of the frequency of emotions experienced during the last three weeks. These judgments were made after the vague quantifier ratings. This order was chosen because participants might have relied on their absolute estimates to make the vague quantifier ratings, whereas a transfer effect in the other direction seemed to be less likely. The emotions were the same as in the vague quantifier questionnaire. Also, the item sequence was the same for vague quantifier ratings and absolute estimates. The absolute estimates were made using a free response format; that is, the participants wrote down any number that deemed to be appropriate. Although the estimates were required to cover the last three weeks, the questionnaire asked to estimate weekly frequencies (i.e., “In the last three weeks, I experience joy _ times a week”). A weekly time frame was used for the following reasons. A daily time frame ( times a day) seemed problematic. First, a daily time frame might have especially encouraged the participants to memorize their daily estimates to make the post-diary estimates. Second, a daily time frame does not allow to discriminate frequencies of very rare emotions (e.g., envy, hate), which on many days are not experienced at all; therefore the modal response of these emotions would be zero. A three-week time frame (_ times in the last three weeks) was not used because a weekly time frame has the advantage that it can be used for different time periods, ranging from one week (“In the last week, I experienced joy _ times a week.”) to people’s frequency of emotional experiences in general (“In general, I experience joy __ times a week”).

3.1.2.4 Mood Questionnaire

After completing the two frequency judgment tasks, the participants rated their current mood on the ELMI (Everyday Language Mood Inventory; Schimmack, 1996a) which is an English adaptation of the BASTI (Berliner Alltagssprachliche Stimmungsinventar; Schimmack, in press). The ELMI measures 10 specific mood dimensions, namely indifference, sentimentality, depression, grouchiness, irritation, anxiety, nervousness, euphoria, cheerfulness, and relaxation, and three global mood dimensions, pleasure-displeasure, aroused-calm, and wakeful-tired, with two items each. Ratings were made on an intensity scale ranging from 0=not at all to 6=extremely intense.

3.1.2.5 Personality Questionnaire

The personality dimensions of neuroticism and extraversion were measured by means of the NEO-PI-R (Costa & McCrae, 1992). Each trait is measured by six subscales, and each subscale comprises 8 items. Therefore, the NEO-PI-R provides for a reliable and broad assessment of these two personality dimensions.

3.2 Results

3.2.1 Absolute Accuracy

The absolute and relative accuracy of frequency judgments of emotions can only be tested for the absolute estimates, because vague quantifiers do not correspond to a fixed absolute frequency (Pepper, 1981; Wright et al., 1994). To test the absolute accuracy, each participant’s standard deviation (i.e., the square root of the squared differences between estimated and actual frequencies) from the participant’s actual frequency was computed for the 20 pre- and the 20 post-diary estimates. A comparison of the standard deviations, indicated that absolute accuracy increased from pre-diary (mean SD = 11.88) to the post-diary estimates (mean SD = 9.09), t(145) = 5.95, p < .01.

This finding suggests that the participation in the diary study increased the accuracy of the estimates. One problem in the interpretation of this finding is, however, that the post-diary estimates, but not the pre-diary estimates, cover the time period of the diary study. Therefore, the present finding might also be due to the fact that the actual emotion frequencies changed from the pre-diary period to the diary period.

The previous analysis measured absolute accuracy at the individual level. It is, however, also possible to compare the absolute accuracy of pre- and post-diary estimates at the group level. To this aim, the frequency judgments of each emotion were first averaged across participants. Then the standard deviations of the averaged pre- and post-diary estimates from the averaged actual frequencies were compared. This analysis also shows an improvement in absolute accuracy from pre-diary (mean SD = 7.54) to post-diary (mean SD = 3.60) estimates, t(19) = 4.46, p < .01.

This finding is stronger evidence for an improvement in absolute accuracy, because it is less likely that the average frequency of an emotion changed from the pre-diary to the diary period. That is, it is less likely that all participants experienced less anger or more joy in one of the two weeks. In sum, the analyses suggest that the participation in the diary study improved the absolute accuracy of the judgments.

3.2.2 Relative Accuracy

Table 1 shows the weekly frequencies of the 20 emotions, as derived from the daily frequency estimates, which were added up and divided by 3, and as estimated before and after the diary study. For all 20 emotions, the actual frequencies were higher than estimated frequencies. The other notable finding that all pre-diary estimates were lower than the post-diary estimates is discussed later. In some cases the absolute differences were quite dramatic: For example, participants experienced contentment on average 28 times a week, but estimated to do so only 7 times in their pre-diary estimates.

As a measure of relative accuracy, each participant’s actual frequencies were subtracted from his or her estimated frequencies of emotions. Underestimation was more severe for the pre-diary estimates (mean d = -7.54) than for the post-diary estimates (mean d = -.3.59), t(145) = 7.09, p < .01. Before the diary study only 6 (of 146) participants revealed overestimation, whereas after the diary study 23 participants overestimated their frequencies of emotions.

The analysis at the group level provided the same results as the previous analysis of the absolute accuracy because the averaged estimates always underestimated the averaged actual frequencies (absolute and relative accuracy differ only when both over- and underestimation occurs). In sum, the analyses provide clear support for the first hypothesis that people in general underestimate the frequency of their emotional experiences.

Figure 3 shows the means of Table 1 to test the second part of hypothesis 1, that people underestimate higher frequencies more than lower frequencies (cf. Watkins & LeCompte, 1991). Clearly, underestimation increases with higher actual frequencies. In addition, it can be seen that the frequency estimates follow a linear trend. This finding is in agreement with results reported by Watkins and LeCompte (1991). To demonstrate higher underestimation for higher actual frequencies quantitatively, the relative accuracy score was correlated with the actual frequencies. This analysis produced, as predicted, negative correlations for both pre- and post-diary estimates: rs = -.99 and -.86 (ps < .01), respectively. Underestimation was more severe for higher actual frequencies. In addition, the relative accuracy scores of pre- and post-diary estimates were highly correlated with each other, r = .88, p < .01. In sum, analyses of the relative accuracy of frequency judgments of emotions revealed that (a) people underestimate the frequency of their emotions and (b) that they do so increasingly with increasing actual frequencies of emotions. This finding is consistent with experimental studies (cf. Watkins & LeCompte, 1991; Williams & Durso, 1986), and estimates of daily activities (Mingay, Shevell, Bradburn, & Ramirez, 1994). Furthermore, underestimation was more pronounced for pre- than for post-diary estimates. Because it is unlikely that the actual frequencies of all emotions increased from the pre-diary to the diary period, this finding can be interpreted as evidence that the relative accuracy of the estimates increased due to the participation in the diary studies.

3.2.3 Discriminative Accuracy across Emotions

Discriminative accuracy across emotions can be assessed at the group level as well as at the individual level. For the analysis at the group level, one simply has to compute the correlation between the actual frequencies and the estimated frequencies across the 20 emotions included in the daily report form (Table 1). The discriminative accuracy across emotions was very high for pre- and post-diary estimates, rs = .96 and .98, respectively, both ps < .01; the pre- and post-diary estimates were also highly correlated with each other, r = .96, p < .01. The same analyses were performed for the vague quantifier ratings. The correlations with the actual frequencies as well as with the absolute estimates were very high (all rs > .90, all ps < .01).

To test discriminative accuracy across emotions at the level of each participant, the correlations between actual frequencies and the four frequency judgments (pre- and post-diary absolute estimates and vague quantifier ratings) were computed for each individual. Subsequently, the correlation coefficients4 were used as dependent variables in a 2 x 2 ANOVA with the within-subject factors response format (absolute estimates vs. vague quantifier ratings) and time of judgment (pre-diary vs. post-diary). This analysis revealed that the absolute estimates produced higher correlations (mean r = .79) than the vague quantifier ratings (mean r = .73), F(1,145) = 61.54, p < .01. Furthermore, post-diary judgments were more highly correlated (mean r = .85) with actual frequencies than pre-diary judgments (mean r = .67), F(1,145) = 415.51, p < .01. The interaction was not significant, F(1,145) = 0.02.

The higher correlations obtained for post-diary estimates suggests that participation in the diary study also increased the discriminative accuracy across emotions. However, the effect could also be due to changes in the true frequencies of emotions from the pre-diary to the diary period. The fact that vague quantifier ratings possessed less discriminative accuracy across emotions in the analysis at the individual level could be due to the limited number of response categories which restricts the possibility to discriminate between emotions with similar frequencies.

In sum, the results support hypothesis 2 that frequency judgments of emotions possess discriminative accuracy across emotions􀁘that is, people are sensitive to the different frequencies with which they experience different emotions. This finding is consistent with studies of frequency judgment in other domains which also show that people are sensitive to variation in the frequencies of different stimuli (Hasher & Zacks, 1984; Hintzman, 1988).

3.2.4 Discriminative Accuracy across Participants

To test the discriminative accuracy across participants the pre- and post-diary frequency judgments of both response formats were correlated with the actual frequencies across participants, separately for each of

the 20 emotions (Table 2). Table 2 also shows the test-retest correlations of the absolute estimates and the weak quantifier ratings. In the last column of Table 2, the internal consistency of the daily estimates across the 21 days of the diary period is reported as a measure of the stability of individual differences in the frequency of emotional experiences during the diary period (Diener & Larsen, 1984).

As can be seen in Table 2, nearly all correlations between frequency estimates and actual frequencies of emotions were significant. Nonsignificant correlations were obtained only for pre-diary absolute estimates. These results support hypothesis 3 that frequency judgments of emotions possess discriminative accuracy across participants. However, the correlations in Table 2 vary considerably, ranging from r = -.03 to .78. A 2 x 2 ANOVA with the within-subject factors response format (absolute estimates vs. vague quantifier ratings) and time of judgment (pre- vs. post-diary judgments) was used to test whether these factors influence the discriminative accuracy across participants.

This analysis revealed significant main effects for response format, F(1,19) = 8.72, p < .01, and for time of estimate, F(1,19) = 148.54, p < .01. In addition, the interaction was also significant, F(1,19) = 12.93, p < .01. Follow up analyses of the mean correlations5 indicated that the post-diary correlations were higher than the pre-diary correlations (Figure 4). In addition, the significant interaction is due to the fact that the pre-diary vague quantifier ratings produced higher correlations than the pre-diary absolute estimates, whereas both response formats produced equally high correlations when the judgments were made after the diary period.

Again, the finding that post-diary estimates possess higher discriminative accuracy across participants can be due to two, not mutually exclusive factors. First, rating the frequency of emotions on a daily basis might make emotional experiences more salient, leading to more accurate judgments. Second, individual differences in the actual frequency of emotional experiences may have changed from the pre-diary period to the actual diary period. It is important to distinguish between these two explanations, because the first explanation implies that the post-diary correlations overestimate the true discriminative accuracy of frequency judgments of emotions, whereas the second explanation implies that the pre-diary correlations underestimate the discriminative accuracy. Additional analyses were carried out to test the viability of the two accounts in greater detail.

The last column in Table 2 shows that the individual differences in the frequencies of emotions were highly stable over the three-week diary period. On the basis of this finding, a fairly high stability of individual differences in the frequencies of emotions can also be expected from the three weeks prior to the diary study to the three weeks of the diary period. If so, the higher correlations obtained for the post-diary judgments were at least partly due to the participation in the diary study. Table 2, however, also shows that emotions differ in their stability over time. For example, the frequency of affection is more stable (alpha = .93) than the frequency of feeling hurt (alpha = .71). If the correlation between pre-diary estimates and actual frequencies is attenuated by changes in the true frequencies of emotions, emotions with more variable frequencies over time (e.g., hurt) should be more affected than emotions with very stable frequencies over time (e.g., affection). To test this hypothesis, the pre-diary correlations (column 1 and 3 in Table 2) were correlated with the stability coefficient (i.e., alpha in the last column in Table 2) across the 20 emotions. Both correlations indicate that the discriminative accuracy of pre-diary estimates increased with the stability of individual differences in the frequency of an emotion (absolute estimates r = .44, p = .05; vague quantifier ratings r = .54, p < .05), although the correlation for the absolute estimates was only marginally significant. This finding suggests that the temporal stability of an emotion influenced the size of the correlations between pre-diary estimates and actual frequencies during the diary study. Therefore, these correlations tend to underestimate the discriminative accuracy of frequency judgments of emotions. In sum, the analyses show empirically that the true discriminative accuracy is higher than the correlation obtained for pre-diary estimates and lower than the correlation obtained for post-diary estimates. Therefore, a point-estimate of the true discriminative accuracy across participants is not possible, but it is on average in a rage from r = .30 to .60. This finding suggests that the discriminative accuracy across participants has been overestimated in previous studies which used only post-diary judgments (Feldman Barrett, in press). The fact that the post-diary estimates in the present study are still lower than in previous studies can be attributed to the use of singe-item measures in the present study, whereas previous studies used multiple-item measures which are bound to have a higher reliability.

In a further set of analyses the specificity of the frequency judgments was explored; that is, whether individual differences in the judged frequency of an emotion are more highly correlated with individual differences in the actual frequencies of the same emotion than with those of other emotions (cf. Diener, Smith et al., 1995). The actual frequencies of each emotion was correlated with the frequency judgments of the remaining 19 emotions and the highest correlation was recorded (see Appendix 1). Subsequently, this correlation was compared to the correlation with the frequency judgment of the same emotion (Table 2). Specificity was established, if the correlation with the judgments of the same emotion exceeded the highest correlation with judgments of another emotion. These analyses were carried out for all four frequency judgments (pre- and post-diary absolute estimates and vague quantifier ratings). The strongest evidence for specificity was obtained for the post-diary weak quantifier ratings: Estimates for all 20 emotions revealed specificity. For the other judgments, specificity existed for frequency judgments of 18 (post-diary absolute estimates), 17 (pre-diary vague quantifier ratings) and 14 emotions (pre-diary absolute estimates). Even 14 cases of specificity are much more than what would be expected by chance; expected = 1, χ2(N = 20) = 177.89, p < .01. These results show that the participants clearly used information about specific emotions. This finding eliminates a simple response set explanation of discriminative accuracy across participants (cf. Diener, Smith et al., 1995). Furthermore, the results suggest that frequency judgments are not based on generalized beliefs, unless one assumes that participants have different beliefs for each of the 20 emotions.

In sum, frequency judgments of emotions were found to (a) possess moderate discriminative accuracy across participants (b) and to show remarkable specificity for each emotion. With regard to the two response formats, the vague quantifier ratings yielded higher correlations and more specificity than the absolute estimates, despite the use of absolute estimates on the daily report form to measure actual frequencies.

3.2.5 The Influence of Daily Ratings on Frequency Judgments of Emotions

Daily ratings of emotions during the diary study might make these emotions salient. According to hypothesis 5, this should increase the absolute level of the frequency estimates of these emotions. To test this prediction, 9 emotions were included in the pre- and post-diary questionnaires that had not been on the daily rating form. Furthermore, these emotions were selected to be related to one of the emotions on the daily form (not daily form-daily form: happiness-joy, love-affection, fear-anxiety, rage-anger, dislike-contempt, regret-guilt, shame-embarrassment, depression-sadness, helplessness-hopelessness).

The previous analysis of relative accuracy already demonstrated that people underestimated actual frequencies less in the post-diary judgments than in the pre-diary judgments. This effect implies that the frequency judgments increased from pre- to post-diary ratings (see Figure 3). If, however, the daily ratings increased especially the salience of those emotions on the daily report form, the increase should be stronger for those emotions on the report form than for their counterparts that were not on the form. In other words, salient emotions should reveal a higher increase from pre- to post-diary estimates than non-salient emotions.

To test this prediction, repeated measure ANOVAs were carried out with the within-subject factors time (pre- vs. post-diary), salience (on the form vs. not on the form) and type of emotion (9 pairs of emotions). The first analysis was based on the absolute estimates and the second analysis on the vague quantifier ratings. The ANOVA revealed significant effects for all main effects and interactions (Table 3). However, not all of the effects are theoretically important. For example, the strong effect6 for the salience x emotion interaction simply shows that the frequencies of emotions were not equivalent across and within the 9 emotion-pairs.

The most important finding is the predicted time x salience interaction was significant. Furthermore, Figure 5 shows that the interaction is due to the predicted stronger increase from pre- to post-diary frequency estimates of salient emotions.

However, the significant three-way interaction indicates that this effect differed across emotion pairs. Table 4 shows the pre- and post-diary absolute estimates for all 9 emotion pairs. Inspection of the data shows that the increase over time was replicated for all emotions, but three emotion pairs did not show the expected stronger increase for the salient emotion, namely joy-happiness, contempt-dislike, and hopelessness-helplessness.

Visual inspection of the effects suggests that frequent emotions showed a stronger increase in the frequency estimates from pre- to post-diary estimates􀁘a hypothesis that is also suggested by the regression lines in Figure 3. To explore this hypothesis more thoroughly, I went back to the data in Table 2 and correlated the actual emotion frequencies with a change score, subtracting pre-diary from post-diary absolute estimates. The correlation proved to be highly significant, r(20) = .96, p < .01, indicating that more frequent emotions show a higher increase in the estimated absolute frequency from pre- to post-diary judgments. This finding is most likely due to the stronger underestimation of these emotions in the pre-diary estimates. Therefore, frequency judgments of more frequent emotions benefit in particular from making them salient.

Figure 6 shows the means of the pre- and post-diary vague quantifier ratings of salient and non-salient emotions. An unexpected finding was that the ratings of both salient and non-salient emotions decreased from pre- to post-diary judgments. This is exactly the opposite of what was expected. Furthermore, this effect occurred although the same participants had just made the absolute estimates which showed the expected increase. This finding is strong supports for the hypothesis that vague quantifiers do not correspond in a one-to-one fashion to absolute frequencies (Pepper, 1981; Schaeffer, 1991; Wright et al., 1994). However, Figure 6 also shows that the significant time x salience interaction is due to a smaller decrease for the salient than the non-salient emotions. This finding is consistent with the predicted influence of salience: Given that vague quantifier ratings decrease over repeated assessments, they do so less for emotions which were made salient.

Table 5 shows the results for each emotion pair. All except three emotions showed the unexpected decrease from pre- to post-diary judgments. Next it was explored whether frequent emotions showed a smaller decrease than less frequent emotions, which would be equivalent to the stronger increase obtained for absolute estimates. Again, this hypothesis was tested by means of the data reported in Table 2. The actual emotion frequencies were correlated with a change score, subtracting pre-diary from post-diary vague quantifier ratings. As to the absolute frequencies, a significant positive correlation was obtained, r(20) = .79, p < .01. In addition, the change scores of absolute estimates and vague quantifier ratings were significantly correlated, r(20) = .75, p < .01.

This finding suggests that absolute estimates and vague quantifier ratings also responded in the same way to the participation in the diary study. Frequent emotions show a higher increase for absolute estimates and a smaller decrease for vague quantifier ratings. In sum, the salience manipulation had the expected effect on both types of frequency judgments. However, for the vague quantifier ratings the expected salience effect was overshadowed by the unexpected and counterintuitive finding that vague quantifier ratings decreased from pre- to post-diary judgments.

A search in the psychological literature uncovered that this finding could have been predicted on the basis of earlier findings. As early as 1954, Windle conducted a meta-analysis and reported a decreasing mean in test-retest comparisons of social-adjustment measures. Although this effect has many practical implications, very little research tried to illuminate its causal mechanisms (see Knowles, Coker, Scott, Cook, & Neville, 1996). Recently, Knowles et al. (1996) suggested that the mean change is due to a meaning change of the items from test to retest. That is, participants better learn the common theme of the items in a questionnaire, which changes the meaning of single items. For example, participants might first think of all episodes of crying when they answer an item such as “I cry easily.” However, after learning that the questionnaire is about anxiety, the item is understood in this context and certain episodes of crying (e.g., crying for joy) are discounted, leading to the choice of lower response categories. Although this explanation might explain decreasing means in questionnaires which assess a single construct, it can hardly explain the findings in the present study. First, the items were not intended to measure a common construct, and it is unlikely that the participants falsely detected such a common theme. Second, meaning changes should have influenced the absolute estimates in the same way, but these estimates increased.

A different explanation could be Parducci’s range-frequency principle. Parducci (1968) demonstrated that people’s assignment of numbers between 100 and 1000 to vague quantifiers such as very small, small, large, very large was context dependent. 550 was high if most numbers fell in the range from 100 to 550, but low if most numbers fell in the range from 550 to 1000. In other words, the meaning of a response category depends on the distribution of the stimuli that have to be assigned to the response categories. Parducci demonstrated in several experiments with various types of stimuli that the assignment function is a compromise between a range and a frequency principle. The frequency principle implies that people try to accommodate an equal number of stimuli (i.e., in the present context a stimulus is the frequency of an emotion) in each category; that is, the same number of emotions should be in the “not at all”, “rarely,” or “often” category. Clearly, the frequency principle ignores the actual distribution of the stimuli; rather it forces the data into a uniform distribution. The range principle is most easily understood by its mathematical formula: Ric = (Si – Smin) / (Smax – Smin), where S is the actual scale value. As the actual scale represents frequencies, it is reasonable to assume that Smin equals zero; hence, Ric = Si/ / Smax). Although it is not clear which frequency corresponds to Smax, probably the number of all emotional experiences during a specified time period, it is clear that the range principle preserves the distribution of the stimuli, as long as the respondent has a sufficient number of response alternatives (Parducci & Wedell, 1986). Previous studies showed that in the final assignment of an item to a category, the two principles are weighted equally: Jic = wRic + (1-w)Fic with w = .50 (Parducci 1968). More recently, Parducci and Wedell (1986) demonstrated that the weight of the two principles is context dependent. For example, a higher number of response categories decreased the influence of the frequency principle.

The range-frequency principle would predict the observed mean change from pre- to post-diary vague quantifier ratings under certain conditions, namely, (a) if the distribution of the emotion frequencies is positively skewed; in this case the frequency-principle leads to the assignment of small frequencies to medium response categories, and (b) if people weight the frequency principle less during the post-diary judgments􀁘in this case the small frequencies are assigned to low categories. The major problem with this explanation is that it remains unclear why participants would shift the weights of the two principles. This problem is closely related to Haubensak’s (1994) criticism of range-frequency theory: It is descriptive but not explanatory; that is, a combination of the range and the frequency principle can predict outcomes of experiments, but this does not illuminate the underlying processes of the effect. To overcome this limitation of range-frequency theory, Haubensak (1994) developed a consistency model which might explain the finding in the present study that vague quantifier ratings decrease from pre- to post-diary judgments. According to the consistency model, respondents prefer to start the rating task with medium rating categories. If the distribution of stimuli is positively skewed, this implies that small frequencies are often assigned to medium rather than small categories. Furthermore, the first ratings influence all subsequent ratings because (a) the initial assignments of frequencies to response categories remain a standard for the complete task, and (b) the participants want to be consistent with their initial standard. Therefore, the tendency to assign small frequencies to moderate categories prevails throughout the task. This model could explain the decreasing mean of vague quantifier ratings: The second time participants have a better sense of the distribution of emotion frequencies, and they do no longer try to be consistent with the standard of the first task, which might have been forgotten anyway. To test the viability of this post-hoc explanation, additional analyses were carried out.

As noted above, a basic assumption of this explanation is that the distribution of the actual frequencies of the emotions is positively skewed. As can be seen in Figure 3 this is indeed the case, which can also be shown quantitatively (skewness = 1.04). Similarly, the pre- and post-diary absolute estimates show a similar skewness (pre-diary 1.00; post-diary 1.15). The prediction of the consistency model that the skewness of the vague quantifier ratings is reduced was also confirmed (pre-diary skewness = 0.24). The additional assumption made to explain the decreasing mean is that participants became more sensitive to the actual distribution of the stimuli so that the distribution of the vague quantifier ratings should be more similar to the actual distribution of the stimuli (this is equivalent to a decreasing influence of the frequency principle in range-frequency theory). This is also the case (post-diary skewness = 0.41). In sum, analyses of the distribution of the actual frequencies and the vague quantifier ratings are in agreement with Huabensak’s (1994) consistency model. Problems with the initial choice of rating categories lead to distorted assignments of vague quantifiers to frequencies. This problem persists within the same questionnaire because people want to be consistent. However, experience with the set of stimuli and the lack of a need (or ability) to be consistency from one measurement point to the other allows participants to improve their ratings. Because the finding was unexpected and the explanation is post-hoc, it was further explored in Study 2.

3.2.5 Exploration of Mood- and Personality-Congruent Biases

Personality- and mood-congruent biases were investigated simultaneously because extraversion and neuroticism are often correlated with current mood (Matthews, Jones, & Chamberlain, 1990; Schimmack, in press; Steyer, Schwenkmezger, Notz, & Eid, 1994). Current mood was measured with the 10 specific mood scales and the global pleasure-displeasure and aroused-unaroused dimensions of the ELMI (Schimmack, in press, 1996a).

To reduce the number of variables, a factor analysis of the 10 specific mood scales was carried out and the factor scores of the first two unrotated factors were retained. The obtained factors were very similar for the pre- and post-diary administration of the ELMI; therefore, only the post-diary factor analysis is reported in detail. The first factor was a displeasure-pleasure factor: The specific mood scales Depression, Grouchiness, and Irritation had high positive loadings on this factor, whereas the scales Good Humor and Relaxation had high negative loadings. The second factor was an arousal factor; the scales Nervousness, Anxiety, and Euphoria had high positive loadings on this factor. This interpretation of the factors was also supported by the simple correlations between the two factors and the directly measured pleasure and arousal dimensions. The first factor correlated highly negatively with the pleasure dimension (r = -.84, p < .01), and slightly with the arousal dimension (r = -.18, p < .05). The second factor correlated mainly with the arousal dimension (r = .45, p < .01), and slightly with the pleasure dimension (r = .18, p < .05).

In the following analysis, the factor scores were used, because they are based on a greater number of items than the direct measures of pleasure and arousal. To facilitate the interpretation of results, the factor scores of the first factor were inverted so that higher values indicate more pleasure. Extraversion and neuroticism were measured by the respective scales of the NEO-PI-R. To reduce the number of analyses, the frequency estimates of pleasant and those of unpleasant emotions were averaged (analyses for each emotion separately are included in Appendix 2).

In the first set of analyses, the post-diary frequency judgments were regressed simultaneously onto the actual frequencies, the mood and the personality variables to control for the intercorrelations between the predictor variables. Table 6 shows that for all analyses the daily averages were the strongest predictor of frequency estimates, indicating that the frequency judgments primarily reflect individual differences in the actual frequencies of experienced emotions. The personality and mood variables, however, were also related to the frequency judgments.

The absolute estimates showed a consistent bias for extraversion: Extraverted individuals estimated their pleasant and unpleasant emotions to occur more frequently than introverted individuals. Because extraversion is not generally assumed to be congruent with frequent experience of unpleasant emotions, this result does not indicate a personality-congruent effect. In contrast, the vague quantifier ratings showed a personality-congruent bias: Extraversion was a significant predictor of frequency estimates of pleasant emotions even after controlling for actual frequencies of emotions, and neuroticism predicted a bias in the frequency estimates of unpleasant emotions.

A mood-congruent effect was obtained in that a pleasant mood predicted lower vague quantifier ratings of unpleasant emotions, but not higher ratings of pleasant mood.
In a second analysis, the simple correlations between the personality and mood variables with the post-diary judgments were compared to the correlations with the pre-diary judgments. Conceivably, participation in the diary study could attenuates personality or mood biases, because the participants are more aware of their emotional experiences. If this is true, the simple correlations between personality and mood measures and frequency judgments should be higher for the pre- than for the post-diary judgments. However, Table 7 provides little support for this hypothesis. Only the absolute level of the correlation between current pleasure and vague quantifier ratings of pleasant and unpleasant emotions was higher for pre- than for post-diary estimates.

In sum, the results are mixed; only vague quantifier ratings showed a consistent personality-congruent bias (see also Feldman Barrett, in press). This finding could be a method artifact because the measurement of extraversion and neuroticism in the NEO-PI-R is partly based on items that include vague quantifier (e.g., “I rarely feel depressed”). Therefore, individual differences in the interpretation of vague quantifiers could explain the finding that the personality measure explained additional variance on top of the actual frequencies, which are based on absolute estimates. Finally, it should be noted that the actual frequencies were by far the strongest predictor of the frequency judgments of emotions. This shows that frequency judgments are mainly based on the actual frequencies of emotions in the past, and that biases play only a minor rule.

3.3 Additional Analyses

3.3.1 The Time Extension of Frequency Judgments of Emotions

People’s frequency judgments of emotions might only reflect the frequencies of emotions in the most recent past, or they may extend over longer time periods. To address this question empirically, the daily frequency estimates were averaged separately for the first, second, and third week of the diary period. Then, the post-diary estimates were regressed onto the three weekly averages in hierarchical regression analyses. In one set of analyses, the third week was entered first, followed by the second and first week, whereas in a second set of analyses, the predictors were entered in the reverse order. If information about more remote time periods; that is, the first week, is weighted less heavily in the frequency judgments, then entering the first week as the last predictor should explain less additional variance than entering the third week as the last predictor. Figure 7 shows the averaged incremental amount of explained variance for the separate analyses of the 20 emotions (see Appendix 3 for the results of each emotion).

A comparison of the increment in explained variance for the two orders in which the predictor variables were entered revealed that in the first step more variance was explained by week 3 than by week 1, t(19) = 2.24, p < .05. Week 2, entered always in the second step, explained more variance when it was entered after week 1 rather than after week 3, t(19) = 3.49, p < .01. There was no significant difference in the amount of explained variance for step 3, t(19) = 0.11, p = .91. This pattern of results indicates a slight recency effect in the vague quantifier ratings. However, week 1 and week 2 still explain 3% additional variance when they were entered after week 3. Therefore, vague quantifier ratings reflect the emotional experiences over the whole three weeks of the diary study (in individual analyses, an increment of 3% explained variance was significant).

The same analyses were performed for the absolute estimates. For these judgments, very different results were obtained (Figure 8): When week 3 was entered first, adding the second and first week hardly increased the amount of explained variance (2% and 1% respectively). In contrast, when week 1 was entered first, week 2 and 3 still explained a considerable amount of additional variance. The differences between the two orders of entry in amount of explained variance for all three steps were highly significant, all ts(19) > 5.00, ps < .01. This pattern of results reveals a strong recency effect for the absolute estimates.

The differences between vague quantifier ratings and absolute estimates can also be shown quantitatively. The third week entered in Step 1 explained more variance for the absolute estimates than for the vague quantifier ratings, t(19) = 2.14, p < .05. In contrast, week 1 entered in Step 3 explained more additional variance for the vague quantifier ratings than for the absolute estimates, t(19) = 5.84, p < .01. There were no differences for the second week entered in step 2, t(19) = 1.72, p = .10. In sum, both response formats show a recency effect; that is, frequency judgments are biased toward the frequencies in the more recent past. However, this effect is more pronounced for the absolute estimates than for the vague quantifier ratings.

The stronger recency effect for absolute estimates might be due to the use of absolute estimates to assess the actual frequencies during the diary study. Therefore, participants might have been influenced by recollections of their last daily absolute estimates when they made the absolute estimates, but not when they made the vague quantifier ratings. This could also explain, why the absolute estimates were much less stable than the vague quantifier ratings, from the pre- to the post-diary judgments (Table 2). Nevertheless both response formats achieve an equally good prediction of emotional frequencies in the last three weeks (see Figure 4), but they do so differently: Whereas the absolute estimates better capture the frequencies in the most recent past, the vague quantifier ratings more accurately reflect frequencies in the remoter past.

3.3.2 Interrelations between the Frequencies of Pleasant and Unpleasant Emotions

The relation between the frequencies of pleasant and unpleasant emotions is an important topic in personality psychology. Previous researchers found the frequencies of pleasant and unpleasant affects to be independent or negatively correlated (Bradburn, 1969; Diener, Smith et al., 1995; Green et al., 1993; Watson, Clark, & Tellegen, 1988). However, these studies relied mostly on retrospective frequency estimates (for an exception see Diener, Smith et al., 1995). More importantly, the studies exclusively used vague quantifiers to assess the frequency of emotions. Because the present study already revealed several differences between vague quantifier ratings and absolute estimates, it seemed worthwhile to explore whether the response format also influenced the relation between frequency estimates of pleasant and unpleasant emotions. This is indeed the case as can be seen in Table 8: The absolute response format produced positive correlations, whereas the vague quantifier ratings produced negative correlations. These conflicting correlations indicate that one response format produces misleading results due to a method artifact.

Green et al. (1993) have argued that the experience of pleasant and unpleasant moods is highly negatively correlated, but that this negative correlation is often obscured by random and systematic measurement errors. On first sight, this argument would suggest that the positive correlation obtained for the absolute estimates is an artifact, for example due to an extremity bias. Furthermore, the low negative correlations obtained for the vague quantifier ratings could be attenuated by random measurement error. However, this interpretation of the data does not recognize the important distinction between mood and emotion. Green et al. (1993) asked their respondents’ how much pleasant or unpleasant mood they experienced in the last month. Assuming that a person is most of the time in a mood state (cf. Ekman & Davidson, 1994, chapter 2); that is, feels either pleasant or unpleasant, and that pleasant and unpleasant affects are rarely experienced at the same moment in time (Diener & Iran-Nejad, 1986; Green et al., 1993; Schimmack, in press; Steyer et al., 1994), it follows that the amount of pleasant mood must be negatively correlated with the amount of unpleasant mood experienced in the last month. This logical necessity, however, does not apply to the relation between the frequencies of pleasant and unpleasant emotions (see Figure 1), because emotions are not elicited and experienced all the time. Therefore the number of times that pleasant emotions are elicited can vary independently from the number of times that unpleasant emotions are elicited. Subsequently, I want to argue that the empirical relation between frequencies of pleasant and unpleasant emotional experiences is positive; that is, that some individuals experience more pleasant and unpleasant emotions than others, and that the negative correlation obtained for the vague quantifier ratings is an artifact.

A major problem of vague quantifier ratings is that it is unknown how the participants’ use the vague quantifiers for their judgments. One possibility could be that participants use vague quantifiers to indicate ranges of absolute frequencies; that is, rarely might mean 2-5 times a week. If this were true, however, vague quantifier ratings should show a similar pattern of results to the absolute estimates. The previous findings contradict this hypothesis. Another possibility could be that the participants used vague quantifiers to describe ranges of percentages (cf. Reisenzein, 1995). For example, experiencing an emotion often might mean in 80 to 90% of all emotional experiences. This, however, would produce a method artifact in the analysis of individual differences in the frequencies of experienced emotions, because percentages eliminate such individual differences. For example, one person might have only 10 emotional experiences a week, of which 8 elicited happiness and 2 elicited sadness. Another person might have 100 emotional experiences a week of which 80 elicited happiness and 20 sadness. Both respondents might say that they experience happiness often, meaning in 80-90% of their emotional experiences and sadness rarely, meaning in 10-20% of their emotional experiences. But the second person clearly experienced both emotions more frequently than the first person. Furthermore, because pleasant and unpleasant emotions co-occur very infrequently during a single emotional episode (Reisenzein, 1995; Schimmack & Reisenzein, in press), percentages, in contrast to absolute frequencies, of pleasant and unpleasant emotion frequencies are bound to be negatively correlated across participants.

The hypothesis that participants use vague quantifiers to indicate percentages makes the prediction that vague quantifier ratings of, for example, pleasant emotions reflect not only the actual frequencies of pleasant emotions, but also the actual frequencies of unpleasant emotions, although in the opposite direction; that is, higher frequency judgments are obtained for lower actual frequencies of emotions of the opposite valence. This follows from the fact that percentages take all emotional experiences into account (i.e. rating of pleasure = actual pleasure / (actual pleasure + actual displeasure). To test this hypothesis, the vague quantifier ratings of pleasant and unpleasant emotions were regressed onto the actual frequencies of pleasant and unpleasant emotions. Table 8 shows the predicted pattern that vague quantifier ratings of pleasant (unpleasant) emotions are positively related to the actual frequencies of pleasant (unpleasant) emotions, but also negatively related to the actual frequencies of unpleasant (pleasant) emotions.

However, the correlations across types of affects are not close to -1, which is what one would expect if vague quantifiers were pure measures of percentages. It is therefore conceivable that they reflect partly absolute frequencies and partly percentages of emotional experiences. If the vague quantifier ratings also reflect individual differences in the average actual frequencies of all emotions, the sum of all weak quantifier ratings should be correlated with the sum of all absolute estimates. Indeed, the correlations are rs = .40 and .45, ps < .01, for the pre- and post-diary vague quantifier ratings. Apparently, vague quantifier ratings indicate partly absolute frequencies and partly percentages of individual’s emotional experiences.

In sum, the major implication of the present findings is that the negative correlations between frequencies of pleasant and unpleasant emotions, obtained with vague quantifier ratings, are an artifact􀁘that is, they conceal that the actual frequencies of pleasant and unpleasant emotions are positively correlated. Additional support for this claim stems from a study by Schimmack and Diener (in press), who also found a positive correlation between frequencies of pleasant and unpleasant emotions by means of a different method. In several studies, the participants indicated their emotional reactions to hypothetical or real life events, using an intensity scale ranging from 0=not at all to 6=extremely intense. Frequency of emotions was then determined as the number of non-zero ratings; that is, the number of times the emotion was experienced at all. In all studies, the frequencies of pleasant emotions were found to be positively correlated with the frequency scores of unpleasant emotions. Hence, yet another method supports a positive correlation between frequencies of pleasant and unpleasant emotions. Furthermore, Suh, Diener, and Fujita (1996) demonstrated that people with more positive life events also have more negative life events, presumably because they lead more active lives. For example, researchers who submit many papers to a journal have both more positive and negative reviews, and as a consequence more pleasant and unpleasant emotions, than researchers who submit only few papers. This can explain the positive correlation between the frequencies of pleasant and unpleasant emotions, because positive events elicit pleasant emotions and negative events elicit negative emotions.

In sum, the present results indicate that the frequencies of pleasant and unpleasant emotions are positively correlated across individuals. This finding contradicts previous findings. However, previous studies did either study moods and not emotions (Green et al., 1993), or used exclusively vague quantifiers ratings to measure the frequency of emotions. For reasons outlined above, the use of vague quantifiers is likely to produce method artifacts and lead to false conclusions about individual differences in the frequency of experienced emotions.

3.4 Discussion

The findings of Study 1 support the first three hypothesis: Frequency judgments of emotions possess discriminative accuracy across emotions as well as across participants. They underestimate the actual frequencies of emotions, and they do so increasingly with increasing frequencies of actual occurrences. Study 1 also provided evidence for hypothesis 5 that the salience of emotions during the encoding stage increases subsequent frequency judgments: After participating in a diary study, participants provided higher absolute frequency estimates and did so especially for salient emotions; that is, emotions that were included in the daily report form. Furthermore, frequent emotions showed stronger increases than infrequent emotions. This is very likely due to the stronger underestimation of these emotions when they are not salient; there is simply more room for salience to boost the estimates of frequent emotions.

The increased salience also had an effect on the accuracy of the frequency judgments. All measures of accuracy showed higher accuracy for the post-diary than for the pre-diary judgments. Although the interpretation of this finding is ambiguous, because the pre-diary judgments necessarily covered a different time period than the period when the actual frequencies were measured, the consistency of the effects suggests that participants were better able to judge the frequencies of their emotions after participating in a diary study. Influences of salience on the accuracy of frequency judgments have also been reported in experimental studies (Naveh-Benjamin & Jonides, 1986).

The findings concerning mood- or personality-congruent biases were mixed. Only the vague quantifier ratings showed a personality-congruent bias: neurotic individuals overestimated the frequencies of their unpleasant emotions, and extraverted individuals overestimated the frequency of their pleasant emotions (Diener et al., 1984; Feldman Barrett, in press).
Study 1 also revealed interesting new findings that deserve attention in future research. First, in various analyses different results were obtained for the absolute estimates and the vague quantifier ratings. Concerning absolute estimates, pre-diary estimates revealed low discriminative accuracy across participants, and the test-retest correlations between pre- and post-diary estimates were low, despite a relatively high stability of emotion frequencies. It was also found that the absolute estimates reflect mostly frequencies in the most recent past (last week), and that they were unaffected by mood or personality-congruent biases. On the other hand, vague quantifier ratings showed high temporal stability, reflected frequencies of emotions over the last three weeks, and appeared to be slightly biased in a personality and mood-congruent direction. First of all, this finding indicates that the choice of the response format matters (Schaeffer, 1991); a factor that has been neglected in the frequency judgment literature. The present study does only allow to speculate about the causal mechanisms that produced these differences. Brown (1995) made the important point that most frequency judgment models do not explain how frequency information (e.g., a feeling of familiarity) is converted into an absolute estimate. The present study shows that the conversion into a vague quantifier rating also requires explanation. The finding that the two response formats produced divergent results, even though both formats were administered at the same time to the same participants, suggests that the response formats influence predominantly the conversion of frequency information into a response.

One very interesting differences between the two response formats was that absolute estimates increased considerably from pre- to post-diary judgments. In contrast, vague quantifier ratings, which were made right before the absolute estimates, decreased from pre- to post-diary judgments. This finding demonstrates clearly that vague quantifier ratings do not correspond to absolute frequencies. The unexpected decrease of vague quantifier ratings in a test-retest design has been observed in other studies (Knowles et al., 1996). Post-hoc analysis were in agreement with Haubensak’s consistency model. Vague quantifier ratings, but not absolute estimates, distorted the positively skewed distribution of the emotion frequencies. However, the post-diary vague quantifier ratings reflected the actual distribution better than the pre-diary estimates. This finding suggests that the participants had problems to convert their frequency impressions into ratings along a limited number of vague quantifier categories.
Finally, the two response formats produced different correlations between frequencies of pleasant and unpleasant emotions: For absolute estimates the correlation was positive, whereas it was negative for the vague quantifier ratings. Regression analyses suggested that the negative correlations obtained with vague quantifier ratings were due to a method artifact: Participants used vague quantifiers partly to indicate percentages of the overall number of their emotional experiences (Reisenzein, 1995) which (a) eliminates individual differences in absolute frequencies of emotions and (b) pushes the correlation in a negative direction. Therefore, the absolute estimates are better suited to explore individual differences in the frequency of emotional experiences. According to the present study, people who experience pleasant emotions frequently are also likely to experience more unpleasant emotions. This finding challenges current structural models of personality which assume that the frequencies of pleasant and unpleasant emotions are independent (cf. Costa & McCrae, 1992; Watson et al., 1988).

4 STUDY 2

Study 2 was similar to Study 1 in that the participants again made daily frequency judgments and estimated the frequency of emotions before and after the diary study. Minor differences between Study 1 and 2 are that in Study 2 (a) the diary period extended only over two weeks, (b) the actual frequencies were based on twice-daily absolute estimates, (c) and the participants made only vague quantifier ratings, but not absolute estimates, before and after the diary period. The most important difference was that the participants in Study 2 estimated the frequencies of emotions separately for the first and second week of the diary period. In analogy to Hintzman and Block’s (1971) seminal experiment of frequency judgments for separate word list, this permitted to test whether information about the frequency of emotions is stored directly or indirectly in memory. If frequencies of emotions are stored directly in memory, participants should be unable to provide accurate frequency estimates separately for the first and second week. However, if frequency estimates are based on the activation of multiple memory traces, and if memory traces of particular time periods can be selectively activated, then participants should be able to judge accurately the frequencies of their emotions in the first and the second week.

Study 2 also provided for an opportunity to replicate several of the findings in Study 1, namely to test (a) the discriminative accuracy of frequency judgments of emotions across emotions and participants, (b) the presence of personality- and mood-congruent biases, and (c) the unexpected decrease of vague quantifier ratings over repeated measurements.

4.1 Method

4.1.1 Participants

The participants were undergraduate students at the Free University Berlin who took part in a course on emotions in everyday life. 80 participants (24 men and 56 woman) with a mean age of 25 completed all data collections.

4.1.2 Material and Procedure

4.1.2.1 Daily Estimates

At the core of Study 2 was the two week diary period. Participants rated the frequency of 34 emotions twice a day, which probably provides for a better estimation of the actual frequencies of emotions than the end-of-day judgments in Study 1. The response format for the daily estimates differed from Study 1. Whereas in Study 1 a free response format was used, the participants in Study 2 used a seven-point scale, ranging from 0 to 6, for their absolute estimates. All categories of this scale represented absolute frequencies (i.e. 0=never, 1=once, 2=twice, etc.) except the last category, which comprised all absolute frequencies greater than five. For most emotions, however, this category was used infrequently, so that the sum of the twice-daily estimates approximates the actual frequencies of emotional experiences during the diary period. Furthermore, the absolute level of the frequencies is less relevant in Study 2, because participants did not make absolute estimates before and after the diary period. Therefore, absolute and relative accuracy were not tested in Study 2.

In Study 1, participants had been asked to return questionnaires on a daily basis to ensure that the ratings were completed daily. This procedure was not feasible in Study 2 because students in Berlin do not live “on campus” and many students do not visit the university each day. Therefore, the report forms for the daily ratings were given to the participants in the form of two booklets; one for each week, so that at least the weekly completion of the report forms could be controlled. Afterwards, the twice-daily ratings were averaged across the repeated assessments to obtain a measure of the actual frequencies with which emotions were experienced.

4.1.2.2 Vague Quantifier Ratings

The vague quantifier ratings were always made for the time period of one week (How often did you experience joy in the last week?). The questionnaire included the 34 emotions that were on the daily report from, although in a different order, to discourage participants from using a stereotyped response pattern when they made the post-diary judgments. Judgments were made on the following scale: 0 =never, 1 = very rarely, 2 = rarely, 3 = sometimes, 4 = often, 5 = very often, and 6 = nearly always. Even though the daily report form and the vague quantifier ratings used a rating scale from 0 to 6, participants could not use the modal response on the daily form to make accurate vague quantifier ratings, because the same numeric category has different meanings in the two questionnaires. For example, a participant who always checks the response category “1” on the daily form as the frequency of his experiences of hate, would indicate to experience hate 14 times a week; this means experiencing hate quite often and not, as a vague quantifier rating of “1” would mean “very rarely.”

The vague quantifier ratings were made four times: two times prior to the diary study with a two week interval between the two assessments, and two times immediately after the diary study to make separate judgments of the first and the second week of the diary period.

4.1.2.3 Mood Questionnaire

Current mood was assessed with the BASTI (Schimmack, in press). The BASTI is the German counterpart of the ELMI used in Study 1 (see 3.1.2.4). The BASTI was completed after each administration of the vague quantifier ratings.

4.1.2.4 Personality Questionnaires

Extraversion and neuroticism were assessed two weeks after the diary study with the NEO-FFI (Borkenau & Ostendorf, 1991). In this questionnaire, neuroticism and extraversion are assessed with 12 items each. The NEO-FFI is a short version of the NEO-PI-R (Costa & McCrae, 1992) used in Study 1. The NEO-FFI is somewhat less reliable than the NEO-PI-R, but the reliability is still good (Borkenau & Ostendorf, 1991).

4.2 Results

Because the participants in Study 2 did not estimate absolute frequencies before or after the diary study, it was not possible to test the absolute or relative accuracy of frequency judgments in this study.

4.2.1 Discriminative Accuracy across Emotions

As in Study 1, the discriminative accuracy across emotions was assessed at the group and at the individual level. For the analysis at the group level, the actual frequencies and the vague quantifier ratings of the 34 emotions were averaged across the 80 participants. Then, the correlations of the actual frequencies with the two pre-diary and the averaged post-diary estimates were computed across the 34 emotions. All correlations were very high and statistically significant (pre-diary 1 r = .88, pre-diary 2 r = .92, averaged post-diary r = .96); a trend toward higher accuracy after repeated assessments is also apparent.

To test discriminative accuracy at the level of each participant, the same correlations were computed for each participant. Subsequently, the correlation coefficients were used as dependent variables in an ANOVA with the within-subject factor time of judgment. This analysis revealed a significant main effect, F(2,158) = 151.88, p < .01. Follow up analyses revealed that all three mean correlations differed significantly from each other (Figure 9).

The increased accuracy from the first to the second pre-diary ratings is especially important, because both ratings do not cover the period during which the actual frequencies of emotions were assessed. Therefore, the effect can be more easily attributed to an increase in people’s ability to discriminate the frequencies of emotions. Furthermore, the increased accuracy from pre- to post-diary ratings replicates the finding of Study 1; and it probably reflects an influence of the participation in the diary study on the discriminative accuracy across emotions. In sum, the results largely replicate the finding of Study 1 that frequency judgments of emotions possess discriminative accuracy across emotions which increased over repeated assessments.

4.2.2 Discriminative Accuracy across Participants

To estimate the discriminative accuracy across participants, the two post-diary estimates were averaged (separate analyses for each of the two diary weeks are reported later). As in Study 1, the actual frequencies were correlated with the pre- and the post-diary ratings.
Figure 10 shows that the frequency judgments of emotions in general possessed discriminative accuracy across participants (the separate correlations for each the 34 emotions can be found in Appendix 5).

However, follow up tests indicated that all three mean correlations differed significantly from each other. Study 2 also replicated the finding that pre-diary estimates are a better predictor of actual frequencies of emotions with higher temporal stability. However, this finding was only supported for the second pre-diary judgments (r = .49, p < .01), but not for the first pre-diary judgments (r = .10, p = .59). This pattern of results indicates that pre-diary judgments attenuate the discriminative accuracy of frequency judgments because they cover a different time period than the one, during which the actual frequencies of emotions were assessed. Nevertheless, the high accuracy of the post-diary judgments is probably due to the participation in the diary study and overestimates the discriminative accuracy across participants under natural conditions.

Next, the discriminative accuracy across participants in study 2 was quantitatively compared to the one in Study 2 for the 18 emotions that were included in both studies (all emotions of Study 1 except hurt and worry). These analyses revealed a very similar degree of accuracy in both studies (Study 1: pre-diary r = .39; post-diary r = .58; Study 2: second pre-diary r = .36; post-diary r = .58), which did not differ significantly from each other ts < 1.50, ps > .20. Furthermore, it was explored whether emotions that revealed high discriminative accuracy across participants in one study, also revealed high discriminative accuracy across participants in the other study; this was, however, not true (pre-diary r = .09, post-diary r = .22, both ps > .10). Therefore, at present it is not possible to recommend emotions that guarantee a high degree of discriminative accuracy across participants.

As in Study 1, the specificity of the frequency judgments was investigated in that the correlation between the actual frequencies of an emotion with the frequency judgment of this emotion was compared to the correlations with the frequency judgments of all other 33 emotions. Specificity was established if the correlation with the same emotion was higher than any of the other 33 correlations. For the post-diary estimates this was the case for all emotions except discontentment (see Appendix 6). However, for the two pre-diary estimates, only about half of the emotions revealed specificity (1st pre-diary estimates N = 16, 2nd pre-diary estimates N = 15 out of 34). Although this number is still significantly different from chance (expected value N = 1, both χ2s > 200, ps < .01), it indicates lower specificity for the pre-diary estimates.

A similar effect had not been observed in Study 1. A possible explanation for this difference could be that Study 1 included more similar emotions. Another explanation would be the longer delay between pre-diary ratings and the assessment of the actual frequencies, which renders it more likely that the actual frequencies changed between the time periods covered by the pre-diary judgments and the time period during which the actual frequencies were assessed.

In sum, Study 2 closely replicated the findings of Study 1 that frequency judgments of emotions possess moderate discriminative accuracy across participants. The size of the correlations between pre-diary and post-diary estimates with actual frequencies was very similar in both studies in the range from r = .30 to r = .60. Furthermore, frequency judgments of emotions in Study 2 were quite often specific for each emotion, indicating that the respondents used different information for frequency judgments of each emotion. As in Study 1, this finding demonstrates that discriminative accuracy across participants is not due to response sets.

4.2.3 Exploration of Mood- and Personality-Congruent Biases

The analysis follows closely the procedure in Study 1. First, the 10 specific mood dimensions of the BASTI were submitted to a factor analysis and the factor scores of the first two unrotated factors were retained for further analyses. Replicating Study 1, the factor scores of the first factor were highly negatively correlated with the directly measured pleasure-displeasure dimension (r = -.81, p < .01), and the factor scores of the second factor were moderately correlated with the directly measured arousal dimension (r = .52, p < .01). To facilitate the interpretation, the factor scores of the first factor were inverted so that positive values indicate pleasure. The frequency estimates of pleasant and of unpleasant emotions were averaged to reduce the number of analyses. Emotion words denoting “mixed feelings” (e.g., sympathy) were dropped at this step. Next, multiple regression analyses were carried out, in which the post-diary vague quantifier ratings were regressed onto the actual frequencies of emotions, neuroticism, extraversion, current pleasure, and current arousal. Table 10 shows that the actual frequencies of emotions were the best predictor. The only additional significant effect was that current arousal predicted frequency estimates of pleasant emotions. However, current arousal did not predict vague quantifier ratings of pleasant emotions in Study 1; although it predicted absolute estimates. Similarly, Study 2 did not replicate the personality-congruency effects obtained for vague quantifier ratings in Study 1.

Finally, the simple correlations between the pre- and post-diary vague quantifier ratings and the five predictors of the previous analyses were compared to each other. If participation in a diary study attenuates biases, the simple correlations between personality and mood measures and pre-diary frequency judgments should be higher than those for the averaged post-diary estimates. Table 11 shows that only extraversion was more highly correlation with both pre-diary than with the post-diary judgments. However, extraversion revealed no higher correlations with pre-diary estimates in Study 1. Hence, the present studies do not support the hypothesis that participation in the diary study reduces the influence of personality- or mood-congruent biases on frequency judgments of emotions.

In sum, only actual frequencies were consistently related to frequency judgments of emotions in studies 1 and 2. Personality and mood effects were not consistent across studies or response formats in Study 1. Although the present results do not indicate that personality- or mood-congruent effects do not exist, especially given the fact that other studies found at least personality-congruent biases (Diener et al., 1984; Feldman Barrett, in press), the studies clearly show that biases are small relative to the amount of accuracy in frequency judgments. This conclusion is in agreement with the previous studies, in which the bias was also small compared to the accuracy in the retrospective judgments (Diener et al., 1984; Larsen, 1992; Feldman Barrett, in press).

4.2.4 Accuracy of Separate Estimates of the First and Second Diary Week

The next analyses go beyond a simple test of the accuracy of frequency judgments of emotions, and start to investigate the cognitive processes underlying frequency judgments of emotions. The participants in Study 2 were asked to provide separate frequency estimates for the first and second week of the diary period. If frequency information is stored directly in memory and frequency judgments are simply based upon the retrieval of this prestored information, participants should not be able to make accurate frequency judgments for the two separate weeks. In contrast, if frequency information is stored indirectly in memory (e.g., in the form of multiple episodes) and frequencies are computed only at the time of the judgments, the judgments should accurately reflect differences between the frequencies of emotions in the two weeks, because contextual cues can be used to activate memory traces of specific time periods (Hintzman & Block, 1971).

First, the daily estimates were averaged separately for the first and second week. Then, a set of hierarchical regression analyses were computed to determine three components of the shared variance between the actual frequencies in the two weeks and the vague quantifier ratings of one week: (a) the variance that is uniquely explained by the actual frequencies of emotions in the first week, (b) the variance that is uniquely explained by the actual frequencies of emotions in the second week, and (c) the variance that is shared by the two predictor variables. Figure 11 illustrates how the shared variance was decomposed into these three components. The amount of shared variance is simply the difference of the total amount of explained

variance minus the two unique variances (R2total = R2unique week1 + R2unique week2 + R2shared). This decomposition of the overall amount of explained variance is possible because all variables were positively intercorrelated so that suppression effects can be ruled out.
Frequency Judgments of Emotions 93
If participants are able to make accurate frequency estimates separately for the two weeks, entering the judged week in the second step should produce a higher increase in explained variance than entering the week that was not the target of the judgment (i.e. for week 1, R2unique week1 > R2unique week2; for week 2, R2unique week2 > R2unique week1). Furthermore, if participants discriminate frequencies of emotions in the two weeks perfectly, adding the non-target week in the second step should not increase the amount of explained variance. (i.e., for week 1, R2unique week2 = 0, for week 2, R2unique week1 = 0.). Finally, if participants are better able to estimate the frequencies of emotions in the more recent second week, the unique variance in estimates of week 2 should be higher than the unique variance in estimates of week 1 estimates (i.e. [for week 2, R2unique week2 ] > [for week 1, R2unique week1]).

Figure 12 shows the amount of the three variance components averaged across the analyses of each of the 34 emotions (see Appendix 7 for the results of the individual analyses). Most important are the findings that (a) the actual frequencies in week 1 uniquely explained more variance in vague quantifier ratings of week 1 than the actual frequencies in week 2, t(33) = 5.24, p < .01, and (b) the actual frequencies in week 2 uniquely explained more variance in vague quantifier ratings of week 2 than the actual frequencies in week 1, t(33) = 3.11, p < .01. This confirms the prediction of the familiarity model that people are sensitive to frequencies of emotions in different contexts.

In addition, the amount of uniquely explained variance by the target week (e.g., actual frequencies in week 1 for ratings of week 1) did not differ between ratings of week 1 and ratings of week 2, t(33) = 1.15, p = .25. There were also no significant differences between the ratings of the two weeks in the variance uniquely explained by the actual frequencies in the non-target week (e.g., actual frequencies in week 1 for ratings of week 2), t(33) = 0.66, p = .52, or the amount of shared variance, t(33) = 0.60, p = .56. This pattern of results shows that the participants were equally able to make accurate frequency estimates for the first and the second week of the diary study, and that the accuracy of the frequency judgments for the remoter first week was as good as for the more recent second week. Finally, the amount of variance uniquely explained by the non-target week was significantly different from zero for ratings of both weeks, both Fs(1,33) > 20.00, both ps < .01. This finding indicates that participants are not perfect in discriminating the frequencies of emotions experienced in the two weeks: Ratings of one week were also influenced by actual frequencies in the other week. This influence is, however relatively small (see Figure 12).

In sum, the findings show a high ability of the participants to detect changes in the frequencies of their emotional experiences from week 1 to week 2. This finding is particularly noteworthy because it replicates an experimental finding (Hintzman & Block, 1971) under natural conditions over a much longer interval between the encoding of stimuli (i.e. the experience of emotions) and the moment when the frequency judgments were made. Furthermore, it demonstrates this effect for the first time with regard to discriminative accuracy across participants. Finally, the present design provides for a strong test of the hypothesis of context sensitivity because the actual frequencies in the two weeks were highly correlated (mean r = .71). Therefore, the changes in the emotion frequencies from one week to the other were relatively small. Nevertheless, the participants detected these changes, a finding that contradicts direct encoding models of frequency information. It also speaks against the hypothesis that frequency judgments of emotions are based on generalized beliefs or are pre-stored in memory.

4.3 Additional Analyses

4.3.1 Repeated Assessment of Vague Quantifier Ratings

In Study 1 the mean of the vague quantifier ratings decreased from pre- to post-diary judgments, whereas the absolute estimates increased. In Study 2 the participants made vague quantifier ratings two times prior to the diary study. Therefore, it could be explored whether the decreased also occurs when participants do not participate in a diary study between the two ratings. Vague quantifier ratings at each measurement point were first averaged across participants. Then, a repeated measure ANOVA was computed across emotions with the within-subject factor time. A strong effect was obtained, F(3,33) = 62.50, p < .01, partial ε2 = .65.

Follow up analyses indicated that the mean decreased from the first to the second assessment and again to the post-diary assessment. The means of the two post-diary ratings were practically identical (Figure 13). Next, the prediction of Haubensak’s consistency model was tested that decreasing means should be paralleled by a better approximation of the distribution of the actual frequencies of emotions, which should be positively skewed. The data support this prediction: The skewness of the actual frequencies was 1.50. The skewness of the first vague quantifier ratings was 0.45. It increased to 0.57 for the second pre-diary judgments, and once more to 0.91 and 0.93 for the two post-diary ratings.

In sum, the decreasing mean of vague quantifier ratings in Study 1 has been replicated. The finding has been extended by showing an additional decrease from a second to a third assessment. Furthermore, additional analyses suggest that the consistency model (Haubensak, 1994) is a promising candidate for a theory that can explain this effect. Future research should try to test the consistency model more directly. A better understanding of the effect seems to be highly desirable because the decreasing mean of vague quantifier ratings has important practical implications (Knowles et al., 1996). Most importantly, it makes the interpretation of changes in pre-post designs (e.g. in therapy evaluation studies) very difficult. A better understanding of the effect might help to develop measurement instruments that minimize this effect.

4.3.2 Interrelations between the Frequencies of Pleasant and Unpleasant Emotions

In Study 1 it was found that the relation between frequency estimates of pleasant and unpleasant emotions depended on the response format: The absolute format produced positive correlations whereas the vague quantifier ratings produced negative correlations. In Study 1 the absolute estimates were made using a free response format. This response format might be especially susceptible to an extremity response style. In contrast, the participants in Study 2 made (daily) absolute estimates on a predefined response scale which limited the range of responses. Therefore, the positive correlation should disappear, if it were simply due to the free response format. However, despite the modified procedure used in Study 2, it replicated the finding of a high positive correlation between absolute frequency estimates of pleasant and unpleasant emotions (Table 12). In contrast, the significant negative correlations obtained for the vague quantifier ratings in Study 1 were not replicated: The two pre-diary estimates revealed non-significant correlations close to zero, whereas the post-diary estimates produced low, but significant positive correlations.

As in Study 1, regression analyses were carried out to test the prediction that vague quantifier ratings of pleasant (unpleasant) emotions are influenced by the actual frequencies of unpleasant (pleasant) emotions. These analyses in general replicated the results of Study 1 (Table 13). Besides high correlations with the actual frequencies of emotions of the same valence, the vague quantifier ratings also show negative correlations to the actual frequencies of emotions of the opposite valence.

As in Study 1, actual and estimated frequencies were averaged across all emotions to test whether people who experience on average more emotions also used higher vague quantifiers. Again, positive correlations were obtained: the correlations were r = .36 for the first pre-diary judgments, r = .47 from the second pre-diary estimates, and r = .64 for the averaged post-diary estimates. Especially for the post-diary estimates the correlation is higher than in Study 1 (r = .45). This finding indicates that vague quantifier ratings partly reflect percentages and partly absolute frequencies of emotions. It seems that the participants in Study 2 used the vague quantifiers more to express absolute frequencies.

In sum, the results of Study 2 replicated a positive correlation between frequencies of pleasant and unpleasant emotions, and the finding that vague quantifier ratings mask this positive correlation because participants partly use vague quantifiers to indicate percentages and only partly to indicate absolute frequencies of emotional experiences. As noted before, this finding undermines the empirical support of prevailing theories of individual differences in emotional experiences (Costa & McCrae, 1992; Watson et al., 1988).

3.4 Discussion

Study 2 replicated several findings of Study 1: The discriminative accuracy across emotions was high, the discriminative accuracy across participants was in the same moderate range (r = .30 to .60), and the mean of vague quantifier ratings decreased over repeated assessments. On the other hand, the few personality- and mood-congruent biases obtained in Study 1 were not replicated, suggesting that theses biases are not an important factor in frequency judgments of emotions. Study 1 also replicated the finding of a positive correlation between the frequencies of pleasant and unpleasant emotions, which, this time, was even supported by the post-diary vague quantifier ratings. Most importantly, Study 2 provided for a first test between direct and indirect frequency judgment models.

Study 2 showed that participants were able to estimate accurately the frequency of emotions in two separate weeks, even after controlling for the highly correlated frequencies in the other week. This finding is incompatible with direct encoding models of frequency information, which assume that frequency information is constantly updated (cf. Hintzman & Block, 1971). According to these models, it should be impossible to estimate frequencies in a remoter time period independently from the frequencies in the recent past. Opposing this prediction, participants were sensitive to differences in the emotion frequencies in the two weeks. This result suggests that the information about the frequency of emotional experiences is indirectly stored in memory, probably in the form of multiple memory traces of emotional episodes. Study 2, however, does not allow to differentiate between the different indirect encoding models (see Figure 2). To this aim, the next two studies were carried out.

5 STUDY 3

The aim of studies 3 and 4 was to study the cognitive processes underlying frequency judgments of emotions under more controlled conditions. To do so, real emotional experiences investigated in the previous studies were replaced by emotional reactions that participants would experience in hypothetical scenarios. That is, participants indicated their likely emotional reactions to a number of hypothetical scenarios. Subsequently, the number of times participants indicated that they would feel joy, anger, or gratitude was used as the measure of the actual frequencies of emotions. The advantage of this new way to determine the actual frequency of emotions is that actual frequencies can be objectively determined. This avoids the problem of the previous studies that the measure of the actual frequencies was based on frequency judgments. The disadvantage of this paradigm is clearly that the frequency judgments concern only hypothetical situations and not real life emotional events. Nevertheless, the approach is similar to the strategy of experimental psychologists to study frequency judgments of words denoting natural objects and to assume that the results generalize to frequency judgments of objects or events in real life (cf. Hasher & Zacks, 1984).

The use of hypothetical scenarios as stimulus material allows one to manipulate to a certain degree the frequencies of “emotions,” because people’s emotional reactions to some kind of situations (e.g., death of a loved one) are fairly universal (cf. Mesquita & Frijda, 1992). However, there are also individual differences in emotional reactions to the same situations, because people appraise situations differently (cf. Lazarus, 1991; Reisenzein & Hofmann, 1993). Therefore, the manipulation of emotion frequencies in the present studies is less reliable than the manipulation of the frequency of natural objects in previous experimental studies (e.g., Greene, 1989). For example, not everybody feels angry if he or she has to wait in line, but everybody agrees that a banana is a fruit. This “weakness” of the present approach can also be a strength, when researchers want to study individual differences in the processing of emotional information under controlled conditions (cf. Schimmack & Hartmann, in press). In the present context, however, individual differences constitute error variance, in accordance with experimental studies of frequency judgments in general (cf. Naveh-Benjamin & Jonides, 1986).

The standard paradigm used in the following two studies consists of an initial scenario rating task (SRT) and a subsequent frequency judgment task (FJT). In the SRT, participants indicated for several scenarios which emotions they would experience if they were in the described situation. These ratings allow the researcher to determine the actual frequencies of emotions􀁘that is, the number of times that a respondent indicated that he or she would have experienced an emotion in a scenario. In the FJT, participants have to estimate how often they would have experienced various emotions in the situations of the SRT. These judgments can then be compared to the actual frequencies as defined above. As the following experiments were carried out on a personal computer, it was also possible to measure the judgment times of the frequency judgments.

On top of the SRT and the FJT, other tasks can be added to explore the cognitive processes underlying frequency judgments of emotions. In the following two studies, a latency-of-retrieval task (LRT) was added (Fitzgerald et al., 1988; MacLeod et al., 1994). In this task, participants were asked to recall, as fast as possible, one of the scenarios in which an emotion would have been experienced. This allows one to test the ease-of-retrieval model: If the ease-of-retrieval hypothesis is correct, higher frequency judgments should be related to shorter retrieval times (see hypothesis 6). Furthermore, the judgment times of the frequency judgments should be correlated with the retrieval latencies as well as with the size of the frequency judgments (hypothesis 7), and frequency judgment should need longer than the retrieval of scenarios from memory (hypothesis 8). In contrast, the familiarity model does not predict these effects.

5.1 Method

5.1.1 Participants

48 undergraduate students at the Free University Berlin participated in the Study for course credits.

5.1.2 Material

5.1.2.1 Scenarios

Reisenzein and Hofmann (1993) asked 20 students at the Free University Berlin to report for each of 23 emotions one personal experience of this emotion. Subsequently, they asked a different sample of 51 participants to indicate which of the 23 emotions was most likely felt by the protagonist of each scenario. The authors found that for most scenarios, the target emotion; that is, the emotion that triggered the reported scenario, was recognized by the majority of the participants. For the present study, 12 of the 23 target emotions investigated by Reisenzein and Hoffman were selected: anger, anxiety, contempt, disappointment, disgust, embarrassment, gratitude, jealousy, joy, love, pride, and sadness. For each target emotion the ten scenarios with the highest recognition rate of the target emotion were selected, yielding a total of 120 scenarios. Each scenario was about 10 to 60 words long. The following example is a description of an anger experience:

A while ago, I bought some apples at the supermarket, because they were so cheap. At home, I found out that they were already rotten inside. I thought: “And this supermarket always advertises with its fresh fruits.

For the present study, the selected 120 episodes were split into four sets of 30 episodes and one of these four sets was presented to each participant. To manipulate the frequency of emotions between the four scenario sets, unequal numbers of scenarios of one target emotion were assigned to each set (see Table 14). It has to be noted, however, that this procedure allows only a rough manipulation of emotion frequencies between sets of scenarios because each scenario tends to elicit several emotions besides the target emotion (cf. Reisenzein, 1995). For example, jealousy scenarios often also elicit anger, disappointment, and sadness. Therefore, another aim of Study 3 was to determine the pattern of emotions that is elicited in each scenario. This would provide for a better manipulation of emotion frequencies in Study 4 and other future studies.

5.1.2.2 Emotions

Although the scenarios elicited most strongly the 12 target emotions, it is likely that the scenarios also elicit various other emotions. To obtain ratings for a comprehensive list of emotions, 32 emotions were selected for the rating task. Besides the 12 target emotions, the 11 additional emotions studied by Reisenzein and Hofmann (1993) were also included, namely compassion, discontentment, envy, guilt, hope, helplessness, loneliness, regret, relieve, shame, and surprise. Furthermore, contentment, depression, euphoria, hate, and hopelessness were included because they had been studied by Reisenzein (1995, Study 3; see also Schimmack & Reisenzein, in press) in related research. Rage was included upon request from participants in a small pilot study. Finally, the global descriptions “a pleasant feeling,” and “an unpleasant feeling” were added, to investigate frequency judgments of broad categories compared to those of specific emotions. For the scenario rating task, the 12 target emotions were split into two sets of 6 emotions each. Similarly, the 20 remaining emotions, labeled non-target emotions, were split into two sets of 10 emotions each (see Appendix 8 for the assignment of emotions to sets of target and non-target emotions). Each participant received only one set of target emotions and one set of non-target emotions in the SRT. In sum, the SRT was divided between participants according to a three factorial design, with four sets of episodes, two sets of target emotions, and two sets of non-target emotions (Figure 14). However, analyses are not based upon the 4 x 2 x 2 design because the number of participants in each cell of the design is too small (N = 3). The purpose of this design was that an equal number of participants rated the intensity of each emotion in each of the four sets of scenarios. Because each scenario set was presented to 12 participants, and each participant rated the intensity of half of the target and half of the non-target emotions, in each set of scenarios six participants rated the intensity of the same emotion.

5.1.2.3 Rating Scale

In the SRT, participants were mainly asked to rate whether they would feel an emotion or not. However, the intensity of the emotional reactions appeared to be of interest as well, especially regarding other research questions (Schimmack & Hartmann, in press; Schimmack & Diener, in press). Therefore, the participants were also asked to indicate how intense their emotional reactions would be, provided that they experienced an emotion at all. Reisenzein (1995, Study 3) used two separate ratings to obtain independent information about the presence and intensity of an emotion. To simplify the judgment process, Schimmack (in press; Schimmack & Diener , in press) proposed to decompose a single rating on an intensity scale into information about the presence/frequency and intensity of an emotional reaction. One simply uses zero-responses as information that an emotion is not experienced; then all non-zero responses indicate the experience of an emotion. That is, it is proposed to use a dichotomization of intensity ratings into ratings equal zero and those greater than zero as information about the absence versus presence of emotions. Summed across scenarios, the number of non-zero judgments for one emotion represents the actual frequency of this emotion. In the present study, a four-point intensity scale was used (0=not at all, 1=sligthly, 2=medium to 3=high intensity).

5.1.3 Procedure

The experiment comprised several tasks that were implemented in a single computer program. First, the participants judged how intensely they would experience a selection of 16 emotions in one of the four sets of 30 scenarios. Afterwards, they estimated how frequently they would have experienced each of the 32 emotions in the set of scenarios of the SRT. That is, they made frequency judgments for the 16 emotions included in their SRT, labeled salient emotions, plus the 16 emotions that were not presented in their SRT, labeled non-salient emotions. Subsequently, in a latency-to-retrieve task, the participants recalled, as quickly as possible, one scenario in which they would have experienced the emotion that was presented as a retrieval cue. To reduce the length of the already strenuous experiment, only the 12 target emotions (Table 14) were used as retrieval cues. Because each participant had six target emotions in his or her SRT, this guarantied that each participant retrieved scenarios for six salient and six non-salient emotions.

5.1.3.1 Scenario Rating Task

In the SRT, the participants were asked to imagine being in the described situation and to indicate how they would feel in the situation. For each emotion the participants were asked to consider first whether they would feel this emotion or not. Only if they would feel the emotion, were they to consider the intensity of the emotion. After having read the instructions, participants pressed the return key to start the scenario rating task. The scenarios were displayed in the upper half of the screen and could be studied by the participants as long as they wanted. When the participants had sufficiently studied the scenario, they pressed the return key to start the rating task. After pressing the return key, the rating scale was displayed below the scenario description, which remained on the screen. The rating scale was split into two parts, with the zero-category on the left and all remaining categories on the right side, to increase the salience of the difference between zero and non-zero responses. Between the scenario description and the rating scale, the sentence “In this situation I would have felt …” followed by each of the 16 emotion words was displayed. The participants indicated their likely emotional reaction by pressing the appropriate number on the keypad. If they made an error, they could repeat the last entry using a special correction key. After all 16 emotions had been rated, the next scenario was displayed. New random sequences were generated by the computer for the 30 scenarios for each participant and for the 16 emotions for each scenario. The computer also measured the judgment times from the display of each emotion words to the intensity rating.

5.1.3.2 Frequency Judgment Task

After the SRT was completed, the participants were surprised with additional instructions that they would now be asked several questions concerning the scenarios presented in the SRT. Their first task, after the SRT, would be to estimate the absolute frequency with which various emotions had occurred in the previous episodes. For example, if they had rated anger to be present in five of the scenarios (i.e. if they had made five non-zero-ratings in the SRT), then five would be the correct answer. Participants were not informed that they read 30 scenarios; therefore it was possible that participants made frequency judgments greater than 30. The participants were also informed that they should estimate the frequency of salient and non-salient emotions and that judgments of non-salient emotions (i.e. those not included in the previous SRT) are meaningful because they could have been elicited in the scenarios even though the participants did not have to rate their intensity. For example, although a participant might not have been asked to rate the presence and intensity of disgust in the SRT, the assignment of scenarios to the four sets of scenarios made it very likely that each participant would have indicated to experience disgust at least once (see Table 14). The participants were also informed that the frequency judgments had to be made within 10s, and that the next item would be presented automatically if they exceeded this time limit. A pilot study had shown that participants exceeded this time limit very rarely.

After reading the instructions, participants pressed the return key to start the computer-paced frequency judgment task. The 32 emotions were displayed in a different random sequence for each participant. With the display of each emotion word, a timer appeared also on the screen that counted the elapsed seconds. After 7s, the computer sounded a warning tone. After the participants had entered the first number, the computer recorded the time since the emotion word was presented. After entering the complete number, participants pressed the return key to continue with the task.

5.1.3.3 Latency-To-Retrieve Task

In the latency-to-retrieve task (LRT), the 12 target emotions appeared on the screen in a new random order for each participant participants. For each emotion, the participants were asked to recall, as quickly as possible, a scenario from the SRT. When they recalled one, they pressed the return key. Subsequently, they entered a keyword to describe the recalled scenario (e.g., apple for the example in 5.1.2.1). If a participant did not recall a scenario within 10s the next emotion word was automatically presented on the screen. The computer recorded the retrieval latency from the presentation of an emotion word to the pressing of the return key.

5.2 Results

5.2.1 Preliminary Analyses

First, the actual frequencies of emotions in each of the four sets of scenarios were determined. To do so, the number of times a participant made a non-zero rating in one of his or her 30 scenarios was counted. Due to the experimental design, six participants rated the same emotion for the same set of scenarios. The frequencies of these six participants were averaged to determine the actual frequency of an emotion in each set of scenarios (see Appendix 8).
Correlations of the frequencies of emotions between the four sets of scenarios revealed that the attempted experimental manipulation of the frequencies was not very successful because the frequencies were highly intercorrelated. (Table 15). That is, emotions that were frequent in one set of scenarios also tended to be frequent in other set of scenarios.

5.2.2 Relative Accuracy

First, the number of times participants failed to make a frequency judgment within 10s was determined. This happened only 5 out of 1536 times. In these cases, the missing frequency judgment was set to zero and the missing judgment time was replaced by the maximum judgment time (10s).

As in Study 1, the relative accuracy of the absolute estimates was tested. Figure 15 shows the estimates of the 32 emotions plotted against the actual frequencies in each of the for scenario sets. As in Study 1 the regression slop of the estimates indicates that the actual frequencies were underestimated in all four sets of scenarios. The relative accuracy scores (actual – estimated frequencies) for all four sets of scenarios are negative (Set 1 d = -6.29, Set 2 d = -2.41, Set 3 d = -4.20, Set 4 d = -3.96) and significantly different from zero (all Fs > 20.00, ps < .01), which shows the trend toward underestimation quantitatively.

As in Study 1, the actual frequencies of emotions were correlated with the relative accuracy score of each emotion. If frequent emotions are underestimated more strongly than infrequent ones, a negative correlation between the actual frequencies and the relative accuracy scores is expected. This prediction was confirmed in all four sets of scenarios (rs =-.88, -.61, -.72, -.80, all ps < .01). In sum, Study 3 replicated the finding in Study 1 that absolute estimates underestimate the actual frequencies of emotions and that they do so increasingly with increasing frequency of occurrence.

Because each participants did not rate the intensity of all 32 emotions in the SRT, actual frequencies of all emotions were not available at the individual level. Therefore, only analyses at the group level were possible. For these analyses, the frequency judgments of those 12 participants who rated the same set of scenarios were averaged. The interrater agreement between the 12 participants in each condition was determined, using Shrout and Fleiss’s (1979) intra-class coefficient (ICC[2,k]). The interrater agreement for the four sets ranged from ICC[2,12] = .60 to .74. Furthermore, the estimated frequencies of emotions were correlated between the four sets of scenarios (Table 16), which is expected because the actual frequencies of emotions in the four sets of scenarios were also correlated (Table 15).

Table 17 shows the correlations between the actual and estimated emotion frequencies in the four sets of scenarios. First of all, the correlations are high, indicating general agreement between actual frequencies of emotions and frequency judgments. However, a stronger test
of discriminative accuracy across emotions would require that the frequency judgments of participants who rated a particular set of scenarios are more highly correlated with the actual frequencies of this rather than a different set of scenarios. Table 17 shows that this was the case for all four sets of scenarios (Note that this implies a comparison only along the rows in Table 17, but not necessarily also along the columns).

In sum, Study 3 replicates the finding of Study 1 that frequency judgments discriminate between actual frequencies of emotions. It was more difficult to demonstrate sensitivity to the particular frequencies in a specific set of scenarios. This was very likely due to the high correlations of the actual frequencies between sets of scenarios (Table 15) and the moderate interrater agreement of the frequency judgments (Table 16). A stronger test of sensitivity to experimentally manipulated frequencies of emotions seems desirable. This test was carried out in Study 4 which allowed a better manipulation of emotion frequencies on the basis of the SRT data obtained in this study.

5.2.4 Influence of the Salience of an Emotion on Frequency Estimates

As in Study 1, the effect of the salience of emotion concepts at the time of encoding on the subsequent frequency estimates was examined. In the present study, some emotions were salient, because they were included in the SRT, whereas others were not salient, because they were presented the first time during the FJT. In the following analyses, the differences between scenario sets are ignored because the sets were varied orthogonal to the sets of emotion words. Therefore, differences in frequency judgments of salient and non-salient emotions cannot be attributed to the presentation of different scenarios. For each emotion, the frequency estimates of those 24 participants for whom the emotion was salient was compared to the frequency estimates of those 24 participants for whom the emotion was not salient (see Figure 14). In 31 of the 32 comparisons the frequency estimate was higher when the emotion was salient. In an analysis across emotions, the mean frequency estimate in the salient condition (M = 7.21, SD = 1.92) was significantly higher than the mean frequency estimate in the non-salient condition (M = 4.61, SD = 2.28), t(31) = 10.23, p < .01. This finding replicates the salience effect obtained in Study 1.

This effect can also be seen in Figure 16. This figure also shows that, in contrast to Study 1, the slop of the regression line was not steeper in the salient condition. This implies that the salience condition produced higher estimates in general, but not for the more frequent emotions in particular. Indeed, the actual frequencies were not significantly correlated with a difference score between frequency estimates in the salient and non-salient condition, r = -.24, p = .19. The divergent findings could be due to (a) the fact that Study 1 comprised more participants, (b) the use of a within-subject design in Study 1 and a between-subject design in Study 3, or (c) the different manipulations of salience. This inconsistency should, however, not obscure the main finding of this analysis that frequency judgments were generally higher in the salient than in the non-salient condition. This finding was predicted in Hypothesis 5 and is inconsistent with direct encoding models of frequency information.

5.2.5 Testing the Ease-of-Retrieval Model

It has been proposed that people rely on the retrieval of exemplars to estimate frequencies (Tversky & Kahneman, 1973). One version of the retrieval-based models, the ease-of-retrieval model, was explicitly tested in this study in several ways. First, the relation between frequency judgments and retrieval times in the LRT were compared (cf. MacLeod et al., 1994). Second, the relation between the size and the speed of the frequency judgments was explored. And finally, the speed of frequency judgments was compared with retrieval latencies in the LRT. In the following analyses, the data were averaged across all 48 participants to increase the reliability of the variables. This was justifiable because of the high correlations of the actual frequencies between scenario sets (Table 15). The frequency estimates showed a high internal consistency, ICC[2,48] = .92, whereas the internal consistency of the averaged judgment times was only moderate, ICC[2,48] = .42. Contrary to the prediction of the ease-of-retrieval model, the judgment times were not significantly correlated with the size of the frequency judgments (r = .22, p = .22). The correlation was even slightly in the opposite direction than the one predicted by the ease-of-retrieval model. The direction of the correlation is more consistent with the recall-estimate model (see Brown, 1995); however, as it is not significant, it does not support this alternative retrieval-based model either.

For the next analysis, the retrieval latencies of the 12 target emotions were averaged across participants. The interrater agreement for these latencies was ICC[2,48] = .66. The latencies were then correlated with the frequency estimates across the 12 target emotions. The correlation with the retrieval latencies failed to be significant (r = .43, p = .17) and was again in the wrong direction. The positive correlation does not support the recall-estimate model either, because in the latency-to-retrieve task participants were asked to retrieve only one exemplar. Therefore, the longer retrieval times do not indicate counting of several exemplars (see Brown, 1995). Third, the average retrieval latencies were compared with the average times needed for the frequency judgment; for this analysis only the judgment times of the 12 target emotions were used, because only target emotions were used in the LRT. If frequency estimates are based on information about the speed of retrieval, frequency estimates should take at least as long as the retrieval of a single scenario. However, the mean retrieval latency (M = 3.56s, SD = 0.57) was longer than the mean time needed to make a frequency judgment (M = 3.16, SD = 0.18), t(11) = 2.91, p < .05. This finding is incompatible with any retrieval-based model; the ease-of-retrieval or the recall-estimate model. In sum, the analyses support hypotheses 6 to 8: retrieval latencies and judgment times were independent of frequency judgments and retrieval of a single exemplar required more time than the complete frequency judgment process.

5.3 Discussion

The results of Study 3 replicated several earlier findings obtained in field studies of real emotional experiences. In studies 1 and 3 the actual frequencies of emotions were underestimated, especially the ones of frequent emotions. Nevertheless, the frequency judgments in both studies revealed discriminative accuracy across emotions (Study 3 did not allow to test discriminative accuracy across participants). Finally, both studies showed that the frequencies of salient emotions were estimated to be higher than the frequencies of non-salient emotions. The only difference was that in Study 3 frequent emotions did not benefit more from the salience manipulation than non-salient emotions, which was the case in Study 1.

Besides replicating the results of the field study in a more controlled setting, Study 3 also provided several new findings. First, frequency judgments were not related to the times needed to make these judgments. Furthermore, retrieval of a single scenario needed more time than the complete frequency judgment process. This finding contradicts retrieval-based frequency judgment models; both the recall-estimate theory (Brown, 1995; Meudall, 1971; Watkins & LeCompte, 1991), or the ease-of-retrieval model (Schwarz et al., 1991). Contrary to the predicts of the ease-of-retrieval model was also the finding that latencies in the LRT were unrelated to the frequencies judgments. In sum, the direct encoding models cannot explain the salience effect, whereas the retrieval-based models cannot account for the speed of the frequency judgments. This leaves the familiarity model as the only model that is compatible with the present data.

Three possible objections can be raised against the present findings. First, the judgment and retrieval times might not have been accurately measured, which is suggested by the low consistency of the measures across participants. However, given that the retrieval times were measured by a computer to the nearest millisecond, it remains unclear how the participants themselves are able to distinguish differences in retrieval times more accurately than the computer, as such an ability is needed to use retrieval latencies as information about frequencies of emotions. Only if one notices that one retrieved a joy-scenario faster than an envy scenario, one can judge the frequency of joy to be higher than the frequency of envy. A second objection could be that the ease of retrieval is conceptually different from the speed of retrieval. That is, people do not base their judgment on the speed of retrieval but on a feeling of ease which is separate from and unrelated to the speed of retrieval. Although such a modified ease-of-retrieval model is logically possible, it would have to specify (a) how the feeling of ease is generated, and (b) why it is unrelated to the latency of retrieval. Straightforward answers to these questions are not in sight. Finally, one might object to the present findings that the manipulation of emotion frequencies was only partly successful and that the high correlation across scenario sets was due to the fact that emotions which are frequent in everyday life were also frequent in all four scenario sets. As a consequence, the participants may have relied on generalized beliefs about the frequencies of different emotions, when making their frequency estimates. However, the analyses of different sets of scenarios suggested that participants were also sensitive to differences in the frequencies of emotions between sets. Nevertheless, a stronger demonstration of sensitivity to experimentally manipulated frequencies would be needed to rule out this hypothesis. This was attempted in Study 4, which also served the purpose to replicate the findings of Study 3.

6 STUDY 4

A main aim of Study 4 was to replicate the findings of Study 3. In addition, Study 4 had the purpose to study how individual differences in a repressive way of coping influence the encoding and retrieval of emotion memories; the results bearing on this issue are reported elsewhere (Schimmack & Hartmann, in press; see 7.3.1 for a brief summary). As a consequence, all participants rated the same set of scenarios with regard to the same set of emotions. Furthermore, the selected scenarios elicited mainly unpleasant emotions. This had the advantage that the frequencies of emotions differed from the frequencies of emotions in everyday life and from the frequencies of emotions in Study 3. Hence, Study 4 provides for a stronger test of participants’ sensitivity to experimentally manipulated frequencies of emotions.

6.1 Method

6.1.1 Participants

61 undergraduate psychology students (14 male, 47 female) at the Free University Berlin participated in the study for course credit.

6.1.2 Material and Procedure

The SRT included 25 negative and 5 positive scenarios. 16 emotion words (13 unpleasant and 3 pleasant) were selected for the rating task. The rating scale was changed to a 7-point scale so that individual differences in the intensity of emotional reactions could be detected more easily, which was important for different research questions (cf. Schimmack & Diener, in press; Schimmack & Hartmann, in press). The response categories were labeled “not”, “very slightly”, “slightly”, “medium”, “strongly”, “very strongly”, and “extremely strongly” and were scored from 0 to 6. As in Study 3, participants were instructed that only zero-rating imply the complete absence of an emotion, whereas all remaining response categories imply its presence, although with different degrees of intensity. The frequency judgment task was identical to the one in Study 3 and the same 32 emotions were used. The LRT was identical to that used in Study 3. However, a different set of 10 emotions was used as retrieval cues, including five salient and five non-salient emotions.

6.2 Results

6.2.1 Absolute Accuracy

In all following analyses the actual frequencies are based on the SRT ratings in Study 3. This had the advantage that Study 3 provided actual frequencies for all 32 emotions included in the frequency judgment task. Furthermore, the actual frequencies of salient and non-salient emotions are both based on ratings of a different group of participants. Absolute accuracy was determined separately for the salient and the non-salient emotions. A significant differences was obtained in that frequency judgments of salient emotions (mean SD = 8.09) were more accurate than those of non-salient emotions (mean SD = 9.41) t(60) = 4.99, p < .01. This finding replicates Study 1.

6.2.2 Relative Accuracy

Figure 17 shows that participants in Study 4 again underestimated the actual frequencies of emotions. Across all emotions, the relative accuracy was d = -5.99, which is significantly different from zero, F(1,31) = 54.97, p < .01. Figure 17 also shows that the frequencies of frequent emotions were underestimated more strongly than those of infrequent emotions. This is also evident in the correlation between actual frequencies and the relative accuracy score, r = -.90, p < .01. In sum, Study 4 replicates the previous finding that people underestimate the frequency of emotions and that they do so especially for frequent emotions.

6.2.3 Discriminative Accuracy across Emotions

One aim of Study 4 was to demonstrate that participants are sensitive to experimentally manipulated frequencies of emotions. Therefore, it is important to demonstrate that the selection of scenarios in study 4 yielded frequencies of emotions that are independent of emotion frequencies in real life. For the 29 overlapping emotions between Study 2 and 4, the correlation between the actual frequencies of emotions was r = -.21, p = .29. As a consequence, discriminative accuracy across emotion in the present study cannot be attributed to generalized beliefs about the frequencies of emotions.

For analysis at the group level, the frequency judgments of all participants were averaged. The correlation between actual frequencies and frequency estimates was r = .66, p < .01. This correlation is rather low, compared to the values in the previous studies. One explanation could be that the present analysis included salient and non-salient emotions. Figure 17 already shows that the frequency estimates of non-salient emotions were lower than those of the salient emotions. Therefore, salience can attenuate the present correlation. As a consequence, separate correlations were computed across the 16 salient and the 16 non-salient emotions. The correlation for the salient emotions was indeed higher, r = .85, p < .01, but the correlation across the 16 non-salient emotions was not, r = .63, p < .01. The difference between the two correlations also suggests that salience increased the discriminative accuracy across emotions.

The analysis at the individual level was carried out separately for the 16 salient and the 16 non-salient emotions. For the salient emotions the discriminative accuracy across emotions (mean r = .45) was significantly higher than the one for the non-salient emotions (mean r = .36), F(1,60) = 5.53, p < .05. Because the actual frequencies are based on ratings of a different sample, this finding suggests that salience also increased the discriminative accuracy across emotions. 6.2.4 Influence of the Salience of an Emotion on Frequency Estimates The following analysis attempts to replicate the finding of Study 1 and 3 that salience at the time of encoding increases frequency judgment. Figure 17 already suggests that this was also true in Study 4. In the present study, all participants rated the same emotions in the SRT. Therefore, the analysis had to be carried out across emotions. To control for any differences in the actual frequencies between salient and non-salient emotions, the actual frequencies of the emotions were used as a covariate. The analysis of variance revealed a highly significant effect of salience, F(1,29) = 23.61, p < .01. A comparison of the predicted means shows that participants judged the frequency of non-salient emotions to be lower (M = 5.18) than the frequency of salient emotions (M = 8.28). It was also tested, whether salience boosted especially the frequency judgments of frequent emotions. Figure 17 already suggests that this was not true, because the slops of the regression line for salient and non-salient emotions were similar. To test this hypothesis quantitatively, a median split of the actual emotion frequencies was carried out. Then, frequency judgments were used as a dependent variable in an ANOVA with the factors actual frequency (high vs. low frequency) and salience. If frequent emotions benefit more from the salience manipulation, the interaction should be significant, but the ANOVA does not confirm this prediction, F(1,28) < 1, p > .50. This finding is consistent with Study 3, where the same salience manipulation also failed to affect especially the frequent emotions, but it is inconsistent with Study 1, where participation in a diary study increased especially the frequencies of frequent emotions.

One explanation for this pattern of results is that the different salience manipulations influence different stages of the frequency judgment process (Brown, 1995). It might be that the participation in the diary study increased participants awareness of the absolute number of emotional experiences. Therefore, they converted the same familiarity signal into higher absolute frequencies than they did before the diary study. In contrast, rendering particular emotions salient does not influence the range of the absolute frequencies; it only boosts the familiarity signal of the salient emotions, which then receive higher absolute estimates for the same range of absolute frequencies as the non-salient emotions.

6.2.5 Testing the Ease-of-Retrieval Model

The interrater agreement for the frequency estimates was excellent, ICC[2,61] = .96. Because of the larger sample size, the judgment times were also more consistent across participants than in Study 3, ICC[2,61] = .61. Nevertheless, replicating the finding of Study 3, the size of the frequency judgments was unrelated to the speed of these judgments (r = -.19, p = .30).

The second test of the ease-of-retrieval hypothesis used the latencies in the LRT. As in Study 3, the retrieval latencies were averaged across participants. This time, the interrater agreement was excellent (ICC[2,61] = .95). The average latencies were then correlated with the frequency estimates across the 10 emotions included in the latency-to-retrieve task. The correlation with the frequency estimates was significant and consistent with predictions of the ease-of-retrieval model (r = -.76, p < .05). However, the next result indicates that this support is more apparent than real. As in Study 3, the average retrieval latencies were compared to the average speed of the frequency judgments. Again, the mean retrieval latency was significantly longer (M = 4.94, SD = 1.40) than the time needed to make a frequency judgment (M = 3.23, SD = 0.10), t(9) = 3.96, p < .01. Hence, it is not possible that the frequency estimates are based on retrieval processes. The finding that the speed of frequency judgment is unrelated to frequency judgments and that these judgments are made faster than the retrieval of scenarios contradicts the ease-of-retrieval hypothesis. It is instructive that the contradictory evidence was obtained concurrently with a high negative correlation between frequency judgments and latencies in the LRT. This finding demonstrates nicely that the significant negative correlation between frequency judgments and latencies in a LRT that were obtained in previous studies (MacLeod et al., 1994; Fitzgerald et al., 1988) do not indicate that the frequency judgment were based on the ease-of-retrieval.

It is also instructive to look at the speed of frequency judgments and retrieval times of the very rare emotion regret. On average, participants made a frequency judgment within 3.22s. In contrast, the average latency in the LRT was 8.13s. In addition, these long latencies are partly due to the responses of 42 participants, who were unable to retrieve a regret scenario within 10s (so that their retrieval latencies were set to 10s). These 42 participants were able to make a frequency judgment within the allotted time of 10s even though they could not recall a single scenario in this time limit. As a consequence, ease-of-retrieval cannot explain the frequency judgments of these participants, because the ease of-retrieval hypothesis assumes that at least one exemplar was retrieved. One could try to rescue the ease-of-retrieval model and argue for a two-stage processes. For example, a fast recognition process could inform the participant whether an emotion occurred at all. If so, an exemplar is actually retrieved from memory and the frequency is estimated following the ease-of-retrieval model. If the recognition signal suggests that no exemplar can be retrieved, the frequency judgment is zero. Such a two-stage model is not very parsimonious because Hintzman and Curran (1994) showed that recognition judgments are based on the same familiarity signal that is assumed to underlie frequency judgments. Therefore, the initial recognition process already supplies the frequency information that the additional ease-of-retrieval heuristic is supposed to provide. In sum, the analysis replicated two findings of Study 3. The judgment times of frequency judgments are not related to the size of the judgment and frequency judgments are made faster than the time needed to retrieve a single scenario. These findings are damaging for retrieval-based frequency judgment models. The finding that latencies in the retrieval task were significantly related to the frequency judgments does not rescue the ease-of-retrieval model. Rather, it demonstrates that the same finding in other studies does not provide evidence for a causal role of ease-of-retrieval in the frequency judgment process.

6.3 Discussion

Study 4 replicated many of the earlier findings. As in Study 1 and 3, participants underestimated absolute frequencies of emotions, especially those of frequent emotions. Furthermore, increasing the salience of emotions at the time of encoding, increased frequency judgments. As in Study 3, salience did not increase especially the frequency estimates of frequent emotions. Most importantly, Study 4 replicated the findings of Study 3, that (a) the times needed to make frequency judgment were unrelated to the magnitude of the judgments and (b) that frequency judgments were made faster than the retrieval of a single scenario. Therefore, frequency judgments cannot be based on information about the retrieval of scenarios, although Study 4 found a significant negative correlation between frequency judgments and latencies in the LRT.

7 GENERAL DISCUSSION

The two main topics of the dissertation􀁘namely (a) the accuracy of frequency judgments of emotions, and (b) the cognitive processes underlying these judgments􀁘are discussed separately.

7.1 The Accuracy of Frequency Judgments of Emotions

Four types of accuracy were differentiated: (a) absolute and (b) relative accuracy as well as discriminative accuracy (c) across emotions and (d) across participants. The results bearing on each of the types of accuracy are discussed next.

7.1.1 Absolute Accuracy

Absolute accuracy was only explored in studies 1 and 4, because the participants in Study 2 did not make absolute estimates, and Study 3 lacked an appropriate standard of comparison. However, studies 1 and 4 both showed that frequency judgments of emotions are not very accurate in an absolute sense. This finding is inconsistent with a direct-encoding of frequencies of emotions. On the other hand estimation errors can be expected when participants use heuristics to make the absolute estimates. The finding in studies 1 and 3 that absolute accuracy was higher for salient compared to non-salient emotions indicates that salience can increase the accuracy of frequency judgments of emotions. Increasing accuracy due to salience has also been observed in other studies of frequency judgments (Naveh-Benjamin & Jonides, 1986). Brown and Singer (1993) pointed out that absolute accuracy is sensitive to two types of estimation errors: (a) errors in the estimation of the distribution of the actual frequencies and (b) errors in the estimation of the level of the absolute frequencies. Therefore, the salience effect on absolute accuracy can be due to an influence on either (or both) of these error sources. These possibilities are explored in the next analyses.

7.1.2 Relative Accuracy

Relative accuracy refers to the question how good the absolute level of frequency judgments reflects the absolute level of the actual frequencies, or in other words, whether people over- or underestimate the actual frequency of their emotions. Studies 1, 3, and 4; Study 2 did not allow to address this question, all showed the predicted effect that frequency estimates underestimated the actual frequencies of emotion. Two objects might be raised against this finding. In the two field studies, actual frequencies were based on the sum of repeated frequency estimates, whereas the estimates are based on a single estimate for the whole time period. Fiedler and Armbruster (1994) demonstrated that splitting a single frequency judgment of one category into two frequency judgments of two sub-categories produced different frequency estimates: The sum of the two estimates was higher than the frequency judgment of the whole category. Therefore, one might argue that the repeated daily estimates might overestimate the actual frequencies. This explanation of the effect in Study 1 encounters several difficulties. First, underestimation was also found in studies 3 and 4, where actual frequencies were not based on split frequency estimates. Second, Fielder and Armbruster’s results did not show overestimation for the split-category judgments; rather they showed that split-judgments prevented categories from being underestimated. Therefore, the sum of the split-estimates was more accurate than the single judgments of a whole category.

Hence, the category-split effect supports, rather than contradicts, the current interpretation that the frequency estimates for the whole diary period underestimate the actual frequencies of emotions. Finally, underestimation is also prevalent in experimental studies of frequency judgments (cf. Watkins & LeCompte, 1991; Williams & Durso, 1986) in which the actual frequencies were objectively determined.

A second objection could be that the participants changed the meaning of the emotion words from the daily estimates to the frequency estimates over three weeks (Schwarz, Strack, Müller, & Chassein, 1988). For short time periods even very mild experiences of the emotion were counted, whereas for longer time periods the participants considered only severe experiences of the emotion. Again, this objection can not explain underestimation in studies 3 and 4, where participants were explicitly told that frequency judgments should reflect all scenarios in which the emotion was rated to be present, irrespectively of the intensity. Nevertheless, the actual frequencies were underestimated in the frequency judgment task. In sum, the present results provide strong support for the hypothesis that people underestimate the frequency of their emotions.

A related expectation was that underestimation should increase with the actual frequency of an emotion, which is also a common finding in the frequency judgment literature (Mingay et al., 1994; Watkins & LeCompte, 1991). Again, all studies that allowed a test of this prediction confirmed it. This finding has important practical implications when frequency judgments of emotions are used to measure subjective well-being. Often researchers compute a difference score between the frequencies of pleasant and unpleasant emotions. This hedonic-balance score is then used as a measure of subjective well-being. The problem with this index is that it
underestimates the well-being of those people who experience more pleasant than unpleasant emotions, because the more frequent pleasant emotions are underestimated more than the less frequent unpleasant emotions. Similarly, the index underestimates the unhappiness of those people who experience unpleasant emotions more frequently than pleasant emotions.

Another noteworthy finding in studies 1, 3 and 4 was that making certain emotions salient increased the estimated frequencies of these emotions. This finding also has implications in many applied settings. For example, several psychotherapies are likely to increase the salience of emotional experiences. This could produce an expected increase in the reported frequencies of pleasant emotions, and an unexpected increase in the reported frequencies of unpleasant emotions, without any changes in the actual frequencies of emotional experiences. Because salience could affect the comparison of pre- and post-treatment measures, evaluation studies of treatment effects should include control groups in which emotional experiences are made salient.

Study 1 also indicated that salience increased especially the frequency estimates of the more frequent emotions. This was evident in a steeper slop of the regression line for post-diary estimates. In contrast, in studies 3 and 4 the regression slop of salient emotions was elevated but not steeper, indicating no special influence of salience on frequent emotions. The discrepant findings can be due to the fact that the different salience manipulations influenced different stages of the frequency judgment process (Brown, 1995). To judge frequencies of emotions, participants first have to construct a range of plausible frequencies. Then, they can assign frequencies to emotions by mapping the strength of the familiarity signal onto the frequency scale. The pre-post design in Study 1 very likely influenced participants’ beliefs about the plausible range of emotion frequencies. Making frequency judgments each day, they noticed that they experienced more emotions than they believed to experience before the diary study. Hence, they increased the upper limit of the frequency scale for the post-diary judgments. Similarly, Brown demonstrated that manipulations of participants beliefs about the range of plausible frequencies changed the slop of the regression line.

This effect is different from the salience effect observed in studies 3 and 4 which showed that frequency judgments were higher for emotions that were made salient compared to emotions that were not made salient. It is plausible that this salience manipulation influenced the first stage of the frequency judgment process. Salient emotions appeared to be more familiar and therefore were rated to be more frequent than non-salient, unfamiliar emotions. Nevertheless, the familiarity feelings of salient and non-salient emotions were mapped onto the same range of absolute frequencies during the second stage of the judgment process, which leads to the observed differences in the level, but not in the slope, of the regression lines in studies 3 and 4. The distinction between two stages in the frequency judgment process (Brown, 1995) has also practical implications. The familiarity model predicts that people’s feeling of familiarity has high discriminative accuracy across emotions. However, it does not allow straightforward predicts of relative accuracy and discriminative accuracy across participants, because these two types of accuracy also depend on the second stage in which the feeling of familiarity is converted into an absolute estimate. A better understanding of this conversion process might help to improve the measurement of emotion frequencies. For example, one could try to assist respondents in their selection of an appropriate range of frequencies. For example, Blair and Williamson (1994) discussed the merits and pitfalls of providing participants with population norms of frequencies (e.g. On average people go to church once every three month) to increase the relative accuracy of frequency estimates. This procedure could improve the accuracy of frequency estimates, if participants have information about their relative standing on the relevant dimension (I go much less frequently to church than the average person). With regard to internal states such as emotions, it is unlikely that people have accurate knowledge how they compare to others in the frequency of emotional experiences. To conclude, the conversion process of frequency information (i.e. feeling of familiarity) into an absolute estimate is an important topic for future research, not only from a theoretical (Brown, 1995) but also from a practical point of view (Blair & Williamson, 1994).

7.1.3 Discriminative Accuracy Across Emotions

The present set of studies also explored the discriminative accuracy across emotions􀁘that is, the question how well frequency judgments discriminate the actual frequencies of different emotions. This question is relevant for some issues in research on emotions. First, hierarchical models of the structure of emotions (Oatley & Johnson-Laird, 1987; Shaver, Schwartz, Kirson & O’Connor, 1987) predict that emotions on higher levels of the hierarchy are experienced more frequently than emotions lower in the hierarchy. For example, sadness should be experienced more frequently than disappointment because disappointment is assumed to be a subtype of sadness. To test these predictions of structural models of emotions, one needs accurate information that discriminates between frequencies of different emotions. It also seems to be an interesting topic for future research on emotions, why certain emotions are experienced more frequently than others. For example, why is anxiety experienced in general more frequently than hate? Any emotion theory that explains how emotions are elicited should eventually explain differences between emotions in their frequencies of occurrence.
Discriminative accuracy across emotions was good in all four studies, and excellent when the data were first aggregated across participants. This finding is consistent with frequency judgments in other domains. Indeed, the claim that frequency judgments are very accurate, which has led some theorists to propose direct encoding models of frequency information (Hasher & Zacks, 1984), is predominantly based on findings of high discriminative accuracy across stimuli. However, even this type of accuracy was influenced by salience (studies 1 and 4), which contradicts the direct-encoding models. Furthermore, Study 1 also showed that the type of the response format influenced this type of accuracy. At the individual level, absolute estimates discriminated more accurately between the frequencies of emotions, presumably because vague quantifier ratings forced participants to assign the same frequency-category to several emotions, although they were able to discriminate between the frequencies of these emotions.

7.1.4 Discriminative Accuracy across Participants

For many applied settings, the last type of accuracy is most important, namely discriminative accuracy across participants􀁘that is, the question how well retrospective frequency judgments reflect individual differences in the actual frequencies of emotions? The first two studies provide highly similar answers to this question, despite the fact that (a) the participants were from different nations, and (b) that the daily ratings and the frequency judgments were obtained with slightly different methods. In both studies the discriminative accuracy was between r = .30 and .60. An exact estimate is difficult because frequency judgments after daily frequency ratings overestimate this type of accuracy, whereas judgments before the daily frequency ratings underestimate this type of accuracy.

The present set of studies also provided some valuable results that rule out artifact explanations. First, the frequency estimates of a single emotion were often correlated most highly with the actual frequency of this emotion and not those of other emotions. Furthermore, frequency judgments made for two separate weeks were more highly correlated with the actual frequencies in the target week and not the alternative week. This pattern of results rules out a simple response set explanation of discriminative accuracy across participants. Furthermore, it contradicts the contention that frequency judgments are simply based on some generalized beliefs. In particular, generalized beliefs cannot explain the context sensitivity of the frequency judgments. Furthermore, studies 1 and 2 provided little support for the hypotheses that frequency judgments are systematically biased by the self-concept or current mood of the participants: Neither emotion-related personality traits, nor the current mood at the time of the frequency judgments appeared to have a consistent effect on the frequency judgments. In a similar vein, Schimmack and Hartmann (in press; see also Cutler, Larsen, & Bunce, 1996 ) investigated whether people with a repressive coping style; that is, people who are assumed to repress unpleasant feelings, underestimate the frequencies of their unpleasant emotions. Although so called repressors indicated to experience unpleasant emotions less frequently when they were confronted with emotional scenarios (in a scenario rating task, see 5.1.3.1), their frequency estimates in the subsequent frequency judgment task were not biased. For unpleasant emotions, repressors’ lower frequency judgments correctly reflected the lower number of endorsements of unpleasant emotions in the scenario rating task. In sum, the search for personality dimensions that predict a systematic bias in frequency judgments of emotions has been unsuccessful. Nevertheless, it is possible that frequency judgments of emotions are influenced by systematic biases that remained undetected in the previous studies.

7.2 The Cognitive Processes underlying Frequency Judgments of Emotions

How do people judge the frequency of emotions? In the present dissertation four models were compared with each other: (a) the direct encoding models, (b) the recall-estimate model, (c) the ease-of-retrieval model, and the (d) familiarity model. Several results of the studies were incompatible with the direct encoding model. Maybe the most important result was that participants in Study 2 were able to estimate accurately the frequency of emotions in the first and second week of the diary period. This means that the frequency were estimated at the time of retrieval. Additional evidence against the direct encoding models was that the salience of emotions at the time of encoding increased frequency judgments (Study 1, 3 and 4). According to the direct encoding models, emotional experiences should automatically activate emotion concepts and modify the frequency counter (Alba et al., 1980), independently of the salience of the concept. The same type of evidence has been used to challenge the direct encoding models in other domains (Greene, 1989). The present studies show that the direct encoding model cannot explain frequency judgments of emotions either.

The present dissertation also challenges the ease-of-retrieval model. If frequency judgments were actually based on information about the ease-of-retrieval, higher frequency judgments should be made faster. Contrary to this prediction, studies 3 and 4 did not find a relation between the speed and the size of frequency judgments. Furthermore, in Study 4 many participants were able to judge the frequency of a very rare emotion (regret), although they were unable to retrieve a single scenario in which this emotion occurred. In addition, latencies in a separate retrieval task were related to frequency judgments only in study 4, but not in Study 3. Nevertheless the frequency judgments in both studies possessed discriminative accuracy across emotions. Probably the most damaging finding was that in studies 3 and 4, the frequency judgments were made faster than the time needed to retrieve a single scenario. Therefore, retrieval of exemplars to a conscious level is simply to slow to explain the fast (and accurate) frequency judgments. The same line of reasoning has been used to dismiss retrieval-based models in related research on metamemory (Metcalfe, 1993; Reder, 1987). The last finding contradicts not only the ease-of-retrieval model, but also other retrieval-based models, such as the recall-estimate model.

The only model that is compatible with all the present findings is the familiarity model. With regard to emotions this model assumes that frequency questions activate multiple memory traces of previous emotional experiences simultaneously. As a consequence, the memory sends a direct signal, reflecting how many traces have been activated in memory. This signal is experienced as a sense of familiarity. Like the other indirect encoding models, the model predicts that participants can differentiate frequencies in different contexts, such as week 1 and 2 in the diary study, if the context variable is sufficiently encoded in memory (see Barsalou & Ross, 1986). Furthermore, it predicts that the salience of emotions at the time of encoding increases frequency judgments because salience strengthens memory traces, which results in a greater echo intensity (Hintzman, 1988). The familiarity model does not predict any relation between the size of a frequency judgment and its time and such a relation was not obtained. Therefore the lack of such relations does at least not contradict the model. To conclude, the familiarity model seems to be the best candidate for a theory of frequency judgments of emotions. This conclusion should not be generalized to frequency judgments in other domains. Retrieval-based estimation strategies can be used and apparently are used under certain conditions (Brown, 1995; Menon, 1994).

The superior performance of the familiarity model might appear to some readers due to the selection of paradigms, which tested and disconfirmed mainly predictions that are made by the competing models, but tested only few predictions that follow from the familiarity model. Although a disconfirmation of the familiarity model does not rescue the other models, such tests are an important topic of future research. One prediction that follows from the familiarity model is, for example, that the feeling of familiarity should be influenced by the presence of similar exemplars in memory (Hintzman, Curran, & Oppy, 1992; Jones & Heit, 1993). For example, frequency estimates of eating carp should be inflated by memories of eating trout. With regard to emotions, this would imply that the presence of memories in which a person experienced disappointment but not anger should nevertheless increase the echo intensity of anger because disappointment episodes share some features with anger episodes.

A second prediction based on the familiarity model is that frequency judgments of very similar events should be higher than those of the same number of dissimilar events (Hintzman and Stern, 1978), because similar memories produce a stronger feeling of familiarity (but see Brown, 1995 for an alternative explanation). Therefore, the experience of similar anger episodes (e.g., always directed at one’s romantic partner) should lead to higher frequency judgments than the experience of anger in different contexts (e.g., directed at boss, partner, friends, and strangers).
7.3 Influence of the Response Format on Frequency Judgments of Emotions
The present dissertation bridged two research traditions: Studies on the validity of self-reports of emotional experiences and experimental studies of the cognitive processes underlying frequency judgments. The preferred response format in the former tradition are vague quantifier ratings, whereas the latter tradition preferred absolute estimates. Furthermore, a survey study which included both response formats obtained divergent results in the comparison of the two response formats (Schaeffer, 1991). This stimulated the idea to use both response formats within the same study (Study 1). Highly surprising and interesting results were obtained. The most interesting effect was that vague quantifier ratings decreased from pre- to post-diary judgments, whereas at the same time the absolute estimates increased. The decrease in vague quantifier ratings was replicated in Study 2. Similar findings have been reported in the literature (Knowles, et al. 1996), but the effect is still not understood, although it has important practical implications. For example, in therapy evaluation studies, patients often have to report the frequency of their emotions before and after treatment, commonly by means of vague quantifier ratings. The present study suggests that changes can be expected in these measures not only because of treatment effects but also due to changes in the use of the rating scale. Because the ratings tend to drop, a questionnaire accessing mainly unpleasant affect, could indicate a positive treatment effect; a decrease in the frequency of unpleasant emotional experiences, even if the treatment did not influence the actual frequencies of emotional experiences.

A second important difference between the two response formats was that the correlation between averaged judgments of pleasant emotions and averaged judgments of unpleasant emotions were positively correlated for the absolute estimates, but not for the vague quantifier ratings, which revealed sometimes negative, sometimes positive, and sometimes non-significant correlations close to zero. Whereas the practical implications are discussed in the next paragraph, the cognitive processes underlying these effects of the response format are discussed now.

First, one might ask during which stage of the frequency judgment process the effects occur (Brown, 1995). That is, do the different response formats influence the generation of frequency information (e.g., a familiarity signal), or do the response formats influence the conversion of this information into a response. It is likely that the influence of the response formats occurs during the second stage. During this stage, the participants are faced with the task to determine a reasonable range of absolute frequencies onto which the feeling of familiarity can be mapped. This part of the estimation process is likely to be difficult and error prone. The vague quantifier ratings do not require that the participants derive an absolute standard. The participants can simply map the different degrees of familiarity onto the categories of the response scale. This seems to suggest that vague quantifier ratings should be preferred. However, if participants use vague quantifier ratings simply to indicate the relative strength of their feeling of familiarity, the ratings can no longer be compared across participants. A rating of the highest category by one participant might reflect a very different absolute frequency than the same rating made by another participant. Therefore, vague quantifier ratings also do not solve the problem how frequency information can be converted into a response that is comparable across participants. Furthermore, range-frequency theory has shown that even the assignment of absolute numbers printed on a sheet of paper is influenced by context effects such as the distribution of the numbers (Parducci & Wedel, 1986). Similar effects were obtained in the present study for vague quantifier ratings of emotion frequencies, but not for the absolute estimates. This finding would favor the absolute estimates. Finally, Study 1 demonstrated that absolute estimates and vague quantifier ratings possessed the same amount of discriminative accuracy across participants when the judgments were made after the diary study; but before the diary study, the vague quantifier ratings outperformed the absolute estimates. In sum, it is not possible to recommend one of the two response formats over the other. Future research on the judgment process might help to reduce judgment errors, and might ultimately allow a rational choice of the best response format. Until then, a viable research strategy is to use both response formats, because each one is associated with different errors. As a consequence, the combined application of two short questionnaires with both response formats would produce more valid results than a long questionnaire with only one of the two response formats (Green et al., 1993).

7.4 The Structure of Individual Differences in the Frequencies of Pleasant and Unpleasant Emotional Experiences

The structure of individual differences in the frequencies of pleasant and unpleasant emotional experiences was not a central issue of the present investigations. However, important results were obtained that challenge current models of the personality structure of emotions. Currently, researchers assume that the frequencies of pleasant and unpleasant experiences of affect are independent (cf. Bradburn, 1969) or negatively correlated (Green et al., 1993). Furthermore, influential personality theories predict the frequencies of pleasant and unpleasant emotions to be independent (Costa & McCrae, 1992; Meyer & Shack, 1989; Watson & Clark, 1992), presumably because pleasant and unpleasant emotions are generated in different areas of the brain.

Studies 1 and 2 replicated previous results in that low correlations were obtained with the traditional response format, namely vague quantifier ratings. However, high positive correlations were obtained for absolute frequency estimates. Furthermore, regression analyses suggested that vague quantifier ratings produce an artifact, because the respondents use them partly to judge percentages of their emotional experiences and only partly to judge absolute frequencies of experienced emotions. As a consequence, individual differences in the overall number of emotional frequencies are obscured and the correlation between frequencies of pleasant and unpleasant emotions becomes negative. This invalidates the conclusion of previous studies that the frequencies of pleasant and unpleasant emotional experiences are independent. The present study suggests that a person who experiences pleasant emotions often also experiences unpleasant emotions often. This finding is consistent with a study by Schimmack and Diener (in press) which also demonstrated a positive correlation between frequencies of pleasant and unpleasant emotions, derived from repeated ratings of emotional events in everyday life. Furthermore, the positive correlation between pleasant and unpleasant emotions, is consistent with the positive correlation obtained for the number of pleasant and unpleasant events that people encounter in their lives (Suh et al., 1996).

It should be noted, however, that this finding is limited to the frequency of emotions. It is likely that it does not hold for moods. Considering only the frequency with which a person is in a pleasant or unpleasant mood, it is very likely that the two frequencies are negatively correlated, because (a) a person is nearly always in a pleasant or unpleasant state, and at any moment in time feelings of pleasure and displeasure rarely co-occur. As a consequence, it is a logical necessity that pleasure and displeasure are highly negatively correlated, a fact which is sometimes obscured by measurement error (Green et al., 1993).

In sum, evidence is growing that the frequencies of pleasant and unpleasant emotions are positively correlated, whereas the number of times a person is in a pleasant mood is inversely related to the number of times he or she is in an unpleasant mood. Although, the results of the present study should not be regarded as a final answer to this question, the divergent findings for the two response formats underscore the need to understand the processes underlying frequency judgments of emotions, before it is possible to use these measures to answer fundamental questions about the causes and the structure of individual differences in the frequency with which they experience emotions.

8 OUTLOOK

At the end, I would like to discuss two important questions for future research, namely (a) individual differences in the accuracy of frequency judgments of emotions, (b) and the question how frequency of emotions should be assessed from a normative point of view.

8.1 Individual Differences in the Accuracy of Frequency Judgments of Emotions

In the present studies, accuracy scores of individuals were averaged to estimate the general accuracy of frequency judgments of emotions. However, the accuracy scores varied across participants. An important topic of future research would be to explore (a) whether these differences in accuracy are systematic, (b) whether it will be possible to assess a person’s level of accuracy, and (c) whether a person’s accuracy is related to existing constructs in the emotion literature, such as affect intensity (Larsen & Diener, 1987; Schimmack & Diener, in press), emotional intelligence (Mayer, DiPaolo, & Salovey, 1990), or alexithymia (Taylor, 1984).
People who experience emotions more intensely than others might tend to overestimate the frequency of their emotions, or in the light of a consistent trend towards underestimation, at least underestimate the frequency of their emotions less, because intense emotional experiences are more memorable (Rapaport, 1942; Holmes, 1970). In contrast, people with high alexithymia scores might severely underestimate the frequency of their emotions, because they have problems to label their emotional experiences; and the present studies found that labeling experiences increased frequency judgments when the same label was part of the frequency question. Finally, “emotionally intelligent” persons can be expected to be more accurate in their self-perceptions than others, because they pay more attention to their emotional experiences However, it is also possible that traditional personality measures do not capture biases in the memory representation of emotions very well. Given the assumption that systematic individual differences in the accuracy of frequency judgments of emotions exist, this would require the development of new questionnaires that measure these differences.

8.2 Toward a Normative Assessment of the Frequency of Emotions

The present studies explored how people judge the frequency of emotions. An equally important question is how the frequency of emotions can be measured with the highest degree of accuracy by means of an economical research instrument at one moment in time. This question arises because an on-line recording over a long time period simply is not a viable option in most assessment situations, although it would be the best strategy from a normative perspective. The present studies suggest that people rely on a sense of familiarity when they judge the frequency of emotions and that they do not use the ease-of-retrieval or a recall-estimate heuristic. However, the fact that most people do not use these strategies, does not imply that they cannot be used. To the contrary, people can recall individual episodes (Fitzgerald et al., 1988) and people can judge the ease-of-retrieval (Schwarz et al., 1991). As a consequence, an important question of future research is whether these strategies would lead to better estimates of the frequency of emotions. For example, in two studies, Means, Swan, Jobe, and Esposito (1994) asked participants to record the number of smoked cigarettes for a period of five days. Afterwards, the participants were asked to estimate the number of smoked cigarettes on one of these five days. Furthermore, they were instructed to use one of four strategies, namely (a) to use any strategy that they wanted, (b) to provide a spontaneous estimate without thinking of particular instances of smoking, (c) to think of different contexts (in the office, after dinner) and then to sum these separate estimates, and (d) to try to recall as many instances as possible. In this study, the recall of exemplars appeared to be a better measure than the spontaneous estimation strategy, in which judgments were probably based on a familiarity signal. But different findings might be obtained for frequency judgments of emotions, especially when the time period is longer than one day. To address this question, one needs a measure of the actual frequencies of emotions as a validation criterion, which is not biased in favor of any of the estimation strategies under investigation. Both the diary and the scenario rating task could be used to assess actual frequencies. In the beginning, the SRT is preferable because it is more economical than a diary study. Ultimately, however, it is necessary to compare different estimation strategies against actual frequencies of emotions in real life.

9 REFERENCES

Alba, J. W., Chromiak, W., Hasher, L., & Attig, M. S. (1980). Automatic encoding of category size information. Journal of Experimental Psychology: Learning, Memory, and Cognition, 6, 370-378.
Andreasen, N. C., & Black, D. W. (1991). Lehrbuch Psychiatrie [Textbook on Psychiatry]. Weinheim: Beltz.
Barsalou, L. W., & Ross, B. H. (1986). The roles of automatic and strategic processing in sensitivity to superordinate and property frequency. Journal of Experimental Psychology: Learning, Memory, and Cognition, 12, 116-134.
Blair, E., & Burton, S. (1987). Cognitive processes used by survey respondents to answer behavioral frequency questions. Journal of Consumer Research, 14, 280-288.
Blair, E., & Williamson, K. (1994). On providing population data to respondents. In N. Schwarz & S. Sudman (Eds.), Autobiographical memory and the validity of retrospective reports (pp. 173-186). New York: Springer.
Blaney, P. H. (1986). Affect and memory: A review. Psychological Bulletin, 99, 229-246.
Borkenau, P., & Ostendorf, F. (1991). Ein Fragebogen zur Erfassung fünf robuster Persönlichkeitsfaktoren [A questionnaire for the assessment of five robust personality factors]. Diagnostica, 37, 29-41.
Bower, G. H. (1981). Mood and memory. American Psychologist, 36, 129-148.
Bradburn, N. M. (1969). The structure of psychological well-being. Chicago: Aldine.
Brewin, C. R., & Andrews, B., & Gotlib, I. H. (1993). Psychopathology and early experience: A reappraisal of retrospective reports. Psychological Bulletin, 113, 82-98.
Briggs, J. L. (1970). Never in anger. Cambridge, MA: Harvard University.
Briggs, J. L. (1987). In search of emotional meaning. Ethos, 15, 8-15.
Brown, N. R. (1995). Estimation strategies and the judgment of event frequency. Journal of Experimental Psychology: Learning, Memory, and Cognition, 21, 1539-1553.
Brown, N. R., & Siegler, R. S. (1993). Metrics and mappings: A framework for understanding real-world quantitative estimation. Psychological Review, 100, 511-534.
Bruce, D., Hockley, W. E., & Craik, F. I. M. (1991). Availability and category-frequency estimation. Memory and Cognition, 19, 301-312.
Clore, G. L. (1994). Why emotions require cognition. In P. Ekman & R. J. Davidson (Eds.), The nature of emotion (pp. 181-191). New York: Oxford University Press.
Cohen, D. (1996). Law, social policy, and violence: The impact of regional cultures. Journal of Personality and Social Psychology, 70, 961-978.
Costa, P. T., & McCrae, R. R. (1980). Influence of extraversion and neuroticism on subjective well-being: Happy and unhappy people. Journal of Personality and Social Psychology, 38, 668-678.
Costa, P. T., & McCrae, R. R. (1992). The revised NEO personality inventory (NEO-PI R) professional manual. Odessa, FL: Psychological Assessment Resources.
Cutler, S. E., Larsen, R. J., & Bunce, S. C. (1996). Repressive coping style and the experience and recall of emotion: A naturalistic study of daily affect. Journal of Personality, 64, 379-405.
Diener, E. (1984). Subjective well-being. Psychological Bulletin, 95, 542-575.
Frequency Judgments of Emotions 149
Diener, E., & Diener, M. (1995). Cross-cultural correlates of life satisfaction and self-esteem. Journal of Personality and Social Psychology, 68, 653-663.
Diener, E., Diener, M., & Diener, C. (1995). Factors predicting the subjective well-being of nations. Journal of Personality and Social Psychology, 69, 851-864.
Diener, E., & Iran-Nejad, A. (1986). The relationship in experience between different types of affect. Journal of Personality and Social Psychology, 50, 1031-1038.
Diener, E., & Larsen, R. J. (1984). Temporal stability and cross-situational consistency of affective, behavioral, and cognitive responses. Journal of Personality and Social Psychology, 47, 871-883.
Diener, E., Larsen, R. J., & Emmons, R. A. (1984). Bias in mood recall in happy and unhappy persons. Paper delivered at the 92nd Annual Meeting of the American Psychological Association, Toronto, August 1984.
Diener, E., Larsen, R. J., Levine, S., & Emmons, R. A. (1985). Intensity and frequency: The underlying dimensions of positive and negative affect. Journal of Personality and Social Psychology, 48, 1253-1265.
Diener, E., Sandvik, E., & Pavot, W. (1991). Happiness is the frequency, not the intensity, of positive versus negative affect. In F. Strack, M. Argyle, & N. Schwarz (Eds.), Subjective well-being (pp. 119-139). Oxford: Pergamon Press.
Diener, E., Smith, H., & Fujita, F. (1995). The personality structure of affect. Journal of Personality and Social Psychology, 69, 130-141.
Ekman, P., & Davidson, R. J. (Eds.). (1994). The nature of emotion. New York: Oxford University Press.
Emmons, R. A., & Diener, E. (1986a). A goal-affect analysis of everyday situational choices. Journal of Research in Personality, 20, 309-326.
Emmons, R. A., & Diener, E. (1986b). Influence of impulsivity and sociability on subjective well-being. Journal of Personality and Social Psychology, 50, 1211-1215.
Epstein, S. (1983). A research paradigm for the study of personality and emotions. In M. M. Page (Ed.), Personality – Current theory and research: 1982 Nebraska symposium on motivation (pp. 91-154). Lincoln: University of Nebraska Press.
Fehr, B., & Russell, J. A. (1984). Concept of emotion viewed from a prototype perspective. Journal of Experimental Psychology: General, 113, 464-486.
Feldman Barrett, L. (in press). The relationships among momentary emotional experiences, personality descriptions, and retrospective ratings of emotion. Personality and Social Psychology Bulletin.
Fiedler, K. (1991). The tricky nature of skewed frequency tables: An information loss account of distinctiveness-based illusory correlations. Journal of Personality and Social Psychology, 60, 24-36.
Fiedler, K., & Armbruster, T. (1994). Two halfs may be more than one whole: Category-split effects on frequency illusions. Journal of Personality and Social Psychology, 66, 633-645.
Fiske, S. T., & Taylor S. E. (1984). Social cognition. Reading, MA: Addison-Wesley.
Fitzgerald, J. M., Slade, S., & Lawrence, R. (1988). Memory availability and judged frequency of affect. Cognitive Therapy and Research, 12, 379-390.
Frijda, N. H., Ortony, A., Sonnemans, J., & Clore, G. L. (1992). The complexity of intensity: Issues concerning the structure of emotion intensity. In M. S. Clark (Ed.), Review of Personality and Social Psychology: Emotion (Vol. 13, pp. 60-89). Newbury Park, CA: Sage.
Frequency Judgments of Emotions 150
Gabrielcik, A., & Fazio, R. H. (1984). Priming and frequency estimation: A strict test of the availability heuristic. Personality and Social Psychology Bulletin, 10, 85-89.
Green, D. P., & Goldman, S. L., & Salovey, P. (1993). Measurement error masks bipolarity in affect ratings. Journal of Personality and Social Psychology, 64, 1029-1041.
Greene, R. L. (1989). On the relationship between categorical frequency estimation and cued recall. Memory and Cognition, 17, 235-239.
Hanson, C., & Hirst, W. (1988). Frequency encoding of token and type information. Journal of Experimental Psychology: Learning, Memory, and Cognition, 14, 289-297.
Hasher, L., & Zacks, R. T. (1979). Automatic and effortful processes in memory. Journal of Experimental Psychology: General, 108, 356-388.
Hasher, L., & Zacks, R. T. (1984). Automatic processing of fundamental information. American Psychologist, 39, 1372-1388.
Hastie, R. & Park, B. (1986). The relationship between memory and judgment depends on whether the judgment task is memory-based or on-line. Psychological Review, 93, 258-268.
Haubensak, G. (1994). Wie entsteht der Häufigkeitseffekt in absoluten Urteilen? [On the origin of the frequency effect in absolute judgments]. Zeitschrift für experimentelle und angewandte Psychologie, 16, 378-397.
Hintzman, D. L. (1988). Judgments of frequency and recognition memory in a multiple trace memory model. Psychological Review, 95, 528-551.
Hintzman, D. L., & Block, R. A. (1971). Repetition and memory: Evidence for a multiple-trace hypothesis. Journal of Experimental Psychology, 88, 297-306.
Hintzman, D. L., & Curran, T. (1994). Retrieval dynamics of recognition and frequency judgments: Evidence for separate processes of familiarity and recall. Journal of Memory and Language, 33, 1-18.
Hintzman, D. L., Curran, T., & Oppy, B. (1992). Effects of similarity and repetition on memory: Registration without learning? Journal of Experimental Psychology: Learning, Memory, and Cognition, 18, 667-680.
Hintzman, D. L., & Stern, L. D. (1978). Contextual variability and memory for frequency. Journal of Experimental Psychology: Human Learning and Memory, 4, 539-549.
Hofstede, G. (1980). Culture’s consequences. Beverly Hills, CA: Sage.
Holmes, D. S. (1970). Differential change in affective intensity and the forgetting of unpleasant personal experiences. Journal of Personality and Social Psychology, 15, 234-239.
Howell, W. C. (1973). Representation of frequency in memory. Psychological Bulletin, 80, 44-53.
Isen, A. M. (1985). Asymmetry of happiness and sadness effects on memory in normal college students: Comments on Hasher, Rose, Zacks, Sanft, and Doren. Journal of Experimental Psychology: General, 114, 388-391.
Izard, C. E., Libero, D. Z., Putnam, P., & Haynes, O. M. (1993). Stability of emotion experiences and their relations to traits of personality. Journal of Personality and Social Psychology, 64, 847-860.
James, W. (1884). What is emotion? Mind, 9, 188-205.
Frequency Judgments of Emotions 151
Jones, C. M., & Heit, E. (1993). An evaluation of the total similarity principle: Effects of similarity on frequency judgments. Journal of Experimental Psychology: Learning, Memory, and Cognition, 19, 799-812.
Jonides, J., & Jones, C. M. (1992). Direct coding for frequency of occurrence. Journal of Experimental Psychology: Learning, Memory, and Cognition, 18, 368-378.
Keppel, G. (1991). Design and analysis: A researcher’s handbook (3rd edition). Englewood Cliffs, NJ: Prentice-Hall.
Knowles, E. S., Coker, M. C., Scott, R. A., Cook, D. A., & Neville, J. W. (1996). Measurement-induced improvement in anxiety: Mean shifts with repeated assessment. Journal of Personality and Social Psychology, 71, 352-363.
Larsen, R. J. (1992). Neuroticism and selective encoding and recall of symptoms: Evidence from a combined concurrent-retrospective study. Journal of Personality and Social Psychology, 62, 480-488.
Larsen, R. J., & Diener, E. (1987). Affect intensity as an individual difference characteristic: A review. Journal of Research in Personality, 21, 1-39.
Larsen, R. J., & Diener, E. (1992). Promises and problems with the circumplex model of emotion. In M. S. Clark (Ed.), Review of personality and social psychology: Emotion (Vol. 13, pp. 25-59). Newbury Park, CA: Sage.
Lazarus, R. S. (1991). Emotion and adaptation. New York: Oxford University Press.
Lewinsohn, P. M., & Rosenbaum, M. (1987). Recall of parental behavior by acute depressives, remitted depressives and nondepressives. Journal of Personality and Social Psychology, 52, 611-619.
MacLeod, A. K., & Andersen, A., & Davies, A. (1994). Self-ratings of positive and negative affect and retrieval of positive and negative affective memories. Cognition and Emotion, 8, 483-488.
Manis, M., Shedler, J., Jonides, J, & Nelson, T. E. (1993). Availability heuristic in judgments of set size and frequency of occurrence. Journal of Personality and Social Psychology, 65, 448-457.
Markus, H. R., & Kitayama, S. (1994). The cultural construction of self and emotion: Implications for Social Behavior. In S. Kitayama & H. R. Markus (Eds.), Emotion and Culture (pp. 89-130). Washington, DC: APA.
Martin, M. (1985). Neuroticism as predisposition toward depression: A cognitive mechanism. Personality and Individual Differences, 6, 353-365.
Matthews, G., Jones, D. M., & Chamberlain, A. G. (1990). Refining the measurement of mood: The UWIST Mood Adjective Checklist. British Journal of Psychology, 81, 17-42.
Mayer, J. D., & DiPaolo, & Salovey, P. (1990). Perceiving affective content in ambiguous visual stimuli: A component of emotional intelligence. Journal of Personality, 54, 772-781.
Means, B., Swan, G. E., Jobe, J. B., & Esposito, J. L. (1994). The effects of estimation strategies on the accuracy of respondents’ reports of cigarette smoking. In N. Schwarz & S. Sudman (Eds.), Autobiographical memory and the validity of retrospective reports (pp. 107-120). New York: Springer.
Menon, G. (1994). Judgments of behavioral frequencies: Memory search and retrieval strategies. In N. Schwarz & S. Sudman (Eds.), Autobiographic memory and the validity of retrospective reports (pp. 161-172). New York: Springer.
Mesquita, B., & Frijda, N. H. (1992). Cultural variations in emotions: A review. Psychological Bulletin, 112, 176-204.
Frequency Judgments of Emotions 152
Metcalfe, J. (1993). Novelty monitoring, metacognition and control in a composite holographic associative recall model: Implications for Korsakoff amnesia. Psychological Review, 100, 3-22.
Meudall, P. R. (1971). Retrieval and representations in long-term memory. Psychonomic Science, 23, 295-296.
Meyer, G. J., & Shack, J. R. (1989). Structural convergence of mood and personality: Evidence for old and new directions. Journal of Personality and Social Psychology, 57, 691-706.
Mingay, D. J., Shevell, K., Bradburn, N. M., & Ramirez, C. (1994). Self and proxy reports of everyday events. In N. Schwarz & S. Sudman (Eds.), Autobiographical memory and the validity of retrospective reports (pp. 235-250). New York: Springer.
Naveh-Benjamin, M., & Jonides, J. (1986). On the automaticity of frequency encoding: Effects of competing task load, encoding strategy, and intention. Journal of Experimental Psychology: Learning, Memory, and Cognition, 12, 378-386.
Nelson, T. O. (1988). Predictive accuracy of the feeling of knowing across different criterion tasks and across different subject populations and individuals. In M. M. Gruneberg, P. E. Morris, & R. N. Sykes (Eds.), Practical aspects of memory: Current research and issues (Vol. 1, pp. 190-196). New York: Wiley.
Oatley, K., & Johnson-Laird, P. N. (1987). Towards a cognitive theory of emotions. Cognition and Emotion, 1, 29-50.
Parducci, A. (1968). The relativism of absolute judgments. Scientific American, 219, 84-90.
Parducci, A., & Wedell, D. H. (1986). The category effect with rating scales: Number of categories, number of stimuli, and method of presentation. Journal of Experimental Psychology: Human Perception and Performance, 12, 496-516.
Parkinson, B., Briner, R. B., Reynolds, S., & Totterdell, P. (1995). Time frames of mood: Relations between momentary and generalized ratings of affect. Personality and Social Psychology Bulletin, 21, 331-339.
Parrott, W. G., & Sabini, J. (1990). Mood and memory under natural conditions: Evidence for mood incongruent recall. Journal of Personality and Social Psychology, 59, 321-326.
Pavot, W., & Diener, E. (1993). Review of the Satisfaction With Life Scale. Psychological Assessment, 5, 164-172.
Pavot, W., Diener, E., & Fujita, F. (1990). Extraversion and happiness. Personality and Individual Differences, 11, 1299-1306.
Pekrun, R., & Frese, M. (1992). Emotions in work and achievement. In C. L. Cooper & I. T. Robertson (Eds.), International Review of Industrial and Organizational Psychology (Vol. 7, pp. 153-200). New York: Wiley.
Pepper, S. (1981). Problems in the quantification of frequency expressions. In D. W. Fiske (Ed.), Problems with language imprecision (pp. 25-41). San Francisco: Jossey-Bass.
Rapaport, D. (1942). Emotions and memory. New York: International Universities Press.
Reder, L. M. (1987). Selection strategies in question answering. Cognitive Psychology, 19, 90-138.
Reisenzein, R. (1995). On Oatley and Johnson-Laird’s theory of emotion and hierarchical structures in the affective lexicon. Cognition and Emotion, 9, 383-416.
Frequency Judgments of Emotions 153
Reisenzein, R., & Hofmann, T. (1993). Discriminating emotions from appraisal-relevant situational information: Baseline data for structural models of cognitive appraisals. Cognition and Emotion, 7, 271-293.
Reisenzein, R. & Schimmack, U. (1996). Similarity and covariation of affects: Findings and implications. Manuscript submitted for publication.
Reisenzein, R., & Schönpflug, W. (1992). Stumpf’s cognitive-evaluative theory of emotion. American Psychologist, 47, 34-45.
Rosch, E. (1975). Cognitive representations of semantic categories. Journal of Experimental Psychology: General, 104, 192-223.
Schaeffer, N. C. (1991). Hardly ever or constantly? Group comparisons using vague quantifiers. Public Opinion Quarterly, 55, 395-423.
Scherer, K. R., Wallbott, H. G., & Summerfield, A. B. (Eds.). (1986). Experiencing emotion: A cross-cultural study. Cambridge: Cambridge University Press.
Schimmack, U. (1996). Resolving some controversies about the mood circumplex. Paper presented at the Annual Meetings of the Midwestern Psychological Association, Chicago, May, 1996.
Schimmack, U. (1996). The relation between extraversion/neuroticism and positive/negative affect: A meta-analysis. Manuscript in preparation.
Schimmack, U. (in press). Das Berliner-Alltagssprachliche-Stimmungsinventar (BASTI): Ein Vorschlag zur kontentvaliden Erfassung von Stimmungen [The Berlin Everyday Language Mood Inventory: Toward the content valid assessment of moods]. Diagnostica.
Schimmack, U., & Diener, E. (in press). Affect Intensity: Separating intensity and frequency in repeatedly measured affect. Journal of Personality and Social Psychology.
Schimmack, U., & Reisenzein, R. (1994). On the demarcation of the mood domain. Paper presented at the symposium, “Mood – Consensus and controversy” at the 102nd Annual Convention of the American Psychological Association, Los Angeles, CA.
Schimmack, U., & Reisenzein, R. (in press). Cognitive processes involved in similarity judgments of emotion concepts. Journal of Personality and Social Psychology.
Schimmack, U., & Siemer, M. (1995). e = m x c! Über Emotionen, Stimmungen und Kognitionen [On emotion, mood and cognition]. Positionsreferat gehalten auf der 37. TeaP in Bochum, 1995.
Schwarz, N. (1987). Stimmung als Information [Mood as information]. Heidelberg: Springer.
Schwarz, N. (1990). Assessing frequency reports of mundane behaviors: Contributions of cognitive psychology to questionnaire construction. In C. Hendrick & M. S. Clark (Eds.), Research methods in personality and social psychology (pp. 98-119). Beverly Hills, CA: Sage.
Schwarz, N., Bless, H., Strack, F., Klumpp, G., Rittenauer-Schatka, H., & Simons, A. (1991). Ease of retrieval as information: Another look at the availability heuristic. Journal of Personality and Social Psychology, 61, 195-202.
Schwarz, N., & Clore, G. L. (1983). Mood, misattribution, and judgments of well-being: Informative and directive functions of affective states. Journal of Personality and Social Psychology, 45, 513-523.
Frequency Judgments of Emotions 154
Schwarz, N., Strack, F., Müller, G., & Chassein, B. (1988). The range of response alternatives may determine the meaning of the question: Further evidence on information functions of response alternatives. Social Cognition, 6, 107-117.
Shaver, P., Schwartz, J., Kirson, D., & O’Conner, C. (1987). Emotion knowledge: Further exploration of a prototype approach. Journal of Personality and Social Psychology, 52, 1061-1086.
Shrout, P. E., & Fleiss, J. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86, 420-428.
Smith, E. R. (1991). Illusory correlation in a simulated exemplar-based memory. Journal of Experimental Social Psychology, 27, 107-123.
Smith, E. R., & Zarate, M. A. (1992). Exemplar-based model of social judgment. Psychological Review, 99, 3-21.
Steyer, R., Schwenkmezger, P., Notz, P., & Eid, M. (1994). Testtheoretische Analysen des Mehrdimensionalen Befindlichkeitsfragebogens [Test theoretical analyses of the multidimensional state questionnaire]. Diagnostica, 40, 320-328.
Suh, E., Diener, E., & Fujita, F. (1996). Events and subjective well-being: Only recent events matter. Journal of Personality and Social Psychology, 70, 1091-1102.
Taylor, G. J. (1984). Alexithymia: Concept, measurement, and implications for treatment. American Journal of Psychiatry, 141, 725-732.
Temme, G., & Tränkle, U. (1996). Arbeitsemotionen: Ein vernachlässigter Aspekt in der Arbeitszufriedenheitsforschung [Emotions at the workplace: A neglected aspect in research on job satisfaction]. Arbeit, 5, 275-297.
Thomas, D. L., & Diener, E. (1990). Memory accuracy in the recall of emotions. Journal of Personality and Social Psychology, 59, 291-297.
Thompson, C. P., & Mingay, D. (1991). Estimating the frequency of everyday events. Applied Cognitive Psychology, 5, 497-510.
Triandis, H. C. (1994). Culture and social behavior. New York: McGraw-Hill.
Tversky, A., & Kahneman, D. (1973). Availability: A heuristic for judging frequency and probability. Cognitive Psychology, 5, 207-232.
Underwood, B. J. (1969). The attributes of memory. Psychological Review, 76, 559-573.
Watkins, M. J., & LeCompte, D. C. (1991). Inadequacy of recall as a basis for frequency knowledge. Journal of Experimental Psychology: Learning, Memory, and Cognition, 17, 1161-1176.
Watson, D., & Clark, L. A. (1992). On traits and temperaments: General and specific factors of emotional experience and their relation to the five-factor model. Journal of Personality, 60, 441-476.
Watson, D., Clark, L. A., & Tellegen, A. (1988). Development and validation of brief measures of Positive and Negative Affect: The PANAS Scales. Journal of Personality and Social Psychology, 54, 1063-1070.
Wiggins, J. S. (1973). Personality and prediction: Principles of personality assessment. Reading, MA: Addision-Wesley.
Williams, K. W., & Durso, F. T. (1986). Judging category frequency: Automaticity or availability? Journal of Experimental Psychology: Learning, Memory, and Cognition, 12, 387-396.
Windle, C. (1955). Test-retest effect on personality questionnaires. Educational and Psychological Measurement, 15, 246-253.
Frequency Judgments of Emotions 155
Wright, D. B., Gaskell, G. D., & O’Muircheartaigh, C. A. (1994). How much is ‘Quite a bit’? Mapping between numerical values and vague quantifiers. Applied Cognitive Psychology, 8, 479-498.
Zuroff, D. C. (1989). Judgments of frequency of social stimuli: How schematic is person memory? Journal of Personality and Social Psychology, 56, 890-898.

The Power-Corrected H-Index

I was going to write this blog post eventually, but the online first publication of Radosic and Diener’s (2021) article “Citation Metrics in Psychological Science” provided a good opportunity to do so now.

Radosic and Diener’s (2021) article’s main purpose was to “provide norms to help evaluate the citation counts of psychological scientists” (p. 1). The authors also specify the purpose of these evaluations. “Citation metrics are one source of information that can be used in hiring, promotion, awards, and funding, and our goal is to help these evaluations” (p. 1).

The authors caution readers that they are agnostic about the validity of citation counts as a measure of good science. “The merits and demerits of citation counts are beyond the scope of the current article” (p. 8). Yet, they suggest that “there is much to recommend citation numbers in evaluating scholarly records” (p. 11).

At the same time, they list some potential limitations of using citation metrics to evaluate researchers.

1. Articles that developed a scale can have high citation counts. For example, Ed Diener has over 71,000 citations. His most cited article is the 1985 article with his Satisfaction with Life Scale. With 12,000 citations, it accounts for 17% of his citations. The fact that articles that published a measure have such high citation counts reflects a problem in psychological science. Researchers continue to use the first measure that was developed for a new construct (e.g., Rosenberg’s 1965 self-esteem scale) instead of improving measurement which would lead to citations of newer articles. So, the high citation counts of articles with scales is a problem, but it is only a problem if citation counts are used as a metric. A better metric is the H-Index that takes number of publications and citations into account. Ed Diener also has a very high H-Index of 108 publications with 108 or more citations. His scale article is only of these articles. Thus, scale development articles are not a major problem.

2. Review articles are cited more heavily than original research articles. Once more, Ed Diener is a good example. His second and third most cited articles are the 1984 and the co-authored 1999 Psychological Bulletin review articles on subjective well-being that together account for another 9,000 citations (13%). However, even review articles are not a problem. First, they also are unlikely to have an undue influence on the H-Index and second it is possible to exclude review articles and to compute metrics only for empirical articles. Web of Science makes this very easy. In WebofScience 361 out of Diener’s 469 publications are listed as articles. The others are listed as reviews, book chapters, or meeting abstracts. With a click of a button, we can produce the citation metrics only for the 361 articles. The H-Index drops from 108 to 102. Careful hand-selection of articles is unlikely to change this.

3. Finally, Radosic and Diener (2021) mention large-scale collaborations as a problem. For example, one of the most important research projects in psychological science in the last decade was the Reproducibility Project that examined the replicability of psychological science with 100 replication studies (Open Science Collaboration, 2015). This project required a major effort by many researchers. Participation earned researchers over 2,000 citations in just five years and the article is likely to be the most cited article for many of the collaborators. I do not see this as a problem because large-scale collaborations are important and can produce results that no single lab can produce. Thus, high citation counts provide a good incentive to engage in these collaborations.

To conclude, Radosic and Diener’s article provides norms for a citation counts that can and will be used to evaluate psychological scientists. However, the article sidesteps the main question about the use of citation metrics, namely (a) what criteria should be used to evaluate scientists and (b) are citation metrics valid indicators of these criteria. In short, the article is just another example that psychologists develop and promote measures without examining their construct validity (Schimmack, 2021).

What is a good scientists?

I didn’t do an online study to examine the ideal prototype of a scientist, so I have to rely on my own image of a good scientist. A key criterion is to search for some objectively verifiable information that can inform our understanding of the world, or in psychology ourselves; that is, humans affect, behavior, and cognition – the ABC of psychology. The second criterion elaborates the term objective. Scientists use methods that produce the same results independent of the user of the methods. That is, studies should be reproducible and results should be replicable within the margins of error. Third, the research question should have some significance beyond the personal interests of a scientist. This is of course a tricky criterion, but research that solves major problems like finding a vaccine for Covid-19 is more valuable and more likely to receive citations than research on the liking of cats versus dogs (I know, this is the most controversial statement I am making; go cats!). The problem is that not everybody can do research that is equally important to a large number of people. Once more Ed Diener is a good example. In the 1980s, he decided to study human happiness, which was not a major topic in psychology. Ed Diener’s high H-Index reflects his choice of a topic that is of interest to pretty much everybody. In contrast, research on stigma of minority groups is not of interest to a large group of people and unlikely to attract the same amount of attention. Thus, a blind focus on citation metrics is likely to lead to research on general topics and avoid research that applies research to specific problems. The problem is clearly visible in research on prejudice, where the past 20 years have produced hundreds of studies with button-press tasks by White researchers with White participants that gobbled up funding that could have been used for BIBOC researchers to study the actual issues in BIPOC populations. In short, relevance and significance of research is very difficult to evaluate, but it is unlikely to be reflected in citation metrics. Thus, a danger is that metrics are being used because they are easy to measure and relevance is not being used because it is harder to measure.

Do Citation Metrics Reward Good or Bad Research?

The main justification for the use of citation metrics is the hypothesis that the wisdom of crowds will lead to more citations of high quality work.

“The argument in favor of personal judgments overlooks the fact that citation counts are also based on judgments by scholars. In the case of citation counts, however, those judgments are broadly derived from the whole scholarly community and are weighted by the scholars who are publishing about the topic of the cited publications. Thus, there is much to recommend citation
numbers in evaluating scholarly records.” (Radosic & Diener, 2021, p. 8).

This statement is out of touch with discussions about psychological science over the past decade in the wake of the replication crisis (see Schimmack, 2020, for a review; I have to cite myself to get up my citation metrics. LOL). In order to get published and cited, researchers of original research articles in psychological science need statistically significant p-values. The problem is that it can be difficult to find significant results when novel hypotheses are false or effect sizes are small. Given the pressure to publish in order to rise in the H-Index rankings, psychologists have learned to use a number of statistical tricks to get significant results in the absence of strong evidence in the data. These tricks are known as questionable research practices, but most researchers think they are acceptable (John et al., 2012). However, these practices undermine the value of significance testing and published results may be false positives or difficult to replicate, and do not add to the progress of science. Thus, citation metrics may have the negative consequence to pressure scientists into using bad practices and to reward scientists who publish more false results just because they publish more.

Meta-psychologists have produced strong evidence that the use of these practices was widespread and accounts for the majority of replication failures that occurred over the past decade.

Schimmack, U. (2020). A meta-psychological perspective on the decade of replication failures in social psychology. Canadian Psychology/Psychologie canadienne, 61(4), 364–376. https://doi.org/10.1037/cap0000246

Motyl et al. (2017) collected focal test statistics from a representative sample of articles in social psychology. I analyzed their data using z-curve.2.0 (Brunner & Schimmack, 2020; Bartos & Schimmack, 2021). Figure 1 shows the distribution of the test-statistics after converting them into absolute z-scores, where higher values show a higher signal/noise (effect size / sampling error) ratio. A z-score of 1.96 is needed to claim a discovery with p < .05 (two-sided). Consistent with publication practices since the 1960s, most focal hypothesis tests confirm predictions (Sterling, 1959). The observed discovery rate is 90% and even higher if marginally significant results are included (z > 1.65). This high success rate is not something to celebrate. Even I could win all marathons if I use a short-cut and run only 5km. The problem with this high success rate is clearly visible when we fit a model to the distribution of the significant z-scores and extrapolate the distribution of z-scores that are not significant (the blue curve in the figure). Based on this distribution, the significant results are only 19% of all tests, indicating that many more non-significant results are expected than observed. The discrepancy between the observed and estimated discovery rate provides some indication of the use of questionable research practices. Moreover, the estimated discovery rate shows how much statistical power studies have to produce significant results without questionable research practices. The results confirm suspicions that power in social psychology is abysmally low (Cohen, 1961; Tversky & Kahneman, 1971).

The use of questionable practices makes it possible that citation metrics may be invalid. When everybody in a research field uses p < .05 as a criterion to evaluate manuscripts and these p-values are obtained with questionable research practices, the system will reward researchers how use the most questionable methods to produce more questionable results than their peers. In other words, citation metrics are no longer a valid criterion of research quality. Instead, bad research is selected and rewarded (Smaldino & McElreath, 2016). However, it is also possible that implicit knowledge helps researchers to focus on robust results and that questionable research practices are not rewarded. For example, prediction markets suggest that it is fairly easy to spot shoddy research and to predict replication failures (Dreber et al., 2015). Thus, we cannot assume that citation metrics are valid or invalid. Instead, citation metrics – like all measures – require a program of construct validation.

Do Citation Metrics Take Statistical Power Into Account?

A few days ago, I published the first results of an ongoing research project that examines the relationship between researchers’ citation metrics and estimates of the average power of their studies based on z-curve analyses like the one shown in Figure 1 (see Schimmack, 2021, for details). The key finding is that there is no statistically or practically significant relationship between researchers H-Index and the average power of their studies. Thus, researchers who invest a lot of resources in their studies to produce results with a low false positive risk and high replicability are not cited more than researchers who flood journals with low powered studies that produce questionable results that are difficult to replicate.

These results show a major problem of citation metrics. Although methodologists have warned against underpowered studies, researchers have continued to use underpowered studies because they can use questionable practices to produce the desired outcome. This strategy is beneficial for scientists and their career, but hurts the larger goal of science to produce a credible body of knowledge. This does not mean that we need to abandon citation metrics altogether, but it must be complemented with other information that reflects the quality of researchers data.

The Power-Corrected H-Index

In my 2020 review article, I proposed to weight the H-Index by estimates of researchers’ replicability. For my illustration, I used the estimated replication rate, which is the average power of significant tests, p < .05 (Brunner & Schimmack, 2020). One advantage of the ERR is that it is highly reliable. The reliability of the ERRs for 300 social psychologists is .90. However, the ERR has some limitations. First, it predicts replication outcomes under the unrealistic assumption that psychological studies can be replicated exactly. However, it has been pointed out that this often impossible, especially in social psychology (Strobe & Strack, 2014). As a result, ERR predictions are overly optimistic and overestimate the success rate of actual replication studies (Bartos & Schimmack, 2021). In contrast, EDR estimates are much more in line with actual replication outcomes because effect sizes in replication studies can regress towards the mean. For example, Figure 1 shows an EDR of 19% for social psychology and the actual success rate (if we can call it that) for social psychology was 25% in the reproducibility project (Open Science Collaboration, 2015). Another advantage of the EDR is that it is sensitive to questionable research practices that tend to produce an abundance of p-values that are just significant. Thus, the EDR more strongly punishes researchers for using these undesirable practices. The main limitation of the EDR is that it is less reliable than the ERR. The reliability for 300 social psychologists was only .5. Of course, it is not necessary to chose between ERR and EDR. Just like there are many citation metrics, it is possible to evaluate the pattern of power-corrected metrics using ERR and EDR. I am presenting both values here, but the rankings are sorted by EDR weighted H-Indices.

The H-Index is an absolute number that can range from 0 to infinity. In contrast, power is limited to a range from 5% (with alpha = .05) to 100%. Thus, it makes sense to use power as a weight and to weight the H-index by a researchers EDR. A researcher who published only studies with 100% power has a power-corrected H-Index that is equivalent to the actual H-Index. The average EDR of social psychologists, however, is 35%. Thus, the average H-index is reduced to a third of the unadjusted value.

To illustrate this approach, I am using two researchers with a large H-Index, but different EDRs. One researcher is James J. Gross with an H-Index of 99 in WebofScience. His z-curve plot shows some evidence that questionable research practices were used to report 72% significant results with 50% power. However, the 95%CI around the EDR ranges from 23% to 78% and includes the point estimate. Thus, the evidence for QRPs is weak and not statistically significant. More important, the EDR -corrected H-Index is 90 * .50 = 45.

A different example is provided by Shelly E. Taylor with a similarly high H-Index of 84, but her z-curve plot shows clear evidence that the observed discovery rate is inflated by questionable research practices. Her low EDR reduces the H-Index considerably and results in a PC-H-Index of only 12.6.

Weighing the two researchers’ H-Index by their respective ERR’s, 77 vs. 54, has similar, but less extreme effects in absolute terms, ERR-adjusted H-Indices of 76 vs. 45.

In the sample of 300 social psychologists, the H-Index (r = .74) and the EDR (r = .65) contribute about equal amounts of variance to the power-corrected H-Index. Of course, a different formula could be used to weigh power more or less.

Discussion

Ed Diener is best known for his efforts to measure well-being and to point out that traditional economic indicators of well-being are imperfect. While wealth of countries is a strong predictor of citizens’ average well-being, r ~ .8, income is a poor predictor of individuals’ well-being with countries. However, economists continue to rely on income and GDP because it is more easily quantified and counted than subjective life-evaluations. Ironically, Diener advocates the opposite approach when it comes to measuring research quality. Counting articles and citations is relatively easy and objective, but it may not measure what we really want to measure, namely how much is somebody contributing to the advancement of knowledge. The construct of scientific advancement is probably as difficult to define as well-being, but producing replicable results with reproducible studies is one important criterion of good science. At present, citation metrics fail to track this indicator of research quality. Z-curve analyses of published results make it possible to measure this aspect of good science and I recommend to take it into account when researchers are being evaluated.

However, I do not recommend the use of quantitative information for the evaluation of hiring and promotion decisions. The reward system in science is too biased to reward privileged upper-class, White, US Americans (see APS rising stars lists). That being said, a close examination of published articles can be used to detect and eliminate researchers who severely p-hacked to get their significant results. Open science criteria can also be used to evaluate researchers who are just starting their career.

In conclusion, Radosic and Diener’s (2021) article disappointed me because it sidesteps the fundamental questions about the validity of citation metrics as a criterion for scientific excellence.

Conflict of Interest Statement: At the beginning of my career I was motivated to succeed in psychological science by publishing as many JPSP articles as possible and I made the unhealthy mistake to try to compete with Ed Diener. That didn’t work out for me. Maybe I am just biased against citation metrics because my work is not cited as much as I would like. Alternatively, my disillusionment with the system reflects some real problems with the reward structure in psychological science and helped me to see the light. The goal of science cannot be to have the most articles or the most citations, if these metrics do not really reflect scientific contributions. Chasing indicators is a trap, just like chasing happiness is a trap. Most scientists can hope to make maybe one lasting contribution to the advancement of knowledge. You need to please others to stay in the game, but beyond those minimum requirements to get tenure, personal criteria of success are better than social comparisons for the well-being of science and scientists. The only criterion that is healthy is to maximize statistical power. As Cohen said, less is more and by this criterion psychology is not doing well as more and more research is published with little concern about quality.

NameEDR.H.IndexERR.H.IndexH-IndexEDRERR
James J. Gross5076995077
John T. Cacioppo48701024769
Richard M. Ryan4661895269
Robert A. Emmons3940468588
Edward L. Deci3643695263
Richard W. Robins3440576070
Jean M. Twenge3335595659
William B. Swann Jr.3244555980
Matthew D. Lieberman3154674780
Roy F. Baumeister31531013152
David Matsumoto3133397985
Carol D. Ryff3136486476
Dacher Keltner3144684564
Michael E. McCullough3034446978
Kipling D. Williams3034446977
Thomas N Bradbury3033486369
Richard J. Davidson30551082851
Phoebe C. Ellsworth3033466572
Mario Mikulincer3045714264
Richard E. Petty3047744064
Paul Rozin2949585084
Lisa Feldman Barrett2948694270
Constantine Sedikides2844634570
Alice H. Eagly2843614671
Susan T. Fiske2849664274
Jim Sidanius2730426572
Samuel D. Gosling2733535162
S. Alexander Haslam2740624364
Carol S. Dweck2642663963
Mahzarin R. Banaji2553683778
Brian A. Nosek2546574481
John F. Dovidio2541663862
Daniel M. Wegner2434524765
Benjamin R. Karney2427376573
Linda J. Skitka2426327582
Jerry Suls2443633868
Steven J. Heine2328376377
Klaus Fiedler2328386174
Jamil Zaki2327356676
Charles M. Judd2336534368
Jonathan B. Freeman2324307581
Shinobu Kitayama2332455071
Norbert Schwarz2235564063
Antony S. R. Manstead2237593762
Patricia G. Devine2125375867
David P. Schmitt2123307177
Craig A. Anderson2132593655
Jeff Greenberg2139732954
Kevin N. Ochsner2140573770
Jens B. Asendorpf2128415169
David M. Amodio2123336370
Bertram Gawronski2133434876
Fritz Strack2031553756
Virgil Zeigler-Hill2022277481
Nalini Ambady2032573556
John A. Bargh2035633155
Arthur Aron2036653056
Mark Snyder1938603263
Adam D. Galinsky1933682849
Tom Pyszczynski1933613154
Barbara L. Fredrickson1932523661
Hazel Rose Markus1944642968
Mark Schaller1826434361
Philip E. Tetlock1833454173
Anthony G. Greenwald1851613083
Ed Diener18691011868
Cameron Anderson1820276774
Michael Inzlicht1828444163
Barbara A. Mellers1825325678
Margaret S. Clark1823305977
Ethan Kross1823345267
Nyla R. Branscombe1832493665
Jason P. Mitchell1830414373
Ursula Hess1828404471
R. Chris Fraley1828394572
Emily A. Impett1819257076
B. Keith Payne1723305876
Eddie Harmon-Jones1743622870
Wendy Wood1727434062
John T. Jost1730493561
C. Nathan DeWall1728453863
Thomas Gilovich1735503469
Elaine Fox1721276278
Brent W. Roberts1745592877
Harry T. Reis1632433874
Robert B. Cialdini1629513256
Phillip R. Shaver1646652571
Daphna Oyserman1625463554
Russell H. Fazio1631503261
Jordan B. Peterson1631394179
Bernadette Park1624384264
Paul A. M. Van Lange1624384263
Jeffry A. Simpson1631572855
Russell Spears1529522955
A. Janet Tomiyama1517236576
Jan De Houwer1540552772
Samuel L. Gaertner1526423561
Michael Harris Bond1535423584
Agneta H. Fischer1521314769
Delroy L. Paulhus1539473182
Marcel Zeelenberg1429373979
Eli J. Finkel1426453257
Jennifer Crocker1432483067
Steven W. Gangestad1420483041
Michael D. Robinson1427413566
Nicholas Epley1419265572
David M. Buss1452652280
Naomi I. Eisenberger1440512879
Andrew J. Elliot1448712067
Steven J. Sherman1437592462
Christian S. Crandall1421363959
Kathleen D. Vohs1423453151
Jamie Arndt1423453150
John M. Zelenski1415206976
Jessica L. Tracy1423324371
Gordon B. Moskowitz1427472957
Klaus R. Scherer1441522678
Ayelet Fishbach1321363759
Jennifer A. Richeson1321403352
Charles S. Carver1352811664
Leaf van Boven1318274767
Shelley E. Taylor1244841452
Lee Jussim1217245271
Edward R. Hirt1217264865
Shigehiro Oishi1232522461
Richard E. Nisbett1230432969
Kurt Gray1215186981
Stacey Sinclair1217304157
Niall Bolger1220343658
Paula M. Niedenthal1222363461
Eliot R. Smith1231422973
Tobias Greitemeyer1221313967
Rainer Reisenzein1214215769
Rainer Banse1219264672
Galen V. Bodenhausen1228462661
Ozlem Ayduk1221353459
E. Tory. Higgins1238701754
D. S. Moskowitz1221333663
Dale T. Miller1225393064
Jeanne L. Tsai1217254667
Roger Giner-Sorolla1118225180
Edward P. Lemay1115195981
Ulrich Schimmack1122353263
E. Ashby Plant1118363151
Ximena B. Arriaga1113195869
Janice R. Kelly1115225070
Frank D. Fincham1135601859
David Dunning1130432570
Boris Egloff1121372958
Karl Christoph Klauer1125392765
Caryl E. Rusbult1019362954
Tessa V. West1012205159
Jennifer S. Lerner1013224661
Wendi L. Gardner1015244263
Mark P. Zanna1030621648
Michael Ross1028452262
Jonathan Haidt1031432373
Sonja Lyubomirsky1022382659
Sander L. Koole1018352852
Duane T. Wegener1016273660
Marilynn B. Brewer1027442262
Christopher K. Hsee1020313163
Sheena S. Iyengar1015195080
Laurie A. Rudman1026382568
Joanne V. Wood916263660
Thomas Mussweiler917392443
Shelly L. Gable917332850
Felicia Pratto930402375
Wiebke Bleidorn920273474
Jeff T. Larsen917253667
Nicholas O. Rule923303075
Dirk Wentura920312964
Klaus Rothermund930392376
Joris Lammers911165669
Stephanie A. Fryberg913194766
Robert S. Wyer930471963
Mina Cikara914184980
Tiffany A. Ito914224064
Joel Cooper914352539
Joshua Correll914233862
Peter M. Gollwitzer927461958
Brad J. Bushman932511762
Kennon M. Sheldon932481866
Malte Friese915263357
Dieter Frey923392258
Lorne Campbell914233761
Monica Biernat817292957
Aaron C. Kay814283051
Yaacov Schul815233664
Joseph P. Forgas823392159
Guido H. E. Gendolla814302747
Claude M. Steele813312642
Igor Grossmann815233566
Paul K. Piff810165063
Joshua Aronson813282846
William G. Graziano820302666
Azim F. Sharif815223568
Juliane Degner89126471
Margo J. Monteith818243277
Timothy D. Wilson828451763
Kerry Kawakami813233356
Hilary B. Bergsieker78116874
Gerald L. Clore718391945
Phillip Atiba Goff711184162
Elizabeth W. Dunn717262864
Bernard A. Nijstad716312352
Mark J. Landau713282545
Christopher R. Agnew716213376
Brandon J. Schmeichel714302345
Arie W. Kruglanski728491458
Eric D. Knowles712183864
Yaacov Trope732571257
Wendy Berry Mendes714312244
Jennifer S. Beer714252754
Nira Liberman729451565
Penelope Lockwood710144870
Jeffrey W Sherman721292371
Geoff MacDonald712183767
Eva Walther713193566
Daniel T. Gilbert727411665
Grainne M. Fitzsimons611232849
Elizabeth Page-Gould611164066
Mark J. Brandt612173770
Ap Dijksterhuis620371754
James K. McNulty621331965
Dolores Albarracin618331956
Maya Tamir619292164
Jon K. Maner622431452
Alison L. Chasteen617252469
Jay J. van Bavel621302071
William A. Cunningham619302064
Glenn Adams612173573
Wilhelm Hofmann622331866
Ludwin E. Molina67124961
Lee Ross626421463
Andrea L. Meltzer69134572
Jason E. Plaks610153967
Ara Norenzayan621341761
Batja Mesquita617232573
Tanya L. Chartrand69282033
Toni Schmader518301861
Abigail A. Scholer59143862
C. Miguel Brendl510153568
Emily Balcetis510153568
Diana I. Tamir59153562
Nir Halevy513182972
Alison Ledgerwood58153454
Yoav Bar-Anan514182876
Paul W. Eastwick517242169
Geoffrey L. Cohen513252050
Yuen J. Huo513163180
Benoit Monin516291756
Gabriele Oettingen517351449
Roland Imhoff515212373
Mark W. Baldwin58202441
Ronald S. Friedman58192544
Shelly Chaiken522431152
Kristin Laurin59182651
David A. Pizarro516232069
Michel Tuan Pham518271768
Amy J. C. Cuddy517241972
Gun R. Semin519301564
Laura A. King419281668
Yoel Inbar414202271
Nilanjana Dasgupta412231952
Kerri L. Johnson413172576
Roland Neumann410152867
Richard P. Eibach410221947
Roland Deutsch416231871
Michael W. Kraus413241755
Steven J. Spencer415341244
Gregory M. Walton413291444
Ana Guinote49202047
Sandra L. Murray414251655
Leif D. Nelson416251664
Heejung S. Kim414251655
Elizabeth Levy Paluck410192155
Jennifer L. Eberhardt411172362
Carey K. Morewedge415231765
Lauren J. Human49133070
Chen-Bo Zhong410211849
Ziva Kunda415271456
Geoffrey J. Leonardelli46132848
Danu Anthony Stinson46113354
Kentaro Fujita411182062
Leandre R. Fabrigar414211767
Melissa J. Ferguson415221669
Nathaniel M Lambert314231559
Matthew Feinberg38122869
Sean M. McCrea38152254
David A. Lishner38132563
William von Hippel313271248
Joseph Cesario39191745
Martie G. Haselton316291154
Daniel M. Oppenheimer316261260
Oscar Ybarra313241255
Simone Schnall35161731
Travis Proulx39141962
Spike W. S. Lee38122264
Dov Cohen311241144
Ian McGregor310241140
Dana R. Carney39171553
Mark Muraven310231144
Deborah A. Prentice312211257
Michael A. Olson211181363
Susan M. Andersen210211148
Sarah E. Hill29171352
Michael A. Zarate24141331
Lisa K. Libby25101854
Hans Ijzerman2818946
James M. Tyler1681874
Fiona Lee16101358

References

Open Science Collaboration (OSC). (2015). Estimating the reproducibility
of psychological science. Science, 349, aac4716. http://dx.doi.org/10
.1126/science.aac4716

Radosic, N., & Diener, E. (2021). Citation Metrics in Psychological Science. Perspectives on Psychological Science. https://doi.org/10.1177/1745691620964128

Schimmack, U. (2021). The validation crisis. Meta-psychology. in press

Schimmack, U. (2020). A meta-psychological perspective on the decade of replication failures in social psychology. Canadian Psychology/Psychologie canadienne, 61(4), 364–376. https://doi.org/10.1037/cap0000246

Power and Success: When the R-Index meets the H-Index

A main message of the Lord of the Rings novels is that power is dangerous and corrupts. The main exception is statistical power. High statistical power is desirable because it reduces the risk of false negative results and therewith increases the rate of true discoveries. A high rate of true discoveries is desirable because it reduces the risk that significant results are false positives. For example, a researcher who conducts studies with low power to produce many significant results, but also tests many false hypotheses, will have a high rate of false positive discoveries (Finkel, 2018). In contrast, a researcher who invests more resources in any single study will have fewer significant results, but a lower risk of false positives. Another advantage of high power is that true discoveries are more replicable. A true positive that was obtained with 80% power has an 80% chance to produce a successful replication. In contrast, a true discovery that was obtained with 20% power has an 80% chance to end with a failure to replicate that requires additional replication studies to determine whether the original result was a false positive.

Although most researchers agree that high power is desirable – and specify that they are planning studies with 80% power in their grant proposals, they no longer care about power once the study is completed and a significant result was obtained. The fallacy is to assume that a significant result was obtained because the hypothesis was true and the study had good power. Until recently, there was also no statistical method to estimate researchers’ actual power. The main problem was that questionable research practices inflate post-hoc estimates of statistical power. Selection for significance ensures that post-hoc power is at least 50%. This problem has been solved with selection models that correct for selection for significance, namely p-curve and z-curve. A comparison of these methods with simulation studies shows that p-curve estimates can be dramatically inflated when studies are heterogeneous in power (Brunner & Schimmack, 2020). Z-curve is also the only method that estimates power for all studies that were conducted and not just the subset of studies that produced a significant results. A comparison with actual success rates of replication studies shows that these estimates predict actual replication outcomes (Bartos & Schimmack, 2021).

The ability to estimate researchers’ actual power offers new opportunities for meta-psychologists. One interesting question is how statistical power is related to traditional indicators of scientific success or eminence. There are three several possible outcomes.

One possibility is that power could be positively correlated with success, especially for older researchers. The reason is that low power should produce many replication failures for other researchers who are trying to build on the work of this researcher. Faced with replication failures, they are likely to abandon this research and work on this topic will cease after a while. Accordingly, low powered studies are unlikely to produce a large body of research. In contrast, high powered studies replicate and many other researchers who build on this work are building on these findings, leading to many citations and a large H-Index.

A second possibility is that there is no relationship between power and success. The reason would be that power is determined by many other factors such as the effect sizes in a research area and the type of design that is used to examine these effects. Some research areas will have robust findings that replicate often. Other areas will have low power, but everybody in this area accepts that studies do not always work. In this scenario, success is determined by other factors that vary within research areas and not by power, which varies mostly across research areas.

Another reason for the lack of a correlation could be a floor effect. In a system that does not value credibility and replicability, researchers who use questionable practices to push articles out might win and the only way to survive is to do bad research (Smaldino & McElreath, 2016).

A third possibility is that power is negatively correlated with success. Although there is no evidence for a negative relationship, concerns have been raised that some researchers are gaming the system by conducting many studies with low power to produce as many significant results as possible. The costs of replication failures are passed on to other researchers that try to build on these findings, whereas the innovator moves on to produce more significant results on new questions.

Given the lack of data and plausible predictions for any type of relationship, it is not possible to make a priori predictions about the relationship. Thus, the theoretical implications can only be examined after we look at the data.

Data

Success was measured with the H-Index in Web of Science. Information about statistical power of over 300 social/personality psychologists was obtained using z-curve analyses of automatically extracted test statistics (Schimmack, 2021). A sample size of N = 300 provides reasonably tight confidence intervals to evaluate whether there is a substantial relationship between H-Index and power. I transformed the H-Index using log-transformation to compute the correlation with the estimated discovery rate, which corresponds to the average power before selection for significance (Brunner & Schimmack, 2020). The results show a weak positive relationship that is not significantly different from zero, r(N -= 304) = .07, 95%CI = -.04 to .18. Thus, the results are most consistent with theories that predict no relationship between success and research practices. Figure 1 shows the scatterplot and there is no indication that the weak correlation is due to a floor effect. There is considerable variation in the estimated discovery rate across researchers.

One concern could be that the EDR is just a very noisy and unreliable measure of statistical power. To examine this, I split the z-values of researchers in half, computed separate z-curves and then computed the split-half correlation and adjusted it to compute alpha for the full set of z-scores. Reliability of the EDR was alpha r = .5. To increase reliability, I used extreme groups for the EDR and excluded values between 25 and 45. However , the correlation with the H-Index did not increase, r = .08, 95%CI = -.08 to .23.

I also correlated the H-Index with the more reliable estimated replication rate (reliability = .9), which is power after selection for significance. This correlation was also not significant, r = .08, 95%CI = -.04 to .19.

In conclusion, we can reject the hypothesis that higher success is related to conducting many small studies with low power and selectively reporting only significant results (r > -.1, p < .05). There may be a small positive correlation, (r < .2, p < .05), but a larger sample would be needed to reject the hypotheses that there is no relationship between success and statistical power.

Discussion

Low replication rates and major replication failures of some findings in social psychology created a crisis of confidence. Some articles suggests that most published results are false and were obtained with questionable research practices. The present results suggests that these fears are unfounded and that it would be false to generalize from a few researchers to the whole group of social psychologists.

The present results also suggest that it is not necessary to burn social psychology to the ground. Instead, social psychologists should carefully examine which important findings are credible and replicable and which ones are not. Although this work has begun, it is moving slowly. The present results show that researchers success, wich is measured in terms of citations by peers, is not tight to the credibility of their findings. Personalized information about power may help to change this in the future.

A famous quote in management is “If You Can’t Measure It, You Can’t Improve It.” This might explain why statistical power remained low despite early warnings about low power (Cohen, 1961; Tversky & Kahneman, 1971). Z-curve analysis is a game changer because it makes it possible to measure power and with the use of modern computers, it is possible to do so quickly and on a large scale. If we agree that power is important and that it can be measured, it is time to improve it. Every researcher can do so and the present results suggest that increasing power is not a career ending move. So, I hope this post empowers researchers to invest more resources in high-powered studies.

Replicability Rankings 2010-2020

Welcome to the replicability rankings for 120 psychology journals. More information about the statistical method that is used to create the replicability rankings can be found elsewhere (Z-Curve; Video Tutorial; Talk; Examples). The rankings are based on automated extraction of test statistics from all articles published in these 120 journals from 2010 to 2020 (data). The results can be reproduced with the R-package zcurve.

To give a brief explanation of the method, I use the journal with the highest ranking and the journal with the lowest ranking as examples. Figure 1 shows the z-curve plot for the 2nd highest ranking journal for the year 2020 (the Journal of Organizational Psychology is ranked #1, but it has very few test statistics). Plots for all journals that include additional information and information about test statistics are available by clicking on the journal name. Plots for previous years can be found on the site for the 2010-2019 rankings (previous rankings).

To create the z-curve plot in Figure 1, the 361 test statistics were first transformed into exact p-values that were then transformed into absolute z-scores. Thus, each value represents the deviation from zero for a standard normal distribution. A value of 1.96 (solid red line) corresponds to the standard criterion for significance, p = .05 (two-tailed). The dashed line represents the treshold for marginal significance, p = .10 (two-tailed). A z-curve analysis fits a finite mixture model to the distribution of the significant z-scores (the blue density distribution on the right side of the solid red line). The distribution provides information about the average power of studies that produced a significant result. As power determines the success rate in future studies, power after selection for significance is used to estimate replicability. For the present data, the z-curve estimate of the replication rate is 84%. The bootstrapped 95% confidence interval around this estimate ranges from 75% to 92%. Thus, we would expect the majority of these significant results to replicate.

However, the graph also shows some evidence that questionable research practices produce too many significant results. The observed discovery rate (i.e., the percentage of p-values below .05) is 82%. This is outside of the 95%CI of the estimated discovery rate which is represented by the grey line in the range of non-significant results; EDR = .31%, 95%CI = 18% to 81%. We see that there are fewer results reported than z-curve predicts. This finding casts doubt about the replicability of the just significant p-values. The replicability rankings ignore this problem, which means that the predicted success rates are overly optimistic. A more pessimistic predictor of the actual success rate is the EDR. However, the ERR still provides useful information to compare power of studies across journals and over time.

Figure 2 shows a journal with a low ERR in 2020.

The estimated replication rate is 64%, with a 95%CI ranging from 55% to 73%. The 95%CI does not overlap with the 95%CI for the Journal of Sex Research, indicating that this is a significant difference in replicability. Visual inspection also shows clear evidence for the use of questionable research practices with a lot more results that are just significant than results that are not significant. The observed discovery rate of 75% is inflated and outside the 95%CI of the EDR that ranges from 10% to 56%.

To examine time trends, I regressed the ERR of each year on the year and computed the predicted values and 95%CI. Figure 3 shows the results for the journal Social Psychological and Personality Science as an example (x = 0 is 2010, x = 1 is 2020). The upper bound of the 95%CI for 2010, 62%, is lower than the lower bound of the 95%CI for 2020, 74%.

This shows a significant difference with alpha = .01. I use alpha = .01 so that only 1.2 out of the 120 journals are expected to show a significant change in either direction by chance alone. There are 22 journals with a significant increase in the ERR and no journals with a significant decrease. This shows that about 20% of these journals have responded to the crisis of confidence by publishing studies with higher power that are more likely to replicate.

Rank  JournalObserved 2020Predicted 2020Predicted 2010
1Journal of Organizational Psychology88 [69 ; 99]84 [75 ; 93]73 [64 ; 81]
2Journal of Sex Research84 [75 ; 92]84 [74 ; 93]75 [65 ; 84]
3Evolution & Human Behavior84 [74 ; 93]83 [77 ; 90]62 [56 ; 68]
4Judgment and Decision Making81 [74 ; 88]83 [77 ; 89]68 [62 ; 75]
5Personality and Individual Differences81 [76 ; 86]81 [78 ; 83]68 [65 ; 71]
6Addictive Behaviors82 [75 ; 89]81 [77 ; 86]71 [67 ; 75]
7Depression & Anxiety84 [76 ; 91]81 [77 ; 85]67 [63 ; 71]
8Cognitive Psychology83 [75 ; 90]81 [76 ; 87]71 [65 ; 76]
9Social Psychological and Personality Science85 [78 ; 92]81 [74 ; 89]54 [46 ; 62]
10Journal of Experimental Psychology – General80 [75 ; 85]80 [79 ; 81]67 [66 ; 69]
11J. of Exp. Psychology – Learning, Memory & Cognition81 [75 ; 87]80 [77 ; 84]73 [70 ; 77]
12Journal of Memory and Language79 [73 ; 86]80 [76 ; 83]73 [69 ; 77]
13Cognitive Development81 [75 ; 88]80 [75 ; 85]67 [62 ; 72]
14Sex Roles81 [74 ; 88]80 [75 ; 85]72 [67 ; 77]
15Developmental Psychology74 [67 ; 81]80 [75 ; 84]67 [63 ; 72]
16Canadian Journal of Experimental Psychology77 [65 ; 90]80 [73 ; 86]74 [68 ; 81]
17Journal of Nonverbal Behavior73 [59 ; 84]80 [68 ; 91]65 [53 ; 77]
18Memory and Cognition81 [73 ; 87]79 [77 ; 81]75 [73 ; 77]
19Cognition79 [74 ; 84]79 [76 ; 82]70 [68 ; 73]
20Psychology and Aging81 [74 ; 87]79 [75 ; 84]74 [69 ; 79]
21Journal of Cross-Cultural Psychology83 [76 ; 91]79 [75 ; 83]75 [71 ; 79]
22Psychonomic Bulletin and Review79 [72 ; 86]79 [75 ; 83]71 [67 ; 75]
23Journal of Experimental Social Psychology78 [73 ; 84]79 [75 ; 82]52 [48 ; 55]
24JPSP-Attitudes & Social Cognition82 [75 ; 88]79 [69 ; 89]55 [45 ; 65]
25European Journal of Developmental Psychology75 [64 ; 86]79 [68 ; 91]74 [62 ; 85]
26Journal of Business and Psychology82 [71 ; 91]79 [68 ; 90]74 [63 ; 85]
27Psychology of Religion and Spirituality79 [71 ; 88]79 [66 ; 92]72 [59 ; 85]
28J. of Exp. Psychology – Human Perception and Performance79 [73 ; 84]78 [77 ; 80]75 [73 ; 77]
29Attention, Perception and Psychophysics77 [72 ; 82]78 [75 ; 82]73 [70 ; 76]
30Psychophysiology79 [74 ; 84]78 [75 ; 82]66 [62 ; 70]
31Psychological Science77 [72 ; 84]78 [75 ; 82]57 [54 ; 61]
32Quarterly Journal of Experimental Psychology81 [75 ; 86]78 [75 ; 81]72 [69 ; 74]
33Journal of Child and Family Studies80 [73 ; 87]78 [74 ; 82]67 [63 ; 70]
34JPSP-Interpersonal Relationships and Group Processes81 [74 ; 88]78 [73 ; 82]53 [49 ; 58]
35Journal of Behavioral Decision Making77 [70 ; 86]78 [72 ; 84]66 [60 ; 72]
36Appetite78 [73 ; 84]78 [72 ; 83]72 [67 ; 78]
37Journal of Comparative Psychology79 [65 ; 91]78 [71 ; 85]68 [61 ; 75]
38Journal of Religion and Health77 [57 ; 94]78 [70 ; 87]75 [67 ; 84]
39Aggressive Behaviours82 [74 ; 90]78 [70 ; 86]70 [62 ; 78]
40Journal of Health Psychology74 [64 ; 82]78 [70 ; 86]72 [64 ; 80]
41Journal of Social Psychology78 [70 ; 87]78 [70 ; 86]69 [60 ; 77]
42Law and Human Behavior81 [71 ; 90]78 [69 ; 87]70 [61 ; 78]
43Psychological Medicine76 [68 ; 85]78 [66 ; 89]74 [63 ; 86]
44Political Psychology73 [59 ; 85]78 [65 ; 92]59 [46 ; 73]
45Acta Psychologica81 [75 ; 88]77 [74 ; 81]73 [70 ; 76]
46Experimental Psychology73 [62 ; 83]77 [73 ; 82]73 [68 ; 77]
47Archives of Sexual Behavior77 [69 ; 83]77 [73 ; 81]78 [74 ; 82]
48British Journal of Psychology73 [65 ; 81]77 [72 ; 82]74 [68 ; 79]
49Journal of Cognitive Psychology77 [69 ; 84]77 [72 ; 82]74 [69 ; 78]
50Journal of Experimental Psychology – Applied82 [75 ; 88]77 [72 ; 82]70 [65 ; 76]
51Asian Journal of Social Psychology79 [66 ; 89]77 [70 ; 84]70 [63 ; 77]
52Journal of Youth and Adolescence80 [71 ; 89]77 [70 ; 84]72 [66 ; 79]
53Memory77 [71 ; 84]77 [70 ; 83]71 [65 ; 77]
54European Journal of Social Psychology82 [75 ; 89]77 [69 ; 84]61 [53 ; 69]
55Social Psychology81 [73 ; 90]77 [67 ; 86]73 [63 ; 82]
56Perception82 [74 ; 88]76 [72 ; 81]78 [74 ; 83]
57Journal of Anxiety Disorders80 [71 ; 89]76 [72 ; 80]71 [67 ; 75]
58Personal Relationships65 [54 ; 76]76 [68 ; 84]62 [54 ; 70]
59Evolutionary Psychology63 [51 ; 75]76 [67 ; 85]77 [68 ; 86]
60Journal of Research in Personality63 [46 ; 77]76 [67 ; 84]70 [61 ; 79]
61Cognitive Behaviour Therapy88 [73 ; 99]76 [66 ; 86]68 [58 ; 79]
62Emotion79 [73 ; 85]75 [72 ; 79]67 [64 ; 71]
63Animal Behavior79 [72 ; 87]75 [71 ; 80]68 [64 ; 73]
64Group Processes & Intergroup Relations80 [73 ; 87]75 [71 ; 80]60 [56 ; 65]
65JPSP-Personality Processes and Individual Differences78 [70 ; 86]75 [70 ; 79]64 [59 ; 69]
66Psychology of Men and Masculinity88 [77 ; 96]75 [64 ; 87]78 [67 ; 89]
67Consciousness and Cognition74 [67 ; 80]74 [69 ; 80]67 [62 ; 73]
68Personality and Social Psychology Bulletin78 [72 ; 84]74 [69 ; 79]57 [52 ; 62]
69Journal of Cognition and Development70 [60 ; 80]74 [67 ; 81]65 [59 ; 72]
70Journal of Applied Psychology69 [59 ; 78]74 [67 ; 80]73 [66 ; 79]
71European Journal of Personality80 [67 ; 92]74 [65 ; 83]70 [61 ; 79]
72Journal of Positive Psychology75 [65 ; 86]74 [65 ; 83]66 [57 ; 75]
73Journal of Research on Adolescence83 [74 ; 92]74 [62 ; 87]67 [55 ; 79]
74Psychopharmacology75 [69 ; 80]73 [71 ; 75]67 [65 ; 69]
75Frontiers in Psychology75 [70 ; 79]73 [70 ; 76]72 [69 ; 75]
76Cognitive Therapy and Research73 [66 ; 81]73 [68 ; 79]67 [62 ; 73]
77Behaviour Research and Therapy70 [63 ; 77]73 [67 ; 79]70 [64 ; 76]
78Journal of Educational Psychology82 [73 ; 89]73 [67 ; 79]76 [70 ; 82]
79British Journal of Social Psychology74 [65 ; 83]73 [66 ; 81]61 [54 ; 69]
80Organizational Behavior and Human Decision Processes70 [65 ; 77]72 [69 ; 75]67 [63 ; 70]
81Cognition and Emotion75 [68 ; 81]72 [68 ; 76]72 [68 ; 76]
82Journal of Affective Disorders75 [69 ; 83]72 [68 ; 76]74 [71 ; 78]
83Behavioural Brain Research76 [71 ; 80]72 [67 ; 76]70 [66 ; 74]
84Child Development81 [75 ; 88]72 [66 ; 78]68 [62 ; 74]
85Journal of Abnormal Psychology71 [60 ; 82]72 [66 ; 77]65 [60 ; 71]
86Journal of Vocational Behavior70 [59 ; 82]72 [65 ; 79]84 [77 ; 91]
87Journal of Experimental Child Psychology72 [66 ; 78]71 [69 ; 74]72 [69 ; 75]
88Journal of Consulting and Clinical Psychology81 [73 ; 88]71 [64 ; 78]62 [55 ; 69]
89Psychology of Music78 [67 ; 86]71 [64 ; 78]79 [72 ; 86]
90Behavior Therapy78 [69 ; 86]71 [63 ; 78]70 [63 ; 78]
91Journal of Occupational and Organizational Psychology66 [51 ; 79]71 [62 ; 80]87 [79 ; 96]
92Journal of Happiness Studies75 [65 ; 83]71 [61 ; 81]79 [70 ; 89]
93Journal of Occupational Health Psychology77 [65 ; 90]71 [58 ; 83]65 [52 ; 77]
94Journal of Individual Differences77 [62 ; 92]71 [51 ; 90]74 [55 ; 94]
95Frontiers in Behavioral Neuroscience70 [63 ; 76]70 [66 ; 75]66 [62 ; 71]
96Journal of Applied Social Psychology76 [67 ; 84]70 [63 ; 76]70 [64 ; 77]
97British Journal of Developmental Psychology72 [62 ; 81]70 [62 ; 79]76 [67 ; 85]
98Journal of Social and Personal Relationships73 [63 ; 81]70 [60 ; 79]69 [60 ; 79]
99Behavioral Neuroscience65 [57 ; 73]69 [64 ; 75]69 [63 ; 75]
100Psychology and Marketing71 [64 ; 77]69 [64 ; 74]67 [63 ; 72]
101Journal of Family Psychology71 [59 ; 81]69 [63 ; 75]62 [56 ; 68]
102Journal of Personality71 [57 ; 85]69 [62 ; 77]64 [57 ; 72]
103Journal of Consumer Behaviour70 [60 ; 81]69 [59 ; 79]73 [63 ; 83]
104Motivation and Emotion78 [70 ; 86]69 [59 ; 78]66 [57 ; 76]
105Developmental Science67 [60 ; 74]68 [65 ; 71]65 [63 ; 68]
106International Journal of Psychophysiology67 [61 ; 73]68 [64 ; 73]64 [60 ; 69]
107Self and Identity80 [72 ; 87]68 [60 ; 76]70 [62 ; 78]
108Journal of Counseling Psychology57 [41 ; 71]68 [55 ; 81]79 [66 ; 92]
109Health Psychology63 [50 ; 73]67 [62 ; 72]67 [61 ; 72]
110Hormones and Behavior67 [58 ; 73]66 [63 ; 70]66 [62 ; 70]
111Frontiers in Human Neuroscience68 [62 ; 75]66 [62 ; 70]76 [72 ; 80]
112Annals of Behavioral Medicine63 [53 ; 75]66 [60 ; 71]71 [65 ; 76]
113Journal of Child Psychology and Psychiatry and Allied Disciplines58 [45 ; 69]66 [55 ; 76]63 [53 ; 73]
114Infancy77 [69 ; 85]65 [56 ; 73]58 [50 ; 67]
115Biological Psychology64 [58 ; 70]64 [61 ; 67]66 [63 ; 69]
116Social Development63 [54 ; 73]64 [56 ; 72]74 [66 ; 82]
117Developmental Psychobiology62 [53 ; 70]63 [58 ; 68]67 [62 ; 72]
118Journal of Consumer Research59 [53 ; 67]63 [55 ; 71]58 [50 ; 66]
119Psychoneuroendocrinology63 [53 ; 72]62 [58 ; 66]61 [57 ; 65]
120Journal of Consumer Psychology64 [55 ; 73]62 [57 ; 67]60 [55 ; 65]

If Consumer Psychology Wants to be a Science It Has to Behave Like a Science

Consumer psychology is an applied branch of social psychology that uses insights from social psychology to understand consumers’ behaviors. Although there is cross-fertilization and authors may publish in more basic and more applied journals, it is its own field in psychology with its own journals. As a result, it has escaped the close attention that has been paid to the replicability of studies published in mainstream social psychology journals (see Schimmack, 2020, for a review). However, given the similarity in theories and research practices, it is fair to ask why consumer research should be more replicable and credible than basic social psychology. This question was indirectly addressed in a diaologue about the merits of pre-registration that was published in the Journal of Consumer Psychology (Krishna, 2021).

Open science proponents advocate pre-registration to increase the credibility of published results. The main concern is that researchers can use questionable research practices to produce significant results (John et al., 2012). Preregistration of analysis plans would reduce the chances of using QRPs and increase the chances of a non-significant result. This would make the reporting of significant results more valuable because signifiance was produced by the data and not by the creativity of the data analyst.

In my opinion, the focus on pre-registration in the dialogue is misguided. As Pham and Oh (2021) point out, pre-registration would not be necessary, if there is no problem that needs to be fixed. Thus, a proper assessment of the replicability and credibility of consumer research should inform discussions about preregistration.

The problem is that the past decade has seen more articles talking about replications than actual replication studies, especially outside of social psychology. Thus, most of the discussion about actual and ideal research practices occurs without facts about the status quo. How often do consumer psychologists use questionable research practices? How many published results are likely to replicate? What is the typical statistical power of studies in consumer psychology? What is the false positive risk?

Rather than writing another meta-psychological article that is based on paranoid or wishful thinking, I would like to add to the discussion by providing some facts about the health of consumer psychology.

Do Consumer Psychologists Use Questionable Research Practices?

John et al. (2012) conducted a survey study to examine the use of questionable research practices. They found that respondents admitted to using these practices and that they did not consider these practices to be wrong. In 2021, however, nobody is defending the use of questionable practices that can inflate the risk of false positive results and hide replication failures. Consumer psychologists could have conducted an internal survey to find out how prevalent these practices are among consumer psychologists. However, Pham and Oh (2021) do not present any evidence about the use of QRPs by consumer psychologists. Instead, they cite a survey among German social psychologists to suggest that QPRs may not be a big problem in consumer psychology. Below, I will show that QRPs are a big problem in consumer psychology and that consumer psychologists have done nothing over the past decade to curb the use of these practices.

Are Studies in Consumer Psychology Adequately Powered

Concerns about low statistical power go back to the 1960s (Cohen, 1961; Maxwell, 2004; Schimmack, 20212; Sedlmeier & Gigerenzer, 1989; Smaldino & McElreath, 2016). Tversky and Kahneman (1971) refused “to believe that any that a serious investigator will knowingly accept a .50 risk of failing to confirm a valid research hypothesis” (p. 110). Yet, results from the reproducibility project suggest that social psychologists conduct studies with less than 50% power all the time (Open Science Collaboration, 2015). It is not clear why we should expect higher power from consumer research. More concerning is that Pham and Oh (2021) do not even mention low power as a potential problem for consumer psychology. One advantage of a pre-registration is that researchers are forced to think ahead of time about the sample size that is required to have a good chance to show the desired outcome, assuming the theory is right. More than 20 years ago, the APA taskforce on on statistical inference recommended a priori power analysis, but researchers continued to conduct underpowered studies. Pre-registration, however, would not be necessary if consumer psychologists already conduct studies with adquate power. Here I show that power in consumer psychology is unacceptably low and has not increased over the past decade.

False Positive Risk

Pham and Oh note that Simmons, Nelson, & Simmonsohn’s (2011) influential article relied exclusively on simulations and speculations and suggest that the fear of massive p-hacking may be unfounded. “Whereas Simmons et al. (2011) highly influential computer simulations point to massive distortions of test statistics when QRPs are used, recent empirical estimates of the actual impact of self-serving analyses suggest more modest degrees of distortion of reported test statistics in recent consumer studies (see Krefeld-Schwalb & Scheibehenne, 2020). Here I presents of empirical analyses to estimate the false discovery risk in consumer psychology.

Data

The data are part of a larger project that examines research practices in psychology over the past decade. For this purpose, my research team and I downloaded all articles form 2010 to 2020 published in 120 psychology journals that cover a broad range of disciplines. Four journals represent research in consumer psychology, namely the Journal of Consumer Behavior, the Journal of Consumer Psychology, the Journal of Consumer Research and Psychology and Marketing. The articles were converted into text files and the text files were searched for test statistics. All F, t, and z-tests were used, but most test statistics were F and t tests. There were 2,304 tests for Journal of Consumer Behavior, 8940 for Journal of Consumer Psychology, 10,521 for Journal of Consumer Research, and 5,913 for Psychology and Marketing.

Results

I first conducted z-curve analyses for each journal and year separately. The 40 results were analyzed with year as continuous and journal as categorical predictor variable. No time trends were significant, but the main effect for the expected replication rate of journals was significant, F(3,36) = 9.63, p < .001. Inspection of the means showed higher values for Journal of Consumer Psychology and Psychology & Marketing than for the other two journals. No other effects were significant. Therefore, I combined the data of Journal of Consumer Psychology and Psychology of Marketing and the Journal of Consumer Behavior and Journal of Consumer Reserach.

Figure 1 shows the z-curve analysis for the first set of journals. The observed discovery rate (ODR) is simply the percentage of results that are significant. Out of the 14,853 tests, 10636 were significant which yields an ODR of 72%. To examine the influence of questionable research practices, the ODR can be compared to the estimated discovery rate (EDR). The EDR is an estimate that is based on a finite mixture model that is fitted to the distribution of the signifiant test statistics. Figure 1 shows that the fitted grey curve closely matches the observed distribution of test statistics that are all converted into z-scores. Figure 1 also shows the projected distribution that is expected for non-significant results. Contrary to the predicted distribution, observed non-significant results sharply drop off at the level of significance (z = 1.96). This pattern provides visual evidence that non-significant results do not follow a sampling distribution. The EDR is the area under the curve for the significant values relative to the total distribution. The EDR is only 34%. The 95%CI of the EDR can be used to test statistical significance. The ODR of 72% is well out side the 95% confidence interval of the EDR that ranges from 17% to 34%. Thus, there is strong evidence that consumer researchers use QRPs and publish too many significant results.

The EDR can also be used to assess the risk of publishing false positive results; that is significant results without a true population effect. Using a formula from Soric (1989), we can use the EDR to estimate the maximum percentage of false positive results. As the EDR decreases, the false discovery risk increases. With an EDR of 34%, the FDR is 10%, with a 95% confidence interval ranging from 7% to 26%. Thus, the present results do not suggest that most results in consumer psychology journals are false positives as some meta-scientists suggested (Ioannidis, 2005; Simmons et al., 2011).

It is more difficult to asses the replicability of results published in these two journals. On the one hand, z-curve provides an estimate of the expected replication rate. That is, the probability that a significant result produces a significant result again in an exact replication study (Brunner & Schimmack, 2020). The ERR is higher than the EDR because studies that produced a significant result have higher power than studies that did not produce a significant result. The ERR of 63% suggests that more than 50% of significant results can be successfully replicated. However, a comparison of the ERR with success rate in actual replication studies showed that the ERR overestimates actual replication rates (Brunner & Schimmack, 2020). There are a number of reasons for this discrepancy. One reason is that replication studies in psychology are never exact replications and that regression to the mean lowers the chances of reproducing the same effect size in a replication study. In social psychology, the EDR is actually a better predictor of the actual success rate. Thus, the present results suggest that actual replication studies in consumer psychology are likely to produce as many replication failures as studies in social psychology have (Schimmack, 2020).

Figure 2 shows the results for the Journal of Consumer Behavior and the Journal of Consumer Research.

The results are even worse. The ODR of 73% is above the EDR of 26% and well outside the 95%CI of the EDR, . The EDR of 24% implies a false discovery risk of 15%, 95%CI =

Conclusion

The present results show that consumer psychology is plagued by the same problems that have produced replication failures in social psychology. Given the similarities between consumer psychology and social psychology, it is not surprising that the two disciplines are alike. Researchers conduct underpowered studies and use QRPs to report inflated success rates. These illusory results cannot be replicated and it is unclear which statistically significant results reveal effects that have practical significance and which ones are mere false positives. To make matters worse, social psychologists have responded to awareness of these problems by increasing power of their studies and by implementing changes in their research practices. In contrast, z-curve analyses of consumer psychology show no improvement in research practices over the past year. In light of this disappointing tend, it is disconcerting to read an article that suggests improvements in consumer psychology are not needed and that everything is well (Pham and Oh, 2021). I demonstrated with hard data and objective analysis that this assessment is false. It is time for consumer psychologists to face reality and to follow in the footsteps of social psychologists to increase the credibility of their science. While preregistration may be optional, increasing power is not.

Guest Post by Peter Holtz: From Experimenter Bias Effects To the Open Science Movement

This post was first shared as a post in the Facebook Psychological Methods Discussion Group. (Group, Post). I thought it was interesting and deserved a wider audience.

Peter Holtz

I know that this is too long for this group, but I don’t have a blog …

A historical anecdote:

In 1963, Rosenthal and Fode published a famous paper on the Experimenter Bias Effect (EBE): There were of course several different experiments and conditions etc., but for example, research assistants were given a set of 20 photos of people that were to be rated by participants on a scale from -10 ([will experience …] “extreme failure”) to + 10 (…“extreme success”).

The research assistants (e.g., participants in a class on experimental psychology) were told to replicate a “well-established” psychological finding just like “students in physics labs are expected to do” (p. 494). On average, the sets of photos had been rated in a large pre-study as neutral (M=0), but some research assistants were told that the expected mean of their photos was -5, whereas others were told that it was +5. When the research assistants, who were not allowed to communicate with each other during the experiments, handed in the results of their studies, their findings were biased in the direction of the effect that they had expected. Funnily enough, similar biases could be found for experiments with rats in Skinner boxes as well (Rosenthal & Fode, 1963b).

The findings on the EBE were met with skepticism from other psychologists since they casted doubt on experimental psychology’s self-concept as a true and unbiased natural science. And what do researchers do since the days of Socrates if they doubt the findings of a colleague? Sure, they attempt to replicate them. Whereas Rosenthal and colleagues (by and large) produced several successful “conceptual replications” in slightly different contexts (for a summary see e.g. Rosenthal, 1966), others (most notably T. X. Barber) couldn’t replicate Rosenthal and Fode’s original study (e.g., Barber et al., 1969; Barber & Silver, 1968, but also Jacob, 1968; Wessler & Strauss, 1968).

Rosenthal, a versed statistician, responded (e.g., Rosenthal, 1969) that the difference between significant and non-significant may be not itself significant and used several techniques that about ten years later came to be known as “meta-analysis” to argue that although Barber’s and others’ replications, which of course used other groups of participants and materials etc., most often did not yield significant results, a summary of results suggests that there may still be an EBE (1968; albeit probably smaller than in Rosenthal and Fode’s initial studies – let me think… how can we explain that…).

Of course, Barber and friends responded to Rosenthal’s responses (e.g., Barber, 1969 titled “invalid arguments, post-mortem analyses, and the experimenter bias effect”) and vice versa and a serious discussion of psychology’s methodology emerged. Other notables weighed in as well and frequently statisticians such as Rozeboom (1960) and Bakan (1966) were quoted who had by then already done their best to explain to their colleagues the problems of the p-ritual that psychologists use(d) as a verification procedure. (On a side note: To me, Bakan’s 1966 paper is better than much of the recent work on the problems with the p-ritual; in particular the paragraph on the problematic assumption of an “automacity of inference” on p. 430 is still worth reading).

Lykken (1968) and Meehl (1967) soon joined the melee and attacked the p-ritual also from an epistemological perspective. In 1969, Levy wrote an interesting piece about the value of replications in which he argued that replicating the EBE-studies doesn’t make much sense as long as there are no attempts to embed the EBE into a wider explanatory theory that allows for deducing other falsifiable hypotheses as well. Levy knew very well already by 1969 that the question whether some effect “exists” or “does not exist” is only in very rare cases relevant (exactly then when there are strong reasons to assume that an effect does not exist – as is the case, for example, with para-psychological phenomena).

Eventually Rosenthal himself (e.g., 1968a) came to think critically of the “reassuring nature of the null hypothesis decision procedure”. What happened then? At some point Rosenthal moved away from experimenter expectancy effects in the lab to Pygmalion effects in the classroom (1968b) – an idea that is much less likely to provoke criticism and replication attempts: Who doesn’t believe that teachers’ stereotypes influence the way they treat children and consequently the children’s chances to succeed in school? The controversy fizzled out and if you take up a social psychology textbook, you may find the comforting story in it that this crisis was finally “overcome” (Stroebe, Hewstone, & Jonas, 2013, p. 18) by enlarging psychology’s methodological arsenal, for example, with meta-analytic practices and by becoming a stronger and better science with a more rigid methodology etc. Hooray!

So psychology was finally great again from the 1970s on … was it? What can we learn from this episode?- It is not the case that psychologists didn’t know the replication game, but they only played it whenever results went against their beliefs – and that was rarely the case (exceptions are apart from Rosenthal’s studies of course Bem’s “feeling the future” experiments). –

Science is self-correcting – but only whenever there are controversies (and not if subcommunities just happily produce evidence in favor of their pet theories). – Everybody who wanted to know it could know by the 1960s that something is wrong with the p-ritual – but no one cared. This was the game that needed to be played to produce evidence in favor of theories and to get published and to make a career; consequently, people learned to play the verification game more and more effectively. (Bakan writes on p. 423: “What will be said in this paper is hardly original. It is, in a certain sense, what “everybody knows.” To say it “out loud” is, as it were, to assume the role of the child who pointed out that the emperor was really outfitted only in his underwear.” – in 1966!)-

Just making it more difficult to verify a theory will not solve the problem imo; ambitious psychologists will again find ways to play the game – and to win.- I see two risks with the changes that have been proposed by the “open science community” (in particular preregistration): First, I am afraid that since the verification game still dominates in psychology researchers will simply shift towards “proving” more boring hypotheses; second, there is the risk that psychological theories will be shielded even more from criticism since only criticism based on “good science” (preregistered experiments with a priori power analysis and open data) will be valid whereas criticism based on other types of research activities (e.g., simulations, case studies … or just rational thinking for a change) will be dismissed as “unscientific” => no criticism => no controversy => no improvement => no progress. – And of course, pre-registration and open science etc. allow psychologists to still maintain the misguided, unfortunate, and highly destructive myth of the “automacity of inferences”; no inductive mechanism whatsoever can ensure “true discovery”.-

I think what is needed more is a discussion about the relationship between data and theory and about epistemological questions such as the question what a “growth of knowledge” in science could look like and how it can be facilitated (I call this a “falsificationist turn”).- Irrespective of what is going to happen, authors of textbooks will find ways to write up the history of psychology as a flawless cumulative success story …

A Z-Curve Analysis of a Self-Replication: Shah et al. (2012) Science

Since 2011, psychologists are wondering which published results are credible and which results are not. One way to answer this question would be for researchers to self-replicate their most important findings. However, most psychologists have avoided conducting or publishing self-replications (Schimmack, 2020).

It is therefore always interesting when a self-replication is published. I just came across Shah, Mullainathana, and Shafir (2019). The authors conducted high-powered (much larger sample-sizes) replications of five studies that were published in Shah, Mullainathana, and Shafir’s (2012) Science article.

The article reported five studies with 1, 6, 2, 3, and 1 focal hypothesis tests. One additional test was significant, but the authors focussed on the small effect size and considered it not theoretically important. The replication studies successfully replicated 9 of the 13 significant results; a success rate of 69%. This is higher than the success rate in the famous reproducibility project of 100 studies in social and cognitive psychology; 37% (OSC, 2015).

One interesting question is whether this success rate was predictable based on the original findings. An even more interesting question is whether original results provide clues about the replicability of specific effects. For example, why were the results of Study 1 and 5 harder to replicate than those of the other studies.

Z-curve relies on the strength of the evidence against the null-hypothesis in the original studies to predict replication outcomes (Brunner & Schimmack, 2020; Bartos & Schimmack, 2020). It also takes into account that original results may be selected for significance. For example, the original article reported 14 out of 14 significant results. It is unlikely that all statistical tests of critical hypotheses produce significant results (Schimmack, 2012). Thus, some questionable practices were probably used although the authors do not mention this in their self-replication article.

I converted the 13 test statistics into exact p-values and converted the exact p-values into z-scores. Figure 1 shows the z-curve plot and the results of the z-curve analysis. The first finding is that the observed success rate of 100% is much higher than the expected discovery rate of 15%. Given the small sample of tests, the 95%CI around the estimated discovery rate is wide, but it does not include 100%. This suggests that some questionable practices were used to produce a pretty picture of results. This practice is in line with widespread practices in psychology in 2012.

The next finding is that despite a low discovery rate, the estimated replication rate of 66% is in line with the observed discovery rate. The reason for the difference is that the estimated discovery rate includes the large set of non-significant results that the model predicts. Selection for significance selects studies with higher power that have a higher chance to be significant (Brunner & Schimmack, 2020).

It is unlikely that the authors conducted many additional studies to get only significant results. It is more likely that they used a number of other QRPs. Whatever method they used, QRPs make just significant results questionable. One solution to this problem is to alter the significance criterion post-hoc. This can be done gradually. For example, a first adjustment might lower the significance criterion to alpha = .01.

Figure 2 shows the adjusted results. The observed discovery rate decreased to 69%. In addition, the estimated discovery rate increased to 48% because the model no longer needs to predict the large number of just significant results. Thus, the expected and observed discovery rate are much more in line and suggest little need for additional QRPs. The estimated replication rate decreased because it uses the more stringent criterion of alpha = .01. Otherwise, it would be even more in line with the observed replication rate.

Thus, a simple explanation for the replication outcomes is that some results were obtained with QRPs that produced just significant results with p-values between .01 and .05. These results did not replicate, but the other results did replicate.

There was also a strong point-biseral correlation between the z-scores and the dichotomous replication outcome. When the original p-values were split into p-values above or below .01, they perfectly predicted the replication outcome; p-values greater than .01 did not replicate, those below .01 did replicate.

In conclusion, a single p-values from a single analysis provides little information about replicability, although replicability increases as p-values decrease. However, meta-analyses of p-values with models that take QRPs and selection for significance into account are a promising tool to predict replication outcomes and to distinguish between questionable and solid results in the psychological literature.

Meta-analyses that take QRPs into account can also help to avoid replication studies that merely confirm highly robust results. Four of the z-scores in Shah et al.’s (2019) project were above 4, which makes it very likely that the results replicate. Resources are better spend on findings that have high theoretical importance, but weak evidence. Z-curve can help to identify these results because it corrects for the influence of QRPs.

Conflict of Interest statement: Z-curve is my baby.