This chapter continues the discussion of experienced well-being (ExWB) measurement, addressing some specific issues that have arisen as the research base has evolved and grown. The following issues are discussed here:
• Whether respondents’ answers to ExWB questions are subject to systematic biases and differences between groups—defined by culture, age, or other traits—that may invite misleading conclusions about respondents’ actual hedonic experiences;
• Susceptibility of ExWB measures to various biases induced by context or by question ordering, and the importance of these effects;
• Sensitivity of self-reported ExWB to changing situations and environments;
• The role of adaptation and response shift in ExWB measurement; and
• Scale and survey mode effects and the design of instruments.
The value that people place on various emotional states shapes their reports of subjective well-being (SWB). A large body of research shows systematic variations in self-reported well-being that appear to be associated with cultural norms about ideal affective states (see Tsai et al., 2006, for a review). Consequently, when making international comparisons or interpreting findings from various subpopulations within a country, care must be taken to consider cultural contexts. Asians and Asian Americans,
for example, appear to place less value on excitement and joy than on states characterized by calmness and serenity. In contrast, European Americans are exceptional in the considerable value they place on high-arousal positive states such as excitement and surprise. Such observations can raise questions, even doubts, about the meaning of comparisons of SWB across countries. The issue is obviously important in measurement, and because SWB is a topic frequently discussed in the media, it is also important when communicating findings to the public.
Arousal appears to be a key dimension that distinguishes subgroups. East Asians, as well as older people, tend to endorse more low-arousal positive emotions than high-arousal positive emotions (Kessler and Staudinger, 2009; Tsai, 2007; Tsai et al., 2006). As noted above, East Asians place less value on surgency than do westerners (Tsai, 2007; Tsai et al., 2006).1 Tsai et al. (2006) argue convincingly that despite great cultural consistency in the subjective and physiological experience of emotions once they are elicited, cultures vary considerably in how people want to feel. Anger and sadness appear to be more acceptable states among Germans than Americans, for example. Similarly, at older ages people report more mixed emotional experiences, even though they report higher overall levels of SWB than younger adults (Ersner-Hershfield et al., 2008). Moreover, mixed emotional experience is associated prospectively with better physical health across adulthood (Hershfield et al., 2013), suggesting that mixed emotions do not detract from SWB in older populations. More research is needed on ethnic and age differences in affect valuation, especially in the United States, where ethnic diversity is increasing. Under mainstream assumptions about immigration policy and trends, Hispanics will account for two-thirds of the growth in U.S. population from 2010 to 2050, and the proportion of older people will increase from 13 percent currently to 20 percent in 2030 (Passel and Cohn, 2008).
Despite such variations in the factors that contribute to it, happiness itself appears to be understood in much the same way across cultures. Thus, SWB measures based on “happy yesterday” questions may be especially useful for international comparisons because they offset the fact that different factors (e.g., arousal or calm) may contribute to subjective happiness at different ages or in different cultures.2
2 Fulmer et al. (2010) showed that people are happier when their personalities match their cultures. That is, the extent to which people’s personalities, for example, traits of the “big five” personality theory, predict their SWB and self-esteem depends on the degree of personality match to the dominant personality dimensions in the culture.
RECOMMENDATION 4.1 (Research): More study is needed about the role of cultural effects on ExWB. In particular, the value placed on high-arousal positive states versus low-arousal positive states and the acceptance of negative states, like anger and sadness, likely varies considerably by age and cultural context, which suggests that subpopulations assess ExWB differently. For example, if a measure relies heavily on high-arousal positive items, older populations will appear less happy; a similar bias may occur in assessing some Asian populations.
The use of anchoring vignettes is a promising approach to identifying and correcting systematic cross-cultural differences in question interpretation. Such approaches have been used in a number of contexts, such as in cross-country comparisons of job satisfaction (Kristensen and Johansson, 2008) or life satisfaction (Kapteyn et al., 2010). Van Soeste et al. (2011, p. 575), in an assessment of this growing literature, conclude that “vignette based corrections appear quite effective in bringing objective and subjective measures closer together.” Notwithstanding this promising beginning, the approach’s effectiveness will not be fully assessable until further research is conducted in a range of contexts and on a range of outcomes.
Because of its key research policy interest, attention to aging as it affects memory for emotional experience merits consideration in the measurement of ExWB. The positivity effect refers to an age-related trend that favors positive over negative stimuli in cognitive processing. Relative to their younger counterparts, older people attend to and remember more positive than negative information.
The positivity effect has been documented across a variety of experimental paradigms and a wide range of stimuli, supporting the robustness of the effect (Reed and Carstensen, 2012). It emerges in studies of working memory (Mikels et al., 2005), short-term memory (Charles et al., 2003), autobiographical memory (Kennedy et al., 2004; Schlagman et al., 2006), and even false memories (Fernandes et al., 2008). It is also evident in decision making. Compared to younger people, older people pay greater attention to positive as compared to negative attributes when, for example, choosing doctors and hospitals (Löckenhoff and Carstensen, 2007, 2008) and making decisions about consumer products (Kim et al., 2008). Compared to younger adults, older adults also remember their choices in a manner that is positively skewed—either via disproportionately recalling positive attributes or via attributing positive attributes to chosen options and negative attributes to rejected options (Löckenhoff and Carstensen, 2007, 2008; Mather and Johnson, 2000; Mather et al., 2005).
The effect appears to reflect a top-down, motivated process in which cognition operates in the service of affect regulation. That is, there are changes in cognitive processing associated with age-related changes in goals that prioritize emotional satisfaction and meaning (Carstensen, 2006). Positivity is most evident in automatic (impulsive) processing and less so in deliberative processing (which entails cognitive work); indeed, experiments that emphasize attention to detail eliminate the effect (Löckenhoff and Carstensen, 2007). Thus, although empirical examination is needed, the deliberative processing inherent in the DRM would likely reduce or eliminate age differences in that it involves reflection.
Among the most crucial issues for ExWB measures are their ability to distinguish groups or sectors of the population and their sensitivity to change. An additional and as yet unanswered question relevant to assessments of their applicability to policy is what constitutes a meaningful change in ExWB measures. Assessing significance (“meaningfulness”) is an obvious challenge given that these are subjective variables that run on an ordinal scale and that there may be (statistically) significant differences in terms of what is meaningful over time versus across cohorts in a cross-section (and there is likely more margin of error in determining the latter). While there is no single answer to this question, ExWB measures should not be held to an unachievable standard—for example, one that is higher than the standards set for other dimensions of social and economic measurement.
To influence long-term ExWB substantially at the population level, government policies designed to change the everyday circumstances of individuals would have to affect very large groups of people (as they sometimes do) on a day-to-day basis. Socially traumatic events like the assassination of President Kennedy or the 2001 terrorist attacks, have had a detectable impact on measures of ExWB at the national level, but the measured effects have typically been short-lived. This highlights an important difference between evaluative well-being and ExWB: the latter primarily reflects what is currently engaging peoples’ attention; much less so events from the past, even important ones. (After a major event, the immediate environment, like being engaged in family or work activities, may be able “to grab” a person’s attention and influence their ExWB.) Even a large increase in unemployment, such as accompanied the severe recession of 2007-2011, may have only a muted impact on response means of SWB measures when the change
in unemployment takes place over a number of months and directly affects only a small percentage of the population.3 In addition to policy issues couched at the macro level, sensitivity to change is relevant to measures most likely to be useful at more local levels—for example, to assess the impact of local initiatives such as changes to traffic management, crime programs, or local school policies. This relates to the issue of targeted versus general measures. For example, if one wants to know about the impact of a traffic management policy or a health care innovation, then measures that specifically target people’s experience of traffic or health will likely be more sensitive than general well-being assessments.
In thinking about how to calibrate ExWB measures to address sensitivity concerns, it is instructive to think of examples of change in other statistical constructs, such as unemployment or income change. The unemployment rate rarely changes quickly, and a change from 6 to 6.1 percent reflects a change in status of only 1 in 1,000 people in the work force. Over the 50 years of existence of national unemployment statistics, economists have had time to learn how to understand and interpret what appears to be a small change; for example, the change from 6 to 6.1 percent represents a much larger impact among the population defined as actively looking for work. At present, the time series of SWB data is of insufficient length to instill confidence that it does or does not move over time or to know how to interpret a change as a small versus big movement.
Changes in income are often benchmarked against changes relative to a peer or professional group or against some threshold such as the poverty line. If one assumes a curvilinear relationship between income and happiness (that is, the widely supported generalization of decreasing marginal utility for higher levels of income), then a positive change in income will have less effect on ExWB than a negative change of the same percentage. But the exact relationship is debated, and, as Easterlin (2005, pp. 252-253) points out, “the cross-sectional relationship is not necessarily a trustworthy guide to experience over time or to inferences about policy.” Given these uncertainties, the answer as to what constitutes a meaningful change in income could be informed as much by SWB metrics as by income metrics. It is possible to measure the effects of these changes—and their relationship to changes or lack thereof in relevant cohorts—on SWB in a way that cannot be captured by revealed preferences.4
3 To be clear, life evaluation (evaluative well-being) measures may trend quite differently. The generalization made by Stiglitz et al. (2009) or the World Happiness Report (Helliwell et al., 2012) about the high human costs of unemployment is based on life-evaluation measures, not measures of momentary emotional states.
4 Economists have generally relied on revealed preferences—observations of people’s actual decisions and choices—as opposed to self-reports of intentions or inclinations. The opening
In the case of some ExWB measures, obvious thresholds exist—a move from a positive to a negative self-assessment, for example. And reducing suffering is surely a meaningful goal. In the same way that targeting the needs of the poor is an important but not the only objective of macroeconomic policy, prioritizing the needs of those in misery is one possible objective of policy informed by ExWB metrics. To the extent policies aim to increase the capabilities and opportunities of the most number of citizens possible, then some attention to increasing SWB as measured by measures of eudaimonic well-being may also enter into policy priorities. Ultimately, a discussion of what the priorities are would help establish what constitutes a meaningful change, or at least establish parameters for assessing changes in the aggregate and those that affect particular cohorts. We have already concluded that aggregate tracking is not what ExWB measures are likely to be most useful for. But can changes in ExWB be reliably detected at the individual level, or is it more realistic and useful to attempt measures for population groups? Research attention is needed to strengthen the evidence base for addressing these and related questions.
Additionally, the temporal nature of change has implications for the kinds of datasets needed. A different data collection approach—e.g., how often people are surveyed—is implied for measures that are sensitive typically on a very short time frame, responding to daily events, versus those that move very slowly. If data are collected every 2 years on a large survey, they are unlikely to be capable of catching short-lived deviations in ExWB (such as those associated with weekends or holidays). Such trends may need to be assessed using higher-frequency data collections with smaller samples, as opposed to massive population surveys conducted annually or every several years. Consumer confidence may be an example of this kind of rapid-change pattern, which may explain why the survey on which the University of Michigan Consumer Sentiment Index is based uses fairly small samples but an ongoing data collection design. On the other hand, large samples may be needed to inform macroeconomic policies about the broad
paragraphs of Kahneman and Krueger (2006) identify some of the respective roles for and strengths and weaknesses of SWB and revealed preference approaches. Fujiwara and Campbell (2011) provided a detailed assessment of valuation techniques—specifically, those based on revealed preference, stated preference, and SWB methods—for estimating costs and benefits of social policies. One of their conclusions was that while, at the moment, SWB methods often yield implausible estimates (as do revealed preferences in many cases), “they may still be useful in challenging decision makers to think more carefully about the full range of impacts of their proposed policies. And they may help decision makers to question the values that they may otherwise place implicitly on these impacts” (Fujiwara and Campbell, 2011, p. 53). Dolan and Metcalfe (2008) compared individuals’ willingness to pay for goods and services related to urban regeneration using revealed preference and SWB methods. They found “that monetary estimates from SWB data are significantly higher than from revealed and stated preference data” and explain possible sources of these differences.
population impact of factors such as unemployment or inflation, which themselves do not often change quickly, or to identify outlier populations that are suffering substantially more than the general population.
If ExWB measures are to be used meaningfully, data users need to know something about how to interpret changes in their value. For example, on a scale from 0 to 10, how does a change from 1 to 2 compare to a change from 7 to 8? At a minimum, it is preferable for the range of values to lie on an interval scale such that each increment on the scale is valued equally. As an alternative, an adjustment factor that accounts for nonlinearities (e.g., end-point aversion) could be applied to the change in rating, but it is unclear what the adjustments should look like. If one can only say that 2 is better than 1 but not by how much, it would only be possible to use ExWB measures as ordinal representations of value; this would seriously limit their applicability. That said, ordinal data could be combined with duration to calculate the percentage of “unhappy” time over the day. This is the approach adopted by Kahneman and Krueger (2006) in calculating the U-index using data from the DRM. But this approach loses potentially important information about just how bad the “unhappy” time is (and just how good the remaining “happy” time is). And it assumes that feelings of relative goodness and badness are independent of, and linearly weighted by, their duration. People care about being happier for longer, but the SWB research field has not made much progress on methods for comparing “how much happier” with “how much longer.”
Hedonic adaptation is the psychological process whereby people adjust to and become accustomed to a positive or negative stimulus brought on by changed circumstances, a single event, or a recurring event. People’s responses to questions about their well-being or quality of life have often reflected this, which potentially poses problems for using SWB measures, particularly for sorting out longitudinal effects when multiple determinants are at work.
Interpreting “response shifts,” a term used to characterize change in reporting over time, is complicated by the possibility that observed differences over time in self-reports of well-being may reflect true change in a respondent’s quality-of-life assessment (e.g., hedonic adaptation); measurement error (e.g., that associated with “scale recalibration” bias); or both. Ubel et al. (2010, p. 466) provided the following hypothetical examples:
1. A person’s happiness is partially restored after paraplegia. Over time, reported mood improves as the person begins shifting his
focus away from what he cannot do and toward new goals (e.g., from jogging to participating in wheelchair basketball). The individual’s physical functioning does not improve or deteriorate but, over time, his or her responses to well-being questions shift as the percentage of time experiencing positive emotions increases and the percentage of time experiencing negative emotions decreases.
2. A person with chronic pain experiences kidney stones. Prior to the bout with kidney stones, the person rates her chronic pain a 7 out of 10, on average. Then, she experiences kidney stones for which the pain is much more intense. This episode leads her to reinterpret the pain scale, and she shifts her response to now rate the (unchanged) chronic pain at only 5 out of 10.
The reported scores of the case 1 person reflect a true change in ExWB that occurred as a result of hedonic adaptation or a change in values; case 2, in contrast, does not provide a valid assessment of the person’s pain levels over time but is simply a recalibration of the reporting scale.5 For most purposes, researchers are interested in isolating the first category of phenomena, without the potential confounding of the second type of response shift.
Another example of scale recalibration has to do with how questions are interpreted. People may normalize their responses to questions about experienced utility (or other dimensions of SWB) to implicit standards of comparison (Kahneman and Miller, 1986). For example, people who have experienced a decline in functioning may norm their responses relative to their perceived assessment of others with the same disability. People may also reconceptualize SWB questions. For example, after surviving a cancer scare, a person may reprioritize and become more concerned about engaging in meaningful activities as opposed to immediately enjoyable ones (or vice versa). This creates a measurement issue in applications for which a consistent definition of SWB (or one dimension of it, such as ExWB) over time is required; if the measurement objective allows for individual interpretation of the SWB construct of interest, then this reconceptualization may not be an issue. Ubel et al. (2010) argued that shifts in actual well-being (adaptation) and scale recalibration are distinct causes of response shift and need to be disentangled; they proposed doing away with “response shift” terminology because of this ambiguity.
Much of the relevant research on response shift—and components thereof—is in the health care/clinical trial literature and has been conducted to more accurately assess quality of life among the chronically ill or the
5 It may be possible that having experienced the more severe pain allows the person to cope with the chronic pain with a new perspective and less distress. In other words, the measurement of pain may not be valid, but measurement of SWB may truly have shifted (as in case 1).
disabled, as in the following two examples. On the topic of cognitive adaptation, the 1978 paper by Brickman et al. on lottery winners and long-term paraplegics was highly influential in establishing the idea that, after these events, reported life satisfaction of the affected individuals returns to pre-event levels more quickly and completely than would be expected either intuitively or by people predicting what their moods would be under those conditions. For ExWB, one possible explanation of hedonic adaptation is offered by the set-point theory, which posits that people initially react to events, but then return to some baseline that is determined by personality factors (Brickman and Campbell, 1971).
Research subsequent to Brickman et al. (1978), much of which has used longitudinal data, has shown that adaptation is more complex than portrayed by some of the earlier studies and not as universal as once thought. The extent to which adaptation occurs may vary a great deal, depending on the exact nature of the event or circumstance that alters SWB. For example, the temporal impact of marriage on SWB, including affect, appears to often be short-lived (Clark et al., 2008), while the effects associated with unemployment and chronic pain appear to be more long-lasting (Lucas et al., 2004).
Loewenstein and Ubel (2008) measured the moment-to-moment mood of healthy people and dialysis patients over the course of a week; they found only small differences in the level of positive and negative mood recorded by the two groups. In other words, the dialysis patients presumably experienced a significant amount of emotional adaptation to their illness. Riis et al. (2005) documented patterns of adaptation and under-prediction of adaptation when eliciting momentary measures of ExWB. Spikes of grief associated with loss of a child, say, may not show up in experience sampling methods. A separate question is how to account for the intensity of these kinds of emotions. Analyses by Diener and colleagues (1999) and others corroborate the conclusion that experience and evaluation dimensions of SWB trend differently after a life-changing event in terms of extent and pace of adaptation. Using data from the Household Income and Labour Dynamics of Australia survey, a long-term longitudinal panel study, he found that disability cases showed approximately the same pattern of decrease in positive effect and increase in negative effect, with very little adaptation. For the loss of spouse or child, negative feelings increase sharply after the death but then fully return to baseline, while positive feelings rebound some, but do not return to previous levels (Clark et al., 2008).
CONCLUSION 4.1: The evidence with regard to adaptation suggests that it cannot be characterized as a process that occurs uniformly; people adapt differently to different events and life changes, in some part due to norms and expectations. Ideally, question structures should
be designed to allow researchers to decompose changes in response scores into scale recalibration (or other measurement errors) and true quality-of-life change components.
For example, in hypothetical case 2 above from Ubel et al. (2010), the person could have been asked to rate both quality of life and pain, which might allow the separate effects to be teased out.
In terms of its effect on policy relevance, if reported SWB (either ExWB or evaluative well-being) were not closely linked with individuals’ circumstances and opportunities, due to adaptation, the question arises of whether the measures are exploitable to inform policy. Smith et al. (2006) found that people report a willingness to pay large sums of money or make other major sacrifices to restore lost functions. Loewenstein and Ubel (2008, p. 1797) wrote:
A key problem with experience utility as a welfare criterion for public policy is its failure to sufficiently value negative or positive outcomes that people adapt to emotionally. It is well documented that people exhibit near-normal levels of happiness not long after experiencing adverse outcomes such as paraplegia, colostomy or end-stage kidney disease. Yet, the same people often report a willingness to make great sacrifices to alleviate their condition. A welfare criterion based on experience utility would run the risk of failing to treat such outcomes as welfare-diminishing—e.g., of treating an increase in cases of paraplegia as a welfare-neutral event.
A broad implication of this line of thinking is that policy makers should be aware that people care about aspects of their life that cannot be captured by a single measure, whether it is willingness to pay, experienced utility, or something else. Multiple kinds of evidence need to be considered. Loewenstein and Ubel (2008) suggest that, given limitations of both decision utility (based on ordinal utility concepts) and of experienced utility measures, evaluations of welfare will inevitably have to be informed by a combination of both approaches, patched together in a fashion that will depend on the specific context. The goal of policies ought to be to maximize people’s SWB, but the moment-to-moment or ExWB dimension is only one component of SWB.
Variation in the extent to which adaptation occurs in response to different domains, conditions, or cases may actually convey a great deal of information that is relevant to policy. For example, information about how people respond and adapt to price inflation versus unemployment, or the threat of it (Di Tella et al., 2001), would seem highly relevant to policy. The same may be true for data on how people who become severely disabled from a job-related accident respond or adapt differentially to psychologi-
cal scarring and emotional harm caused by inability to continue working and to a compensation package for lost income. In the health care context, Dolan and Kahneman (2008, p. 221) concluded that:
in general, it seems entirely appropriate [for ranking policy options] to give greater priority to those states that people do not adapt to over those that they do adapt to. This would seem to be particularly true when allocating resources amongst patients once the budget for health care has been determined i.e., once we have decided the priority afforded to patients in relation to other groups. Given this, we need to consider how well people predict changes—including any adaptation—in their future preferences.
Furthermore, people compensate for (adapt to) having poor education, to living in poverty or high crime areas, etc., yet these are certainly important policy areas. Understanding why people tolerate poor norms of health or lots of crime and corruption or bad environments seems especially relevant (Graham, 2011; Sen, 1985).6
All human judgment is subject to contextual influences, and the same holds for all self-reports that serve as measures of SWB. Focusing effects have a big impact. When asking across people, it is difficult to know if context is a biasing factor and, if so, for whom. How problematic a given contextual influence is depends on the objective of the measure. These objectives differ across measures of SWB, which renders various types of contextual influences differentially problematic.
Evaluative well-being involves assessments of extended periods of time, often a respondent’s “life-as-a-whole” or “life-these-days.” Such questions explicitly ask respondents to include all aspects of life (or the respective life-domain)—for example, “Taking all things together….” If this is the goal, any transient influence on judgment represents undue contamination in the form of giving too much weight at the moment to things the respondent would consider irrelevant if asked about them specifically. Typical examples include naturalistic context variables (e.g., the weather at the time of interview, sports news of the day) and research instrument variables (e.g., question order). For a comprehensive review, see Schwarz and Strack (1999).
In contrast, measures of ExWB attempt to assess how respondents feel during a much shorter reference period or episode, on which respondents
6 There are, as discussed in section 4.4, two distinct influences working here, which should not be confused: adaptation to conditions and cognitive states reflecting low expectations.
may report either immediately or retrospectively (see Chapter 3). In the ideal case, ExWB is assessed through concurrent reports of affect in situ (that is, with momentary assessment methods as discussed in section 3.1). Concurrent reports allow for introspective access to one’s momentary feelings and are the ideal option for their assessment (Robinson and Clore, 2002). Under such conditions, temporary influences arising from the context of daily life do not represent undue contamination; those whose moods were lifted by sunny weather or news about a sports event did indeed experience a period of higher ExWB. That such events are reflected in ExWB measures is testimony to their sensitivity, whereas the same influence would constitute a source of context bias for measures of evaluative well-being; people usually assume that good news about a favorite sports team can brighten one’s afternoon for a couple hours but not improve one’s life-as-a-whole, “taking all things together.” In contrast, when ExWB is not assessed immediately, temporary real-life influences at the time of measurement can bias retrospective reports. Accordingly, different measures of ExWB differ in their susceptibility to bias.
Finally, the influence of research instrument variables always presents undesirable contamination on measures of SWB, whether they pertain to ExWB or evaluative well-being. When the goal is to draw conclusions about a population, any influence that merely affects the sample and was not part of the experience of the population undermines the purpose of the assessment.
Psychological research shows that many feelings are fleeting. An individual can introspect on them while they are occurring (making Ecological Momentary Assessment the gold standard for ExWB assessment) but will need to reconstruct them after they have dissipated (Robinson and Clore, 2002; Schwarz et al., 2009). The extent to which the reconstruction captures the actual experience depends on the temporal distance between the experience and the time of interview and the extent to which respondents “relive” the past experience prior to reporting on it (i.e., the extent to which they reinstantiate the experience in memory). Thus, the potential for bias is likely to increase with the length of the episode and its temporal distance from the interview, and to decrease with the detailed reinstantiation of the episode. The available data are compatible with these assumptions, but more systematic comparisons across measures, based on the same population and time frame, are needed.
Considerations of context effects have played a strong role in the conceptualization and development of ExWB measures, and that work is continuing to explore the effects of context and determine ways to reduce unwanted effects. Several effects specifically related to context are discussed in the following sections.
Assessments of SWB, both evaluative and experienced, typically depend on the administration of several questions. In the case of ExWB, these are often questions about a series of adjectives, posed to the respondent one after another. One concern is that the order of the questions or the order in which adjectives are presented may introduce random error or, worse, bias in the ExWB measures. In addition to the order of questions within an assessment, there is evidence (discussed next) that the content of questions that precede an ExWB assessment may influence the answers. The nature of these effects, and their directionality and magnitude are important considerations in the design of SWB research protocols.
There is a large literature on question context and order effects.7 Schimmack and Oishi (2005), in a review of 16 studies, found that only 3 of them exhibited significant item-order effects. The authors concluded that order effects are often unimportant in actual survey settings because (as summarized by Diener et al., 2013, pp. 13-14), “chronically accessible information is not raised in importance by priming because it is already highly accessible, and other information is often ignored because it is seen as not relevant.” Other investigations raise serious concerns however. A split-sample randomized trial using experimental national data conducted by the UK Office for National Statistics (ONS) reported an effect of question order on multiple-item positive and negative emotion questions (Office for National Statistics, 2011). Asking negative emotion questions first produced lower scores on some positive emotion items for the adjectives “relaxed,” “calm,” “excited,” and “energized.” When positive emotion questions were asked first, the mean ratings for negative emotion questions were generally higher—except in the case of “pain”—and the increase was statistically significant for the adjectives “worried” and “bored” (OECD, 2013, p. 87). Similarly, when the order of positive and negative adjectives was varied, Krueger et al. (2009) observed higher ratings of positive emotions in a positive-to-negative order and lower ratings of negative emotions in a negative-to-positive order.
In the life-evaluation context, Deaton’s (2012) analysis of data from the Gallup-Healthways Well-Being Index demonstrated the importance of the content of questions (and responses) that precede an assessment of evaluative well-being. This randomized study showed that certain questions about political topics, which apparently altered respondents’ feelings while answering the questions, had a substantial impact on evaluative well-being
7 OECD Guidelines (2013) includes a more thorough discussion of this literature than is provided here. The OECD report also contains a number of thoughtful recommendations and priorities for future work, which this panel endorses, to improve understanding and deal better with question order and context issues.
as rated using the Cantril ladder. Specifically, “prompting them to think about [politics and politicians] has a very large downward effect on their assessment of their own lives” (Deaton, 2012, p. 19). The magnitude of the effect (about 0.6, or a rung on the Cantril ladder) was comparable to that associated with becoming unemployed. This magnitude of effect translates into a larger impact on average changes because a comparatively small percentage of respondents become unemployed, whereas all can be influenced by question order and context. An assessment of ExWB (global-yesterday adjectives) was placed later in the interview, and the political questions had considerably less impact on those responses. However, it is not clear if it was the “distance” from the political questions or the nature of the ExWB questions that was responsible for small impact. Deaton concluded that these unintended effects linked to context could threaten the internal validity of studies that did not take steps to resolve them.
RECOMMENDATION 4.2: As part of a general research program to study contextual influences on ExWB measures, survey designers should experiment with randomization of question ordering to create opportunities to study (and eventually minimize) the associated effects. Further work is likewise needed on the effectiveness of buffer and transition questions that precede and follow SWB question modules.
Deaton’s analysis is supportive of the notion that including a buffer or transition question between the political questions and the life-evaluation questions would largely eliminate the item-order effect. This was the case when, after an initial period, Gallup added a transition question of the form, “Now thinking about your personal life, are you satisfied with your personal life today?” That this insertion virtually eliminated the item-order effect suggests that careful survey design has the potential to greatly minimize such effects. This finding supports earlier work by Schwarz and Schuman (1997) indicating buffer questions, even a single one, could be effective at reducing context effects. However, they also found that buffer questions that are related to the subsequent SWB questions could prime responses in a way that generates additional context effects. More work is needed to study the frequency with which context and question-order effects arise, their severity, the effectiveness of methods to reduce them, and how they may impact measures of evaluative well-being and ExWB differentially.
CONCLUSION 4.2: Though not evaluated by the panel in detail, evaluative well-being and even global-yesterday ExWB questions likely benefit from being placed at the front of surveys or, when this is not possible, by the use of buffer questions. Further work is needed on the
most effective content and phrasing of these questions. In contrast, for reconstructed activity measures such as the DRM or other time-use formats, respondents need to reinstantiate the prior-day emotional context as much as possible. All SWB questions should appear in the same module of a given survey where possible. Based on the current body of research, the ordering should be questions on evaluative wellbeing first, questions requiring reinstantiation content next, and ExWB or hedonic questions last.8
To summarize the above discussion, much is already known about how to think about these effects. Considerations of context and order are important for deciding how to interpret data as well as how to design surveys. Many questions of design and interpretation can be addressed using fairly straightforward experiments. In many cases, existing research indicates what to do about these biases and what to do, for example, to handle mood effects. Researchers (e.g., Eid and Diener, 2004; Schwarz, 1987) have documented mood changes associated with the weather, question order, or minor events such as finding a dime before answering a question, which in turn influence reported life satisfaction; others have “used structural models to attempt to separate situational variability from random error and basic stability” (Krueger and Schkade, 2008).
Another survey construction issue is the measurement scales used in response formants. At one end, dichotomous scales—for example, yes/no responses—are easy to summarize (as in “x% of this group reported high stress”), so they are useful and understandable. One methodological reason supporting the 0-1 dichotomous option is that it eliminates scale effects (although there is little evidence that they are major). Cultural differences affecting interpretation of terms such as “a lot” are similar to scale effects, and using dichotomous response options may minimize cultural effects; however, there is presently no evidence supporting this contention.9 The advantage of multipoint scales of the kind favored by
8 This conclusion is consistent with the similar OECD (2013, p. 127) conclusion: “Question order effects can be a significant problem, but one that can largely be managed when it is possible to ask subjective well-being questions before other sensitive survey items, allowing some distance between them. Where this is not possible, introductory text and other questions can also serve to buffer the impact of context.”
9 Extreme responses to scales could represent one form of arousal measurement, discussed in section 4.1. A possible drawback to multipoint scales is that there could be differential group-level reporting patterns associated with certain emotions or sensations—that is, a propensity to choose scores closer to the ends of the scale.
ONS, which uses a 0-10 version, is that they contain much more information. For this reason, the panel agrees with ONS (2011) and OECD (2013) conclusions that a multipoint numeric scale is generally preferably to a dichotomous question structure—though, as always, this hinges on the purpose to which the data will be put.
Going further, for emotion measures, the panel agrees with the following OECD (2013, p. 126) conclusions about how response scales should be labeled and structured:
there is empirical support for the common practice of using 0-10 point numerical scales, anchored by verbal labels that represent conceptual absolutes (such as completely satisfied/completely dissatisfied). On balance, it seems preferable to label scale interval-points (between the anchors) with numerical, rather than verbal, labels, particularly for longer response scales…. In the case of affect measures, unipolar scales (i.e., those reflecting a continuous scale focused on only one dimension—such as those anchored from never/not at all through to all the time/completely) are desirable, as there are advantages to measuring positive and negative affect separately.
Standardized wording, scaling and ordering, question buffering, etc., are all important implementation considerations for which the field does not yet have a full understanding, and for which more research is therefore warranted.
Survey mode refers to the vehicle used to ask respondents questions—by personal interview, phone, Internet instrument, and so on. Preliminary results, discussed in this section, indicate that survey mode has a significant impact on responses and, perhaps more importantly, on who responds in the first place. More needs to be known about who is in a given study and who does and does not answer specific kinds of questions (for example, are happier people more likely to respond to the survey?). A big advance would be the ability to gain clear clues about these kinds of selection biases and how to solve them.
Dolan and Kavetsos (2012) investigated the differences between interviewer-administered and telephone-administered responses to the UK Annual Population Survey. The authors examined (a) the impact of survey mode on SWB reports and (b) the determinants of SWB by mode, using the April-September 2011 pre-release of the survey data. Their analysis found large differences by survey mode; in fact, mode effects in the data swamped all other effects. This carries implications for descriptive sta-
tistics already published in ONS reports, which ONS has acknowledged (2013, p. 30). This kind of result, similar to Deaton’s (2012) findings about question ordering in the Gallup surveys, can seriously undermine a survey enterprise.
The results by Dolan and Kavetsos (2012) are particularly important for cross-region comparisons, because some regions covered by the Annual Population Survey are interviewed via one mode only (the study is based on region W1 respondents only, to avoid self-selection into mode). Their finding was that individuals report higher SWB over the telephone than in face-to-face interviews. Scores for average life satisfaction, happiness, and worthwhileness were about 0.5 points higher in the telephone interviews, and anxiety was about 0.3 points lower. For happiness, the telephone coefficient was three times as large as the (absolute) negative effect associated with being male. That effect is sufficient to offset more than half the effect of widowhood and is more than twice the coefficient of degree-level education; it offsets about a quarter of the effects of unemployment. A large research literature exists on the problem of survey mode effects generally; going forward, it will be crucial to study different survey modalities, including the Internet, for SWB applications specifically.
RECOMMENDATION 4.3: Given the potential magnitude of survey-mode and contextual effects (as shown in findings related to work by ONS and elsewhere), research on the magnitude of these effects and methods for mitigating them should be a priority for statistical agencies during the process of experimentation and testing of new SWB modules.10
The OECD Guidelines presents a thorough review of the issues and the evidence in the literature, and it offers sensible guidance on next steps:
Where mixed-mode surveys are unavoidable, it will be important for data comparability to select question and response formats that do not require extensive modifications for presentation in different modalities. Details of the survey mode should be recorded alongside responses, and mode effects across the data should be systematically tested and reported … enabling compilation of a more comprehensive inventory of questions known to be robust to mode effects. (OECD, 2013, pp. 127-128)
10 OECD (2013, p. 127) similarly recommends that “details of the survey mode should be recorded alongside responses, and mode effects across the data should be systematically tested and reported.”
This section has touched on most, though not all, of the major measurement hurdles facing SWB measurement.11 Until the issues discussed here are more fully sorted out, using split trial and other experiments, it is hard to make the case for expanding SWB questions into the major U.S. federal surveys.
11 For example, work is needed to better understand and estimate the role of traditionally unobservable characteristics for those who select into a survey (versus those who opt out); innovative methods are needed to ascertain how “happy” people are who refuse to participate in an SWB survey. Related is the effect that being surveyed itself has on other outcomes, bearing in mind that participation in well-being surveys is itself an intervention.