A number of techniques have been developed and used for obtaining experienced well-being (ExWB) data from subjects. These include momentary assessments that take place throughout the day, such as the Experience Sampling Method (ESM) and Ecological Momentary Assessment (EMA); overall assessments of a day—which may follow an end-of-day or a yesterday structure; and reconstructions of a previous day’s activities followed by well-being assessment for each episode, such as the Day Reconstruction Method (DRM). These approaches vary in depth of information and precision of measurement; they also vary in terms of respondent burden. In this section, we review the major techniques used to measure people’s ExWB. Some attempt to capture emotional states in the moment; others rely on longer reference or recall periods and thus require some reconstruction or reflective assessment by respondents, pushing them along the time frame continuum toward life evaluations.
ESM is a research methodology that asks participants to stop at certain times and make notes of their experience in real time—it measures immediate experience or feelings. EMA refers to a class of methods designed to track emotions associated with experiences as they occur, in everyday life; they thus avoid both reliance on memory and context effects caused by artificial environments (e.g., a laboratory). EMA also provides a method-
ological framework for capturing other data in the field, such as the use of ambulatory monitors of various physiological states.1 In the most paradigmatic type of EMA, respondents provide subjective assessments of their emotions and experiences in real time, as they go about their daily lives (see Shiffman and Stone, 1998, for a review). The usual method is for the respondent to wear an electronic device for a period of time, such as a week, that prompts the wearer at various times throughout the day to respond to a brief survey.2 Answers are input directly into the device. From EMA data, researchers are able to compute average levels of variables of interest (thus avoiding the problem of relying on participants to aggregate their own experiences) and can also explore peak and diurnal experiences. A primary advantage of this method is that it does not rely on memories constructed after the fact. EMA methods provide direct, subjective assessments of actual experiences, allowing the various biasing factors associated with recall to be bypassed.
On the downside, the intensive nature of EMA studies makes it very difficult to scale up this method to the level of nationally representative surveys. Devices must be provided to (and usually returned by) respondents, who must be trained in their use. Given the considerable respondent burden involved, response rates may be low—especially among some vulnerable or distressed groups—and participant compensation costs are likely to be substantial. Response rates may be especially low among people who have the most difficulty using the devices (e.g., those with
1 For this report, the panel uses the term “EMA” to refer to the class of methods that includes both EMA and ESM.
2 In its original form, EMA uses randomly selected intervals to avoid bias that could be incurred by a fixed interval schedule. In addition, the traditional EMA approach asks respondents to assess their experiences right now, to avoid reliance on memory. However, this approach means that distinctive but important events may not be captured. This problem is exacerbated by the fact that, while the prompts are random, nonresponses to the prompts—missing data—may not be. Respondents may not wish to respond to the device if it beeps at particularly positive (e.g., watching one’s child’s first steps) or negative (e.g., a fight with one’s spouse) moments—yet these are precisely among the moments that researchers would probably want to capture.
One answer to the problem of “missed” events is to alter the EMA protocol from a “right now” sampling frame to a coverage frame. In a coverage frame, respondents are asked to characterize their experiences “since the last prompt.” Often in this case, a fixed sampling interval is selected, to standardize the length of time respondents are being asked to summarize. The drawback here is apparent; once again, participants are being asked to remember and summarize—albeit over a much shorter time frame (typically just a few hours). But the potential advantage is also apparent: transient but important experiences are more likely to be captured. Coverage sampling frames are probably most useful for smaller samples and when a researcher wishes to capture experiences that are known to fluctuate within the day but may be somewhat infrequent (pain in some patient populations, suicidal thoughts, etc.).
certain disabilities), although technological innovation may make this less of an issue over time.3
CONCLUSION 3.1: Momentary assessment methods are often regarded as the gold standard for capturing experiential states. However, these methods have not typically been practical for general population surveys because they involve highly intensive methods, are difficult to scale up to the level of nationally representative surveys, and involve considerable respondent burden, which can lead to low response rates. For these reasons, while momentary assessment methods have proven important in research, they have not typically been in the purview of federal statistical agencies.
The panel notes that this conclusion reflects the current (or past) state of monitoring and survey technology, which is of course changing rapidly. Thus, we would append some important qualifying statements to Conclusion 3.1:
• The ways in which government agencies administer surveys are surely going to change, and as monitoring technologies continue to evolve rapidly, new measurement opportunities will arise. Considered in terms of comparative respondent burden, it may become less intrusive to respond to a modern electronic EMA device (or smartphone beep) than to fill out a long-form survey. So, while EMA may not be practical for the American Community Survey or Current Population Survey for the foreseeable future, real-time analyses may become practical for a number of other surveys, particularly in the health realm. One way this works in practice is that, at various (usually irregular) intervals, respondents would be beeped. The National Institutes of Health is interested in this kind of technology for real-time health applications. As technology advances, such modes could become feasible, even for large-scale surveys at reasonable cost.
• Large-scale (more general) surveys could build in the possibility of mapping the data from single-day measures with the data from more detailed studies for a subset of the sample.
• Experiences in real time, because they are especially relevant to health, have been incorporated into health examination surveys,
3 Differential response rates to questionnaires by subpopulations are a concern generally, but particularly when making group comparisons. Moreover, it is not just an issue for how to interpret measures once they are collected; these group differences may also need to inform the construction of the survey instruments themselves. Some peripheral insights can be learned from Abraham et al. (2009), who showed that high variability exists for survey estimates of volunteering due to the “greater propensity of those who do volunteer work to respond to surveys.”
so there is precedent. It is also possible to monitor blood pressure and other physical signals related to affect in real time.
Again, the appropriate methodology and data collection instruments will be framed by what questions are being asked and which policies are to be informed.
To date, the most common method of measuring ExWB in large-scale survey research is based on assessments of a single day. Although single-day measures are currently the standard in survey research, there are also alternative methods, as discussed in this section and elsewhere in this report. A potential criticism of the method is that there is considerable day-to-day variation in hedonic states (people have “good” and “bad” days) and, thus, a single-day assessment might be “too variable.” In this section, the adequacy of single-day ExWB measures for testing different hypotheses (e.g., between-group differences) is evaluated. The importance of this topic is that, if single-day ExWB measures are found to be credible for research, the case for including them in large-scale national surveys is strengthened. However, if single-day measures do not approximate ExWB as captured in the more intensive momentary approaches, then, depending on the survey objectives, the case for including them is undermined.
A number of national and international surveys have used single-day assessments to measure ExWB—that is, assessments that target affect for a single day. For example, in the United States, the Health and Retirement Study, the Disability and Use of Time supplement to the Panel Study of Income Dynamics, and the Gallup-Healthways survey employ single-day hedonic assessments, as do the English Longitudinal Survey of Ageing and the surveys on well-being of the UK Office for National Statistics (ONS).
End-of-day subjective well-being (SWB) measurement is a well-established and frequently used research method. The objective of end-of-day measures is to capture a respondent’s assessment of affect for an entire day, which is quite different from the goal of the momentary assessments, although the target objective for methods such as EMA is also often to add up to a full-day measure. ExWB measures are somewhat sensitive to what people are doing at the time of questioning (see Schneider et al., 2011). Compared with momentary assessment, an end-of-day method shifts measurement from a temporal integral of experienced affect to the respondents’ summary impressions of how
good their day was. There may be some questions and policies for which a reflective assessment of this sort is the relevant measurement objective.
End-of-day self-reports of the form “Overall, how happy would you say your day was?” can be influenced by the same variables that drive answers to evaluative well-being questions, though presumably to a much lesser degree; and they will reflect both the respondent’s mood at the time of judgment and the most memorable moments of the day. One way to establish credibility for an overall day measure is to see how well it approximates an “integral” over 1 entire day of momentary assessments. This kind of credibility for single-day ExWB measures would seem to be a prerequisite for including them in large-scale national surveys.
Although research generally indicates that end-of-day measurement methods, typically used in smaller-scale studies, yield credible and consistent data about people’s experiences for the day, important research questions remain. For example, more needs to be known about how momentary or activity-based experiences map into longer-period assessments and about how different time increments are remembered by respondents. The usual issues—from salience and recency to duration neglect—apply as well. Some evidence of how respondents weight the day’s moments comes from the observation that patients’ end-of-day ratings of pain are more influenced by the last EMA measurement of the day (Schneider et al., 2011). This finding is consistent with many other studies (e.g., Redelmeier and Kahneman, 1996; Stone et al., 2000) that show higher recalled pain during episodes of higher concurrent pain. A day that ends well will almost certainly be reported as a better day than a day that ends poorly, even when the averages are identical.
One limitation of end-of-day measures (and a reason that they have not been used more by statistical agencies) is that large population surveys often depend on telephone interviews conducted throughout the day, not just at the end of the day. Because of the survey timing requirement, end-of-day instruments have typically been less practical for use in general surveys than global-yesterday methods (discussed next) have been. However, newer technologies, such as use of interactive cellphone assessments, may offer solutions to some of the data collection constraints associated with end-of-day methods.4 Mode of survey administration is a central issue when assessing the appropriateness of SWB measurement approaches that require precise timing, as do end-of-day measures.
4 The panel returns to the potential role of new technologies in SWB survey methods in section 6.3.
Global-yesterday measures ask respondents about emotions and feelings experienced the previous day. The concerns about end-of-day measures, in terms of approximating a time integral of real-time emotions, apply here as well, except that the increased temporal distance may accentuate them. That is, relative to global-yesterday assessments, one would expect end-of-day measures to correlate more closely with ESM. However, there has been little systematic experimentation into how the recall and contextual influences act differentially between end-of-day and global-yesterday measures.
As noted above, momentary assessment data collection has typically not been feasible for nationally representative government surveys because it involves considerable respondent burden, which can lead to low response rates. Similarly, end-of-day instruments (usually defined as “before bed”) have the practical disadvantage that the survey must take place somewhat precisely at the end of the day. Thus, global-yesterday measures are often the default approach for large surveys. In part because an interviewer can call at any point during the (following) day, global-yesterday questions have featured regularly in large surveys such as those conducted by the Gallup Organization and, more recently, by the ONS.
Christodoulou et al. (2013) validated a global-yesterday version of an ExWB measure by comparing the results against the same emotion adjectives administered using a DRM that links assessments to specific episodes of the day (this work is described in more detail later in this section). Although not the same as testing a single-day assessment against momentary assessments, the idea was that the DRM reconstruction techniques and the duration-weighted average of reported emotions over the previous day should be closer to the “truth” than the global-yesterday measure, which is usually completed quickly and without much contemplation of the day’s events. This study shows promise that the correspondence between global-yesterday measures and DRM is good—in this case (depending upon the adjective) it was in the 0.7 range.
Much of the evidence for the utility of global-yesterday measures has been established through research using the Gallup Organization’s datasets. These data have generated insights into which groups of the population report being happier from day to day (e.g., middle-aged versus young or old, married versus unmarried, employed and unemployed) or at what times (e.g., weekends versus weekdays, holidays versus work days). Using the “yesterday” measures from Gallup surveys, Kahneman and Deaton (2010) found that income was related to ExWB in nonlinear ways. Stone et al. (2010) found that various measures of ExWB were related to respondent age in patterns that were very different from a measure of life evaluation (evaluative well-being). Deaton (2012) found that the negative impact on
average self-reports of ExWB at the time of the 2008 financial collapse in the United States was short-lived. Stone et al. (2012) used global-yesterday measures to extend knowledge about day-of-the-week associations with ExWB measures. Using the Gallup data from 2008 for more than 340,000 U.S. citizens, the authors found contrasts in mood between weekend days and weekdays, but no significant difference in mood on Mondays compared with Tuesdays, Wednesdays, and Thursdays. Some of the effects contrasted quite sharply; for example, 60 percent of the individuals in one age group reported being stressed for much of the day while the figure was only 20 percent in another age group (Stone et al., 2012).
CONCLUSION 3.2: Global-yesterday measures represent a practical methodology for use in large population surveys. Data from such surveys have yielded important insights—for example, about the relationships between ExWB and income, age, health status, employment status, and other social and demographic characteristics. Research using these data has also revealed how these relationships differ from those associated with measures of evaluative well-being. Even so, there is much still to be learned about single-day measures, and it is possible that much of what has been concluded so far may end up being contested.
These positives notwithstanding, data from global-yesterday, and single-day methods more generally, provide less information about why these differentials exist or during which activities people are suffering more or less. Global-yesterday measures are therefore limited in terms of creating a more detailed understanding of the drivers of ExWB over the course of the day (e.g., variation at the individual level). For this level of analysis, time-use or activities-based data—for example, data generated by DRM-type methods, discussed in section 3.3.1—are needed. Describing group variation, which global-yesterday measures have been shown to do well, is different from explaining the sources of differences in level or the drivers of change for a population.
Survey purpose will dictate the appropriateness of single-day methods (SDMs). Findings that end-of-day ratings correlate well with averages for the day collected using EMA (e.g., Broderick et al., 2009, for pain and fatigue) do not imply that decontextualized end-of-day ratings are sufficient for all purposes. For policy, it is often essential to know what experiences or activities are driving affect or changes in affect, and how. That said, low-
burden ExWB ratings from a single day may capture some activities and life events well. For example, it may be possible to meaningfully measure emotional effects associated with Valentine’s Day (Kahneman and Deaton, 2010) using daily affect questions because that is in a real sense the relevant reference unit of time. And some phenomena that unfold over long periods, such as the cumulative emotional impact of being unemployed or of marital or financial problems, may be captured more in responses to questions that involve reflection than they would be in momentary assessment.
Cross-sectional, single-day surveys are most often used to address group differences in ExWB—for example, are older people happier than younger people? Are females more stressed than males? Or, do males report more happiness than females? The main selling point of single-day measures, upon which the case for their inclusion in large surveys hinges, is their ability to accurately detect group differences in a minimally burdensome way. For this reason, the panel considered the analytic value of SDMs in the context of making between-group comparisons. Examining the contributors to SDM variability helps to address this consideration.
Several sources contribute to the variability in SDMs. If a SDM is assumed to contain a portion of its score that is related to the group factor being explored (e.g., gender differences), then this part of the score presumably remains stable from day to day. In other words, if males are in truth happier than females, that information is embedded in SDM values. However, there are many other factors that impact a particular daily score, including the various events that occur on the day the measurement was taken. The daily variability of a SDM, presumably driven by daily occurrences, can make detection of a group effect more difficult.
One way to measure the level of daily variability is to compute the ratio of over-days variation (that is, with SDMs repeated daily for some respondents) to all variation (that is, the total variation due to daily variation, between-person variation, measurement error, and so on). In one study where several hedonic states were assessed daily for 1 week, Stone et al. (2012) found, using IntraClass correlations, that 30 to 50 percent of all variation was attributable to day-by-day variation. The researchers concluded that most of the day-to-day variation is “real,” in the sense that daily events and well-known cyclical effects (e.g., weekday to weekend cycling) were producing it; therefore, it was not reasonable to assume that all, or even most, of the daily variation is measurement error.
CONCLUSION 3.3: Preliminary work suggests that SDMs of measuring ExWB are appropriate for many purposes and contain a valid signal that can be captured by survey studies. Thus, despite their variability, SDMs can be used for testing questions about group differences in ExWB.
Appropriate sampling, though, is necessary for estimates of group differences to be unbiased. One example of how inappropriate sampling could bias estimates occurs in the case where ExWB varies by day of the week and the groups to be compared are not sampled equally over days of the week. In this case, group differences would be confounded with day-of-the-week effects. Random sampling of SDMs over day-of-the-week is probably the best method for reducing the possibility of confounding, but stratified sampling strategies could be effective in smaller samples. (Appropriate data weighting methods may also be used to correct sampling bias.)
The Gallup datasets—along with others such as the International Social Survey Program, the World Values Survey, and the Survey of Health, Ageing and Retirement in Europe—have also afforded an opportunity to examine the statistical power (which is a function of daily variability) for the detection of between-group effects. This issue was addressed in Stone (2011) in which a survey of more than 300,000 individuals conducted by the Gallup Organization was analyzed. A small effect size of 0.11, statistical power of 0.80, and a two-tailed alpha level of 0.05 were used for determining the sample size necessary for detecting this magnitude of effect. A sample of only 2,796 people was necessary in this case (and these analyses were supplemented by simulations; see Stone, 2011), indicating that the very large sample for the original survey had extremely high power for detecting small effects with this “highly variable” daily measure (in this case it was a rating of the amount of stress experienced yesterday).
Another paper by Krueger and Schkade (2008) examined the test-retest reliability of SDMs derived from DRMs administered 2 weeks apart. They found that most ExWB indices yielded reliability coefficients between 0.50 and 0.70, with the multi-item scales for positive affect and negative affect having reliability coefficients of 0.68 and 0.60, respectively. Perhaps surprisingly, these reliabilities were in the same range as those for measures of life satisfaction (evaluative well-being). The authors concluded that, among other things, experience measures derived from the DRM are “sufficiently high to yield informative estimates for much of the research that is currently being undertaken on subjective well-being, particularly in cases where group means are being compared (e.g., rich vs. poor, employed vs. unemployed)” (Krueger and Schkade, 2008, p. 1843).
CONCLUSION 3.4: Although there may be an initial hesitancy by some to accept the utility of SDMs of ExWB because of their daily variability, a strong case can be made for their deployment in survey studies. Ideally, for a given respondent, capacity would be built into the survey design to aggregate over a number of days; controls for day-of-week effects need to be included in the survey design thoughtfully. In practice, multiday sampling will frequently not be possible; further,
government surveys will not always be the best option for carrying out this sort of detailed data collection. Sometimes, government-funded surveys and nongovernment data collections will possess a comparative advantage.
Another concern for interpretation of SDM data has to do with effect sizes. The unstandardized effect size (say a difference of 2 points between men and women on a 7-point scale) should be estimated in an unbiased manner regardless of the amount of noise in the SDM—that is, a 2-point difference by sex (for example) would be evident regardless of the variability of the measure, although in any single study the 2-point difference will be better approximated by a study with relatively lower variability. The same cannot be said for the standardized effect size that is often used as a measure of the strength of an association. Here, the noisy (higher variance) measure will have a lower standardized effect size. This distinction is important when comparing effect sizes from different measurement strategies, especially those that do not contain daily variation. Standardized effect sizes from SDMs will often be relatively small because of the daily variation.
There is also value to increasing the number of SDMs completed by each respondent. In the example above, taking the mean of 20 SDMs for each person would continue to yield a 2-point difference by gender (unstandardized effect size). However, the standardized effect size would be considerably higher because the variability inherent in one SDM will have been “averaged out” and the gender effect will appear relatively larger. Fewer participants would be required to achieve a given level of statistical power in this case. But the feasibility of administering multiple SDMs depends on a host of issues, including the relative burden for participants, the feasibility of implementation, and the costs to the investigator. Furthermore, the value in obtaining additional SDMs per respondent depends upon the amount of daily variability in SDM content for the population in question. Generally speaking, more is to be gained by adding additional assessments from an individual when there is much day-to-day fluctuation in the SDM.
Additionally, “simple” measures of ExWB, such as end-of-day and global-yesterday measures, which seem preferred for a broad range of surveys on practical grounds, need systematic experimentation that is informed by the extensive literature on retrospective versus concurrent reports of subjective experiences.
RECOMMENDATION 3.1 (Research): Despite the promising information available about SDMs, more information is needed about the psychometric properties of this class of ExWB measures. In particular, research is needed on how many days of data are generally needed to
construct a reliable predictor (or average for a person) for end-of-day (or reconstructed yesterday) measures.
With respect to research into how many days might be needed to reduce within-person variability to tolerable levels or to identify the latent variable of ExWB, it should be noted that there cannot be one answer to the question that will be right for all people in all circumstances. Like correlation coefficients, the answer will be sensitive to a host of factors, so the most one study can do is provide a very rough estimate. Additional research is needed to further address such questions as “When is day-to-day variability itself of interest?” and “What do daily peaks and troughs in these data reveal?” For instance, some jobs create stress, which could be important. And various survey issues need more attention; for example, “What biases are created because working people are easier to reach on the weekends?” Additionally, because small effect sizes due to daily variation are compounded when the data are studied at the individual level (and most large-scale surveys are reported at the aggregate level where item reliability and effect sizes will be more substantial), measurement error differences between individual- and group-level measures should be investigated further.
Although some of this research could be carried out by statistical agencies prior to fielding a survey, most of this work will continue to be done by academic researchers working in the field, perhaps under research-grant programs supported by funding agencies.
RECOMMENDATION 3.2: For SDMs, day of sampling, time of day, and even respondent location, especially for certain subgroups, are important considerations when designing a study. These variables must be controlled for (which might just mean randomized sampling, so that the effects wash out) or avoided in measures of ExWB.
The above considerations are especially important for a statistical agency charged with developing and experimenting with single-day SWB questions.
For some research and policy questions, contextual information about activities engaged in, specific behaviors, and proximate determinants is essential. For example, to investigate how people feel during job search activities, while undergoing medical procedures, or when engaged in child care, something more detailed than a global daily assessment is needed. Activity-based measures attempt to fill this measurement need. The attractive feature
of activity-based measures is their capacity to improve understanding of the drivers of experience by providing dimensions of quality and context.
One promising activity-based ExWB measure is the DRM, developed by Kahneman et al. (2004). The DRM was created to assess subjective experiences in a manner that specifically avoids problems of many recall-based measures while being more affordable and less burdensome than momentary methods. The attractive feature of the DRM is its capacity to combine time-use information with the measurement of affective experiences. Respondents are asked to construct a diary of all activities they engaged in the preceding day; then they are given a list of positive and negative feelings and are asked to evaluate how strongly they felt each emotion during each activity listed in their diary, using a numeric rating (e.g., on a scale from 0 to 10). Participants follow a structured format in which they first divide a day into specific “episodes” or events. They then describe those events in terms of the type of activity (e.g., commuting to work, having a meal, exercising) and provide a detailed rating of their affective state during the activity.
Another attractive feature of DRM (and EMA) measures of ExWB is their potential to go beyond single indices of well-being that simply average across all ratings for an individual. While the mean certainly carries valuable information, it also ignores many other characteristics of experience, such as the amount of time spent in a particular hedonic state or the variability of hedonics throughout the day. Using DRM data, Kahneman et al. (2004) proposed the U-index, which is based on the relative intensity of positive and negative emotions during every episode; it yields a metric indicating the proportion of time respondents spent in predominantly positive or negative states. Thus, the richer data yielded by EMA and DRM have the potential to provide correspondingly deeper views of experience.
By asking participants to first recall the events of their day and then provide ratings associated with them, the DRM exploits the fact that, while memories of ongoing experiences such as pain and mood are flawed, memory for discrete events is more accurate (Robinson and Clore, 2002). Thus it avoids, or at least reduces, some of the biasing factors noted above, such as the tendency to recall information that is congruent with peak or recent experiences, which are more easily remembered. The DRM is designed to be self-administered and can be completed by most participants in a single sitting. It is thus much less burdensome and costly to field than the most rigorous EMA methods, and it is scalable to large surveys.
In assessing the value of DRM for estimating emotional experience, an obvious question is how well DRM results mirror those of more intensive methods such as EMA. Numerous concerns have arisen regarding the accuracy of traditional self-report measures that require respondents to
remember and summarize their emotional experiences over some period of time or that ask respondents for on-the-spot judgments of their overall quality of life (for reviews, see Diener et al., 1999; Schwarz and Strack, 1999). These concerns have led methodologists to consider ways of capturing subjective experiences that rely less on participants’ ability to remember subjective states accurately and to aggregate these experiences into a single summary score.
Among the issues remaining to be addressed about the scoring of DRM data is that many activity episodes are available for every person and many descriptive adjectives are available for each of the episodes. Apart from the U-index mentioned above, another attractive scoring method is to create a duration-weighted average of a selected adjective or composite of adjectives, where longer episodes contribute relatively more to the daily average than shorter episodes. This method has been employed in several DRM-type studies. However, for other purposes, different weighting schemes may be more appropriate, such as when high levels of the feelings of interest in a problem at hand are present; in this case, assigning higher weighting to episodes with feeling surpassing some threshold may be productive. Yet another option is to create metrics based on the content of activities; this may be appropriate if one were interested in well-being at the workplace or during specific activities. More research is needed to document the most efficient and useful ways to combine the rich information produced by the DRM.
For ranking the relative merits of the competing ExWB measurement approaches, the panel took as its starting point the following statement, which is distilled from assessments of the reliability of SWB measures articulated by Krueger and Schkade (2008) and by Krueger et al. (2009). There is a compelling conceptual case for measures of ExWB (and, more narrowly, hedonic well-being) that is best satisfied by ESM/EMA, reasonably satisfied by the full DRM, and—with some compromises—sometimes satisfied by truncated versions of the DRM such as the ATUS SWB module.
This statement holds for cases in which momentary ExWB is the measurement objective. For some questions (e.g., predicting consumer behavior or whether or not a person is likely to repeat a medical procedure), a reconstructed assessment of ExWB may be more relevant;5 it may also
5 Posing the issue in a medical context clarifies the distinction. For instance, is the goal of a drug treatment to alter how much pain a person is in at a given moment or to alter how one remembers being in pain? The U.S. Food and Drug Administration tends to focus on actual pain; drug manufacturers may have a different objective.
be better at predicting a policy’s impact on people’s choices, but worse at assessing a policy’s impact on experience. The director of a survey charged with adding self-reported well-being content has to answer the question, “Which ExWB approach should be used?” In many cases, the structure of the survey will rule out such approaches as EMA or end-of-day measures. For other predictive purposes, a cheaper evaluative well-being measure may perform as well as an ExWB measure.
This section discusses how momentary measures and reconstructed measures fare in terms of their relative susceptibility to context influences and the implications for accurate measurement. Overall, the panel concludes that episodic reconstruction can be quite accurate, at least for recent episodes, if respondents are given sufficient encouragement and time to relive the episode.
• ESM/EMA allow for introspective access to concurrent affect. By contrast, end-of-day and global-yesterday measures of ExWB require reconstruction; they differ in the extent to which they encourage and enable episodic reconstruction.
• DRM—detailed reconstruction of yesterday. A fully executed DRM encourages reconstruction of specific episodes, which is likely to induce a mild version of the affect associated with the episode. It captures EMA-like patterns that are not part of respondents’ lay theories (Kahneman et al., 2004; Stone et al., 2006); this is important because it implies that the answers could not be produced by theory-driven reconstruction. A fully executed DRM takes considerable time, up to an hour in some cases, but Internet-based DRM versions may be more efficient. The time requirement precludes its routine use in representative surveys; even the reconstruction of partial days exceeds realistic resources for most studies.
• Episodic with limited reconstruction of yesterday. DRM adaptations with more limited reconstruction are more feasible and can reproduce core patterns obtained with the DRM and ESM. One version was implemented as the Princeton Affect and Time Survey; a variant is included in the ATUS SWB module, which assesses affect for three randomly selected episodes after respondents complete a whole-day stylized diary with minimal reconstruction of the three selected nonconsecutive episodes. (Notably, though, the entire day is reconstructed in both methods; only the feelings information is limited to being recalled for the selected episodes.)
The comparative properties of EMA and DRM measures of ExWB are a central concern for the future development of SWB survey modules. There are theoretical reasons, and some limited empirical evidence, to suggest that
the DRM may provide some of the same advantages of EMA over traditional recall-based survey approaches. Several studies have directly tested whether the DRM can be used instead of EMA in some research contexts. Box 3-1 relates findings of a recent study that directly compares DRM and EMA data for the same participants during a given time period. The approach was to collect EMA, DRM, and standard recall-based measures of mood and physical symptoms in older adults, many of whom have osteoarthritis.6 EMA-based ratings of emotional experiences as they occurred throughout the day, a reconstruction of those experiences using the DRM, and memory-based estimates from standard survey items were all obtained. If the DRM measure provides a close replication of actual experiences, one would expect to see high concordance between the DRM and EMA measures. If, on the other hand, DRM estimates are biased due to their reliance on recall, they should more closely match the estimates based on standard recall-based measures.
The findings in Box 3-1 suggest that DRM measures of mood and physical symptoms closely approximate summary measures created from an EMA protocol. Where there were systematic differences, DRM estimates of negative mood and physical symptoms, such as pain and fatigue, tended to be lower than those collected by EMA. In terms of within-day patterns, the correspondence between EMA and DRM estimates was striking; furthermore, both estimates diverged from participants’ expressed beliefs about those patterns. The investigators also noted what appear to be advantages of the DRM measures over traditional recall-based summary measures, even with a time frame for the recall measures (4 days) that is shorter than usual.
CONCLUSION 3.5: Preliminary assessment of DRM measures of mood and physical symptoms suggests that they reasonably approximate summary measures created from EMA protocols. An attractive feature for survey objectives is that the DRM approach goes beyond simply addressing who in the surveyed population is happy to identifying when they are happy. Additionally, it appears that the DRM is less burdensome on respondents than experience sampling, and it might reduce memory biases that are inherent in global recall of feelings. The DRM is thus a promising method for assessing feelings, mood, and physical symptoms that accompany situations and activities more
6 Because this study was based on a sample of people with osteoarthritis, the associations observed may be somewhat higher than in other samples because of the relatively high variability of pain and fatigue likely in this group and therefore may not generalize to the population at large. On the other hand, one potential problem in comparisons between EMA and DRM data of this kind is that, if there is little within-person variation, the strength of associations between different measures will be low even if the measures themselves are relatively accurate.
A Test Comparison of EMA and DRM Estimates
Smith et al. (2012)* surveyed 120 older adults (age > 50), 80 of whom have osteoarthritis of the knee. These participants completed an EMA protocol over 4 days. It used a fixed interval schedule with 6 prompts per day (upon waking, 2, 4, 8, and 12 hours after waking, bedtime). Patients were asked about their mood (happiness, depression), symptoms (pain and fatigue), and level of physical activity. Because of the focus on transient physical symptoms in a smaller clinical sample, the investigators used a “coverage model” for the EMA protocol. That is, they asked participants to summarize their mood and symptom levels since the last prompt. On one day, participants also completed an Internet-based version of the DRM, which included the same measures (with identical wordings and scales) as the EMA protocol. The DRM protocol asked about activities and feelings for the previous day; and therefore was administered on day 2, 3, 4, or 5 to correspond to EMA day 1, 2, 3, or 4, respectively. On day 5, participants completed the same measures in summary form (e.g., “Over the past 4 days, how happy were you?”).
With the data from this design, the investigators created overlapping EMA, DRM, and standard recall measures. A key difference is that the DRM sampled only one day. To allow comparison with the standard summary measures, the investigators created composite estimates of each measure by averaging responses to all 24 EMA prompts and all DRM activity ratings. Thus, each participant has one EMA score for average happiness, one DRM score for average happiness, and of course the single summary score from the standard recall-based measure.
Mean Levels of Mood and Symptoms by Method of Assessment
|EMA Estimate||DRM Estimate||Recall-Based Estimate|
Compared to the EMA estimates, standard recall-based survey measures of happiness, depression, pain, fatigue, and activity level all showed levels that were higher, and markedly so in the case of activity level (all comparisons significant at p < .05, with the exception of fatigue; Stone et al., 2012, p. 10). This pattern is consistent with memory estimates that were biased by “peak” experiences (Broderick et al., 2009). Person-level correlations between the recall-based and EMA estimates were strong, ranging from r = 0.53 (activity level) to r = 0.86 (physical pain); average correlation across all five measures was r = 0.75. In contrast, DRM levels of happiness were nearly identical to those of EMA (p = 0.24). Estimates of pain, depression, fatigue, and activity level were all slightly lower, which is not consistent with a peak bias (all p’s < 0.05). In addition, person-level correlations were higher
than those observed with the recall-based measure in every instance, ranging from r = 0.65 (activity level) to r = 0.92 (physical pain); average correlation across all five measures was r = 0.81.
Interpretation of these comparisons is somewhat complicated by the fact that the EMA and recall-based measures cover 4 days, compared to 1 day for the DRM measures. Thus, the investigators restricted the EMA time range to the single day on which the DRM was completed, but this made little difference; the means were again similar across the EMA and DRM methods, and the correlations were nearly identical.
Diurnal patterns were examined next. Both EMA and DRM revealed similar cross-day changes in mood and symptom levels. When participants were asked to estimate how their mood and symptom levels typically changed throughout the day, they did not reproduce these patterns (with the notable exception of physical pain). Thus, they appeared to be mostly unaware of the patterns present in the scores they had provided over the previous 4 days. This analysis is important, because it shows that DRM diurnal patterns resembled EMA patterns more closely than they resembled participants’ beliefs about these patterns. Where the recall-based measure diverged from the EMA averaged estimate, the DRM measures still tracked with the EMA measures.
*This summary of findings was commissioned by the panel and funded by the National Institute on Aging. Susan Murphy, Norbert Schwarz, and Peter Ubel, as well as study team members William Lopez and Rachel Tocco, were study co-investigators with Dylan Smith.
efficiently than with EMA methods and with greater specificity and accuracy than traditional recall-based methods.
While the DRM is certainly promising, questions have been raised about its use—for example, about the extent to which estimates produced are unbiased and about whether the time weighting implied in the survey structure reflects psychological realities (see Diener and Tay, 2013). Thus, the panel adds to the above conclusion the caveat that research using the DRM is still in an early stage, so evidence for the validity and reliability of the DRM is, at this time, somewhat limited. This constraint suggests the need for further comparative research using ESM/EMA data to validate the DRM.
RECOMMENDATION 3.3 (Research): Additional research is needed to better establish the evidence base for determining when the DRM is an adequate substitute for EMA methods of measuring ExWB. In
particular, better understanding is needed of the psychometric properties of the DRM; this may be achieved, for example, by comparing DRM reports to mobile phone assessments and other forms of momentary experience sampling, as well as to global reports of feelings in situations. Additionally, more research is needed comparing performance, sensitivity, and variation of DRM and EMA approaches to measuring ExWB.
For some purposes, the DRM will not be an adequate substitute for momentary experience sampling. While it is possible to learn a great deal about people’s emotional states associated with various activities using the DRM, additional work is needed on the meaning or interpretation of self-reports summarized over a period of time.
The American Time Use Survey (ATUS), conducted by the Bureau of Labor Statistics, included an SWB module in 2010 and 2012, which was funded by the National Institute on Aging. The ATUS SWB module, which is described in detail in Appendix B, is the only federal government data source of its kind, linking self-reported information on individuals’ well-being to their activities and time use. The ATUS SWB module is thus an abbreviated version of a DRM approach. There are other short-form versions of the DRM that have been used in experimentation, such as the Princeton Affect and Time Survey, mentioned above.
Much of the policy promise of ExWB measurement lies in its potential to be combined with time-use information designed to illuminate how activities and environments relate to a person’s emotional states. If activity and time allocation are not included in a survey design, data analyses are limited to considering the influence of sociodemographic characteristics, such as those that dominate the literature on evaluative well-being.
CONCLUSION 3.6: Capturing the time-use and activity details of survey respondents enhances the policy relevance of ExWB measures by embedding information about relationships between emotional states and specific activities of daily life.
It is a relatively easy task to identify examples where detailed time-use survey data add analytic content beyond that which is obtainable from global-yesterday measures:
• Commuting effects cannot be teased out of global-day measures. In contrast, Christmas encompasses a day, so a yesterday measure will in principle call attention to effects associated with it.
• Tracking the health of a population may not require detailed activity-based data. But to get at causes of stress or even pain, researchers need data on the activities associated with these affects.
• Using an overall day measure, an unemployed person may look only a little worse off (or not at all) than the population average. Analysis needs to look at differentials at work and during activities while not at work; otherwise any explanation of the self-reported results is incomplete.
• In terms of policy pathways, time-use data provide insights into what income is a proxy for. Such data capture effects on emotional states of being on vacation, enjoying leisure, being at work, etc.
How well the ATUS truncated version of the DRM will ultimately perform is yet to be determined. However, it is not too early to begin taking advantage of the opportunity afforded by the ATUS SWB module to explore this issue.
RECOMMENDATION 3.4: While it may not be practical to run the ATUS as a full DRM—although this would yield very valuable information—it may be possible to explore differences between the ATUS SWB module and a full DRM by using a pilot consisting of a sample of ATUS respondents. In addition, increasing the number of episodes examined for ExWB would be desirable.
More generally, for DRM-type survey designs, much more can be learned when survey modules are placed so that samples are drawn from people for whom much is already known—e.g., subsamples of the Understanding Society Survey, Current Population Survey, and others that are rich in relevant covariates. Among additional research questions, one is how to weight events in a DRM approach, given that people experience different numbers of episodes of different durations and that affect has been shown to correlate with duration of episode. Another key research question is the reliability and usefulness of shorter, hybrid, DRM-like methods linking to activities.7 The overall goal of this research would be to produce something
7 In research being funded by the National Institute on Aging, Jacqui Smith and colleagues are tackling this issue by comparing Health and Retirement Study findings with the DRM data collected in the Panel Study of Income Dynamics, the ATUS SWB module, and the American Life Panel’s DRM measures. These secondary analyses will answer questions about the quality and comparability of responses to the fine-grained DRM approach versus brief DRM measures. Available: http://micda.psc.isr.umich.edu/project/detail/35382 [October 2013].
better (more information content) than simple overall day measures, while still being short enough in administration time required to add to surveys with minimal increase in respondent burden.
RECOMMENDATION 3.5 (Research): Additional research is needed on the optimal response scales and on the various ways of creating summary measures of the day’s affect. Although duration-weighted measures are usually used, other combinations of the data from time-use and affective data are possible, such as the U-index.
Chapter 6, on data collection strategies, returns to considerations about the next steps for the ATUS SWB module and other shortened variants of the DRM.