National Academies Press: OpenBook
« Previous: 5 Data Linkage to Improve Income Measurement
Suggested Citation:"6 Data Linkage to Supplement Health Surveys." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×

6

Data Linkage to Supplement Health Surveys

As with income, much work has been done on linking health survey data with administrative records. Many health data sources contain personally identifying information that permits record linkage; these include surveys about health, administrative data such as Medicare and Medicaid claims or databases of birth and death records, state data records, health claims submitted to private insurers, and electronic health records from government agencies (e.g., the Department of Veterans Affairs, the Indian Health Service, and municipal hospitals) and from private hospitals and doctors. These data have been used to study potential bias from missing data and to suggest improvements to measurement methods, as with the income studies discussed in Chapter 5.

This chapter focuses on the use of linked household survey and administrative data to enhance the study of health conditions and outcomes, as emphasized in the workshop session Data Linkage for Income and Health Statistics. For example, survey respondents might know they were hospitalized, but not the precise condition(s) treated, the results of all tests that were done, the actual medical procedures undertaken, or the total costs. By adding variables from administrative records on health claims or deaths to health survey data (which may provide information not available in administrative records such as demographic information, health attitudes and behaviors, and health conditions from self-reports or medical examinations), researchers can gain additional insights about health and diseases nationally and in population subgroups.

Sections 6.1 and 6.2 review key household surveys and administrative data sources used in linkage projects by the U.S. National Center for Health

Suggested Citation:"6 Data Linkage to Supplement Health Surveys." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×

Statistics (NCHS). Section 6.3 highlights recent data-linkage activities by NCHS, and Section 6.4 discusses data-equity implications of the linkages. Section 6.5 examines challenges involved in linking data from longitudinal surveys, with a focus on data linkage with the Health and Retirement Study, a panel survey of Americans over the age of 50.

6.1 SURVEYS FROM THE U.S. NATIONAL CENTER FOR HEALTH STATISTICS

Many data linkages have involved two of the key household surveys administered by the NCHS: the National Health Interview Survey (NHIS) and the National Health and Nutrition Examination Survey (NHANES).1

National Health Interview Survey

The NHIS, the largest household survey conducted by NCHS, is a face-to-face, cross-sectional survey that monitors the health of the U.S. civilian noninstitutionalized population through interviews with survey participants. The NHIS has been conducted continually since 1957, but the survey design and content have been updated periodically to take advantage of new developments in survey methodology, include new health topics, reduce respondent burden, and harmonize content with other health data sources. Some content is included every year, including demographic information, health insurance coverage, health care access and use, chronic conditions, health-related behaviors such as diet and physical activity, and functioning and disability. Other questions are asked on a rotating schedule.2

One “sample adult” aged 18 years or older and one “sample child” aged 17 years or younger (if applicable) are selected randomly from each respondent household. Sampled adults provide their own health information if able to do so (otherwise information is provided by a proxy); information about the sample child is collected from a parent or other knowledgeable adult. In 2021, there were 29,482 sample adult interviews and 8,261 sample child interviews (NCHS, 2022b, p. 10).

___________________

1 NCHS also conducts many other surveys and these are listed at https://www.cdc.gov/nchs/. Other federal agencies also conduct surveys about health topics. For example, the Current Population Survey regularly has a supplement on tobacco use, the U.S. Veterans Health Administration conducts surveys about veterans’ health and use of health care, and the Substance Abuse and Mental Health Services Administration conducts the National Survey on Drug Use and Health.

2 See https://www.cdc.gov/nchs/nhis/about_nhis.htm and NCHS (2020a) for overviews of the survey, and https://www.cdc.gov/nchs/nhis/2019_quest_redesign.htm for a description of content in any given year.

Suggested Citation:"6 Data Linkage to Supplement Health Surveys." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×

As with the Current Population Survey (see Chapters 2 and 5), the target population for the NHIS is the U.S. civilian noninstitutionalized population:

The NHIS universe includes residents of households and noninstitutional group quarters (e.g., homeless shelters, rooming houses, and group homes). Persons residing temporarily in student dormitories or temporary housing are sampled within the households that they reside in permanently. Persons excluded from the universe are those with no fixed household address (e.g., homeless and/or transient persons not residing in shelters), active duty military personnel and civilians living on military bases, persons in long-term care institutions (e.g., nursing homes for the elderly, hospitals for the chronically ill or physically or intellectually disabled, and wards for abused or neglected children), persons in correctional facilities (e.g., prisons or jails, juvenile detention centers, and halfway houses), and U.S. nationals living in foreign countries (NCHS, 2022b, p. 11).

National Health and Nutrition Examination Survey

The NHANES began in the 1960s to assess the health and nutritional status of U.S. adults and children.3 It provides information not available from other health surveys because it has both interview and examination components. The interview asks questions about demographic and socioeconomic characteristics as well as dietary and health-related questions. The examination component, conducted by trained medical personnel, includes laboratory tests and medical, dental, and physiological measurements. Because it measures aspects of health directly, data from the NHANES can be used to estimate the prevalence of major diseases and risk factors. NHANES findings are also the basis for national standards for such measurements as height, weight, and blood pressure.

Interviews are conducted in respondents’ homes and medical examinations are performed in mobile examination centers that travel to the areas included in the sample. Because of the expense of conducting medical examinations of survey respondents, the sample size for the NHANES is smaller than for the NHIS: about 5,000 adults and children each year. Data must typically be accumulated for multiyear periods to allow computation of estimates for population subgroups.

Like the NHIS, the NHANES is a sample of the civilian noninstitutionalized population. Persons experiencing homelessness, persons residing in institutions such as nursing homes and prisons, and persons in the military are excluded.

___________________

3 See https://www.cdc.gov/nchs/nhanes/about_nhanes.htm and NCHS (2020a) for overviews of the survey.

Suggested Citation:"6 Data Linkage to Supplement Health Surveys." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×

Strengths and Limitations of Health Survey Data

A survey is the only way to measure some health topics, and the NHIS and NHANES both ask a broad array of questions about health, nutrition, and physical activity that are unavailable from administrative records.4 The NHANES examination component identifies health conditions that might be unknown to the survey participant—for example, some survey participants may be unaware that they have diabetes—and that information would not be found in any other data source.5

As with all household surveys, however, both the NHIS and NHANES have been subject to decreasing response rates, with accelerating declines since 2010. Figure 2-1 shows the response rates for the screener portion of the NHIS (which obtains the roster of household members for selecting the sample adult and child) and the interview portion of the NHANES. Additional nonresponse occurs because some sampled adults and children in the NHIS do not participate in interviews, some NHANES respondents do not participate in the medical examination, and participants may have missing data for survey items. In 2021, the NHIS response rates for the sample adult and sample child interview were each close to 50 percent (NCHS, 2022b, p. 10). Of the 27,066 persons sampled for the 2017–2020 NHANES, 51.0 percent were interviewed and 46.9 percent were examined.6

The low response rates in recent years raise concern about possible nonresponse bias that might remain in the survey data after weighting adjustments for nonresponse are performed. Administrative data sources can be used to investigate how well nonresponse adjustments remove bias (see Section 6.3), but administrative records datasets may also omit parts of the population.

6.2 SOURCES OF ADMINISTRATIVE DATA ON HEALTH

The NCHS has a robust program linking data from its surveys with administrative data, including the National Death Index (NDI), Social Security and Supplemental Security Income benefit records collected by the Social Security Administration (SSA), data on Medicare and Medicaid/State Children’s Health Insurance Program from the Centers for Medicare

___________________

4 As discussed in Section 2.2, some information about these topics may be available from fitness-tracking devices, but data from these devices are typically available only through convenience samples.

5 Survey participants receive a report on results of the thorough medical examination as one of the benefits of participation. A participant is notified of any urgent health problems immediately.

6https://wwwn.cdc.gov/nchs/data/nhanes3/ResponseRates/NHANES-2017-2020-Response%20Rates-2017-March2020-508.pdf. Data collection for this sample ended in March 2020 because of the COVID-19 pandemic.

Suggested Citation:"6 Data Linkage to Supplement Health Surveys." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×

& Medicaid Services, and administrative data for participants in the Department of Housing and Urban Development’s (HUD) largest housing-assistance programs (the Housing Choice Voucher program, public housing, and privately owned subsidized multifamily housing).

The NDI contains records of nearly all deaths occurring since 1979 (see Section 2.2), and provides information that cannot be gathered from a household survey: date, location, causes, and circumstances of deaths.

Administrative records from the Centers for Medicare & Medicaid Services provide the opportunity to study changes in health status, health care utilization and costs, and prescription drug use among Medicare and Medicaid participants.7 NCHS is provided with Medicare program enrollment and claims/encounters data for survey participants who are matched with Medicare administrative records. Using Medicare and Medicaid data together with survey data allows researchers to study some of the populations excluded from the NHIS and NHANES, such as people living in institutions. Health care needs and expenditures of nursing home residents differ from those of people of similar age who live in households or noninstitutional group quarters, and Medicare and Medicaid data, either alone or combined with other data sources, can provide information on the health trajectories of nursing home residents.

Medicare and Medicaid data are not available for everyone in the U.S. population, however, since both programs have eligibility requirements. Medicare federal health insurance is limited to people who are 65 or older, people under 65 with disabilities, and people with end-stage renal disease. Medicare eligibility and enrollment files, containing information on demographics, reason for Medicare eligibility, and type of Medicare enrollment (fee-for-service Original Medicare or Medicare Advantage), are available for everyone in the program. But Medicare claims data generally do not include information about beneficiaries enrolled in Medicare Advantage plans, which are operated by private companies that contract with Medicare (NCHS, 2016); in 2021, about 44 percent of beneficiaries were enrolled in such plans.8

Federal law specifies mandatory eligibility groups for state Medicaid programs, including low-income families and individuals receiving Supplemental Security Income; some states cover additional groups.9 But Medicaid data do not have full coverage for studying health and expenditures of the

___________________

7https://www.cdc.gov/nchs/data-linkage/CMS-Medicare-Restricted.htm and https://www.cdc.gov/nchs/data-linkage/medicaid.htm

8https://www.cms.gov/newsroom/news-alert/cms-releases-latest-enrollment-figures-medicare-medicaid-and-childrens-health-insurance-program-chip; https://data.cms.gov/collection/cms-program-statistics

9https://www.medicaid.gov/medicaid/eligibility/index.html describes Medicaid eligibility requirements.

Suggested Citation:"6 Data Linkage to Supplement Health Surveys." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×

low-income population because not all eligible people participate in the program. Certain population subgroups are particularly likely to be nonparticipants and thus be without health insurance. Using American Community Survey data, Lukens and Sharer (2021) estimated that Black and Hispanic adults accounted for nearly 60 percent of the 2019 “coverage gap”—adults with incomes below the poverty line but who do not have Medicaid or other insurance.10

Medicare and Medicaid files both lack information about people who do not participate in Medicare or Medicaid—including those with private, or no, health insurance. Moyer (2021) discussed NCHS initiatives for using private-sector data.

HUD data, too, cover only part of the population: people receiving housing assistance through HUD’s three largest programs. HUD’s administrative data, submitted by local public housing authorities and contracted private owners or managers of apartment buildings, contain housing, income, and program information for participants.

6.3 DATA LINKAGE AT THE U.S. NATIONAL CENTER FOR HEALTH STATISTICS

The NCHS data linkage program “aims to maximize the scientific value of the Center’s population-based surveys, by linking NCHS survey data with data collected from vital and other administrative records. Linked data files enable researchers to augment information for major diseases, risk factors, and health service utilization, by linking exposures to outcomes and in some cases introducing a longitudinal component to survey data” (NCHS, 2022b, p. 118).11Golden and Mirel (2021) and Mirel (2022) gave overviews of the program.12

In addition to linking individual records, NCHS also performs linkages at the area level. Addresses are geocoded to standard census geography

___________________

10 Being below the poverty line does not exactly coincide with Medicaid eligibility because eligibility criteria vary across states. Children from low-income families, however, are eligible for Medicaid in all states, and the percentage of eligible children enrolled in Medicaid across states ranged from 81–98 percent in 2018 (Schor & Johnson, 2021). Keisler-Starkey and Bunch (2022, Figure 4) estimated from CPS ASEC data that in 2021, 5 percent of all children under age 19 had no health insurance coverage, but the uninsurance rate was 8.6 percent for Hispanic children and 18.6 and 22.6 percent for foreign-born and noncitizen children, respectively.

11 An inventory of NCHS survey data already linked with administrative records can be found at https://www.cdc.gov/nchs/data/datalinkage/LinkageTable.pdf. Linked data from NCHS can be accessed for approved research projects at the NCHS Research Data Center or through the Federal Statistics Research Data Centers.

12 See also https://www.cdc.gov/nchs/data-linkage/index.htm for a general description. NCHS (2021c, 2022d) described the specific procedures used to link NCHS survey data to the NDI and Medicare/Medicaid records.

Suggested Citation:"6 Data Linkage to Supplement Health Surveys." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×

units, which allows researchers to merge area-level statistics such as county poverty rate or air quality with the survey data.

The linked data have been used for two main purposes. First, as with the income studies discussed in Section 5.4, linked data have been used to study accuracy of items in survey and administrative datasets. Linked data have also been used to study questions about health and to provide information that can be used to promote evidence-based health policy.

Linkages to Examine Accuracy of Health Data

As with linked income data, researchers have used linked health data to study the concordance between survey reports and information in administrative records, or to assess effects of survey nonresponse. For example, Keyes et al. (2018) studied potential nonresponse bias in the NHIS by comparing age-adjusted mortality rates estimated from survey respondents (with mortality status determined by linkages with the NDI) with population mortality rates from the National Vital Statistics System. Other researchers have examined the concordance between survey and administrative data on topics including Medicare enrollment (Gindi & Cohen, 2012), Medicaid enrollment (Mirel et al., 2014), receipt of rental assistance or Social Security disability benefits (Boudreaux, Fenelon, & Slopen, 2018; Mirel et al., 2019b), and reports of childhood asthma (Zablotsky & Black, 2019).

For example, Day and Parker (2013) compared self-reported diabetes in the 2005 NHIS with information about diabetes in linked Medicare claims files, using a procedure typical of concordance studies. They linked NHIS participants aged 65 and over with their Medicare records, finding that 93 percent of survey respondents who reported they had diabetes had a diabetes indicator in the Medicare files, but only 67 percent of those with a diabetes indicator in the Medicare files self-reported the condition on the NHIS. Day and Parker (2013) suggested that the discrepancy may have occurred because respondents misunderstood the survey questions or their doctors’ diagnoses.

Linkages to Study Health Outcomes and Associations

Linking health survey data with administrative data can provide information on health outcomes and associations with other participant characteristics that can inform medical practice and health policy. Mirel (2022) listed areas in which linked data have been used in evidence-based policymaking: to study health insurance coverage and costs, to evaluate policies such as smoking-cessation programs, and to generate evidence that can be used to improve public health. She mentioned the following examples of studies that used linked NCHS data:

Suggested Citation:"6 Data Linkage to Supplement Health Surveys." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×
  1. Excess deaths associated with underweight, overweight, and obesity (NHANES-NDI linked data; Flegal et al., 2005);
  2. Air pollution exposure and heart disease mortality (NHIS-NDI; Parker, Kravets, & Vaidyanathan, 2018);
  3. Differences in adult mortality by education level (NHANES-NDI; Rogers, Hummer, & Everett, 2013);
  4. Comparing health characteristics of people who chose Medicare Advantage with those who chose Original (fee-for-service) Medicare (NHANES-Medicare enrollment; Mirel et al., 2012);
  5. Use of health services among Medicare enrollees who were previously uninsured (NHIS-Medicare enrollment and claims; Decker et al., 2012);
  6. Medical costs of chronic kidney disease in the Medicare population (NHANES-Medicare claims; Honeycutt et al., 2013);
  7. Housing assistance and children’s blood lead levels (NHANES-HUD; Ahrens et al., 2016; see Section 1.1);
  8. Cigarette smoking and adverse health outcomes among adults receiving federal housing assistance (NHIS-HUD; Helms, King, & Ashley, 2017); and
  9. Association between housing assistance, health insurance coverage, and unmet medical needs (NHIS-HUD; Simon et al., 2017).

The research in these studies could not have been done with the survey data alone or with the administrative data alone. In the first three studies, linkages between NCHS survey data and the NDI allowed researchers to examine the association between personal characteristics and risk factors (measured in the surveys) and mortality. Parker, Kravets, and Vaidyanathan (2018) also used the geocoding of NHIS data to link each survey participant with an annual estimate of fine particulate matter for the participant’s census tract. The researchers were thus able to control for risk factors such as body mass index and smoking status (from NHIS) when examining the association between air pollution and heart disease mortality.

In studies 4–6, information about the type of Medicare plan, health care usage, and medical costs came from the Medicare data. The health surveys also do not ask about housing assistance; that information, for studies 7–9, came from the linked HUD data.

In concordance studies, comparisons are done using the set of records that can be linked, and conclusions typically apply only to those data. For studying health outcomes, however, it is desired to make inferences to the U.S. population or specific subpopulations. The NHIS and NHANES are designed to be representative of the U.S. civilian noninstitutional population at the time of the survey, but the set of records that can be linked is not necessarily a random subsample of respondents (Golden et al., 2015,

Suggested Citation:"6 Data Linkage to Supplement Health Surveys." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×

p. 38). In addition, some administrative records datasets include only part of the population of interest (for example, Medicare data do not have claims information on Medicare Advantage participants). The next section describes approaches for addressing potential differences between records that can, and cannot, be linked.

6.4 LINKAGE AND DATA EQUITY

This section looks at data-equity issues for linked datasets and possible steps for investigating and documenting them. The issues are described in the context of the NCHS linkages described in Section 6.3 but apply to other data-linkage programs as well.

Linkage Eligibility

Linkage with NHIS or NHANES records is performed only for “linkage-eligible” participants—those who have provided consent and have sufficient personally identifiable information to enable successful linkage. For NHIS, “[s]urvey participants are informed of NCHS’ intent to conduct data linkage activities through a variety of procedures such as ‘advance letters,’ participant brochures, and during the interview when verbal consent is requested” (NCHS, 2022b, p. 118). Participants are asked to supply the last four digits of their Social Security Numbers (or, if unwilling to provide that, asked if they consent to linkage that uses other identifying information). Children are linkage eligible if consent is provided by their parent or guardian and they have enough identifying information to enable linkage, but that consent applies only to administrative data about events occurring before the child reaches the legal adult age of 18.

One approach for analyzing linked datasets is to treat ineligibility for linkage as an additional stage of nonresponse, and to perform weight adjustments similar to those used to adjust for nonresponse. NCHS (2022d) described the procedure used to produce survey weights for analyzing linked NHIS-NDI data, which involved adjusting the survey weights for linkage-eligible respondents so that they sum to known population counts for sex, age, race, and ethnicity subgroups. This procedure produces estimates similar to those that would be obtained from all NHIS respondents if, within each demographic subgroup, health characteristics of linkage-eligible persons are similar to those of non-linkage-eligible persons.

Many surveys conduct nonresponse bias analyses, and similar analyses can be carried out to investigate possible bias from differences in linkage eligibility across subpopulations. For example, Aram et al. (2021) found that about 88 percent of sample adults in the 2010–2013 NHIS were linkage-eligible regardless of age group, sex, and education. Linkage eligibility was

Suggested Citation:"6 Data Linkage to Supplement Health Surveys." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×

slightly higher (about 90%) for adults with diabetes or obesity, and slightly lower for Hispanic and non-Hispanic Asian adults (85.5% and 85.6%, respectively).

Aram et al. (2021) also investigated possible linkage bias by comparing estimates of demographic and health characteristics (diabetes, hypertension, obesity, fair or poor self-rated health, having a doctor’s office visit in the past year, and smoking) for the full NHIS sample (using the nonresponse-adjusted weights) with estimates calculated from the set of linkage-eligible records (using linkage-eligibility-adjusted weights). They found that, while there were large differences for some of these characteristics before 2007, estimates were similar for the 2010–2013 NHIS, indicating that restricting to linkage-eligible records did not increase bias for these characteristics for those years. Lloyd et al. (2017) investigated potential bias in linked NHIS-HUD and NHANES-HUD data by comparing estimates of housing characteristics and demographic information computed from HUD administrative files with estimates calculated from the set of linked records.

Linkage Errors

Probabilistic linkage procedures compute a “match score” for pairs of records (see Box 2-1). Pairs with high match scores are thought likely to belong to the same person, and pairs with low scores are likely to belong to different persons. Record pairs with scores in the middle might or might not be a true match. There is evidence, however, that linkage uncertainties and errors affect some population groups more than others (see Chapters 2 and 3). Miller, McCarty, and Parker (2017, p. 83) wrote: “With data coming from multiple sources, there will be differences in availability, quality, and format of unique identifiers, which could disproportionately affect minority populations.”

Lariscy (2017) studied data-equity issues related to linkage uncertainty by examining the distribution of match scores for Black and White men and women in the NCHS linkage of data from the 1986–2009 NHIS with the NDI (see NCHS, 2009, for the linkage procedures used for these data and NCHS, 2022d, for current linkage procedures). Lariscy (2017) found that linkage quality was lower for Black adults than for White adults. Among the persons whom NCHS had determined to be deceased, 51 percent of Black women and 54 percent of Black men were in Class 1 (considered to have a high likelihood of being a true match), compared with 59 percent of White women and 66 percent of White men. Black decedents had lower mean scores than White decedents, indicating less certainty about the matches. Similarly, a higher percentage of White men and women who were deemed to be still living were placed in Class 5 (considered to have a high likelihood that there is no match in the NDI) than were Black men and

Suggested Citation:"6 Data Linkage to Supplement Health Surveys." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×

women. In a similar study, Lariscy (2011) found more linkage uncertainty for Hispanic adults (and especially for foreign-born Hispanic adults) than for U.S.-born non-Hispanic White adults under the linkage consent rules and procedures used at that time.

Mortality rates estimated from linked data may be less accurate for population subgroups with more uncertainty about linkages. For example, Black et al. (2017) found that even small numbers of missed links between a survey and the NDI can result in large underestimation of mortality rates for older age groups, a phenomenon they dubbed the “Methuselah effect.”13

Investigating and Documenting Properties of Linked Survey Data

NCHS has performed multiple investigations of the quality of linked datasets, and a perusal of their work suggests some “best practices” for investigating and documenting the quality of linked data.14

  • Identify the exact datasets that were linked, with an assessment of coverage, missing data, and measurement methods. Describe how the data were collected, maintained, cleaned, and processed for each source. Provide references to the full documentation of the individual data sources, including nonresponse bias analyses of the surveys being linked (or supply such documentation if it does not exist).
  • Provide full documentation of the linkage method used, including descriptions of the data elements used for linkage, the accuracy of those elements for each data source, and the algorithm followed. Also provide documentation of weighting adjustments or other methods used for estimating population characteristics from the linked data.
  • Report rates for linkage consent and eligibility, with disaggregated statistics by age, sex, race, ethnicity, and other subgroups. If

___________________

13 The effect occurs because a survey respondent who died at age a but is not matched to the NDI inflates the denominator of the estimated mortality rate (the estimated number of persons still alive) for all ages greater than a. For each successive age group, as the number of “real” survivors in the denominator decreases, the number of “nonreal” survivors in the denominator increases (because of the cumulative missed links of all persons younger than that age group), resulting in a higher proportion of “nonreal” survivors in the denominator and a too-large estimate of the percentage of persons who live to an advanced age. See Arias (2021) for a discussion of how data quality affects comparisons of longevity across race and ethnicity groups.

14 See also the guidance presented by Bohensky et al. (2011); Davern, Roemer, and Thomas (2014); and Gilbert et al. (2018).

Suggested Citation:"6 Data Linkage to Supplement Health Surveys." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×
  • probabilistic linkage is used, provide information about the distribution of match scores for population subgroups.
  • Provide disaggregated estimates of linkage error rates, with a description of how these were estimated. How many missed links and false links were found in validation studies?
  • Analyze additional bias that may occur when restricting analyses to the set of linkage-eligible individuals or linked records. As part of this analysis, compare estimates computed from the linkage-eligible respondents with estimates from the full set of survey respondents. If the administrative records form the population of interest, compare characteristics calculated from the set of linked records with characteristics calculated from the full set of administrative records.
  • Investigate discrepancies in measurements between the survey and the administrative dataset, for example, differences in self-reports of disease and reports in claims data.
  • Describe how linkage errors and uncertainties about linkage might affect analyses performed on the linked data. For some linkage methods, uncertainties about linkage can be a component of measures of uncertainty for statistics produced from the linked data.

Each step involves consideration of data-equity aspects. Discussions—within the agency and with data users and community members—of how a proposed linkage project might affect population subgroups can promote transparency and raise awareness of community concerns. What are the potential benefits and harms of the linkage, and should the effort even be undertaken? How does linkage quality vary by age, sex, race, ethnicity, disability, and other characteristics? What are the implications of those disparities for research performed on the data? Future reports in this series will address privacy and confidentiality concerns for data linkage.

CONCLUSION 6-1: The U.S. National Center for Health Statistics has linked many of its surveys with administrative records datasets, providing valuable resources for investigating long-term health outcomes and promoting evidence-based policy. These linkage procedures and documentation can serve as models for other partnerships between program-oriented and federal statistical agencies.

6.5 LINKAGE OF LONGITUDINAL HEALTH SURVEYS

Longitudinal datasets allow researchers to investigate the dynamics of human behavior, such as how participation in government transfer programs might relate to subsequent labor force behavior or utilization of

Suggested Citation:"6 Data Linkage to Supplement Health Surveys." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×

health care services. Understanding these interactions in a dynamic environment is helped by linkage with administrative datasets. Chapter 4 described two longitudinal datasets formed by linking administrative records: the Longitudinal Business Database and the Longitudinal Employer-Household Dynamics database. Data from the longitudinal Survey of Income and Program Participation (see Chapter 5) have been used to study the dynamics of poverty over time.

Linkage of longitudinal surveys presents challenges additional to those for linking cross-sectional surveys (Calderwood & Lessof, 2009). As discussed in Section 4.1, population coverage of administrative datasets may change over time (for example, Medicaid coverage expanded after passage of the Affordable Care Act in 2010) or data-access rules may change. Characteristics measured in administrative data and definitions of those characteristics may also change over time, and variables used to link records may be missing or may use different categories across administrative datasets. Attrition in a longitudinal survey, when combined with missing administrative data and missed links, can cause the set of survey respondents having data across all time periods and for all variables of interest to be small. Issues of consent for longitudinal data linkage are also more complex (Jäckle et al., 2021a).

This section illustrates data linkages with the Health and Retirement Study (HRS), a nonfederal longitudinal survey (Faul & Levy, 2022).15 The HRS started in 1992, with a nationally representative sample of about 12,000 people who were between the ages of 51 and 61 at the time of the initial face-to-face interview. Additional cohorts of persons over the age of 50 have been added every six years so that there are approximately 20,000 respondents at any point in time; more than 40,000 respondents have participated altogether. Both members of a couple are included in the sample for all cohorts, and participants are interviewed every two years.

The repeated interviews allow researchers to study changes in health and economic circumstances that are associated with aging. The study collects detailed information about demographic characteristics, cognition, health status and functional limitations, use of health care services, work history and employment, retirement plans, net worth, income, health and life insurance, family structure, and subjective well-being.

___________________

15 See Sonnega (2017) and Fisher and Ryan (2018) for overviews of the HRS. Sonnega et al. (2014) described the sample design and weighting. The HRS is conducted by the University of Michigan Institute for Social Research as a cooperative agreement with the U.S. National Institute on Aging, with additional funding from the SSA. The National Institute on Aging also sponsors the Longitudinal Studies of Aging Network at the University of Michigan (https://micda.isr.umich.edu/networks/longitudinal-studies-of-aging/) to promote research related to data-collection procedures and measurement issues, and has been working to expand and facilitate linkages between aging studies that it funds and administrative records (Rose Li & Associates, 2016, 2019).

Suggested Citation:"6 Data Linkage to Supplement Health Surveys." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×

Each new cohort is selected through a probability sample of households. As participants age, however, some of them may move into nursing homes, and these respondents are retained and followed in the sample. Thus, although the HRS does not sample from nursing homes at the time of recruitment, the sample contains members of the U.S. nursing home population, and weights are constructed to allow researchers to study that population (Sonnega et al., 2014; Lee et al., 2021).

An important data-equity issue for the HRS is inclusion of people with cognitive impairments. Excluding people who are physically or mentally unable to answer survey questions would create bias and result in underestimates of the prevalence of conditions such as dementia. The HRS asks a proxy respondent (usually a family member) to provide information about a participant who cannot or is unwilling to answer questions after the baseline interview, or when an interview started but the interviewer has concerns about the participant’s ability to provide accurate information. About 9 percent of interviews overall, and 18 percent of those for persons aged 80 or older, are with proxy respondents (Sonnega et al., 2014).

HRS data are linked to sources of administrative information at the individual level. Respondents must consent to having their data linked. Faul and Levy (2022) reported that from 1996 to 2018, consent for linkage to Medicare records was obtained after three attempts for about 85–90 percent of respondents. Linkage to SSA records provides earnings histories, benefit histories, and application histories for disability and Supplemental Security Income of HRS participants.

One of the main goals of the HRS is to understand the relationship between medical history and financial status and how health care usage changes as people age. For respondents who consent to linkage, information about diagnoses and costs of treatment has been obtained from Medicare and Medicaid records. For HRS participants who served in the military, medical records have been obtained from the Department of Veterans Affairs. Linkage to the NDI tracks mortality. Information on employer-provided pension plans is obtained from businesses at which respondents are or have been employed.16

Researchers can access HRS data linked with other sources in a protected research environment. Faul and Levy (2022) mentioned the following recent studies that used linked data:17

___________________

16 See https://hrs.isr.umich.edu/data-products/restricted-data/available-products for a list of datasets linked to the HRS. In addition, the Census-Enhanced HRS project is linking HRS data to U.S. Census Bureau data on characteristics of respondents’ employers (https://cenhrs.isr.umich.edu/).

17 A bibliography of studies that have used the HRS is at https://hrs.isr.umich.edu/publications/biblio/. Fisher and Ryan (2018) gave an extensive description of the research areas involving the HRS.

Suggested Citation:"6 Data Linkage to Supplement Health Surveys." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×
  • Studying potential bias from dropouts and proxy reporters in the HRS, using NDI data to identify respondents who died and Medicare claims data to identify the earliest reported diagnostic code for dementia (Weir, Faul, & Langa, 2011).
  • Monetary cost of dementia, using self-reports on the HRS to estimate out-of-pocket spending and nursing home costs, and linked Medicare claims data to identify costs paid by Medicare (Hurd et al., 2013).
  • Long-term consequences of sepsis for cognition and physical function, obtaining characteristics of hospitalizations for severe sepsis from Medicare claims data (Iwashyna et al., 2010).
  • Knowledge about Social Security and pensions, comparing self-reported expected Social Security and pension income with benefit entitlements calculated from SSA earnings histories and employer pension plan descriptions (Gustman & Steinmeier, 2005).
  • Impact of employer match on retirement contributions, linking with SSA data to obtain earnings histories (Engelhardt & Kumar, 2007).
  • Delayed diagnoses of dementia for Black and Hispanic older adults, using HRS data on cognitive and daily function and linked Medicare/Medicaid claims data to identify the time of dementia diagnosis (Lin et al., 2021).

CONCLUSION 6-2: Longitudinal surveys provide perspectives on individual and household behavior not available in cross-sectional surveys. Data from such longitudinal surveys can be enhanced through data linkages to create new opportunities for social science research.

Suggested Citation:"6 Data Linkage to Supplement Health Surveys." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×

This page intentionally left blank.

Suggested Citation:"6 Data Linkage to Supplement Health Surveys." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×
Page 125
Suggested Citation:"6 Data Linkage to Supplement Health Surveys." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×
Page 126
Suggested Citation:"6 Data Linkage to Supplement Health Surveys." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×
Page 127
Suggested Citation:"6 Data Linkage to Supplement Health Surveys." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×
Page 128
Suggested Citation:"6 Data Linkage to Supplement Health Surveys." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×
Page 129
Suggested Citation:"6 Data Linkage to Supplement Health Surveys." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×
Page 130
Suggested Citation:"6 Data Linkage to Supplement Health Surveys." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×
Page 131
Suggested Citation:"6 Data Linkage to Supplement Health Surveys." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×
Page 132
Suggested Citation:"6 Data Linkage to Supplement Health Surveys." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×
Page 133
Suggested Citation:"6 Data Linkage to Supplement Health Surveys." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×
Page 134
Suggested Citation:"6 Data Linkage to Supplement Health Surveys." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×
Page 135
Suggested Citation:"6 Data Linkage to Supplement Health Surveys." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×
Page 136
Suggested Citation:"6 Data Linkage to Supplement Health Surveys." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×
Page 137
Suggested Citation:"6 Data Linkage to Supplement Health Surveys." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×
Page 138
Suggested Citation:"6 Data Linkage to Supplement Health Surveys." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×
Page 139
Suggested Citation:"6 Data Linkage to Supplement Health Surveys." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×
Page 140
Next: 7 Combining Multiple Data Sources to Measure Crime »
Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources Get This Book
×
Buy Paperback | $35.00 Buy Ebook | $28.99
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

Much of the statistical information currently produced by federal statistical agencies - information about economic, social, and physical well-being that is essential for the functioning of modern society - comes from sample surveys. In recent years, there has been a proliferation of data from other sources, including data collected by government agencies while administering programs, satellite and sensor data, private-sector data such as electronic health records and credit card transaction data, and massive amounts of data available on the internet. How can these data sources be used to enhance the information currently collected on surveys, and to provide new frontiers for producing information and statistics to benefit American society?

Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources, the second report in a series funded by the National Science Foundation, discusses how use of multiple data sources can improve the quality of national and subnational statistics while promoting data equity. This report explores implications of combining survey data with other data sources through examples relating to the areas of income, health, crime, and agriculture.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!