National Academies Press: OpenBook
« Previous: Summary
Suggested Citation:"1 The Promise of Integrated Data." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×

1

The Promise of Integrated Data

Probability surveys have been a cornerstone of federal statistics since the 1940s. Back then, almost any kind of data collection was expensive, and probability survey samples provided a way to produce accurate statistics without having to measure everyone. Probability surveys still serve that role, but they have faced a number of challenges in recent years, including declining response rates, increasing costs, and user demand for timelier and more granular data and statistics. Meanwhile, there has been a proliferation of other data sources, including data collected by government agencies while administering programs (administrative records), satellite and sensor data, private-sector data such as electronic health records and credit card transaction data, and massive amounts of data available on the internet.

There is increasing interest in using non-survey data sources together with probability surveys to improve official statistics and create new data resources for social and economic research. Data and statistics from the federal government “provide the foundation for policymakers, businesses, and individuals to make informed decisions regarding the economy, society, and their lives. An improved national data infrastructure would provide many societal benefits, including improved decisionmaking and more informed public policy.”1

The Committee on National Statistics (CNSTAT) in the Division of Behavioral and Social Sciences and Education of the National Academies of Sciences, Engineering, and Medicine received funding from the

___________________

1https://www.nationalacademies.org/our-work/toward-a-vision-for-a-new-data-infrastructure-for-federal-statistics-and-social-and-economic-research-in-the-21st-century

Suggested Citation:"1 The Promise of Integrated Data." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×

National Science Foundation to convene three panels of experts in statistics, economics, social science research, survey methodology, privacy, public policy, and computer science, under the collective title Toward a Vision for a New Data Infrastructure for Federal Statistics and Social and Economic Research in the 21st Century.

Box 1-1 gives the Statement of Task for the three panels. Each panel was charged with convening a 1.5-day workshop on particular aspects of a vision for a new data infrastructure and writing a consensus panel report on those aspects. The first panel’s workshop, The Scope, Components, and Characteristics of a 21st Century Data Infrastructure, was held on December 9 and 16, 2021.2 This workshop explored recent data infrastructure initiatives in the federal government; presented examples of using private-sector data for statistical purposes; and discussed legal, privacy, and access issues in using alternative data sources for official statistics. Box 1-2 reproduces the seven key attributes for a new data infrastructure from the report on the first workshop. The third scheduled workshop, and possible additional future workshops, will delve more deeply into practical and legal considerations for obtaining access to data, information technology aspects of an infrastructure that draws on multiple data sources, and protecting the privacy of entities supplying data and the confidentiality of the data that are supplied.

The panel for this, the second of the three reports, was specifically directed to concentrate on issues relating to The Implications of Using Multiple Data Sources for Major Survey Programs. Which programs might benefit from the use of alternative data sources? How might non-survey data—data such as administrative records that are collected for purposes other than creating official statistics—supplement survey and census data to provide a more accurate, complete, and timely picture of U.S. residents, households, and businesses?

This report builds on previous CNSTAT reports about using multiple data sources to produce statistics and enhance research, including:

  • Modernizing Crime Statistics (NASEM, 2016a, 2018);
  • Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy (NASEM, 2017c);
  • Federal Statistics, Multiple Data Sources, and Privacy Protection: Next Steps (NASEM, 2017a);

___________________

2 Video and presentations from the first workshop are available at https://www.nationalacademies.org/event/12-09-2021/the-scope-components-and-key-characteristics-of-a-21st-century-data-infrastructure-workshop-1a and https://www.nationalacademies.org/event/12-16-2021/the-scope-components-and-key-characteristics-of-a-21st-century-data-infrastructure-workshop-1b

Suggested Citation:"1 The Promise of Integrated Data." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×
Suggested Citation:"1 The Promise of Integrated Data." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×
  • Improving Crop Estimates by Integrating Multiple Data Sources (NASEM, 2017b);
  • A Satellite Account to Measure the Retail Transformation: Organizational, Conceptual, and Data Foundations (NASEM, 2021a);
  • A Vision and Roadmap for Education Statistics (NASEM, 2022a);
  • Transparency in Statistical Information for the National Center for Science and Engineering Statistics and All Federal Statistical Agencies (NASEM, 2022e);
  • Modernizing the Consumer Price Index for the 21st Century (NASEM, 2022d); and
  • Toward a 21st Century National Data Infrastructure: Mobilizing Data for the Common Good (NASEM, 2023), the report of the first panel in the project “Toward a Vision for a New Data Infrastructure for Federal Statistics and Social and Economic Research in the 21st Century” (Report 1 in Box 1-1).

This report examines current practice and potential for using data originating from administrative records, private-sector organizations, sensors and satellites, and other sources to enhance the timeliness, detail, and accuracy of information currently collected through surveys. The use of

Suggested Citation:"1 The Promise of Integrated Data." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×

multiple data sources can promote data equity, through providing more accurate representation of population subgroups that have historically been underrepresented or misrepresented in the data ecosystem, as discussed in Chapter 3.

The chapter begins with an example that sets the context for the report and a brief discussion of what makes data fit for use. Section 1.1 describes the potential of combined data sources to improve evidence-based policymaking and gives an example in which using multiple data sources to investigate childhood lead exposure resulted in new information that was used to change policy. Section 1.2 discusses frameworks for evaluating the quality of statistics calculated from single and multiple data sources. Section 1.3 describes panel activities and approach to gathering information, and Section 1.4 provides a roadmap to the rest of the report.

1.1 AN EXAMPLE OF ENHANCING SURVEY DATA FOR POLICYMAKING

The U.S. Commission on Evidence-Based Policymaking was formed as the result of bipartisan legislation, the Evidence-Based Policymaking Commission Act of 2016 (U.S. Congress, 2016). One of the explicit charges to the Commission was to “[d]etermine the optimal arrangement for which administrative data, survey data, and related statistical data series may be integrated and made available for evidence building while protecting privacy and confidentiality” (U.S. Commission on Evidence-Based Policymaking, 2017, p. 7). The Commission’s final report stated: “There are many barriers to the effective use of government data to generate evidence. Better access to these data holds the potential for substantial gains for society. The Commission’s recommendations recognize that the country’s laws and practices are not currently optimized to support the use of data for evidence building, nor in a manner that best protects privacy” (p. 1). It also noted: “The strategy outlined in the Commission’s report simultaneously improves privacy protections and makes better use of data the government already collects to support policymaking” (p. 3).

The ensuing Foundations for Evidence-Based Policymaking Act of 2018 (U.S. Congress, 2019) has become the cornerstone for many projected improvements in U.S. statistics. The Commission’s report included a number of examples that demonstrated “the promise of evidence-based policymaking,” specifically noting that “administrative data, collected in the first instance to serve routine program operation purposes, also can be used to assess how well programs are achieving their intended goals” (U.S. Commission on Evidence-Based Policymaking, 2017, p. 9). Examples included using administrative records to study permanent supportive housing for chronic homelessness, substance abuse education, and workforce

Suggested Citation:"1 The Promise of Integrated Data." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×

investment. The Commission also pointed to the value of reducing the burden on survey respondents:

Respondents have become less willing to participate in surveys and are increasingly reluctant to respond to questions about income. When they do answer questions about income, they are providing less accurate responses. The burden on respondents could be reduced and the accuracy of the data improved if statistical agencies were able to rely more on the income data the government already maintains to administer tax, income support, and social insurance programs (U.S. Commission on Evidence-Based Policymaking, 2017, p. 25).

Mirel (2022) reported on an example that showed how using multiple data sources could promote evidence-based policymaking for improving public health. In children, even small amounts of lead exposure can cause serious and irreversible mental and physical health problems; high levels can be fatal. But childhood lead poisoning can be prevented. Large declines “in blood lead levels occurred from the 1970s to the 1990s following the elimination of lead in motor-vehicle gasoline, the ban on lead paint for residential use, removal of lead from solder in food cans, bans on the use of lead pipes and plumbing fixtures and other limitations on the uses of lead” (President’s Task Force on Environmental Health Risks and Safety Risks to Children, 2016, p. 5).

These declines are known to have occurred because the National Health and Nutrition Examination Survey (NHANES), a nationally representative survey initiated in 1960 to assess the health and nutrition status of adults and children in the United States, began measuring blood lead levels in 1976. According to NHANES data, the median blood lead level in children aged 1–5 dropped from 15 micrograms per deciliter in 1976–1980 to 0.6 micrograms per deciliter in 2017–2018, with most of the reduction occurring before 1990 (EPA, 2022).

Despite this progress, “lead exposure remains an important public health problem among children particularly for those in high-risk groups” (Egan et al., 2021, p. 10).3 A major source of childhood lead exposure in the United States “is lead-based paint and lead-contaminated dust found in buildings built before 1978.”4 Using data from the American Healthy Homes Survey, the U.S. Department of Housing and Urban Development (HUD) estimated that, in 2019, approximately 35 million housing units

___________________

3Egan et al. (2021), analyzing NHANES data between 1976 and 2016, found that higher childhood blood lead levels were associated with non-Hispanic Black race and ethnicity, having family income below 130 percent of the poverty level, and living in older housing. See Rabin (1989) for a history of childhood lead poisoning in the United States.

4https://www.cdc.gov/nceh/tracking/topics/ChildhoodLeadPoisoning.htm

Suggested Citation:"1 The Promise of Integrated Data." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×

contained lead-based paint somewhere in the building, with about 90 percent of those units built before 1978 (HUD, 2021). Households receiving government housing assistance had statistically significantly lower levels of lead-based paint hazards than those not receiving assistance (11% versus 20%).5

While there were indications that HUD-assisted housing units had lower levels of lead hazards, no single dataset included both designations of HUD-assisted housing and information on children’s blood lead levels, which would enable evaluating associations between children’s health and living in HUD-assisted housing. The NHANES data contained blood lead levels and other health information about respondents, but no information on whether respondents lived in assisted housing. HUD’s annual data about participants in housing-assistance programs (administrative records collected through the local housing authorities that administer the programs) had no information on tenants’ health.

To study health characteristics (including blood lead levels) of children and adults living in HUD-assisted housing, HUD collaborated with the U.S. National Center for Health Statistics (NCHS), which administers the NHANES (Mirel et al., 2019a). Data from the 1999–2012 NHANES were linked to records for the same households in the HUD tenant data (with strict controls over access to those linked data).6

Researchers analyzing the linked data found that children living in HUD-assisted housing from 2005 to 2012 had lower blood lead levels than comparable children who did not receive housing assistance (see Ahrens et al., 2016). HUD used evidence from this and other observational research conducted on the linked NCHS-HUD data “to support the continued removal of lead-based paint hazards in HUD homes” and “cited this evidence in a proposed rule to lower the threshold for elevated blood lead level determination to align with CDC [Centers for Disease Control and Prevention] standards” (Mirel, 2022, slide 6).

___________________

5 The Residential Lead-Based Paint Hazard Reduction Act of 1992 (U.S. Congress, 1992) and other legislation instituted requirements for lead-based paint notification, evaluation, and reduction for housing receiving federal assistance.

6Lloyd et al. (2017) described the linkage process (also see NCHS, 2022c). To be eligible for linkage to HUD data, a NHANES participant must have consented for their data to be linked and provided sufficient data elements (including full or partial Social Security Number, full name, and month and year of birth) for the linkage to be attempted. About 65 percent of the 1999–2012 NHANES medical examination participants were eligible for linkage, and about 13 percent of those were linked to the HUD data (Lloyd et al., 2017, p. 14). In analyses using the linked data, NHANES participants who were matched with a record in the HUD data were considered to be receiving housing assistance, and linkage-eligible NHANES participants who could not be matched with a record in the HUD data were considered to be not receiving housing assistance. See Chapters 2, 3, 6.

Suggested Citation:"1 The Promise of Integrated Data." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×

By linking administrative records from HUD with survey data from NHANES, investigators could identify children in the NHANES dataset who lived in federally assisted housing. Linking the two datasets produced information not available from either source by itself, without requiring additional data collection. The editors of the volume Evidence Works concluded: “Combining data can produce valuable insights” (Hart & Yohannes, 2019, p. 121).

1.2 PRODUCING STATISTICS THAT ARE FIT FOR USE

The U.S. Office of Management and Budget’s (OMB) Statistical Policy Directive No. 1 states: “It is the responsibility of Federal statistical agencies and recognized statistical units to produce and disseminate relevant and timely information; conduct credible, accurate, and objective statistical activities; and protect the trust of information providers by ensuring confidentiality and exclusive statistical use of their responses…” (OMB, 2014, p. 71614). In 2021, OMB also issued guidance for implementing the Foundations for Evidence-Based Policymaking Act of 2018, which is part of a large collection of laws and regulations governing data sharing within the federal statistical system and with the public.7 The guidance specified that the data be “fit for use” or “fit for purpose”:

Underlying all of the methodological approaches outlined here are the data collected and used in Federal evidence-building activities. Ensuring that those data are reliable, high-quality, and fit for their intended purpose is essential to restoring trust in Government (OMB, 2021, p. 11).

OMB’s Federal Data Strategy was designed to create “a framework of operational principles and best practices that help agencies deliver on the promise of data in the 21st century” (OMB, 2019a, p. 1). In addition to desiring that agencies implement ethical governance and create a learning culture, the strategy specifically addressed four elements of “conscious design”:

  • Ensure Relevance: Protect the quality and integrity of the data. Validate that data are appropriate, accurate, objective, accessible, useful, understandable, and timely.
  • Harness Existing Data: Identify data needs to inform priority research and policy questions; reuse data if possible and acquire additional data if needed.

___________________

7Principles and Practices for a Federal Statistical Agency (NASEM, 2021b, Appendix A) lists laws and standards that govern federal data collection and sharing.

Suggested Citation:"1 The Promise of Integrated Data." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×
  • Anticipate Future Uses: Create data framework thoughtfully, considering fitness for use by others; plan for reuse and build in interoperability from the start.
  • Demonstrate Responsiveness: Improve data collection, analysis, and dissemination with ongoing input from users and stakeholders. The feedback process is cyclical; establish a baseline, gain support, collaborate, and refine continuously (OMB, 2019a, pp. 2–3).

These OMB guidelines emphasize the importance of validating the quality of data and ensuring that they are fit for use—not just for the immediate purpose but also for possible future reuse. Groves and Lyberg (2010, p. 873) noted: “Because statistics are of little importance without their use being specified, ‘fitness for use’ is of growing importance as a quality concept.… It is relatively common for national statistical agencies to refer to their quality frameworks as a means to achieve fitness for use.”

Traditionally, statistics from probability surveys have been accompanied by margins of error or confidence intervals that provide a measure of their accuracy. Modern data-quality frameworks, however, argue that quality is multidimensional:

Quality is defined as “fitness for use” in terms of user needs. This definition is broader than has been customary [sic] used in the past when quality was equated with accuracy. It is now generally recognised that there are other important dimensions. Even if data is accurate, they cannot be said to be of good quality if they are produced too late to be useful, or cannot be easily accessed, or appear to conflict with other data. Thus, quality is viewed as a multi-faceted concept. The quality characteristics of most importance depend on user perspectives, needs and priorities, which vary across groups of users…. [T]he OECD views quality in terms of seven dimensions: relevance; accuracy; credibility; timeliness; accessibility; interpretability; and coherence (Organisation for Economic Co-operation and Development, 2012, p. 7).

More recent statements on data quality have kept the same seven basic dimensions of quality but have added guidelines for assessing the quality of integrated data sources (Federal Committee on Statistical Methodology, 2018, 2020; Statistics Canada, 2019, 2022; Eurostat, 2021; see also the review of international quality standards in Czajka & Stange, 2018).

The Federal Committee on Statistical Methodology (2020, p. 2) also added a dimension of public trust to earlier ideas of “fitness for use,” defining data quality as “the degree to which data capture the desired information using appropriate methodology in a manner that sustains public trust.” Their data-quality framework, reproduced in Figure 1-1, encompasses 11 dimensions, categorized within the broader headings of utility, objectivity, and integrity.

Suggested Citation:"1 The Promise of Integrated Data." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×
Suggested Citation:"1 The Promise of Integrated Data." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×

As Brackstone (1999, p. 140) noted, dimensions of quality “are not independent of each other…. Accuracy and timeliness often have to be traded off against each other. Coherence and relevance can sometimes be in conflict as the needs of current relevance and historical consistency compete. Information provided to ensure information is interpretable will also serve to define its coherence.” For example, making efforts to obtain complete data, studying measurement properties, cleaning data, and evaluating sources of uncertainty in individual and combined data sources all contribute to increased accuracy but also increase the amount of time needed to produce statistics.

The use of alternative data sources such as administrative records complicates the assessment of quality because of the many types of data sources and the many paths that can be taken to integrate data and statistics (see Chapter 2). Each individual data source has its own quality profile with respect to the dimensions in Figure 1-1. When multiple data sources are combined, quality assessments must consider the quality of each source as well as the quality of the combined data.

Paths for using multiple data sources, and possible implications for data quality, include:

  • Using administrative records directly to give a picture of the population found in the administrative records system (see Chapter 4). In some situations, an administrative data source may replace a survey; in such cases, it is important to ensure that statistics produced by the administrative data can be compared with previous statistics produced by the survey.
  • Using administrative records or other data sources as input to statistical models developed to estimate population characteristics, as in the U.S. Census Bureau’s Small Area Income and Poverty Estimates program (see Box 2-2) or the National Agricultural Statistics Service’s Crops County Estimates Program (see Chapter 8). Quality assessment involves evaluating the performance of both the statistical models and the individual data sources.
  • Linking administrative records or private-sector data records with records from a survey or the decennial census, to extend the number of attributes known about the entities in the survey or census. When individual records from a survey are linked with those from an administrative records dataset (as in the blood lead example discussed in Section 1.1) the accuracy of statistics calculated from the linked data depends on the quality of each individual data source, the accuracy of the data linkage, and the characteristics of the linked dataset.
  • Merging datasets or integrating statistics calculated from separate datasets to compensate for the underrepresentation of certain
Suggested Citation:"1 The Promise of Integrated Data." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×
  • population subgroups in some of the data sources. For example, information from a survey of the civilian noninstitutional population might be combined with information collected from institutions such as prisons and nursing homes. The quality of estimates depends on that of each source and on the alignment of the data sources (sometimes the same entity appears in multiple sources and duplication must be identified when producing population statistics). In addition, the concepts might be measured differently in the data sources, and possible consequences of the measurement differences need to be investigated.

For all of these paths, the resulting integrated datasets and statistics must be of sufficient quality to meet user needs. The National Academies (NASEM, 2017a, p. 109) emphasized that “the quality of administrative and private-sector data sources needs careful examination before being used for federal statistics,” because of “the relatively recent novelty of the simultaneous use of multiple data sources and the fact that some potential new sources of data present new issues of data quality.” The United Nations Inter-Secretariat Working Group on Household Surveys (2022) and Chen (2022) emphasized the importance of establishing a “total quality framework” for data integration.

One important aspect of fitness for use involves regularly produced statistics that are used to monitor aspects of society. Consistent measurement of statistics such as monthly unemployment rates or annual crime rates facilitates comparisons across time periods and geographic locations. Switching to administrative records or combined data sources may affect the time series for these indicators, and these potential effects need to be thoroughly investigated.

The use of multiple data sources can help improve the quality of data collected in surveys, even if the data are not combined. For instance, linking records for two sources that each measure wage income can provide information that can be used to improve income measurement. Non-survey data can also improve the quality of probability surveys by augmenting the sampling frame or providing information that can be used to adjust for nonresponse.

1.3 STUDY APPROACH AND INFORMATION GATHERING

Between December 2021 and September 2022, the panel held nine closed virtual meetings to organize the 1.5-day workshop, decide on the study conclusions, and discuss drafts of the report.

Three early panel decisions defined the scope of the project:

Suggested Citation:"1 The Promise of Integrated Data." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×
  • The federal government collects data on thousands of topics every year, from seat belt use to welfare of veterans to household energy consumption to adult literacy.8 No report of reasonable length could possibly cover the implications of using multiple data sources for each of these surveys. The panel decided to focus on a small set of “use cases” that represent different ways that multiple data sources are, or could be, exploited and that illustrate the types of challenges to be faced.
    • Statistical agencies and researchers in the areas of income and health statistics have done extensive work on methods for linking survey and administrative records datasets. The panel decided to devote a workshop session to recent data-linkage projects involving income and health data that illustrate the current “state of the art” and show the potential for data linkage in other subject areas. These projects involved both cross-sectional datasets, which contain information for one time point, and longitudinal datasets, which follow individuals or businesses over time.
    • Crime statistics published by the Federal Bureau of Investigation are compiled from information submitted by individual law enforcement agencies (data submission is usually coordinated through state programs). The data collection is intended to be a census of the more than 18,000 law enforcement agencies in the United States. Challenges include missing data and ensuring consistency in the measurement of crime across agencies and across time.
    • Survey data about agriculture can be enhanced using information from administrative records, satellites, and sensors. In this application, survey data are collected on farm operations, as opposed to individual persons, and some of the issues faced are similar to those in other establishment surveys. Challenges

___________________

8 Many surveys are collected by the 13 principal U.S. statistical agencies (see NASEM, 2021b, Appendix B): the Bureau of Economic Analysis (U.S. Department of Commerce), Bureau of Justice Statistics (U.S. Department of Justice), Bureau of Labor Statistics (U.S. Department of Labor), Bureau of Transportation Statistics (U.S. Department of Transportation), Census Bureau (U.S. Department of Commerce), Economic Research Service (U.S. Department of Agriculture), Energy Information Administration (U.S. Department of Energy), National Agricultural Statistics Service (U.S. Department of Agriculture), National Center for Education Statistics (U.S. Department of Education), National Center for Health Statistics (U.S. Department of Health and Human Services), National Center for Science and Engineering Statistics (National Science Foundation), Office of Research, Evaluation, and Statistics (Social Security Administration), and Statistics of Income (U.S. Department of the Treasury). Other federal agencies also collect data; for example, the National Highway Traffic and Safety Administration (U.S. Department of Transportation) collects data on traffic crashes and seat belt use.

Suggested Citation:"1 The Promise of Integrated Data." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×
    • include aligning geographic units in the data sources and developing models to produce crop estimates for small geographic areas.
  • The panel was tasked with examining the implications of using multiple data sources “to assess and enhance representativeness” and “for population subgroup coverage” (see Box 1-1). The panel decided to address these issues through the lens of data equity, examining how multiple data sources might affect the representation of population subgroups that have historically been underrepresented or misrepresented in the data record.
  • The panel decided to exclude or de-emphasize topics that, while essential for the development of a new data infrastructure that uses multiple sources of data, were delineated in the Statement of Task (Box 1-1) as primary focuses of Reports 1 and 3. Thus, this workshop and report do not include extensive discussions of:
    • Legal agreements needed for data sharing;
    • Computer infrastructure for blended data;
    • Methods for providing public access to data; and
    • Methods for protecting the privacy and confidentiality of people, businesses, and other entities whose data are used.

    The panel recognizes, however, that these issues are crucial considerations and that the work ahead must integrate them into the vision for a new data infrastructure.

The public virtual workshop on Implications of Using Multiple Data Sources for Major Survey Programs was held on May 16 and 18, 2022. The five sessions of the workshop were organized according to decisions outlined above, with an overview session followed by the use cases and a final session on data equity:

  1. Opportunities for Using Multiple Data Sources to Enhance Major Survey Programs
  2. Measuring Crime in the 21st Century: A Panel Discussion
  3. Improving Agriculture Statistics with New Data Sources
  4. Data Linkage for Income and Health Statistics
  5. Issues in Data Equity

The full agenda for the workshop is provided in Appendix A, and video and presentation slides are available online.9 The panel asked workshop participants to explore how using alternative data sources such as administrative

___________________

9https://www.nationalacademies.org/event/05-16-2022/the-implications-of-using-multiple-data-sources-for-major-survey-programs-workshop

Suggested Citation:"1 The Promise of Integrated Data." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×

records, health records, satellite and sensor data, and private-sector data can improve the quality, granularity, timeliness, and equity of data in major survey programs.

This report relies on information presented by experts from federal and state governments, academic institutions, and international statistical organizations who participated in the workshop; public comments made during the workshop; and comments from the report’s reviewers. In addition, panel members reviewed more than 800 books, research articles, technical reports, and informational websites to provide additional examples and background for the discussion. The report reflects the information available as of the fall of 2022, when the panel completed the bulk of the work on this report.

1.4 ORGANIZATION OF THE REPORT

The remainder of the report proceeds as follows. Chapter 2 discusses various data types and their sources: probability samples; administrative records; private-sector data; satellite, sensor, and location data; convenience samples; and data obtained from social media, webscraping, and crowdsourcing. It also outlines some of the methods that can be used to combine data from multiple sources, such as linking data records, combining statistics from multiple sources, and using statistical models to predict values for missing data and to merge information from separate data sources.

Chapter 3 introduces the key theme of data equity. The chapter starts by defining aspects of data equity and then looks at examples of how using multiple data sources can improve the representation of population groups that have historically been underrepresented, unmeasured, or mismeasured in the data record. It also explores how misuses of available data sources might exacerbate data inequity.

Chapter 4 focuses on examples in which administrative records are used directly to produce statistics, largely bypassing surveys. The chapter begins with a description of three longitudinal databases assembled by the U.S. Census Bureau to study economic activity and population dynamics. The chapter then describes the Frames project, under way at the U.S. Census Bureau, is intended to link information from the Bureau’s various databases to improve accuracy and inclusiveness of population and business listings maintained for drawing probability samples and other purposes. The National Vital Statistics System, coordinated by NCHS, is a model for cooperation in building an administrative data system based on data submissions by states. State-level systems of linked administrative records demonstrate both the promise of integrated data and the challenges of harmonizing data concepts across sources.

Chapters 58 concentrate on four subject areas—income, health, crime, and agriculture—each with a different experience in their use of

Suggested Citation:"1 The Promise of Integrated Data." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×

administrative records and other non-survey data. Chapters 5 and 6 focus on the extensive programs of data linkage that have been implemented or are in progress for improving income and health statistics, respectively. Chapter 5 emphasizes the use of administrative data to study properties of income measurement, while Chapter 6 focuses on the ability to add data about health outcomes and expenditures to the records of survey participants. Chapter 7 discusses challenges in measuring crime as the Uniform Crime Reporting Program, which collects data on criminal offenses from law enforcement agencies, has migrated from a system that measured only counts of offenses to a system that records detailed information about the victims, offenders, and characteristics of incidents—but with fewer law enforcement agencies providing data to the federal government. The chapter discusses the potential for using statistical modeling and linkage to provide increased geographic and subpopulation detail and more timely statistics. Chapter 8 focuses on agricultural statistics, where external data sources including administrative and satellite data are already being used to improve crop estimates.

Chapter 9 concludes the report with a discussion of common themes for the case studies and opportunities and challenges for moving forward.

Suggested Citation:"1 The Promise of Integrated Data." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×
Page 9
Suggested Citation:"1 The Promise of Integrated Data." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×
Page 10
Suggested Citation:"1 The Promise of Integrated Data." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×
Page 11
Suggested Citation:"1 The Promise of Integrated Data." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×
Page 12
Suggested Citation:"1 The Promise of Integrated Data." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×
Page 13
Suggested Citation:"1 The Promise of Integrated Data." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×
Page 14
Suggested Citation:"1 The Promise of Integrated Data." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×
Page 15
Suggested Citation:"1 The Promise of Integrated Data." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×
Page 16
Suggested Citation:"1 The Promise of Integrated Data." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×
Page 17
Suggested Citation:"1 The Promise of Integrated Data." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×
Page 18
Suggested Citation:"1 The Promise of Integrated Data." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×
Page 19
Suggested Citation:"1 The Promise of Integrated Data." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×
Page 20
Suggested Citation:"1 The Promise of Integrated Data." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×
Page 21
Suggested Citation:"1 The Promise of Integrated Data." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×
Page 22
Suggested Citation:"1 The Promise of Integrated Data." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×
Page 23
Suggested Citation:"1 The Promise of Integrated Data." National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. doi: 10.17226/26804.
×
Page 24
Next: 2 Types of Data and Methods for Combining Them »
Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources Get This Book
×
Buy Paperback | $35.00 Buy Ebook | $28.99
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

Much of the statistical information currently produced by federal statistical agencies - information about economic, social, and physical well-being that is essential for the functioning of modern society - comes from sample surveys. In recent years, there has been a proliferation of data from other sources, including data collected by government agencies while administering programs, satellite and sensor data, private-sector data such as electronic health records and credit card transaction data, and massive amounts of data available on the internet. How can these data sources be used to enhance the information currently collected on surveys, and to provide new frontiers for producing information and statistics to benefit American society?

Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources, the second report in a series funded by the National Science Foundation, discusses how use of multiple data sources can improve the quality of national and subnational statistics while promoting data equity. This report explores implications of combining survey data with other data sources through examples relating to the areas of income, health, crime, and agriculture.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!