Much of the statistical information produced by federal statistical agencies since the 1950s—information about economic, social, and physical well-being that is essential for the functioning of modern society—has come from sample surveys. Data from these surveys have been used to inform economic, social, and health policies; evaluate the effects of those policies; monitor the health and economic circumstances of the population; inform decisionmaking by businesses and individuals; and produce vast quantities of economic, health, and social research that informs the public and can lead to societal benefits. As the National Academies of Sciences, Engineering, and Medicine report Principles and Practices for a Federal Statistical Agency stated: “It is impossible to capture the full economic and societal value of having reliable data on economic, social, health, agricultural, industrial, and environmental characteristics of the country” (NASEM, 2021b, p. 14).
At the time they were established, many sample surveys represented the only way to obtain reliable, accurate, and regularly updated information about the population and businesses of the United States. But surveys have faced a number of challenges in recent years, including decreasing response rates, increasing costs, and user demand for more timely and more granular data and statistics. At the same time, there has been a proliferation of data from other sources, including data collected by government agencies while administering programs (administrative records), satellite and sensor data, private-sector data such as electronic health records and credit card transaction data, and massive amounts of data available on the internet. How can these new data sources be used to supplement or replace some of the
information currently collected on surveys, and to provide new frontiers for producing information and statistics to benefit American society?
To answer those questions, the National Academies, with funding from the National Science Foundation, appointed three consensus panels to develop a vision for a new data infrastructure for national statistics and social and economic research in the 21st century. Each panel was asked to examine a separate aspect of the new data infrastructure. The first panel’s report (NASEM, 2023) discussed legal, privacy, and access issues related to using alternative data sources for official statistics, and it identified seven key attributes for a new data infrastructure.
The Statement of Task for this second panel, the Panel on the Implications of Using Multiple Data Sources for Major Survey Programs, directed the panel to examine how survey programs might be affected by the use of alternative data sources, including:
- Addressing changes in measurement with new data sources;
- Approaches for linking alternative data sources to universe frames to assess and enhance representativeness; and
- Implications of new data sources for population subgroup coverage and life-course longitudinal data.
A diverse panel—with expertise spanning areas of statistics, survey methodology, economics, sociology, psychology, public policy, equity analytics, public health, geography, and demography—was formed to study these issues. The panel convened a 1.5-day virtual public workshop to seek input from external experts about survey programs that might benefit from use of non-survey data sources, and about how these data sources might be used to produce more accurate, detailed, and timely information. Realizing that no single workshop or report could possibly cover the implications of using multiple data sources for each of the thousands of federal data collections, the panel decided to focus on a small set of “use cases”—from the areas of income, health, crime, and agricultural statistics—that represent different ways in which multiple data sources are, or could be, exploited and that illustrate the types of challenges to be faced. Examples from these areas anchor the discussion of the report’s themes.
Use of multiple data sources can add value for the production of official statistics as well as for research. However, combining information across data sources must be done carefully, with deep understanding of the properties of each component dataset and the statistics resulting from their combination. The process begins by evaluating the quality of each data source through assessing how well each source meets the needs it is asked to address (fitness for use). Additional evaluations are needed of the quality of the data resources and of the statistics generated from combined
datasets. Frameworks exist for evaluating the quality of data from probability samples; standards for the quality of integrated data and statistics would promote sound practices and help federal statistical agencies and data users understand these new data products.
CONCLUSION 2-2: Numerous data sources, including probability samples, administrative records, and private-sector data, could be used to produce official statistics if they meet standards for quality. Each data source has specific tradeoffs in terms of timeliness, population coverage, amount of geographic or subgroup detail, concepts measured, accuracy, and continuing availability. Relying on multiple sources can take advantage of the strengths of each source while compensating for its weaknesses.
CONCLUSION 9-1: The quality of statistics produced from multiple data sources depends on properties of the individual sources as well as the methods used to combine them. A new framework of quality standards and guidelines is needed for evaluating such data sources’ fitness for use.
The use of multiple data sources can benefit data equity—promoting the collection and use of data in which all populations, and especially those that have been historically underrepresented or misrepresented in the data record, are visible and accurately portrayed. Alternative data sources can advance data equity by identifying data gaps or misrepresentations, providing information about population members underrepresented in surveys (for example, persons experiencing homelessness or in institutions such as nursing homes), and producing statistics that are disaggregated by race, ethnicity, education, disability status, and other characteristics of interest.
CONCLUSION 3-1: Many data sources include or represent only part of the population of interest. Multiple data sources can be used to assess and improve the coverage of underrepresented groups, and to enable the production of disaggregated statistics. It is important to examine the representativeness and coverage of combined data sources to ensure data equity.
CONCLUSION 3-3: Data equity is an essential aspect of any data system. Documentation of equity aspects, including a discussion of the decisions to include or exclude population subgroup information and an evaluation of data quality for subpopulations of interest, will promote transparency. Development of standards for data equity, and procedures for regularly reviewing equity implications of statistical
programs, would enhance efforts to improve data equity across the federal statistical system.
This report discusses four main ways that multiple data sources could improve national statistics, provide new resources for social and economic research, and promote data equity. These improvements range from providing information for improving current surveys to having the option of replacing surveys altogether. Use of multiple data sources could:
Provide information for evaluating and improving quality of data sources. Administrative and privately held data sources can identify subpopulations that are underrepresented in a sampling frame (a population list from which the sample is drawn) or that are especially prone to nonresponse. Standard survey practice involves comparing estimates of subpopulation sizes calculated from the survey with estimates from an external data source. If records can be linked across sources such that it is possible to identify which (if any) record in source B belongs to the same entity as a record in source A, the linkage can be used to identify, and add, records missing from the frame. This report discusses examples in which non-survey data sources are used to investigate demographic characteristics and socioeconomic status of nonrespondents to income and health surveys (see Chapters 5, 6), to obtain estimates of the number of people killed by law enforcement actions (see Chapter 2), and to identify small urban agricultural operations that are missing from the sampling frame of farms (see Chapters 3, 8). In some cases, information from an administrative source can be used to impute (fill in values using information from a statistical model or similar data records) data items that are missing in a survey.
Linking records can also identify differences in the measurement of concepts across data sources. Chapter 5 discusses studies that compare income items self-reported on surveys with the same categories from linked tax or earnings records. Such studies are an important prelude to greater use of administrative data to supplement or replace information from surveys.
- Obtain additional information about survey respondents. Linking survey records with administrative data sources can provide information not measured in the survey, such as earnings histories and participation in food- or housing-assistance programs (see Chapters 1, 5, 6). Linkage can also provide information about life-course
- outcomes that occur after the survey, such as subsequent medical expenditures or mortality (see Chapter 6).
Produce statistics for small populations. Survey sample sizes are typically insufficient to produce statistics for small demographic groups or geographic areas with small populations. Administrative datasets may have large sample sizes but lack information (such as race or ethnicity) that would allow the production of statistics for those groups. Linking records across sources allows statistics to be produced from the administrative records information for groups whose membership is defined in survey or decennial census data. In other situations, information about relationships between race and ethnicity and other variables can be used to impute group membership for administrative data records (see Chapter 3).
Multiple data sources can also be used to produce statistics for small groups without the need to link individual records. This report discusses examples in which statistical models, relying on summary statistics computed from surveys, administrative data, and other sources, are used to produce statistics about income, poverty, health insurance, crime, and agriculture for counties or small demographic groups (see Chapters 2, 3, 7, 8).
Create data products and produce statistics directly from administrative data. In some cases, after thorough research, surveys can be bypassed and statistics produced directly from administrative data sources. Chapter 4 discusses examples of U.S. Census Bureau and state-level projects that link records from various administrative data sources to create new data products.
Some data sources used to produce statistics have relied on administrative data supplied by state and local governments since their inception. These include the National Vital Statistics System, which tracks births and deaths (see Chapter 4), and the Uniform Crime Reporting Program, which provides estimates of crimes known to the police (see Chapter 7). This report describes the federal-state cooperation that enables creating these datasets, as well as possible modifications that could lead to more timely statistics.
These methods show promise for enhancing data products of the federal statistical system, but care is needed to ensure that the resulting datasets and statistics are of high quality. Administrative and private data sources used to produce statistics should be dependable and continuing sources of
accurate information, with consistent measurement of concepts, to ensure that statistics can be compared across times and locations.
CONCLUSION 4-4: Administrative records are a valuable source of information for official statistics and social and economic research. Each administrative records dataset considered for use in creating national statistics needs to be understood in terms of both its original and its proposed uses. This includes assessing the dataset’s fitness for use, timeliness, continuing availability, population coverage, measurement of key concepts, and equity aspects.
Statistical methods used to combine information can provide new insights from data, but each method also has the potential to introduce errors. Models used to produce statistics for small geographic areas or to impute missing data values rely on assumptions about relationships among variables that might not apply uniformly across population subgroups. These assumptions need to be carefully investigated and documented for data users.
Accurate record linkage provides additional information about populations and individual entities. However, when a record from Source A is mistakenly linked to a record from Source B that belongs to a different entity, the linked dataset record has erroneous information. Some data records contain insufficient identifying information to enable linkage across datasets and some subpopulations are more likely to have missed links than others (see Chapters 2, 3, 6). While record linkage can promote data equity by allowing calculation of statistics for small population groups, the method must be rigorously evaluated to identify unintended consequences for measurement and for the communities being measured.
CONCLUSION 3-2: Record linkage can merge information from separate data sources and add variables that are needed to produce disaggregated statistics. But linkage procedures may also introduce biases because linkage errors can disproportionately affect members of some population subgroups. It is important to assess data-equity implications of record-linkage methods.
The first report in this series concluded that “[t]rust in a new data infrastructure requires transparency of operations and accountability of the operators, with ongoing engagement of stakeholders” (NASEM, 2023, p. 8). Many of the data products discussed in the current report are new, and the methods used to produce them may be new or unfamiliar. Documentation of all steps in the data-collection and production processes is needed to ensure that data users understand the properties and limitations of the statistics produced.
CONCLUSION 9-2: Transparency and documentation of component datasets and of methods used to combine datasets are essential for producing trust in information created from multiple data sources, particularly as new types of data are used.
Creating useful statistics and data products from combined data sources requires new skills. A new data infrastructure requires investment not only in data sources but also in the people who can work with those data. Beyond the technical challenges of developing new statistical methods, there are challenges for promoting data equity and public trust in integrated data. To take advantage of new data resources, it will be important for statistical agencies to invest in personnel, training, and cyberinfrastructure.
CONCLUSION 9-3: Use of multiple data sources is expected to play a major role in the future production of statistical information in the United States, but additional technical expertise and resources are needed to address the challenges involved in producing and assessing the quality of integrated data and statistics.
Probability surveys have provided the nation with useful statistics on numerous topics for more than 80 years, and the panel anticipates that they will continue to be used for producing statistics in many topic areas. Some statistics, such as the percentage of persons who were looking for work last week or the percentage of criminal victimizations that are reported to the police, rely on information that can only be provided by individuals in the population—a probability survey may still be the best method for collecting information on such topics. But there are many opportunities for enhancing survey information with data from other sources, or for reducing burden on survey respondents by obtaining information elsewhere. For some topics and for some parts of the population, administrative records or other data sources can provide more timely, accurate, or granular information than surveys, and at reduced cost.
For all individual data sources that feed into combined data sets and ultimately a new data infrastructure, continued investments in improving the quality of the underlying data are essential for ensuring that the resulting statistics are valid and reliable. This is particularly important given that, as discussed above, data-quality concerns do not affect all population groups, geographic areas, or administrative units equally. A new data infrastructure, and ultimately data users, would benefit from changes to the underlying data sources that would facilitate data linkages. These changes could include revised consent forms or the addition of new data items.
There is much work to be done. The first report in this series (NASEM, 2023) discussed challenges related to data infrastructure governance and
data sharing, and the work needed to overcome those challenges. Many challenges to creating and sustaining a new data infrastructure have not yet been addressed by this or the previous report, and they will be studied in future reports in this series. These include the crucial issues of establishing cyberinfrastructure tailored to integrated data, sharing the benefits of enhanced data resources with researchers and the public while protecting the confidentiality of information contained in the data, investigating issues of data ownership, involving data users and community members in data decisions, and ensuring transparency. The panel believes that these challenges can be met and that a new data infrastructure can be developed to produce improved statistical information for the public good.