Consensus Study Report
NATIONAL ACADEMIES PRESS 500 Fifth Street, NW Washington, DC 20001
This activity was supported by a grant from the National Science Foundation to the National Academy of Sciences. Any opinions, findings, conclusions, or recommendations expressed in this publication do not necessarily reflect the views of any organization or agency that provided support for the project.
International Standard Book Number-13: 978-0-309-69675-3
International Standard Book Number-10: 0-309-69675-5
Digital Object Identifier: https://doi.org/10.17226/26804
Library of Congress Control Number: 2023940168
This publication is available from the National Academies Press, 500 Fifth Street, NW, Keck 360, Washington, DC 20001; (800) 624-6242 or (202) 334-3313; http://www.nap.edu.
Copyright 2023 by the National Academy of Sciences. National Academies of Sciences, Engineering, and Medicine and National Academies Press and the graphical logos for each are all trademarks of the National Academy of Sciences. All rights reserved.
Printed in the United States of America.
Suggested citation: National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources. Washington, DC: The National Academies Press. https://doi.org/10.17226/26804.
The National Academy of Sciences was established in 1863 by an Act of Congress, signed by President Lincoln, as a private, nongovernmental institution to advise the nation on issues related to science and technology. Members are elected by their peers for outstanding contributions to research. Dr. Marcia McNutt is president.
The National Academy of Engineering was established in 1964 under the charter of the National Academy of Sciences to bring the practices of engineering to advising the nation. Members are elected by their peers for extraordinary contributions to engineering. Dr. John L. Anderson is president.
The National Academy of Medicine (formerly the Institute of Medicine) was established in 1970 under the charter of the National Academy of Sciences to advise the nation on medical and health issues. Members are elected by their peers for distinguished contributions to medicine and health. Dr. Victor J. Dzau is president.
The three Academies work together as the National Academies of Sciences, Engineering, and Medicine to provide independent, objective analysis and advice to the nation and conduct other activities to solve complex problems and inform public policy decisions. The National Academies also encourage education and research, recognize outstanding contributions to knowledge, and increase public understanding in matters of science, engineering, and medicine.
Learn more about the National Academies of Sciences, Engineering, and Medicine at www.nationalacademies.org.
Consensus Study Reports published by the National Academies of Sciences, Engineering, and Medicine document the evidence-based consensus on the study’s statement of task by an authoring committee of experts. Reports typically include findings, conclusions, and recommendations based on information gathered by the committee and the committee’s deliberations. Each report has been subjected to a rigorous and independent peer-review process and it represents the position of the National Academies on the statement of task.
Proceedings published by the National Academies of Sciences, Engineering, and Medicine chronicle the presentations and discussions at a workshop, symposium, or other event convened by the National Academies. The statements and opinions contained in proceedings are those of the participants and are not endorsed by other participants, the planning committee, or the National Academies.
Rapid Expert Consultations published by the National Academies of Sciences, Engineering, and Medicine are authored by subject-matter experts on narrowly focused topics that can be supported by a body of evidence. The discussions contained in rapid expert consultations are considered those of the authors and do not contain policy recommendations. Rapid expert consultations are reviewed by the institution before release.
For information about other products and activities of the National Academies, please visit www.nationalacademies.org/about/whatwedo.
PANEL ON THE IMPLICATIONS OF USING MULTIPLE DATA SOURCES FOR MAJOR SURVEY PROGRAMS
SHARON L. LOHR (Chair), School of Mathematical and Statistical Sciences, Arizona State University (Emerita)
JEAN-FRANÇOIS BEAUMONT, Statistics Canada
LAWRENCE D. BOBO, Office of the Dean of Social Science, Harvard University
MICK P. COUPER, Institute for Social Research, University of Michigan
HILARY HOYNES, Goldman School of Public Policy at the University of California, Berkeley
KIMBERLYN LEARY, Harvard Medical School/McLean Hospital and Department of Health Policy and Management, Harvard T.H. Chan School of Public Health
DAVID MANCUSO, Washington State Department of Social and Health Services
JUDITH A. SELTZER, Department of Sociology, University of California, Los Angeles
ELIZABETH A. STUART, Department of Mental Health, Johns Hopkins Bloomberg School of Public Health
SHAOWEN WANG, Department of Geography and Geographic Information Science, University of Illinois Urbana-Champaign
Study Staff
DANIEL H. WEINBERG, Study Director (until December 2022)
KRISZTINA MARTON, Study Director (from December 2022)
JOSHUA LANG, Senior Program Assistant
COMMITTEE ON NATIONAL STATISTICS
ROBERT M. GROVES (Chair), Office of the Provost, Georgetown University
LAWRENCE D. BOBO, Department of Sociology, Harvard University
ANNE C. CASE, School of Public and International Affairs, Princeton University (Emeritus)
MICK P. COUPER, Institute for Social Research, University of Michigan
DIANA FARRELL, President and Chief Executive Officer, JPMorgan Chase Institute
ROBERT GOERGE, Chapin Hall at the University of Chicago
ERICA L. GROSHEN, School of Industrial and Labor Relations, Cornell University
DANIEL E. HO, Law School, Stanford University
HILARY HOYNES, Goldman School of Public Policy, University of California, Berkeley
DANIEL KIFER, Department of Computer Science and Engineering, The Pennsylvania State University
SHARON LOHR, School of Mathematical and Statistical Sciences, Arizona State University (Emerita)
JEROME P. REITER, Department of Statistical Science, Duke University
NELA RICHARDSON, Senior Vice President and Chief Economist, ADP Research Institute
JUDITH A. SELTZER, Department of Sociology, University of California, Los Angeles
C. MATTHEW SNIPP, School of the Humanities and Sciences, Stanford University
ELIZABETH A. STUART, Department of Mental Health, Johns Hopkins Bloomberg School of Public Health
MELISSA CHIU, Director
BRIAN HARRIS-KOJETIN, Senior Scholar
CONSTANCE F. CITRO, Senior Scholar
Reviewers
This Consensus Study Report was reviewed in draft form by individuals chosen for their diverse perspectives and technical expertise. The purpose of this independent review is to provide candid and critical comments that will assist the National Academies of Sciences, Engineering, and Medicine in making each published report as sound as possible and to ensure that it meets the institutional standards for quality, objectivity, evidence, and responsiveness to the study charge. The review comments and draft manuscript remain confidential to protect the integrity of the deliberative process.
We thank the following individuals for their review of this report:
Although the reviewers listed above provided many constructive comments and suggestions, they were not asked to endorse the conclusions or recommendations of this report nor did they see the final draft before its release. The review of this report was overseen by CYNTHIA CLARK, independent consultant, and KATHLEEN MULLAN HARRIS, Department of Sociology, University of North Carolina. They were responsible for making certain that an independent examination of this report was carried out in accordance with the standards of the National Academies and that all review comments were carefully considered. Responsibility for the final content rests entirely with the authoring committee and the National Academies.
Acknowledgments
This report of the Panel on the Implications of Using Multiple Data Sources for Major Survey Programs is the product of contributions from many colleagues, whom we thank for sharing their time and expertise. The panel was funded by the National Science Foundation, which has been a true partner in this endeavor, and we are especially indebted to Alan Tomkins, Daniel Goroff, Rayvon Fouché, and Cheryl Eavey for their support and for valuable discussions about the panel’s goals and activities. Cheryl Eavey opened the workshop with comments about how the panel’s activities complement other efforts at the National Science Foundation on enhancing data for social and economic research.
The panel benefited greatly from the presentations provided during the virtual public workshop held on May 16 and 18, 2022. The experts the panel heard from can be clustered into the following perspectives and areas of expertise (see Appendix A for the workshop agenda and Appendix B for biographies of the workshop presenters):
- Keynote speakers and discussants: Robert Santos, Director, U.S. Census Bureau; Anil Arora, Chief Statistician of Canada; Joseph Salvo, University of Virginia; and Haoyi Chen, United Nations.
- Experts on crime statistics: Janet Lauritsen, University of Missouri-St. Louis; Ramiro Martinez, Jr., Northeastern University; Erica Smith, U.S. Bureau of Justice Statistics; and Derek Veitenheimer, State of Wisconsin.
- Experts on agricultural statistics: Linda Young, U.S. National Agricultural Statistics Service; Herbert Nkwimi-Tchahou, Statistics
- Canada; Martin Mendez-Costabel, Bayer Crop Science; and Michael Goodchild, University of California, Santa Barbara.
- Experts on income and health statistics: Jonathan Rothbaum, U.S. Census Bureau; Lisa Mirel, U.S. National Center for Health Statistics; Jessica Faul, University of Michigan; and Helen Levy, University of Michigan.
- Experts on data-equity issues: Steven Brown, Urban Institute; Randall Akee, University of California, Los Angeles; Frauke Kreuter, LudwigMaximilians-University of Munich and University of Maryland; Clarence Wardell, Chief Data and Equitable Delivery Officer, Executive Office of the President; and Margaret Levenstein, University of Michigan.
We would also like to thank the Chair of the Committee on National Statistics, Robert M. Groves, for his leadership and his insightful comments about a new vision for national statistics in the final workshop session.
The panel could not have conducted its work without the capable staff at the National Academies of Sciences, Engineering, and Medicine. Brian Harris-Kojetin, Director of the Committee on National Statistics, and Melissa Chiu, Deputy Director, provided invaluable support throughout the panel’s activities, and their insightful comments improved the workshop and report. Joshua Lang did a magnificent job of organizing the panel meetings, ensuring the smooth operation of the workshop and other panel activities, and assisting with the report. Neeti Pokhriyal (Mirzayan Science Technology Policy Fellow) helped with literature reviews, and Constance Citro, Daniel Cork, David Johnson, and Nancy Kirkendall provided helpful input for the report. Kirsten Sampson-Snyder organized the review process, and Susan Debad’s thorough editing improved the readability and accessibility of the report. We are grateful to all of them for their contributions and help.
The crew at Spark Street Digital ensured that the technological aspects of the virtual workshop worked flawlessly and produced the video of the event. We appreciate their help in familiarizing participants with the web-cast features and their behind-the-scenes support during the workshop.
Finally, we thank the members of the Panel on the Implications of Using Multiple Data Sources with Major Survey programs, listed on page v. As can be seen from the biographies in Appendix B, the panel members brought an impressive array of expertise and they generously volunteered their time to organize the workshop, gather evidence, and work on the report. The final report reflects the commitment and expertise of all panel members.
Sharon L. Lohr (Chair)
Daniel H. Weinberg (Study Director)
Krisztina Marton (Study Director)
Contents
1 The Promise of Integrated Data
1.1 An Example of Enhancing Survey Data for Policymaking
1.2 Producing Statistics That Are Fit for Use
1.3 Study Approach and Information Gathering
1.4 Organization of the Report
2 Types of Data and Methods for Combining Them
Administrative Records Collected by Government Agencies
Records Collected by Private-Sector Organizations
Satellite, Sensor, and Location Data
Nonprobability or Convenience Samples
Data from Social Media, Webscraping, and Crowdsourcing
2.2 Methods for Combining Data
Combining Statistics Calculated from Independent Data Sources
2.3 Opportunities and Challenges for Combining Data from Multiple Sources
3 Using Multiple Data Sources to Enhance Data Equity
3.2 Investigate or Improve Coverage of a Survey
3.3 Enable Finer Data Disaggregation
3.4 Produce Model-Based Estimates for Small Subpopulations
3.5 Assess and Reduce Measurement Error
3.6 Add Features to the Data Through Data Linkage
Adding Variables to a Dataset from Records Linked in Another Source
Linkage Errors and Data Equity
Additional Equity Considerations for Data Linkage
3.7 Add Features to the Data Through Imputation
Imputing Information Needed for Disaggregation
Equity Considerations for Imputation
4 Creating New Data Resources with Administrative Records
4.1 Creating Longitudinal Databases from Existing Records
Longitudinal Business Database
Longitudinal Employer-Household Dynamics Database
Decennial Census Digitization and Linkage Project
4.3 The National Vital Statistics System
4.4 Linking Data at the State or Regional Level
Illinois Integrated Database of Child and Family Programs
Washington State Department of Social and Health Services
4.5 Using Administrative Records to Produce Statistics
5 Data Linkage to Improve Income Measurement
5.1 Income Data Collection on Surveys
Current Population Survey Annual Social and Economic Supplement
5.2 Administrative Records Sources for Income Data
Data from the Internal Revenue Service
Data from the Social Security Administration
Administrative Data from Other Government Agencies
5.3 Using Administrative Data with Income Surveys
5.4 Studying Measurement of Income and Program Participation
5.5 Using Linked Income Data to Improve Income Statistics
Comprehensive Income Dataset Project
National Experimental Wellbeing Statistics Project
Using Administrative Records to Improve Income Measures
6 Data Linkage to Supplement Health Surveys
6.1 Surveys from the U.S. National Center for Health Statistics
National Health Interview Survey
National Health and Nutrition Examination Survey
Strengths and Limitations of Health Survey Data
6.2 Sources of Administrative Data on Health
6.3 Data Linkage at the U.S. National Center for Health Statistics
Linkages to Examine Accuracy of Health Data
Linkages to Study Health Outcomes and Associations
Investigating and Documenting Properties of Linked Survey Data
6.5 Linkage of Longitudinal Health Surveys
7 Combining Multiple Data Sources to Measure Crime
7.1 The Uniform Crime Reporting Program
7.2 National Crime Victimization Survey
7.3 Other National Data Sources About Crime
National Vital Statistics System
Data Collected by Regulatory Agencies
Data from Crowdsourcing and Webscraping
7.5 Combining Statistics Computed from Multiple Data Sources
7.6 Linking Individual Records Across Data Sources
Linkage to Add Variables About Crime Incidents, Victims, or Offenders
Linkage to Study Crime Measurement or Law Enforcement Procedures
7.7 Improving the Quality of Crime Data
Improve Population and Crime Coverage
Enable Production of Disaggregated Statistics
Improve Cooperation for Data Collection
8 Using Multiple Data Sources for County-Level Crop Estimates
8.1 Data Sources for Crop Estimates
Satellite, Aerial Imagery, and Sensor Data
Data from Social Media, Webscraping, and Crowdsourcing
8.2 Modeling Crops County Estimates in the United States
8.3 Modeling Crop Estimates in Canada
8.4 Opportunities for Improving Agricultural Statistics
9 Combining Data Sources for National Statistics: Next Steps
Multiple Data Sources Can Add Value for Official Statistics and Research
Quality of Integrated Data and Statistics
Transparency and Documentation
Boxes, Figures, and Tables
BOXES
1-2 Seven Attributes of a 21st Century National Data Infrastructure Vision
2-1 Deterministic and Probabilistic Record Linkage
2-2 The Small Area Income and Poverty Estimates Program
3-1 Artificial Intelligence and Data Equity
3-2 Measuring Coverage of the 2020 Census
3-3 Measuring Race and Ethnicity in the United States
3-4 Privacy, Confidentiality, and Data Equity
3-5 Informed Consent and Data Ownership
4-1 Historical Uses of Administrative Records for Statistical Purposes: Selected Examples
FIGURES
1-1 Dimensions of data quality
2-1 Response rates for selected surveys, 2000–2022
3-1 Statistics Canada Disaggregated Data Action Plan
3-2 Ethnicity and race questions in the 2020 Census
4-1 The U.S. Census Bureau’s Frames project
5-1 American Community Survey income questions, 2022
5-2 Item nonresponse for selected income types, American Community Survey, 2000–2021
TABLES
Acronyms and Abbreviations
ACS | American Community Survey |
AIAN | American Indian or Alaska Native |
ASEC | Annual Social and Economic Supplement [of the Current Population Survey] |
BJS | Bureau of Justice Statistics |
CAPS | County Agricultural Production Survey |
CDC | Centers for Disease Control and Prevention |
CID | Comprehensive Income Dataset |
CNSTAT | Committee on National Statistics |
CPS | Current Population Survey |
FBI | Federal Bureau of Investigation |
FSA | Farm Service Agency |
HRS | Health and Retirement Study |
HUD | Department of Housing and Urban Development |
ICDR | Integrated Client Data Repository [State of Washington] |
IRS | Internal Revenue Service |
JAS | June Area Survey |
LEHD | Longitudinal Employer-Household Dynamics |
MAF | Master Address File |
NASEM | National Academies of Sciences, Engineering, and Medicine |
NASS | National Agricultural Statistics Service |
NCHS | National Center for Health Statistics |
NCVS | National Crime Victimization Survey |
NDI | National Death Index |
NEWS | National Experimental Well-being Statistics |
NHANES | National Health and Nutrition Examination Survey |
NHIS | National Health Interview Survey |
NIBRS | National Incident-Based Reporting System [of Uniform Crime Reports] |
NVSS | National Vital Statistics System |
OMB | Office of Management and Budget |
PIK | Protected Identification Key |
RMA | Risk Management Agency |
SAIPE | Small Area Income and Poverty Estimates |
SIPP | Survey of Income and Program Participation |
SNAP | Supplemental Nutrition Assistance Program |
SRS | Summary Reporting System [of Uniform Crime Reports] |
SSA | Social Security Administration |
SSN | Social Security Number |
TIGER | Topologically Integrated Geographic Encoding and Referencing |
UCR | Uniform Crime Reports/Reporting |
USDA | U.S. Department of Agriculture |