Consensus Study Report
NATIONAL ACADEMIES PRESS 500 Fifth Street, NW Washington, DC 20001
This activity was supported by award number SES-2114583 between the National Academy of Sciences and the National Science Foundation. Support for the work of the Committee on National Statistics is provided by a consortium of federal agencies through a grant from the National Science Foundation, a National Agricultural Statistics Service cooperative agreement, and several individual contracts. Any opinions, findings, conclusions, or recommendations expressed in this publication do not necessarily reflect the views of any organization or agency that provided support for the project.
International Standard Book Number-13: 978-0-309-69274-8
International Standard Book Number-10: 0-309-69274-1
Digital Object Identifier: https://doi.org/10.17226/26688
Library of Congress Control Number: 2022950596
This publication is available from the National Academies Press, 500 Fifth Street, NW, Keck 360, Washington, DC 20001; (800) 624-6242 or (202) 334-3313; http://www.nap.edu.
Copyright 2023 by the National Academy of Sciences. National Academies of Sciences, Engineering, and Medicine and National Academies Press and the graphical logos for each are all trademarks of the National Academy of Sciences. All rights reserved.
Printed in the United States of America.
Suggested citation: National Academies of Sciences, Engineering, and Medicine. 2023. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. https://doi.org/10.17226/26688.
The National Academy of Sciences was established in 1863 by an Act of Congress, signed by President Lincoln, as a private, nongovernmental institution to advise the nation on issues related to science and technology. Members are elected by their peers for outstanding contributions to research. Dr. Marcia McNutt is president.
The National Academy of Engineering was established in 1964 under the charter of the National Academy of Sciences to bring the practices of engineering to advising the nation. Members are elected by their peers for extraordinary contributions to engineering. Dr. John L. Anderson is president.
The National Academy of Medicine (formerly the Institute of Medicine) was established in 1970 under the charter of the National Academy of Sciences to advise the nation on medical and health issues. Members are elected by their peers for distinguished contributions to medicine and health. Dr. Victor J. Dzau is president.
The three Academies work together as the National Academies of Sciences, Engineering, and Medicine to provide independent, objective analysis and advice to the nation and conduct other activities to solve complex problems and inform public policy decisions. The National Academies also encourage education and research, recognize outstanding contributions to knowledge, and increase public understanding in matters of science, engineering, and medicine.
Learn more about the National Academies of Sciences, Engineering, and Medicine at www.nationalacademies.org.
Consensus Study Reports published by the National Academies of Sciences, Engineering, and Medicine document the evidence-based consensus on the study’s statement of task by an authoring committee of experts. Reports typically include findings, conclusions, and recommendations based on information gathered by the committee and the committee’s deliberations. Each report has been subjected to a rigorous and independent peer-review process and it represents the position of the National Academies on the statement of task.
Proceedings published by the National Academies of Sciences, Engineering, and Medicine chronicle the presentations and discussions at a workshop, symposium, or other event convened by the National Academies. The statements and opinions contained in proceedings are those of the participants and are not endorsed by other participants, the planning committee, or the National Academies.
Rapid Expert Consultations published by the National Academies of Sciences, Engineering, and Medicine are authored by subject-matter experts on narrowly focused topics that can be supported by a body of evidence. The discussions contained in rapid expert consultations are considered those of the authors and do not contain policy recommendations. Rapid expert consultations are reviewed by the institution before release.
For information about other products and activities of the National Academies, please visit www.nationalacademies.org/about/whatwedo.
PANEL ON THE SCOPE, COMPONENTS, AND KEY CHARACTERISTICS OF A 21ST CENTURY DATA INFRASTRUCTURE
ROBERT M. GROVES (Chair), Office of the Provost, Georgetown University
DANAH BOYD, Microsoft Research and Data & Society
ANNE C. CASE, School of Public and International Affairs, Princeton University, Emeritus
JANET M. CURRIE, School of Public and International Affairs; Co-Director, Center for Health and Wellbeing, Princeton University
ERICA L. GROSHEN, Cornell University School of Industrial and Labor Relations; Upjohn Institute for Employment Research
MARGARET C. LEVENSTEIN, Inter-university Consortium for Political and Social Research, University of Michigan
TED McCANN, American Idea Foundation
HELEN NISSENBAUM,1 Cornell Tech, Cornell University
C. MATTHEW SNIPP, Department of Sociology, Stanford University
PATRICIA SOLÍS, School of Geographical Sciences and Urban Planning, Arizona State University
THOMAS MESENBOURG, Study Director
MICHAEL SIRI, Associate Program Officer
KATELYN STENGER, Associate Program Officer
JOSHUA LANG, Senior Program Assistant
1 Resigned from panel on January 15, 2022.
COMMITTEE ON NATIONAL STATISTICS
ROBERT M. GROVES (Chair), Office of the Provost, Georgetown University
LAWRENCE D. BOBO, Department of Sociology, Harvard University
ANNE C. CASE, School of Public and International Affairs, Princeton University, Emerita
MICK P. COUPER, Institute for Social Research, University of Michigan
DIANA FARRELL, JPMorgan Chase Institute, Washington, DC
ROBERT GOERGE, Chapin Hall at the University of Chicago
ERICA L. GROSHEN, School of Industrial and Labor Relations, Cornell University
DANIEL E. HO, Stanford Law School, Stanford University
HILARY HOYNES, Goldman School of Public Policy, University of California, Berkeley
DANIEL KIFER, Department of Computer Science and Engineering, The Pennsylvania State University
SHARON LOHR, School of Mathematical and Statistical Sciences, Arizona State University, Emerita
JEROME P. REITER, Department of Statistical Science, Duke University
NELA RICHARDSON, ADP Research Institute, Roseland, NJ
JUDITH A. SELTZER, Department of Sociology, University of California, Los Angeles
C. MATTHEW SNIPP, School of the Humanities and Sciences, Stanford University
ELIZABETH A. STUART, Department of Mental Health, Johns Hopkins Bloomberg School of Public Health
BRIAN HARRIS-KOJETIN, Director
MELISSA CHIU, Deputy Director
CONSTANCE F. CITRO, Senior Scholar
This report is the product of contributions from many colleagues whom we thank for their time and expert guidance. The National Science Foundation (NSF) funded the project, and we are indebted to Daniel Goroff, Alan Tompkins, and Cheryl Eavey at NSF for valuable discussions and their support of the study.
To address gaps in the literature and to provide a forum for public comment, the panel convened two public workshop sessions in December 2021. Employees from statistical agencies in the United States and Europe, researchers, and private sector representatives described the impediments they confronted while attempting to blend nontraditional (usually private sector) data sources to improve national statistics.
The panel thanks the following individuals for presenting at these sessions: Cheryl Eavey (NSF) provided an overview of the project, highlighting areas where the panel’s contributions could be especially impactful. Emilda Rivers (National Center for Science and Engineering Statistics) and Andrew Reamer (George Washington University) described federal statistical system initiatives that are examining issues similar to those that the panel studied. Ivan Deloach (Federal Geographic Data Committee), Mathew Shapiro (University of Michigan), and John Stevens (Federal Reserve Board of Governors) all offered valuable comments on these initiatives.
Antonio Chessa (Statistics Netherlands), Sarah Henry (U.K. Office of National Statistics), and Geoff Bowlby (Statistics Canada) offered their experiences in gathering and using private sector data in the production of national statistics, notably, describing the challenges they faced and overcame. Stephanie Studds (Census Bureau), Matt Gee (Brighthive), John
Haltiwanger (University of Maryland), and John Stevens (Federal Reserve Board of Governors) informed the panel on the use of private sector transaction data at U.S. statistical agencies. In a session on federal statistical agencies and nonprofits’ use of private sector health data, Mary Bohman (Bureau of Economic Analysis), Brian Moyer (National Center for Health Statistics), and Niall Brennan (The Health Care Cost Institute) pointed out unique issues in obtaining and working with these data.
The workshop sessions concluded with a discussion of issues that arise when using private sector data for official statistics and research, especially the relationship between the data subject and data holder, as well as the changing legal, regulatory, and privacy landscape regarding private sector data. Nathan Persily (Stanford Law School) described his experiences with Social Science One and drafted legislation that could facilitate learning more about the benefits and limitations of private sector data. Salome Viljoen (Columbia Law School) provided the panel with an overview of the philosophical and legal underpinnings regarding notions of privacy. Panel member danah boyd offered her thoughts on the concept of “data as a gift” and how this theory informs data exchange agreements. Kadija Ferryman (Johns Hopkins Public Health), Frank Nothaft (CoreLogic), DJ Patil (Harvard University), Katherine Wallman (U.S. Office of Management and Budget [retired]), and Maurine Haver (Haver Analytics) provided excellent commentary on the issues set forth by Persily, Viljoen, and boyd.
The panel could not have conducted its work efficiently without the capable staff of the National Academies of Sciences, Engineering, and Medicine: Brian Harris-Kojetin, director of the Committee on National Statistics, provided institutional leadership and substantive contributions during meetings. Kirsten Sampson-Snyder, Division of Behavioral and Social Sciences and Education, expertly coordinated the review process, and Susan Debad and Bea Porter provided thorough final editing that improved the readability of the report for a wide audience. We also thank Rebecca Krone and Joshua Lang for well-organized and efficient logistical support of the panel’s meetings, and Katelyn Stenger, who provided valuable research support throughout the project. On behalf of the panel, I thank the study directors, Thomas Mesenbourg and Michael Siri, for their excellent management of the panel’s work. The quality and timeliness of this report would not have been possible without their contributions.
Finally, and most importantly, a note of appreciation is in order for my fellow panel members. This report reflects their collective expertise and commitment.
This Consensus Study Report was reviewed in draft form by individuals chosen for their diverse perspectives and technical expertise. The purpose of this independent review is to provide candid and critical comments that will assist the National Academies of Sciences, Engineering, and Medicine
in making each published report as sound as possible and to ensure that it meets the institutional standards for quality, objectivity, evidence, and responsiveness to the study charge. The review comments and draft manuscript remain confidential to protect the integrity of the deliberative process.
The panel thanks the following individuals for their review of this report: Richard D. Alba (Department of Sociology, The Graduate Center, City University, New York), Claire McKay Bowen (Statistical Methods Group, Center on Labor, Human Services, and Population, Technology and Data Science, Urban Institute), Alan Butler (Office of President, Electronic Privacy Information Center, Washington, D.C.), Kathleen Cagney (Institute for Social Research and Department of Sociology, University of Michigan), Laura DeNardis (School of Communication, American University), Kenneth E. Poole (Office of the President, Center for Regional Economic Competitiveness, Arlington, VA), Kosali Simon (O’Neill School of Public and Environmental Affairs, Indiana University, Bloomington), and Timothy D. Wilson (Department of Psychology, University of Virginia).
Although the reviewers listed above provided many constructive comments and suggestions, they were not asked to endorse the conclusions, nor did they see the final draft of the report before its release. The review of the report was overseen by Cynthia Clark, independent consultant, Mclean, VA, and Kathleen Mullan Harris, Department of Sociology, University of North Carolina. Appointed by the National Research Council’s Report Review Committee, they were responsible for making certain that the independent examination of this report was carried out per institutional procedures and that all review comments were carefully considered. Responsibility for the final content of the report rests entirely with the authoring panel and the National Academies of Sciences, Engineering, and Medicine.
Robert M. Groves, Chair
Panel on the Scope, Components, and Key Characteristics of a 21st Century Data Infrastructure
This page intentionally left blank.
2 The United States Needs a New National Data Infrastructure
Producing National Statistics: Declining Response Rates and Increased Costs
The Digital Data Revolution Presents Opportunities and Challenges
Current Efforts to Use Digital Data to Repair Weaknesses in National Statistics Demonstrate the Possibilities and Limitations of Alternative Data Sources
RECENT CONGRESSIONAL DATA-RELATED INITIATIVES: NECESSARY BUT NOT SUFFICIENT
3 A Vision for a New National Data Infrastructure
VISION FOR A 21ST CENTURY NATIONAL DATA INFRASTRUCTURE
Outcomes of a New Data Infrastructure
Key Attributes of a New National Data Infrastructure
ATTRIBUTES OF A NEW DATA INFRASTRUCTURE
Attribute 1: Safeguards and Advanced Privacy-Enhancing Practices to Minimize Possible Individual Harm
Attribute 2: Statistical Uses Only, for Common-Good Information, with Statistical Aggregates Freely Shared with All
Attribute 3: Mobilization of Relevant National Digital Data Assets, Blended in Statistical Aggregates to Provide Benefits to Data Holders, with Societal Benefits Proportionate to Possible Costs and Risks
Attribute 4: Reformed Legal Authorities Protecting All Parties’ Interests
Attribute 5: Governance Framework and Standards Effectively Supporting Operations
Attribute 6: Transparency to the Public Regarding Analytical Operations Using the Infrastructure
Attribute 7: State-of-the-Art Practices for Access, Statistical, Coordination, and Computational Activities; Continuously Improved to Efficiently Create Increasingly Secure and Useful Information
APPENDIX 3A: LAWS AND OFFICE OF MANAGEMENT AND BUDGET GUIDANCE ON CONFIDENTIALITY AND PRIVACY PROTECTION
APPENDIX 3B: EXAMPLES OF STANDARDS THAT WOULD BE USEFUL TO ANY NEW DATA INFRASTRUCTURE
4 Blended Data: Implications for a New National Data Infrastructure and Its Organization
KEY DATA HOLDERS FOR A 21ST CENTURY NATIONAL DATA INFRASTRUCTURE
Principal Federal Statistical Agencies and Units
Federal Program and Administrative Agencies
Nonprofit and Academic Institutions
Crowdsourced or Citizen-Science Data Holders
WHICH DATA SHOULD BE INCLUDED?
Fitness-for-Use to Produce Key Information for the Country
Data Minimized to Satisfy Pre-Specified Purposes
Data Access and Use Respect Data Holders’ and Data Subjects’ Interests and Privacy
Prioritize Easily Acquired Data That Provide Tangible Benefits
Available, Usable Metadata Is Essential for Statistical Purposes
BLENDED DATA REQUIRE NEW STATISTICAL METHODS
BLENDED DATA REQUIRE NEW STATISTICAL DESIGNS
BLENDED DATA REQUIRE NEW DATA INFRASTRUCTURE CAPABILITIES
BLENDED DATA POSE NEW PRIVACY AND ETHICAL CHALLENGES
MULTIPLE ORGANIZATIONAL STRUCTURES CAN SUPPORT A NEW DATA INFRASTRUCTURE
Organizational Models to Facilitate Cross-Sector Data Access and Use
5 Building a 21st Century National Data Infrastructure Requires Identifying Short- and Medium-Term Activities
ATTRIBUTE 1: SAFEGUARDS AND ADVANCED PRIVACY-ENHANCING PRACTICES TO MINIMIZE POSSIBLE INDIVIDUAL HARM
ATTRIBUTE 2: STATISTICAL USES ONLY, FOR COMMON-GOOD INFORMATION, WITH STATISTICAL AGGREGATES FREELY SHARED WITH ALL
ATTRIBUTE 3: MOBILIZATION OF RELEVANT DIGITAL DATA ASSETS, BLENDED IN STATISTICAL AGGREGATES TO PROVIDE BENEFITS TO DATA HOLDERS, WITH SOCIETAL BENEFITS PROPORTIONATE TO POSSIBLE COSTS AND RISKS
ATTRIBUTE 4: REFORMED LEGAL AUTHORITIES PROTECTING ALL PARTIES’ INTERESTS
ATTRIBUTE 5: GOVERNANCE FRAMEWORK AND STANDARDS EFFECTIVELY SUPPORTING OPERATIONS
ATTRIBUTE 6: TRANSPARENCY TO THE PUBLIC REGARDING ANALYTICAL OPERATIONS USING THE DATA INFRASTRUCTURE
ATTRIBUTE 7: STATE-OF-THE-ART PRACTICES FOR ACCESS, STATISTICAL, COORDINATION, AND COMPUTATIONAL ACTIVITIES; CONTINUOUSLY IMPROVED TO EFFICIENTLY CREATE INCREASINGLY SECURE AND USEFUL INFORMATION
NEW PARTNERSHIPS MUST BE FORMED
A BIOGRAPHICAL SKETCHES OF PANEL MEMBERS
Boxes, Figures, and Tables
S-1 Seven Attributes of a 21st Century National Data Infrastructure Vision
2-1 Terminated and Threatened Statistical Programs
3-1 Outcomes of a New Data Infrastructure
3-2 Seven Attributes of a 21st Century National Data Infrastructure Vision
3-3 Designated U.S. Federal Statistical Agencies and Units
3-4 Data Holders Proposed to Share Data in a 21st Century National Data Infrastructure
3-5 Advisory Committee on Data for Evidence Building Recommendations for Regulatory Action (Year 1 Report)
3-6 Data Governance Components
4-1 Updated CEP Examples of Selected Administrative Data Assets
4-2 Data Holders Proposed to Share Data in a 21st Century National Data Infrastructure
4-3 Key Characteristics of Data Assets to Be Included in a 21st Century National Data Infrastructure
4-4 Capabilities Needed for a 21st Century National Data Infrastructure
4-5 Illustrative Organizational Entity Questions
5-1 Properties of First Additions to a New Data Infrastructure for Statistical Purposes
5-2 Data Governance Components
5-1 Steps in building a regulatory environment supporting private sector data sharing for national statistical purposes
5-2 Entity actions to build partnerships with data-sharing private sector data holders
2-1 Selected Household Survey Response Rates
4-1 Dimensions of Data Quality
5-1 Short- and Medium-Term Tasks for a 21st Century National Data Infrastructure
Acronyms and Abbreviations
|ACDEB||Advisory Committee on Data for Evidence Building|
|ACS||American Community Survey|
|ADC||America’s DataHub Consortium|
|AEAStat||American Economic Association Committee on Statistics|
|BEA||Bureau of Economic Analysis|
|BJS||Bureau of Justice Statistics|
|BLS||U.S. Bureau of Labor Statistics|
|CDC||Centers for Disease Control and Prevention|
|CDO||(Federal) Chief Data Officer|
|CE||Consumer Expenditure Survey|
|CEP||U.S. Commission on Evidence-Based Policymaking|
|CIPSEA||Confidential Information Protection and Statistical Efficiency Act|
|CNSTAT||Committee on National Statistics|
|CPI||Consumer Price Index|
|CPS||Current Population Survey|
|CSDA||Common Statistical Data Architecture|
|CSPA||Common Statistical Production Architecture|
|DDI||Data Documentation Initiative|
|DHS||Department of Homeland Security|
|EHR||electronic health record|
|EIA||Energy Information Administration|
|FCSM||Federal Committee on Statistical Methodology|
|FFRDC||federally funded research and development center|
|FGDC||Federal Geographic Data Committee|
|FISMA||Federal Information Security Management Act|
|FITARA||Federal Information Technology Acquisition Reform Act|
|FSRDC||federal statistical research data center|
|GAO||U.S. Government Accountability Office|
|GSBPM||Generic Statistical Business Process Model|
|GSIM||Generic Statistical Information Model|
|HHS||U.S. Department of Health and Human Services|
|HIPAA||Health Insurance Portability and Accountability Act|
|ICPSR||Inter-university Consortium for Political and Social Research|
|ICSP||Interagency Council on Statistical Policy|
|IEC||International Electrotechnical Commission|
|IRB||institutional review board|
|IRS||Internal Revenue Service|
|ISO||International Organization for Standardization|
|MEPS HC||Medical Expenditure Panel Survey, Household Component|
|MOUs||memoranda of understanding|
|NARA||National Archives and Records Administration|
|NASEM||National Academies of Sciences, Engineering, and Medicine|
|NIEM||National Information Exchange Model|
|NIH||National Institutes of Health|
|NIST||National Institute for Standards and Technology|
|NRC||National Research Council|
|NSDS||National Secure Data Service|
|NSF||National Science Foundation|
|OECD||Organisation for Economic Co-operation and Development|
|OMB||U.S. Office of Management and Budget|
|QCEW||Quarterly Census of Employment and Wages|
|SAP||Standard Application Process|
|SBA||Small Business Administration|
|SDMX||Statistical Data and Metadata eXchange|
|SHIELD||Stop Hacks and Improve Electronic Data Security Act|
|SORNs||system of records notices|
|SSA||Social Security Administration|
|UNECE||United Nations Economic Commission for Europe|
|USDA||U.S. Department of Agriculture|
This page intentionally left blank.
Glossary of Select Terms
No common definition exists within the federal government—neither the Equitable Data Working Group in their recent report (The White House, 2022b) nor the U.S. Census Bureau (https://www.census.gov/about/what/data-equity.html) define the term. In this report, data equity “refers to the consideration, through an equity lens, of the ways in which data is collected, analyzed, interpreted, and distributed” (Lee-Ibarra, 2021).
Data assets; the technologies used to discover, access, share, process, use, analyze, manage, store, preserve, protect, and secure those assets; the people, capacity, and expertise needed to manage, use, interpret, and understand data; the guidance, standards, policies, and rules that govern data access, use, and protection; the organizations and entities that manage, oversee, and govern the data infrastructure; and the communities and data subjects whose data is shared and used for statistical purposes and may be impacted by decisions that are made using those data assets.
Equitable Data Working Group
Executive Order 13985, “Advancing Racial Equity and Support for Underserved Communities Through the Federal
Government,” issued by President Biden in January 2020, formed the Equitable Data Working Group. It is tasked to identify inadequacies and areas of improvement within federal data and outline a strategy for increasing data available for measuring equity and representing the diversity of the American people and their experiences (The White House, 2021b).
Also referred to as the “Foundations for Evidence-Based Policymaking Act of 2018.” This bill requires agency data to be accessible and requires agencies to plan to develop statistical evidence to support policymaking (U.S. Congress, 2019).
Standard Application Process (SAP)
The federal statistical system is currently developing an SAP for applying for access to confidential data assets. When fully built, the SAP will serve as a “front door” through which to apply for permission to use protected data from any of the 16 federal statistical agencies and units for evidence building (https://ncses.nsf.gov/about/standard-application-process). Testing for the portal will occur in September 2022, with the expectation that the site will be operational by the end of 2022. The current portal is at: https://www.researchdatagov.org/