Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Toward a 21st Century National Data Infrastructure Mobilizing Information for the Common Good Robert M. Groves, Thomas Mesenbourg, and Michael Siri, Editors Panel on the Scope, Components, and Key Characteristics of a 21st-Century Data Infrastructure Committee on National Statistics Division of Behavioral and Social Sciences and Education Consensus Study Report Prepublication CopyâUncorrected Proofs
Prepublication CopyâUncorrected Proofs NATIONAL ACADEMIES PRESS 500 Fifth Street, NW, Washington, DC 20001 This activity was supported by award number SES-2114583 between the National Academy of Sciences and the National Science Foundation. Support for the work of the Committee on National Statistics is provided by a consortium of federal agencies through a grant from the National Science Foundation, a National Agricultural Statistics Service cooperative agreement, and several individual contracts. Any opinions, findings, conclusions, or recommendations expressed in this publication do not necessarily reflect the views of any organization or agency that provided support for the project. International Standard Book Number-13: 978-0-309-XXXXX-X International Standard Book Number-10: 0-309-XXXXX-X Digital Object Identifier: https://doi.org/10.17226/26688 This publication is available from the National Academies Press, 500 Fifth Street, NW, Keck 360, Washington, DC 20001; (800) 624-6242 or (202) 334-3313; http://www.nap.edu. Copyright 2022 by the National Academy of Sciences. National Academies of Sciences, Engineering, and Medicine and National Academies Press and the graphical logos for each are all trademarks of the National Academy of Sciences. All rights reserved. Printed in the United States of America. Suggested citation: National Academies of Sciences, Engineering, and Medicine. 2022. Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good. Washington, DC: The National Academies Press. https://doi.org/10.17226/26688.
Prepublication CopyâUncorrected Proofs The National Academy of Sciences was established in 1863 by an Act of Congress, signed by President Lincoln, as a private, nongovernmental institution to advise the nation on issues related to science and technology. Members are elected by their peers for outstanding contributions to research. Dr. Marcia McNutt is president. The National Academy of Engineering was established in 1964 under the charter of the National Academy of Sciences to bring the practices of engineering to advising the nation. Members are elected by their peers for extraordinary contributions to engineering. Dr. John L. Anderson is president. The National Academy of Medicine (formerly the Institute of Medicine) was established in 1970 under the charter of the National Academy of Sciences to advise the nation on medical and health issues. Members are elected by their peers for distinguished contributions to medicine and health. Dr. Victor J. Dzau is president. The three Academies work together as the National Academies of Sciences, Engineering, and Medicine to provide independent, objective analysis and advice to the nation and conduct other activities to solve complex problems and inform public policy decisions. The National Academies also encourage education and research, recognize outstanding contributions to knowledge, and increase public understanding in matters of science, engineering, and medicine. Learn more about the National Academies of Sciences, Engineering, and Medicine at www.nationalacademies.org.
Prepublication CopyâUncorrected Proofs Consensus Study Reports published by the National Academies of Sciences, Engineering, and Medicine document the evidence-based consensus on the studyâs statement of task by an authoring committee of experts. Reports typically include findings, conclusions, and recommendations based on information gathered by the committee and the committeeâs deliberations. Each report has been subjected to a rigorous and independent peer-review process and it represents the position of the National Academies on the statement of task. Proceedings published by the National Academies of Sciences, Engineering, and Medicine chronicle the presentations and discussions at a workshop, symposium, or other event convened by the National Academies. The statements and opinions contained in proceedings are those of the participants and are not endorsed by other participants, the planning committee, or the National Academies. Rapid Expert Consultations published by the National Academies of Sciences, Engineering, and Medicine are authored by subject-matter experts on narrowly focused topics that can be supported by a body of evidence. The discussions contained in rapid expert consultations are considered those of the authors and do not contain policy recommendations. Rapid expert consultations are reviewed by the institution before release. For information about other products and activities of the National Academies, please visit www.nationalacademies.org/about/whatwedo.
Prepublication CopyâUncorrected Proofs PANEL ON THE SCOPE, COMPONENTS, AND KEY CHARACTERISTICS OF A 21ST-CENTURY DATA INFRASTRUCTURE ROBERT M. GROVES (Chair), Office of the Provost, Georgetown University DANAH BOYD, Microsoft Research and Data & Society ANNE C. CASE, School of Public and International Affairs, Princeton University, Emeritus JANET M. CURRIE, School of Public and International Affairs; and Co-Director, Center for Health and Wellbeing, Princeton University ERICA L. GROSHEN, Cornell University School of Industrial and Labor Relations; and Upjohn Institute for Employment Research MARGARET C. LEVENSTEIN, Inter-university Consortium for Political and Social Research, University of Michigan TED MCCANN, American Idea Foundation HELEN NISSENBAUM*, Cornell Tech, Cornell University C. MATTHEW SNIPP, Department of Sociology, Stanford University PATRICIA SOLÃS, School of Geographical Sciences and Urban Planning, Arizona State University THOMAS MESENBOURG, Study Director MICHAEL SIRI, Associate Program Officer KATELYN STENGER, Associate Program Officer JOSHUA LANG, Senior Program Assistant *Resigned from panel on January 15, 2022. FM - v
Prepublication CopyâUncorrected Proofs COMMITTEE ON NATIONAL STATISTICS ROBERT M. GROVES (Chair), Office of the Provost, Georgetown University LAWRENCE D. BOBO, Department of Sociology, Harvard University ANNE C. CASE, School of Public and International Affairs, Princeton University, Emeritus MICK P. COUPER, Institute for Social Research, University of Michigan JANET M. CURRIE, School of Public and International Affairs, Princeton University DIANA FARRELL, JPMorgan Chase Institute ROBERT GOERGE, Chapin Hall at the University of Chicago ERICA L. GROSHEN, School of Industrial and Labor Relations, Cornell University HILARY HOYNES, Goldman School of Public Policy, University of California-Berkeley DANIEL KIFER, Department of Computer Science and Engineering, The Pennsylvania State University SHARON LOHR, School of Mathematical and Statistical Sciences, Arizona State University, Emerita JEROME P. REITER, Department of Statistical Science, Duke University JUDITH A. SELTZER, Department of Sociology, University of California-Los Angeles C. MATTHEW SNIPP, School of the Humanities and Sciences, Stanford University ELIZABETH A. STUART, Department of Mental Health, Johns Hopkins Bloomberg School of Public Health JEANNETTE WING, Data Science Institute and Computer Science Department, Columbia University BRIAN HARRIS-KOJETIN, Director MELISSA CHIU, Deputy Director CONSTANCE F. CITRO, Senior Scholar FM - vi
Prepublication CopyâUncorrected Proofs Acknowledgments This report is the product of contributions from many colleagues whom we thank for their time and expert guidance. The project was funded by the National Science Foundation (NSF), and we are indebted to Daniel Goroff, Alan Tompkins, and Cheryl Eavey at NSF for valuable discussions and their support of the study. To address gaps in the literature, and to provide a forum for public comment, the panel convened two public workshop sessions in December 2021. Employees from statistical agencies in the U.S. and Europe, researchers, and private sector representatives described the impediments they confronted while attempting to blend nontraditional (usually private sector) data sources to improve national statistics. The panel thanks the following individuals for presenting at these sessions: Cheryl Eavey (NSF) provided an overview of the project highlighting areas where the panelâs contributions could be especially impactful. Emilda Rivers (National Center for Science and Engineering Statistics) and Andrew Reamer (George Washington University) described federal statistical system initiatives that are examining issues similar to those that the panel studied. Ivan Deloach (Federal Geographic Data Committee), Mathew Shapiro (University of Michigan), and John Stevens (Federal Reserve Board of Governors) all offered valuable comments on these initiatives. Antonio Chessa (Statistics Netherlands), Sarah Henry (UK Office of National Statistics), and Geoff Bowlby (Statistics Canada) offered their experiences in gathering and using private sector data in the production of national statistics, notably, describing the challenges they faced and overcame. Stephanie Studds (Census Bureau), Matt Gee (Brighthive), John Haltiwanger (University of Maryland), and John Stevens (Federal Reserve Board of Governors) informed the panel on the use of private sector transaction data at U.S. statistical agencies. In a session on federal statistical agencies and non-profits use of private sector health data, Mary Bohman (Bureau of Economic Analysis), Brian Moyer (National Center for Health Statistics), and Niall Brennan (The Health Care Cost Institute) pointed out unique issues in obtaining and working with these data. The workshop sessions concluded with a discussion of issues that arise when using private sector data for official statistics and research, especially the relationship between data subject and data holder, as well as the changing legal, regulatory, and privacy landscape regarding private sector data. Nathan Persily (Stanford Law School) described his experiences with Social Science One and draft legislation that could facilitate learning more about the benefits and limitations of private sector data. Salome Viljoen (Columbia Law School) provided the panel with an overview of the philosophical and legal underpinnings regarding notions of privacy. Panel member danah boyd offered her thoughts on the concept of âdata as a giftâ and how this theory informs data exchange agreements. Kadija Ferryman (Johns Hopkins Public Health), Frank Nothaft (CoreLogic), DJ Patil (Harvard University), Katherine Wallman (U.S. Office of Management and Budget (retired)), and Maurine Haver (Haver Analytics) provided excellent commentary on the issues set forth by Persily, Viljoen, and boyd. FM - vii
Prepublication CopyâUncorrected Proofs The panel could not have conducted its work efficiently without the capable staff of the National Academies of Sciences, Engineering, and Medicine: Brian Harris-Kojetin, director of the Committee on National Statistics, provided institutional leadership and substantive contributions during meetings. Kirsten Sampson-Snyder, Division of Behavioral and Social Sciences and Education, expertly coordinated the review process; and Susan Debad provided thorough final editing that improved the readability of the report for a wide audience. We also thank Rebecca Krone and Joshua Lang for well-organized and efficient logistical support of the panelâs meetings, and Katelyn Stenger, who provided valuable research support throughout the project. On behalf of the panel, I thank the study directors, Thomas Mesenbourg and Michael Siri, for their excellent management of the panelâs work. The quality and timeliness of this report would not have been possible without their contributions. Finally, and most importantly, a note of appreciation is in order for my fellow panel members. This report reflects their collective expertise and commitment. This Consensus Study Report was reviewed in draft form by individuals chosen for their diverse perspectives and technical expertise. The purpose of this independent review is to provide candid and critical comments that will assist the National Academies of Sciences, Engineering, and Medicine in making each published report as sound as possible and to ensure that it meets the institutional standards for quality, objectivity, evidence, and responsiveness to the study charge. The review comments and draft manuscript remain confidential to protect the integrity of the deliberative process. The panel thanks the following individuals for their review of this report: Richard D. Alba (Department of Sociology, The Graduate Center, City University, New York), Claire McKay Bowen (Statistical Methods Group, Center on Labor, Human Services, and Population, Technology and Data Science, Urban Institute), Alan Butler (Office of President, Electronic Privacy Information Center, Washington, DC), Kathleen Cagney (Institute for Social Research and Department of Sociology, University of Michigan), Laura DeNardis (School of Communication, American University), Kenneth E. Poole (Office of the President, Center for Regional Economic Competitiveness, Arlington, VA), Kosali Simon (OâNeill School of Public and Environmental Affairs, Indiana University, Bloomington), and Timothy D. Wilson (Department of Psychology, University of Virginia). Although the reviewers listed above provided many constructive comments and suggestions, they were not asked to endorse the conclusions, nor did they see the final draft of the report before its release. The review of the report was overseen by Cynthia Clark, independent consultant, Mclean, VA, and Kathleen Mullan Harris, Department of Sociology, University of North Carolina. Appointed by the National Research Councilâs Report Review Committee, they were responsible for making certain that the independent examination of this report was carried out in accordance with institutional procedures and that all review comments were carefully considered. Responsibility for the final content of the report rests entirely with the authoring panel and the National Academies of Sciences, Engineering, and Medicine. Robert M. Groves Chair Panel on the Scope, Components, and Key Characteristics of a 21st-Century Data Infrastructure FM - viii
Prepublication CopyâUncorrected Proofs Contents Boxes, Figures, and Tables Acronyms Glossary of Select Terms Summary 1 Introduction Background Interpretation of the Charge Evidence Base for Report Report Structure 2 The United States Needs a New National Data Infrastructure Background Motivation Producing National Statistics: Declining Response Rates and Increased Costs The Digital Data Revolution Presents Opportunities and Challenges Current Efforts to Use Digital Data to Repair Weaknesses in National Statistics Demonstrate the Possibilities and Limitations of Alternative Data Sources Reports Recommend the Use of Blended Data Recent Congressional Data-Related Initiatives: Necessary But Not Sufficient Summary 3 A Vision for a New National Data Infrastructure Vision for a 21st-Century National Data Infrastructure Outcomes of the a Data Infrastructure Key Attributes of a New National Data Infrastructure Attributes of a New Data Infrastructure Attribute 1: Safeguards and Advanced Privacy-Enhancing Practices, to Minimize Possible Individual Harm Attribute 2: Statistical Uses Only, for Common-Good Information, with Statistical Aggregates Freely Shared with All Attribute 3: Mobilization of Relevant National Digital Data Assets, Blended in Statistical Aggregates to Provide Benefits to Data Holders, with Societal Benefits Proportionate to Possible Costs and Risks Attribute 4: Reformed Legal Authorities Protecting All Partiesâ Interests FM - ix
Prepublication CopyâUncorrected Proofs Attribute 5: Governance Framework and Standards Effectively Supporting Operations Attribute 6: Transparency to the Public Regarding Analytical Operations Using the Infrastructure Attribute 7: State-of-the-Art Practices for Access, Statistical, Coordination, and Computational Activities; Continuously Improved to Efficiently Create More Secure, More Useful Information Summary Appendix 3A: Laws and OMB Guidance on Confidentiality and Privacy Protection Appendix 3B: Examples of Standards That Would Be Useful to Any Data Infrastructure 4 Blended Data: Implications for a New Data Infrastructure and its Organization Key Data Holders for a 21stCentury Data Infrastructure Principal Federal Statistical Agencies and Units Federal Program and Administrative Agencies State, Tribal, Territory, and Local Governments Private Sector Enterprises Data Brokers Nonprofit and Academic Institutions Crowdsourced or Citizen-Science Data Holders Which Data Should Be Included? Fitness-for-Use to Produce Key Information for the Country Data Minimized to Satisfy Pre-Specified Purposes Data Access and Use Respect Data Holders and Data Subjects Interests and Privacy Prioritize Easily Acquired Data That Provide Tangible Benefits Available, Usable Metadata Is Essential for Statistical Purposes Blended Data Require New Statistical Methods Blended Data Require New Statistical Designs Blended Data Requires New Data Infrastructure Capabilities Blended Data Poses New Privacy and Ethical Challenges Multiple Organizational Structures Can Support the New Data Infrastructure Organizational Models to Facilitate Cross-Sector Data Access and Use Summary 5 Building a New Data Infrastructure Requires Identifying Short- and Medium-Term Activities Attribute 1: Safeguards and Advanced Privacy-Enhancing Practices to Minimize Possible Individual Harm Attribute 2: Statistical Uses Only, for Common Good Information, with Statistical Aggregates Freely Shared with All Attribute 3: Mobilization of Digital Data Assets, Blended in Statistical Aggregates, to Provide Benefits to Data Holders, with Societal Benefits Proportionate to Possible Costs and Risks FM - x
Prepublication CopyâUncorrected Proofs Attribute 4: Reformed Legal Authorities Protecting All Partiesâ Interests Attribute 5: Governance Framework and Standards Effectively Supporting Operations Attribute 6: Transparency to the Public About Analytical Operations Using a New Data Infrastructure Attribute 7: State-of-the-Art Practices for Access, Statistical, Coordination, and Computational Activities, Continuously Improved to Efficiently Create Increasingly Secure and Useful Information New Partnerships Must Be Formed Summary References Appendixes A Biographical Sketches of Panel Members B Workshop Agendas FM - xi
Prepublication CopyâUncorrected Proofs Boxes, Figures, and Tables BOXES S-1 Seven Attributes of a 21st-Century National Data Infrastructure Vision 1-1 Statement of Task 2-1 Terminated and Threatened Statistical Programs 3-1 Outcomes of a New Data Infrastructure 3-2 Seven Attributes of a 21st-Century National Data Infrastructure Vision 3-3 Designated U.S. Federal Statistical Agencies and Units 3-4 Data Holders Proposed to Share Data in a 21st-Century National Data Infrastructure 3-5 Advisory Committee on Data for Evidence Building Recommendations for Regulatory Action (Year 1 Report) 3-6 Data Governance Components 4-1 Updated CEP Examples of Selected Administrative Data Assets 4-2 Data Holders Proposed to Share Data in a 21st-Century National Data Infrastructure 4-3 Key Characteristics of Data Assets to Be Included in a 21st-Century National Data Infrastructure 4-4 Capabilities Needed for a 21st-Century National Data Infrastructure Requiring Enhancement Over Current Practices 4-5 Illustrative Organizational Entity Questions 5-1 Properties of First Additions to a New Data Infrastructure for Statistical Purposes 5-2 Data Governance Components FIGURES 5-1 Steps in building a regulatory environment supporting private-sector data sharing for national statistical purposes. 5-2 Entity actions to build partnerships with data-sharing private-sector data holders. TABLES 2-1 Selected Household Survey Response Rates 4-1 Dimensions of Data Quality 5-1 Short- and Medium-Term Tasks for a 21st-Century National Data Infrastructure FM - xii
Prepublication CopyâUncorrected Proofs Acronyms ACDEB Advisory Committee on Data for Evidence Building ACS American Community Survey ADC Americaâs DataHub Consortium AEAStat American Economic Association Committee on Statistics AHRQ Agency for Healthcare Research and Quality BEA Bureau of Economic Analysis BJS Bureau of Justice Statistics BLS U.S. Bureau of Labor Statistics CDC Centers for Disease Control and Prevention CDO (federal) chief data officer CE Consumer Expenditure Survey CEP U.S. Commission on Evidence-Based Policymaking CIPSEA Confidential Information Protection and Statistical Efficiency Act CNSTAT Committee on National Statistics CPI Consumer Price Index CPS Current Population Survey CSDA Common Statistical Data Architecture CSPA Common Statistical Production Architecture DDI Data Documentation Initiative DHS Department of Homeland Security EHR electronic health records EIA Energy Information Administration FCSM Federal Committee on Statistical Methodology FERPA Family Educational Rights and Privacy Act FFRDC federally funded research and development center FGDC Federal Geographic Data Committee FISMA Federal Information Security Management Act FITARA Federal Information Technology Acquisition Reform Act FSRDC federal statistical research data center GAO U.S. Government Accountability Office GSBPM Generic Statistical Business Process Model GSIM Generic Statistical Information Model HHS U.S. Department of Health and Human Services HIPAA Health Insurance Portability and Accountability Act ICPSR Inter-university Consortium for Political and Social Research ICSP Interagency Council on Statistical Policy IEC International Electrotechnical Commission IRB institutional review board IRS Internal Revenue Service ISO International Organization for Standardization FM - xiii
Prepublication CopyâUncorrected Proofs IT information technology MEPS HC Medical Expenditure Panel Survey, Household Component MOUs memoranda of understanding NARA National Archives and Records Administration NASEM National Academies of Sciences, Engineering, and Medicine NIA National Institutes of Health NIEM National Information Exchange Model NIST National Institute for Standards and Technology NRC National Research Council NSDS National Secure Data Service NSF National Science Foundation OECD Organisation for Economic Co-operation and Development OMB U.S. Office of Management and Budget QCEW Quarterly Census of Employment and Wages SAP Standard Application Process SBA Small Business Administration SDMX Statistical Data and Metadata eXchange SHIELD Stop Hacks and Improve Electronic Data Security Act SORNs system of records notices SSA Social Security Administration UNECE United Nations Economic Commission for Europe USDA U.S. Department of Agriculture FM - xiv
Prepublication CopyâUncorrected Proofs Glossary of Select Terms Data EquityâNo common definition exists within the federal governmentâneither the Equitable Data Working Group in their recent report (The White House, 2022b) nor the U.S. Census Bureau (https://www.census.gov/about/what/data-equity.html) define the term. In this report, data equity ârefers to the consideration, through an equity lens, of the ways in which data is collected, analyzed, interpreted, and distributedâ (Lee-Ibarra, 2021). Data InfrastructureâData assets; the technologies used to discover, access, share, process, use, analyze, manage, store, preserve, protect, and secure those assets; the people, capacity, and expertise needed to manage, use, interpret, and understand data; the guidance, standards, policies, and rules that govern data access, use, and protection; the organizations and entities that manage, oversee, and govern the data infrastructure; and the communities and data subjects whose data is shared and used for statistical purposes and may be impacted by decisions that are made using those data assets. Equitable Data Working GroupâExecutive Order 13985, âAdvancing Racial Equity and Support for Underserved Communities Through the Federal Government,â issued by President Biden in January 2020, formed the Equitable Data Working Group. It is tasked to identify inadequacies and areas of improvement within federal data and outline a strategy for increasing data available for measuring equity and representing the diversity of the American people and their experiences (The White House, 2021b). Evidence ActâAlso referred to as âEvidence-Based Policymaking Act of 2018.â This bill requires agency data to be accessible and requires agencies to plan to develop statistical evidence to support policymaking (U.S. Congress, 2019). Standard Application Process (SAP)âThe federal statistical system is currently developing an SAP for applying for access to confidential data assets. When fully built, the SAP will serve as a âfront doorâ through which to apply for permission to use protected data from any of the 16 federal statistical agencies and units for evidence building (https://ncses.nsf.gov/about/standard- application-process). Testing for the portal will occur in September 2022, with the expectation that the site will be operational by the end of 2022. The current portal is at: https://www.researchdatagov.org/. FM - xv
Prepublication CopyâUncorrected Proofs FM - xvi