Consensus Study Report
NATIONAL ACADEMIES PRESS 500 Fifth Street, NW Washington, DC 20001
This activity was supported by a contract between the National Academy of Sciences and the U.S. Census Bureau (#1333LB21D0000003/1333LB 21F00000248). Support of the work of the Committee on National Statistics is provided by a consortium of federal agencies through a grant from the National Science Foundation (No. 1560294) and several individual contracts. Any opinions, findings, conclusions, or recommendations expressed in this publication do not necessarily reflect the views of any organization or agency that provided support for the project.
International Standard Book Number-13: 978-0-309-70710-7
International Standard Book Number-10: 0-309-70710-2
Digital Object Identifier: https://doi.org/10.17226/27169
Library of Congress Control Number: 2023952292
This publication is available from the National Academies Press, 500 Fifth Street, NW, Keck 360, Washington, DC 20001; (800) 624-6242 or (202) 334-3313; http://www.nap.edu.
Copyright 2024 by the National Academy of Sciences. National Academies of Sciences, Engineering, and Medicine and National Academies Press and the graphical logos for each are all trademarks of the National Academy of Sciences. All rights reserved.
Printed in the United States of America.
Suggested citation: National Academies of Sciences, Engineering, and Medicine. 2024. A Roadmap for Disclosure Avoidance in the Survey of Income and Program Participation. Washington, DC: The National Academies Press. https://doi.org/10.17226/27169.
The National Academy of Sciences was established in 1863 by an Act of Congress, signed by President Lincoln, as a private, nongovernmental institution to advise the nation on issues related to science and technology. Members are elected by their peers for outstanding contributions to research. Dr. Marcia McNutt is president.
The National Academy of Engineering was established in 1964 under the charter of the National Academy of Sciences to bring the practices of engineering to advising the nation. Members are elected by their peers for extraordinary contributions to engineering. Dr. John L. Anderson is president.
The National Academy of Medicine (formerly the Institute of Medicine) was established in 1970 under the charter of the National Academy of Sciences to advise the nation on medical and health issues. Members are elected by their peers for distinguished contributions to medicine and health. Dr. Victor J. Dzau is president.
The three Academies work together as the National Academies of Sciences, Engineering, and Medicine to provide independent, objective analysis and advice to the nation and conduct other activities to solve complex problems and inform public policy decisions. The National Academies also encourage education and research, recognize outstanding contributions to knowledge, and increase public understanding in matters of science, engineering, and medicine.
Learn more about the National Academies of Sciences, Engineering, and Medicine at www.nationalacademies.org.
Consensus Study Reports published by the National Academies of Sciences, Engineering, and Medicine document the evidence-based consensus on the study’s statement of task by an authoring committee of experts. Reports typically include findings, conclusions, and recommendations based on information gathered by the committee and the committee’s deliberations. Each report has been subjected to a rigorous and independent peer-review process and it represents the position of the National Academies on the statement of task.
Proceedings published by the National Academies of Sciences, Engineering, and Medicine chronicle the presentations and discussions at a workshop, symposium, or other event convened by the National Academies. The statements and opinions contained in proceedings are those of the participants and are not endorsed by other participants, the planning committee, or the National Academies.
Rapid Expert Consultations published by the National Academies of Sciences, Engineering, and Medicine are authored by subject-matter experts on narrowly focused topics that can be supported by a body of evidence. The discussions contained in rapid expert consultations are considered those of the authors and do not contain policy recommendations. Rapid expert consultations are reviewed by the institution before release.
For information about other products and activities of the National Academies, please visit www.nationalacademies.org/about/whatwedo.
PANEL TO CREATE A ROADMAP FOR DISCLOSURE AVOIDANCE IN THE SURVEY OF INCOME AND PROGRAM PARTICIPATION
TRIVELLORE RAGHUNATHAN (Chair), University of Michigan
SCOTT H. HOLAN, University of Missouri
V. JOSEPH HOTZ, Duke University
THOMAS KRENZKE, Westat
FANG LIU, University of Notre Dame
ROBERT A. MOFFITT, Johns Hopkins University
AMY PIENTA, Inter-university Consortium for Political and Social Research
NATALIE SHLOMO, University of Manchester
ALEKSANDRA (SEŠA) SLAVKOVIĆ, The Pennsylvania State University
HEEJU SOHN, Emory University
SALIL VADHAN, Harvard School of Engineering and Applied Sciences
JENNIFER VAN HOOK, The Pennsylvania State University
Staff
BRADFORD CHANEY, Study Director
DAVID JOHNSON, Senior Program Officer
NANCY KIRKENDALL, Senior Program Officer
MADELEINE GOEDICKE, Senior Program Assistant
JOSHUA LANG, Senior Program Assistant
COMMITTEE ON NATIONAL STATISTICS
KATHARINE ABRAHAM (Chair), Department of Economics, University of Maryland, College Park
MICK P. COUPER, Institute for Social Research, University of Michigan
DIANA FARRELL, JPMorgan Chase Institute, Washington, DC
ROBERT GOERGE, Chapin Hall at the University of Chicago
ERICA L. GROSHEN, School of Industrial and Labor Relations, Cornell University
DANIEL E. HO, Stanford Law School, Stanford University
HILARY HOYNES, Goldman School of Public Policy, University of California, Berkeley
DANIEL KIFER, Department of Computer Science and Engineering, The Pennsylvania State University
SHARON LOHR, School of Mathematical and Statistical Sciences, Arizona State University, Emerita
NELA RICHARDSON, ADP Research Institute, Roseland, NJ
C. MATTHEW SNIPP, School of the Humanities and Sciences, Stanford University
ELIZABETH A. STUART, Department of Mental Health, Johns Hopkins Bloomberg School of Public Health
Staff
MELISSA CHIU, Director
BRIAN HARRIS-KOJETIN, Senior Scholar
CONSTANCE F. CITRO, Senior Scholar
Reviewers
This Consensus Study Report was reviewed in draft form by individuals chosen for their diverse perspectives and technical expertise. The purpose of this independent review is to provide candid and critical comments that will assist the National Academies of Sciences, Engineering, and Medicine in making each published report as sound as possible and to ensure that it meets the institutional standards for quality, objectivity, evidence, and responsiveness to the study charge. The review comments and draft manuscript remain confidential to protect the integrity of the deliberative process.
We thank the following individuals for their review of this report:
Although the reviewers listed above provided many constructive comments and suggestions, they were not asked to endorse the conclusions or recommendations of this report nor did they see the final draft before its release. The review of this report was overseen by JOHN L. CZAJKA, Independent Consultant, and WILLIAM W. STEAD, Vanderbilt University
Medical Center. They were responsible for making certain that an independent examination of this report was carried out in accordance with the standards of the National Academies and that all review comments were carefully considered. Responsibility for the final content rests entirely with the authoring committee and the National Academies.
Acknowledgments
This Consensus Study Report reflects the invaluable contributions of many colleagues, whom the panel thanks for their generous time, effort, and expert guidance. On behalf of the panel, I extend my deepest appreciation to the sponsor of this work: the Census Bureau within the U.S. Department of Commerce. Without the Census Bureau’s support, including through briefings and responses to the panel’s information requests, this study would not have been completed. In particular, the panel thanks David Waddington, Division Chief of the Social, Economic, and Housing Statistics Division; Jason Fields, Senior Researcher for Demographic Programs and Survey of Income and Program Participation (SIPP); and Holly Fee of the Social, Economic, and Housing Statistics Division. The panel also thanks all of those who provided briefings on key issues to the panel. These include Gary Benedetto, Steve Clark, Aref Dajani, Holly Fee, Jason Fields, Benjamin Gurrentz, Adriana Hernández-Viver, Yerís H. Mayol-García, Robert Munk, Rolando Rodriguez, Rachel Shattuck, Phyllis Singer, Jordan Stanley, Sam Szelepka, Evan Totty, and Ashley Westra, all of the Census Bureau; Jerry Reiter, Duke University; danah boyd, Microsoft Research and Georgetown University; and Lars Vilhuber, Cornell University.
The panel also extends its gratitude to members of the staff of the National Academies of Sciences, Engineering, and Medicine for their significant contributions to this report. Kirsten Sampson Snyder and Bea Porter masterfully shepherded the report through the review and production process, and Marc DeFrancis provided useful editorial advice that streamlined the report. Joshua Lang and Madeleine Goedicke provided administrative and logistical support for numerous panel meetings.
Brian Harris-Kojetin, senior scholar and former director of the Committee on National Statistics, and Melissa Chiu, current director of the Committee on National Statistics, had key roles in the original study design and selection and recruitment of the study panel, along with ongoing support of the panel and the preparation of the report. Bradford Chaney, study director and senior program officer, assisted in leading the panel and acquiring needed resources. Nancy Kirkendall and David Johnson, both senior program officers, provided valuable assistance based on their past experience with SIPP and the Census Bureau.
To my colleagues on the panel, I appreciate your diligence and expertise in examining the difficult issues raised in this study, and your spirit of cooperation in coming together to reach a consensus. Your shared wisdom from across a wide range of expertise areas, team spirit, and generosity of time brought innovative ideas to the discussions and produced this report. It was a great pleasure to work with you all. Thank you.
Trivellore Raghunathan, Chair
Panel to Create a Roadmap for Disclosure Avoidance in the Survey of Income and Program Participation
Contents
PREVIOUS COMMITTEE ON NATIONAL STATISTICS STUDIES RELATING TO SIPP
An Interim Assessment of SIPP (1989)
The Future of the Survey of Income and Program Participation (1993)
Reengineering the Survey of Income and Program Participation (2009)
The 2014 Redesign of the Survey of Income and Program Participation: An Assessment (2018)
2 Overview: Survey of Income and Program Participation (SIPP) and Its Uses
SAMPLING DESIGN AND METHODOLOGY FOR SIPP
THE CONTENT COLLECTED THROUGH SIPP
Selected Studies Show the Range and Depth of SIPP
DISCLOSURE AVOIDANCE IN SIPP FROM THE 1990s TO THE PRESENT
Administrative Data Added to SIPP
3 Measuring of Disclosure Risk and Ways of Assessing It
Differential Privacy and Disclosure Risk
Comparing the Measures of Disclosure Risk
Assessing Disclosure Risk Through Analytic or Empirical Approaches
Quantifying Risks of Re-identification
EXAMINING DISCLOSURE RISKS FACED BY SIPP
The Level of Detail in SIPP Presents a Challenge
Choosing Which External Databases to Compare with SIPP
The Census Bureau’s Measurement of Risk Through a Re-identification Study
RECOMMENDATIONS TO THE CENSUS BUREAU FOR ONGOING DISCLOSURE RISK ASSESSMENT STRATEGY FOR SIPP
Limitations of the Census Bureau’s Re-identification Study
Longitudinal Data Increase the Risk of Disclosure
Making Disclosure Analysis an Ongoing Activity
CONCLUSIONS AND RECOMMENDATIONS
4 Overview of Disclosure Limitation Approaches
PROTECTING CONFIDENTIALITY THROUGH OFFERING MULTIPLE MODES OF ACCESS
METHODS OF DISCLOSURE LIMITATION INVOLVING CHANGES TO THE FILE
5 Disclosure Limitation Approaches: Secure Online Data Access (SODA)
PRIVACY PROTECTIONS IN THE SODA
REQUIREMENTS FOR ORGANIZATIONS MANAGING A CENSUS BUREAU SODA
CONCLUSIONS AND RECOMMENDATION
6 Disclosure Limitation Approaches: Synthetic Data
STATISTICAL EVALUATION OF DATA UTILITY
Validation and Verification Systems
CHALLENGES IN SIPP DATA SYNTHESIS AND UTILITY EVALUATION
CONCLUSIONS AND RECOMMENDATIONS
7 Disclosure Limitation Approaches: Flexible Table Generator and Remote Analysis Platforms
8 Disclosure Limitation Approaches: Geography Variables
USE 1: IDENTIFYING SPECIFIC GEOGRAPHIES IS SOMETIMES UNNECESSARY
USE 2: MAKING SUBNATIONAL ESTIMATES
Incorporating Spatial Dependence and Correlated Random Effects
Leveraging Auxiliary Data and Applying Disclosure Avoidance
9 Maintaining Usability While Preserving Confidentiality: Potential Strategies
Use 1: Analyses Relying on Unique SIPP Content
Use 3: Analyses Relying on Granular Data and Complex Recoded Data
Use 4: Causal Effects of Public Policies
Use 5: Analyses Relying on Administrative Record Linkages
How the Introduction of SODA and Reinterpretation of Title 13 Would Affect Accessibility
CONCLUSIONS AND RECOMMENDATIONS
10 Conclusions and Recommendations
SIPP AND THE RISK OF DISCLOSURE AVOIDANCE
CURRENT SIPP APPROACHES TO DISCLOSURE AVOIDANCE
TIERS OF ACCESS AS A TOOL FOR PROTECTING PRIVACY
ADDITIONAL TOOLS TO SUPPORT DISCLOSURE AVOIDANCE
TITLE 13 AND ITS REQUIREMENTS FOR DISCLOSURE AVOIDANCE
THE BENEFITS OF ENHANCED COMMUNICATION
LINKING SIPP DATA WITH ALTERNATIVE DATA SOURCES
FUTURE RESEARCH TO ADDRESS CURRENT KNOWLEDGE GAPS
CENSUS BUREAU RESOURCES FOR ADDRESSING DISCLOSURE AVOIDANCE
Appendix A Technical Details on Measuring Disclosure Risk
Appendix B Inferences Based on Multiple Synthetic Data
Appendix C Technical Details for Differential Privacy Table Builder
Appendix D Technical Details for Geography Variables
Appendix E Data Collection Report
This page intentionally left blank.
Boxes, Figures, and Tables
BOXES
S-2 Methods of Adjusting the Data to Protect Confidentiality
2-1 Relationship Categories Used in SIPP
2-2 SIPP Disclosure Avoidance Procedures
4-1 Sample Data Usage Agreement
9-1 Illustration of Feasibility: Descriptive Analysis Using Unique SIPP Content
9-2 Illustration of Feasibility: Longitudinal Analysis with Household Relational Data
FIGURES
S-1a Stages of disclosure avoidance
S-1b Disclosure avoidance approaches and tiers of access
6-2 Selected variables are synthetic
6-3 Variables are synthesized for selected respondents
6-4 How a validation server works
9-1 Uses of SIPP data in the most cited and recent studies (percentage)
9-2 Unique SIPP content used in the most cited and recent studies (percentage)
9-3 Number of respondents reporting use of various SIPP modules
E-1 Characteristics of respondents to call for information
E-2 Number of different SIPP data sources used by respondents to the call for information
E-3 Types of respondents to the call for information that used each format of SIPP data file
E-4 Number of respondents to the call for information who used each module or topic area within SIPP
E-5 Number of modules or topic areas used by respondents to the call for information
E-6 Types of analysis performed by respondents to the call for information
E-9 Impact of encountering difficulties with accessing SIPP data
E-10 How the results from SIPP data were used
E-11 Fields in which SIPP data findings were published
E-12 Degree to which SIPP findings could be met by standardized tables
TABLES
1-1 List of Briefings Provided to the Panel
2-1 Data Collected in SIPP 2020, by Broad Category
2-2 SIPP Bibliographic References, 2000–2014, by Topic
3-4 Key Areas in Which Three Commercial Databases Have Data That Correspond to SIPP Data
9-1 Matrix for Evaluating Feasibility with the Context of Various Modes of Access
9-2 Example of Evaluation of Accessibility by Mode of Access and User Type (1 = low to 4 = high)
Acronyms
ACS | American Community Survey |
ASA/SRM | American Statistical Association’s Survey Research Methods |
BLS | Bureau of Labor Statistics |
CNSTAT | Committee on National Statistics |
CPS | Current Population Survey |
DHHS | U.S. Department of Health and Human Services |
FISMA | Federal Information Security Modernization Act of 2014 |
FSRDC | Federal Statistical Research Data Center |
GAN | generative adversarial network |
ICAR | intrinsic conditional autoregressive |
id | identifier |
IRB | Institutional Review Board |
IRS | Internal Revenue Service |
LBD | Longitudinal Business Databases |
MINT | Modeling Income in the Near Term |
NF | normalizing flows |
NSDS | National Secure Data Service |
PIK | Protected Identification Key |
PSID | Panel Study of Income Dynamics |
QIDs | quasi-identifiers |
RAP(s) | Remote Analysis Platform(s) |
RDC | Restricted Data Center |
SAE | small area estimation |
SCHIP | State Children’s Health Insurance Program |
SDL | statistical disclosure limitation |
SIPP | Survey of Income and Program Participation |
SNAP | Supplemental Nutrition Assistance Program |
SODA | secure online data access |
SSA | Social Security Administration |
SSB | SIPP Synthetic Beta |
SSI | Supplemental Security Income |
SUDA | Special Uniques Detection Algorithm |
TANF | Temporary Assistance for Needy Families |
USDA | U.S. Department of Agriculture |
VAE | variational autoencoders |
VDE | virtual data enclave |
WIC | Special Supplemental Nutrition Program for Women, Infants, and Children |
Glossary
Added noise for privacy protection The altering of survey responses (e.g., by adding or subtracting some amount), which may vary across responses and may be randomized both in terms of which data are altered and in how much the data are altered.
Bottom-coding Setting a minimum value that may be released; for example, all values at or below $1,000 are set to $1,000.
Data perturbation Changes to the data to protect confidentiality; these include adding noise and data swapping.
Data suppression Reducing the amount of data that are released, such as by completely eliminating some measures, modifying the measures to make them less specific (e.g., top-coding, bottom-coding, and collapsing a continuous variable to become a categorical variable), and modifying what data are released (e.g., suppressed table cells based on fewer than three observations).
Data swapping Data items are swapped between two or more comparable respondents in order to protect confidentiality and to provide deniability if someone claims to have identified a respondent—for example, swapping the state of residence for two respondents. The swapping may be directed (i.e., designed to address a particular disclosure risk for a respondent) or random. The purpose is to maintain the same overall totals (and hopefully similar statistical relationships) while protecting the confidentiality of who gave what response.
Data Use Agreement Specifies limitations on how data may be used and publicly released; for example, the data must be used only for statistical purposes and not to identify individuals.
Differential privacy Differential privacy is the leading form of formal privacy used by government agencies and by researchers on privacy methods. It is a framework for both quantifying the level of disclosure risk (under several related metrics) and developing disclosure limitation methods that control the risk of single and multiple releases under those metrics.
Disclosure Review Board A committee that sets limits on what data may be released—for example, by limiting what variables may be included in a public-use file or restricting what tables may be published.
Federal Statistical Research Data Center (FSRDC) These are created through partnerships between federal statistical agencies and research leader institutions. They provide secure access to restricted data, either on-site or virtually. There is an approval process for allowing access, both for individuals seeking access and for the project to be performed. There are also financial costs involved.
Formal privacy Formal privacy refers to any rigorous and unambiguous framework for quantifying disclosure risk in an internally consistent statistical framework that bounds the success probability of a wide class of potential attacks on privacy.
Gold Standard File A file containing original (nonsynthesized) data created by the Census Bureau as a step toward producing the synthetic data. It is also used to verify whether statistics based on the synthetic data are consistent with those using the original data. It is not a master file of all Survey of Income and Program Participation (SIPP) original data but rather was created specifically for SIPP synthetic data.
Institutional Review Board (IRB) A committee that reviews potential research studies on human subjects and monitors them to ensure that they comply with applicable regulations, meet commonly accepted ethical standards, follow institutional policies, and adequately protect research participants.
Microdata Data at the level of individual persons or respondents (as differentiated from summary statistical data).
National Secure Data Service (NSDS) The creation of an NSDS has been proposed by the Commission on Evidence-Based Policymaking to support statistical evidence building through data sharing and linking, providing a pathway for those desiring data access and expertise. Currently the National Science Foundation is carrying out a demonstration project to inform whether and how an NSDS will be established in the future.1
Privacy budget A limit on the total amount of data that may be released within the context of formal privacy; each published statistic draws from this budget, and at some point either no additional statistics may be released or the privacy budget must be changed.
Public-use file A file containing microdata that can be downloaded by anyone and may be analyzed and reported on without limitations.
Quasi-identifier A data value that doesn’t directly identify a person but that might be used to identify a person. For example, while a name or address would identify a person directly and be an identifier, a zip code would provide highly specific information that might help to identify a person and would therefore be a quasi-identifier.
Recoding Often used in disclosure avoidance to reduce the number of discrete values that appear. For example, a continuous variable such as household income might be converted into a categorical measure with only a few categories, or a categorical variable such as the state of residence might be converted to a measure of geographic region. Recoding is also used to make two different databases more consistent with each other.
Restricted-use file A file in which there are limitations in how the data may be analyzed and reported on. The restrictions may range from clicking on a user agreement concerning how the data will be used (with the data remaining available to anyone consenting to the user agreement) to a file in which an application process is designed to control who is allowed access and in which strong controls may be in place on what data can be accessed and what can be reported.
Secure online data access (SODA) A mechanism through which data may be accessed virtually (online) with controls to protect respondents’ privacy, such as a process to gain permission to work with the data, controls on
___________________
1https://ncses.nsf.gov/about/national-secure-data-service-demo
which data are accessible, and controls on what data may be publicly released. Also called a virtual data enclave, the term is used here to differentiate it from virtual data access through FSRDCs, with less stringent controls on access and potentially less complete access than would be available through an FSRDC.
SIPP Synthetic Beta (SSB) A synthetic data product created by the Census Bureau that combines selected data from SIPP with administrative tax and benefit data. Early versions were partially synthetic; the latest version is fully synthetic.
Synthetic data Data that are created through a statistical modeling process to have the same statistical properties as the original data. The intention is to allow researchers to perform the kinds of statistical calculations and get results that are similar to what would be produced from the original data without allowing access to the original data. The data may be fully synthetic (all of the records are generated from the model) or partially synthetic (some records or variables are generated from the model, while others are identical to the original data).
Top-coding Setting a maximum value that may be released—for example, by setting all values at or above $10,000 to $10,000.
Verification and validation system A system designed to measure whether the results using altered data (due to disclosure avoidance procedures, particularly as applied to synthetic data) are comparable to those from the unaltered data and that provides researchers with validated results that may include added noise to protect confidentiality.